Checking Formal Specifications by Testing

Formal specification methods hold promise for bridging the wide gap between an intuitive idea for solving a problem by computer, and the executable program that attempts to do the job. The use of formalism is itself a good thing, allowing professionals to understand and analyze their work better. However, formal methods are an aid to human effort, not a panacea. Conventional software testing can be an ideal complement to formally directed development. Tests are concrete and immediately comprehensible to end users, and they are unlikely to miss mistakes because of a pernicious correlation with the formal work. Research is needed on ways to make formal specifications and testing work together to realize the potential of both. Tests should serve to increase confidence that a formal method has been correctly applied. Such tests would free the developers from tedious checking of formalism details, and the success of only a few tests would have real significance for the software’s correctness. As an example of a formalism/testing partnership, this talk describes joint work with Sergio Antoy [4] on automatically checking a conventional implementation of an abstract data type against its formal algebraic specification. 1 Formal Methods and Testing In the “old days” when formal methods were identified with proving an implementation program correct with respect to a first-order logic specification, this proving was thought to be antithetical to testing. The two methods were viewed as being at opposite ends of a spectrum: the formal work was abstract, precise, and concerned with the function the program computes; testing was practical, sloppy, and concerned only with finite support for that function. A more modern viewpoint considers the two subjects complementary. Today’s formal methods are still abstract and precise, but now much more broadly concerned with expressing and reasoning about specifications. Testing today is still practical and sometimes sloppy, but it can be the most technical and precise of the other software-development arts. Specification and testing remain at ∗On leave from Portland State University, Department of Computer Science and Center for Software Quality Research, Portland, OR 97207, USA opposite ends of the development time-line, so there is maximum scope for testing to serve as a check on the whole process. To mention a promising interaction in which testing benefits from a formal specification, consider automatic random testing. This idea is to subject code to inputs chosen at random, where the random generation of test points is planned and automated using a formal description of a program’s input domain. To be effective, random testing must employ far more test points than the usual “by hand” tests. Hence it is practical only if the results can be automatically checked, and here again a formal specification is the answer. Many formal specifications are effective in that they provide a decision procedure for whether or not any input-output pair conforms to the specification. In the other direction, testing has the potential to detect mistakes in software development, mistakes which may arise from poor communication between end users and software experts (that is, the formal specifications are for the wrong problem), and mistakes in carrying through development (that is, the formal specifications have not been properly followed). Here the technical nature of both disciplines comes to the fore: it should be possible to devise tests that pinpoint common mistakes, and if the tests succeed, to gain great confidence that nothing is wrong.


Formal Methods and Testing
In the "old days" when formal methods were identified with proving an implementation program correct with respect to a first-order logic specification, this proving was thought to be antithetical to testing.The two methods were viewed as being at opposite ends of a spectrum: the formal work was abstract, precise, and concerned with the function the program computes; testing was practical, sloppy, and concerned only with finite support for that function.
A more modern viewpoint considers the two subjects complementary.Today's formal methods are still abstract and precise, but now much more broadly concerned with expressing and reasoning about specifications.Testing today is still practical and sometimes sloppy, but it can be the most technical and precise of the other software-development arts.Specification and testing remain at opposite ends of the development time-line, so there is maximum scope for testing to serve as a check on the whole process.
To mention a promising interaction in which testing benefits from a formal specification, consider automatic random testing.This idea is to subject code to inputs chosen at random, where the random generation of test points is planned and automated using a formal description of a program's input domain.To be effective, random testing must employ far more test points than the usual "by hand" tests.Hence it is practical only if the results can be automatically checked, and here again a formal specification is the answer.Many formal specifications are effective in that they provide a decision procedure for whether or not any input-output pair conforms to the specification.
In the other direction, testing has the potential to detect mistakes in software development, mistakes which may arise from poor communication between end users and software experts (that is, the formal specifications are for the wrong problem), and mistakes in carrying through development (that is, the formal specifications have not been properly followed).Here the technical nature of both disciplines comes to the fore: it should be possible to devise tests that pinpoint common mistakes, and if the tests succeed, to gain great confidence that nothing is wrong.

Formal Methods
Formal methods (hereafter "FMs") are broadly any mathematical techniques used in program development.The intent is clearly to bring abstract thought to bear on the difficult problems of describing what a program is supposed to do and seeing that it does just that.The common narrower meaning of FMs is mathematics for stating and reasoning about specifications.Some FMs have a constructive, algorithmic character, while others are more abstract.At the abstract extreme, intuitive set theory provides most of the mathematical abstractions needed for describing program meaning.At the other extreme, a finite-state-machine description allows the mechanical generation or checking of many aspects of specification.The Z language is close to the abstract end, while the SCR method is near the constructive end.In between, methods may be designed for facilitating mechanical proofs, but not necessarily in a decidable theory; PVS is one such FM.
To usefully interact with software testing, a FM must be capable of serving as an oracle.An oracle is a means of attempting to decide of any pair (x, y) whether or not on input x, y is a result consistent with the specification.An effective oracle is a mechanical decision procedure.

Testing
In principle, a test is a finite set of inputs for a program.A test succeeds if the program outputs are consistent with the specification.A failure is a test point on which the program output is inconsistent with the specification.
Since testing deals only with a finite input set, it cannot establish general properties of the program being tested.But in random testing it is possible to obtain statistical results such as a bound on the probability of failure.A clever choice of test points can also be used to probe for likely failures.In fault-based testing, test points are chosen so as to unambiguously answer worrisome questions about program behavior.For example, a programmer may be concerned about the conditional expression terminating a loop: should the comparison be "≤" or "<"?It is usually possible to find a single test case that will succeed only if the conditional is correct.

Research Agenda
FMs and testing each has difficult technical problems that define current research.It may be that each area can help the other.

FMs in aid of testing
By far the most important interaction between FMs and testing is that a FM provides an oracle.The testing literature almost universally assumes that an oracle is available for the program being tested, but in practice this assumption is mostly false.
"Specification-based" tests are recommended as the only ones capable of detecting programming errors of omission, and as reflecting what a program is expected to accomplish.Not the least important property of specification-based tests is that they can be devised early in the development process, and in parallel with the rest of development.Even the most non-constructive FMs have a well defined syntax that allows their specifications to be printed in a standard format, and to be processed mechanically; the syntax is often enough to permit automatic test generation.

Testing in aid of FM
FMs have a number of difficulties in principle, and testing can be of some help for each: Wrong specification.By far the strongest argument against the use of FMs is that they, like any specification method, may fail to capture the intuitive requirements of the software end user.These user requirements are intrinsically vague and imprecise, and in sharpening them as any FM must, human beings may get it wrong.Testing can help by generating revealing, representative cases, which unsophisticated users can understand, to try out the formal specification.
Improper use of the FM.Every FM has pitfalls that allow its practitioners to create apparently meaningful specifications that contain technical mistakes.For example, in any "declarative" FM, it will be possible to write a specification that is correctly implemented by almost any program; at the other extreme a specification may have no correct implementations.As experience with a FM is gained, one might hope to make a catalog of "common blunders," and to find corresponding tests, by analogy to "fault-based" tests for programs, that necessarily expose them.
Inconsistent implementation.Tests are conducted on the final product of software development: the program.They therefore check how faithfully the design and code have followed the specification.The example presented in section 2 to follow falls in this category.

Example: Self-checking ADT Code
As an example of the potential benefits of interaction between the FMs and testing communities, this section summarizes research work primarily done by my colleague Sergio Antoy, to appear elsewhere [4].In that research, an abstract data type (ADT) is specified using a formal algebraic specification, the ADT is implemented by hand in a conventional programming language like C ++ , and a largely automatic scheme is used to instrument the program so that as it executes, it checks each result against the specification.The exposition of Antoy's scheme uses the example of a finite set of integers, as taken from Stroustrup's C ++ textbook.This ADT has operations given by the signature in figure 1.

Correctness of an ADT
An implementation of an ADT is correct iff certain diagrams for its operations commute.These diagrams display the "abstract" world of the ADT specification, and the "concrete" world of pro- where b is the value in the state at the lower right that results from calling member with argument values x and S. It was Antoy's idea that it would be possible to automate checking of the commuting diagram for particular values of x and S if the representation mapping were made explicit, that is, if it were programmed as part of the C ++ implementation of the ADT.In addition to R, checking the diagram requires the ability to compute: The abstract function.(∈ in figure 2.) The ADT is specified by a set of equational axioms.
These axioms describe a ground word algebra in the names of the operations and conventional representations of the integers, and allow words to be reduced to normal form, using the axioms as left-to-right rewriting rules.(Antoy uses a restricted axiom form that guarantees that the rewrite theory will be identical to the equational theory.)Hence the abstract function at the top of the diagram can be computed by rewriting (in the word algebra).
The concrete function.(member in figure 2.) The C ++ implementation provides the ability to compute all concrete functions.

Notations:
The "?" notation signals an exception that terminates rewriting.Rules may have a guard following ":-" to select that rule.
The fourth rule for insert, for example, applies in the case that the inserted element is not already in the set, the set is not too large, and the inserted element is larger than the one currently at the beginning of the word being rewritten.Thus this rule specifies that the canonical form will have the set elements in ascending order.

The "Direct" (Rewriting) Implementation
The rules of the specification can be used to compute normal forms in the word algebra.This computation is the same for all specifications, except for the particular rules to be used.A possible implementation strategy for the rewriting is to store the algebraic words in a tree structure, and rewrite by manipulating the tree.We call this the direct implementation because it comes directly from the specification, and it can be mechanically compiled into C ++ code.
For example, here is how a portion of the direct implementation might appear for the set example: typedef int elem; // kind of generic enum tag { EMPTY, INSERT }; // discriminant tokens those of the direct implementation.These (1) add code for the explicit representation function, (2) call the representation function to convert state values to the abstract world, and (3) compare the results of the direct-implementation computation with this converted result.With this additional code in place, the by-hand implementation will check each of its computations and report any disagreement with the specification.The additional code required for the set example is indicated below.

Specification
Hand Gen.
Figure 3: Construction of the self-checking implementation.

Discussion of Self-checking
The self-checking scheme we have described connects a formal specification and a conventional implementation to provide for automatic run-time checking of each computed result.Thus the hand implementation is provided with an effective test oracle, which allows automated testing.A simple driver program that selects random inputs can be used to conduct far more tests than are usually performed on any code.In experiments with a part of the Java library, we were able to execute 10 6 tests in about 20 minutes on a modern laptop.Self-checking also provides a kind of ultimate multi-version programming.There is no apparent connection between the data structures, algorithms, or other aspects of the way in which the two implementations, the direct using rewriting, and the by-hand using conventional techniques, produce their results.Thus if they agree, it provides some assurance that neither of them is doing the wrong thing as judged by intuition.
It is more controversial to recommend that self-checking code be carried from the test phase of development into released code.The conventional wisdom is that all checks should be removed from "production" code in the interests of speed.(Perhaps the embarrassment of a program that announces that it has just made a mistake is another factor in removing checks.)Rewriting in the direct implementation is sometimes much slower than a conventional implementation, but it is well known that only a very small part of any program lies on the path of time-critical execution, and it would be easy enough to omit checking only on this path.What to do when the implementation disagrees with the specification is a more interesting question.
First, when a failure is detected in a running production program, information can be logged so that fixing the program later is easier.But what can be done about the failed calculation, on which real-time decisions may be riding, and ultimately even human lives?Our position is that it is far better to know that a result is wrong than to erroneously believe that it is right.One possibility is simply to try again.It may be that the failure is isolated, and slight differences in sensor readings or timing may produce a correct result.In the case of a hard failure, there may be a human in the loop who can be given control.But absent all such plausible outcomes, we believe that the potential for failure puts a necessary pressure on the software designer.If the system architect knows that an ADT might fail, then the system design must try to take failure into account, and consider retries, calling for human help, etc.The resulting code will be much safer than if the ADT were blindly trusted.Ultimately, if the architect sees a case in which failure will necessarily lead to disaster, it will be necessary to use extreme measures, for example to rigorously prove that this case cannot occur, or even to inform the customer that perhaps the computer application should be reconsidered.

Figure 3
Figure3summarizes the form taken by a complete self-checking implementation, using the set example.