Modelling Program Compilation in the Refinement Calculus

We show how compilation of high-level language programs to assembler code can be formally represented in the refinement calculus. New operators are introduced to widen the modelling language to encompass assembler code. A compilation strategy is then embodied as a set of derived refinement rules.


Introduction
The idea of modelling program compilation as a formal development procedure has surfaced many times in the literature, but has presented a signi cant challenge.This has resulted in complex models, often using new, unfamiliar formalisms.
Our goal is to develop a model of program compilation within the alreadyfamiliar re nement calculus.Normally the re nement calculus translates an abstract requirements speci cation into a programming language implementation, using guarded command language augmented with speci cation statements as the underlying modelling notation.In the context of compilation, however, our `speci cation' is a high-level language (HLL) program, and our desired `implementation' is assembler code.
In this paper we focus on re nement of a program's control ow.In the past this has proven to be a major stumbling block because assembler code does not necessarily obey structured programming principles, and is therefore awkward to model in the guarded command language.Here we present a simple way of modelling assembler language constructs in the guarded command language, thus widening the modelling language to encompass the new application domain.Several simple `compilation laws' and a small example are then given to show how this model can be e ectively exploited.

Previous work
Recently, Norvell 15,16] showed how a simple compilation strategy for a small programming language could be derived from formal models of the HLL and assembler languages.His assembler code model was complicated, however, by the need for assembler instructions to be `interpreted' by an imaginary abstract machine.Earlier, Hale 5, x7.2.2], Fr anzle and M uller-Olm 4, p. 303], Hoare 9, x4] 10, x4] and He Jifeng 8, x6.2] used similar models, with the semantics of assembler instruction sequences given by their interpretation.
This need for an interpreter stems from the potentially unstructured control ow of assembler programs|such code cannot be easily represented in the structured guarded command language used by the re nement calculus.Unfortunately the presence of this interpreter introduces a signi cant paradigm shift during attempts to model compilation as re nement|the source and target languages are represented in very di erent ways.Indeed, B orger and Durdanovi c, in de ning a compilation-as-re nement strategy for occam programs, go as far as stating, \we could not make reasonable use of any of the many re nement notions in the literature" 2].
Nevertheless, Back 1] demonstrated through a case study that a guarded command language subset is capable of representing assembler-like programs, albeit clumsily.Lermer and Fidge 11] then exploited this model to show how compilation laws could be expressed in the standard re nement calculus.
The approach presented herein is inspired by various aspects of these previous attempts, especially the assembler-language models of Norvell 15,16] and Back 1], and our own earlier work in the area 11], but is considerably more elegant than any previous method.

Languages
Our `compilation laws' are expressed in the well-known re nement calculus developed by Morgan,et al 12,14].As usual, we use an extended guarded command language as the `wide-spectrum' modelling language.Both high-level language statements and individual assembler instructions are then merely subset notations.

Modelling language
As shown in Figure 1, our modelling language consists of Dijkstra's guarded command language, augmented with speci cation statements 12, x23] 13].We freely use the usual re nement calculus laws and de nitions 12, App.C] for manipulating this language.
S ::= j var v S ]j { variable block j skip { null statement Figure 1: Wide-spectrum modelling language for re nement 12].Let S be a statement in the language; v a variable name; P and Q predicates; E an expression; B a boolean expression.Scoping brackets j ]j may be omitted when programs are displayed vertically 13, pp.55{6].

High-level language statements
Our source language is an Ada-like sequential programming language, featuring the usual structured programming statements for assignment, sequence, choice and iteration.However, as shown in Figure 2, these constructs merely denote a distinguished subset of our modelling language.

Assembler instructions
Our target assembler language instructions act on a number of new variables introduced into the program, representing observable aspects of the target processor 15, p. 193].Since our concern in this paper is with control ow, we are mainly interested in the program counter pc herein, introducing the other variables informally as needed.
As shown in Figure 3 we need only a few instructions, to load and store data values, perform unconditional jumps and conditional branches, do nothing, and evaluate expressions.Pseudo-instruction `eval' denotes evaluation of a HLL expression, with the result left in a register.Expression evaluation has been well explored elsewhere, and has no impact on control ow, so we refer the interested reader to previous work on formalising compilation of expressions for further detail 3, x7.3.2] 8, x8.2].In e ect, eval is a temporary, intermediate statement in our model, requiring further re nement to primitive assembler instructions, but we do not consider this herein.(The de nition of eval in Figure 3 leaves the program counter value unspeci ed|in use below we are careful to always augment this statement with an explicit nal program counter value.) 4 Creating assembler programs from instructions So far our language models di er little from their predecessors.The challenge now is to devise suitable operators for composing the assembler instructions in Figure 3 to form complete assembler programs.Furthermore, to support re nement, the operators must work equally well on any statement in our modelling language, not just assembler instructions!There are two aspects to the problem, labelling instructions with the locations at which they will reside in memory, and composing sequences of instruc-tions together.We satisfy these requirements through two new de nitions.

Labelling statements
An assembler instruction performs its required function only when the program counter points to the particular location at which the instruction resides, otherwise the instruction \does" nothing 15, x6.3.1].For instance, the behaviour of instruction s at location l could be modelled by the statement dopc = l !s od.
Instruction s executes only if the program counter currently points to it.The iteration operator allows for the possibility that s is a branch or jump instruction that returns control to location l!
Here we generalise this concept by allowing any statement in our modelling language to be associated with a set of memory locations.
De nition 1 (Labelling) A statement S can be labelled with a set of locations L using the labelling operator `:'.
The statement \executes" only while the program counter points to one of the locations.Otherwise the construct has no e ect, i.e., behaves like skip.As illustrated below, allowing a set of labels is helpful during re nement because S may prove to be a compound statement that re nes to several distinct instructions, residing in several memory locations.
As a syntactic convenience we allow singleton label sets to be written without braces: l : S = flg : S : Also, since label sets are usually contiguous in practice, we often write l 1 : : l 2 to denote all locations between l 1 and l 2 , inclusive: l 1 : : l 2 = fl j l 1 6 l ^l 6 l 2 g :

Non-sequential statement composition
The next requirement is the ability to compose sequences of instructions together.The challenge is that a `sequence' of assembler instructions is not necessarily performed sequentially!Branches and jumps may cause instructions to be executed in an order totally di erent than their textual one.
We therefore introduce a \non-sequential" composition operator on labelled statements.To support subsequent re nement, the operator constructs a labelled statement.
De nition 2 (Non-sequential composition) Two labelled statements L 1 :S 1 and L 2 : S 2 can be composed using the non-sequential composition operator `o 9 '.
(L 1 : S 1 ) o 9 (L 2 : S 2 ) def = (L 1 L 2 ) : (L 1 : S 1 ; L 2 : S 2 ) 2 It may seem curious that our `non-sequential' operator is de ned in terms of the sequential composition operator `;'.Using De nition 1, we can express the de nition of `o 9 ' in full as It thus gives preference to statement S 1 when the program counter points into both L 1 and L 2 .
However, in practice, we always expect L 1 and L 2 to be disjoint, i.e., L 1 \L 2 = fg.In this case standard program transformation rules allow us to simplify this de nition considerably, because choice between the two statements is always mutually exclusive.Thus, for disjoint L 1 and L 2 , In this context, therefore, `o 9 ' is `non-sequential' since it allows its operands to execute in either order, depending on the value of the program counter, and exhibits no bias towards either statement.It also allows repetition as long as the program counter points to either labelled statement.Finally, it is possible to combine the labelling and non-sequential composition operators in ways that are not useful.Given two distinct locations l 1 and l 2 , then fl 1 g : (l 1 : S 1 o 9 l 2 : S 2 ) 6 = fl 1 ; l 2 g : (l 1 : S 1 o 9 l 2 : S 2 ) : If the program counter initially equals l 2 , the speci cation on the right will perform statement S 2 , but the one on the left will behave like skip.We consider a statement such as that on the left to be ill-formed.
De nition 3 (Well-formedness) When labels are nested, the construct is well-formed only when all `inner' labels are contained within the `outer' label set.That is,

Properties
The following properties of the labelling and non-sequential composition operators are frequently used below.Each can be readily proven by expanding the de nitions into their guarded command language equivalents.
Property 1 (Associativity) Non-sequential composition is associative for disjoint sets of labels.If L 1 , L 2 and L 3 are all mutually disjoint, then (L 1 : S In other words, the only way to begin executing code within S 1 is via label l 1 .This is enforced by the initial assumption and the coercion following each statement.We are therefore free to assume that the program counter equals l 1 immediately before S 1 starts executing.(S 1 may make use of labels in L 1 other than l 1 while it is executing, however.) 5 Re nement rules for compilation In this section we de ne re nement rules for translating a `speci cation', expressed in the high-level language statements from Section 3.2, into an `implementation', expressed using the assembler instructions from Section 3.3 composed with the operators from Section 4. In e ect, these re nement rules de ne a program compilation strategy.
The re nement goal is to construct an assembler program of the form l 1 : i 1 o 9 l 2 : i 2 o 9 . . .l n : i n where i 1 to i n are individual assembler instructions, and l 1 ; : : : ; l n is a contiguous sequence of locations at which the instructions are placed.Since we usually do not know in advance how many instructions will be generated for each highlevel language statement, we make frequent use of symbolic constants for labels, on the understanding that these will be instantiated with consecutive numbers once the re nement is complete.

Introduce machine-dependent constructs
Given a high-level language program S, the rst step is to introduce new variables modelling machine-speci c features.Since we are concerned only with control ow in this paper the only variable of interest here is the program counter.In general, however, the register and memory variables should also be declared, along with a symbol-table relation for associating HLL variables with assembler-level memory locations.
Law 1 (Introduce assembler variables) If variable pc is fresh, and i and f are location-valued constants such that i < f , then S v var pc pc := i ; i : : f ? 1 : S ; pc = f ] : 2 Here the program counter is set to some initial value i, and f is its nal one 11].Thus the assembler code generated for statement S will reside in locations i up to f ? 1, inclusive.Since S did not previously refer to the program counter, we have used a coercion 12, x17.3] to introduce the new requirement that S leaves pc equal to f .All subsequent re nement laws follow this template of explicitly stating the initial and nal program counter values 11].

Compiling assignment
An assignment statement needs to rstly evaluate the expression E, and then store the result at the address associated with the target variable v. Let register x be one that is currently `unallocated'; in a full compilation strategy this variable would need to be formally declared and its status maintained.Similarly, let addr be a symbol-table lookup function that returns the memory location associated with each HLL variable; in a full compilation strategy this relation would be extended on entering a scope and retracted on exit.
Law 2 (Compile assignment) If i < f ? 1, then fpc = ig ; v := E ; pc = f ] v i : : f ? 2 : eval reg x ; E ; pc = f ?1] o 9 f ? 1 : store addr(v); reg x : 2 By de nition (Figure 3) the nal store instruction increments the program counter, so placing it at location f ? 1 ensures that the coercion to leave pc equal to f will be satis ed.Similarly, the coercion that constrains the eval instruction to leave the program counter equal to f ? 1 ensures that the store instruction will be executed immediately after E has been evaluated.As mentioned in Section 3.3, we do not know how many actual assembler instructions will be required to implement our temporary eval instruction, so we reserve a range of locations (from i to f ?2) for it.In a complete compilation strategy this pseudo-instruction would typically be further re ned to a number of load instructions to fetch the operands into registers, followed by arithmetic and logical operations that leave the nal result in register x. (If the range of locations is too small for the number of instructions required, then re nement of eval to primitive assembler instructions will be impossible.This is why it is best to avoid allocating absolute labels until re nement is complete.)

Compiling sequential composition
Compiling sequential composition of two HLL statements merely involves allocating consecutive memory blocks for their respective assembler instruction implementations.
Law 3 (Compile sequence) If i < m and m < f , then fpc = ig ; (S 1 ; S 2 ) ; pc = f ] v i : : m ? 1 : S 1 ; pc = m] o 9 m : : f ? 1 : S 2 ; pc = f ] : 2 The two statements are essentially unchanged, but are augmented with coercions requiring them to update the program counter in such a way that they will be executed in the correct sequence.

Compiling choice
Implementing a conditional statement involves rst evaluating the boolean expression and then branching to the appropriate alternative.Care must be taken to ensure that when the chosen alternative nishes the construct exits correctly.

2
Here the code to evaluate B has been placed at location i, that to implement S 1 at location j , and that for S 2 at location k.The brfalse instruction changes control to S 2 if B evaluates to false, otherwise S 1 executes.The jump instruction after S 1 exits the whole construct after S 1 terminates.

Compiling iteration
Compiling iterative statements involves generating instructions to repeatedly evaluate the boolean expression, and execute the statement if the expression is true, otherwise exit the construct.The particular compilation strategy used below places the code to evaluate the expression after the statement.This strategy generally produces more e cient code, saving one jump instruction on all iterations after the rst.
Law 5 (Compile iteration) If i < j ? 1 and j < f ? 1, then fpc = ig ; while B loop S endloop ; pc = f ] v i : jump j o 9 i + 1 : : j ? 1 : S ; pc = j ] o 9 j : : f ? 2 : eval reg x ; B ; pc = f ?1] o 9 f ? 1 : brtrue reg x ; i + 1 2 Here the code to evaluate B is placed at location j , and that for statement S at location i + 1. Initially the jump changes control to evaluate the expression for the rst time.If B is true the brtrue instruction then changes control to execute S. Once S nishes the expression is re-evaluated.

Example
As a simple example, we consider a code fragment that calculates the remainder of integer division by repeated subtraction.Let p be the dividend, q the divisor and r the required remainder.The high-level language program is as follows.
fp > 0 ^q > 0g ; r := p ; while r > q loop r := r ?q endloop The assumption on the rst line documents knowledge about the initial values of p and q.It is not used in the following `compilation ' 12, p. 15].
Firstly we introduce assembler-level constructs to the executable statements.
v `by Law 1' var pc pc := a ; a : : d ? 1 : (r := p ; while r > q loop r := r ?q endloop) ; pc = d] Symbolic constants a and d represent the initial and nal program counter values.Thus this block of code has been allocated to instruction memory locations b : : d ? 1 : (while r > q loop r := r ?q endloop) ; pc = d] The rst assignment is then readily compiled to instructions in our target assembler language, introducing proviso a < b ?b : jump c o 9 b + 1 : : c ? 1 : r := r ?q ; pc = c] o 9 c : : d ? 2 : eval reg 2 ; r > q ; pc = d ?1] o 9 d ? 1 : brtrue reg 2 ; b + 1 The assignment statement within the loop is then easily compiled, provided b < c ? 2. statement labelled b + 1 : : c ? 1 v `by Property 5; Law 2' b + 1 : : c ? 2 : eval reg 3 ; r ?q ; pc = c ? 1] o 9 c ? 1 : store addr(r); reg 3 (In this case the ultimate implementation of eval needs to ensure that the values of both r and q are in registers, and perform the subtraction, leaving the result in register 3.) Putting these steps together yields the nal assembler program, and we can instantiate the symbolic instruction memory location constants with particular values.For the purposes of illustration, assume each eval `instruction' occupies exactly one location|let `eval' denote an eval statement that increments the program counter.Combining the provisos accumulated above requires that  Given this paper's focus on control ow only, we noted above several informal aspects of our model that need eshing out.The assembler-level register and memory variables, and the compiler-level symbol table relation, should be formally declared and their status maintained as part of the re nement rules.Keeping track of these variables would support many useful optimisations.For instance, we can easily envisage a variant of Law 2 that uses the knowledge that the required expression value is already available in some register x.Law 6 (Compile assignment of known value) fpc = i ^reg x = Eg ; v := E ; pc = i + 1] v i : store addr(v); reg x 2 (This also overcomes another weakness of the laws above in that they require every statement to occupy at least one memory location|statements logically equivalent to skip currently compile to unnecessary noop instructions.)Similarly, Law 2 is rather naive in that it always stores the result in memory.Instructions 2 and 5 in the example in Section 6 could potentially be eliminated, and replaced with a single store instruction at the end of the code fragment, if we knew when HLL variables must be observably updated in memory.
Our overall goal is to develop trustworthy compilation techniques for realtime programs.Elsewhere we have proposed a real-time re nement formalism which makes `time' an integral part of the re nement laws 6,17,7].Using this real-time calculus instead of the standard re nement calculus will allow us to consider compilation of time-sensitive HLL statements such as `clock' functions and delay statements.

Conclusion
We have shown how compilation of high-level language programs to assembler code can be modelled in the standard re nement calculus.To achieve this we introduced two new operators for composing assembler programs from individual instructions.The innovation of allowing any statement to be labelled with a set of locations enabled an elegant compilation-as-re nement methodology.

Figure 2 :
Figure 2: High-level language statement de nitions.The HLL language statements on the left are equivalent to the underlying model on the right.

a : : d ? 1 .
The proviso on Law 1 introduces the constraint that a < d.Next the sequential composition operator can be compiled, introducing a new location b, with provisos a < b and b < d. statement labelled a : : d ? 1 v `by introducing fpc = ag 12, p. 183]; Law 3' a : : b ? 1 : r := p ; pc = b] o 9 a < b ?1, b < c ?2 and c < d ?1.Then letting a = 1, b = 3, c = 6 and d = 8, and attening the nested labels, yields the nal `compiled' code.
1 o 9 L 2 : S 2 ) o 9 L 3 : S 3 = L 1 : S 1 o 9 (L 2 : S 2 o 9 L 3 : S 3 ) : 2We can therefore omit bracketting around the non-sequential composition operator.(We have also assumed that `:' binds more tightly than `o 9 ', but more loosely than other modelling language constructs.)Property 2 (Commutativity) The order of labelled instructions composed using non-sequential composition is irrelevant for disjoint sets of labels.If L 1 and L 2 are disjoint, then L 1 : S 1 o 9 L 2 : S 2 = L 2 : S 2 o 9 L 1 : S 1 : 2Redundant `outer' labels can be removed, once all statements are explicitly labelled.
An obvious extension to this work is to consider other high-level language statements and constructs.For instance, it is trivial to devise a compilation law for an if statement with no else part, using Law 4 as a guide.More challenging, though, are constructs such as subroutine calls, and concurrency and communication statements.