Integrating Multithreading into the Spineless Tagless G-machine

To reduce the adverse effects of long latency communication operations in distributed implementations of the Spineless Tagless G-machine (STGM), a variant of the original abstract machine that contains explicit support for multithreading is introduced. In particular, source-to-source transformations can be used on the level of the abstract machine code to foster the tolerance to long latency communication. 
 
The changes to the original STG-language include a separation of demand from case selection together with the introduction of a new construct that provides an abstract notion of thread boundaries and thread synchronization.


Introduction
A static mapping of the components of a parallel program to the physical processor elements of a parallel computer such that these components communicate in a regular fashion is only possible for restricted programs or programs that contain explicit layout information, e.g., Kel89,DP93].In contrast, distributed implementations of general lazy functional programs issue demands for remote data in a dynamic and unpredictable fashion; the resulting long latency communication operations have adverse e ects on distributed implementations of abstract machines, such as the Spineless Tagless G-machine (STGM) PS88,Pey92] with its implementations for GRIP PCS89] and GUM THJ + 96].
The subject of this paper is a variant of the STGM that is designed to reduce the impact of long latency communication on the execution time; to this end, it exploits the inherent ne-grain parallelism contained in functional programs by employing multithreading and supporting stateless threads.A special feature of the new abstract machine, called STGM MT , is that sourceto-source transformations can be used on the level of the abstract machine code to foster the tolerance to long latency communication.
Section 2 introduces the basic ideas underlying the development of the STGM MT .Section 3 describes the changes to the original machine language of the STGM, and Section 4 demonstrates the use of the new constructs.Then in Section 5, the operational semantics of the new machine language is presented.Related work is discussed in Section 6, and Section 7 contains the conclusions.E-mail: chak@cs.tu-berlin.de,URL: http://www.cs.tu-berlin.de/~chak/ 2 The Use of Multithreading We assume a distributed implementation of the STGM where on each of the many processor elements (PEs) a number of tasks are running interleaved.When the active task of one of these processors attempts the evaluation of the expression (a + b) (c + d), it may start by accessing the value of the variable a.If the value of a is stored in the memory of another PE, a remote access is triggered (by entering a FETCHME closure; cf.Pey92]).When such a long latency communication operation stalls the progress of a task, it is common practice to suspend that task and, instead, to execute some other, runnable task.Such a task switch includes an expensive context switch where the current values of some processor registers have to be saved and other values have to be loaded; furthermore, the data in the processor cache is probably useless for the computation performed by the newly selected task.
Crucial for the techniques developed in this paper is, returning to the example, the observation that the evaluation of the subexpression c + d is (i) independent from the result of the remote access induced by a and (ii) can be achieved with less context switching costs than switching to an arbitrary runable task.The context needed to evaluate c + d is similar to the context needed to evaluate a+b, in particular, both computations are part of the same function activation.
Across function boundaries, a similar observation can be made.When the result of a function activation depends on a long latency operation and there are no more computations in this function activation that are independent from the long latency operation, then it is usually less costly to switch to the execution of some independent computation in the calling function than to activate a completely unrelated task|consider the processor cache.
Overall, when the delay of a long latency operation has to be covered by some independent computation, take a computation whose context is as close to the current context as possible.Such a computation can be chosen by the heuristic that computations that would have been executed shortly after the current computation are closely related in their context|an assumption justi ed by the locality that computations usually exhibit.
The ne-grain parallelism inherent in functional programs can be used to mechanically partition a program into threads realizing independent computations.Within a function activation, long latency operations can then be covered by switching to any of the ready threads.The abstract machine language described below provides a construct that explicitly displays independent computations on the level of the abstract machine language.These independent computations can be implemented by stateless threads CGSE93, EAL93].The crucial feature of stateless threads is that they never suspend, i.e., they start only when all resource requirements are satis ed and execute to completion without any further need for synchronization|in essence, they represent the smallest unit of non-synchronizing computation.There is evidence that the use of stateless threads minimizes the thread switching time while simultaneously allowing to exploit the properties of the memory hierarchy CGSE93, H + 95].

The Abstract Machine Language
Starting from the STG-language as described in Pey92], two principal changes are required to integrate support for multithreading: demand must be separated from case selection, and an abstract notion of thread boundaries and thread synchronization is needed.In the following, a third modi cation will also be applied: an abstract notion of distribution is added in order to be able to observe the e ects of multithreading on the abstract level of the STG-language.The variant of the STG-language that is de ned in this paper is called the STG MT -language.
To focus the following presentation on issues relevant to multithreading, support for built-in data types, such as integers and oating point numbers, is omitted|they can be handled in a similar way as in the original STGM, namely by an explicit treatment of unboxed values.Furthermore, the explicit top-level, which contains the global de nitions, is omitted|here also, the mechanisms of the STGM can still be applied in the STGM MT .

The Grammar
A de nition of the grammar of the STG MT -language can be found in Figure 1.
In comparison to the original STG-language, note the addition of letrem and letpar, and the fact that in a case expression the keywords case and of enclose an identi er and not an arbitrary expression.An intuition of the behaviour of the added or changed constructs is provided in the following subsections.

An Abstract Notion of Threads
In the original STGM, there is only a single kind of bindings.It associates variable names with lambda forms.In the STGM MT these bindings are called function bindings and are produced by the nonterminal fbind.In addition, value bindings|nonterminal vbind|are introduced in the STGM MT .
Value bindings occur only in the letpar construct, which has the following general form: letpar v 1 # = e 1 ; : : :; v n # = e n in e In contrast to the letrec construct (cf.Pey92]), no closures are created, but the expressions e 1 to e n are evaluated, and only after all results are assigned to the v i #, evaluation proceeds with e. Furthermore, the v i # may not occur free in e 1 to e n .
The last restriction guarantees the independence that we required in Section 2 for computations that may be used to cover long latency operations.More precisely, it allows to evaluate e 1 to e n in an arbitrary order without any need to synchronize on the v i #.Should the evaluation of any e i suspend due to a remote access, then it is still possible to continue the computation locally with any e j where j 6 = i.In short, letpar allows to express the independence  of local computations on the level of the abstract machine language.Furthermore, the fact that the evaluation of the body expression e must wait for the delivery of all v i # can be seen as an abstract form of synchronization barrier.The hash marks (#) behind the v i indicate that the v i # store unboxed values.The treatment of unboxed values in the STGM MT is related to, but not identical to the use of unboxed values in the original STGM.In particular in the original STGM, which follows JL91], all types of boxed and unboxed values have to be explicitly introduced, but in the STGM MT , there is implicitly a corresponding unboxed value for each boxed value.The coercion from boxed to unboxed and from unboxed to boxed types is made explicitly by value bindings and function bindings, respectively.For example, the expression letpar v# = w in : : : binds to v# the unboxed value associated with the boxed value stored in w.
Conversely, letrec w = {} \n {} -> v# in : : : boxes the unboxed value contained in v#.The deviation from the technique used in the STGM becomes necessary due to the fact that demand for evaluation, in the STGM MT , is issued when an expression occurs in the right-hand side of a value bindings while, in the STGM, the value of an expression is only demanded when it is scrutinized by a case expression.
Note that unboxed variables are not allowed to be in the list of free variables of a function binding, i.e., only boxed values can be stored in the environment of closures, and that it is forbidden to use them as arguments to constructors.These restrictions can be relaxed, but here they are enforced here to make the presentation simpler.
Furthermore, the expressions appearing as right-hand sides of value bindings must not be of functional type, i.e., must not be of type ! .This restriction corresponds to the restriction of the original STGM that says that case expressions must not inspect values of functional type.

Selection Without Demand
In the original STGM, case expressions play two roles: rst, they demand the evaluation of the scrutinized expression, i.e., the expression between the keywords case and of; second, they select one of several alternatives by matching the value of the scrutinized expression against the patterns of the alternatives.It was already mentioned that value bindings issue demands for evaluation in the STGM MT , and hence, the single purpose of case expressions is pattern matching.Overall, we have the following correspondence: Original STGM STGM MT case e of alt 1 ; : : :;alt n ;dft letpar v# = e in case v# of alt 1 ; : : :;alt n ;dft (1)

An Abstract Notion of Distribution
In the original presentation of the STGM Pey92], the potential distribution of the heap of the abstract machine over multiple processing elements is left implicit.To make the need for long latency operations explicit, we expose the potential for distribution in the STGM MT .
To this end, the concept of a machine instance is introduced.Each machine instance has a local heap and is able to evaluate closures in its local heap independent from the other instances.When the local evaluation depends on a closure stored within another machine instance|we call this a remote closure|a long latency operation is triggered. 1On the level of the abstract machine code, no assumptions are made about the number of machine instances available.
The letrem construct speci es those closures that may be allocated remotely.In contrast, the closures associated with the function bindings of a letrec are bound to be allocated locally.To simplify matters, there may only be one binding in a letrem and it must not be recursive|recursion can be introduced by using a letrec in the right-hand side of the single function binding of the letrem.Furthermore, the binding of a letrem must not have any arguments, i.e., it has to represent a nullary function.
Overall, the STG MT -language allows, even requires, to explicitly specify the partitioning of a parallel program, but it abstracts over its mapping (cf.Fos95] for a de nition of these notions).

Using the New Constructs
In summary, the STG MT -language modi es the original STG-language in three ways: it introduces an explicit, but abstract, notion of (i) local, independent computations (letpar) and (ii) closures that may be allocated on a remote instance (letrem); and (iii) demand and selection are separated.These modi cations lead to a number of interesting properties that are discussed in the following.

Generating code for the STGM MT
The translation of a functional program into the STG MT -language corresponds closely to the generation of code for the original STGM.The main di erence is that we have to observe the correspondence stated in Equation (1) and case expressions that are just used for unboxing require no case in the STGM MT , but only a letpar.In contrast to letpar expressions, letrem constructs are not expected to be generated automatically; instead they are generated from explicit annotations, i.e., the programmer decides which computations are coarsegrained enough to be worth the shipping to another processor element.

Covering Long Latency Operations
Following the stated scheme for code generation, only letpar constructs containing a single value binding are generated.Such code does not exhibit any tolerance to long latency operations.An important characteristic of the STG MTlanguage is that simple source-to-source transformations can be used to increase this tolerance.In particular, we can apply the following transformation rule: (2) when v 1 is not free in e 2 .In case of the example from Section 2, (a + b) (c + d), the transformation rule (2) has a dramatic e ect on the corresponding STGM MT -code: The code to the left is similar to the case cascade used to represent this computation in the original STGM.The transformed code, to the right, explicitly represents the independence of those subcomputations that can be used to cover long latency operations.In particularly, when demanding a triggers a remote access, the demand of b as well as the demands to c and d together with the evaluation of add# cv#, dv#] can be done while waiting for the value av#.In the worst case, when all data is remote, at least the access to a, b, c, and d is overlapped.In essence, the code to the right is a textual representation of the partial ordering induced on the code by the data dependencies.
Overall the separation of demand and selection allows to move demand, i.e., value bindings, outwards in order to collect multiple value bindings in a single letpar as in the following code fragment (the y i are not free in e 2 ): letpar x# = e 1 in case x# of C {y 1 , : : :,y n } -> letpar z# = e 2 in case z# of : : : ) letpar x# = e 1 z# = e 2 in case x# of C {y 1 , : : :,y n } -> case z# of : : : Apart from data dependencies, the outwards movement of value bindings is stopped by case expressions with more than one alternative; moving a value binding over a case with multiple alternatives can change the termination behaviour of a program.
In principle, function boundaries also stop the outwards movement, but this should, for the following reason, not become a problem in practice: either the function is very simple, then it can be inlined; or it is complex and, then, it usually contains a case with multiple alternatives, which hinders the outward movement anyway.

Distribution
The following program fragement displays the essentials of a parallel program exploiting pipelined parallelism: The above code for consume is already transformed; the original code would place the expression consume {xs} into the function binding of a separate letrec.But the immediate following occurence of the bound variable in the right-hand side of a value binding allows the transformation into the shown code.
If there are machine instances that need additional work, then closures created with letrem can be shipped to those instances; otherwise they can also be allocated and evaluated locally|the latter case corresponds to the idea of the absorption of previously sparked children HMP94].In the above example, let us assume that stream is allocated remotely.Then, the value of stream and consume {stream} are evaluated on di erent instances, in parallel|any closures created with letrec while evaluating stream are also allocated and, thus, evaluated on the remote instance.This implies that the access to x in the body of consume triggers a remote access, which is, at least partially, covered by the recursive call to consume (in the same letpar).

The Meaning
To formalize the operational semantics of the STG MT -language, a transition system is presented in this section; it is derived from the system in Pey92].It makes the e ects of multithreading on the abstract level of the STG MTlanguage explicit.The notation used in this section is similar to that used in Pey92]; details are provided in Appendix A.

Machine Con gurations
A machine con guration is represented by a mapping I from names to machine instances.Each instance consists of several components, including a code component, a task pool, an argument stack, a return stack, and a heap.A detailed description of these data structures is provided in Appendix B.
The machine instances in a con guration share a global name space, but the computations within one instance i may only access the components of i.When i needs to access a closure, named o, that is located in the heap of another instance j, it has to request j to evaluate the closure o and to return the WHNF of o back to i.This operation is the single form of long latency operation in the STGM MT .
In the transition rules, we assume an unbound number of instances, and each closure allocated by a letrem is created on a not-yet-used instance.This exposes the maximal parallelism of the program.In a concrete implementation, the scheme outlined Section 4.3 is used, i.e., closures are only distributed upon request from processing elements with an insu cient amount of work.

The Transition Rules
The following transition rules a ect either one or two instances at a time.To get a parallel and not only an interleaving semantics, we de ne a parallel transition step to be a set of applications of transition rules such that this set contains at least a single element and no instance is a ected by more than one transition rule.A transition rule is said to a ect an instance if this instance occurs in the rules pre-or postcondition.

The Initial State
The initial machine state used to evaluate an expression e is the following: ]] The con guration consists of a single instance named i.Its current task is to evaluate e within an empty environment.The task pool is empty (i.e., there is no further work), just as the frame map and argument stack.The single continuation on the return stack indicates that the result of e has to be delivered via the (non-existing) slot d of the dissemination map.
The machine terminates when it attempts to distribute some value over the dissemination slot d; this is the value computed for e.
Intuitively, the roles of the components of an instance are as follows.The task pool contains the waiting (for the completion of a long latency operation) and the ready-to-run tasks that have to be executed on this instance.At this point it is important to clearly distinguish between tasks and threads.Tasks are unrelated, coarse-grain computations that are distributed over the machine instances to gain speedup by parallel evaluation; they are indirectly introduced by the letrem construct.Threads are clustered into closely related groups represented by the letpar construct and are ne-grain computations that are used to e ciently cover long latency operations.Only when a task contains no more ready-to-run threads and it is still waiting for a long latency operation, it is suspended and placed in the task pool.Every distributed implementation of the STGM uses tasks, but threads are the uncommon feature of the STGM MT .
For every letpar construct that is executed, a frame is created; it contains a counter storing the number of value bindings that have not been completed yet and the local environment used to store the value bindings.The argument stack has the same function as in the original STGM, but the return stack assumes the functionality of both the return and the update stack of the STGM|this is necessary to correctly deal with updates of closures whose evaluation triggered a long latency operation.The heap is used to store closures|just as in the original STGM.Finally, the dissemination map supports the dissemination of the results of long latency operations to multiple receivers.

Applications
Execution of the application of a function to some arguments, pushes the arguments on the stack and enters the closure that represents the function.In contrast, the application of a data constructor, initiates a return operation.

? fs as rs h ds]
(2) =) I i 7 !RetTerm hc; x N ]i ?fs as rs h ds] We use to indicate repetition, e.g., x N ] stands for x 1 ; : : :; Evaluating an unboxed variable returns the unboxed value represented by this variable in the local environment .

Entering a Closure
A not updatable closure is entered by evaluating its code under an environment built from the closures arguments and the appropriate number of parameters from the argument stack.The body of the closure is a function that applied to the environment yields the code form that has to be executed.The environment is constructed by taking length xs arguments from the stack and associating the free variables vs with the environment eos of the closure.Updatable closures are always nullary (cf.Pey92]).In the original STGM, such closures push an update frame; in the STGM MT , they create an Upd dissemination entry|as soon as a value is passed to this entry, the closure is updated with this value.Depending on the type of the value that is computed by the closure, we distinguish two cases: rst, if the type is non-functional it is su cient to extend the dissemination entry referenced by the topmost return continuation; second, if the type is functional, the closure has to be reentered after the update, i.e., a return continuation initiating the reentering is pushed and a new slot d is created in the dissemination map.Note that in the rst case, the argument stack is guaranteed to be empty in type correct programs.I i 7 !Enter o ?fs ] hd; cont; as p i:rs h ds d 7 !ms]] when h o = h(vs nu ] !);eosi and has non-functional type Finally, entering a (not updatable) closure needing more than the available arguments indicates that a partial application has to be passed to the topmost return continuation, i.e., the partial application must be distributed using the dissemination slot d referenced by the return continuation.This case occurs when either a thunk (cf.Pey92]) has to be updated with a partial application or a partial application has to be communicated to a remote instance.I i 7 !Enter o ?fs as hd; cont; as p i:rs h ds] when h o = h(vs nn xs !);eosi and length as < length xs A new closure, named o p , is created, which implements the partial application; its structure corresponds to the representation of partial applications in the original STGM.? f as rs h 0 ds] where

I i 7 ! Enter o ? fs as rs h o 7 ! h(vs nnxs ! );eosi] ds]
Evaluating a letrem expression allocates a closure on a new instance k.Additional forwarding closures that contain EnterOn code forms are used for two purposes: rst, to reference the new closure o k that is allocated on the new instance k from the current instance i and, second, to reference the closures ( v N ) that are contained in the environment of the new closure but are located on the current instance i.
In Section 4.3, letrem was used to implement the meta-function par.Following the de nition of par in HMP94], the closure allocated on the remote instance k must be evaluated immediately.To achieve this behaviour, the initial code form of k must be (Enter o) instead of Next.letpar constructs specify related, but independent work; furthermore, the the evaluation of the body expression has to be synchronized with the delivery of the values demanded in the value bindings.To this end, the code form Sync together with a new frame f are employed.The rst argument of Sync contains the still to be evaluated value bindings.The frame maintains a counter of the number of value bindings whose value was not yet added to the environment that is also held in the frame.Note that the number of still awaited values will be greater than the number of bindings in the Sync form when the computation of some values is hindered by long latency operations.I i 7 !Eval (letpar b 1 ; : : :;b n in e) ?fs as rs h ds] (10) =) I i 7 !Sync b 1 ; : : :; b n ] e f ?fs 0 as rs h ds] where fs 0 = fs f 7 !hn; i] In a concrete implementation, the guaranteed independence of the value bindings within one letpar can be used to partition the code generated from an STG MT -program into non-synchronizing threads, i.e., stateless threads.
If there are unprocessed value bindings in a Sync, one is selected and its right hand side e 1 is evaluated.A new return continuation is pushed on the stack; it contains the remaining part of the Sync code form and the values currently on the argument stack.The new slot d in the dissemination map is used to eventually distribute the result of e 1 .The dissemination entry (Store f x) indicates that the result has to be stored in the environment of frame f with the local name x.I i 7 !Sync ((x# = e 1 ):bs) e 2 f ?fs f 7 !hn; i] as rs h ds] If there are no more unprocessed value bindings in a Sync form, the behaviour depends on the value of the synchronization counter in the associated frame f.If it is zero, all values are available, and the body expression can be evaluated; otherwise, the evaluation of the body form has to wait for the delivery of the remaining values, but there is no more independent work in the letpar that created this Sync form.Nevertheless, it is usually not necessary to suspend the current task; there may be further work in textually enclosing letpars or in the calling function.In order to utilize such work the code form RetDelay is used.The evaluation of the expression e is deferred to a new task cont that is activated only after the long latency operation is completed.I i 7 !Sync ] e f ?fs f 7 !h0; i] as rs h ds] The form RetDelay informs the enclosing computation about the fact that a long latency operation delayed the delivery of the requested value and that no more local work is available.This information is propagated through all the return continuations until a Sync form is found that has some work.When the delayed value eventually becomes available it is distributed using the new slot d of the dissemination map.

Remote Method Invocation
In the STGM MT as in the original STGM, accessing a data structure means to evaluate a closure.Hence, the code form EnterOn represents a remote data access; it initiates the evaluation of the closure on the remote instance k by placing a new task enter into the task pool ?k .This task eventually distributes the result of the remote computation using the dissemination slot d k , which forwards it to the slot d i on the instance i that initiated the whole process.Note that the time between the execution of an EnterOn and the delivery of its result is, in general, unbound.Hence, it is important that the instance i can do some useful work while waiting for the delivery of the result.The delay induced by the remote access is signaled with RetDelay to the enclosing computation.If the return continuation is a Sync form that still has some unevaluated value bindings left, it can continue by evaluating one of these bindings as they do not depend on the delayed value.I i 7 !EnterOn k v ?i fs i as i rs i h i ds i ; =) I i 7 !RetDelay x x 7 !d i ] ?i fs i as i rs i h i ds 0 i ; k 7 !task ?k :enter fs k as k rs k h 0 k ds 0 k ] where enter = hEval v v 7 !v]; hd k ; Next; ]ii

Selection
A case selects the appropriate alternative on the basis of the scrutinized value.?fs as rs h ds] If no alternative matches, a fatal failure occurs.

Returning a Proper Value
Returning the unboxed form of a data term implies to distribute it using the dissemination slot d referenced by the topmost return continuation; afterwards, the continuation is executed.Returning a data term, the argument stack always is empty for type correct programs.I i 7 !RetTerm hc; wsi ?fs ] hd; cont; as p i:rs h ds] (17) =) I i 7 !(MsgTerm d hc; wsi cont) ?fs as p rs h ds]

Returning a Delayed Value
When the code form RetDelay is executed, a long latency operation delayed the delivery of some intermediate result and local, but independent, computations should be employed to cover the delay, i.e., not yet evaluated value bindings in surrounding letpars should be executed.For this mechanism to work properly, two jobs have to be carried out: rst, some independent work has to be found and, second, when the delayed value nally arrives, it has to be introduced into the ongoing computation.
Imagine a Sync code form with multiple value bindings.When the rst binding is evaluated, according to Rule (11), a return continuation is pushed that contains the Sync form with the remaining bindings.To utilize the independent work constituted by these bindings, we just have to return to this Sync.This is what RetDelay does while simultaneously taking care of the second issue, namely, preparing the asynchronous delivery of the delayed value.The latter is done by placing Fwd entries in the dissemination slot ( v) that will eventually be used to deliver the remote value; as a result the asynchronous delivery updates closures that have to be updated with the remote value and it stores the value in frames whose associated Sync forms wait for that value.
When the topmost return continuation belongs to a closure that has to be updated (this is handled by the \if" in the rule below), this closure has to be overwritten with a new closure that contains a RetDelay code form.This ensures that repeatedly entering the closure does not cause multiple remote accesses; instead, remote accesses are shared.
When there are no arguments on the stack, the delayed value can be forwarded to the entry speci ed by the return continuation by using a Fwd entry, which sends any value delivered via this entry on to the entry given in its argument (in the following rule, to d).I i 7 !RetDelay v ?fs ] rs h ds v 7 !ms]] when rs = hd; cont; as p i:rs 0 (18) =) I i 7 !cont ?fs as p rs 0 h 0 ds v 7 !(Fwd d):ms]] where h 0 = if (Enter o u ) = cont then h o u 7 !h( v 0 ] nu ] !RetDelay v 0 ); vi] else h The above rule together with Rule (13) implies that the continuations on the return stack are popped in the process of exploiting work that is independent from the delayed value.Only when the stack is completely empty, it is necessary to switch to a completely unrelated task.An interesting consequence of this property is that only one stack per machine instance is needed in the STGM MT |instead of one stack per task.
If there are arguments remaining on the stack, the delayed value is a partial application, which has to be applied to the arguments on the stack when it it eventually delivered.To realize the synchronization between the delivery of the partial application via the dissemination entry ( v) and the code performing the application, a new frame f and a task enter are created.The form Store in the dissemination entry is used to put the partial application into the frame f when it arrives; this, in turn, enables the task enter.I i 7 !RetDelay v ?fs as rs h ds v 7 !ms]] when rs = hd; cont; as p i:rs 0  A concrete implementation would, of course, ensure that the selected task is ready to run, e.g., by maintaining a list of such tasks; otherwise, the selected task may just suspended again immediately.

Dissemination of Messages
The distribution of unboxed data terms is performed by considering one entry of the dissemination slot after the other and use the value accordingly to store it into the environment of a frame (Store), update a closure (Upd), forward it to another dissemination slot (Fwd), or send it to another instance (RetTo).where In the last rule, where the value is transmitted to another instance, the arguments must be replaced by forwarding closures using the EnterOn code form (compare this to Rule (9)).
The distribution of partial applications is similar to that of data terms.
The complexity of the last rule is due to the fact that forwarding EnterOn closures have to be created for all the objects referenced in the environments of the transmitted closures.
It is not su cient to transmit only the partial application (h i o i ), but the closure (h i o f ) referenced in the partial application's rst environment argument has to be transmitted, too.This closure represents the function that was (partially) applied to the w i .To execute the partial application on the instance k, the closure (h i o f ) has obviously to be on instance k also.

Related Work
The Threaded Abstract Machine (TAM) CGSE93] is designed to implement the data ow language Id Nik90].It applies multithreading based on stateless threads to tolerate long latency operations.In this respect it is close to the work presented in this paper, but the realization of this basic idea di ers considerably.Instead of source-to-source transformations, the Id compiler based on the TAM builds a structured data ow graph as an intermediate representation and has to apply sophisticated thread partitioning schemes SCG95].These partitioning algorithms require graphs without cyclic dependencies, which can not be guaranteed for a lazy language.In contrast to the STGM MT , which employs asynchronous operations only when a long latency operation is actually encountered, the TAM uses asynchronous operations by default.
In comparison to the parallel implementation of the STGM, it is interesting to note that the STGM MT , only needs a single argument and return stack per machine instance, instead of one stack per task.Furthermore, on entry, closures need not be overwritten with a \queue-me" code pointer.The updating performed in the rules (18) and ( 19) is su cient and happens only when a remote access occurred.A proper comparison with a distributed implementation of the original STGM, e.g., GUM THJ + 96], has to wait until a rst implementation of the STGM MT is working.

Conclusion
To decrease the impact of long latency operations on the execution time of distributed implementations of the Spineless Tagless G-machine, the STGM MT extends the abstract machine language with abstract notions of independent local computations (thread boundaries and thread synchronization) and with an abstract notion of distribution.This enables the use of source-to-source transformations to increase the tolerance of the code to long latency operations.
While the behaviour of the new abstract machine can be studied using the operational semantics provided in this paper, it remains to be shown that the proposed techniques decrease the impact of the communication overhead in an actual implementation of a lazy functional language on parallel computers with distributed memory.the data type Value, data constructors are omitted|this avoids some clutter in the transition rules and there is no danger that ambiguities arise.| {z } env i The sequences of free and argument variables correspond to the list of environment variables and the list of argument variables, respectively, in the function bindings of the STG MT -language.The body of a closure is, in contrast to the original STGM, not an expression of the STG MT -language, but a function from environments to code forms|the reason for this change is that we sometimes need to place code forms other than Eval (e.g., EnterOn) into the body of closures.
Return continuations, the elements of the return stack, are triples made of the name of a dissemination entry, a code form, and a sequence of values.The meaning is that the currently executed task distributes its result value using the dissemination entry, and then continues with the code after placing the values on the argument stack.

Figure 1 :
Figure 1: The grammar of the STG MT -language.

I i 7 !
Eval case x# of : : :;c {v N } -> e; : : : ?fs as rs h ds] when x = hc; w N ]i (15) =) I i 7 !Eval e v N 7 !w N ] ?fs as rs h ds] I i 7 !Eval case x# of c N vs N -> e N ; default -> e ?fs as rs h ds] when x = hc; wsi and c 6 = c N (16) =) I i 7 !Eval e Enter o u ) = cont then h o u 7 !h( v]nu ] !RetDelay v); vi] else h fs 0 = fs f 7 !(1; vs 7 !as])] enter = (Sync ] (x vs) f; hd; Next; ]i) ds 0 = ds v 7 !(Store f x):ms] As mentioned above, an empty return stack indicates that the current task has no more work to o er.I i 7 !RetDelay v ?fs ] ] h ds](20) =) I i 7 !Next ?fs ] ] h ds] 5.2.9 Task Management Next passes control to some arbitrary task from the task pool.The return continuation of the new task is placed on the stack, then the task is executed.I i 7 !Next ?:(task;r) fs ] ] h ds]

I i 7 !
MsgTerm d hc; wsi task ?fs as rs h ds d 7 !]]] I i 7 !task ?fs as rs h ds] I i 7 !MsgTerm d hc; wsi task ?fs as rs h ds] when fs f = hm; i and ds d = (Store f x):ms (23) =) I i 7 !MsgTerm d hc; wsi task ?fs 0 as rs h ds d 7 !ms]] where fs 0 = fs f 7 !hm ?1; x 7 !hc; wsi]i] I i 7 !MsgTerm d hc; w N ]i task ?fs as rs h ds] when ds d = (Upd o u ):ms (24) =) I i 7 !MsgTerm d hc; w N ]i task ?fs as rs h u ds d 7 !ms]] a dissemination entry UValue = hCName; HName ]i | constructor & argumentsThe type CName is a synonym for Name, it is used for names of data constructors.The type of closures is de ned as Cls = h( LName ] Cont = hDName; Code; Value ]i The various code forms are de ned in the following data type (Eval and Enter are equal to the corresponding forms of the original STGM): other task j MsgTerm DName UValue Code | dissem.unboxed value j MsgPAPP DName HName Code | dissem.partial appl.Details on the meaning of the variants are provided in Section 5.2.Finally, elements of the lists of the dissemination map describe the locations where the code forms MsgTerm and MsgPAPP have to place disseminated values: DEntry = Store FName LName | store in the env. of a frame j Upd HName | update closure j Fwd DName | forward to other dissem.entry j RetTo IName DName | communicate to other instance ! MsgTerm d hc; wsi task ?fs as rs h MsgTerm d i hc; w N ]i task i ?i fs i as i rs i h i ds i ; k 7 !task k ?k fs k as k rs k h k ds k ] when ds d i = (RetTo k d k ):ms MsgTerm d i hc; w N ]i task i ?i fs i as i rs i h i ds 0 MsgPAPP d i o i task i ?i fs i as i rs i h i ds i d i 7 !ms]; k 7 !MsgPAPP d k o k task k ?k fs k as k rs k h 0