Obtaining a COBOL Grammar from Legacy Code for Reengineering Purposes

We argue that maintenance and reengineering tools need to have a thorough knowledge of the language that the code is written in. More speci(cid:12)cally, for the family of COBOL languages we present a general method to de(cid:12)ne COBOL dialects that are based on the actual code that has to be reengineered or maintained. Subsequently, we give some typical examples of maintenance and reengi-neering tools that have been speci(cid:12)ed on top of such a COBOL grammar in order to show that our approach is useful and leads to accurate and relatively simple maintenance and reengineering tools.


Introduction
There is a constant need for updating and renovating business-critical software systems for many and diverse reasons: business requirements change, technological infrastructure is modernized, the government changes laws, or the third millennium approaches, to mention a few.So, in the area of software engineering the subject of reengineering becomes more and more important.Reengineering is the analysis and renovation of software, see, for instance, 11] for an introductory paper and 7] for an annotated bibliography on this subject.To aid in the analysis and renovation of software it is crucial to have tool support.We strongly believe that such tools largely bene t from having knowledge of the language they have to analyze or renovate.In other words, tools to aid in reengineering should be based on the grammar of the language they intend to process.
As most of the business-critical software is written in COBOL, we will present a general method describing how to obtain a COBOL grammar, given a number of Chris Verhoef was supported by the Netherlands Computer Science Research Foundation (SION) with nancial support from the Netherlands Organization for Scienti c Research (NWO), project Interactive tools for program understanding, 612-33-002.COBOL programs.Moreover, we explain how to reduce the size of this grammar signi cantly by means of, as we call it, uni cation of production rules.Finally, we show that tools that are based on such a compact language de nition are useful in reengineering the code written in that language.Thus, the challenge of specifying reengineering tools lies in the (compact) speci cation of the underlying language.When it comes to de ning syntax is is obviously better to use a parser generator, like Lex+Yacc than to write a parser by hand.Therefore, it is not surprising that, e.g., Software Re nery 24], which is based on generic programming language technology incorporates a number of reengineering tools (for more connections between reengineering and generic language technology we refer to 6]).
To emphasize the importance of tools based on language de nitions we give a small example.Suppose that we need a tool that analyzes COBOL programs by looking for date-related variables and that this tool is solely based on lexical scanning.This implies that the tool will list comments containing the patterns it is looking for, as well.To reduce the number of so-called false-positives it is sensible to modify this tool so that listing of comments will be suppressed.In fact, the tool now contains knowledge of the language it is analyzing.Still many false-positives will be found for various reasons.This can mostly be solved in an ad-hoc manner by adding extra knowledge of the language.This procedure has to be reiterated for a tool that needs to perform another task.In fact, most of the time a tool developer is busy with adding knowledge of the language to the tool instead of implementing the actual reengineering or maintenance task of the tool.So in our opinion it is a better idea to approach this in a more structured manner, by specifying COBOL syntax and subsequently specifying tools on top of that syntax.Which brings us to the matter of de ning COBOL syntax.There is a myriad of COBOL dialects.We will discuss in this paper a general method describing how to obtain a practical COBOL grammar that recognizes a particular dialect.Then we discuss how tools can be speci ed that aid in reengineering and maintenance tasks.At this point the reader may observe that although putting more knowledge in a tool implies less false-positives, the processing overhead will increase.However, since legacy systems are notoriously large, the reduction of false-positives is a serious matter.In the end it pays o to use a tool that has extensive knowledge of the language.Apart from that it can even be the case that a tool that has knowledge of the language improves its performance.For instance, a tool that returns the PROGRAM-ID of a given COBOL program only needs to process the IDENTIFICATION DIVISION whereas a tool that has no knowledge of the language has to process the entire program to nd the PROGRAM-ID (if it nds it at all, see Section 5.1 for more details).
Having motivated our work, we will give an impression of it.In Figure 1 we depicted how to obtain a practical grammar that is based on preprocessed code.We brie y discuss the various phases of that process.
First, we will discuss the left-hand side of Figure 1.The starting point is COBOL code that needs to be analyzed.In our case this was a mix of two COBOL dialects provided by a ( nancial) company.To be able to analyze and change such code, it is important to minimize e ort.This implies that the code has to be preprocessed.We sketch this approach: rst we strip the COBOL code so that only relevant columns remain, thus obtaining socalled 666-code 1 .Then we reformulate this code a bit by performing some global substitutions.For instance, we substitute THRU for THROUGH.Then we lexically disambiguate the thus obtained reduced 666-code, for instance, to be able to lexically distinguish the comment marker * from the multiplication operator *.The result is lexically disambiguated 666-code.
The right-hand side of Figure 1 deals with manipu-lations on the grammar.The open arrows indicate that the grammar in question understands the code the arrow points at.We have an ANSI COBOL grammar at our disposal 22] plus extensions that we found in other sources such as IBM manuals, actual COBOL code, etc.It understands COBOL74, COBOL85, IBM speci c code, etc. in 666-form and reduced 666 form.666-code and reduced 666-code.We reduce the ANSI-grammar plus its extensions by the removal of redundant tokens and we remove all the constructs that do not even occur in the code, for instance, the END PROGAM HEADER did not occur in the code.Thus we obtain a restricted grammar that understands reduced 666-code.Then we retokenize the grammar so that it will recognize the lexically disambiguated 666-code.Finally, we unify the grammar.
Roughly, this means that we combine context-free rules into less rules.The advantage of this approach is that the grammar will be reduced in size, which leads to a better performance of the generated parser.Of course, the uni ed grammar will recognize more programs than the retokenized one, it may even recognize incorrect programs.But since the original code is compilable, this causes no problem (see Section 4.4 for more details).We emphasize that our approach is not just a reduction of an existing COBOL85 grammar but it uses it as a reference.If constructs that are not present in the standard are located in the code we add those to our grammar.We used iour approach also to de ne CICS and SQL syntax to deal with embedded calls in COBOL dialects; the only di erence is that we have no standard available to save us some work.
In practice, the above code and grammar manipulations will be carried out in parallel.In order to explain our approach we treat those transformations as conceptual entities.
At the time of writing this paper we are in a preliminary phase to design a software renovation factory to restructure legacy code from a large nancial company.We are using the methods we report on in this paper to construct a grammar that recognizes their code.In this paper we intend to show that our approach is useful for the construction of tools by giving some examples.
In 8] we describe how we can even generate tools for a renovation factory using a grammar as input.

Organization of the Paper
In Section 2 we discuss how to obtain COBOL de nitions in general.In Section 3 we will elaborately discuss the code manipulation part of Figure 1 where we translate COBOL code into lexically disambiguated 666-code that suits our reengineering purposes.The grammar manipulation part of Figure 1 will be discussed in Section 4. In this section we discuss a general method to de ne the syntax of COBOL.In section 5, we discuss the speci cation of tools based on the COBOL syntax de nition.In Section 6 we give our conclusions.

Related Work
In 2] and 3] a formal semantics of a subset of COBOL is presented based on denotational semantics.A reverse engineering tool is based on these ideas which can be used to reverse engineer COBOL74 programs.This tool is based on a COBOL grammar consisting of 170 complex rules in BNF notation, based on this de nition the formal semantics of the control statements is de ned.These control statements are isolated in a new language MiCo.We give a general method describing how to obtain such grammars, we have speci ed such grammars, and we speci ed tools based on such grammars.
In 23] experiences with building a remodularization tool for COBOL using Software Re nery is discussed.We used a system that is similar to theirs to develop tools for COBOL.
In 10] a method is proposed to algebraically specify tools to aid in reengineering and maintenance.Their approach is to base the tools on a database that contains knowledge of the code to be maintained.We stored the knowledge of the code in a language de nition of that code.They use traditional parser generation techniques (Lex+Yacc) to generate parsers for speci c languages before the actual analysis can take place.In our case we use more sophisticated incremental parser generation techniques.See Section 2 for more comparisons and details.

Towards De ning COBOL
In this section we will discuss how to de ne syntax in general and what techniques we used to de ne COBOL dialects.Then we discuss an ambiguity problem of COBOL that can not be solved easily.
First we enumerate the various COBOL dialects and their compilers.COBOL68, COBOL74, COBOL85.The rst, second and third version of COBOL.These languages were released in 1968, 1974, and 1985 respectively.UNISYS COBOL, OS/VS COBOL, COBOL 370, Microfocus COBOL.These are software products: COBOL compilers, their corresponding COBOL dialect, and tools to support programming that particular COBOL dialect.
We note that many COBOL dialects have the possibility to incorporate embedded SQL, JCL, CICS, etc.We can handle those extenstions, as well.

Formalisms to De ne Syntax
We brie y discuss a number of formalisms in which it is possible to de ne the syntax of a programming language.They are, EAG 20], Lex+Yacc 16,21], Re ne 24], and Asf+Sdf 4,14].First we will shortly characterize them and then we discuss their merits and shortcomings in the light of de ning a COBOL syntax.We refer to the surveys 25] and 27] for a general discussion on the above (and other) formalisms to de ne the syntax and semantics of programming languages.
EAG EAG stands for Extended A x Grammar.It is an attribute grammar formalism based on two-level van Wijngaarden grammars 28], it allows semanticdirected parsing and imposes no restrictions on the grammar.We are not aware of a COBOL de nition in EAG.
Lex+Yacc Lex+Yacc are well-known formalisms to generate scanners and parsers.They have been used to de ne COBOL grammars.
Re ne Re ne is a formalism that is supported by a commercial system to develop reengineering tools, it is based on research activities in the area of knowledgebased systems and programming environments performed at Kestrel.There exists a Re ne/COBOL environment to analyze and modify legacy COBOL code.
ASF+SDF Asf+Sdf stands for algebraic speci cation formalism and syntax de nition formalism.It is an algebraic formalism with user de nable syntax, it is modular and imposes no restrictions on de ning a grammar.A programming environment supports this formalism and from a language de nition, a complete programming environment, including a parser, can be generated.Asf+Sdf has been used to de ne COBOL grammars.Now we discuss relevant properties of the above formalisms in the light of de ning COBOL syntax.

Modularity The formalisms EAG, Re ne, and
Lex+Yacc do not support modularity.Especially for a huge language like COBOL with so many dialects, modularity is essential for developing language de nitions for various dialects and to increase grammar reuse.
Asf+Sdf does support the necessary modularity.So with Asf+Sdf we can store the dialect speci c parts of a grammar in separate modules and simply switch from one dialect to another by adapting the import structure.See Section 2.2 for more details on modules.

Parsing techniques The formalisms Re ne, and
Lex+Yacc are based on LR-parsing and while developing a context-free grammar the user may be confronted with shift-reduce and reduce-reduce con icts.This can be an enormous problem when developing a de nition for COBOL.In particular, when an existing de nition is extended or modi ed to deal with a new dialect.

Arbitrary grammars The formalisms EAG and
Asf+Sdf support the development of arbitrary contextfree grammars, even ambiguous grammars.In addition, EAG supports semantic-directed parsing: semantic values, like the types of variables, can in uence the parsing of a program.This can be an aid in disambiguating COBOL expressions (see Section 2.3 for more details).

Techniques: Motivation and Choice
Obviously, there is no best formalism to de ne COBOL syntax.Lex+Yacc is proven technology but has as a serious drawback its lack of modularity and its con icts.Re ne which is based on the same parser generation technology also has these problems.On EAGs we can be short.They are not an option since EAGs are not very well supported by tools.
We have chosen for Asf+Sdf because we think modularity is essential when describing the syntax of COBOL and its myriad of dialects.Moreover, the Asf+Sdf formalism is supported by a programming environment, called the Asf+Sdf Meta-Environment 18], which is based on lazy and incremental scanner and parser generation techniques allowing exible prototyping of language de nitions.Moreover, it supports rapid prototyping of tools based on the COBOL syntax, see Section 5 for details.
Another argument in favor of the Asf+Sdf Meta-Environment is that it is quite easy to change the contextfree grammar rules of company-speci c versions of the language, or to de ne a context-free grammar for languages that resemble COBOL such as Telon 12] (a language used to generate screens of the user-interface of an application).It is also easy to incorporate modules describing CICS and SQL syntax and JCL calls.Since there are many dialects of COBOL and the possibility of embedded calls in them this is a serious argument in favor of Asf+Sdf.
Re Asf is based on the notion of a module consisting of a signature de ning the abstract syntax of functions and a set of conditional equations de ning their semantics.
Sdf allows the de nition of concrete (i.e.lexical and context-free) syntax.Abstract syntax is automatically derived from the concrete syntax rules.
Asf+Sdf has already been used for the formal de nition of a variety of (programming) languages and for the speci cation of software engineering problems in diverse areas.See 5] for details on industrial applications.
Asf+Sdf speci cations can be executed by interpreting the equations as conditional rewrite rules or by compilation to C. For more information on conditional rewrite systems we refer to 19] and 17].It is also possible to regard the Asf+Sdf speci cation as a formal speci cation and to implement the described functionality in some programming language.

Ambiguities in COBOL
There are various forms of ambiguity, many of them are created by associativity and priority con icts in the context-free syntax rules.Associativity problems can be solved by introducing either brackets in the sentence to be parsed, or by modifying the context-free syntax rules.
The associativity problems can be solved in Sdf by extending the context-free syntax rules with the appropriate associativity attributes.Priority problems can also be solved by introducing brackets in the input sentence, or by giving an ordening on the context-free syntax rules that cause the priority con icts.
However, there are ambiguities which are contextdependent.We will give a COBOL speci c example: X = Y OR Z (see Figure 2).Depending on the status of Z the root of the parse tree is either = or OR.The program in Figure 2 prints s(success) or f(failure) depending on whether Z is a numeral or a predicate and depending on the values of X, Y, and Z; viz. the results in the table in the program.The conclusion is here that the parse tree depends on the status of Z.Note that the * on line 7 now forces Z to be a predicate but if the * is on line 8 then Z is a numeral.This problem can be solved in Sdf by de ning a more general context-free syntax rule and by adding an extra pass to adjust the constructed parse tree.
Note that this problem will not occur when we would use EAG because of the facility of semantic-directed parsing.The other approaches that we discussed in Section 2.1 do have this problem, but we think it is harmless since we do not need this type of information in reengineering.At least we did not encounter this problem in real-world code.

Preprocessing of COBOL programs
As is well-known, COBOL is a huge language.One of the problems in specifying COBOL syntax is that many syntactic variations with exactly the same semantics occur  in the language.We also know that the reduction of the size of a COBOL de nition leads to better performance, and it makes the speci cation of tools more easy.So, it is a good idea to preprocess COBOL code by taking away that syntactic freedom that can be dealt with easily in a preprocessor.However, other transformations, like the use of periods instead of scope terminators are not typical preprocessing tasks.For such tasks it is better to specify a tool on top of the COBOL syntax that takes care of such transformations.Usually, such tools will facilitate maintenance and reengineering of COBOL code.For instance, the transformation of implicit to explicit scope terminators (see Section 5.3 for details).
We will devote the rest of this section to a pointwise discussion of a few typical preprocessing tasks.
Stripping The rst steps in the preprocessing phase correspond with the stripping in Figure 1.
Elimination of obsolete language constructs.Namely, the line numbering (columns 1{6) and comments (columns 73{80) can be omitted, thus obtaining 666-code.Since the line numbers and comments may occur on every location that permits a line break, the context-free grammar explodes if we would incorporate them.Therefore, we omit them, which is harmless since line numbers and comments do not have a semantics.Elimination of the continuation markin the C-Area (this is column 7).The reason is that otherwise you could have continuation marks on arbitrary locations in the context-free grammar.To avoid explosion of the grammar, we remove them.Note that this is harmless since continuation marks do not have a semantics.
Reformulation The second phase corresponds to the reformulation phase in Figure 1.This phase leads to reduced 666-code.Note that we reduce the number of syntactic possibilities without changing the semantics of the code.
Elimination of syntactic variation to prematurily reduce many alternatives in production rules of the context-free grammar.For example, the construct VARYING IN] SIZE] FROM] (where the text brackets stand for optional keywords) has eight alternatives.We substitute for all variations just VARYING.Another example is to transform (>|GREATER) OR] (=|EQUAL) to >= thus reducing from eight to one alternative (with (A|B) we mean the choice between A and B).
Transformation from mixed-case keywords to uppercase keywords.This leads to a simpli cation of the grammar.

Lexical disambiguation The last phase corresponds
to the lexical disambiguation phase in Figure 1.After that phase we have lexically disambiguated 666-code.
Elimination of lexical ambiguities.One of the problems that we see in COBOL is the overload of symbols with a di erent meaning depending on the context.This can cause parse problems.For example the comment marker in COBOL is * or / but these symbols are also arithmetic operations.Only in the C-Area they indicate comments.We substitute for the comment signs a special symbol so that we are able to distinguish it from the operations.The reason for this is technical: most scanning techniques are oriented from left to right whereas for this problem we would prefer a bidirectional scanning mechanism.
Our preprocessor is a one page perl(1) script 26].Translation of original code to lexically disambiguated 666-code takes about 1 second per 3 KLOC.

De ning COBOL Syntax
In this section we will discuss a general method to obtain a de nition of a COBOL syntax that is based on code that has to be maintained or reengineered.As a reference we will work with an as complete as possible COBOL de nition in Sdf, which is based on the ANSI standard of COBOL85 1].This de nition contains approximately 1100 production rules but it is still not complete 22].We will discuss how to obtain the ANSI grammar, and then we discuss how to obtain a speci c grammar that recognizes the code that actually has to be maintained or reengineered.This is not necessarily a reduction of the ANSI grammar: constructions that do not occur in it are added to the speci c grammar in the same way the ANSI grammar was developped.

An ANSI COBOL85 Grammar
The standard way to develop a language de nition for an existing language is to base it on an o cial document containing at least the syntax rules.Developing the definition based on syntax rules results in a parser which accepts only syntactical correct programs.In this section we discuss the syntax de nition of COBOL85 based on the ANSI standard 1].A more detailed description of this process is reported on in 22].
Ideally this process leads to a more or less complete formal de nition of the language, however the amount of success depends on the complexity and size of the language.Although COBOL85 is a huge language, a considerable part of the language is de ned in 22] along the lines that we discuss below.It consists of about 1100 production rules and it took an undergraduate student four months to specify these rules.As far as we know this is the largest speci cation of COBOL85.We are aware of 2] and 3] where a tool is described that is based on a COBOL grammar of 170 BNF rules.
Almost all context-free grammar rules of the COBOL85 syntax in 1] are translated into rules in the Syntax De nition Formalism (Sdf) 14].After the denition of a considerable number of rules in Sdf, a number of test programs were parsed by the generated parser to test the correctness and completeness of the context-free grammar for ANSI COBOL85.Here we bene ted from the advantage of using the Asf+Sdf Meta-Environment since it supports incremental parser generation from the context-free grammar.
Next, we describe the problems we encountered during the translation of the context-free grammar rules in 1] into Sdf rules.This is a nontrivial process for the following reasons.First of all the size of the ANSI document is huge.But more important is that the ANSI standard is not very well structured.The ANSI document distinguishes the following sets of rules: the socalled general format that gives the formal de nition of the syntax, the syntax rules that add properties to the de nitions in general format, and the general rules that| they claim|de ne the semantics.In fact, the syntax definition is scattered over the general format, the syntax rules, and the general rules.The general format is a rst order approximation of the actual syntax.The syntax rules and the general rules give more information on the syntax in natural language.For example, the arbitrary order of optional constructs is often de ned informally in the syntax rules.The general rules de ne syntactic constraints, whereas they claim to de ne semantics.We incorporated all their di erent syntax descriptions using one formalism: Sdf.
In 22] care has been taken of these problems and although the resulting de nition is not yet complete it has a (far more) better structure than the ANSI document and is, therefore, useful as an electronic reference manual.In fact, the grammar in 22] could very well serve as the starting point for grammar manipulations as depicted in Figure 1 leading to a more speci c grammar that is geared towards recognizing speci c COBOL code that has to be analyzed.
To give the reader an idea of the translation process we present an example.We discuss the translation of the PICTURE string environment from the ANSI standard into Sdf.Since the complete Sdf de nition of the PICTURE string environment takes about 90 production rules we con ne ourselves to showing how the following rule in general format is translated to Sdf rules.

PICTURE PIC IS pic-string
The intended interpretation of the above rule is as follows.Uppercase words represent keywords, lowercase words represent non-terminals.The curly braces indicate that one of the alternatives has to be chosen.Underlined keywords are obligatory as opposed to not underlined ones.
The translation to Sdf rules for this construct is below.
The meaning of the above rules is as follows.We mention that Sdf is a BNF-like formalism, except that the left-and right-hand sides are switched.The alternatives of a syntax rule are those rules which have the same nonterminal at the right-hand side.PICTURE-DEF, PIC-LIT, OPT-IS, and PIC-STRING are non-terminals, "PICTURE", "PIC", and "IS" are keywords.The rst alternative of OPT-IS is an empty alternative.
In the above way the entire translation can be made and a major part of it has been carried out in 22].The ensueing context-free grammar recognizes 666-code (see Figure 1).

A Restricted Grammar
Based on actual code of about 200 COBOL programs, a mixture of OS/VS COBOL and COBOL 370 programs, the Sdf de nition discussed in the previous section is adapted and reduced to a restricted grammar.The reason that we want a restricted grammar is to improve the performance of the parser generator, the generated parserm and the tools that use the grammar.The performance is not optimal, since he production rules in the ANSI grammar plus extensions have many alternatives (due to the syntactic freedom that COBOL provides), and these alternatives in turn have many members.This implies that the e ort to construct the LRtable increases.
First we restrict the grammar by leaving out all those COBOL constructions from the ANSI de nition that do not occur in the actual COBOL code.Next we carry out a reduction of the ANSI grammar which corresponds with the reformulation phase applied to the code, this yields a restricted grammar.The result is a restricted grammar that understands reduced 666-code.

A Retokenized Grammar
The next phase in the preprocessing was to lexically disambiguate the reduced 666-code.The reason to do this is not to optimize the performance but to solve some technical problems that deal with ambiguities.Hence, the grammar needs to be retokenized.To give an impression of the retokenization we treat a few typical aspects of it.
In COBOL a period (.) stands for at least three different things: a scope terminator, a DECIMAL POINT, or in a PICTURE it can be a formatting symbol (if DECIMAL POINT is set to COMMA).In order to distinguish a period that terminates a PICTURE we changed other two possibilities into a colon (:).Thus the grammar needs to understand colons.
To distinguish the comment markers * and / from the same tokens for the operations multiply and divide we changed comments into #.Hence, the grammar needs to understand that # indicates layout.
Certain scope delimiters can be dealt with more easily when they are renamed into a single symbol.The reason to do this is purely technical.We changed the scope delimiters == in a COPY statement into @.So the grammar needs to be retokenized in order to understand these symbols.

A Uni ed COBOL Grammar
The grammar manipulations that we discussed thus far share the property that the grammar does not recognize incorrect programs.The restricted grammar can recognize less programs than the ANSI de nition and the retokenized grammar parses a lexically disambiguated form of COBOL code (which can be seen as a dialect of COBOL).In this section we will discuss a grammar transformation that we call uni cation.It can parse incorrect programs.At this point we have to make an assumption on the code that we want to analyze: we assume that the code is correct, in the sense that it compiles.Fortunately, our point of departure is COBOL code that is compilable since it is our purpose to maintain or reengineer exactly such COBOL code.The reason that we want to unify the grammar is to further improve the performance of the generated parser and the tools that use the grammar.We emphasize that uni cation does not imply a reduction of the number of programs that can be parsed, it just means that the grammar is less ne-grained than the original grammar.Next we will explain what uni cation is and we give an example of its use.The idea is that we take the least common multiple, to phrase it informally, of all the alternatives of a production rule, thus obtaining a rule that allows possibly more than is permitted by the compiler.Consider the productions: {A1,...,An} -> A {B1,...,Bm} -> B where the set notation {A1,...,An} -> A is used to denote the set of n productions A1 -> A up to and including An -> A. The uni cation of the above rules is the union of their sets where the output sort is renamed into a common one: {A1,...,An,B1,...,Bm} -> C This operation gives a reduction of the number of production rules if there is overlap in {A1,...,An} and {B1,...,Bm}.We call this context-free uni cation.
Apart from context-free uni cation we also consider lexical uni cation.Therefore, consider the following productions: where regexp1 and regexp2 are regular expressions.Their uni cation is a single production rule regexp -> C where regexp is a regular expression that matches when one of the regular expressions regexp1 or regexp2 matches.We call this lexical uni cation.
The combination of context-free and lexical uni cation yields a reduction of the number of production rules that is quite dramatical in the case of COBOL.Just as with the ANSI de nition we will show how we take care of the PICTURE string environment in the uni ed grammar.As an indication we give two (simpli ed) Sdf rules originating from the ANSI grammar.They represent the rules for numeric, e.g. 9 (6), and alphanumeric, e.g.X(6), pictures: 9]( 1-9] 0-9]*) -> PIC-STRING X]( 1-9] 0-9]*) -> PIC-STRING By repeated application of context-free and lexical unication of the rules that de ne PIC-STRING (approximately 90 in the ANSI grammar) we end up with a single production rule to de ne the complete PICTURE string in the uni ed grammar: 0-9XxAa()pZzVvSszBCRD/,$+\-*:]+ -> PIC-STRING The above rule expresses that one or more characters that are contained in the text brackets form a PICTURE string, called PIC-STRING.So the above uni ed rule recognizes that there is a PICTURE string, but not if it is correct.For example, it parses :-) but such a PICTURE string never occurs in compilable COBOL code.In fact, the uni ed grammar disentangles the structure of the COBOL program; which is exactly our intention.
In this way we uni ed the retokenized COBOL grammar.This resulted in a de nition of approximately 300 production rules.It is a month work to obtain this denition.We tested it on a large collection of COBOL programs that had to be reengineered (about 200 programs, varying from 20 to 11.000 LOC).The performance of the parser is 130 LOC/sec.------------------------------------------- In this section we will explain how to use a COBOL syntax to specify tools that aid in reengineering and maintenance tasks.We specify such tools on top of the COBOL syntax.This is done with the Algebraic Speci cation Formalism 4] (the Asf part of Asf+Sdf).With algebraic equations it is convenient to specify the functional behavior of a tool.The Asf+Sdf Meta-Environment supports the rapid prototyping and the development of such a tool.The equations that specify it take as input COBOL code.They perform the speci ed task by focusing on that part of the code that has to be inspected and/or changed.We will rst specify an example tool to make our approach clear and then we discuss some maintenance/reengineering tools.We will provide input/output example fragments to show the functionality of the tools.For more applications of the construction of tools based on grammars we refer to 8] were we generate such tools from a grammar.

Example Tool
We present a simple example to illustrate how to specify tools on top of a COBOL grammar.The task of our example tool pid is to extract from a given COBOL program its PROGRAM-ID.Since the PROGRAM-ID of a COBOL program is located in the IDENTIFICATION DIVISION, we can restrict our tool to only inspect the identi cation division: pid(I-div E-div D-div P-div) = pid(I-div) The variables I-div, E-div, D-div, and P-div stand for the four divisions in COBOL.The above equation expresses that the tool will only focus on the IDENTIFICATION DIVISION.The behavior of pid on that division is described as follows: The variables Id, Author, Inst, Wdate, Cdate, and Remarks stand for the possible elds in the identi cation division.This last equation expresses that for a given IDENTIFICATION DIVISION the pid tool returns the variable Id which is the PROGRAM-ID.
Par  This trivially speci ed tool is capable of extracting the correct PROGRAM-ID in the cryptic COBOL program of Figure 3.The program compiles with the Microfocus COBOL compiler.Of course, it is an arti cial program, but it shows that even though the keyword PROGRAM-ID does not appear literally in it, our simple tool extracts the name: Pid-Bull.This is due to the fact that the tool has knowledge of the language the program is written in.Needless to say that with grep-like tools the correct PROGRAM-ID would never have been found.Now that we have an idea of how to conveniently specify tools, we will discuss some tools that have been speci ed to aid in maintenance and reengineering.We describe them brie y and provide examples of their input/output behavior to illustrate their functionality and power.

out2in Tool
In COBOL we can distinguish so-called in-line and outof-line PERFORM statements.An in-line PERFORM is an iterative construct that is comparable to, for instance, while-loops in PASCAL.An out-of-line PERFORM is actually a procedure call.The out2in tool transforms an out-of-line PERFORM to an in-line one.This is useful since for most analysis tools it is often convenient that the physical order of the code equals the execution order.Of course, it is possible to re-implement this functionality in every tool again, but it is, in our opinion, a better idea to make a tool that performs this task, so other tools can use this tool if necessary.In Figure 4 an example fragment and its transformed output are shown.We can see that the out-of-line construction at the left-hand side is replaced by the contents of the called paragraph Par-name 1 resulting in an in-line construction.Note that the contents of an inline PERFORM may only contain statements, so it is not a simple textual replacement, hence the omission of the periods after Stat 2 and Stat 3 in the in-line version.In fact, the out2in tool also attens the code that the outof-line PERFORM calls.The straightforward speci cation of the out2in tool consists of about 25 Asf equations.

Add-END-IF Tool
In COBOL74 there is no explicit scope terminator for IF statements.All (open) IF statements are closed by a period.Furthermore, an ELSE can be closed by another ELSE that is matching a \higher" IF.In COBOL85 an explicit scope terminator END-IF has been added as an optional feature.The Add-END-IF tool (aei for short) transforms the implicit scope terminators into explicit ones.In Figure 5 we give an example fragment and its output.
It will be clear that the insertion of explicit scope terminators improves the legibility of the code.So, this tool aids in reengineering and maintenance.In practice, such transformations are part of migrations from COBOL74 to COBOL85 code.The speci cation of this tool comprises about 25 equations.We note that tools for the insertion of other scope terminators, like END-READ or END-WRITE, can be speci ed in a similar way.
We observed that when others work with our grammar to specify tools, they need more time to develop them.This is due to the fact that they did not know all the intricacies of the grammar.After exaplanation of di cult COBOL contructs the development time decreased.

Generation of Tools
One of the advantages of using the Asf+Sdf Meta-Environment is that given an Sdf de nition of COBOL it is possible to generate tools for COBOL programs.We already saw the incremental generation of parsers during the development of a COBOL grammar.An elaborate discussion on the subject of generating such tools is out of scope.See 8] for details on how to generate tools for reengineering purposes using a context-free grammar as input.We mention here that the development of unparsers is supported by an unparser generator tool 9], since unparsers aid in reengineering and maintenance.The unparser generator generates a set of equations which can be ne-tuned in order to obtain the desired layout of COBOL programs.The equations translate a COBOL program to a language independent expression that can be translated to ASCII-text or T E Xcode.The T E X output can be used to produce correct PROCEDURE DIVISION.The purpose of an unparser is not only to improve the readability of code, but it can also be of use when migrating from one compiler to another.The Microfocus compiler does not mind when, for instance, PROGRAM-ID does not start in the A-area, but another compiler might (see the program in Figure 3).Then an unparser is useful to modify existing code to meet the standards of the new compiler.Another application of an unparser is that code can be made migration-ripe by uniformizing it so that certain assumptions on the form of the code can be made in order to facilitate massive automatic migrations.
The generation and ne-tuning of an unparser for COBOL given the grammar we discussed in Section 4.4 was less than one week work.This is due to the fact that although for each production rule equations are generated, half of them need small modi cations, since it is impossible to predict, in general, how an arbitrary language has to be formatted.6

Conclusions
In this paper we have motivated that in order to specify reengineering and maintenance tools for COBOL it is convenient that such tools are built on top of a COBOL de nition that is geared towards the speci c COBOL code that has to be processed by such tools.We have shown that it is sensible to separate the development of the tools from the development of the grammar that the tools use.To aid in the development of the grammars, we have presented a general method to specify COBOL that uses on the one hand an ANSI COBOL85 grammar plus extensions that we found in IBM manuals, and on the other hand real code that has to be reengineered or maintained.Since there are many COBOL dialects such general methods are useful in the development of tools for COBOL.It turned out that specifying tools on top of such grammars is now a convenient task.In fact, in 8] we show that tools can now be generated from the grammar.
We used modular techniques to specify the grammars and the tools.This guarantees that (parts of) grammars and tools can be reused without problems since those techniques support de nition of arbitrary context-free grammars.So we do not have the traditional shift-reduce and reduce-reduce con icts that come with Lex+Yacclike formalisms.
Concluding, we hope to have shown that the quality of reengineering and maintenance tools drastically improves when these tools are based on a uni ed contextfree grammar of the language in which the code has been written.

Figure 3 :
Figure 3: Cryptic example program.5 Tools Based on COBOL De nitions

Figure 4 :
Figure 4: Out-of-line to in-line transformation.