Grid Application Performance Prediction: a Case Study in BROADEN

The BROADEN (Business Resource Optimisation for Aftermarket and Design on Engineering Networks) project aims to build a Rolls-Royce Grid as a proving ground for utilising Grid services technology. One of the central applications in this project is XTO ((eXtract Track Order) which is used to facilitate distributed diagnostics support for aircraft engines. One of the key elements in BROADEN is the ability to predict XTO applications run-time. In this paper a SNAP-base (Service Negotiation Acquisition Protocol) resource broker is extended to include a QoS (Quality of Service) Manager which utilises a run-time historical database that allows for such prediction to take place. The performance results presented in terms of predicted run-time that the use of this manager can provide show the potential of this approach on a Grid test-bed.


INTRODUCTION
Grid computing has the potential to provide users with high performance, utility computing in a seamless virtual organisation (VO) (Foster 2003).In such environments users aim to negotiate with resource providers, often belonging to different administrative domains in order to achieve reliable, predictable application performance.More specifically, users will typically have commitments and performance requirements (e.g.time deadlines attached to their applications).A key goal of resource management within the Grid is to deliver commitments and assurances on top of the allocated resources, for example, compute and storage resources, security and network performance and assurance that a specified time deadline can be met.This is of particular importance to users from a commercial environment, since a failure to complete an application in a timely fashion may have severe implications.An example is the BROADEN (Business Resource Optimisation for Aftermarket and Design on Engineering Networks) project (Fletcher 2006), which is used in this paper to provide our experimental evaluation scenario.
The task of providing commitments and assurances is non-trivial, since a Grid system can integrate heterogeneous resources with varying quality and availability.This places importance on the system's ability to monitor the state of these resources.The Grid is a dynamic system where resources are subjected to changes due to system performance degradation, system failure, etc.
This paper focuses on Quality of Service provision in the Grid.To do so, it is important to address the dynamic and heterogeneous nature of the Grid.This is a key issue: 1) before an application begins execution for best resource selection; 2) during run-time in case support of adaptation is provided, and 3) post run-time as information regarding the application, run-time, environment etc is stored as historical data that can be used in future application runs for knowledge extraction, helping to determine whether the user's requirement will be met.It is envisaged that Grid implementations will operate within an economic framework (Buyya 2005).This means cost/performance trade-off decisions must be made pre run-time, requiring mechanisms to support performance prediction.The paper looks specifically into post run-time information regarding the application execution environment including execution time.A key feature of our approach is that the user is not required to install additional software, or make alterations to their code requiring specialist Grid computing knowledge.This is achieved through a Quality of Service (QoS) Manager which is integrated within a Grid resource broker (Haji 2005).
The QoS Manager implementation in a real Grid environment is described in detail and performance results are presented in terms of predicted execution time that the use of this manager can provide.Our results show the potential of this approach on a Grid test-bed.The application used in these experiments is XTO (eXtract Tracked Orders) (Austin 2003) which is an engine data analysis tool that captures data monitored and recorded by the QUICK onengine system during a flight (Nairac 1999), and is of considerable importance in the BROADEN project.
The paper is organised as follows.Related work is described in Section 2. This is followed by a discussion of the BROADEN project, which provides motivation and context for the research and introduces the application which is used in the experiments.Section 4 presents the QoS manager design and prediction methods used for run-time application.Section 5 presents experimental results, highlighting the key issues that must be accounted for in a Grid environment.Conclusions and future work are presented in Section 6.

RELATED WORK
Quality of Service (QoS) provision through resource management in the Grid has received considerable attention from the research community (Nabrzyski 2004).A number of projects have investigated scheduling on the Grid and include Nimrod-G (Buyya 2000), Condor-G (Thain 2005), Gridway (Huedo 2006) and Globus (Foster 2005).A number of works have also explored the problem of deadline scheduling in real-time systems (Aydin 1999).However, they mostly consider deadline parameters for individual jobs and are restricted to centralised resource management schemes and single administrative domain resources.
AppLeS (Berman 2003) has been developed to enable resource selection based on meeting user QoS requirements (e.g.execution time, turnaround time) and adapt to changes in resource availability.However, performance prediction used is developed with specific applications in mind.In addition, applications must be "AppLeS enabled" which means making alterations to the user's application code.GrADS (Grid Application Development Software) (Berman 2001) aim to provide a framework to enable efficient Grid application execution.It supports run-time monitoring and comparison with performance contracts, so that appropriate adaptive actions (e.g.application tuning, migration) can be performed during run-time.However, their approach to performance prediction differs in that historical compute resource performance and load variations are not accounted for (Vraalsen 2001).In addition, like AppLeS, the use of GrADS requires the user's application to be altered in order to utilise the GrADS APIs.
Faerman et al (Faerman 1999) presents an application performance prediction method for determining file transfer times.However, the paper does not discuss resource selection, based on the behaviour (i.e.past and current performance) of compute resources.In (Smith 1998) methods for predicting run-times for parallel applications, using historical information, are discussed.Execution times of 'similar' applications run in the past are used to estimate the execution time.The use of historical information to determine resource reliability is not considered.Application of such methods to Grid scheduling is considered in [Smith 2004].Specifically, run-time prediction of pending jobs is used to estimate execution start times by predicting queue waiting times.While their results indicate that this approach is useful in decreasing waiting times, it does not provide information as to the anticipated execution time of the user's job.The performance prediction based tool, PACE (Performance Analysis and Characterisation Environment) (Nudd 2004), uses a combination of source code analysis and hardware modelling to provide an application performance prediction.The hardware models are static, this provides the advantage of reusability but does not account for dynamic changes to resource performance.The application of three learning techniques to estimate the resource requirements of any given application run before a scheduling decision is made is found in (Kapadia 1999).

BROADEN OVERVIEW
The importance for supporting run-time prediction can be seen by considering the requirements of a commercial Grid project such as BROADEN, a multi-site, DTI funded project which aims to build a Rolls Royce Grid as a proving ground for utilising Web/Grid services technology to fully exploit available IT resources.This is to support the following: • Integrating diagnostic tools for health monitoring across development test and in-service aero-engines; • Large scale and innovative numerical simulations of high fidelity Computational Fluid Dynamics (CFD) and design optimisation; • Large scale agent-based modelling of aftermarket processes incorporating logistics and the supply chain.
BROADEN focuses on three distinct development areas -distributed diagnostics for engine health monitoring, high-performance computing for design, and agent-based software development for business modelling.On the distributed diagnostics for engine health monitoring side it builds on the DAME (Distributed Aircraft Maintenance Environment) (Austin 2003) escience project, which aimed to develop a generic test bed for distributed diagnostics.The generic framework is deployed in a proof of concept demonstrator in the context of maintenance applications for civil aerospace engines.The essential theme of DAME was the use of real-time intelligent feature extraction, intelligent data mining and decision support techniques, where expertise and software tools are distributed across the Grid.The enormity of the databases and the need for distributed access to the data make this a particularly challenging problem for the Grid.DAME has demonstrated the potential of Grid-based diagnostics for health-monitoring applications and shown how data growth problems could potentially be addressed within scope of distributed data assets.
Access to distributed data is typically as important as access to distributed computational resources.Applications for distributed diagnostics require transfers of large amount of data between storage systems, and access to large amounts of data by many geographically distributed applications and users for analysis and visualisation.
The XTO (eXtract Track Orders) forms the central application within the project and is used to support distributed diagnostics support for aircraft engines.The XTO analyses vibration data produced by aircraft engines for features that can be used to support the process of making a fault diagnosis.This data captured is engine information monitored and recorded by an onengine system during a flight, which is stored in a file system and consists of a single control file and a number of binary files that are all in the Rolls-Royce proprietary ZMOD format.
A simple scenario which forms part of a sophisticated process involved in using BROADEN to support engine fault detection and diagnosis is the following: 1. Data is downloaded from an aircraft; 2. A novel event is flagged and marked in the data set; 3. A search through historical data fails to identify the cause of the anomaly.Therefore the system operator launches a feature analysis session, which uses the XTO application; 4. A diagnosis is made, based on the features detected by this analysis.
In this scenario timely completion of the XTO application (and any others applications used in a diagnosis session) is critical since failure to make a diagnosis within the engine turnaround time could result, for example, in flight delays, leading a revenue loss.Thus the work is to look into ways of providing a run-time prediction mechanism for XTO application under various plug-ins and execution environments.

BROADEN Tool, Service and Data Architecture
A generic distributed tool, service and data architecture is being developed with aeroengine health monitoring as the exemplar domain (Figure 1).
The architecture permits the integration of tools to the system with the minimum of change to the tools.Underlying the use of tools is the provision of services, which may be centralised, distributed or autonomous.The volume of the ZMOD data downloadable from the QUICK system may be up to 1 GB per engine per flight.It is envisaged the storage requirements of an aero-engine application alone for an operational day is around 1 TB.Therefore the architecture permits the use of distributed nodes to store data (and services), and ensures other issues such as security, etc are satisfied.The Storage Resource Broker (SRB) (SDSC 2006) provides the mechanism to virtualise the storage across nodes.The scenario used for BROADEN envisages geographically dispersed users using a workbench and tools which access distributed services and data.A middleware layer addresses work orchestration issues such as starting, stopping, monitoring and dynamic workflow capabilities.This is achieved thanks to: 1) the Enterprise Service Bus (Chappell 2004) which provides XML messaging, XML transformation, intelligent routing, connectivity, and is used within the architecture as a simple messaging and translation mechanism between the tools of the system, and 2) a Process Management Controller (PMC), a lightweight middleware application for managing services which act on distributed data in a generic manner and for providing high performance communication (Austin 2005).
Of particular interest in the resource broker which manages selection of Grid resources in keeping with specified Service Level Agreements (SLAs) (see Section 4).
An overview of the tools, services, and data including the issues and rationale behind the development of the architecture is found in (Fletcher 2006).

QUALITY OF SERVICE PROVISION
BROADEN applications can be executed on demand using computational resources that are hidden to the user.The type of applications can vary from computationally intensive simulations to high priority time critical executions.The resources on which these services are executed can vary in quality and reliability, particularly if demand for these application services is high.In this situation competition for resources is high and they can be easily swamped, leading to a drop in application performance.If this coincides with the execution of a time critical application, the results may be delayed.Therefore QoS requirements may be attached to applications in order to guarantee timely execution.In (Padgett 2006) we proposed a Service Level Agreement (SLA) management system incorporating resource reservation and run-time adaptation.For this system to function, commitments and assurances are specified using Service Level Agreements (SLA).An SLA is a contract between user and provider, stating the expectations that exist between them.
Sharing resources in a Grid is complicated in that it requires the ability to bridge the differing policy requirements of the resource owner, in order to create a consistent cross organisation policy domain that delivers the necessary capability to the end user, while respecting the policy requirements of the resource owner.
Further complicating the management of Grid resources is the fact that Grid jobs often require the concurrent allocation of multiple resources.This need for simultaneous resource usage necessitates a structured framework in which resources can be co-ordinated across administrative domains.This is addressed through the use of a resource broker developed within the SNAP framework (Czajkowski 2002).The SNAP-based resource broker (Haji 2005), unlike other brokers presented in section 2, caters for the three fabric QoS essentialities known as pre-runtime, runtime, and post-runtime phases, through its TSLA (Task Service Level Agreement) RSLA (Resource Service Level Agreement) and BSLA (Binding Service Level Agreement).An overview of the broker can be seen in Figure 2. The work presented in this paper extends the SNAP-based broker by integrating a QoS manager, which has the responsibility to gather historical data of past job execution runs times.This is achieved through a Run-time Historical database that captures information as records including CPU load, speed and type, cache size, hostname, etc as well as the actual execution run-time.The responsibility of the QoS manager is to apply various statistical approaches (such as those mentioned in Section 5.0) on the data stored in the historical database and to predict the two most important factors surrounding the XTO application: 1) a run-time prediction, and 2) a storage capacity prediction needed to store the volume of output data.
In the following the approach taken is to formulate a run-time prediction for XTO application.A performance methodology is therefore developed to choose the most appropriate method to predict run-time.The run-time prediction from observations of past runs can be used by the BROADEN end-user.This performance information can be used in its other functionalities, including scheduling and contributing to the guarantee of QoS contracts.The run-time prediction is based on collecting information for each XTO application run and then applying predictive methods to the previous observations.The gathering of performance information does not affect the behaviour of the BROADEN system in any way.Three prediction methods are used to estimate future run-times (Dushay 1999): 1.Last observation: the most recent, single performance observation value is taken as the prediction.The last performance value is most likely to reflect the behaviour of future runs: P n = V 2. Sample average: the prediction is the mean average of the past performance values within a sample set.This set is defined by a sliding window of size x, which corresponds to the x most recent observations.This method is used when performance information is produced on a regular basis, and old value become less relevant.An average can be used with a maximum window size: 3. Low Pass Filter: recent performance data constitutes a better predictor than older data.This method uses an exponentially degrading function to obtain average of recent performance behaviour, thanks to the following formula: where: P n is the prediction and the new value of the low pass filter P n-1 is the previous filter value V is the most recent performance observation value W is the weighting parameter and is a value between 0 and 1.It is generally equal to 0.95, thus decreasing the value of the weight as observation values grow older and increasing the prediction accuracy.
All three methods are simple, use a combination of past performance data for a given application run and require little processing time, in the order of seconds depending on the size of the historical database.

Overview and Objectives
The prediction methods described in Section 4 have different capabilities in terms of the concepts that they can represent.Experiments were designed on a Grid test-bed to highlight their accuracy as well as their differences.
The experiments made use of ten different engine data with various input sizes, each using four different plugins (Fractional, Multiple, Residual_Energy and Step_Change) with a total of fifteen different combinations.The experiments ran for a period of several weeks at various times of the day, and generated over 45,000 different statistical records.The records were stored in the run-time historical database for use by the QoS manager (see Figure 2).Due to space limitation three different data engine sizes (large, medium and small) are presented in this paper, referred to as 7101xyz1, 7101xyz2 and 7101xyz3 respectively.With this in mind the experiments are designed on the basis of the following objectives: • To gather statistical data on XTO application run-time based on engine data size using various plugins and to observe the accuracy of the prediction algorithms used, namely last observation, sample average and low pass filter; • To gather statistical data on XTO application output data generated after each experimental run.
The experiments were performed on a Grid test-bed consisting of 10 machines, each having a Pentium IV processor (1.2GHz) and 256MB RAM.The operating system is Linux 2.4 and all machines have Globus 4 (Foster 2005) installed, with Sun Grid Engine (SGE) as the job scheduler.

Performance Results
The results in Figures 3-5 show the output data generated for three engine data size (large, medium and small) with 15 different plugins combinations.Clearly there is a pattern similarity in all three figures (apart from the obvious scale of the data output size) with the single plugin Multiple generating the largest output data, and Residual_Energy generating the smalllest.It is this type of data that the QoS manager needs to capture for it to be able to predict and reserve sufficient disk space for future runs.This is particularly important for any new engine data introduced into the BROADEN system.The result of one plugin will enable the QoS manager to estimate the output data for the other 14 remaining combinations that will be applied to that engine.Figures 6-8 present four performance results per plugin, namely the actual real run-time and the different predicted run-times generated by the QoS manager prior to the job execution.It can be seen in Figure 6 that the most accurate prediction for the majority of plugin combinations was the last observation.However in Figures 7 and 8 the best prediction method was the low pass filter.From the experiments it is observed that the most appropriate method for run-time prediction is based on how recent the data was captured and available to use by the QoS manager.Hence if it is recent in relation to the actual application execution (within a range of hours) then the last observation prediction method is appropriate, otherwise the low pass filter provides a better prediction.Further there is a pattern in the performance results presented in Figures 6-8; however it is not as significant and well pronounced as that in Figures 3-5.This shows output data provides little insight to run-time prediction.Conversely Figures 6-8 do provide the QoS manager information to estimate the run-time of a new engine with one set of plugin results entered into the system as the time range differences for all plugins based on a single engine data are marginally small.

CONCLUSION AND FUTURE WORK
A key goal of Grid resource management within BROADEN is to deliver commitments and assurances on top of the allocated resources, and assurance that a specified time deadline can be met pre application run-time.This is of particular importance to BROADEN users from a commercial environment, since a failure to complete an application in a timely fashion may have severe implications.The approach taken in this paper is to predict XTO application run-time as well as output data size generated from a Grid application's perspective, based on past performance data.A QoS manager makes use of three prediction methods, which may help a resource broker in scheduling decisions.This informed choice further contributes to guarantee QoS in the use of Grid middleware.
Experiments have been designed to gather statistical data based on engine data size using various plugins and to observe the accuracy of the prediction algorithms used, namely last observation, sample average and low pass filter.Using past performance data, the experiment results have been analysed.It has been found that the accuracy of the predictors was somehow different.The low pass prediction method was the most accurate method, with the last observation being on slightly less accurate but nevertheless appropriate if it is recent in relation to the actual application execution (within a range of hours).
The primary utility of this work is that it quantifies the accuracy of different prediction methods as well as the consistency of observed data from an operational Grid application.None of the prediction algorithms we used required on-going heavy computation, and the QoS manager does benefit the users of the BROADEN decision support system.The XTO application which was previously provided on a best-effort basis can now be offered with QoS guarantees.
The broker has been recently deployed on Rolls-Royce Grid which is heterogeneous in nature and consists of different clusters with various architectures and operating systems.Therefore further experiments will be designed to gather statistical data on XTO application runtime taking into consideration dynamic resource performance and volatility.A thorough evaluation of the historical observed data (CPU load, speed and type, cache size, hostname, run-time) is envisaged in order to further investigate performance prediction and nodes configurations.
Future work will consist of the integration of the QoS manager within an SLA management to monitor run-time application performance.Further the broker will be enhanced to cater for other Rolls-Royce applications such as compute intensive Message Passing Interface (MPI) Computational Fluid Dynamics.

FIGURE 1 :
FIGURE 1: BROADEN Tool, Service and Data Architecture.

Figure 5 :
FIGURE 3: Output data generated for engine number 7101xyz1 (large), based on various plugin combinations.
Figure 6: Actual run-time compared to the predicted times for engine number 7101xyz1 (Large), based on various plugin combination.

Figure 8 :
Figure 8: Actual run-time compared to the predicted times for engine number 7101xyz3 (Small), based on various plugin combinations.