6
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Crystal polymorphism in fragment-based lead discovery of ligands of the catalytic domain of UGGT, the glycoprotein folding quality control checkpoint

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          None of the current data processing pipelines for X-ray crystallography fragment-based lead discovery (FBLD) consults all the information available when deciding on the lattice and symmetry (i.e., the polymorph) of each soaked crystal. Often, X-ray crystallography FBLD pipelines either choose the polymorph based on cell volume and point-group symmetry of the X-ray diffraction data or leave polymorph attribution to manual intervention on the part of the user. Thus, when the FBLD crystals belong to more than one crystal polymorph, the discovery pipeline can be plagued by space group ambiguity, especially if the polymorphs at hand are variations of the same lattice and, therefore, difficult to tell apart from their morphology and/or their apparent crystal lattices and point groups. In the course of a fragment-based lead discovery effort aimed at finding ligands of the catalytic domain of UDP–glucose glycoprotein glucosyltransferase (UGGT), we encountered a mixture of trigonal crystals and pseudotrigonal triclinic crystals—with the two lattices closely related. In order to resolve that polymorphism ambiguity, we have written and described here a series of Unix shell scripts called CoALLA ( crystal p olymorph and ligand likelihood-based assignment). The CoALLA scripts are written in Unix shell and use autoPROC for data processing, CCP4-Dimple/ REFMAC5 and BUSTER for refinement, and RHOFIT for ligand docking. The choice of the polymorph is effected by carrying out (in each of the known polymorphs) the tasks of diffraction data indexing, integration, scaling, and structural refinement. The most likely polymorph is then chosen as the one with the best structure refinement R free statistic. The CoALLA scripts further implement a likelihood-based ligand assignment strategy, starting with macromolecular refinement and automated water addition, followed by removal of the water molecules that appear to be fitting ligand density, and a final round of refinement after random perturbation of the refined macromolecular model, in order to obtain unbiased difference density maps for automated ligand placement. We illustrate the use of CoALLA to discriminate between H3 and P1 crystals used for an FBLD effort to find fragments binding to the catalytic domain of Chaetomium thermophilum UGGT.

          Related collections

          Most cited references75

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          XDS

          1. Functional specification The program package XDS (Kabsch, 1988a ▶,b ▶, 1993 ▶, 2010 ▶) was developed for the reduction of single-crystal diffraction data recorded on a planar detector by the rotation method using monochromatic X-rays. It includes a set of three programs. XDS accepts a sequence of adjacent non-overlapping rotation images from a variety of imaging-plate, CCD, pixel and multiwire area detectors, infers crystal symmetry and metrics and produces a list of corrected integrated intensities of the reflections occurring in the images in a nearly automatic way. The program assumes that each image covers the same positive amount of crystal rotation and that the rotation axis, incident beam and crystal intersect at one point, but otherwise imposes no limitations on the detector position, on the directions of the rotation axis and incident beam or on the oscillation range covered by each image. XSCALE places the data sets obtained from processing with XDS on a common scale, optionally merges them into one or several sets of unique reflections and reports their completeness and the quality of the integrated intensities. It corrects the data for absorption effects, sensitivity variations in the detector plane and radiation damage. Optionally, it can correct reflections individually for radiation damage by extrapolation to their initial intensities at zero dose. XDSCONV converts reflection data files as obtained from XDS or XSCALE into various formats required by software packages for crystal structure determination. It can generate test reflections or inherit previously selected ones which are used for the calculation of a free R factor to monitor the progress of structure refinement. 2. XDS XDS is organized into eight steps (major subroutines) which are called in succession by the main program. Information is exchanged between the steps by files (see Table 1 ▶), which allows the repetition of selected steps with a different set of input parameters without rerunning the whole program. The files generated by XDS are either ASCII-type files that can be inspected and modified using a text editor or binary control images saved as a byte-offset variant of the CBFlib format (Bernstein & Hammersley, 2005 ▶; Bernstein & Ellis, 2005 ▶). Such images are indicated by the file-name extension .cbf and can be looked at using the open-source program XDS-Viewer written by Michael Hoffer. All files have a fixed name defined by XDS, which makes it mandatory to process each data set in a newly created directory in order to avoid name clashes. Clearly, one should not run more than one XDS job at a time in the same directory. Output files affected by rerunning selected steps (see Table 1 ▶) should also first be given another name if their original contents are meant to be saved. Data processing begins by copying an appropriate input file into the new directory. Input-file templates are provided with the XDS package for a number of frequently used data-collection facilities. The copied input file must be renamed XDS.INP and edited to provide the correct parameter values for the actual data-collection experiment. All parameters in XDS.INP are named by keywords containing an equals sign as the last character and many of them will be mentioned here in context in order to clarify their meaning. Execution of XDS (JOB=XDS) invokes each of the eight program steps as described below. The results and diagnostics from each step are saved in files with the extension .LP attached to the program-step name. These files should always be studied carefully to see whether processing was satisfactory or, in the case of failure, to find out what could have gone wrong. 2.1. XYCORR This program step calculates a lookup table of additive spatial corrections at each detector pixel which is stored in the files X-CORRECTIONS.cbf, Y-CORRECTIONS.cbf. Often, the data images have already been corrected for geometrical distortions, in which case XYCORR produces tables of zeros. For spiral read-out imaging-plate detectors the small corrections resulting from radial (ROFF=) and tangential (TOFF=) offset errors of the scanner are computed. For some multiwire and CCD detectors that deliver geo­metrically distorted images, corrections are derived from a calibration image (BRASS_PLATE_IMAGE=file name). This image displays the response to a brass plate containing a regular grid of holes which is mounted in front of the detector and illuminated by an X-ray point source. Clearly, the source must be placed exactly at the location to be occupied by the crystal during the actual data collection, as photons emanating from the calibration source are meant to simulate all possible diffracted beam directions. For visual control, spots that have been located and accepted from the brass-plate image by XYCORR are marked in the file FRAME.cbf. The following problems can be encountered in this step. (i) A misplaced calibration source can lead to an incorrect lookup table, impairing the correct prediction of the observed diffraction pattern in subsequent program steps. (ii) An underexposed calibration image can result in an incomplete and unreliable list of calibration spots. 2.2. INIT INIT determines three lookup tables, saved as the files BLANK.cbf, GAIN.cbf and BKGINIT.cbf, that are required by the subsequent processing steps for classifying pixels in the data images as background or belonging to a diffraction spot (‘strong’ pixels). These tables should be inspected with the XDS-Viewer program. BLANK.cbf contains a lookup table of the detector noise. It is determined from a specific image recorded in the absence of X-rays (DARK_CURRENT_IMAGE=) or is assumed to be a constant derived from the mean recorded value in each corner of the data images. GAIN.cbf codes for the expected variation of the pixel contents in the background region of a data image. The variance of the contents of a pixel in the background region is GAIN·(pixel contents − detector noise). The variance is determined from the scatter of pixel values within a rectangular box (NBX=, NBY=) of size (2·NBX + 1)·(2·NBY + 1) centred at each image pixel in succession. The table GAIN.cbf is used to distinguish background pixels from ‘strong’ pixels that are part of a diffraction spot. BKGINIT.cbf estimates the initial background at each pixel from a few data images specified by the input parameter BACKGROUND_RANGE=. The lookup table is obtained by adding the X-ray background from each image. Shaded regions on the detector (i.e. from the beamstop), pixels outside a user-defined circular region (TRUSTED_REGION=) or pixels with an undefined spatial correction value are classified as untrustworthy and marked by −3. The following problem can be encountered in this step. Some detectors with insufficient protection from electromagnetic pulses may generate badly spoiled images whose inclusion leads to a completely wrong X-ray background table. These images can be identified in INIT.LP by their un­expected high mean pixel contents and this step should be repeated with a different set of images. 2.3. COLSPOT COLSPOT locates strong diffraction spots occurring in a subset of the data images and saves their centroids in the file SPOT.XDS. The data subset is defined by contiguous image number ranges, where each range is specified by the keyword SPOT_RANGE=. As described in Kabsch (2010 ▶), spots are defined as sets of ‘strong’ pixels that are adjacent in three dimensions. The classification of ‘strong’ pixels is controlled by the decision constants STRONG_PIXEL= and BACKGROUND_PIXEL=. If the total number of ‘strong’ pixels occurring in the specified data images exceeds the upper limit as given by the input parameter MAXIMUM_NUMBER_OF_STRONG_PIXELS=, the weaker ones are discarded. A spot is accepted if it contains a minimum number of ‘strong’ pixels (MINIMUM_NUMBER_OF_PIXELS_IN_A_SPOT=) and if the spot centroid is sufficiently close to the location of the strongest pixel in the spot (SPOT_MAXIMUM-CENTROID=). The following problem can be encountered in this step. Sharp edges such as ice rings in the images can lead to an excessive number of ‘strong’ pixels being erroneously classified as contributing to diffraction spots. These aliens could prevent IDXREF from recognizing the crystal lattice. 2.4. IDXREF IDXREF uses the initial parameters describing the diffraction experiment as provided by XDS.INP and the observed centroids of the spots from the file SPOT.XDS to find the orientation, metric and symmetry of the crystal lattice and refines all or a specified subset of these parameters [input parameter REFINE(IDXREF)=] . On return, the complete set of parameters are saved in the file XPARM.XDS and the original file SPOT.XDS is replaced by a file of identical name, now with indices attached to each observed spot. Spots not belonging to the crystal lattice are given indices 0, 0, 0. XDS considers the run to be successful if the coordinates of at least 70% of the given spots can be explained with reasonable accuracy (input parameter MAXIMUM_ERROR_OF_SPOT_POSITION=); otherwise, XDS will stop with an error message. Alien spots often arise because of the presence of ice or small satellite crystals and continuation of data processing may still be meaningful. In this case, XDS is called again with an explicit list of the subsequent steps specified in XDS.INP (input parameter JOB=DEFPIX XPLAN INTEGRATE CORRECT). IDXREF uses the methods described in Kabsch (1993 ▶, 2010 ▶) to determine a crystal lattice that explains the observed locations of the diffraction spots listed in the file SPOT.XDS. Firstly, a reciprocal-lattice vector referring to the unrotated crystal is computed from each observed spot centroid. Differences between any two reciprocal-lattice vectors that are above a specified minimal length (SEPMIN=) are accumulated in a three-dimensional histogram. These difference vectors will form clusters in the histogram, since there are many different pairs of reciprocal-lattice vectors of nearly identical vector difference. The clusters are found as maxima in the smoothed histogram (CLUSTER_RADIUS=) and a basis of three linearly independent cluster vectors is selected that allows all other cluster vectors to be expressed as nearly integral multiples of small magnitude with respect to this basis. The basis vectors and the 60 most populated clusters with attached indices are listed in IDXREF.LP. If many of the indices deviate significantly from integral values, the program is unable to find a reasonable lattice basis and all further processing will be meaningless. If the space group and unit-cell parameters are specified, a reduced cell is derived and the reciprocal-basis vectors found above are reinterpreted accordingly; otherwise, a reduced cell is determined directly from the reciprocal basis. The parameters of the reduced cell, the coordinates of the reciprocal-basis vectors and their indices with respect to the reduced cell are reported. Based on the orientation and metric of the reduced cell now available, IDXREF indexes up to 3000 of the strongest spots using the local-indexing method. This method considers each spot as a node of a tree and identifies the largest subtree of nodes which can be assigned reliable indices. The number of reflections in the ten largest subtrees is reported and usually shows a dominant first tree corresponding to a single lattice, whereas alien spots are found in small subtrees. Reflections in the largest subtree are used for initial refinement of the basis vectors of the reduced cell, the incident-beam wavevector and the origin of the detector, which is the point in the detector plane nearest to the crystal. Experience has shown that the detector origin and the direction of the incident beam are often specified with insufficient accuracy, which could easily lead to a misindexing of the reflections by a constant offset. For this reason, IDXREF considers alternative choices for the index origin and reports their likelihood of being correct. The parameters controlling the local indexing are INDEX_ERROR=, INDEX_MAGNITUDE=, INDEX_QUALITY= (corresponding to ∊, ϕ and 1 − ℓmin in Kabsch, 2010 ▶) and INDEX_ORIGIN=h 0, k 0, l 0, which is added to the indices of all reflections in the tree. After initial refinement based on the reflections in the largest subtree, all spots which can now be indexed are included. Usually, the detector distance and the direction of the rotation axis are not refined, but if the spots were extracted from images covering a large range of total crystal rotation then better results are obtained by including these parameters in the refinement [REFINE(IDXREF)=] . The refined metric parameters of the reduced cell are used to test each of the 44 possible lattice types as described in Kabsch (2010 ▶). For each lattice type, IDXREF reports the likelihood of its being correct and the conventional unit-cell parameters. The program step concludes with an overview of possible lattice symmetries, but makes no automatic decision for the space group. If the crystal symmetry is unknown, XDS will continue data processing with the crystal being described by its reduced-cell basis vectors and triclinic symmetry. Space-group assignment is postponed to the last program step, CORRECT, when integrated intensities are available. The following problems can be encountered in this step. (i) The indices of many difference-vector clusters deviate significantly from integral values. This can be caused by incorrect input parameters, such as rotation axis, oscillation angle or detector position, by a large fraction of alien spots in SPOT.XDS, by placing the detector too close to the crystal or by an inappropriate choice of the parameters SEPMIN= and CLUSTER_RADIUS= in densely populated images. (ii) Indexing and refinement is unsatisfactory despite well indexed difference-vector clusters. This is probably caused by the selection of an incorrect index origin and IDXREF should be rerun with plausible alternatives for INDEX_ORIGIN= after a visual check of a data image with XDS-Viewer. (iii) Despite successful indexing and refinement, IDXREF stops with the error message INSUFFICIENT PERCENTAGE OF INDEXED REFLECTIONS, complaining that less than 70% of the given spots could be explained. Alien spots often arise because of the presence of ice or small satellite crystals and continuation of data processing may still be meaningful. To continue data processing, just specify the missing processing steps in XDS.INP by JOB=DEFPIX XPLAN INTEGRATE CORRECT and call XDS again. 2.5. DEFPIX DEFPIX recognizes regions in the initial background table (file BKGINIT.cbf) that are obscured by intruding hardware and marks the shaded pixels as untrusted. In addition, pixels that are outside a user-defined resolution range (INCLUDE_RESOLUTION_RANGE=) are marked and eliminated from the trusted region. The marked background table that is thus obtained is saved in the file BKGPIX.cbf which is needed by the subsequent program steps. To recognize the obscured regions in the initial background, DEFPIX generates a control image (file ABS.cbf) that contains values of around 10 000 for unshaded pixels and lower values for shaded pixels. The classification of the pixels into reliable and untrusted pixels is based on the two input parameters VALUE_RANGE_FOR_TRUSTED_DETECTOR_PIXELS= (default 6000 30 000) and INCLUDE_RESOLUTION_RANGE= (default 20.0 0.0). Pixels in the table ABS.cbf with a value outside the ranges specified by the two parameters are marked unreliable (by −3) in the background table BKGPIX.cbf. The following problem can be encountered in this step. If the parameter VALUE_RANGE_FOR_TRUSTED_DETECTOR_PIXELS= specifies a value range that is too narrow, ‘good’ regions will erroneously be excluded from the trusted detector region. Check BKGPIX.cbf with the XDS-Viewer program and if necessary repeat the DEFPIX step with more appropriate values. 2.6. XPLAN XPLAN supports the planning of data collection. It is based upon information provided by the input files XPARM.XDS and BKGPIX.cbf, both of which become available on processing a few test images with XDS. XPLAN estimates the completeness of new reflection data expected to be collected for each given starting angle and total crystal rotation and reports the results for a number of selected resolution shells in the file XPLAN.LP. To minimize the recollection of data, the name of a file containing already measured reflections can be provided by the input parameter REFERENCE_DATA_SET=. The following problems can be encountered in this step. (i) Incorrect results may occur for some space groups, i.e. P42, if the unit cell determined by XDS from processing a few test images implicates reflection indices that are inconsistent with those from the reference data set. However, the correct cell choice can be found by using the old data as a reference and repeating CORRECT with the appropriate reindexing transformation, followed by copying GXPARM.XDS to XPARM.XDS. The same applies if IDXREF was run for an unknown space group and then reindexed in CORRECT. (ii) XPLAN ignores potential reflection overlap owing to the finite oscillation range covered by each image. 2.7. INTEGRATE INTEGRATE determines the intensity of each reflection predicted to occur in the rotation data images (DATA_RANGE=) and saves the results in the file INTEGRATE.HKL. The diffraction parameters needed to predict the reflection positions are initially provided by the file XPARM.XDS. These parameters are either kept constant or refined periodically using strong diffraction spots encountered in the data images. Whether refinement should be carried out at all and which parameters are to be refined can be specified by the user [input parameter REFINE(INTEGRATE)=]. The centroids of the strong spots in the data images are computed from pixels that exceed the background by a given multiple of standard deviations (input parameters SIGNAL_PIXEL=, BACKGROUND_PIXEL=). Strong spots are used in the refinement if their centroids are reasonably close to their calculated position (input parameter MAXIMUM_ERROR_OF_SPOT_POSITION=). For determination of the intensity, approximate values describing the extension and the form of the diffraction spot must be specified. The shapes of all spots become very similar when the contents of each of their contributing image pixels is mapped onto a three-dimensional coordinate system, specific for each reflection, which has its origin on the surface of the Ewald sphere at the terminus of the diffracted beam wavevector (see Kabsch, 2010 ▶). The transformed spot can roughly be described as a Gaussian involving two parameters: the standard deviations of the reflecting range σM (input parameter REFLECTING_RANGE_E.S.D.=σM) and the beam divergence σD (input parameter BEAM_DIVERGENCE_E.S.D.=σD). This leads to an integration region around the spot that is defined by the parameters δM (REFLECTING_RANGE=) and δD (BEAM_DIVERGENCE=), which are typically chosen to be 6–10 times larger than σM and σD, respectively. Appropriate values for these parameters are determined automatically by XDS (Kabsch, 2010 ▶); the user has the option to override the automatic assignments. Integration is carried out by a two-step procedure. In the first pass, spot templates are generated by superimposing the profiles of strong reflections after their mapping to the Ewald sphere. Grid points with a value above a minimum percentage of the maximum in the template (parameter CUT=) are marked for inclusion in the final integration. To allow for variations in their shape, profile templates are generated from reflections located at nine regions of equal size covering the detector surface and additional sets of nine to cover equally sized (parameter DELPHI=) batches of images. The actual integration is carried out in the second pass by profile fitting with respect to the spot shape determined in the first pass. Incomplete reflections below a minimum percentage of the observed reflection intensity (parameter MINPK=) will be discarded. Otherwise, the missing intensity is estimated from the learned reflection profiles. On return from the INTEGRATE step, all spots expected to occur in the last data image are encircled and the modified image is saved as the file FRAME.cbf for inspection. The following problems can be encountered in this step. (i) Off-centred profiles indicate incorrectly predicted reflection positions by using the parameters provided by the file XPARM.XDS (i.e. misindexing by using a wrong origin of the indices), crystal slippage or change in the incident-beam direction. (ii) Profiles extending to the borders of the box indicate too-small values of the parameters BEAM_DIVERGENCE= or REFLECTING_RANGE=. This leads to incorrect integrated intensities because of truncated reflection profiles and un­reliable background determination. (iii) Display of the file FRAME.cbf shows spots which are not encircled. If these unexpected reflections are not close to the spindle and are not ice reflections, then it is likely that the parameters provided by the file XPARM.XDS are wrong. 2.8. CORRECT CORRECT applies correction factors to the intensities and standard deviations of all reflections found in the file INTEGRATE.HKL, determines the space group if unknown and refines the unit-cell parameters, reports the quality and com­pleteness of the data set and saves the final integrated intensities in the file XDS_ASCII.HKL. Some of the employed algorithms are new and are described in Kabsch (2010 ▶). CORRECT accepts reflections from the file INTEGRATE.HKL that are (i) recorded (parameter MINPK=) on specified images (parameter DATA_RANGE=); (ii) within a given resolution range (parameter INCLUDE_RESOLUTION_RANGE=); (iii) outside ice rings (parameter EXCLUDE_RESOLUTION_RANGE=); (iv) not overloaded (parameter OVERLOAD=); and (v) not marked for exclusion in the file REMOVE.HKL. Thus, the user has the option to exclude unreliable reflections from the final data set by repeating the CORRECT step with appropriate parameter values. The intensities of the accepted reflections are first corrected for effects arising from polarization of the incident beam (parameters FRACTION_OF_POLARIZATION=, POLARIZATION_PLANE_NORMAL=) and absorption effects (parameters AIR=, SILICON=, SENSOR_THICKNESS=) arising from differences in path lengths of the diffracted beam. These corrections do not depend on knowledge of the space group. The integrated intensities of the reflections in the file INTEGRATE.HKL may or may not have been indexed in the correct space group; for the purpose of integration, it is important only that all reflections occurring in the data images have been indexed with respect to some unit-cell basis and that their locations on the images were hit exactly. The correct reflection indices in the true space group are always a linear transformation of the original indices used in INTEGRATE.HKL. All lattices consistent with the locations of the reflections saved in INTEGRATE.HKL (decision parameters MAX_CELL_AXIS_ERROR=, MAX_CELL_ANGLE_ERROR=) and their corresponding linear transformations are printed to provide a useful overview similar to that shown in IDXREF.LP. If the space group is not specified, XDS proposes one of the enantiomorphous space groups without screw axes that is compatible with the observed lattice symmetry and explains the intensities of a subset of the reflections (parameter TEST_RESOLUTION_RANGE=) at an acceptable R meas (Diederichs & Karplus, 1997 ▶; Weiss, 2001 ▶) using a minimum number of unique reflections. The criteria for an acceptable R meas are controlled by the decision parameters MIN_RFL_Rmeas= and MAX_FAC_Rmeas=. The user can always override the automatic decisions by specifying the correct space-group number (parameter SPACE_GROUP_NUMBER=) and unit-cell parameters (parameter UNIT_CELL_CONSTANTS=) in XDS.INP and repeating the CORRECT step. This provides a simple way to rename orthorhombic unit-cell parameters, which often becomes necessary if screw axes are present. In addition, the user has the option to specify the following in XDS.INP: (i) a reference data set (parameter REFERENCE_DATA_SET=), (ii) a reindexing transformation (parameter REIDX=) and (iii) three basis vectors if known from processing a previous data set taken at the same crystal orientation in a multi-wavelength experiment (parameters UNIT_CELL_A-AXIS=, UNIT_CELL_B-AXIS=, UNIT_CELL_C-AXIS=). The possibility of comparing the new data with a reference data set is particularly useful for resolving the issue of alternative settings of polar or rhombohedral cells (such as P4, P6 and R3). Also, reference data are quite useful for recognizing misindexing or for testing potential heavy-atom derivatives. For refinement of the unit-cell parameters [parameter REFINE(CORRECT)=], CORRECT uses a subset of the accepted reflections whose observed centroid is sufficiently close to the predicted spot position (parameter MAXIMUM_ERROR_OF_SPOT_POSITION=). The refined set of parameters is saved in the file GXPARM.XDS, which has an identical layout to the file XPARM.XDS produced by IDXREF. If the crystal has not slipped during data collection, these parameters are quite accurate. Other correction factors (parameter CORRECTIONS=) which partially compensate for radiation damage, absorption effects and variations in the sensitivity of the detector surface are determined from the symmetry-equivalent reflections usually found in the data images. The corrections are chosen such that the integrated intensities of symmetry-equivalent reflections come out as similar as possible. The user may control application of the various corrections by specifying the parameter CORRECTIONS= by a combination of the key­words DECAY MODULATION ABSORPTION. Whether Friedel pairs are considered as symmetry-equivalent reflections in the calculation of the correction factors depends on the values of the two parameters STRICT_ABSORPTION_CORRECTION= and FRIEDEL’S_LAW=. The number of correction factors is controlled by the input parameters MINIMUM_I/SIGMA=, NBATCH= and REFLECTIONS/CORRECTION_FACTOR=. The residual scatter in intensity of symmetry-equivalent reflections is used to estimate their standard deviations. Here, the initial estimate v 0(I) (obtained from the INTEGRATE step) for the variance of the reflection intensity I is replaced by v(I) = a[v 0(I) + bI 2]. The two constants a and b are chosen to minimize discrepancies between v(I) and the variance estimated from sample statistics of symmetry-related reflections. Based on the more realistic error estimates for the intensities, outliers are recognized by comparison with other symmetry-equivalent reflections. These outliers are included in the main output file XDS_ASCII.HKL, in which they are marked by a negative sign attached to the estimated standard deviations of their intensity. Classification of a reflection as a misfit is con­trolled by a decision constant which has the default value of WFAC1=1.5. Specification of a lower value such as WFAC1=1.0 by the user will lead to an increasing number of misfits and lower R factors as outliers are not included in the reported statistics. Data quality as a function of resolution is described by the agreement of intensities of symmetry-related reflections and quantified by the R factors R merge and the more robust indicator R meas (Diederichs & Karplus, 1997 ▶; Weiss, 2001 ▶). These R factors as well as the intensities of all reflections with indices of type h00, 0k0 and 00l and those expected to be systematically absent provide important information for identification of the correct space group. Clearly, large R factors or many rejected reflections or large observed intensities for reflections that are expected to be systematically absent suggest that the assumed space group or indexing is incorrect. The presence or absence of anomalous scatterers is specified by the parameter FRIEDEL’S_LAW=. Finally, CORRECT analyzes the distribution of reflection intensities as a function of their resolution and reports outliers from the Wilson plot. Often, these aliens arise from ice rings in the data images. To suppress the un­wanted reflections from the final output file XDS_ASCII.HKL, the user copies them to a file named REMOVE.HKL in the current directory and repeats the CORRECT step. The following problems can be encountered in this step. (i) Incomplete data sets may lead to wrong conclusions about the space group, as some of its symmetry operators might not be involved in the R-factor calculations. (ii) Often, the CORRECT step is repeated several times. It should be remembered that XDS overwrites earlier versions of the output files XDS_ASCII.HKL, GXPARM.XDS etc. 3. XSCALE The scaling program XSCALE (i) puts one or more files obtained from data processing with XDS on a common scale and reports the completeness and quality of the data sets; (ii) offers a choice of either combining symmetry-equivalent observations into a single unique reflection or saving the scaled but unmerged observations in the output file; (iii) allows several output files that are placed on the same scale, a feature that is recommended for MAD data sets taken from the same crystal at different wavelengths; (iv) determines correction factors that partially compensate for absorption effects, sensitivity variations in the detector plane and radiation damage; and (v) can correct reflections individually for radiation damage (Diederichs et al., 2003 ▶). The program uses a new fast algorithm (Kabsch, 2010 ▶) and imposes no limitations on the number of data sets or scaling/correction factors. The easiest way to run XSCALE is to copy a template input file named XSCALE.INP to a new directory and to replace the parameter values by the appropriate values describing the actual scaling run. The input parameters may be given in arbitrary order, except for the parameters defining the input and output reflection files (INPUT_FILE=, OUTPUT_FILE=). Here, an output file is defined first by the parameter OUTPUT_FILE= that will include the scaled and merged reflections from all following input files specified by the parameters INPUT_FILE= until the next occurrence of OUTPUT_FILE= in XSCALE.INP. An arbitrary number of output files can be specified (together with their set of input files) in a single run of XSCALE. All output files are then on the same scale, which is a useful program feature for MAD data sets. The reflections in each output file will be unmerged and Friedel pairs will be considered to be different if this holds for all of the input data sets unless explicitly redefined by the parameters MERGE= and FRIEDEL’S_LAW=. Moreover, each output file accepts an additional parameter that controls how the Friedel pairs of the input files are treated in the calculation of the absorption correction factors. If STRICT_ABSORPTION_CORRECTION=FALSE, Friedel pairs are treated as symmetry-equivalent reflections in these calculations, which could lead to an underestimate of the anomalous differences in the presence of anomalous scatterers. Friedel pairs are only treated as different reflections in the calculations if STRICT_ABSORPTION_CORRECTION=TRUE and FRIEDEL’S_LAW=FALSE. For each input file, a resolution window for accepting reflections (INCLUDE_RESOLUTION_RANGE=), the extent of absorption corrections (CORRECTIONS=DECAY MODULATION ABSORPTION) and the number of correction factors (NBATCH=) can be specified. Finally, each input data set can be corrected for radiation damage by specifying the name of the crystal the data set was obtained from (CRYSTAL_NAME=). Specification of this parameter implicates zero-dose extrapolation of individual reflection intensities to compensate for the effects of radiation damage experienced by the crystal so far (see Diederichs et al., 2003 ▶). Each resulting scaled data set is of XDS_ASCII format. It can be converted into a CCP4-style multi-record MTZ file using the copy feature of the program POINTLESS (Evans, 2006 ▶) available from the web (ftp://ftp.ccp4.ac.uk/ccp4/6.0.2/prerelease/pointless.html) or converted by XDSCONV into the format required by various structure-solution packages. 4. XDSCONV XDSCONV accepts reflection-intensity data files as produced by XSCALE or CORRECT and converts them into the format required by software packages for structure determination. XDSCONV estimates structure-factor moduli based on the assumption that the intensity data set obeys Wilson’s distribution and uses a Bayesian approach to statistical inference as described by French & Wilson (1978 ▶). The output file generated may inherit the test reflections previously used to calculate a free R factor (Brünger, 1992 ▶) or may contain new test reflections selected by XDSCONV. 5. Parallelization of XDS In order to efficiently use modern multiprocessor hardware, a major effort has been undertaken to replace the original code of XDS by routines that can run concurrently with very little need for synchronization. As described above, data processing by XDS is organized into eight steps that must be executed in a fixed order since the result of each step is needed as input for the subsequent ones. Thus, the only way to speed up processing is to make each step faster. The most computationally intensive steps are COLSPOT and INTEGRATE and, to a lesser degree, the routine that refines diffraction parameters in IDXREF and CORRECT. Thus, the highest savings in wall clock time are expected to result from changing these routines so that each one can make efficient use of the multiprocessor hardware. Two methods can be used (simultaneously) to speed up data processing. In the first method, XDS divides the set of data images into approximately equal portions, calls a shell script that starts an independent job for processing each portion of images by the computer cluster and waits until all jobs have finished. The number of such independent jobs can be limited by the user (MAXIMUM_NUMBER_OF_JOBS=); up to 99 jobs are allowed. This method works even if the processors do not share the same address space since the jobs are independent processes that do not communicate at all. The second method uses OpenMP to control execution by a team of threads and relies on a shared-memory multiprocessor platform. This allows the program to exploit data parallelism at a more fine-grained level to speed up refinements and routines for setting up and solving systems of linear equations. The maximum number of threads that can be employed by the parallel version of XDS (xds_par, xscale_par) can be limited by the user (MAXIMUM_NUMBER_OF_PROCESSORS=); up to 32 processors can be used. OpenMP has been chosen for execution control because it hardly adds to the complexity of the program code and most importantly does not require the maintenance of separate versions of the source code depending on whether the program is intended for execution by a team of processors or just by a single CPU. Moreover, OpenMP has become the de facto standard and compilers accepting OpenMP directives are available for most shared-memory multiprocessor platforms. The new version of COLSPOT comprises an initial part, a concurrent procedure and a final part. After initialization each available processor is kept busy analyzing its share of rotation images for strong pixels, which are saved in a processor-specific file. In the final sequential part of COLSPOT all files resulting from the concurrent computations are read and the location addresses, image running numbers and signal values of the strong pixels are stored in a hash table. Strong pixels belonging to the same spot can be located rapidly in this table and the centroids of the spots are saved in the final output file from this step. For the INTEGRATE step, the rotation images are divided into approximately equal portions for independent processing under control by a shell script according to the first method described above. When all jobs have finished, the integrated intensities from each independent file are joined. Minor problems could occur for reflections that receive intensity contributions from images that have been processed by different jobs. Compared with processing as a single job, the observed intensity differences are small and disappear if the different jobs use identical reference profiles and diffraction parameters to predict spot locations [to avoid refinements, specify REFINE(INTEGRATE)=!]. In addition, each of the independent jobs can be executed by a team of processors controlled by OpenMP. The rotation images analyzed by each job are split into a sequence of batches of consecutive images that cover a total rotation range that is large enough to accommodate the integration domain. The batches are evaluated in strictly sequental order; parallel processing is confined to images within each batch. The restructured routine for the INTEGRATE step consists of code regions for parallel execution interspersed by sequential sections. After initialization, strong reflections and their mean size and extent are determined concurrently. The diffraction parameters are refined in parallel processing mode based on the observed spot locations. In the following sequential section a database is generated containing information about all reflections occurring in this batch of images. A subset of strong reflections is also identified that is useful for the subsequent reflection-profile learning pass. The mean profile of these reflections is determined concurrently in a second pass through the images in the batch. Reflection integration by profile fitting is carried out in parallel in the third cycle through the batch. In the final sequential step the results from each job, which have been saved in files, are harvested and intensity contributions to the same reflection from adjacent batches are merged. 6. Availability Documentation and executable versions of the XDS package for widely used computer systems running under Linux or OSX can be obtained from the XDS homepage (http://xds.mpimf-heidelberg.mpg.de/) free of charge for use by academics for noncommercial applications. Additional information can be found at http://strucbio.biologie.uni-konstanz.de/xdswiki/index.php/XDS. For looking at rotation data images and control images generated by XDS, an open-source program XDS-Viewer written by Michael Hoffer can be obtained from http://xds-viewer.sourceforge.net under the GNU General Public License. A graphical interface XDSi (Kursula, 2004 ▶) is available (http://cc.oulu.fi/~pkursula/xdsi.html) that simplifies the operation of XDS.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Overview of the CCP4 suite and current developments

            1. Introduction CCP4 (Collaborative Computational Project, Number 4, 1994 ▶) exists to produce and support a world-leading integrated suite of programs that allows researchers to determine macromolecular structures by X-ray crystallography and other bio­physical techniques. CCP4 aims to develop and support the development of cutting-edge approaches to the experimental determination and analysis of protein structure and to integrate these approaches into the CCP4 software suite. CCP4 is a community-based resource that supports the widest possible researcher community, embracing academic, not-for-profit and for-profit research. CCP4 aims to play a key role in the education and training of scientists in experimental structural biology. It encourages the wide dissemination of new ideas, techniques and practice. In this article, we give an overview of the CCP4 project, past, present and future. We begin with a historical perspective on the growth of the software suite, followed by a summary of the current functionality in the suite. We then discuss ongoing plans for the next generation of the suite which is in development. In this account we focus on the suite as a whole, while other articles in this issue delve deeper into individual programs. We intend that this article could serve as a general literature citation for the use of the CCP4 software suite in structure determination, although we also encourage the citation of individual programs, many of the relevant references for which are included here. While we focus here on the CCP4 software suite, we would emphasize that comparable functionality is available in other software packages such as SHARP/autoSHARP (Vonrhein et al., 2007 ▶), SHELX (Sheldrick, 2008 ▶), ARP/wARP (Langer et al., 2008 ▶), PHENIX (Adams et al., 2010 ▶) and many others. 2. Evolution of the CCP4 software suite The CCP4 software suite is a collection of programs implementing specific algorithms concerned with macromolecular structure solution from X-ray diffraction data. Significantly, it is a collection of autonomous and independently developed programs. While some have been commissioned by the academic committees overseeing the CCP4 project, the majority originate from the community to address a perceived gap in current functionality or to implement newly developed algorithms. The result is a collection of around 200 programs, ranging from large programs which are effectively packages in themselves to small ‘jiffy’ programs. Over the years the suite has grown continuously, with each major release featuring significant new software (see Table 1 ▶). Unsurprisingly, there is overlap of functionality, with several programs performing a particular task, albeit often using different approaches. The question then is how to combine these programs into a software suite, both in terms of ensuring communication between the different programs and in helping both naïve and experienced users to navigate through the suite. Early on in the history of CCP4, there was an agreement for all programs to use the same file formats for data files. Formats were specified for diffraction data (the LCF format, later replaced by the MTZ format) and for electron-density maps (the CCP4 map format), while for atomic coordinates the PDB format was adopted. A software library was developed to facilitate reading and writing of these data formats and thereby ensure standardization of the formats. Originally supporting only Fortran programs, the library was re-written to support both Fortran and C/C++ as well as scripting languages (Winn et al., 2002 ▶). The CCP4 set of libraries has since expanded to cover a wider range of crystallographic tasks, in particular with the addition of the Clipper library (Cowtan, 2003 ▶), the MMDB library (Krissinel et al., 2004 ▶) and the CCTBX library (Grosse-Kunstleve et al., 2002 ▶) from the PHENIX project (Adams et al., 2010 ▶). Crystallographic tasks were performed by writing or adapting scripts (e.g. Unix shell or VMS scripts) to link together a number of programs (Fig. 1a ▶) and the suite can still be run in this way. The programs communicate solely via the data files which are passed between them. The user sets program options based on the program documentation and the expected results from earlier steps. A major change was introduced in 2000 with the release of the graphical user interface ccp4i (Fig. 1b ▶; Potterton et al., 2003 ▶). Task interfaces help the user to prepare run scripts. Details of how to run specific programs are largely hidden, as are the jiffy programs used to perform minor functions such as format conversion. Some limited intelligence in the interface code allows program options to be customized according to properties of the data and/or the desired objective. ccp4i interfaces are now available for all of the commonly used CCP4 programs as well as for several non-CCP4 programs (e.g. ARP/wARP; Langer et al., 2008 ▶). The ccp4i interface also introduced for the first time tools for helping the user to organize data. Jobs that have been run were recorded in a ‘database’ (in reality a directory of files) with tools to access and interpret the files saved there. Jobs are further organized into projects, representing different structure solutions. There are now plans to update the CCP4 GUI (see §4), but the impact of the original ccp4i on the suite should not be underestimated. In the last few years, two other modes of accessing the CCP4 suite have emerged. On the one hand, the latest version of the suite contains four complementary automation pipelines, namely xia2 (Winter, 2010 ▶), CRANK (Ness et al., 2004 ▶), MrBUMP (Keegan & Winn, 2007 ▶) and BALBES (Long et al., 2008 ▶). These pipelines attempt to perform large sections of the full structure solution (e.g. phasing) without user intervention. This is achieved partly through the use of a large number of trials, trying different protocols and performing parameter scanning. Such an approach can be very powerful, using cheap computer power to make many more attempts than a user would manually. Automation pipelines have been realised in the last few years because of the maturity of the underlying programs and the availability of sufficient computer power to support multiple trials. On the other hand, graphical programs for interactive use have become more powerful. Rather than simply reviewing the results of previously run programs and performing interactive model editing, Coot (Emsley et al., 2010 ▶) can launch separate refinement and validation programs (Fig. 1c ▶). Similarly, iMOSFLM can be used to interface the data-processing programs POINTLESS and SCALA. In some ways this is a completely different scenario to the automation pipelines. User interaction is paramount, with crystallo­graphy programs acting as tools to be invoked. The user can become familiar with the data and structure and use this to make intelligent decisions. Such an approach has also become possible because of the maturity of the invoked programs and the availability of sufficient computer power to run the programs interactively. 3. Overview of current functionality In this section, we give an overview of the current functionality of the CCP4 software suite (corresponding to release series 6.1 at the time of writing). We summarize the automation pipelines and individual programs included in the suite; many more details can be found in the accompanying articles in this issue. We present the functionality in the traditional manner, starting at data processing and ending at validation. However, it is becoming increasingly apparent that these neat categories are breaking down. 3.1. Data processing The earliest starting point for entry into the CCP4 suite is a set of X-ray diffraction images. The data-reduction program MOSFLM (Leslie, 2006 ▶) will take a set of diffraction images, identify spots on each image, index the diffraction pattern and thus identify the Bragg peaks, and integrate the spots. The output is a list of integrated intensities and their standard uncertainties labelled by the h, k, l indices. Associated information includes the batch number of the image from which the intensity was obtained, whether the peak was full or partial and the symmetry operation that relates the particular observation to the chosen asymmetric unit. MOSFLM continues to be improved, with support added recently for Pilatus detectors, addition of automatic backstop masking etc. The most visible change is the replacement of the old X-­windows-based interface with the Tcl-based iMOSFLM interface (Fig. 2 ▶), which guides the user in a stepwise manner through the stages of data processing. POINTLESS is a relatively new program whose primary purpose is to identify the Laue group of a crystal from an unmerged data set (Evans, 2006 ▶). The program will also attempt to identify the space group from an analysis of systematic absences. A secondary purpose is to test the choice of indexing and re-index a data set if necessary. Given a choice of space group, the program SCALA (Evans, 2006 ▶) will refine the parameters of a scaling function for an unmerged data set, apply scales to each observation of a reflection and merge all observations of a reflection to give an average intensity. It will also provide an improved estimate of the standard uncertainty of each intensity. The new program CTRUNCATE (which replaces the older TRUNCATE; Stein, unpublished program) can then convert the intensities to structure-factor amplitudes, although downstream programs increasingly use the mean intensities directly. Perhaps more importantly, CTRUNCATE will analyse a data set for signs of twinning, translational noncrystallographic symmetry (NCS), anisotropy and other notable features, since it is best to identify problems before attempting phasing. The program SFCHECK (Vaguine et al., 1999 ▶) will also provide an analysis of a data set, including testing for twinning and translational NCS, estimating the optical resolution and the anisotropy, and plotting the radial and angular completeness. The previous steps of data processing are automated by the xia2 pipeline (Winter, 2010 ▶). From a directory of images, xia2 will identify the type of experiment (multi-wedge, multi-pass, multi-wavelength) and process accordingly. The pipeline will determine the point group, space group and correct indexing. Multiple processing pipelines using alternative underlying programs are supported. At the end, the user should have a set of merged structure-factor amplitudes suitable for input to phasing. 3.2. Experimental phasing CCP4 includes the CRANK pipeline (Ness et al., 2004 ▶), which covers experimental phasing and beyond, and interfaces with several CCP4 and non-CCP4 programs. Heavy-atom sub­structure detection is performed by AFRO/CRUNCH2 (de Graaff et al., 2001 ▶) or by SHELXC/D (Sheldrick, 2008 ▶) and initial phasing is carried out by BP3 (Pannu et al., 2003 ▶; Pannu & Read, 2004 ▶) or SHELXE (Sheldrick, 2008 ▶). Phase improvement is carried out by SOLOMON (Abrahams & Leslie, 1996 ▶), DM (Cowtan et al., 2001 ▶) or Pirate (Cowtan, 2000 ▶) and automated model building by Buccaneer (Cowtan, 2006 ▶; Cowtan, 2008 ▶) or ARP/wARP (Langer et al., 2008 ▶). CRANK thus supports a range of underlying software handling the communication of data and allowing the user to trial different combinations. CCP4 includes a number of additional individual programs, each of which has its own particular strength. The long-standing CCP4 program MLPHARE for phasing still works in straight­forward cases and is fast to use. ACORN (Jia-xing et al., 2005 ▶; Dodson & Woolfson, 2009 ▶) uses ab initio methods for the determination of phases starting from a small fragment which could be a single heavy atom. The use of ab initio methods usually requires atomic resolution data, since it assumes atomicity of the electron density. However, a variant of the so-called free-­lunch algorithm (Jia-xing et al., 2005 ▶) allows the temporary generation of phases to atomic resolution which the ACORN method can utilize. The OASIS program (Wu et al., 2009 ▶) also uses ab initio methods to break the phase ambiguity in SAD/SIR phasing. Phaser (McCoy et al., 2007 ▶) can obtain phase estimates starting from known heavy-atom positions and SAD data. Log-likelihood gradient (LLG) maps are used to automatically find additional sites for anomalous scatterers and to detect anisotropy in existing anomalous scatterers. Phaser can also use a partial model, for example from a molecular-replacement solution that is hard to refine, as a source of phase information to help locate weak anomalous scatterers and thus improved phases. The latter reflects the view of experimental phasing and molecular replacement as just two sources of phase information rather than two separate techniques. 3.3. Molecular replacement CCP4 includes two pipelines for molecular replacement (MR): MrBUMP (Keegan & Winn, 2007 ▶) and BALBES (Long et al., 2008 ▶). Both start from processed data and a target sequence and aim to deliver a molecular-replacement solution consisting of positioned and partially refined models. BALBES uses its own database of protein molecules and domains taken from the PDB and customized for MR, while MrBUMP uses public databases and a set of widely available bioinformatics tools to generate possible search models. BALBES is based around the MR program MOLREP (Vagin & Teplyakov, 1997 ▶, 2010 ▶), while MrBUMP can also use the program Phaser (McCoy et al., 2007 ▶). Both MOLREP and Phaser are also available as stand-alone programs in CCP4. As well as providing rotation and translation functions, whereby a search model is positioned in the unit cell to give an initial estimate of the phases, these programs provide additional functionality, including a significant contribution to automated decision-making. For instance, a single run of Phaser can search for several copies each of several components in the structure of a complex, testing different possible search orders and trying different possible choices of space group. The search model for MR may be an ensemble of structures, a set of models from an NMR structure or an electron-density map. Phases for the target may be available, so that the search model is to be fitted into electron density, or there may be density available from an electron-microscopy experiment. The MR step can be followed by rigid-body refinement and the packing of the MR solution can be checked. Much of this functionality is common to Phaser and MOLREP, but there are a number of differences in implementation, so that both may prove useful in certain circumstances. A crucial component of MR is the selection and preparation of search models. The program CHAINSAW (Stein, 2008 ▶) takes as input a sequence alignment which relates residues in the search model to residues in the target protein and uses this information to edit the search model appropriately. The output model is labelled according to the target sequence. MOLREP (Lebedev et al., 2008 ▶) can take as input the target sequence and performs its own alignment to the search model in order to edit the search model. 3.4. Phase improvement and automated model building Having obtained initial phases from experimental phasing, the next step is phase improvement (density modification) to give a map that can be built into. When phases come from molecular replacement, phase improvement may also be useful to reduce model bias. For a long time, the main CCP4 phase-improvement programs were DM (Cowtan et al., 2001 ▶) and SOLOMON (Abrahams & Leslie, 1996 ▶), which covered the standard techniques of solvent flattening/flipping, histogram matching and NCS averaging. More recently, statistically based methods have been incorporated into the program Pirate (Cowtan, 2000 ▶). Pirate can give better results, but has been found to be inconveniently slow. The latest program Parrot (Cowtan, 2010 ▶) achieves similar improvements but is also fast and automated. Given an electron-density map, automated model building is provided in CCP4 by Buccaneer (Cowtan, 2006 ▶, 2008 ▶). This finds candidate Cα positions, builds these into chain fragments, joins the fragments together and docks a sequence. NCS can be used to rebuild and complete related chains. Since version 1.4, there is support for model (re)building after molecular replacement and for supplying known structural elements such as heavy atoms. The CCP4 suite includes an interface for alternating cycles of model building with Buccaneer with cycles of model refinement with REFMAC5. The supplementary program Sloop (Cowtan, unpublished program) builds missing loops using fragments taken from the Richardson’s Top500 library of structures (Lovell et al., 2003 ▶) to fill gaps in the chain. The chance of finding a good fit falls with increasing size of the gap, but the method may work for loops of up to eight residues in length. RAPPER (Furnham et al., 2006 ▶) provides a conformational search algorithm for protein modelling, which can produce an ensemble of models satisfying a wide variety of restraint information. In the context of CCP4, restraints on the modelling are provided by the electron density and/or the locations of the Cα atoms. The ccp4i interface includes modes for loop building or for building the entire structure. 3.5. Refinement and model completion The aim of macromolecular crystallography is to produce a model of the macromolecule of interest which explains the diffraction images as accurately and completely as possible. Both the form of the model and the parameters of the model need to be defined. Refinement is the process of optimizing the values of the model parameters and in CCP4 is performed by the program REFMAC5 (Murshudov et al., 1997 ▶). REFMAC5 will refine atomic coordinates and atomic isotropic or anisotropic displacement parameters (Murshudov et al., 1999 ▶), as well as group parameters for rigid-body refinement and TLS refinement (Winn et al., 2001 ▶, 2003 ▶). It will also refine scaling parameters and a mask-based bulk-solvent correction. When good-quality experimental phases are available, these can be included as additional data (Pannu et al., 1998 ▶). More recently, it has become possible to refine directly against anomalous data for the cases of SAD (Skubák et al., 2004 ▶) and SIRAS (Skubák et al., 2009 ▶) without the need for estimated phases and phase probabilities. REFMAC5 will also now refine against twinned data (Lebedev et al., 2006 ▶), automatically recognising the twin laws and estimating the corresponding twin fractions. The nonprotein contents of the crystal are often of most interest, such as bound ligands, cofactors, metal sites etc. Correct refinement at moderate or low resolution requires a knowledge of the ideal geometry together with associated uncertainties. In REFMAC5 this is handled through a dictionary of possible ligands (Vagin et al., 2004 ▶), with details held in mmCIF format. Dictionary files can be created through the tools SKETCHER and JLIGAND. Refinement goes hand-in-hand with rounds of model building which add/subtract parts of the model and apply large structural changes that are beyond the reach of refinement. In addition to the automated procedures of Buccaneer and RAPPER described above, there are many model-building tools in Coot (Emsley et al., 2010 ▶). A ccp4i interface to the popular ARP/wARP model-building package (Langer et al., 2008 ▶) has also been available for many years. 3.6. Validation, deposition and publication Validation is the process of ensuring that all aspects of the model are supported by the diffraction data, as well as con­forming with known features of protein chemistry. Although validation has traditionally been viewed as something that is performed at the end of structure determination, just before deposition, it is now appreciated that validation is an integral part of the process of structure solution, which should be carried out continually. CCP4 includes a wide variety of validation tools, all of which should be run to gain a complete picture of model quality. Coot (Emsley et al., 2010 ▶) has a dedicated drop-down menu of validation tools which can and should be applied as the model is being built. Coot can also extract warnings about particular links or outliers from a REFMAC5 log file. Warnings associated with specific atoms or residues are linked directly to the model as viewed in Coot. The ccp4i ‘Validation and Deposition’ module contains further validation tools. As mentioned above, SFCHECK (Vaguine et al., 1999 ▶) provides a number of measures of data quality, but if a model is provided it will also assess the agreement of the model with the data. Sequins (Cowtan, unpublished program) validates the assigned sequence against electron density (generated from experimental phases or from phases calculated from a side-chain omit process) and warns of mis­placed side chains or register errors. RAMPAGE (which is part of the RAPPER package; Furnham et al., 2006 ▶) provides Ramachandran plots based on updated ϕ–ψ propensities. PROCHECK is also included, although the Ramachandran plots are no longer generated, having been superseded by RAMPAGE. R500 (Henrick, unpublished program) checks the stereochemistry in a given PDB file against expected values and lists outliers in REMARK 500 records. The quaternary structure of the protein can be analysed with PISA (Krissinel & Henrick, 2007 ▶). This considers all possible interfaces in the crystal structure, estimates the free energy of dissociation, taking into account solvation and entropy effects, and predicts which interfaces are likely to be of biological significance. The CCP4 molecular-graphics program CCP4mg (Potterton et al., 2002 ▶, 2004 ▶) provides a simple means of generating publication-quality images and movies. As well as displaying coordinates in a wide variety of styles, CCP4mg can display molecular surfaces, electron density, arbitrary vectors and labels. The latest versions are built on the Qt toolkit, giving an enhanced look and feel (Fig. 3 ▶). Structures and views can be transferred between CCP4mg and Coot. 3.7. Jiffies and utilities In addition to the main functionality described above, the CCP4 suite contains a large number of utilities for performing format conversions and various analyses. Reflection data processed in other software packages can be imported with the utilities COMBAT, POINTLESS, SCALEPACK2MTZ, DTREK2SCALA and DTREK2MTZ, while data can be exchanged with other structure-solution packages with CONVERT2MTZ, F2MTZ, CIF2MTZ, MTZ2VARIOUS and MTZ2CIF. There are several useful utilities based on the Clipper library (Cowtan, 2003 ▶), such as CPHASEMATCH, which will compare two phase sets and look for changes in origin or hand. There are also many useful utilities for analysing coordinate files. New programs based on the MMDB library (Krissinel et al., 2004 ▶) include NCONT for listing atom contacts and PDB_MERGE for combining two PDB files. 4. Future plans At the heart of the CCP4 suite are the set of algorithms encoded in individual programs. As always, we include new programs in each major release of the suite and will continue to do so. Since the source of novel software is usually independent developers, the additions to the suite are not centrally planned. Nevertheless, some current themes are clearly recognisable, such as automated model building, in particular for low-resolution data. CCP4 also aims to enhance its functionality related to the maintenance and use of data on small molecules (ligands). Firstly, a considerably larger library of chemical compounds will be provided with the suite. Extended search functions will be provided to allow the efficient retrieval of known com­pounds or their close analogues. Secondly, existing functions for generating restraint data for new ligands will be enhanced by the inclusion of relevant software such as PRODRG (Schüttelkopf & van Aalten, 2004 ▶) into the suite, as well as by the development of new methods for structure reconstruction on the basis of partial similarity to structures in the library. Functionality will be available through a graphical front-end application, JLIGAND. In addition to the core programs, the infrastructure of CCP4 continues to evolve to support the latest working practices. The current CCP4 GUI, ccp4i, was a major innovation and has served us well for over ten years (Potterton et al., 2003 ▶). While it continues to provide a useful interface to the CCP4 suite, there are increasing demands from automation pipelines and users alike. In particular, there is a requirement to provide help on what to try next, advice which can be useful to both scientists and automated software. This depends on a robust assessment of the experimental data and the results of previous processing, which in turn requires good data management. We aim to address these issues through the development of a next-generation CCP4 interface. There will also be changes in the way that CCP4 is delivered to the end user. We have all become used to automated up­dates to the software we use (e.g. Windows Update, Synaptic for Debian-based Linux or application-specific updates such as for Firefox). Some CCP4 programs do alert the users to the availability of newer versions and CCP4mg (Potterton et al., 2002 ▶, 2004 ▶) will update the version on request. A CCP4-wide update mechanism is more difficult given the heterogeneous nature of the suite, but efforts in this direction are under way. A specific example of a remotely maintained crystallography platform is given by the US-based SBGrid Consortium. The CCP4 suite is downloaded to a user’s machine or a local server before being run. This is in contrast to many biology software tools, which are web-based. Reasons for running CCP4 locally include the wallclock time of jobs, the detailed control required and the size of data files. Nevertheless, there is increasing usage of web servers for crystallographic tasks. A server at York (http://www.ysbl.york.ac.uk/YSBLPrograms/index.jsp) runs a number of CCP4 programs, including BALBES and Buccaneer, while CCP4 programs are included in a number of other services, for example the ARP/wARP server at Hamburg (http://cluster.embl-hamburg.de/ARPwARP/remote-http.html). Plans are under way to make more CCP4 functionality available via the web. Finally, the coming years will see increasing integration of crystallography with other techniques, both experimental and theoretical. CCP4 aims to contribute towards efforts, such as the European infrastructure project INSTRUCT, to ease the transfer of data to and from these other domains.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              REFMAC5 for the refinement of macromolecular crystal structures

              1. Introduction As a final step in the process of solving a macromolecular crystal (MX) structure, refinement is carried out to maximize the agreement between the model and the X-ray data. Model parameters that are optimized in the refinement process include atomic coordinates, atomic displacement parameters (ADPs), scale factors and, in the presence of twinning, twin fraction(s). Although refinement procedures are typically designed for the final stages of MX analysis, they are also often used to improve partial models and to calculate the ‘best’ electron-density maps for further model (re)building. Refinement protocols are therefore an essential component of model-building pipelines [ARP/wARP (Perrakis et al., 1999 ▶), SOLVE/RESOLVE (Terwilliger, 2003 ▶) and Buccaneer (Cowtan, 2006 ▶)] and are of paramount importance in guiding manual model updates using molecular-graphics software [Coot (Emsley & Cowtan, 2004 ▶), O (Jones et al., 1991 ▶) and XtalView (McRee & Israel, 2008 ▶)]. The first software tools for MX refinement appeared in the 1970s. Real-space refinement using torsion-angle parameterization was introduced by Diamond (1971 ▶). This was followed a few years later by reciprocal-space algorithms for the refinement of individual atomic parameters with added energy (Jack & Levitt, 1978 ▶) and restraints (Konnert, 1976 ▶) in order to deliver chemically reasonable models. The energy and restraints approaches differ only in terminology as they use similar information and both can be unified using a Bayesian formalism (Murshudov et al., 1997 ▶). Early programs used the well established statistical technique of least-squares residuals with equal weights on all reflections (Press et al., 1992 ▶), with gradients and second derivatives (if needed) calculated directly. This changed when Fourier methods, which were developed for small-molecule structure refinement (Booth, 1946 ▶; Cochran, 1948 ▶; Cruickshank, 1952 ▶, 1956 ▶), were formalized for macromolecules (Ten Eyck, 1977 ▶; Agarwal, 1978 ▶). The use of the FFT for structure-factor and gradient evaluation (Agarwal, 1978 ▶) sped up calculations dramatically and the refinement of large molecules using relatively modest computers became realistic. Later, the introduction of molecular dynamics (Brünger, 1991 ▶), the generalization of the FFT approach for all space groups (Brünger, 1989 ▶) and the development of a modular approach to refinement programs (Tronrud et al., 1987 ▶) dramatically changed MX solution procedures. Also, the introduction of the very robust and popular small-molecular refinement program SHELXL (Sheldrick, 2008 ▶) to the macromolecular community allowed routine analysis of high-resolution MX data, including the refinement of merohedral and non-merohedral twins. More sophisticated statistical approaches to MX structure refinement started to emerge in the 1990s. Although the basic formulations and most of the necessary probability distributions used in crystallography were developed in the 1950s and 1960s (Luzzati, 1951 ▶; Ramachandran et al., 1963 ▶; Srinivasan & Ramachandran, 1965 ▶; see also Srinivasan & Parthasarathy, 1976 ▶, and references therein), their implementation for MX refinement started in the middle of the 1990s (Pannu & Read, 1996 ▶; Bricogne & Irwin, 1996 ▶; Murshudov et al., 1997 ▶). It should be emphasized that prior to the application of maximum-likelihood (ML) techniques in MX refinement, the importance of advanced statistical approaches to all stages of MX analysis had been advocated by Bricogne (1997 ▶) for two decades. Nowadays, most MX refinement programs offer likelihood targets as an option. Although ML can be very well approximated using the weighted least-squares approach in the very simple case of refinement against structure-factor amplitudes (Murshudov et al., 1997 ▶), ML has the attractive advantage that it is relatively easy (at least theoretically) to generalize for the joint utilization of a variety of sources of observations. For example, it was immediately extended to use experimental phase information (Bricogne, 1997 ▶; Murshudov et al., 1997 ▶; Pannu et al., 1998 ▶). In the last two decades, there have been many developments of likelihood functions towards the exploitation of all available experimental data for refinement, thus increasing the reliability of the refined model in the final stages of refinement and improving the electron density used in model building in the early stages of MX analysis (Bricogne, 1997 ▶; Skubák et al., 2004 ▶, 2009 ▶). MX crystallography can now take advantage of highly optimized software packages dealing with all of the various stages of structure solution, including refinement. There are several programs available that either are designed to perform refinement or offer refinement as an option. These include BUSTER/TNT (Blanc et al., 2004 ▶), CNS (Brünger et al., 1998 ▶), MAIN (Turk, 2008 ▶), MOPRO (Guillot et al., 2001 ▶), phenix.refine (Adams et al., 2010 ▶), REFMAC5 (Murshudov et al., 1997 ▶), SHELXL (Sheldrick, 2008 ▶) and TNT (Tronrud et al., 1987 ▶). While MOPRO was specifically designed for niche ultrahigh-resolution refinement and is able to model deformation density, all of the other programs can deal with a multitude of MX refinement problems and produce high-quality electron-density maps, although with different emphases and strengths. This contribution describes the various components of the macromolecular crystallographic refinement program REFMAC5, which is distributed as part of the CCP4 suite (Collaborative Computational Project, Number 4, 1994 ▶). REFMAC5 is a flexible and highly optimized refinement package that is ideally suited for refinement across the entire resolution spectrum that is encountered in macromolecular crystallo­graphy. 2. Target functions in REFMAC5 As in all other refinement programs, the target function minimized in REFMAC5 has two components: a component utilizing geometry (or prior knowledge) and a component utilizing experimental X-ray knowledge, where f total is the total target function to be minimized, con­sisting of functions controlling the geometry of the model and the fit of the model parameters to the experimental data, and w is a weight between the relative contributions of these two components. In macromolecular crystallography, the weight is traditionally selected by trial and error. REFMAC5 offers automatic weighting, which is based on the fact that both components are the natural logarithm of a probability distribution. However, this ‘automatic’ weight may lead to unreasonable deviations from ideal geometry (either too tight or too relaxed) in some cases, as the ideal geometry is difficult to describe statistically. For these cases, the weight parameter may need to be selected manually to produce more reasonable geometry, e.g. such that the root-mean-square deviation of the bond lengths from the ideal values is 0.02 Å and at resolutions lower than 3 Å perhaps even smaller. From a Bayesian viewpoint (O’Hagan, 1994 ▶), these functions have the following probabilistic interpretation (ignoring constants which are irrelevant for minimization purposes): From this point of view, MX refinement is similar to a well known technique in statistical analysis: maximum posterior (MAP) estimation. The model parameters are linked with the experimental data via f xray, i.e. likelihood is a mechanism that controls information flow from the experimental data to the derived model. Consequently, it is important to design a likelihood function that allows optimal information transfer from the data to the derived model. f geom ensures that the derived model is consistent with the presumed chemical and structural knowledge. This function plays the role of regularization, reduction of the effective number of parameters and transfer of known information to the new model. If care is not taken, then wrong information may be transferred to the model; removing the effect of such errors may be difficult if possible at all. The design of such functions should be performed using verifiable invariant information and it should be testable and revisable during the refinement and model-building procedures. Functions dealing with geometry usually depend only on atomic parameters. We are not aware of any function used in crystallography that deals with the prior geometry probability distributions of overall parameters. A possible reason for the lack of interest in (and necessity of) this type of function may be that, despite popular belief, the statistical problem in crystallo­graphy is sufficiently well defined and that the main problems are those of model parameterization and completion. The existing refinement programs differ in the target functions and optimization techniques used to derive model parameters. Most MX programs use likelihood target functions. However, their form, implementations and parameterizations are different. Therefore, it should not come as a surprise if different programs give (slightly) different results in terms of model parameters, electron-density maps and reliability factors (such as R and R free). 2.1. X-ray component The X-ray likelihood target functions used in REFMAC5 are based on a general multivariate probability distribution of E observations given M model structure factors. This function is derived from a multivariate complex Gaussian distribution of N = E + M structure factors for acentric reflections and from a multivariate real Gaussian distribution for centric reflections and has the following form: where P = P(|F 1|, …, |FE |; F E+1, …, FN ), Fi = |Fi |exp(ια i }, |F 1|, …, |FE | denote the observed amplitudes, F E+1, …, FN are the model structure factors, C N is the covariance matrix with the elements of its inverse denoted by aij , C M is the bottom right square submatrix of C N of dimension M with the elements of its inverse denoted by cij . We define cij = 0 for i ≤ 0 or j ≤ 0. |C N | and |C M | are the determinants of matrices C N and C M , = (α1, …, α E ) is the vector of the unknown phases of the observations that need to be integrated and is a probability distribution expressing any prior knowledge about the phases. In the simplest case of one observation, one model and no prior knowledge about phases, the integral in (3) can be evaluated analytically. In this case, the function follows a Rice distribution (Bricogne & Irwin, 1996 ▶), which is a non-central χ2 distribution of |F o|2/Σ and |F o|2/2Σ with non-centrality parameters D 2|F c|2/Σ and D 2|F o|2/2Σ with one and two degrees of freedom for centric and acentric reflections, respectively (Stuart & Ord, 2009 ▶), where D in its simplest interpretation is 〈cos(Δxs)〉, a Luzzati error parameter (Luzzati, 1952 ▶) expressing errors in the positional parameters of the model, F c is the model structure factor, |F o| is the observed amplitude of the structure factor and Σ is the uncertainty or the second central moment of the distribution. Both Σ and D enter the equation as part of the covariance matrices C N and C M from (3). Σ is a function of the multiplicity of the Miller indices (∊ factor), experimental uncertainties (σo), model completeness and model errors. For simplicity, the following parameterization is used: The current version of REFMAC5 estimates D and Σmod in resolution bins. Working reflections are used for estimation of D and free reflections are used for Σmod estimation. Although this simple parameterization works in many cases, it may give misleading results for data from crystals with pseudo translation, OD disorder or modulated crystals in general. Currently, there is no satisfactory implementation of the error model to account for these cases. 2.2. Incorporation of experimental phase information in model refinement 2.2.1. MLHL likelihood MLHL likelihood (Bricogne, 1997 ▶; Murshudov et al., 1997 ▶; Pannu et al., 1998 ▶) is based on a special case of the probability distribution (3) where we have one observation, one model and phase information derived from an experiment available as a prior distribution P pr(α), where F o = |F o|exp(ια), F c = |F c|exp(ιαc), α is the unknown phase of the structure factor and α1 and α2 are its possible values for a centric reflection. The prior phase probability distribution P pr(α) is usually represented as a generalized von Mises distribution (Mardia & Jupp, 1999 ▶) and is better known in crystallography as a Hendrickson–Lattman distribution (Hendrickson & Lattman, 1970 ▶), where A, B, C and D are coefficients of the Fourier transformation of the logarithm of the phase probability distribution and N is the normalization coefficient. The distribution is unimodal when C and D are zero; otherwise, it is a bimodal distribution that reflects the possible phase uncertainty in experimental phasing. For centric reflections C and D are zero. 2.2.2. SAD/SIRAS likelihood The MLHL likelihood is dependent on the reliability and accuracy of the prior distribution P pr(α). However, the phase distributions after density modification (or even after phasing), which are usually used as P pr(α), often suffer from inaccurate estimation of the phase errors. Furthermore, MLHL [as well as any other special case of (3) with a non-uniform P pr(α)] assumes independence of the prior phases from the model phases. These shortcomings can be addressed by using experimental information directly from the experimental data, instead of from the P pr(α) distributions obtained in previous steps of the structure-solution process. Currently, SAD and SIRAS likelihood functions are implemented in REFMAC5. The SAD probability distribution (Skubák et al., 2004 ▶) is obtained from (3) by setting E = 2, M = 2, P pr(α) = constant and |F 1| = |F o +|, |F 2| = |(F o −)*|, F 3 = F c +, F 4 = (F c −)*, where F + and F − are the structure factors of the Friedel pairs. The model structure factors are constructed using the current parameters of the protein, the heavy-atom substructure and the inputted anomalous scattering parameters. Similarly, the SIRAS function (Skubák et al., 2009 ▶) is a special case of (3) with E = 3, M = 3, P pr(α) = constant and |F 1| = |F o N |, |F 2| = |F o +|, |F 3| = |(F o −)*|, F 4 = F c N, F 5 = F c +, F 6 = (F c −)*, where |F 1| and F 4 correspond to the observation and the model of the native crystal, respectively, and |F 2|, |F 3|, F 5 and F 6 refer to the Friedel pair observations and models of the derivative crystal. If any of the E observations are symmetrically equivalent, for instance centric Friedel pair intensities, the equation is reduced appropriately so as to only include non-equivalent observations and models. The incorporation of prior phase information by the refinement function is especially useful in the early and middle stages of model building and at all stages of structure solution at lower resolutions, owing to the improvement in the observation-to-parameter ratio. The refinement of a well resolved high-resolution structure is often best achieved using the simple Rice function. Fig. 1 ▶ shows the effect of various likelihood functions on automatic model building using ARP/wARP (Perrakis et al., 1999 ▶). 2.3. Twin refinement The function used for twin refinement is a generalization of the Rice distribution in the presence of a linear relationship between the observed intensities. This function has the form where N o and N model are normalization coefficients. In the first equation, the first term inside the integral, P(I o; F), represents the probability distribution of observations if ‘ideal’ structure factors are known. Here, all reflections that are twinned and that can be grouped together are included. Models representing the data-collection instrument, if available, could be added to this term. The second term, P(F; model), represents a probability distribution of the ‘ideal’ structure factors should an atomic model be known for a single crystal. Here, all reflections from the asymmetric unit that contribute to the observed ‘twinned’ intensities are included. If the data were to come from more than one crystal or if, for example, SAD should be used simultaneously with twinning, then this term would need to be modified appropriately. F c is a function of atomic and overall parameter D. Overall parameters also include Σ and twin-fraction parameters. f represents the way structure factors from the asymmetric unit contribute to the particular ‘twinned’ intensity. The above formula is more symbolic rather than precise; further details of twin refinement will be published elsewhere. REFMAC5 performs the following preparations before starting refinement against twinned data. (i) Identify potential (pseudo)merohedral twin operators by analyses of cell/space-group combination using the algorithm developed by Lebedev et al. (2006 ▶). (ii) Calculate R merge for each potential twin operator and filter out twin operators for which R merge is greater than 0.5 or a user-defined value. (iii) Estimate twin fractions for the remaining twin domains and filter out those with small twin fractions (the default value is 0.05). (iv) Make sure that the point group and twin operators form a group. Strictly speaking this stage is not necessary, but it makes bookkeeping easy. (v) Perform twin refinement using the remaining twin operators. Twin fractions are refined at every cycle. All integrals necessary for evaluation of the minus log-likelihood function and its derivatives with respect to the structure factors are evaluated using the Laplace approximation (McKay, 2003 ▶). 2.4. Modelling bulk-solvent contribution Typically, a significant part of a macromolecular crystal is occupied by disordered solvent. Accurate modelling of this part of the crystal is still an unsolved problem of MX. The contribution of bulk solvent to structure factors is strongest at low resolution, although its effect at high resolution is still non-negligible. The absence of good models for disordered solvent may be one of the reasons why R factors in MX are significantly higher than those in small-molecular crystallography. For small molecules R factors can be around 1%, whereas for MX they are rarely less than 10% and more often around 20% or even higher. REFMAC5 uses two types of bulk (disordered) solvent models. One of them is the so-called Babinet’s bulk-solvent model, which is based on the assumption that the only difference between solvent and protein at low resolution is their scale factor (Tronrud, 1997 ▶). Here, we use a slight modification of the formulation described by Tronrud (1997 ▶) and assume that if protein electron density is convoluted using the Gaussian kernel and multiplied by an appropriate scale factor, then protein and solvent electron densities are equal, where * denotes convolution, denotes the Fourier transform and k babinet = k babinet0exp(−B babinet|s|2/4). Here, we used the convolution theorem, which states that the Fourier transform of the convolution of two functions is the product of their Fourier transforms. The second bulk-solvent model is derived similarly to that described by Jiang & Brünger (1994 ▶). The basic assumption is that disordered solvent atoms are uniformly distributed over the region of the asymmetric unit that is not occupied by the atoms of the modelled part of the crystal structure. The region of the asymmetric unit occupied by the atomic model is masked out. Any holes inside this mask are removed using a cavity-detection algorithm. A constant value is assigned outside this region and the structure factors F mask are calculated using an FFT algorithm. These structure factors, multiplied by appropriate scale factors (estimated during the scaling procedure), are added to those calculated from the atomic model. Additionally, various mask parameters may optionally be optimized. One should be careful with bulk-solvent corrections, especially when the atomic model is incomplete. This type of bulk-solvent model may result in smeared-out electron density that may reduce the height of electron density in less-ordered and unmodelled parts of the crystal. The final total structure factors with scale and solvent contributions included take the following form: where the ks are scale factors, s is the reciprocal-space vector, |s| is the length of this vector, U aniso is the crystallographic anisotropic tensor that obeys crystal symmetry, F mask is the contribution from the mask bulk solvent and F protein is the contribution from the protein part of the crystal. Usually, either mask or Babinet bulk-solvent correction is used. However, sometimes their combination may provide better statistics (lower R factors) than either individually. The overall parameters of the solvent models, the overall anisotropy and the scale factors are estimated using a least-squares fit of the amplitude of the total structure factors to the observed amplitudes, In the case of twin refinement, the following function is used to estimate overall parameters including twin fractions (details of twin refinement will be published elsewhere), where f(α, F) is as defined in (8). Both (11) and (12) are minimized using the Gauss–Newton method with eigenvalue filtering to solve linear equations, which ensures that even very highly correlated parameters can be estimated simultaneously. However, one should be careful in interpretating these parameters as the system is highly correlated. Once overall parameters such as the scale factors and twin fractions have been estimated, REFMAC5 estimates the overall parameters of one of the abovementioned likelihood functions and evaluates the function and its derivatives with respect to the atomic parameters. A general description of this procedure can be found in Steiner et al. (2003 ▶). 2.5. Geometry component The function controlling the geometry has several components. (i) Chemical information about the constituent blocks (e.g. amino acids, nucleic acids, ligands) of macromolecules and the covalent links between them. (ii) Internal consistency of macromolecules (e.g. NCS). (iii) Structural knowledge (known structures, restraints on current interatomic distances, secondary structures). The first component is used by all programs and has been tabulated in an mmCIF dictionary (Vagin et al., 2004 ▶) now used by several programs, including REFMAC5, phenix.refine (Adams et al., 2010 ▶) and Coot (Emsley & Cowtan, 2004 ▶). The current version of the dictionary contains around 9000 entries and several hundred covalent-link descriptions. Any new entries may be added using one of several programs, including Sketcher (Vagin et al., 2004 ▶) from CCP4 (Collaborative Computational Project, Number 4, 1994 ▶), JLigand (unpublished work), PRODRG (Schüttelkopf & van Aalten, 2004 ▶) and phenix.elbow (Adams et al., 2010 ▶). Standard restraints on the covalent structure have the general form where bm represents a geometric parameter (e.g. bonds, angles, chiralities) calculated from the model and bi is the ideal value of this particular geometric parameter as tabulated in the dictionary. Apart from ω (the angle of the peptide bond) and χ (the angles of amino-acid side chains), torsion angles in general are not restrained by default. However, the user can request to restrain a particular torsion angle defined in the dictionary or can define general torsion angles and use them as restraints. In general, it is not clear how to handle the restraint on torsion angles automatically, as these angles may depend on the covalent structure as well as the chemical environment of a particular ligand. 2.6. Noncrystallographic symmetry restraints 2.6.1. Automatic NCS definition Automatic NCS identification in REFMAC5 is performed using the following procedure. (i) Align the sequences of all chains with all chains using the dynamic alignment algorithm (Needleman & Wunsch, 1970 ▶). (ii) Accept the alignment if the number of aligned residues is more than k (default 15) residues and the sequence identity for aligned residues is more than α% (default 80%). (iii) Calculate the global root-mean-square deviation (r.m.s.d.) using all aligned residues. (iv) Calculate the average local r.m.s.d. using the formula where N is the number of aligned residues, j indexes the aligned residues, Nj is the number of corresponding atoms in residue j, n j is the number of atoms in the ith group, rl is the vector of differences between corresponding atomic positions and Rj and tj are the rotation and translation that give the best superposition between atoms in group i. To calculate the r.m.s.d., it is not necessary to calculate the rotation and translation operators explicitly or to apply these transformations to atoms. Rather, it is achieved implicitly using Procrustes analysis, as described, for example, in Mardia & Bibby (1979 ▶). When k = N, the local and global r.m.s.d. coincide. (v) If the r.m.s.d. is less than β Å (default 2.5 Å), then we consider the chains to be aligned. (vi) Prepare the list of aligned atoms. If after applying the transformation matrix (calculated using aligned atoms) the neighbours (waters, ligands) of aligned atoms are superimposed, then they are also added to the list of aligned atoms. (vii) If local NCS is requested, then prepare pairs of corresponding interatomic distances. Steps (i)–(v) are performed once during each session of refinement. Step (vi) is performed during every cycle of refinement in order to allow conformational changes to occur. 2.6.2. Global NCS For global NCS restraints, transformation operators (Rij and tij ) that optimally superpose all NCS-related molecules are estimated and the following residual is added to the total target function, where the weight w is a user-controllable parameter. Note that the transformation matrices are estimated using xi and xj and thus they are dependent on these parameters. Therefore, in principle the gradient and second-derivative calculations should take this dependence into account, although this dependence is ignored in the current version of REFMAC5. Ignoring the contribution of these terms may reduce the rate of convergence, although in practice it does not seem to pose a problem. 2.6.3. Local NCS The following function (similar to the implementation in BUSTER) is used for local NCS restraints, where GM is the Geman–McClure robust estimator function (Geman & McClure, 1987 ▶), which can be written Fig. 2 ▶ shows that for small values of r this function is similar to the usual least-squares function. However, it behaves differently for large r: least-square residuals do not allow conformational changes to occur, whereas this type of function is more tolerant to such changes. 2.6.4. External structure restraints The interatomic distances within the structure being analysed may be similar to a known (deposited) structure, particularly in localized regions. In cases where it makes sense, this information can be exploited in order to aid the refinement of the target structure. In doing so, the target structure is pulled towards the con­formation adopted by the known structure. The mechanism for generic external restraints described by Mooij et al. (2009 ▶) is used for external structure restraints. In our implementation, structural information from external known structures is utilized by applying restraints to the distances between atom pairs based on a presumed atomic correspondence between the two structures. The following function is used for external structure restraints, where the atoms ai belong to the set A of atoms for which a correspondence is known, dij is the distance between the positions of atoms ai and aj , is the corresponding distance in the known structure, σ ij is the estimated standard deviation of dij about and d max ensures that atom pairs are only restrained within localized regions, allowing insensitivity to global conformational changes. External structure restraints should be weighted differently to the other geometry com­ponents in order to allow the restraint strength to be separately specified. Consequently, a weight w ext is applied, which should be appropriately chosen depending on the data quality and resolution, the structural similarity between the external known structure and the target, and the choice of d max. The Geman–McClure function with sensitivity parameter σGM is used to increase robustness to outliers, as with the local NCS restraints. Prior information from the external known structure(s) is generated using the software tool PROSMART. Specifically, this includes the atomic correspondence A, distances , standard deviations σ ij and the distance cutoff d max. Potential sources of prior structural information include different conformations of the target chain (such as those that may result from using different crystallization conditions or in a different binding state) as well as those from homologous or structurally similar proteins. It is possible to use multiple known structures as prior information. The combination of this information results in modified values of and σ ij as appropriate. This allows a structure to be refined utilizing information from a whole class of similar structures, rather than just a single source. Furthermore, it opens up the future possibility for multi-crystal co-refinement. The employed formalism also allows the application of atomic distance restraints to secondary-structure elements (and, in principle, other motifs). Consequently, external restraints may be applied without requiring the prior identification of known structures similar to the target. This is intended to help to refine such motifs towards the expected/presumed local conformation. This technique has been found to be particularly useful for low-resolution crystals and in cases where the target structure is unable to be refined to a satisfactory level. When used appropriately, external structure restraints should increase refinement reliability. Consequently, the difference between the R and R free values is expected to decrease in successful cases. Fig. 3 ▶ shows the refinement statistics resulting from using external restraints to refine a low-resolution bluetongue virus VP4 enzyme (Sutton et al., 2007 ▶). A sequence-identical structure solved at a higher resolution is used as prior information. Refinement statistics are compared after ten refinement cycles with and without using external restraints. Using the external restraints results in a 2.8% improvement in R free. Furthermore, the difference between the R and R free values is reduced from 11.5 to 4.3%, suggesting greatly increased refinement reliability. 2.6.5. ‘Jelly-body’ restraints The ratio of the number of observations to the number of adjustable parameters is very small at low resolution. Even after accounting for chemical restraints, this ratio stays very small and refinement in such cases is usually unstable. The danger of overfitting is very high; this is reflected in large differences between the R and R free values. External structure restraints and the use of experimental phase information (described above) provide ways of dealing with this problem. Unfortunately, it is not always possible to find similar structures refined at high resolution (or at least ones that result in a sufficiently successful improvement in refinement statistics) and experimental phase information is not always available or sufficient. Fortunately, statistical techniques exist to deal with this type of problem. Such techniques include ridge regression (Stuart et al., 2009 ▶), the lasso estimation procedure (Tibshirani, 1997 ▶) and Bayesian estimation with prior knowledge of parameters (O’Hagan, 1994 ▶). REFMAC5 has a regularization function in interatomic distance space that has the form for pairs of atoms i, j from the same chain, with maximum radius d max, which can be controlled (default 4.25 Å). Note that this term does not contribute to the value of the function or its gradient; it only changes the second derivative, thus changing the search direction. It should be noted that a similar technique has been implemented in CNS (Schröder et al., 2010 ▶). Note that if all interatomic distances were constrained, then individual atomic refinement would become rigid-body refinement. The effect of ‘jelly-body’ restraints is the implicit parameterization between the rigid body and individual atoms. This technique has strong similarity to elastic network model calculations (Trion, 1996 ▶). This simple formula has been found to work surprisingly well. 2.6.6. Atomic displacement parameter restraints Unlike positional parameters, where prior knowledge can be designed using basic knowledge of the chemistry of the building blocks of macromolecules and analysis of high-resolution structures, it is not obvious how to design restraints for atomic displace­ment parameters (ADPs). Ideally, restraints should reflect the geometry of the molecules as well as their overall mobility. Various programs use various restraints (Sheldrick, 2008 ▶; Adams et al., 2010 ▶; Konnert & Hendrickson, 1980 ▶; Murshudov et al., 1997 ▶). In the new version of REFMAC5, restraints on ADPs are based on the distances between distributions. If we assume that atoms are represented as Gaussian distributions, then we are able to design restraints based on the distance between such distributions. For a given two distributions in three-dimensional space P(x) and Q(x), the symmetrized Kullback–Liebler (KL) divergence (McKay, 2003 ▶) is defined as follows: It can be verified that the symmetrized KL divergence satisfies the conditions of a metric distance in the space of distributions. The KL divergence can also be represented as follows: This distance changes more smoothly than the L 2 distance between functions and seems to be a useful criterion for the design of approximate probability distributions (McKay, 2003 ▶; O’Hagan, 1994 ▶). When both distributions are Gaussian with mean zero, this distance has an elegant form. Assume that both atoms have Gaussian distribution: In this case, the KL divergence becomes In the case of isotropic ADPs, KL has an even simpler form: REFMAC5 uses restraints based on the KL divergence: The summation is over all atom pairs with distance less than r max. The weights depend on the nature of the bonds as well as on the distance between the atoms. If atoms are bonded or angle-related then the weight is larger. However, the weight is smaller if the atoms are not related by covalent bonds. Moreover, if the distance between the atoms is more than 3 Å then the weight decreases as follows: where w 0,ij is the weight for nonbonded atoms that are closer than 3 Å to each other. 2.6.7. Rigid-bond restraints For anisotropic atoms there are so-called rigid-bond restraints, based on the idea of rigid-bond tests of anisotropic atoms (Hirshfeld, 1976 ▶). The idea is that projections of U values on the bond vector joining two atoms should be similar. In other words, if two atoms are bonded then an oscillation across the bond is more likely than an independent oscillation along the bond. Atoms oscillate along the bond in a concerted fashion. Rigid-bond restraints are designed as follows. Let us assume that two atoms have positions x 1 and x 2 and their corresponding ADPs are U 1 and U 2; the unit vector joining these atoms is then calculated, The projections of corresponding U values on this vector are then calculated as Now, using these projections, the KL divergence is formed for all pairs and added to the target function: Again, the weights depend on the nature of the bonds between the atoms and the distances between them. Note that if the ADPs of both bonded atoms are isotropic then the rigid-bond restraint is equivalent to the above-described KL restraint. 2.6.8. Sphericity restraints To avoid atoms exploding and becoming too elliptical or, even worse, non-elliptical, REFMAC5 uses restraints on sphericity. It is a simple restraint: an isotropic equivalent of the anisotropic tensor, where k indexes the anisotropic atoms, i, j are components of the anisotropic tensor and wk are weights for this particular type of restraint. The weights depend on the number of other restraints (KL, rigid bond) on this atom. Atoms that have fewer restraints have stronger weights on sphericity, since these atoms are more likely to be unstable. It should be noted that similar restraints on ADPs are used in several other refinement programs (Sheldrick, 2008 ▶; Adams et al., 2010 ▶). 3. Parameterization 3.1. General parameters REFMAC5 uses the standard parameterization of molecules in terms of atomic coordinates and isotropic/anisotropic atomic displacement parameters. The refinement of these parameters is performed using an FFT formulation for gradients and approximations for second derivatives. Details of these formulations have been published elsewhere (Murshudov et al., 1997 ▶, 1999 ▶; Steiner et al., 2003 ▶). Once the gradients and approximate second derivatives have been calculated for these parameters, they are used to calculate the derivatives of derived parameters. Derived parameters include those for rigid-body and TLS refinement. 3.2. Rigid body Rigid-body parameterization is achieved as follows. For each rigid group, transformation operators are defined and new positions are calculated from the starting positions using the formula where Rj is the rotation matrix, t origin is the centre of mass of the rigid group and tj is the translational component of the transformation. The x old are the starting coordinates of the atoms and x new are their positions after application of the transformation operators. There are six parameters per rigid group, defining the rotation matrix and the translational component. At each cycle of refinement, an eigenvalue-filtering technique is used to avoid potential singularities arising from the shape of the rigid groups. It should be noted that no terms between rigid groups are calculated for the approximate second-derivative matrix. For large rigid groups this does not pose much of a problem. However, for many small rigid groups it may slow down convergence substantially. In any case, it is not recommended to divide molecules into very small rigid groups. For these cases, ‘jelly-body’ refinement should produce better results. Once derivatives with respect to the positional parameters have been calculated, those for rigid-body parameters are calculated using the chain rule. The current version of REFMAC5 uses an Euler angle parameterization. 3.3. TLS Atomic displacement parameters describe the spread of atomic positions and can be derived from the Fourier transform of a Gaussian probability distribution function for the atomic centre. The atomic displacement parameters are an important part of the model. Traditionally, a single parameter describing isotropic displacements has been used, namely the B factor. However, it is well known that atomic displacements are likely to be anisotropic owing to directional bonding and at high resolutions the six parameters per atom of a fully anisotropic model can be refined. TLS refinement is a way of modelling anisotropic displacements using only a few parameters, so that the method can be used at medium and low resolutions. The TLS model was originally proposed for small-molecule crystallography (Schomaker & Trueblood, 1968 ▶) and was incorporated into REFMAC5 almost ten years ago (Winn et al., 2001 ▶). The idea behind TLS is to suppose that groups of atoms move as rigid bodies and to constrain the anisotropic displace­ment parameters of these atoms accordingly. The rigid-body motion is described by translation (T), libration (L) and screw (S) tensors, using a total of 20 parameters for each rigid body. Given values for these 20 parameters, anisotropic displacement parameters can be derived for each atom in the group (and this relationship also allows one to calculate derivatives via the chain rule). Usually, an extra isotropic displacement parameter (the residual B factor) is refined for each atom in addition to the TLS contribution. The sum of these two con­tributions can be output using the supplementary program TLSANL (Howlin et al., 1993 ▶) or optionally directly from REFMAC5. TLS groups need to be chosen before refinement and constitute part of the definition of the model for the macromolecule. Groups of atoms should conform to the idea that they move as a quasi-rigid body. Often the choice of one group per chain suffices (or at least serves as a reference calculation) and this is the default in REFMAC5. More detailed choices can be made using methods such as TLSMD (Painter & Merritt, 2006 ▶). By default, REFMAC5 also includes waters in the first hydration shell, which it seems reasonable to assume move in concert with the protein chain. Fig. 4 ▶ shows the effect of TLS refinement and orientation of libration tensors. In this case, TLS refinement improves R/R free and the derived libration tensors make biological sense. 4. Optimization REFMAC5 uses the Gauss–Newton method for optimization. For an elegant and comprehensive review on optimization techniques, see Nocedal & Wright (1999 ▶). In this method, the exact second derivative is not calculated, but rather approximated to make sure it is always non-negative. Once derivatives or approximations have been calculated, the following linear equation is built, where H is the approximate second derivative and G is the gradient vector. The contribution of most of the geometrical terms are calculated using algorithms designed for quadratic optimization or least-squares fitting (Press et al., 1992 ▶). To calculate the contribution from the Geman–McClure terms, the following approximation is used (Huber & Ronchetti, 2009 ▶), This approximation ensures that H stays non-negative and consequently directions calculated as a result of the solution of (32) point towards a reduction of the total function. The contribution of the X-ray term to the gradient is calculated using FFT algorithms (Murshudov et al., 1997 ▶). The Fisher information matrix, as described by Steiner et al. (2003 ▶), is used to calculate the contribution of the likelihood functions to the matrix H. Tests have demonstrated that using the diagonal elements of the Fisher information matrix and both diagonal and nondiagonal elements of the geometry terms results in a more stable refinement. Once all of the terms contributing to H and G have been calculated, the linear equation (32) is solved using preconditioned conjugate-gradient methods (Nocedal & Wright, 1999 ▶; Tronrud, 1992 ▶). A diagonal matrix formed by the diagonal elements of H is used as a preconditioner. This brings parameters with different overall scales (positional and B values) onto the same scale and controlling convergence becomes easier. If the conjugate-gradient procedure does not converge in N maxiter cycles (the default is 1000), then the diagonal terms of the H matrix are increased. Thus, if the matrix is not positive then ridge regression is activated. In the presence of a potential (near-) singularity, REFMAC5 uses the following procedure to solve the linear equation. (i) Define and use preconditioner. At this stage, H and G are modified. Define the new matrix by H 1 and vector by G 1. (ii) Set γ = 0. (iii) Define a new matrix: H 2 = H 1 + γI, where I is the identity matrix. (iv) Solve the equation H 2 p = −G 1 using the conjugate-gradient method for linear equations for sparse and positive-definite matrices (Press et al., 1992 ▶). If convergence was achieved in less than N maxiter iterations, then proceed to the next step. Otherwise, increase γ and go to step (iii). (v) Decondition the matrix, gradient and shift vectors. (vi) Apply shifts to the atomic parameters, making sure that the ADPs are positive. (vii) Calculate the value of the total function. (viii) If the value of the total function is less than the previous value, then proceed to the next step. Otherwise, reduce the shifts and repeat steps (vi)–(viii). (ix) Finish the refinement cycle. After application of the shifts, the next cycle of refinement starts. 5. Conclusions Refinement is an important step in macromolecular crystal structure elucidation. It is used as a final step in structure solution, as well as as an intermediate step to improve models and obtain improved electron density to facilitate further model rebuilding. REFMAC5 is one of the refinement programs that incorporates various tools to deal with some crystal peculiarities, low-resolution MX structure refinement and high-resolution refinement. There are also tabulated dictionaries of the constituent blocks of macromolecules, cofactors and ligands. The number of dictionary elements now exceeds 9000. There are also tools to deal with new ligands and covalent modifications of ligands and/or proteins. Low-resolution MX structure analysis is still a challenging task. There are several outstanding problems that need to be dealt with before we can claim that low-resolution MX analysis is complete. Statistics, image processing and computer science provide general methods for these and related problems. Unfortunately, these techniques cannot be directly applied to MX structure analysis, either because of the huge computer resources needed or because the assumptions used are not applicable to MX. In our opinion, the problems of state-of-the-art MX analysis that need urgent attention include the following. (i) Reparameterization depending on the quality and the amount of experimental data. Some tools implemented in REFMAC5 allow partial dealing with this problem. These tools include (a) restraining against known homologous structures, (b) ‘jelly-body’ restraints or refinement along implicit normal modes, (c) long-range ADP restraints based on KL divergence, (d) automatic local and global NCS restraints and (e) experimental phase-information restraints. However, low-resolution refinement and model (re)building is still not as automatic as for high-resolution structures. (ii) Statistical methods for peculiar crystals with low signal-to-noise ratio. Some of the implemented tools, such as likelihood-based twin refinement and SAD/SIRAS refinement, help in the analysis of some of the data produced by such crystals. The analysis of data from such peculiar crystals as OD disorder with or without twinning, multiple cells, translocational disorder or modulated crystals in general remains problematic. (iii) Another important problem is that of limited and noisy data. As a result of resolution cutoff (owing to the survival time of the crystal under X-ray irradiation or otherwise), the resultant electron density usually exhibits noise owing to series termination. If the resolution that the crystal actually diffracts to is the same as the resolution of the data, then series termination is not very serious as the signal dies out towards the limit of the resolution. However, in these cases the electron density becomes blurred, reflecting high mobility of the molecules or crystal disorder. When map sharpening is used, the signal is amplified and series termination becomes a serious problem. To reduce noise, it is necessary to work with the full Fourier transformation. In other words, resolution extension and the prediction of missing reflections becomes an important problem. The dramatic effect of such an approach for density modification at high resolution has been demonstrated by Altomare et al. (2008 ▶) and Sheldrick (2008 ▶). The direct replacement of missing reflections by calculated ones necessarily introduces bias towards model errors and may mask real signal. To avoid this, it is necessary to integrate over the errors in the model parameters (coordinates, B values, scale values and twin fractions). However, since the number of parameters is very large (sometimes exceeding 1 000 000), integration using available numerical techniques is not feasible. (iv) Error estimation. Despite the advances in MX, there have been few attempts to evaluate errors in the estimated parameters. Works attempting to deal with this problem are few and far between (Sheldrick, 2008 ▶). To complete MX structure analysis, it is necessary to develop and implement techniques for error estimation. If this is achieved, then incorrect structures could be eliminated while analysing the MX data and building the model. One of the promising approaches to this problem is the Gauss–Markov random field sampling technique (Hue & Held, 2005 ▶) using the (approximate) second derivative as a field-defining matrix. (v) Multicrystal refinement with the simultaneous multicrystal averaging of isomorphous or non-isomorphous crystals is one of the important directions for low-resolution refinement. If it is dealt with properly then the number of structures analysed at low resolution should increase substantially. Further improvement may consist of a combination of various experimental techniques. For example, the simultaneous treatment of electron-microscopy (EM) and MX data could increase the reliability of EM models and put MX models in the context of larger biological systems. The direct use of unmerged data is another direction in which refinement procedures could be developed. If this were achieved, then several long-standing problems could be easier to deal with. Two such problems are the following. (i) In general, the space group of a crystal should be considered as an adjustable parameter. If unmerged data are used, then space-group assumptions could be tested after every few sessions of refinement and model building. (ii) Dealing with the processes in the crystal during data collection requires unmerged data. One of the best-known such problems is radiation damage.
                Bookmark

                Author and article information

                Contributors
                Journal
                Front Mol Biosci
                Front Mol Biosci
                Front. Mol. Biosci.
                Frontiers in Molecular Biosciences
                Frontiers Media S.A.
                2296-889X
                14 December 2022
                2022
                : 9
                : 960248
                Affiliations
                [1] 1 Biochemistry Department , Oxford Glycobiology Institute , University of Oxford , Oxford, United Kingdom
                [2] 2 Commonwealth Scientific and Industrial Research Organisation , Clayton, VIC, Australia
                [3] 3 Department of Medicine , Surgery and Pharmacy , University of Sassari , Sassari, Italy
                [4] 4 Wellcome Trust Centre for Cell Biology , University of Edinburgh , Scotland, United Kingdom
                [5] 5 Biochemistry Department , University of Oxford , Oxford, United Kingdom
                [6] 6 IBBA-CNR Unit of Milano , Institute of Agricultural Biology and Biotechnology , Milano, Italy
                [7] 7 Department of Molecular and Cell Biology , Leicester Institute of Structural and Chemical Biology , University of Leicester , Leicester, United Kingdom
                Author notes

                Edited by: Anastassios C. Papageorgiou, University of Turku, Finland

                Reviewed by: Manfred S. Weiss, Helmholtz Association of German Research Centers (HZ), Germany

                Max Nanao, European Synchrotron Radiation Facility, France

                *Correspondence: Nicole Zitzmann, nicole.zitzmann@ 123456bioch.ox.ac.uk ; Pietro Roversi, pietro.roversi@ 123456ibba.cnr.it
                [ † ]

                These authors have contributed equally to this work and share first authorship

                This article was submitted to Structural Biology, a section of the journal Frontiers in Molecular Biosciences

                Article
                960248
                10.3389/fmolb.2022.960248
                9794592
                36589243
                3095c6b0-8bd9-414a-a9cd-81491c386d5e
                Copyright © 2022 Caputo, Ibba, Le Cornu, Darlot, Hensen, Lipp, Marcianò, Vasiljević, Zitzmann and Roversi.

                This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

                History
                : 02 June 2022
                : 11 November 2022
                Funding
                Funded by: Medical Research Council , doi 10.13039/501100000265;
                Award ID: MRC–MC PC 15029
                Funded by: Wellcome Trust , doi 10.13039/100010269;
                Award ID: 204801/Z/16/Z 214090/Z/18/Z
                Categories
                Molecular Biosciences
                Original Research

                uggt,crystal polymorphism,structure-based lead discovery,structure determination pipeline,[(morpholin-4yl)methyl]quinolin-8-ol

                Comments

                Comment on this article