Molecular basis of USP7 inhibition by selective small-molecule inhibitors

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Small molecules are identified that inhibit the ubiquitin-specific protease USP7 with high affinity and specificity as explained by co-crystal structures, and are shown to reduce tumour growth in mice.

Related collections

Most cited references 47

Record: found
Abstract: found
Article: found

Is Open Access

Features and development of Coot

P Emsley, B Lohkamp, W. Scott … (2010)

1. Introduction Macromolecular model building using X-ray data is an interactive task involving the iterative application of various optimization algorithms with evaluation of the model and interpretation of the electron density by the scientist. Coot is an interactive three-dimensional molecular-modelling program particularly designed for the building and validation of protein structures by facilitating the steps of the process. In recent years, initial construction of the protein chain has often been carried out using automatic model-building tools such as ARP/wARP (Langer et al., 2008 ▶), SOLVE/RESOLVE (Wang et al., 2004 ▶) and more recently Buccaneer (Cowtan, 2006 ▶). In consequence, relatively more time and emphasis is placed on model validation than has previously been the case (Dauter, 2006 ▶). The refinement and validation steps become increasingly important and also more time-consuming with lower resolution data. Coot aims to provide access to as many of the tools required in the iterative refinement and validation of a macromolecular structure as possible in order to facilitate those aspects of the process which cannot be performed automatically. A primary design goal has been to make the software easy to learn in order to provide a low barrier for scientists who are beginning to work with X-ray data. While this goal has not been met for every feature, it has played a major role in many of the design decisions that have shaped the software. The principal tasks of the software are the visualization of macromolecular structures and data, the building of models into electron density and the validation of existing models; these will be considered in the next three sections. The remaining sections of the paper will deal with more technical aspects of the software, including interactions with external software, scripting and testing. 2. Program design The program is constructed from a range of existing software libraries and a purpose-written Coot library which provides a range of tools specific to model building and visualization. The OpenGL and other graphics libraries, such as the X Window System and GTK+, provide the graphical user-interface functionality, the GNU Scientific Library (GSL) provides mathematical tools such as function minimizers and the Clipper (Cowtan, 2003 ▶) and MMDB (Krissinel et al., 2004 ▶) libraries provide crystallographic tools and data types. On top of these tools are the Coot libraries, which are used to manipulate models and maps and to represent them graphically. Much of this functionality may be accessed from the scripting layer (see §8), which allows programmatic access to all of the underlying functionality. Finally, the graphical user interface is built on top of the scripting layer, although in some cases it is more convenient for the graphical user interface to access the underlying classes directly (Fig. 1 ▶). 3. Visualization Coot provides tools for the display of three-dimensional data falling into three classes. (i) Atomic models (generally displayed as vectors connecting bonded atoms). (ii) Electron-density maps (generally contoured using a wire-frame lattice). (iii) Generic graphical objects (including the unit-cell box, noncrystallographic rotation axes and similar). A user interface and a set of controls allow the user to interact with the graphical display, for example in moving or rotating the viewpoint, selecting the data to be displayed and the mode in which those data are presented. The primary objective in the user interface as it stands today has been to make the application easy to learn. Current design of user interfaces emphasizes a number of characteristics for a high-quality graphical user interface (GUI). Such characteristics include learnability, productivity, forgiveness (if a user makes a mistake, it should be easy to recover) and aesthetics (the application should look nice and provide a pleasurable experience). When designing the user interface for Coot, we aim to respect these issues; however, this may not always be achieved and the GUI often undergoes redesign. Ideally, a user who has a basic familiarity with crystallographic data but who has never used Coot before should be able to start the software, display their data and perform some basic manipulations without any instruction. In order for the software to be easy to learn, it is necessary that the core functionality of the software be discoverable, i.e. the user should be able to find out how to perform common tasks without consulting the documentation. This may be achieved in any of three ways. (i) The behaviour is intuitive, i.e. the behaviour of user-interface elements can be either anticipated or determined by a few experiments. An example of this is the rotation of the view, which is accomplished by simply dragging with the mouse. (ii) The behaviour is familiar and consistent, i.e. user-interface elements behave in a similar way to other common software. An example of this is the use of a ‘File’ menu containing ‘Open…’ options, leading to a conventional file-selection dialogue. (iii) The interface is explorable, i.e. if a user needs an additional functionality they can find it rapidly by inspecting the interface. An example of this is the use of organized menus which provide access to the bulk of the program functionality. Furthermore, tooltips are provided for most menus and buttons and informative widgets explain their function. 3.1. User interface The main Coot user interface window is shown in Fig. 2 ▶ and consists of the following elements. (i) In the centre of the main window is the three-dimensional canvas, on which the atomic models, maps and other graphical objects are displayed. By default this area has a black background, although this can be changed if desired. (ii) At the top of the window is a menu bar. This includes the following menus: ‘File’, ‘Edit’, ‘Calculate’, ‘Draw’, ‘Measures’, ‘Validate’, ‘HID’, ‘About’ and ‘Extensions’. The ‘File’, ‘Edit’ and ‘About’ menus fulfill their normal roles. ‘Calculate’ provides access to model-manipulation tools. ‘Draw’ implements display options. ‘Measures’ presents access to geometrical information. ‘Validate’ provides access to validation tools. ‘HID’ allows the human-interface behaviour to be customized. ‘Extensions’ provides access to a range of optional functionalities which may be customized and extended by advanced users. Additional menus can be added by the use of the scripting interface. (iii) Between the menu bar and the canvas is a toolbar which provides two very frequently used controls: ‘Reset view’ switches between views of the molecules and ‘Display Manager’ opens an additional window which allows individual maps and molecules to be displayed in different ways. This toolbar is customizable, i.e. additional buttons can be added. (iv) On the right-hand side of the window is a toolbar of icons which allow the modification of atomic models. By default these are displayed as icons, although tooltips are provided and text can also be displayed. (v) Below the canvas is a status bar in which brief text messages are displayed concerning the status of current operations. The user interface is implemented using the GTK+2 widget stack, although with some work this could be changed in the future. 3.2. Controls User input to the program is primarily via mouse and keyboard, although it is also possible to use some dial devices such as the ‘Powermate’ dial. The mouse may be used to select menu options and toolbar buttons in the normal way. In addition, the mouse and the keyboard may be used to manipulate the view in the three-dimensional canvas using the controls shown in Fig. 3 ▶. In a large program there is often tension between software being easy to learn and being easy to use. A program which is easy to use provides extensive shortcuts to allow common tasks to be performed with the minimum user input. Keyboard shortcuts, customizations and macro languages are common examples and are often employed by expert users of all types of software. Coot now provides tools for all of these. Much of the functionality of the package is now accessible from both the Python (http://www.python.org) and the Scheme (Kelsey et al., 1998 ▶) scripting languages, which may be used to construct more powerful tools using combinations of existing functions. One example is a function often used after molecular replacement which will step through every residue in a protein, replace any missing atoms, find the best-fitting side-chain rotamer and perform real-space refinement. This function is in turn bound to a menu item, although it would also be possible to bind it to a key on the keyboard. 3.3. Lighting model The lighting model used in Coot is a departure from the approach adopted in most molecular-graphics software. It is difficult to illustrate a three-dimensional shape in a two-dimensional representation of an object. The traditional approach is to use so-called ‘depth-cueing’: objects closer to the user appear more brightly lit and more distant objects are more like the background colour (usually darker). In the Coot model, however, the most brightly lit features are just forward of the centre of rotation. This innovation was accidental, but has been retained because it seemed to provide a more natural image and has generated positive feedback from users once they become accustomed to the new behaviour. It is now possible to offer an explanation for this result. Depth-cueing is an algorithm which adjusts the colours of graphical objects according to their distance from the viewer. Depth-cueing is used in several ways. When rendering outdoor scenes, it is used to wash out the colours of distant features to simulate the effect of light scattering in the intervening air. When rendering darkened scenes, the same effect can be used to darken distant objects in order to create the effect that the viewer is carrying a light source which illuminates nearer objects more brightly than distant ones. Note that both of these usages assume a ‘first-person’ view: the observer is placed within the three-dimensional environment. This is also borne out in the controls for manipulating the view: when the view is rotated, the whole environment usually rotates about the observer. However, fitting three-dimensional atomic models to X-ray data is a different situation. It is not useful to place the observer inside the model and rotate the model around them, not least because the scientist is usually more interested in looking at the molecule or electron density from the outside. As a result, it is normal to rotate the view not about the observer but rather about the centre of the feature being studied. Since the central feature is of most interest, it helps the visualization if it is the brightest entity. To properly light the model in this way is relatively slow, so in Coot an approximation is used and the plane perpendicular to the viewer that contains the central feature is most brightly lit. 3.4. Atomic model Coot displays the atoms of the atomic models as points on the three-dimensional canvas. If the points are within bonding distance then a line symbolizing a bond is drawn between the atomic points; otherwise the atoms are displayed as crosses. By default the atoms are coloured by element, with carbon yellow, oxygen red, nitrogen blue, sulfur green and hydrogen white. Bonds have two colours, with one half corresponding to each connecting atom. Additional atomic models are distinguished by different colour coding. The colour wheel is rotated and the element colours are adjusted accordingly. However, there is an option to fix the colours for noncarbon elements and the colour-wheel position can be adjusted for each molecule individually. Furthermore, Coot allows the user to colour the atomic model by molecule, chain, secondary structure, B factor and occupancy. Besides showing atomic models, Coot can also display Cα (or backbone) chains only. Again the model can be coloured in different modes, by chain, secondary structure or with rainbow colours from the N-terminus to the C-terminus. Currently, Coot offers some additional atomic representations in the form of different bond-width or ball-and-stick representation for selected residues. Information about individual atoms can be visualized in the form of labels. These show the atom name, residue number, residue name and chain identifier. Labels are shown upon Shift + left mouse click or double left mouse click on an atom (the atom closest to the rotation/screen centre can be labelled using the keyboard shortcut ‘l’). This operation not only shows the label beside the atom in the three-dimensional canvas, but also gives more detailed information about the atom, including occupancy, B factor and coordinates, in the status bar. Symmetry-equivalent atoms of the atomic model can be displayed in Coot within a certain radius either as whole chains or as atoms within this radius. Different options for colouring and displaying atoms or Cα backbone are provided. The symmetry-equivalent models can be labelled as described above. Additionally, the label will provide information about the symmetry operator used to generate the selected model. Navigation around the atomic models is primarily achieved with a GUI (‘Go To Atom…’). This allows the view to be centred on a particular atom by selection of a model, chain ID, residue number and atom name. Buttons to move to the next or previous residue are provided and are also available via keyboard shortcuts (space bar and Shift space bar, respectively). Furthermore, each chain is displayed as an expandable tree of its residues, with atoms that can be selected for centring. Additionally, a mouse can be used for navigation, so a middle mouse click centres on the clicked atom. A keyboard shortcut for the view to be centred on a Cα atom of a specific residue is provided by the use of Ctrl-g followed by input of the chain identifier and residue number (terminated by Enter). All atomic models, in contrast to other display objects, are accessible by clicking a mouse button on an atom centre. This allows, for example, re-centring, selection and labelling of the model. 3.5. Electron density Electron-density maps are displayed using a three-dimensional mesh to visualize the surface of electron-density regions higher than a chosen electron-density value using a ‘marching-cubes’-type algorithm (Lorensen & Cline, 1987 ▶). The spacing of the mesh is dictated by the spacing of the grid on which the electron density is sampled. Since electron-density maps are most often described in terms of structure factors, the sampling can be modified by the user at the point where the electron density is read into the program. The contour level may be varied interactively using the scroll wheel on the mouse (if available) or alternatively by using the keyboard (‘+’ and ‘-’). In most cases this avoids the need for multiple contour levels to be displayed at once, although additional contour levels can be displayed if desired. The colour of the electron-density map may be selected by the user. By default, the first map read into the program is contoured in blue, with subsequent maps taking successive colour hues around a colour wheel. Difference maps are by default contoured at two levels, one positive and one negative (coloured green and red, respectively). The electron density is contoured in a box about the current screen centre and is interactively re-contoured whenever the view centre is changed. By default, this box covers a volume extending at least 10 Å in each direction from the current screen centre. This is an appropriate scale for manipulating individual units of a peptide or nucleotide chain and provides good interactive performance, even on older computers. Larger volumes may be contoured on faster machines. A ‘dynamic volume’ option allows the volume contoured to be varied with the current zoom level, so that the contoured region always fills the screen. A ‘dynamic sampling’ option allows the map to be contoured on a subsampled grid (e.g. every second or fourth point along each axis). This is useful when using a solvent mask to visualize the packing of the molecules in the crystal. 3.6. Display objects There are a variety of non-interactive display objects which can also be superimposed on the atomic model and electron density. These include the boundaries of the unit cell, an electron-density ridge trace (or skeleton), surfaces, three-dimensional text annotations and dots (used in the MolProbity interface). These cannot be selected, but aid in the visualization of features of the electron density and other entities. 3.7. File formats Coot recognizes a variety of file formats from which the atomic model and electron density may be read. The differences in the information stored in these various formats mean that some choices have to be made by the user. This is achieved by providing several options for reading electron density and, where necessary, by requesting additional information from the user. The file formats which may be used for atomic models and for electron density will be considered in turn. In addition to obtaining data from the local storage, it is also possible to obtain atomic models directly from the Protein Data Bank (Bernstein et al., 1977 ▶) by entering the PDB code of a deposited structure. Similarly, in the case of structures for which experimental data have been deposited, the model and phased reflections may both be obtained from the Electron Density Server (Kleywegt et al., 2004 ▶). 3.7.1. Atomic models Atomic models are read into Coot by selecting the ‘Open Coordinates…’ option from the File menu. This provides a standard file selector which may be used to select the desired file. Coot recognizes atomic models stored in the following three formats. (i) Protein Data Bank (PDB) format (with file extension .pdb or .ent; compressed files of this format with extension .gz can also be read). The latest releases provide compatibility with version 3 of the PDB format. (ii) Macromolecular crystallographic information file (mmCIF; Westbrook et al., 2005 ▶) format (extension .cif). (iii) SHELX result files produced by the SHELXL refinement software (extension .res or .ins). In each case, the unit-cell and space-group information are read from the file (in the case of SHELXL output the space group is inferred from the symmetry operators). The atomic model is read, including atom name, alternate conformation code, monomer name, sequence number and insertion code, chain name, coordinates, occupancy and isotropic/anisotropic atomic displacement parameters. PDB and mmCIF files are handled using the MMDB library (Krissinel et al., 2004 ▶), which is also used for internal model manipulations. 3.7.2. Electron density The electron-density representation is a significant element of the design of the software. Coot employs a ‘crystal space’ representation of the electron density, in which the electron density is practically infinite in extent, in accordance with the lattice repeat and cell symmetry of the crystal. Thus, no matter where the viewpoint is located in space density can always be represented. This design decision is achieved by use of the Clipper libraries (Cowtan, 2003 ▶). The alternative approach is to just display electron density in a bounded box described by the input electron-density map. This approach is simpler and may be more appropriate in some specific cases (e.g. when displaying density from cryo-EM experiments or some types of NCS maps). However, it has the limitation that no density is available for symmetry-related molecules and if the initial map has been calculated with the wrong extent then it must be recalculated in order to view the desired regions. This distinction is important in that it affects how electron-density data should be prepared for use in Coot. Files prepared for O or PyMOL may not be suitable for use in Coot. In order to read a map file into Coot, it should cover an asymmetric unit or unit cell. In contrast, map files prepared for O (Jones et al., 1991 ▶) or PyMOL (DeLano, 2002 ▶) usually cover a bounded box surrounding the molecule. While it is possible to derive any bounded box from the asymmetric unit, it is not always possible to go the other way; therefore, using map files prepared for other software may lead to unexpected results in some cases, the most common being an incorrect calculation of the standard deviation of the map. If one uses more advanced techniques that involve masking, the electron-density map must have the same symmetry as the associated model molecule. Electron density may be read into Coot either in the form of structure factors (with optional weights) and phases or alternatively in the form of an electron-density map. There are a number of reasons why the preferred approach is to read reflection data rather than a map. (i) Coot can always obtain a complete asymmetric unit of data, avoiding the problems described above. (ii) Structure-factor files are generally smaller than electron-density maps. (iii) Some structure-factor files, and in particular MTZ files, provide multiple sets of data in a single file. Thus, it is possible to read a single file and obtain, for example, both best and difference maps. The overhead in calculating an electron-density map by FFT is insignificant for modern computers. 3.7.3. Reading electron density from a reflection-data file Two options are provided for reading electron density from a reflection-data file. These are ‘Auto Open MTZ…’ and ‘Open MTZ, mmcif, fcf or phs…’ from the ‘File’ menu. (i) ‘Auto Open MTZ…’ will open an MTZ file containing coefficients for the best and difference map, automatically select the FWT/PHWT and the DELFWT/DELPHWT pairs of labels and display both electron-density maps. Currently, suitable files are generated by the following software: Phaser (Storoni et al., 2004 ▶), REFMAC (Murshudov et al., 1997 ▶), phenix.refine (Adams et al., 2002 ▶), DM (Zhang et al., 1997 ▶), Parrot (Cowtan, 2010 ▶), Pirate (Cowtan, 2000 ▶) and BUSTER (Blanc et al., 2004 ▶). (ii) ‘Open MTZ, mmcif, fcf or phs…’ will open a reflection-data file in any of the specified formats. Note that XtalView .phs files do not contain space-group and cell information: in these cases a PDB file must be read first to obtain the relevant information or the information has to be entered manually. MTZ files may contain many sets of map coefficients and so it is necessary to select which map coefficients to use. In this case the user is provided with an additional window which allows the map coefficients to be selected. The standard data names for some common crystallographic software are provided in Table 1 ▶. SHELX .fcf files are converted to mmCIF format and the space group is then inferred from the symmetry operators. 4. Model building Initial building of protein structures from experimental phasing is usually accomplished by automated methods such as ARP/wARP, RESOLVE (Wang et al., 2004 ▶) and Buccaneer (Cowtan, 2006 ▶). However, most of these methods rely on a resolution of better than 2.5 Å and yield more complete models the better the resolution. The main focus in Coot, therefore, is the completion of initial models generated by either molecular replacement or automated model building as well as building of lower resolution structures. However, the features described below are provided for cases where an initial model is not available. 4.1. Tools for general model building 4.1.1. Cα baton mode Baton building, which was introduced by Kleywegt & Jones (1994 ▶), allows a protein main chain to be built by using a 3.8 Å ‘baton’ to position successive Cα atoms at the correct spacing. In Coot, this facility is coupled with an electron-density ridge-trace skeleton (Greer, 1974 ▶). Firstly, a skeleton is calculated which follows the ridges of the electron density. The user then selects baton-building mode, which places an initial baton with one end at the current screen centre. Candidate positions for the next α-carbon are highlighted as crosses selected from those points on the skeleton which lie at the correct distance from the start point. The user can cycle through a list of candidate positions using the ‘Try Another’ button or alternatively rotate the baton freely by use of the mouse. Additionally, the length of the baton can be changed to accommodate moderate shifts in the α-carbon positions. Once a new position is accepted, the baton moves so that its base is on the new α-carbon. In this way, a chain may be traced manually at a rate of between one and ten residues per minute. 4.1.2. Cα zone→main chain Having placed the Cα atoms, the rest of the main-chain atoms may be generated automatically. This tool uses a set of 62 high-resolution structures as the basis for a library of main-chain fragments. Hexapeptide and pentapeptide fragments are chosen to match the Cα positions of each successive pentapeptide of the Cα trace in turn, following the method of Esnouf (1997 ▶), which is similar to that of Jones & Thirup (1986 ▶). The fragments with the best fit to the candidate Cα positions are merged to provide a full trace. After this step, one typically performs a real-space refinement of the subsequent main-chain model. 4.1.3. Find secondary structure Protein secondary-structure elements, including α-helices and β-strands, can be located by their repeating electron-density features, which lead to high and low electron-density values in characteristic positions relative to the consecutive Cα atoms. The ‘Find Secondary Structure’ tool performs a six-dimensional rotation and translation search to find the likely positions of helical and strand elements within the electron density. This search has been highly optimized in order to achieve interactive performance for moderately sized structures and as a result is less exhaustive than the corresponding tools employed in automated model-building packages: however, it can provide a very rapid indication of map quality and a starting point for model building. 4.1.4. Place helix here At low resolution it is sometimes possible to identify secondary-structure features in the electron density when the Cα positions are not obvious. In this case, Coot can fit an α-helix automatically. This process involves several stages. (i) A local optimization is performed on the starting position to maximize the integral of the electron density over a 5 Å sphere. This tends to move the starting point close to the helix axis. (ii) A search is performed to obtain the direction of the helix by integrating the electron density in a cylinder of radius 2.5 Å and length 12 Å. A two-dimensional orientation search is performed to optimize the orientation of the cylinder. This gives the direction of the helix. (iii) A theoretical α-helical model (including C, Cα, N and O atoms) is placed in the density in accordance with the position and direction already found. Different rotations of the model around the helix axis must be considered. Each of the resulting models is scored by the sum of the density at the atomic centres. At this stage the direction of the helix is unknown and so both directions are tested. (iv) Next, a choice is made between the best-fitting models for each helix direction by comparing the electron density at the Cβ positions. In case neither orientation gives a significant better fit for the Cβ atoms, both helices are presented to the user. (v) Finally, attempts are made to extend the helix from the N- and C-termini using ideal ϕ, ψ values. 4.1.5. Place strand here A similar method is used for placing β-strand fragments in electron density. However, there are three differences compared with helix placement: firstly the initial step is omitted, secondly the length of the fragment (number of residues) needs to be provided by the user and finally the placed fragments are obtained from a database. The first step (optimizing the starting position) is unreliable for strands owing to the smaller radius of the cylinder, i.e. main chain, combined with larger density deviations originating from the side chains. Hence, it is omitted and the user must provide a starting position in this case. The integration cylinder used in determining the orientation of the strand has a radius of 1 Å and a length of 20 Å. The ϕ, ψ torsion angles in β-strands in protein deviate from the ideal values, resulting in curved and twisted strands. Such strands cannot be well modelled using ideal values of ϕ and ψ; therefore, candidate strand fragments corresponding to the requested length are taken from a strand ‘database’ (top100 or top500; Word, Lovell, LaBean et al., 1999 ▶) and used in the search. 4.1.6. Ideal DNA/RNA Coot has a function to generate idealized atomic structures of single or double-stranded A-form or B-form RNA or DNA given a nucleotide sequence. The function is menu-driven and can produce any desired helical nucleic acid coordinates in PDB format with canonical Watson–Crick base pairing from a given input sequence with the click of a single button. Because most DNA and RNA structures are comprised of at least local regions of regular near-ideal helical structural elements, the ability to generate nucleic acid helical models on the fly is of particular value for molecular replacement. Recently, a collection of short ideal A-form RNA helical fragments generated within Coot were used to solve a structurally complex ligase ribozyme by molecular replacement (Robertson & Scott, 2008 ▶). Using Coot together with the powerful molecular-replacement program Phaser (Storoni et al., 2004 ▶) not only permitted this novel RNA structure to be solved without resort to heavy-atom methods, but several other RNA and RNA/protein complexes were also subsequently determined using this approach (Robertson & Scott, 2007 ▶). Since Coot and Phaser can be scripted using embedded Python components, an automated and integrated phasing system is amenable for development within the current software framework. 4.1.7. Find ligands The automatic fitting of ligands into electron-density maps is a frequently used technique that is particularly useful for pharmaceutical crystallographers (see, for example, Williams et al., 2005 ▶). The mechanism in Coot addresses a number of ligand-fitting scenarios and is a modified form of a previously described algorithm (Oldfield, 2001 ▶). It is common practice in ‘fragment screening’ to soak different ligands into the same crystal (Blundell et al., 2002 ▶). Using Coot one can either specify a region in space or search a whole asymmetric unit for either a single or a number of different ligand types. In the ‘whole-map’ scenario, candidate ligand sites are found by cluster analysis of a residual map. The candidate ligands are fitted in turn to each site (with the candidate orientations being generated by matching the eigenvectors of the ligand to that of the cluster). Each candidate ligand is fitted and scored against the electron density. The best-fitting orientation of the ligand candidates is chosen. Ligands often contain a number of rotatable bonds. To account for this flexibility, Coot samples torsion angles around these rotatable bonds. Here, each rotatable bond is sampled from an independent probability distribution. The number of conformers is under user control and it is recommended that ligands with a higher number of rotatable bonds should be allowed more conformer candidates. Above a certain number of rotatable bonds it is more efficient to use a ‘core + fragment by fragment’ approach (see, for example, Terwilliger et al., 2006 ▶). 4.2. Rebuilding and refinement The rebuilding and refinement tools are the primary means of model manipulation in Coot and are all grouped together in the ‘Model/Fit/Refine’ toolset. These tools may be accessed either through a toolbar (which is usually docked on the right-hand side of the main window) or through a separate ‘Model/Fit/Refine’ window containing buttons for each of the toolbar functions. The core of the rebuilding and refinement tools is the real-space refinement (RSR) engine, which handles the refinement of the atomic model against an electron-density map and the regularization of the atomic model against geometric restraints. Refinement may be invoked both interactively, when executed by the user, and non-interactively as part of some of the automated fitting tools. The refinement and regularization tools are supplemented by a range of additional tools aimed at assisting the fitting of protein chains. These features are discussed below. 4.3. Tools for moving existing atoms 4.3.1. Real-space refine zone The real-space refine tool is the most frequently used tool for the refinement and rebuilding of atomic models and is also incorporated as a final stage in a number of other tools, e.g. ‘Add Terminal Residue…’. In interactive mode, the user selects the RSR button and then two atoms bounding a range of monomers (amino acids or otherwise). Alternatively, a single atom can be selected followed by the ‘A’ key to refine a monomer and its neighbours. All atoms in the selected range of monomers will be refined, including any flanking residues. Atoms of the flanking residues are marked as ‘fixed’ but are required to be added to the refinement so that the geometry (e.g. peptide bonds, angles and planes) between fixed and moving parts is also optimized. The selected atoms are refined against a target consisting of two terms: the first being the atomic number (Z) weighted sum of the electron-density values over all the atomic centres and the second being the stereochemical restraints. The progress of the refinement is shown with a new set of atoms displayed in white/pale colours. When convergence is reached the user is shown a dialogue box with a set of χ2 scores and coloured ‘traffic lights’ indicating the current geometry scores in each of the geometrical criteria (Fig. 4 ▶). Additionally, a warning is issued if the refined range contains any new cis-peptide bonds. At this stage the user may adjust the model by selecting an atom with the mouse and dragging it, whereby the other atoms will move with the dragged atom. Alternatively, a single atom may be dragged by holding the Ctrl key. As soon as the atoms are released, the selected atoms will refine from the dragged position. Optionally, before the start of refinement atoms may be selected to be fixed during the refinement (in addition to the atoms of the flanking residues). 4.3.2. Sphere refinement One of the problems with the refinement mode described above is that it only considers a linear range of residues. This can cause difficulties, with some side chains being inappropriately refined into the electron density of neighbouring residues, particularly at lower resolutions. Additionally, a linear residue selection precludes the refinement of entities such as disulfide bonds. Therefore, a new residue-selection mechanism was introduced to address these issues: the so-called ‘Sphere Refinement’. This mode selects residues that have atoms within a given radius of a specified position (typically within 4 Å of the centre of the screen). The selected residues are matched to the dictionary and any user-defined links (typically from the mon_lib_list.cif in the REFMAC dictionary), e.g. disulfide bonds, glycosidic linkages and formylated lysines. If such links are found and the (supposedly) bonded atoms are within 3 Å of each other then these extra link restraints are added into the refinement. 4.3.3. Ramachandran restraints At lower resolution it is sometimes difficult to obtain an acceptable fit of the model to the density and at the same time achieve a Ramachandran plot of high quality (most residues in favourable regions and less than 1% outliers). If a Ramachandran score is added to the target function then the Ramachandran plot can be improved. The analytical form for torsion gradients (∂θ/∂x 1 and so on) for each of the x, y, z positions of the four atoms contributing to the torsion angle has been reported previously (Emsley & Cowtan, 2004 ▶) (in the case of Ramachandran restraints, the θ torsions will be ϕ and ψ). The extension of the torsion gradients for use as Ramachandran restraints is performed in the following manner. Firstly, two-dimensional log Ramachandran plots R are generated as tables (one for each of the residue types Pro, Gly and non-Pro or Gly). Where the Ramachandran probability becomes zero the log probability becomes infinite and so it is replaced by values which become increasingly negative with distance from the nearest nonzero value. This provides a weak gradient in the disallowed regions towards the nearest allowed region. The log Ramachandran plot provides the following values and derivatives: The derivative of R with respect to the coordinates is required for the addition into the target geometry and is generated as (and so on for each of the x, y, z positions of the atoms in the torsion). Adding a Ramachandran score to the geometry target function is not without consequences. The Ramachandran plot has for a long time been used as a validation criterion, therefore if it is used in geometry optimization it becomes less informative as a validation metric. Kleywegt & Jones (1996 ▶) included the Ramachandran plot in the restraints during refinement using X-PLOR (Brünger, 1992 ▶) and reported that the number of Ramachandran outliers was reduced by about a third using moderate force constants. However, increasing the force constants by over two orders of magnitude only marginally decreased the number of outliers. As a result, Kleywegt and Jones note that the Ramachandran plot retains significant value as a validation tool even when it is also used as a restraint. Using the Ramachandran restraints as implemented in Coot with the default weights, the number of outliers can be reduced from around 10% to 5% (typical values). 4.3.4. Regularize zone The ‘Regularize Zone’ option functions in the same way as ‘Real-Space Refine Zone’ except that in this case the model is refined with respect to stereochemical restraints but without reference to any electron density. 4.3.5. Rigid-body fit zone The ‘Rigid-Body Fit Zone’ option also follows a similar interface convention to the other refinement options. A range of atoms are selected and the orientation of the selected group is refined to best fit the density. In this case the density is the only contributor to the target function, since the geometry of the fragment is not altered. No constraints are placed on the bonding atoms. If atoms are dragged after refinement, no further refinement is performed on the fragment. 4.3.6. Rotate/translate zone Using this tool, the selected residue selection can be translated and rotated either by dragging it around the screen or through the use of user-interface sliders. No reference to the map is made. The rotation centre can be specified to be either the last atom selected or the centre of mass of the fragment rotated. Additionally, a selection of the whole chain or molecule can be transformed. 4.3.7. Rotamer tools Four tools are available for the fitting of amino-acid side chains. For a side chain whose amino-acid type is already correctly assigned, the best rotamer may be chosen to fit the density either automatically or manually. If the automatic option is chosen then the side-chain rotamer from the MolProbity library (Lovell et al., 2000 ▶) which gives rise to the highest electron-density values at the atomic centres is selected and rigid-body refined (this includes the main-chain atoms of the residues). Otherwise, the user is presented with a list of rotamers for that side-chain type sorted by frequency in the database. The user can then scroll through the list of rotamers using either the keyboard or user-interface buttons to select the desired rotamer. Rotamers are named according to the MolProbity system. Briefly, the χ angles are given letters according to the torsion angle: ‘t’ for approximately 180°, ‘p’ for approximately 60° and ‘m’ for approximately −60° (Lovell et al., 2000 ▶). The other two options (‘Mutate & Auto Fit’ and ‘Simple Mutate’) allow the amino-acid type to be assigned or changed. The ‘Mutate & Auto Fit Rotamer’ option allows an amino-acid type to be selected from a list and then immediately performs the autofit rotamer operation as above. The ‘Simple Mutate’ option changes the amino-acid type and builds the side-chain atoms in the most frequently occurring rotamer without further refinement. 4.3.8. Torsion editing (‘Edit Chi Angles’, ‘Edit Backbone Torsions’, ‘Torsion General’) Coot has different tools for editing the main-chain and side-chain (or ligand) torsion angles. The main-chain torsion angles, namely ϕ and ψ, can be edited using ‘Edit Backbone Torsion…’. With two sliders, the peptide and carbonyl torsion angles can be adjusted. A separate window showing the Ramachandran plot with the two residues forming the altered peptide bond is displayed with the position of the residues updated as the angles change. Side-chain (or ligand) torsion angles must be defined prior to editing. Either the user manually defines the four atoms forming the torsion angle (‘Torsion General’) or the torsion angles are determined automatically and the user selects the one to edit. In the latter case the bond around which the selected torsion angle is edited is visually marked. Using the mouse, the angle can then be rotated freely. 4.3.9. Other protein tools (‘Flip peptide’, ‘Side Chain 180° Flip’, ‘Cis→Trans’) There are three other tools to perform common corrections to protein models. ‘Flip peptide’ rotates the planar atoms in a peptide group through 180° about the vector joining the bounding Cα atoms (Jones et al., 1991 ▶). ‘Side Chain 180° Flip’ rotates the last torsion of a side chain through 180° (e.g. to swap the OD1 and ND2 side-chain atoms of Asn). ‘Cis→Trans’ shifts the torsion of the peptide bond through 180°, thereby changing the peptide bond from trans to cis and vice versa. 4.4. Tools for adding atoms to the model 4.4.1. Find waters The water-finding mechanism in Coot uses the same cluster analysis as is used in ligand fitting. However, only those clusters below a certain volume (by default 4.2 Å3) are considered as candidate sites for water molecules. The centre of each cluster is computed and a distance check is then made to the potential hydrogen-bond donors or receptors in the protein molecule (or other waters). The distance criteria for acceptable hydrogen-bond length are under user control. Additionally, a test for acceptable sphericity of the electron density is performed. 4.4.2. Add terminal residue The MolProbity ϕ, ψ distribution is used to generate a set of randomly selected ϕ, ψ pairs. To build additional residues at the N- and C-termini of protein chains, the MolProbity ϕ, ψ distribution is used to generate a set of positions of the N, Cα, O and C atoms of the next two residues. The conformation of these new atoms is then scored against the electron-density map and recorded. This procedure is carried out a number of times (by default 100). The best-fitting conformation is offered as a candidate to the user (only the nearest of the two residues is kept). 4.4.3. Add alternate conformation Alternate conformations are generated by splitting the residue into two sets of conformations (A and B). By default all atoms of the residue are split, or alternatively only the Cα and side-chain atoms are divided. If the residue chosen is a standard protein residue then the rotamer-selection dialogue described above is also shown, along with a slider to specify the occupancy of the new conformation. 4.4.4. Place atom at pointer This is a simple interface to place a typed atom at the position of the centre of the screen. It can place additional water or solvent molecules in unmodelled electron-density peaks and is used in conjunction with the ‘Find blobs’ tool, which allows the largest unmodelled peaks to be visited in turn. 4.5. Tools for handling noncrystallographic symmetry (NCS) Noncrystallographic symmetry (NCS) can be exploited during the building of an atomic model and also in the analysis of an existing model. Coot provides five tools to help with the building and visualization of NCS-related molecules. (i) NCS ghost molecules. In order to visualize the similarities and differences between NCS-related molecules, a ‘ghost’ copy of any or all NCS-related chains may be superimposed over a specific chain in the model. The ‘ghost’ copies are displayed in thin lines and coloured differently, as well as uniformly, in order to distinguish them from the original. The superposition may be performed automatically by secondary-structure matching (Krissinel & Henrick, 2004 ▶) or by least-squares superposition. An example of an NCS ghost molecule is shown in Fig. 5 ▶. (ii) NCS maps. The electron density of NCS-related molecules can be superimposed in order to allow differences in the electron density to be visualized. This is achieved by transforming the coordinates of the three-dimensional contour mesh, rather then the electron density itself, in order to provide good interactive performance. The operators are usually determined with reference to an existing atomic model which obeys the same NCS relationships. An example of an NCS map is shown in Fig. 6 ▶. (iii) NCS-averaged maps. In addition to viewing NCS-related copies of the electron density, the average density of the related regions may be computed and viewed. In noisy maps this can provide a clearer starting point for model building. (iv) NCS rebuilding. When building an atomic model of a molecule with NCS, it is often more convenient to work on one chain and then replicate the changes made in every NCS-related copy of that chain (at least in the early stages of model building). This can be achieved by selecting two related chains and replacing the second chain in its entirety, or in a specific residue range, with an NCS-transformed copy of the first chain. (v) NCS ‘jumping’. The view centre jumps to the next NCS-related peer chain and at the same time the NCS operators are taken into account so that the relative view remains the same. This provides a means for rapid visual comparison of NCS-related entities. 5. Validation Coot incorporates a range of validation tools from the comparison of a model against electron density to comprehensive geometrical checks for protein structures and additional tools specific to nucleotides. It also provides convenient interfaces to external validation tools: most notably the MolProbity suite (Davis et al., 2007 ▶), but also to the REFMAC refinement software (Murshudov et al., 1997 ▶) and dictionary (Vagin et al., 2004 ▶). Many of the internal validation tools provide a uniform interface in the form of colour-coded bar charts, for example the ‘Density Fit Analysis’ chart (Fig. 7 ▶). This window contains one bar chart for each chain in the structure. Each chart contains one bar for each residue in the chain. The height and colour of the bar indicate the model quality of the residue, with small green bars indicating a good or expected/conventional conformation and large red bars indicating poor-quality or ‘unconventional’ residues. The chart is active, i.e. on moving the pointer over the bar tooltips provide relevant statistics and clicking on a bar changes the view in the main graphics window to centre on the selected residue. In this way, a rapid overview of model quality is obtained and problem areas can be investigated. In order to obtain a good structure for submission, the user may simply cycle though the validation options, correcting any problems found. The available validation tools are described in more detail in the following sections. 5.1. Ramachandran plot The Ramachandran plot tool (Fig. 8 ▶) launches a new window in which the Ramachandran plot for the active molecule is displayed. A data point appears in this plot for each residue in the protein, with different symbols distinguishing Gly and Pro residues. The background of the plot shows frequency data for Ramachandran angles using the Richardsons’ data (Lovell et al., 2003 ▶). The plot is interactive: clicking on a data point moves the view in the three-dimensional canvas to centre on the corresponding residue. Similarly, selecting an atom in the model highlights the corresponding data point. Moving the mouse over a data point corresponding to a Gly or Pro residue causes the Ramachandran frequency data for that residue type to be displayed. 5.2. Kleywegt plot The Kleywegt plot (Kleywegt, 1996 ▶; Fig. 9 ▶) is a variation of the Ramachandran plot that is used to highlight NCS differences between two chains. The Ramachandran plot for two chains of the protein is displayed, with the data points of NCS-related residues in the two chains linked by a line for the top 50 (default) most different ϕ, ψ angles. Long lines in the corresponding figure correspond to significant differences in backbone conformation between the NCS-related chains. 5.3. Incorrect chiral volumes Dictionary definitions of monomers can contain descriptions of chiral centres. The chiral centres are described as ‘positive’, ‘negative’ or ‘both’. Coot can compare the residues in the protein structure to the dictionary and identify outliers. 5.4. Unmodelled blobs The ‘Unmodelled Blobs’ tool finds candidate ligand-binding sites (as described above) without trying to fit a specific ligand. 5.5. Difference-map peaks Difference maps can be searched for positive and negative peaks. The peak list is then sorted on peak height and filtered by proximity to higher peaks (i.e. only peaks that are not close to previous peaks are identified). 5.6. Check/delete waters Waters can be validated using several criteria, including distance from hydrogen-bond donors or acceptors, temperature factor or electron-density level. Waters that do not pass these criteria are identified and presented as a list or automatically deleted. 5.7. Check waters by difference map variance This tool is used to identify waters that have been placed in density that should be assigned to other atoms or molecules. The difference map at each water position is analysed by generating 20 points on each sphere at radii of 0.5, 1.0 and 1.5 Å and the electron-density level at each of these points is found by cubic interpolation. The mean and variance of the density levels is calculated for each set of points. If, for example, a water was misplaced into the density for a glycerol then (given an isotropic density model for the water molecule) the difference map will be anisotropic because there will be unaccounted-for positive density along the bonds to the other atoms in the glycerol. There may also be some negative density in a perpendicular direction as the refinement program tries to compensate for the additional electron density. The variances are summed and compared with a reference value (by default 0.12 e2 Å−6). Note that it only makes sense to run this test on a difference map generated by reciprocal-space refinement (for example, from REFMAC or phenix.refine) that included temperature-factor refinement. 5.8. Geometry analysis The geometry (bonds, angles, planes) for each residue in the selected molecule is compared with dictionary values (typically provided by the mmCIF REFMAC dictionary). Torsion-angle deviations are not analysed (as there are other validation tools for these; see §5.9). The statistic displayed in the geometry graph is the average Z value for each of the geometry terms for that residue (peptide-geometry distortion is shared between neighbouring residues). The tooltip on the geometry graph describes the geometry features giving rise to the highest Z value. 5.9. Peptide ω analysis This is a validation tool for the analysis of peptide ω torsion angles. It produces a graph marking the deviation from 180° of the peptide ω angle. The deviation is assigned to the residue that contains the C and O atoms of the peptide link, thus peptide ω angles of 90° are very poor. Optionally, ω angles of 0° can be considered ideal (for the case of intentional cis-peptide bonds). 5.10. Temperature-factor variance analysis The variance of the temperature factors for the atoms of each residue is plotted. This is occasionally useful to highlight misbuilt regions. In a badly fitting residue, reciprocal-space refinement will tend to expand the temperatures factors of atoms in low or negative density, resulting in a high variance. However, residues with long side chains (e.g. Arg or Lys) often naturally have substantial variance, even though the atoms are correctly placed, which causes ‘noise’ in this graph. This shortcoming will be addressed in future developments. H atoms are ignored in temperature-factor variance analysis. 5.11. Gln and Asn B-factor outliers This is another tool that analyses the results of reciprocal-space refinement. A measure z is computed that is half of the difference of the temperature factor between the NE2 and OE1 atoms (in the case of Gln) divided by the standard deviation of the temperature factors of the remaining atoms in the residue. Our analysis of high-resolution structures has shown that when z is greater than +2.25 there is a more than 90% chance that OE1 and NE2 need to be flipped (P. Emsley, unpublished results). 5.12. Rotamer analysis The rotamer statistics are generated from an analysis of the nearest conformation in the MolProbity rotamer probability distribution (Lovell et al., 2000 ▶) and displayed as a bar chart. The height of the bar in the graph is inversely proportional to the rotamer probability. 5.13. Density-fit analysis The bars in the density-fit graphs are inversely proportional to the average Z-weighted electron density at the atom centres and to the grid sampling of the map (i.e. maps with coarser grid sampling will have lower bars than a more finely gridded map, all other things being equal). Accounting for the grid sampling allows lower resolution maps to have an informative density-fit graph without many or most residues being marked as worrisome owing to their atoms being in generally low levels of density. 5.14. Probe clashes ‘Probe Clashes’ is a graphical representation of the output of the MolProbity tools Reduce (Word, Lovell, Richardson et al., 1999 ▶), which adds H atoms to a model (and thereby provides a means of analyzing potential side-chain flips), and Probe (Word, Lovell, LaBean et al., 1999 ▶), which analyses atomic packing. ‘Contact dots’ are generated by Probe and these are displayed in Coot and coloured by the type of interaction. 5.15. NCS differences The graph of noncrystallographic symmetry differences shows the r.m.s. deviation of atoms in residues after the transformation of the selected chain to the reference chain has been applied. This is useful to highlight residues that have unusually large differences in atom positions (the largest differences are typically found in the side-chain atoms). 6. Model analysis 6.1. Geometric measurements Geometric measurements can be performed on the model and displayed in a three-dimensional view using options from the ‘Measures’ menu. These measurements include bond lengths, bond angles and torsion angles, which may be selected by clicking successively on the atoms concerned. It is also possible to measure the distance of an atom to a least-squares plane defined by a set of three or more other atoms. The ‘Environment Distances’ option allows all neighbours within a certain distance of any atom of a chosen residue to be displayed. Distances between polar neighbours are coloured differently to all others. This is particularly useful in the initial analysis of hydrogen bonding. 6.2. Superpositions It is often useful to compare several related molecules which are similar in terms of sequence or fold. In order to do this the molecules must be placed in the same position and orientation in space so that the differences may be clearly seen. Two tools are provided for this purpose. (i) SSM superposition (Krissinel & Henrick, 2004 ▶). Secondary Structure Matching (SSM) is a tool for superposing proteins whose fold is related by fitting the secondary-structure elements of one protein to those of the other. This approach is automatic and does not rely on any sequence identity between the two proteins. The superposition may include a complete structure or just a single chain. (ii) LSQ superposition. Least-squares (LSQ) superposition involves finding the rotation and translation which minimizes the distances between corresponding atoms in the two models and therefore depends on having a predefined correspondence between the atoms of the two structures. This approach is very fast but requires that a residue range from one structure be specified and matched to a corresponding residue range in the other structure. 7. Interaction with other programs In addition to the built-in tools, e.g. for refinement and validation, Coot provides interfaces to external programs. For refinement, interfaces to REFMAC and SHELXL are provided. Validation can be accomplished by interaction with the programs Probe and Reduce from the MolProbity suite. Furthermore, interfaces for the production of publication-quality figures are provided by communication with the (molecular) graphics programs CCP4mg, POV-Ray and Raster3D. 7.1. REFMAC Coot provides a dialogue similar to that used in CCP4i for running REFMAC (Murshudov et al., 2004 ▶). REFMAC is a program from the CCP4 suite for maximum-likelihood-based macromolecular refinement. Once a round of interactive model building has finished, the user can choose to use REFMAC to refine the current model. Reflections for the refinement are either used from the MTZ file from which the currently displayed map was calculated or can be acquired from a selected MTZ file. Most REFMAC parameters are set as defaults; however, some can be specified in the GUI, such as the number of refinement cycles, twin refinement and the use of NCS. Once REFMAC has terminated, the newly generated (refined) model and MTZ file from which maps are generated are automatically read in (and displayed). If REFMAC detected geometrical outliers at the end of the refinement, an interactive dialogue will be presented with two buttons for each residue containing an outlier: one to centre the view on the residue and the other to carry out real-space refinement. 7.2. SHELXL For high-resolution refinement, SHELXL can be used directly from Coot. A new SHELXL.ins file can be generated from a SHELXL.res file including any manipulations or additions to the model. Additional parameters may be added to the file or it can be edited in a GUI. Once refinement in SHELXL is finished, the refined coordinate file is read in and displayed. The resulting reflections file (.fcf) is converted into an mmCIF file, after which it is read in and the electron density is displayed. An interactive dialogue of geometric outliers (disagreeable restraints and other problems discovered by SHELXL) can be displayed by parsing the .lst output file from SHELXL. 7.3. MolProbity Coot interacts with programs and data from the MolProbity suite in a number of ways, some of which have already been described. In addition, MolProbity can provide Coot with a list of possible structural problems that need to be addressed in the form of a ‘to-do chart’ in either Python or Scheme format; this can be read into Coot (‘Calculate’→‘Scripting…’). 7.4. CCP4mg Coot can write CCP4mg picture-definition files (Potterton et al., 2004 ▶). These files are human-readable and editable and define the scene displayed by CCP4mg. Currently, the view and all displayed coordinate models and maps are described in the Coot-generated definition file. Hence, the displayed scene in Coot when saving the file is identical to that in CCP4mg after reading the picture-definition file. For convenience, a button is provided which will automatically produce the picture-definition file and open it in CCP4mg. 7.5. Raster3D/POV-Ray Raster3D (Merritt & Bacon, 1997 ▶) and POV-Ray (Persistence of Vision Pty Ltd, 2004 ▶) are commonly used programs for the production of publication-quality figures in macromolecular crystallography. Coot writes input files for both of these programs to display the current view. These can then be rendered and ray-traced by the external programs either externally or directly within Coot using ‘default’ parameters. The resulting images display molecular models in ball-and-stick representation and electron densities as wire frames. 8. Scripting Most internal functions in Coot are accessible via a SWIG (Simplified Wrapper and Interface Generator) interface to the scripting languages Python (http://www.python.org) and Guile (a Scheme interpreter; Kelsey et al., 1998 ▶; http://www.gnu.org/software/guile/guile.html). Via the same interface, some of Coot’s graphics widgets are available to the scripting layer (e.g. the main menu bar and the main toolbar). The availability of two scripting interfaces allows greater flexibility for the user as well as facilitating the interaction of Coot with other applications. In addition to the availability of Coot’s internal functions, the scripting interface is enriched by a number of provided scripts (usually available in both scripting languages). Some of these scripts use GUIs, either through use of the Coot graphics widgets or via the GTK+2 extensions of the scripting languages. A number of available scripts and functions are made available in an extra ‘Extensions’ menu. Scripting not only provides the user with the possibility of running internal Coot functions and scripts but also that of reading and writing their own scripts and customizing the menus. 9. Building and testing When Coot was made available to the public, three initial considerations were that it should be cross-platform, robust and easy to install. These considerations continue to be a challenge. To assist in meeting them, an automated scheduled build-and-test system has been developed, thus enabling almost constant deployment of the pre-release software. The subversion version-control system (http://svnbook.red-bean.com/) is used to manage source-code revisions. An ‘integration machine’ checks out the latest source code several times per hour, compiles the software and makes a source-code tar file. Less frequently, a heterogeneous array of build machines copies the source tar file and compiles it for the host architecture. After a successful build, the software is run against a test suite and only if the tests are passed is the software bundled and made available for download from the web site. All the build and test logs are made available on the Coot web site. Fortunately, users of the pre-release code seem to report problems without undue exasperation. It is the aim of the developers to respond rapidly to such reports. 9.1. Computer operating-system compatibility Coot is released under the GNU General Public License (GPL) and depends upon many other GPL and open-source software components. Coot’s GUI and graphical display are based on rather standard infrastructure, including the X11 windowing system, OpenGL and associated software such as the cross-platform GTK+2 stack derived from the GIMP project. In addition, Coot depends upon open-source crystallographic software components including the Clipper libraries (Cowtan, 2003 ▶), the MMDB library (Krissinel et al., 2004 ▶), the SSM library (Krissinel & Henrick, 2004 ▶) and the CCP4 libraries. In principle, Coot and its dependencies can be installed on any modern GNU/Linux or Unix platform without fanfare. A Windows-based version of Coot is also available. 9.2. Coot on GNU/Linux Compiling and installing Coot on the GNU/Linux operating system is probably the most straightforward option. GNU/Linux is in essence a free software/open-source collaborative implementation of the Unix operating system that is compatible with most computer hardware. Coot’s infrastructural dependencies, such as GTK+2 and other GNU libraries, as well as all of its crystallographic software dependencies, were selected with portability in mind. Most of the required dependencies are either installed with the GNOME desktop or are readily available for installation via the package-management systems specific to each distribution. It is possible that in future Coot (along with all its dependencies) will be made available via the official package-distribution systems for several of the major GNU/Linux distributions. When an end-user chooses to install the Coot package, all of Coot’s required dependencies will be installed along with it in a simple and painless procedure. An official Coot package currently exists in the Gentoo distribution (maintained by Donnie Berkholz), a Fedora package (maintained by Tim Fenn) is under development at the time of writing and unofficial Debian and rpm Coot packages are also available. Binary Coot releases for the most popular GNU/Linux platforms are available from the Coot website: http://www.ysbl.york.ac.uk/~emsley/software/binaries/. Additional information on installing Coot on GNU/Linux, either as a pre-compiled binary or from source code, is available on the Coot wiki: http://strucbio.biologie.uni-konstanz.de/ccp4wiki/index.php/COOT. 9.3. Coot on Apple’s Mac OS X With the release of Apple’s Mac OS X, a Unix-based operating system, it became possible to use most if not all of the standard crystallographic software on Apple computers. OS X does not natively use the X11 windowing system, but rather a proprietary windowing technology called Quartz. This system has some benefits over X11, but does not support X11-based Unix software. However, the X11 windowing system can be run within OS X (in rootless mode) and as of OS X version 10.5 this has become a default option and operates in a reasonably seamless manner. Unlike GNU/Linux, Apple does not provide the X11-based dependencies (GTK+2, GNOME libraries) and many of the other open-source components required to install and run Coot. However, third-party package-management systems have appeared to fill this gap, having made it their mission to port essentially all of the most important software that is freely available to users of other Unix-based systems to OS X. The two most popular package-management systems are Fink and MacPorts. Of these, Fink makes available a larger collection of software that is of use to scientists, including a substantial collection of crystallographic software. For that reason, Fink has been adopted as the preferred option for installing Coot on Mac OS X. Fink uses many of the same software tools as the Debian GNU/Linux package-management system and provides a convenient front-end. In practice, this requires the end user to do three things in preparation for installing Coot under OS X. (i) Install Apple’s X-code Developer tools. This is a free gigabyte-sized download available from Apple. (ii) Install the very latest version of X11. This is crucial, as many bug fixes are required to run Coot. (iii) Install the third-party package-management system Fink and enable the ‘unstable’ software tree to obtain access to the latest software. Coot may then be installed through Fink with the command fink install coot. 9.4. Coot on Microsoft Windows Since Microsoft Windows operating systems are the most widely used computer platform, a Coot version which runs on Microsoft Windows has been made available (WinCoot). All of Coot’s dependencies compile readily on Windows systems (although some require small adjustments) or are available as GPL/open-source binary downloads. The availability of GTK+2 (dynamically linked) libraries (DLLs) for Windows makes it possible to compile Coot without the requirement of the X11 windowing system, which would depend on an emulation layer (e.g. Cygwin). Some minor adjustments to Coot itself were necessary owing to differences in operating-system architecture, e.g. the filesystem (Lohkamp et al., 2005 ▶). Currently WinCoot, by default, only uses Python as a scripting language since the Guile GTK+2 extension module is not seen as robust enough on Windows. WinCoot binaries are, as for GNU/Linux systems, automatically built and tested on a regular basis. The program is executed using a batch script and has been shown to work on Windows 98, NT, 2000, XP and Vista. WinCoot binaries (stable as well as pre-releases) are available as a self-extracting file from http://www.ysbl.york.ac.uk/~lohkamp/coot/. 10. Discussion Coot tries to combine modern methods in macromolecular model building and validation with concerns about a modern GUI application such as ease of use, productivity, aesthetics and forgiveness. This is an ongoing process and although improvements can still be made, we believe that Coot has an easy-to-learn intuitive GUI combined with a high level of crystallographic awareness, providing useful tools for the novice and experienced alike. However, Coot has a number of limitations: NCS-averaged maps are poorly implemented, being meaningful only over a limited part of the unit cell (or crystal). There is also a mismatch in symmetry when using maps from cryo-EM data (Coot incorrectly applies crystal symmetry to EM maps). Coot is not at all easy to compile, having many dependencies: this is a problem for developers and advanced users. 10.1. Future Coot is under constant development. New features and bug fixes are added on an almost daily basis. It is anticipated that further tools will be added for validation, nucleotide and carbohydrate model building, as well as for refinement. Interactive model building will be enhanced by communication with the CCP4 database, use of annotations and an interactive notebook and by adding annotation representation into the validation graphs. The embedded scripting languages provide the potential for sophisticated communication with model-building tools such as Buccaneer, ARP/wARP and PHENIX; in future this may be extended to include density modification as well. In the longer term tools to handle EM maps are planned, including the possibility of building and refining models. The appropriate data structures are already implemented in the Clipper libraries but are not yet available in Coot. The integration of validation tools will be expanded, especially with respect to MolProbity, and an interface to the WHAT_CHECK validation program (Hooft et al., 1996 ▶) will be added. WHAT_CHECK provides machine-readable output and this can be read by Coot to provide both an interactive description and navigation as well as (requiring more work) a mode to automatically fix up problematic geometry. Note added in proof: Ian Tickle has noted a potential problem with the calculation of χ2 values resulting from real-space refinement. Coot will be reworked to instead represent the r.m.s. deviation from ideality of each of the geometrical terms.

0 comments Cited 5170 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Phaser crystallographic software

Airlie J. McCoy, Ralf W. Grosse-Kunstleve, Paul D. Adams … (2007)

1. Introduction Improved crystallographic methods rely on both improved automation and improved algorithms. The software handling one part of structure solution must be automatically linked to software handling parts upstream and downstream of it in the structure solution pathway with (ideally) no user input, and the algorithms implemented in the software must be of high quality, so that the branching or termination of the structure solution pathway is minimized or eliminated. Automation allows all the choices in structure solution to be explored where the patience and job-tracking abilities of users would be exhausted, while good algorithms give solutions for poorer models, poorer data or unfavourable crystal symmetry. Both forms of improvement are essential for the success of high-throughput structural genomics (Burley et al., 1999 ▶). Macromolecular phasing by either of the two main methods, molecular replacement (MR) and experimental phasing, which includes the technique of single-wavelength anomalous dispersion (SAD), are key parts of the structure solution pathway that have potential for improvement in both automation and the underlying algorithms. MR and SAD are good phasing methods for the development of structure solution pipelines because they only involve the collection of a single data set from a single crystal and have the advantage of minimizing the effects of radiation damage. Phaser aims to facilitate automation of these methods through ease of scripting, and to facilitate the development of improved algorithms for these methods through the use of maximum likelihood and multivariate statistics. Other software shares some of these features. For molecular replacement, AMoRe (Navaza, 1994 ▶) and MOLREP (Vagin & Teplyakov, 1997 ▶) both implement automation strategies, though they lack likelihood-based scoring functions. Likelihood-based experimental phasing can be carried out using Sharp (La Fortelle & Bricogne, 1997 ▶). 2. Algorithms The novel algorithms in Phaser are based on maximum likelihood probability theory and multivariate statistics rather than the traditional least-squares and Patterson methods. Phaser has novel maximum likelihood phasing algorithms for the rotation functions and translation functions in MR and the SAD function in experimental phasing, but also implements other non-likelihood algorithms that are critical to success in certain cases. Summaries of the algorithms implemented in Phaser are given below. For completeness and for consistency of notation, some equations given elsewhere are repeated here. 2.1. Maximum likelihood Maximum likelihood is a branch of statistical inference that asserts that the best model on the evidence of the data is the one that explains what has in fact been observed with the highest probability (Fisher, 1922 ▶). The model is a set of parameters, including the variances describing the error estimates for the parameters. The introduction of maximum likelihood estimators into the methods of refinement, experimental phasing and, with Phaser, MR has substantially increased success rates for structure solution over the methods that they replaced. A set of thought experiments with dice (McCoy, 2004 ▶) demonstrates that likelihood agrees with our intuition and illustrates the key concepts required for understanding likelihood as it is applied to crystallography. The likelihood of the model given the data is defined as the probability of the data given the model. Where the data have independent probability distributions, the joint probability of the data given the model is the product of the individual distributions. In crystallography, the data are the individual reflection intensities. These are not strictly independent, and indeed the statistical relationships resulting from positivity and atomicity underlie direct methods for small-molecule structures (reviewed by Giacovazzo, 1998 ▶). For macromolecular structures, these direct-methods relationships are weaker than effects exploited by density modification methods (reviewed by Kleywegt & Read, 1997 ▶); the presence of solvent means that the molecular transform is over-sampled, and if there is noncrystallographic symmetry then other correlations are also present. However, the assumption of independence is necessary to make the problem tractable and works well in practice. To avoid the numerical problems of working with the product of potentially hundreds of thousands of small probabilities (one for each reflection), the log of the likelihood is used. This has a maximum at the same set of parameters as the original function. Maximum likelihood also has the property that if the data are mathematically transformed to another function of the parameters, then the likelihood optimum will occur at the same set of parameters as the untransformed data. Hence, it is possible to work with either the structure-factor intensities or the structure-factor amplitudes. In the maximum likelihood functions in Phaser, the structure-factor amplitudes (Fs), or normalized structure-factor amplitudes (Es, which are Fs normalized so that the mean-square values are 1) are used. The crystallographic phase problem means that the phase of the structure factor is not measured in the experiment. However, it is easiest to derive the probability distributions in terms of the phased structure factors and then to eliminate the unknown phase by integration, a process known as integrating out a nuisance variable (the nuisance variable being the introduced phase of the observed structure factor, or equivalently the phase difference between the observed structure factor and its expected value). The central limit theorem applies to structure factors, which are sums of many small atomic contributions, so the probability distribution for an acentric reflection, F O, given the expected value of F O (〈F O〉) is a two-dimensional Gaussian with variance Σ centred on 〈F O〉. (Note that here and in the following, bold font is used to represent complex or signed structure factors, and italics to represent their amplitudes.) In applications to molecular replacement and structure refinement, 〈F O〉 is the structure factor calculated from the model (F C) multiplied by a fraction D (where 0 R, H = 0. The atoms are taken to be of equal mass. The eigenvalues λ and eigenvectors U of H can then be calculated. The eigenvalues are directly proportional to the squares of the vibrational frequencies of the normal modes, the lowest eigenvalues thus giving the lowest normal modes. Six of the eigenvalues will be zero, corresponding to the six degrees of freedom for a rotation and translation of the entire structure. For all but the smallest proteins, eigenvalue decomposition of the all-atom Hessian is not computationally feasible with current computer technology. Various methods have been developed to reduce the size of the eigenvalue problem. Bahar et al. (1997 ▶) and Hinsen (1998 ▶) have shown that it is possible to find the lowest frequency normal modes of proteins in the elastic network model by considering amino acid Cα atoms only. However, this merely postpones the computational problem until the proteins are an order of magnitude larger. The problem is solved for any size protein with the rotation–translation block (RTB) approach (Durand et al., 1994 ▶; Tama et al., 2000 ▶), where the protein is divided into blocks of atoms and the rotation and translation modes for each block used project the full Hessian into a lower dimension. The projection matrix is a block-diagonal matrix of dimensions 3N × 3N. Each of the NB block matrices P nb has dimensions 3N nb × 6, where N nb is the number of atoms in the block nb, For atom j in block nb displaced from the centre of mass, of the block, the 3 × 6 matrix P nb,j is The first three columns of the matrix contain the infinitesimal translation eigenvectors of the block and last three columns contain the infinitesimal rotation eigenvectors of the block. The orthogonal basis Q of P nb is then found by QR decomposition: where Q nb is a 3N nb × 6 orthogonal matrix and R nb is a 6 × 6 upper triangle matrix. H can then be projected into the subspace spanned by the translation/rotation basis vectors of the blocks: where The eigenvalues λP and eigenvectors U P of the projected Hessian are then found. The RTB method is able to restrict the size of the eigenvalue problem for any size of protein with the inclusion of an appropriately large N nb for each block. In the implementation of the RTB method in Phaser, N nb for each block is set for each protein such that the total size of the eigenvalue problem is restricted to a matrix H P of maximum dimensions 750 × 750. This enables the eigenvalue problem to be solved in a matter of minutes with current computing technology. The eigenvectors of the translation/rotation subspace can then be expanded back to the atomic space (dimensions of U are N × N): As for the decomposition of the full Hessian H, the eigenvalues are directly proportional to the squares of the vibrational frequencies of the normal modes, the lowest eigenvalues thus giving the lowest normal modes. Although the eigenvalues and eigenvectors generated from decomposition of the full Hessian and using the RTB approach will diverge with increasing frequency, the RTB approach is able to model with good accuracy the lowest frequency normal modes, which are the modes of interest for looking at conformational difference in proteins. The all-atom, Cα only and RTB normal-mode analysis methods are implemented in Phaser. After normal-mode analysis, n normal modes can be used to generate 2 n − 1 (nonzero) combinations of normal modes. Phaser allows the user to specify the r.m.s. deviation between model and target desired by the perturbation, and the fraction dq of the displacement vector for each mode combination corresponding to each model combination is then used to generate the models. Large r.m.s. deviations will cause the geometry of the model to become distorted. Phaser reports when the model becomes so distorted that there are Cα clashes in the structure. 2.4. Packing function The packing of potential solutions in the asymmetric unit is not inherently part of the translation function. It is therefore possible that an arrangement of models has a high log-likelihood gain, although the models may overlap and therefore be physically unreasonable. The packing of the solutions is checked using a clash test using a subset of the atoms in the structure: the ‘trace’ atoms. For proteins, the trace atoms are the Cα positions, spaced at 3.8 Å. For nucleic acid, the phosphate and C atoms in the ribose-phosphate backbone and the N atoms of the bases are selected as trace atoms. These atoms are also spaced at about 3.8 Å, so that the density of trace atoms in nucleic acid is similar to that of proteins, which makes the number of protein–protein, protein–nucleic acid and nucleic acid–nucleic acid clashes comparable where there is a mixed protein–nucleic acid structure. For the clash test, the number of trace atoms from another model within a given distance (default 3 Å) is counted. The clash test includes symmetry-related copies of the model under consideration, other components in the asymmetric unit and their symmetry-related copies. If the search model has a low sequence identity with the target, or has large flexible loops that could adopt an alternative conformation, the number of clashes may be expected to be nonzero. By default the best packing solutions are carried forward, although a specific number of allowed clashes may also be given as the cut-off for acceptance. However, it is better to edit models before use so that structurally nonconserved surface loops are excluded, as they will only contribute noise to the rotation and translation functions. Where an ensemble of structures is used as the model, the highest homology model is taken as the template for the packing search. Before this model is used, the trace atom positions are edited to take account of large conformational differences between the models in the ensemble. Equivalent trace atom positions are compared and if the coordinates deviate by more than 3 Å then the template trace atom is deleted. Thus, use of an ensemble not only improves signal to noise in the maximum likelihood search functions, it also improves the discrimination of possible solutions by the packing function. 2.5. Minimizer Minimization is used in Phaser to optimize the parameters against the appropriate log-likelihood function in the anisotropy correction, in MR (refines the position and orientation of a rigid-body model) and in SAD phasing. The same minimizer code is used for all three applications and has been designed to be easily extensible to other applications. The minimizer for the anisotropy correction uses Newton’s method, while MR and SAD use the standard Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm. Both minimization methods in Phaser include a line search. The line search algorithm is a basic iterative method for finding the local minimum of a target function f. Starting at parameters x , the algorithm finds the minimum (within a convergence tolerance) of by varying γ, where γ is the step distance along a descent direction d . Newton’s method and the BFGS algorithm differ in the determination of the descent direction d that is passed to the line search, and thus the speed of convergence. Within one cycle of the line search (where there is no change in d ) the trial step distances γ are chosen using the golden section method. The golden ratio (51/2/2 + 1/2) divides a line so that the ratio of the larger part to the total is the same as the ratio of the smaller to larger. The method makes no assumptions about the function’s behaviour; in particular, it does not assume that the function is quadratic within the bracketed section. If this assumption were made, the line search could proceed via parabolic interpolation. Newton’s method uses the Hessian matrix H of second derivatives and the gradient g at the initial set of parameters x 0 to find the values of the parameters at the minimum x min. If the function is quadratic in x then Newton’s method will find the minimum in one step, but if not, iteration is required. The method requires the inversion of the Hessian matrix, which, for large matrices, consumes a large amount of computational time and memory resources. The eigenvalues of the Hessian need to be positive for the function to be at a minimum, rather than a maximum or saddle point, since the method converges to any point where the gradient vector is zero. When used with the anisotropy correction, the full Hessian matrix is calculated analytically. The BFGS algorithm is one of the most powerful minimization methods when calculation of the full Hessian using analytic or finite difference methods is very computationally intensive. At every step, the gradient search vector is analysed to build up an approximate Hessian matrix H, in order to make the resulting search vector direction d better than the original gradient vector direction. In the ‘pure’ form of the BFGS algorithm, the method is started with matrix H equal to the identity matrix. The off-diagonal elements of the Hessian, the mixed second derivatives (i.e. ∂2LL/∂p i ∂p j ) are thus initially zero. As the BFGS cycle proceeds, the off-diagonal elements become nonzero using information derived from the gradient. However, in Phaser, the matrix H is not the identity but rather is seeded with diagonal elements equal to the second derivatives of the parameters (p i ) with respect to the log-likelihood target function (LL) (i.e. ∂2LL/∂p i 2, or curvatures), the values found in the ‘true’ Hessian. For the SAD refinement the diagonal elements are calculated analytically, but for the MR refinement the diagonal elements are calculated by finite difference methods. Seeding the Hessian with the diagonal elements dramatically accelerates convergence when the parameters are on different scales; when an identity matrix is used, the parameters on a larger scale can fail to shift significantly because their gradients tend to be smaller, even though the necessary shifts tend to be larger. In the inverse Hessian, small curvatures for parameters on a large scale translate into large scale factors applied to the corresponding gradient terms. If any of these curvature terms are negative (as may happen when the parameters are far from their optimal values), the matrix is not positive definite. Such a situation is corrected by using problem-specific information on the expected relative scale of the parameters from the ‘large-shift’ variable, as discussed below in §2.5.1. In addition to the basic minimization algorithms, the minimizer incorporates the ability to bound, constrain, restrain and reparameterize variables, as discussed in detail below. Bounds must be applied to prevent parameters becoming nonphysical, constraints effectively reduce the number of parameters, restraints are applied to include prior probability information, and reparameterization of variables makes the parameter space more quadratic and improves the performance of the minimizer. 2.5.1. Problem-specific parameter scaling information When a function is defined for minimization in Phaser, information must be provided on the relative scales of the parameters of that function, through a ‘large-shifts’ variable. As its name implies, the variable defines the size of a parameter shift that would be considered ‘large’ for each parameter. The ratios of these large-shift values thus specify prior knowledge about the relative scales of the different parameters for each problem. Suitable large-shift values are found by a combination of physical insight (e.g. the size of a coordinate shift considered to be large will be proportional to d min for the data set) and numerical simulations, studying the behaviour of the likelihood function as parameters are varied systematically in a variety of test cases. The large-shifts information is used in two ways. Firstly, it is used to prevent the line search from taking an excessively large step, which can happen if the estimated curvature for a parameter happens to be too small and can lead to the refinement becoming numerically unstable. If the initial step for a line search would change any parameter by more than its large-shift value, the initial step is scaled down. Secondly, it is used to provide relative scale information to correct negative curvature values. Parameters with positive curvatures are used to define the average relationship between the large-shift values and the curvatures, which can then be used to compute appropriate curvature values for the parameters with negative curvatures. This stabilizes the refinement until it is sufficiently close to the minimum that all curvatures become positive. 2.5.2. Reparameterization Second-order minimization algorithms in effect assume that, at least in the region around the minimum, the function can be approximated as a quadratic. Where this assumption holds, the minimizer will converge faster. It is therefore advantageous to use functions of the parameters being minimized so that the target function is more quadratic in the new parameter space than in the original parameter space (Edwards, 1992 ▶). For example, atomic B factors tend to converge slowly to their refined values because the B factor appears in the exponential term in the structure-factor equation. Although any function of the parameters can be used for this purpose, we have found that taking the logarithm of a parameter is often the most effective reparameterization operation (not only for the B factors). The offset x offset is chosen so that the value of x′ does not become undefined for allowed values of x, and to optimize the quadratic nature of the function in x′. For instance, atomic B factors are reparameterized using an offset of 5 Å2, which allows the B factors to approach zero and also has the physical interpretation of accounting roughly for the width of the distribution of electrons for a stationary atom. 2.5.3. Bounds Bounds on the minimization are applied by setting upper and/or lower limits for each variable where required (e.g. occupancy minimum set to zero). If a parameter reaches a limit during a line search, that line search is terminated. In subsequent line searches, the gradient of that parameter is set to zero whenever the search direction would otherwise move the parameter outside of its bounds. Multiplying the gradient by the step size thus does not alter the value of the parameter at its limit. The parameter will remain at its limit unless calculation of the gradient in subsequent cycles of minimization indicates that the parameter should move away from the boundary and into the allowed range of values. 2.5.4. Constraints Space-group-dependent constraints apply to the anisotropic tensor applied to ΣN in the anisotropic diffraction correction. Atoms on special positions also have constraints on the values of their anisotropic tensor. The anisotropic displacement ellipsoid must remain invariant under the application of each symmetry operator of the space group or site-symmetry group, respectively (Giacovazzo, 1992 ▶; Grosse-Kunstleve & Adams, 2002 ▶). These constraints reduce the number of parameters by either fixing some values of the anisotropic B factors to zero or setting some sets of B factors to be equal. The derivatives in the gradient and Hessian must also be constrained to reflect the constraints in the parameters. 2.5.5. Restraints Bayes’ theorem describes how the probability of the model given the data is related to the likelihood and gives a justification for the use of restraints on the parameters of the model. If the probability of the data is taken as a constant, then P(model) is called the prior probability. When the logarithm of the above equation is taken, Prior probability is therefore introduced into the log-likelihood target function by the addition of terms. If parameters of the model are assumed to have independent Gaussian probability distributions, then the Bayesian view of likelihood will lead to the addition of least-squares terms and hence least-squares restraints on those parameters, such as the least-squares restraints applied to bond lengths and bond angles in typical macromolecular structure refinement programs. In Phaser, least-squares terms are added to restrain the B factors of atoms to the Wilson B factor in SAD refinement, and to restrain the anisotropic B factors to being more isotropic (the ‘sphericity’ restraint). A similar sphericity restraint is used in SHELXL (Sheldrick, 1995 ▶) and in REFMAC5 (Murshudov et al., 1999 ▶). 3. Automation Phaser is designed as a large set of library routines grouped together and made available to users as a series of applications, called modes. The routine-groupings in the modes have been selected mainly on historical grounds; they represent traditional steps in the structure solution pipeline. There are 13 such modes in total: ‘anisotropy correction’, ‘cell content analysis’, ‘normal-mode analysis’, ‘ensembling’, ‘fast rotation function’, ‘brute rotation function’, ‘fast translation function’, ‘brute translation function’, ‘log-likelihood gain’, ‘rigid-body refinement’, ‘single-wavelength anomalous dispersion’, ‘automated molecular replacement’ and ‘automated experimental phasing’. The ‘automated molecular replacement’ and ‘automated experimental phasing’ modes are particularly powerful and aim to automate fully structure solution by MR and SAD, respectively. Aspects of the decision making within the modes are under user input control. For example, the ‘fast rotation function’ mode performs the ensembling calculation, then a fast rotation function calculation and then rescores the top solutions from the fast search with a brute rotation function. There are three possible fast rotation function algorithms and two possible brute rotation functions to choose from. There are four possible criteria for selecting the peaks in the fast rotation function for rescoring with the brute rotation function, and for selecting the results from the rescoring for output. Alternatively, the rescoring of the fast rotation function with the brute rotation function can be turned off to produce results from the fast rotation function only. Other modes generally have fewer routines but are designed along the same principles (details are given in the documentation). 3.1. Automated molecular replacement Most structures that can be solved by MR with Phaser can be solved using the ‘automated molecular replacement’ mode. The flow diagram for this mode is shown in Fig. 1 ▶. The search strategy automates four search processes: those for multiple components in the asymmetric unit, for ambiguity in the hand of the space group and/or other space groups in the same point group, for permutations in the search order for components (when there are multiple components), and for finding the best model when there is more than one possible model for a component. 3.1.1. Multiple components of asymmetric unit Where there are many models to be placed in the asymmetric unit, the signal from the placement of the first model may be buried in noise and the correct placement of this first model only found in the context of all models being placed in the asymmetric unit. One way of tackling this problem has been to use stochastic methods to search the multi-dimensional space (Chang & Lewis, 1997 ▶; Kissinger et al., 1999 ▶; Glykos & Kokkinidis, 2000 ▶). However, we have chosen to use a tree-search-with-pruning approach, where a list of possible placements of the first (and subsequent) models is kept until the placement of the final model. This tree-search-with-pruning search strategy can generate very branched searches that would be challenging for users to negotiate by running separate jobs, but becomes trivial with suitable automation. The search strategy exploits the strength of the maximum likelihood target functions in using prior information in the search for subsequent components in the asymmetric unit. The tree-search-with-pruning strategy is heavily dependent on the criteria used for selecting the peaks that survive to the next round. Four selection criteria are available in Phaser: selection by percentage difference between the top and mean log-likelihood of the search, selection by Z score, selection by number of peaks, and selection of all peaks. The default is selection by percentage, with the default percentage set at 75%. This selection method has the advantage that, if there is one clear peak standing well above the noise, it alone will be passed to the next round, while if there is no clear signal, all peaks high in the list will be passed as potential solutions to the next round. If structure solution fails, it may be possible to rescue the solution by reducing the percentage cut-off used for selection from 75% to, for example, 65%, so that if the correct peak was just missing the default cut-off, it is now included in the list passed to the next round. The tree-search-with-pruning search strategy is sub-optimal where there are multiple copies of the same search model in the asymmetric unit. In this case the search generates many branches, each of which has a subset of the complete solution, and so there is a combinatorial explosion in the search. The tree search would only converge onto one branch (solution) with the placement of the last component on each of the branches, but in practice the run time often becomes excessive and the job is terminated before this point can be reached. When searching for multiple copies of the same component in the asymmetric unit, several copies should be added at each search step (rather than branching at each search step), but this search strategy must currently be performed semi-manually as described elsewhere (McCoy, 2007 ▶). 3.1.2. Alternative space groups The space group of a structure can often be ambiguous after data collection. Ambiguities of space group within the one point group may arise from theoretical considerations (if the space group has an enantiomorph) or on experimental grounds (the data along one or more axes were not collected and the systematic absences along these axes cannot be determined). Changing the space group of a structure to another in the same point group can be performed without re-indexing, merging or scaling the data. Determination of the space group within a point group is therefore an integral part of structure solution by MR. The translation function will yield the highest log-likelihood gain for a correctly packed solution in the correct space group. Phaser allows the user to make a selection of space groups within the same point group for the first translation function calculation in a search for multiple components in the asymmetric unit. If the signal from the placement of the first component is not significantly above noise, the correct space group may not be chosen by this protocol, and the search for all components in the asymmetric unit should be completed separately in all alternative space groups. 3.1.3. Alternative models As the database of known structures expands, the number of potential MR models is also rapidly increasing. Each available model can be used as a separate search model, or combined with other aligned structures in an ‘ensemble’ model. There are also various ways of editing structures before use as MR models (Schwarzenbacher et al., 2004 ▶). The number of MR trials that can be performed thus increases combinatorially with the number of potential models, which makes job tracking difficult for the user. In addition, most users stop performing MR trials as soon as any solution is found, rather than continuing the search until the MR solution with the greatest log-likelihood gain is found, and so they fail to optimize the starting point for subsequent steps in the structure solution pipeline. The use of alternative models to represent a structure component is also useful where there are multiple copies of one type of component in the asymmetric unit and the different copies have different conformations due to packing differences. The best solution will then have the different copies modelled by different search models; if the conformation change is severe enough, it may not be possible to solve the structure without modelling the differences. A set of alternative search models may be generated using previously observed conformational differences among similar structures, or, for example, by normal-mode analysis (see §2.3). Phaser automates searches over multiple models for a component, where each potential model is tested in turn before the one with the greatest log-likelihood gain is found. The loop over alternative models for a component is only implemented in the rotation functions, as the solutions passed from the rotation function to the translation function step explicitly specify which model to use as well as the orientation for the translation function in question. 3.1.4. Search order permutation When searching for multiple components in the asymmetric unit, the order of the search can be a factor in success. The models with the biggest component of the total structure factor will be the easiest to find: when weaker scattering components are the subject of the initial search, the solution may be buried in noise and not significant enough to survive the selection criteria in the tree-search-with-pruning search strategy. Once the strongest scattering components are located, then the search for weaker scattering components (in the background of the strong scattering components) is more likely to be a success. Having a high component of the total structure factor correlates with the model representing a high fraction of the total contents of the asymmetric unit, low r.m.s. deviation between model and target atoms, and low B factors for the target to which the model corresponds. Although the first of these (high completeness) can be determined in advance from the fraction of the total molecular weight represented by the model, the second can only be estimated from the Chothia & Lesk (1986 ▶) formula and the third is unknown in advance. If structure solution fails with the search performed in the order of the molecular weights, then other permutations of search order should be tried. In Phaser, this possibility is automated on request: the entire search strategy (except for the initial anisotropic data correction) is performed for all unique permutations of search orders. 3.2. Automated experimental phasing SAD is the simplest type of experimental phasing method to automate, as it involves only one crystal and one data set. SAD is now becoming the experimental phasing method of choice, overtaking multiple-wavelength anomalous dispersion because only a single data set needs to be collected. This can help minimize radiation damage to the crystal, which has a major adverse effect on the success of multi-wavelength experiments. The ‘automated experimental phasing’ mode in Phaser takes an atomic substructure determined by Patterson, direct or dual-space methods (Karle & Hauptman, 1956 ▶; Rossmann, 1961 ▶; Mukherjee et al., 1989 ▶; Miller et al., 1994 ▶; Sheldrick & Gould, 1995 ▶; Sheldrick et al., 2001 ▶; Grosse-Kunstleve & Adams, 2003 ▶) and refines the positions, occupancies, B factors and values of the atoms to optimize the SAD function, then uses log-likelihood gradient maps to complete the atomic substructure. The flow diagram for this mode is shown in Fig. 2 ▶. The search strategy automates two search processes: those for ambiguity in the hand of the space group and for completing atomic substructure from log-likelihood gradient maps. A feature of using the SAD function for phasing is that the substructure need not only consist of anomalous scatterers; indeed it can consist of only real scatterers, since the real scattering of the partial structure is used as part of the phasing function. This allows structures to be completed from initial real scattering models. 3.2.1. Enantiomorphic space groups Since the SAD phasing mode of Phaser takes as input an atomic substructure model, the space group of the solution has already been determined to within the enantiomorph of the correct space group. Changing the enantiomorph of a SAD refinement involves changing the enantiomorph of the heavy atoms, or in some cases the space group (e.g. the enantiomorphic space group of P41 is P43). In some rare cases (Fdd2, I41, I4122, I41 md, I41 cd, I 2d, F4132; Koch & Fischer, 1989 ▶) the origin of the heavy-atom sites is changed [e.g. the enantiomorphic space group of I41 is I41 with the origin shifted to ( , 0, 0)]. If there is only one type of anomalous scatterer, the refinement need not be repeated in both hands: only the phasing needs to be carried out in the second hand to be considered. However, if there is more than one type of anomalous scatterer, then the refinement and substructure completion needs to be repeated, as it will not be enantiomorphically symmetric in the other hand. To facilitate this, Phaser runs the refinement and substructure completion in both hands [as does other experimental phasing software, e.g. Solve (Terwilliger & Berendzen, 1999 ▶) and autosharp (Vonrhein et al., 2006 ▶)]. The correct space group can then be found by inspection of the electron density maps; the density will only be interpretable in the correct space group. In cases with significant contributions from at least two types of anomalous scatterer in the substructure, the correct space group can also be identified by the log-likelihood gain. 3.2.2. Completing the substructure Peaks in log-likelihood gradient maps indicate the coordinates at which new atoms should be added to improve the log-likelihood gain. In the initial maps, the peaks are likely to indicate the positions of the strongest anomalous scatterers that are missing from the model. As the phasing improves, weaker anomalous scatterers, such as intrinsic sulfurs, will appear in the log-likelihood gradient maps, and finally, if the phasing is exceptional and the resolution high, non-anomalous scatterers will appear, since the SAD function includes a contribution from the real scattering. After refinement, atoms are excluded from the substructure if their occupancy drops below a tenth of the highest occupancy amongst those atoms of the same atom type (and therefore ). Excluded sites are flagged rather than permanently deleted, so that if a peak later appears in the log-likelihood gradient map at this position, the atom can be reinstated and prevented from being deleted again, in order to prevent oscillations in the addition of new sites between cycles and therefore lack of convergence of the substructure completion algorithm. New atoms are added automatically after a peak and hole search of the log-likelihood gradient maps. The cut-off for the consideration of a peak as a potential new atom is that its Z score be higher than 6 (by default) and also higher than the depth of the largest hole in the map, i.e. the largest hole is taken as an additional indication of the noise level of the map. The proximity of each potential new site to previous atoms is then calculated. If a peak is more than a cut-off distance (κ Å) of a previous site, the peak is added as a new atom with the average occupancy and B factor from the current set of sites. If the peak is within κ Å of an isotropic atom already present, the old atom is made anisotropic. Holes in the log-likelihood gradient map within κ Å of an isotropic atom also cause the atom’s B factor to be switched to anisotropic. However, if the peak or hole is within κ Å of an anisotropic atom already present, the peak or hole is ignored. If a peak is within κ Å of a previously excluded site, the excluded site is reinstated and flagged as not for deletion in order to prevent oscillations, as described above. At the end of the cycle of atom addition and isotropic to anisotropic atomic B-factor switching, new sites within 2κ Å of an old atom that is now anisotropic are then removed, since the peak may be absorbed by refining the anisotropic B factor; if not, it will be accepted as a new site in the next cycle of log-likelihood gradient completion. The distance κ may be input directly by the user, but by default it is the ‘optical resolution’ of the structure (κ = 0.715d min), but not less than 1 Å and no more than 10 Å. If the structure contains more than one significant anomalous scatterer, then log-likelihood gradient maps are calculated from each atom type, the maps compared and the atom type associated with each significant peak assigned from the map with the most significant peak at that location. 3.2.3. Initial real scattering model One of the reasons for including MR and SAD phasing within one software package is the ability to use MR solutions with the SAD phasing target to improve the phases. Since the SAD phasing target contains a contribution from the real scatterers, it is possible to use a partial MR model with no anomalous scattering as the initial atomic substructure used for SAD phasing. This approach is useful where there is a poor MR solution combined with a poor anomalous signal in the data. If the poor MR solution means that the structure cannot be phased from this model alone, and the poor anomalous signal means that the anomalous scatterers cannot be located in the data alone, then using the MR solution as the starting model for SAD phasing may provide enough phase information to locate the anomalous scatterers. The combined phase information will be stronger than from either source alone. To facilitate this method of structure solution, Phaser allows the user to input a partial structure model that will be interpreted in terms of its real scattering only and, following phasing with this substructure, to complete the anomalous scattering model from log-likelihood gradient maps as described above. 3.3. Input and output The fastest and most efficient way, in terms of development time, to link software together is using a scripting language, while using a compiled language is most efficient for intensive computation. Following the lead of the PHENIX project (Adams et al., 2002 ▶, 2004 ▶), Phaser uses Python (http://python.org) as the scripting language, C++ as the compiled language, and the Boost.Python library (http://boost.org/libs/python/) for linking C++ and Python. Other packages, notably X-PLOR (Brünger, 1993 ▶) and CNS (Brünger et al., 1998 ▶), have defined their own scripting languages, but the choice of Python ensures that the scripting language is maintained by an active community. Phaser functionality has mostly been made available to Python at the ‘mode’ level. However, some low-level SAD refinement routines in Phaser have been made available to Python directly, so that they can be easily incorporated into phenix.refine. A long tradition of CCP4 keyword-style input in established macromolecular crystallography software (almost exclusively written in Fortran) means that, for many users, this has been the familiar method of calling crystallographic software and is preferred to a Python interface. The challenge for the development of Phaser was to find a way of satisfying both keyword-style input and Python scripting with minimal increase in development time. Taking advantage of the C++ class structure allowed both to be implemented with very little additional code. Each keyword is managed by its own class. The input to each mode of Phaser is controlled by Input objects, which are derived from the set of keyword classes appropriate to the mode. The keyword classes are in turn derived from a CCP4base class containing the functionality for the keyword-style input. Each keyword class has a parse routine that calls the CCP4base class functions to parse the keyword input, stores the input parameters as local variables and then passes these parameters to a keyword class set function. The keyword class set functions check the validity and consistency of the input, throw errors where appropriate and finally set the keyword class’s member parameters. Alternatively, the keyword class set functions can be called directly from Python. These keyword classes are a standalone part of the Phaser code and have already been used in other software developments (Pointless; Evans, 2006 ▶). An Output object controls all text output from Phaser sent to standard output and to text files. Switches on the Output object give different output styles: CCP4-style for compatibility with CCP4 distribution, PHENIX-style for compatibility with the PHENIX interface, CIMR-style for development, XML-style output for developers of automation scripts and a ‘silent running’ option to be used when running Phaser from Python. In addition to the text output, where possible Phaser writes results to files in standard format; coordinates to ‘pdb’ files and reflection data (e.g. map coefficients) to ‘mtz’ files. Switches on the Output object control the writing of these files. 3.3.1. CCP4-style output CCP4-style output is a text log file sent to standard output. While this form of output is easily comprehensible to users, it is far from ideal as an output style for automation scripts. However, it is the only output style available from much of the established software that developers wish to use in their automation scripts, and it is common to use Unix tools such as ‘grep’ to extract key information. For this reason, the log files of Phaser have been designed to help developers who prefer to use this style of output. Phaser prints four levels of log file, summary, log, verbose and debug, as specified by user input. The important output information is in all four levels of file, but it is most efficient to work with the summary output. Phaser prints ‘SUCCESS’ and ‘FAILURE’ at the end of the log file to demarcate the exit state of the program, and also prints the names of any of the other output files produced by the program to the summary output, amongst other features. 3.3.2. XML output XML is becoming commonly used as a way of communicating between steps in an automation pipeline, because XML output can be added very simply by the program author and relatively simply by others with access to the source code. For this reason, Phaser also outputs an XML file when requested. The XML file encapsulates the mark-up within 〈phaser〉 tags. As there is no standard set of XML tags for crystallographic results, Phaser’s XML tags are mostly specific to Phaser but were arrived at after consultation with other developers of XML output for crystallographic software. 3.3.3. Python interface The most elegant and efficient way to run Phaser as part of an automation script is to call the functionality directly from Python. Using Phaser through the Python interface is similar to using Phaser through the keyword interface. Each mode of operation of Phaser described above is controlled by an Input object and its parameter set functions, which have been made available to Python with the Boost.Python library. Phaser is then run with a call to the ‘run-job’ function, which takes the Input object as a parameter. The ‘run-job’ function returns a Result object on completion, which can then be queried using its get functions. The Python Result object can be stored as a ‘pickled’ class structure directly to disk. Text is not sent to standard out in the CCP4 logfile way but may be redirected to another output stream. All Input and Result objects are fully documented. 4. Future developments Phaser will continue to be developed as a platform for implementing novel phasing algorithms and bringing the most effective approaches to the crystallographic community. Much work remains to be done formulating maximum likelihood functions with respect to noncrystallographic symmetry, to account for correlations in the data and to consider non-isomorphism, all with the aim of achieving the best possible initial electron density map. After a generation in which Fortran dominated crystallographic software code, C++ and Python have become the new standard. Several developments, including Phaser, PHENIX (Adams et al., 2002 ▶, 2004 ▶), Clipper (Cowtan, 2002 ▶) and mmdb (Krissinel et al., 2004 ▶), simultaneously chose C++ as the compiled language at their inception at the turn of the millennium. At about the same time, Python was chosen as a scripting language by PHENIX, ccp4mg (Potterton et al., 2002 ▶, 2004 ▶) and PyMol (DeLano, 2002 ▶), amongst others. Since then, other major software developments have also started or converted to C++ and Python, for example PyWarp (Cohen et al., 2004 ▶), MrBump (Keegan & Winn, 2007 ▶) and Pointless (Evans, 2006 ▶). The choice of C++ for software development was driven by the availability of free compilers, an ISO standard (International Standardization Organization et al., 1998 ▶), sophisticated dynamic memory management and the inherent strengths of using an object-oriented language. Python was equally attractive because of the strong community support, its object-oriented design, and the ability to link C++ and Python through the Boost.Python library or the SWIG library (http://www.swig.org/). Now that a ‘critical mass’ of developers has taken to using the new languages, C++ and Python are likely to remain the standard for crystallographic software for the current generation of crystallographic software developers. Phaser source code has been distributed directly by the authors (see http://www-structmed.cimr.cam.ac.uk/phaser for details) and through the PHENIX and CCP4 (Collaborative Computing Project, Number 4, 1994 ▶) software suites. The source code is released for several reasons, including that we believe source code is the most complete form of publication for the algorithms in Phaser. It is hoped that generous licensing conditions and source distribution will encourage the use of Phaser by other developers of crystallographic software and those writing crystallographic automation scripts. There are no licensing restrictions on the use of Phaser in macromolecular crystallography pipelines by other developers, and the license conditions even allow developers to alter the source code (although not to redistribute it). We welcome suggestions for improvements to be incorporated into new versions. Compilation of Phaser requires the computational crystallography toolbox (cctbx; Grosse-Kunstleve & Adams, 2003 ▶), which includes a distribution of the cmtz library (Winn et al., 2002 ▶). The Boost libraries (http://boost.org/) are required for access to the functionality from Python. Phaser runs under a wide range of operating systems including Linux, Irix, OSF1/Tru64, MacOS-X and Windows, and precompiled executables are available for these platforms when only keyword-style access (and not Python access) is required. Graphical user interfaces to Phaser are available for both the PHENIX and the CCP4 suites. User support is available through PHENIX, CCP4 and from the authors (email cimr-phaser@lists.cam.ac.uk).

0 comments Cited 2807 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

REFMAC5 for the refinement of macromolecular crystal structures

Garib Murshudov, Pavol Skubák, Andrey Lebedev … (2011)

1. Introduction As a final step in the process of solving a macromolecular crystal (MX) structure, refinement is carried out to maximize the agreement between the model and the X-ray data. Model parameters that are optimized in the refinement process include atomic coordinates, atomic displacement parameters (ADPs), scale factors and, in the presence of twinning, twin fraction(s). Although refinement procedures are typically designed for the final stages of MX analysis, they are also often used to improve partial models and to calculate the ‘best’ electron-density maps for further model (re)building. Refinement protocols are therefore an essential component of model-building pipelines [ARP/wARP (Perrakis et al., 1999 ▶), SOLVE/RESOLVE (Terwilliger, 2003 ▶) and Buccaneer (Cowtan, 2006 ▶)] and are of paramount importance in guiding manual model updates using molecular-graphics software [Coot (Emsley & Cowtan, 2004 ▶), O (Jones et al., 1991 ▶) and XtalView (McRee & Israel, 2008 ▶)]. The first software tools for MX refinement appeared in the 1970s. Real-space refinement using torsion-angle parameterization was introduced by Diamond (1971 ▶). This was followed a few years later by reciprocal-space algorithms for the refinement of individual atomic parameters with added energy (Jack & Levitt, 1978 ▶) and restraints (Konnert, 1976 ▶) in order to deliver chemically reasonable models. The energy and restraints approaches differ only in terminology as they use similar information and both can be unified using a Bayesian formalism (Murshudov et al., 1997 ▶). Early programs used the well established statistical technique of least-squares residuals with equal weights on all reflections (Press et al., 1992 ▶), with gradients and second derivatives (if needed) calculated directly. This changed when Fourier methods, which were developed for small-molecule structure refinement (Booth, 1946 ▶; Cochran, 1948 ▶; Cruickshank, 1952 ▶, 1956 ▶), were formalized for macromolecules (Ten Eyck, 1977 ▶; Agarwal, 1978 ▶). The use of the FFT for structure-factor and gradient evaluation (Agarwal, 1978 ▶) sped up calculations dramatically and the refinement of large molecules using relatively modest computers became realistic. Later, the introduction of molecular dynamics (Brünger, 1991 ▶), the generalization of the FFT approach for all space groups (Brünger, 1989 ▶) and the development of a modular approach to refinement programs (Tronrud et al., 1987 ▶) dramatically changed MX solution procedures. Also, the introduction of the very robust and popular small-molecular refinement program SHELXL (Sheldrick, 2008 ▶) to the macromolecular community allowed routine analysis of high-resolution MX data, including the refinement of merohedral and non-merohedral twins. More sophisticated statistical approaches to MX structure refinement started to emerge in the 1990s. Although the basic formulations and most of the necessary probability distributions used in crystallography were developed in the 1950s and 1960s (Luzzati, 1951 ▶; Ramachandran et al., 1963 ▶; Srinivasan & Ramachandran, 1965 ▶; see also Srinivasan & Parthasarathy, 1976 ▶, and references therein), their implementation for MX refinement started in the middle of the 1990s (Pannu & Read, 1996 ▶; Bricogne & Irwin, 1996 ▶; Murshudov et al., 1997 ▶). It should be emphasized that prior to the application of maximum-likelihood (ML) techniques in MX refinement, the importance of advanced statistical approaches to all stages of MX analysis had been advocated by Bricogne (1997 ▶) for two decades. Nowadays, most MX refinement programs offer likelihood targets as an option. Although ML can be very well approximated using the weighted least-squares approach in the very simple case of refinement against structure-factor amplitudes (Murshudov et al., 1997 ▶), ML has the attractive advantage that it is relatively easy (at least theoretically) to generalize for the joint utilization of a variety of sources of observations. For example, it was immediately extended to use experimental phase information (Bricogne, 1997 ▶; Murshudov et al., 1997 ▶; Pannu et al., 1998 ▶). In the last two decades, there have been many developments of likelihood functions towards the exploitation of all available experimental data for refinement, thus increasing the reliability of the refined model in the final stages of refinement and improving the electron density used in model building in the early stages of MX analysis (Bricogne, 1997 ▶; Skubák et al., 2004 ▶, 2009 ▶). MX crystallography can now take advantage of highly optimized software packages dealing with all of the various stages of structure solution, including refinement. There are several programs available that either are designed to perform refinement or offer refinement as an option. These include BUSTER/TNT (Blanc et al., 2004 ▶), CNS (Brünger et al., 1998 ▶), MAIN (Turk, 2008 ▶), MOPRO (Guillot et al., 2001 ▶), phenix.refine (Adams et al., 2010 ▶), REFMAC5 (Murshudov et al., 1997 ▶), SHELXL (Sheldrick, 2008 ▶) and TNT (Tronrud et al., 1987 ▶). While MOPRO was specifically designed for niche ultrahigh-resolution refinement and is able to model deformation density, all of the other programs can deal with a multitude of MX refinement problems and produce high-quality electron-density maps, although with different emphases and strengths. This contribution describes the various components of the macromolecular crystallographic refinement program REFMAC5, which is distributed as part of the CCP4 suite (Collaborative Computational Project, Number 4, 1994 ▶). REFMAC5 is a flexible and highly optimized refinement package that is ideally suited for refinement across the entire resolution spectrum that is encountered in macromolecular crystallography. 2. Target functions in REFMAC5 As in all other refinement programs, the target function minimized in REFMAC5 has two components: a component utilizing geometry (or prior knowledge) and a component utilizing experimental X-ray knowledge, where f total is the total target function to be minimized, consisting of functions controlling the geometry of the model and the fit of the model parameters to the experimental data, and w is a weight between the relative contributions of these two components. In macromolecular crystallography, the weight is traditionally selected by trial and error. REFMAC5 offers automatic weighting, which is based on the fact that both components are the natural logarithm of a probability distribution. However, this ‘automatic’ weight may lead to unreasonable deviations from ideal geometry (either too tight or too relaxed) in some cases, as the ideal geometry is difficult to describe statistically. For these cases, the weight parameter may need to be selected manually to produce more reasonable geometry, e.g. such that the root-mean-square deviation of the bond lengths from the ideal values is 0.02 Å and at resolutions lower than 3 Å perhaps even smaller. From a Bayesian viewpoint (O’Hagan, 1994 ▶), these functions have the following probabilistic interpretation (ignoring constants which are irrelevant for minimization purposes): From this point of view, MX refinement is similar to a well known technique in statistical analysis: maximum posterior (MAP) estimation. The model parameters are linked with the experimental data via f xray, i.e. likelihood is a mechanism that controls information flow from the experimental data to the derived model. Consequently, it is important to design a likelihood function that allows optimal information transfer from the data to the derived model. f geom ensures that the derived model is consistent with the presumed chemical and structural knowledge. This function plays the role of regularization, reduction of the effective number of parameters and transfer of known information to the new model. If care is not taken, then wrong information may be transferred to the model; removing the effect of such errors may be difficult if possible at all. The design of such functions should be performed using verifiable invariant information and it should be testable and revisable during the refinement and model-building procedures. Functions dealing with geometry usually depend only on atomic parameters. We are not aware of any function used in crystallography that deals with the prior geometry probability distributions of overall parameters. A possible reason for the lack of interest in (and necessity of) this type of function may be that, despite popular belief, the statistical problem in crystallography is sufficiently well defined and that the main problems are those of model parameterization and completion. The existing refinement programs differ in the target functions and optimization techniques used to derive model parameters. Most MX programs use likelihood target functions. However, their form, implementations and parameterizations are different. Therefore, it should not come as a surprise if different programs give (slightly) different results in terms of model parameters, electron-density maps and reliability factors (such as R and R free). 2.1. X-ray component The X-ray likelihood target functions used in REFMAC5 are based on a general multivariate probability distribution of E observations given M model structure factors. This function is derived from a multivariate complex Gaussian distribution of N = E + M structure factors for acentric reflections and from a multivariate real Gaussian distribution for centric reflections and has the following form: where P = P(|F 1|, …, |FE |; F E+1, …, FN ), Fi = |Fi |exp(ια i }, |F 1|, …, |FE | denote the observed amplitudes, F E+1, …, FN are the model structure factors, C N is the covariance matrix with the elements of its inverse denoted by aij , C M is the bottom right square submatrix of C N of dimension M with the elements of its inverse denoted by cij . We define cij = 0 for i ≤ 0 or j ≤ 0. |C N | and |C M | are the determinants of matrices C N and C M , = (α1, …, α E ) is the vector of the unknown phases of the observations that need to be integrated and is a probability distribution expressing any prior knowledge about the phases. In the simplest case of one observation, one model and no prior knowledge about phases, the integral in (3) can be evaluated analytically. In this case, the function follows a Rice distribution (Bricogne & Irwin, 1996 ▶), which is a non-central χ2 distribution of |F o|2/Σ and |F o|2/2Σ with non-centrality parameters D 2|F c|2/Σ and D 2|F o|2/2Σ with one and two degrees of freedom for centric and acentric reflections, respectively (Stuart & Ord, 2009 ▶), where D in its simplest interpretation is 〈cos(Δxs)〉, a Luzzati error parameter (Luzzati, 1952 ▶) expressing errors in the positional parameters of the model, F c is the model structure factor, |F o| is the observed amplitude of the structure factor and Σ is the uncertainty or the second central moment of the distribution. Both Σ and D enter the equation as part of the covariance matrices C N and C M from (3). Σ is a function of the multiplicity of the Miller indices (∊ factor), experimental uncertainties (σo), model completeness and model errors. For simplicity, the following parameterization is used: The current version of REFMAC5 estimates D and Σmod in resolution bins. Working reflections are used for estimation of D and free reflections are used for Σmod estimation. Although this simple parameterization works in many cases, it may give misleading results for data from crystals with pseudo translation, OD disorder or modulated crystals in general. Currently, there is no satisfactory implementation of the error model to account for these cases. 2.2. Incorporation of experimental phase information in model refinement 2.2.1. MLHL likelihood MLHL likelihood (Bricogne, 1997 ▶; Murshudov et al., 1997 ▶; Pannu et al., 1998 ▶) is based on a special case of the probability distribution (3) where we have one observation, one model and phase information derived from an experiment available as a prior distribution P pr(α), where F o = |F o|exp(ια), F c = |F c|exp(ιαc), α is the unknown phase of the structure factor and α1 and α2 are its possible values for a centric reflection. The prior phase probability distribution P pr(α) is usually represented as a generalized von Mises distribution (Mardia & Jupp, 1999 ▶) and is better known in crystallography as a Hendrickson–Lattman distribution (Hendrickson & Lattman, 1970 ▶), where A, B, C and D are coefficients of the Fourier transformation of the logarithm of the phase probability distribution and N is the normalization coefficient. The distribution is unimodal when C and D are zero; otherwise, it is a bimodal distribution that reflects the possible phase uncertainty in experimental phasing. For centric reflections C and D are zero. 2.2.2. SAD/SIRAS likelihood The MLHL likelihood is dependent on the reliability and accuracy of the prior distribution P pr(α). However, the phase distributions after density modification (or even after phasing), which are usually used as P pr(α), often suffer from inaccurate estimation of the phase errors. Furthermore, MLHL [as well as any other special case of (3) with a non-uniform P pr(α)] assumes independence of the prior phases from the model phases. These shortcomings can be addressed by using experimental information directly from the experimental data, instead of from the P pr(α) distributions obtained in previous steps of the structure-solution process. Currently, SAD and SIRAS likelihood functions are implemented in REFMAC5. The SAD probability distribution (Skubák et al., 2004 ▶) is obtained from (3) by setting E = 2, M = 2, P pr(α) = constant and |F 1| = |F o +|, |F 2| = |(F o −)*|, F 3 = F c +, F 4 = (F c −)*, where F + and F − are the structure factors of the Friedel pairs. The model structure factors are constructed using the current parameters of the protein, the heavy-atom substructure and the inputted anomalous scattering parameters. Similarly, the SIRAS function (Skubák et al., 2009 ▶) is a special case of (3) with E = 3, M = 3, P pr(α) = constant and |F 1| = |F o N |, |F 2| = |F o +|, |F 3| = |(F o −)*|, F 4 = F c N, F 5 = F c +, F 6 = (F c −)*, where |F 1| and F 4 correspond to the observation and the model of the native crystal, respectively, and |F 2|, |F 3|, F 5 and F 6 refer to the Friedel pair observations and models of the derivative crystal. If any of the E observations are symmetrically equivalent, for instance centric Friedel pair intensities, the equation is reduced appropriately so as to only include non-equivalent observations and models. The incorporation of prior phase information by the refinement function is especially useful in the early and middle stages of model building and at all stages of structure solution at lower resolutions, owing to the improvement in the observation-to-parameter ratio. The refinement of a well resolved high-resolution structure is often best achieved using the simple Rice function. Fig. 1 ▶ shows the effect of various likelihood functions on automatic model building using ARP/wARP (Perrakis et al., 1999 ▶). 2.3. Twin refinement The function used for twin refinement is a generalization of the Rice distribution in the presence of a linear relationship between the observed intensities. This function has the form where N o and N model are normalization coefficients. In the first equation, the first term inside the integral, P(I o; F), represents the probability distribution of observations if ‘ideal’ structure factors are known. Here, all reflections that are twinned and that can be grouped together are included. Models representing the data-collection instrument, if available, could be added to this term. The second term, P(F; model), represents a probability distribution of the ‘ideal’ structure factors should an atomic model be known for a single crystal. Here, all reflections from the asymmetric unit that contribute to the observed ‘twinned’ intensities are included. If the data were to come from more than one crystal or if, for example, SAD should be used simultaneously with twinning, then this term would need to be modified appropriately. F c is a function of atomic and overall parameter D. Overall parameters also include Σ and twin-fraction parameters. f represents the way structure factors from the asymmetric unit contribute to the particular ‘twinned’ intensity. The above formula is more symbolic rather than precise; further details of twin refinement will be published elsewhere. REFMAC5 performs the following preparations before starting refinement against twinned data. (i) Identify potential (pseudo)merohedral twin operators by analyses of cell/space-group combination using the algorithm developed by Lebedev et al. (2006 ▶). (ii) Calculate R merge for each potential twin operator and filter out twin operators for which R merge is greater than 0.5 or a user-defined value. (iii) Estimate twin fractions for the remaining twin domains and filter out those with small twin fractions (the default value is 0.05). (iv) Make sure that the point group and twin operators form a group. Strictly speaking this stage is not necessary, but it makes bookkeeping easy. (v) Perform twin refinement using the remaining twin operators. Twin fractions are refined at every cycle. All integrals necessary for evaluation of the minus log-likelihood function and its derivatives with respect to the structure factors are evaluated using the Laplace approximation (McKay, 2003 ▶). 2.4. Modelling bulk-solvent contribution Typically, a significant part of a macromolecular crystal is occupied by disordered solvent. Accurate modelling of this part of the crystal is still an unsolved problem of MX. The contribution of bulk solvent to structure factors is strongest at low resolution, although its effect at high resolution is still non-negligible. The absence of good models for disordered solvent may be one of the reasons why R factors in MX are significantly higher than those in small-molecular crystallography. For small molecules R factors can be around 1%, whereas for MX they are rarely less than 10% and more often around 20% or even higher. REFMAC5 uses two types of bulk (disordered) solvent models. One of them is the so-called Babinet’s bulk-solvent model, which is based on the assumption that the only difference between solvent and protein at low resolution is their scale factor (Tronrud, 1997 ▶). Here, we use a slight modification of the formulation described by Tronrud (1997 ▶) and assume that if protein electron density is convoluted using the Gaussian kernel and multiplied by an appropriate scale factor, then protein and solvent electron densities are equal, where * denotes convolution, denotes the Fourier transform and k babinet = k babinet0exp(−B babinet|s|2/4). Here, we used the convolution theorem, which states that the Fourier transform of the convolution of two functions is the product of their Fourier transforms. The second bulk-solvent model is derived similarly to that described by Jiang & Brünger (1994 ▶). The basic assumption is that disordered solvent atoms are uniformly distributed over the region of the asymmetric unit that is not occupied by the atoms of the modelled part of the crystal structure. The region of the asymmetric unit occupied by the atomic model is masked out. Any holes inside this mask are removed using a cavity-detection algorithm. A constant value is assigned outside this region and the structure factors F mask are calculated using an FFT algorithm. These structure factors, multiplied by appropriate scale factors (estimated during the scaling procedure), are added to those calculated from the atomic model. Additionally, various mask parameters may optionally be optimized. One should be careful with bulk-solvent corrections, especially when the atomic model is incomplete. This type of bulk-solvent model may result in smeared-out electron density that may reduce the height of electron density in less-ordered and unmodelled parts of the crystal. The final total structure factors with scale and solvent contributions included take the following form: where the ks are scale factors, s is the reciprocal-space vector, |s| is the length of this vector, U aniso is the crystallographic anisotropic tensor that obeys crystal symmetry, F mask is the contribution from the mask bulk solvent and F protein is the contribution from the protein part of the crystal. Usually, either mask or Babinet bulk-solvent correction is used. However, sometimes their combination may provide better statistics (lower R factors) than either individually. The overall parameters of the solvent models, the overall anisotropy and the scale factors are estimated using a least-squares fit of the amplitude of the total structure factors to the observed amplitudes, In the case of twin refinement, the following function is used to estimate overall parameters including twin fractions (details of twin refinement will be published elsewhere), where f(α, F) is as defined in (8). Both (11) and (12) are minimized using the Gauss–Newton method with eigenvalue filtering to solve linear equations, which ensures that even very highly correlated parameters can be estimated simultaneously. However, one should be careful in interpretating these parameters as the system is highly correlated. Once overall parameters such as the scale factors and twin fractions have been estimated, REFMAC5 estimates the overall parameters of one of the abovementioned likelihood functions and evaluates the function and its derivatives with respect to the atomic parameters. A general description of this procedure can be found in Steiner et al. (2003 ▶). 2.5. Geometry component The function controlling the geometry has several components. (i) Chemical information about the constituent blocks (e.g. amino acids, nucleic acids, ligands) of macromolecules and the covalent links between them. (ii) Internal consistency of macromolecules (e.g. NCS). (iii) Structural knowledge (known structures, restraints on current interatomic distances, secondary structures). The first component is used by all programs and has been tabulated in an mmCIF dictionary (Vagin et al., 2004 ▶) now used by several programs, including REFMAC5, phenix.refine (Adams et al., 2010 ▶) and Coot (Emsley & Cowtan, 2004 ▶). The current version of the dictionary contains around 9000 entries and several hundred covalent-link descriptions. Any new entries may be added using one of several programs, including Sketcher (Vagin et al., 2004 ▶) from CCP4 (Collaborative Computational Project, Number 4, 1994 ▶), JLigand (unpublished work), PRODRG (Schüttelkopf & van Aalten, 2004 ▶) and phenix.elbow (Adams et al., 2010 ▶). Standard restraints on the covalent structure have the general form where bm represents a geometric parameter (e.g. bonds, angles, chiralities) calculated from the model and bi is the ideal value of this particular geometric parameter as tabulated in the dictionary. Apart from ω (the angle of the peptide bond) and χ (the angles of amino-acid side chains), torsion angles in general are not restrained by default. However, the user can request to restrain a particular torsion angle defined in the dictionary or can define general torsion angles and use them as restraints. In general, it is not clear how to handle the restraint on torsion angles automatically, as these angles may depend on the covalent structure as well as the chemical environment of a particular ligand. 2.6. Noncrystallographic symmetry restraints 2.6.1. Automatic NCS definition Automatic NCS identification in REFMAC5 is performed using the following procedure. (i) Align the sequences of all chains with all chains using the dynamic alignment algorithm (Needleman & Wunsch, 1970 ▶). (ii) Accept the alignment if the number of aligned residues is more than k (default 15) residues and the sequence identity for aligned residues is more than α% (default 80%). (iii) Calculate the global root-mean-square deviation (r.m.s.d.) using all aligned residues. (iv) Calculate the average local r.m.s.d. using the formula where N is the number of aligned residues, j indexes the aligned residues, Nj is the number of corresponding atoms in residue j, n j is the number of atoms in the ith group, rl is the vector of differences between corresponding atomic positions and Rj and tj are the rotation and translation that give the best superposition between atoms in group i. To calculate the r.m.s.d., it is not necessary to calculate the rotation and translation operators explicitly or to apply these transformations to atoms. Rather, it is achieved implicitly using Procrustes analysis, as described, for example, in Mardia & Bibby (1979 ▶). When k = N, the local and global r.m.s.d. coincide. (v) If the r.m.s.d. is less than β Å (default 2.5 Å), then we consider the chains to be aligned. (vi) Prepare the list of aligned atoms. If after applying the transformation matrix (calculated using aligned atoms) the neighbours (waters, ligands) of aligned atoms are superimposed, then they are also added to the list of aligned atoms. (vii) If local NCS is requested, then prepare pairs of corresponding interatomic distances. Steps (i)–(v) are performed once during each session of refinement. Step (vi) is performed during every cycle of refinement in order to allow conformational changes to occur. 2.6.2. Global NCS For global NCS restraints, transformation operators (Rij and tij ) that optimally superpose all NCS-related molecules are estimated and the following residual is added to the total target function, where the weight w is a user-controllable parameter. Note that the transformation matrices are estimated using xi and xj and thus they are dependent on these parameters. Therefore, in principle the gradient and second-derivative calculations should take this dependence into account, although this dependence is ignored in the current version of REFMAC5. Ignoring the contribution of these terms may reduce the rate of convergence, although in practice it does not seem to pose a problem. 2.6.3. Local NCS The following function (similar to the implementation in BUSTER) is used for local NCS restraints, where GM is the Geman–McClure robust estimator function (Geman & McClure, 1987 ▶), which can be written Fig. 2 ▶ shows that for small values of r this function is similar to the usual least-squares function. However, it behaves differently for large r: least-square residuals do not allow conformational changes to occur, whereas this type of function is more tolerant to such changes. 2.6.4. External structure restraints The interatomic distances within the structure being analysed may be similar to a known (deposited) structure, particularly in localized regions. In cases where it makes sense, this information can be exploited in order to aid the refinement of the target structure. In doing so, the target structure is pulled towards the conformation adopted by the known structure. The mechanism for generic external restraints described by Mooij et al. (2009 ▶) is used for external structure restraints. In our implementation, structural information from external known structures is utilized by applying restraints to the distances between atom pairs based on a presumed atomic correspondence between the two structures. The following function is used for external structure restraints, where the atoms ai belong to the set A of atoms for which a correspondence is known, dij is the distance between the positions of atoms ai and aj , is the corresponding distance in the known structure, σ ij is the estimated standard deviation of dij about and d max ensures that atom pairs are only restrained within localized regions, allowing insensitivity to global conformational changes. External structure restraints should be weighted differently to the other geometry components in order to allow the restraint strength to be separately specified. Consequently, a weight w ext is applied, which should be appropriately chosen depending on the data quality and resolution, the structural similarity between the external known structure and the target, and the choice of d max. The Geman–McClure function with sensitivity parameter σGM is used to increase robustness to outliers, as with the local NCS restraints. Prior information from the external known structure(s) is generated using the software tool PROSMART. Specifically, this includes the atomic correspondence A, distances , standard deviations σ ij and the distance cutoff d max. Potential sources of prior structural information include different conformations of the target chain (such as those that may result from using different crystallization conditions or in a different binding state) as well as those from homologous or structurally similar proteins. It is possible to use multiple known structures as prior information. The combination of this information results in modified values of and σ ij as appropriate. This allows a structure to be refined utilizing information from a whole class of similar structures, rather than just a single source. Furthermore, it opens up the future possibility for multi-crystal co-refinement. The employed formalism also allows the application of atomic distance restraints to secondary-structure elements (and, in principle, other motifs). Consequently, external restraints may be applied without requiring the prior identification of known structures similar to the target. This is intended to help to refine such motifs towards the expected/presumed local conformation. This technique has been found to be particularly useful for low-resolution crystals and in cases where the target structure is unable to be refined to a satisfactory level. When used appropriately, external structure restraints should increase refinement reliability. Consequently, the difference between the R and R free values is expected to decrease in successful cases. Fig. 3 ▶ shows the refinement statistics resulting from using external restraints to refine a low-resolution bluetongue virus VP4 enzyme (Sutton et al., 2007 ▶). A sequence-identical structure solved at a higher resolution is used as prior information. Refinement statistics are compared after ten refinement cycles with and without using external restraints. Using the external restraints results in a 2.8% improvement in R free. Furthermore, the difference between the R and R free values is reduced from 11.5 to 4.3%, suggesting greatly increased refinement reliability. 2.6.5. ‘Jelly-body’ restraints The ratio of the number of observations to the number of adjustable parameters is very small at low resolution. Even after accounting for chemical restraints, this ratio stays very small and refinement in such cases is usually unstable. The danger of overfitting is very high; this is reflected in large differences between the R and R free values. External structure restraints and the use of experimental phase information (described above) provide ways of dealing with this problem. Unfortunately, it is not always possible to find similar structures refined at high resolution (or at least ones that result in a sufficiently successful improvement in refinement statistics) and experimental phase information is not always available or sufficient. Fortunately, statistical techniques exist to deal with this type of problem. Such techniques include ridge regression (Stuart et al., 2009 ▶), the lasso estimation procedure (Tibshirani, 1997 ▶) and Bayesian estimation with prior knowledge of parameters (O’Hagan, 1994 ▶). REFMAC5 has a regularization function in interatomic distance space that has the form for pairs of atoms i, j from the same chain, with maximum radius d max, which can be controlled (default 4.25 Å). Note that this term does not contribute to the value of the function or its gradient; it only changes the second derivative, thus changing the search direction. It should be noted that a similar technique has been implemented in CNS (Schröder et al., 2010 ▶). Note that if all interatomic distances were constrained, then individual atomic refinement would become rigid-body refinement. The effect of ‘jelly-body’ restraints is the implicit parameterization between the rigid body and individual atoms. This technique has strong similarity to elastic network model calculations (Trion, 1996 ▶). This simple formula has been found to work surprisingly well. 2.6.6. Atomic displacement parameter restraints Unlike positional parameters, where prior knowledge can be designed using basic knowledge of the chemistry of the building blocks of macromolecules and analysis of high-resolution structures, it is not obvious how to design restraints for atomic displacement parameters (ADPs). Ideally, restraints should reflect the geometry of the molecules as well as their overall mobility. Various programs use various restraints (Sheldrick, 2008 ▶; Adams et al., 2010 ▶; Konnert & Hendrickson, 1980 ▶; Murshudov et al., 1997 ▶). In the new version of REFMAC5, restraints on ADPs are based on the distances between distributions. If we assume that atoms are represented as Gaussian distributions, then we are able to design restraints based on the distance between such distributions. For a given two distributions in three-dimensional space P(x) and Q(x), the symmetrized Kullback–Liebler (KL) divergence (McKay, 2003 ▶) is defined as follows: It can be verified that the symmetrized KL divergence satisfies the conditions of a metric distance in the space of distributions. The KL divergence can also be represented as follows: This distance changes more smoothly than the L 2 distance between functions and seems to be a useful criterion for the design of approximate probability distributions (McKay, 2003 ▶; O’Hagan, 1994 ▶). When both distributions are Gaussian with mean zero, this distance has an elegant form. Assume that both atoms have Gaussian distribution: In this case, the KL divergence becomes In the case of isotropic ADPs, KL has an even simpler form: REFMAC5 uses restraints based on the KL divergence: The summation is over all atom pairs with distance less than r max. The weights depend on the nature of the bonds as well as on the distance between the atoms. If atoms are bonded or angle-related then the weight is larger. However, the weight is smaller if the atoms are not related by covalent bonds. Moreover, if the distance between the atoms is more than 3 Å then the weight decreases as follows: where w 0,ij is the weight for nonbonded atoms that are closer than 3 Å to each other. 2.6.7. Rigid-bond restraints For anisotropic atoms there are so-called rigid-bond restraints, based on the idea of rigid-bond tests of anisotropic atoms (Hirshfeld, 1976 ▶). The idea is that projections of U values on the bond vector joining two atoms should be similar. In other words, if two atoms are bonded then an oscillation across the bond is more likely than an independent oscillation along the bond. Atoms oscillate along the bond in a concerted fashion. Rigid-bond restraints are designed as follows. Let us assume that two atoms have positions x 1 and x 2 and their corresponding ADPs are U 1 and U 2; the unit vector joining these atoms is then calculated, The projections of corresponding U values on this vector are then calculated as Now, using these projections, the KL divergence is formed for all pairs and added to the target function: Again, the weights depend on the nature of the bonds between the atoms and the distances between them. Note that if the ADPs of both bonded atoms are isotropic then the rigid-bond restraint is equivalent to the above-described KL restraint. 2.6.8. Sphericity restraints To avoid atoms exploding and becoming too elliptical or, even worse, non-elliptical, REFMAC5 uses restraints on sphericity. It is a simple restraint: an isotropic equivalent of the anisotropic tensor, where k indexes the anisotropic atoms, i, j are components of the anisotropic tensor and wk are weights for this particular type of restraint. The weights depend on the number of other restraints (KL, rigid bond) on this atom. Atoms that have fewer restraints have stronger weights on sphericity, since these atoms are more likely to be unstable. It should be noted that similar restraints on ADPs are used in several other refinement programs (Sheldrick, 2008 ▶; Adams et al., 2010 ▶). 3. Parameterization 3.1. General parameters REFMAC5 uses the standard parameterization of molecules in terms of atomic coordinates and isotropic/anisotropic atomic displacement parameters. The refinement of these parameters is performed using an FFT formulation for gradients and approximations for second derivatives. Details of these formulations have been published elsewhere (Murshudov et al., 1997 ▶, 1999 ▶; Steiner et al., 2003 ▶). Once the gradients and approximate second derivatives have been calculated for these parameters, they are used to calculate the derivatives of derived parameters. Derived parameters include those for rigid-body and TLS refinement. 3.2. Rigid body Rigid-body parameterization is achieved as follows. For each rigid group, transformation operators are defined and new positions are calculated from the starting positions using the formula where Rj is the rotation matrix, t origin is the centre of mass of the rigid group and tj is the translational component of the transformation. The x old are the starting coordinates of the atoms and x new are their positions after application of the transformation operators. There are six parameters per rigid group, defining the rotation matrix and the translational component. At each cycle of refinement, an eigenvalue-filtering technique is used to avoid potential singularities arising from the shape of the rigid groups. It should be noted that no terms between rigid groups are calculated for the approximate second-derivative matrix. For large rigid groups this does not pose much of a problem. However, for many small rigid groups it may slow down convergence substantially. In any case, it is not recommended to divide molecules into very small rigid groups. For these cases, ‘jelly-body’ refinement should produce better results. Once derivatives with respect to the positional parameters have been calculated, those for rigid-body parameters are calculated using the chain rule. The current version of REFMAC5 uses an Euler angle parameterization. 3.3. TLS Atomic displacement parameters describe the spread of atomic positions and can be derived from the Fourier transform of a Gaussian probability distribution function for the atomic centre. The atomic displacement parameters are an important part of the model. Traditionally, a single parameter describing isotropic displacements has been used, namely the B factor. However, it is well known that atomic displacements are likely to be anisotropic owing to directional bonding and at high resolutions the six parameters per atom of a fully anisotropic model can be refined. TLS refinement is a way of modelling anisotropic displacements using only a few parameters, so that the method can be used at medium and low resolutions. The TLS model was originally proposed for small-molecule crystallography (Schomaker & Trueblood, 1968 ▶) and was incorporated into REFMAC5 almost ten years ago (Winn et al., 2001 ▶). The idea behind TLS is to suppose that groups of atoms move as rigid bodies and to constrain the anisotropic displacement parameters of these atoms accordingly. The rigid-body motion is described by translation (T), libration (L) and screw (S) tensors, using a total of 20 parameters for each rigid body. Given values for these 20 parameters, anisotropic displacement parameters can be derived for each atom in the group (and this relationship also allows one to calculate derivatives via the chain rule). Usually, an extra isotropic displacement parameter (the residual B factor) is refined for each atom in addition to the TLS contribution. The sum of these two contributions can be output using the supplementary program TLSANL (Howlin et al., 1993 ▶) or optionally directly from REFMAC5. TLS groups need to be chosen before refinement and constitute part of the definition of the model for the macromolecule. Groups of atoms should conform to the idea that they move as a quasi-rigid body. Often the choice of one group per chain suffices (or at least serves as a reference calculation) and this is the default in REFMAC5. More detailed choices can be made using methods such as TLSMD (Painter & Merritt, 2006 ▶). By default, REFMAC5 also includes waters in the first hydration shell, which it seems reasonable to assume move in concert with the protein chain. Fig. 4 ▶ shows the effect of TLS refinement and orientation of libration tensors. In this case, TLS refinement improves R/R free and the derived libration tensors make biological sense. 4. Optimization REFMAC5 uses the Gauss–Newton method for optimization. For an elegant and comprehensive review on optimization techniques, see Nocedal & Wright (1999 ▶). In this method, the exact second derivative is not calculated, but rather approximated to make sure it is always non-negative. Once derivatives or approximations have been calculated, the following linear equation is built, where H is the approximate second derivative and G is the gradient vector. The contribution of most of the geometrical terms are calculated using algorithms designed for quadratic optimization or least-squares fitting (Press et al., 1992 ▶). To calculate the contribution from the Geman–McClure terms, the following approximation is used (Huber & Ronchetti, 2009 ▶), This approximation ensures that H stays non-negative and consequently directions calculated as a result of the solution of (32) point towards a reduction of the total function. The contribution of the X-ray term to the gradient is calculated using FFT algorithms (Murshudov et al., 1997 ▶). The Fisher information matrix, as described by Steiner et al. (2003 ▶), is used to calculate the contribution of the likelihood functions to the matrix H. Tests have demonstrated that using the diagonal elements of the Fisher information matrix and both diagonal and nondiagonal elements of the geometry terms results in a more stable refinement. Once all of the terms contributing to H and G have been calculated, the linear equation (32) is solved using preconditioned conjugate-gradient methods (Nocedal & Wright, 1999 ▶; Tronrud, 1992 ▶). A diagonal matrix formed by the diagonal elements of H is used as a preconditioner. This brings parameters with different overall scales (positional and B values) onto the same scale and controlling convergence becomes easier. If the conjugate-gradient procedure does not converge in N maxiter cycles (the default is 1000), then the diagonal terms of the H matrix are increased. Thus, if the matrix is not positive then ridge regression is activated. In the presence of a potential (near-) singularity, REFMAC5 uses the following procedure to solve the linear equation. (i) Define and use preconditioner. At this stage, H and G are modified. Define the new matrix by H 1 and vector by G 1. (ii) Set γ = 0. (iii) Define a new matrix: H 2 = H 1 + γI, where I is the identity matrix. (iv) Solve the equation H 2 p = −G 1 using the conjugate-gradient method for linear equations for sparse and positive-definite matrices (Press et al., 1992 ▶). If convergence was achieved in less than N maxiter iterations, then proceed to the next step. Otherwise, increase γ and go to step (iii). (v) Decondition the matrix, gradient and shift vectors. (vi) Apply shifts to the atomic parameters, making sure that the ADPs are positive. (vii) Calculate the value of the total function. (viii) If the value of the total function is less than the previous value, then proceed to the next step. Otherwise, reduce the shifts and repeat steps (vi)–(viii). (ix) Finish the refinement cycle. After application of the shifts, the next cycle of refinement starts. 5. Conclusions Refinement is an important step in macromolecular crystal structure elucidation. It is used as a final step in structure solution, as well as as an intermediate step to improve models and obtain improved electron density to facilitate further model rebuilding. REFMAC5 is one of the refinement programs that incorporates various tools to deal with some crystal peculiarities, low-resolution MX structure refinement and high-resolution refinement. There are also tabulated dictionaries of the constituent blocks of macromolecules, cofactors and ligands. The number of dictionary elements now exceeds 9000. There are also tools to deal with new ligands and covalent modifications of ligands and/or proteins. Low-resolution MX structure analysis is still a challenging task. There are several outstanding problems that need to be dealt with before we can claim that low-resolution MX analysis is complete. Statistics, image processing and computer science provide general methods for these and related problems. Unfortunately, these techniques cannot be directly applied to MX structure analysis, either because of the huge computer resources needed or because the assumptions used are not applicable to MX. In our opinion, the problems of state-of-the-art MX analysis that need urgent attention include the following. (i) Reparameterization depending on the quality and the amount of experimental data. Some tools implemented in REFMAC5 allow partial dealing with this problem. These tools include (a) restraining against known homologous structures, (b) ‘jelly-body’ restraints or refinement along implicit normal modes, (c) long-range ADP restraints based on KL divergence, (d) automatic local and global NCS restraints and (e) experimental phase-information restraints. However, low-resolution refinement and model (re)building is still not as automatic as for high-resolution structures. (ii) Statistical methods for peculiar crystals with low signal-to-noise ratio. Some of the implemented tools, such as likelihood-based twin refinement and SAD/SIRAS refinement, help in the analysis of some of the data produced by such crystals. The analysis of data from such peculiar crystals as OD disorder with or without twinning, multiple cells, translocational disorder or modulated crystals in general remains problematic. (iii) Another important problem is that of limited and noisy data. As a result of resolution cutoff (owing to the survival time of the crystal under X-ray irradiation or otherwise), the resultant electron density usually exhibits noise owing to series termination. If the resolution that the crystal actually diffracts to is the same as the resolution of the data, then series termination is not very serious as the signal dies out towards the limit of the resolution. However, in these cases the electron density becomes blurred, reflecting high mobility of the molecules or crystal disorder. When map sharpening is used, the signal is amplified and series termination becomes a serious problem. To reduce noise, it is necessary to work with the full Fourier transformation. In other words, resolution extension and the prediction of missing reflections becomes an important problem. The dramatic effect of such an approach for density modification at high resolution has been demonstrated by Altomare et al. (2008 ▶) and Sheldrick (2008 ▶). The direct replacement of missing reflections by calculated ones necessarily introduces bias towards model errors and may mask real signal. To avoid this, it is necessary to integrate over the errors in the model parameters (coordinates, B values, scale values and twin fractions). However, since the number of parameters is very large (sometimes exceeding 1 000 000), integration using available numerical techniques is not feasible. (iv) Error estimation. Despite the advances in MX, there have been few attempts to evaluate errors in the estimated parameters. Works attempting to deal with this problem are few and far between (Sheldrick, 2008 ▶). To complete MX structure analysis, it is necessary to develop and implement techniques for error estimation. If this is achieved, then incorrect structures could be eliminated while analysing the MX data and building the model. One of the promising approaches to this problem is the Gauss–Markov random field sampling technique (Hue & Held, 2005 ▶) using the (approximate) second derivative as a field-defining matrix. (v) Multicrystal refinement with the simultaneous multicrystal averaging of isomorphous or non-isomorphous crystals is one of the important directions for low-resolution refinement. If it is dealt with properly then the number of structures analysed at low resolution should increase substantially. Further improvement may consist of a combination of various experimental techniques. For example, the simultaneous treatment of electron-microscopy (EM) and MX data could increase the reliability of EM models and put MX models in the context of larger biological systems. The direct use of unmerged data is another direction in which refinement procedures could be developed. If this were achieved, then several long-standing problems could be easier to deal with. Two such problems are the following. (i) In general, the space group of a crystal should be considered as an adjustable parameter. If unmerged data are used, then space-group assumptions could be tested after every few sessions of refinement and model building. (ii) Dealing with the processes in the crystal during data collection requires unmerged data. One of the best-known such problems is radiation damage.

0 comments Cited 1117 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Title: Nature

Abbreviated Title: Nature

Publisher: Springer Science and Business Media LLC

ISSN (Print): 0028-0836

ISSN (Electronic): 1476-4687

Publication date Created: October 2017

Publication date (Electronic): October 18 2017

Publication date (Print): October 2017

Volume: 550

Issue: 7677

Pages: 481-486

Article

DOI: 10.1038/nature24451

PMC ID: 6029662

PubMed ID: 29045389

SO-VID: 9b76cc1e-4599-4b98-8f07-b19e1bffbaee

License:

http://www.springer.com/tdm

History

Data availability:

Comments

Comment on this article

scite_

Cited by 157

See all cited by

Most referenced authors 1,304

See all reference authors

Molecular basis of USP7 inhibition by selective small-molecule inhibitors

Read this article at

Abstract

Related collections

UCL: UN SDG 03 Good Health and Well-Being

Most cited references 47

Features and development of Coot

Phaser crystallographic software

REFMAC5 for the refinement of macromolecular crystal structures

Author and article information

Journal

Article

History

Comments

Comment on this article

Similar content 2,422

Cited by 157

Most referenced authors 1,304