- Record: found
- Abstract: found
- Article: not found

research-article

Author(s):
Gordon J. Lockbaum ,
Mina Henes ,
Jeong Min Lee ,
Jennifer Timm ,
Ellen A. Nalivaika ,
Paul R. Thompson ,
Nese Kurt Yilmaz ,
Celia A. Schiffer
^{ }

Publication date (Electronic):
10 September 2021

Journal:
Biochemistry

Publisher:
American Chemical Society

Rupintrivir targets the 3C cysteine proteases of the picornaviridae family, which
includes rhinoviruses and enteroviruses that cause a range of human diseases. Despite
being a pan-3C protease inhibitor, rupintrivir activity is extremely weak against
the
homologous 3C-like protease of SARS-CoV-2. In this study, the crystal structures of
rupintrivir were determined bound to enterovirus 68 (EV68) 3C protease and the 3C-like
main protease (M
^{pro}) from SARS-CoV-2. While the EV68 3C
protease–rupintrivir structure was similar to previously determined complexes
with other picornavirus 3C proteases, rupintrivir bound in a unique conformation to
the
active site of SARS-CoV-2 M
^{pro} splitting the catalytic cysteine and histidine
residues. This bifurcation of the catalytic dyad may provide a novel approach for
inhibiting cysteine proteases.

- Record: found
- Abstract: found
- Article: found

Paul D. Adams, Pavel Afonine, Gábor Bunkóczi … (2010)

1. Foundations 1.1. PHENIX architecture The PHENIX (Adams et al., 2002 ▶) architecture is designed from the ground up as a hybrid system of tightly integrated interpreted (‘scripted’) and compiled software modules. A mix of scripted and compiled components is invariably found in all major successful crystallographic packages, but often the scripting is added as an afterthought in an ad hoc fashion using tools that predate the object-oriented programming era. While such ad hoc systems are quickly established, they tend to become a severe maintenance burden as they grow. In addition, users are often forced into many time-consuming routine tasks such as manually converting file formats. In PHENIX, the scripting layer is the heart of the system. With only a few exceptions, all major functionality is implemented as modules that are exclusively accessed via the scripting interfaces. The object-oriented Python scripting language (Lutz & Ascher, 1999 ▶) is used for this purpose. In about two decades, a large developer/user community has produced millions of lines of highly uniform, interoperable, mature and openly available sources covering all aspects of programming ranging from simple file handling to highly sophisticated network communication and fully featured cross-platform graphical interfaces. Embedding crystallographic methods into this environment enables an unprecedented degree of automation, stability and portability. By design, the object-oriented programming model fosters shared collaborative development by multiple groups. It is routine practice to hierarchically recombine modules written by different groups into ever more complex procedures that appear uniform from the outside. A more detailed overview of the key software technology leading to all these advances, presented in the context of crystallography, can be found in Grosse-Kunstleve et al. (2002 ▶). In addition to the advantages outlined in the previous paragraph, the scripting language is generally most efficient for the rapid development of new algorithms. However, runtime performance considerations often dictate that numerically intensive calculations are eventually implemented in a compiled language. The first choice of a compiled language is of course to reuse the same language environment as used for the scripting language itself, which is a C/C++ environment. Not only is this the mainstream software environment on all major platforms used today, but with probably hundreds of millions of lines of C/C++ sources in existence it is an environment that is virtually guaranteed to thrive in the long term. An in-depth discussion of the combined use of Python and C++ can be found in Grosse-Kunstleve et al. (2002 ▶) and Abrahams & Grosse-Kunstleve (2003 ▶). This model is used throughout the PHENIX system. 1.2. Graphical user interface A new graphical user interface (GUI) for PHENIX was introduced in version 1.4. It uses the open-source wxPython toolkit, which provides a ‘native’ look on each operating system. Development has focused on providing interfaces around the existing command-line programs with minimal modification, using the same underlying configuration system (libtbx.phil) as used by most PHENIX programs as a template to automatically generate controls. Because these programs are implemented primarily as Python modules, complex data including models, reflections and other viewable data may be exchanged with the GUI without resorting to parsing log files. The current PHENIX release (version 1.5) includes GUIs for phenix.refine (Afonine et al., 2005 ▶), phenix.xtriage (Zwart et al., 2005 ▶), the AutoSol (Terwilliger et al., 2009 ▶), AutoBuild (Terwilliger, Grosse-Kunstleve, Afonine, Moriarty, Adams et al., 2008 ▶) and LigandFit (Terwilliger et al., 2006 ▶) wizards, the restraints editor REEL, all of the validation tools and several utilities for creating and manipulating maps and reflection files. More recent builds of PHENIX contain a new GUI for the AutoMR wizard and future releases will include a new interface for Phaser (McCoy et al., 2007 ▶). Intrinsically graphical data is visualized with embedded graphs (using the free matplotlib Python library) or a simple OpenGL viewer. This simplifies the most complex parameters, such as atom selections in phenix.refine, which can be visualized or picked interactively with the built-in viewer. The GUI also serves as a platform for additional automation and user customization. Similarly to the CCP4 interface (CCP4i; Potterton et al., 2003 ▶), PHENIX manages data and task history for separate user-defined projects. Default parameters and input files can be specified for each project; for instance, the generation of ligand restraints from the phenix.refine GUI gives the user the option of automatically loading these restraints in future runs. The popularity of Python as a scientific programming language has led to its use in many other structural-biology applications, especially molecular-graphics software. The PHENIX GUI includes extension modules for the modeling programs Coot (Emsley & Cowtan, 2004 ▶) and PyMOL (DeLano, 2002 ▶), both of which are controlled remotely from PHENIX using the XML-RPC protocol. This allows the interfaces to integrate seamlessly; any model or map in PHENIX can be automatically opened in Coot with a single click. In programs that iteratively rebuild or refine structures, such as AutoBuild and phenix.refine, the current model and maps will be continually updated in Coot and/or PyMOL as soon as they are available. In the validation utilities, clicking on any atom or residue flagged for poor statistics will recentre the graphics windows on that atom. Remote control of the PHENIX GUI is also simple using the same protocol and simple extensions to the Coot interface provide direct launching of phenix.refine with a model pre-loaded. 2. Analysis of experimental data PHENIX has a range of tools for the analysis, validation and manipulation of X-ray diffraction data. A comprehensive tool for analyzing X-ray diffraction data is phenix.xtriage (Zwart et al., 2005 ▶), which carries out tests ranging from space-group determination and detection of twinning to detection of anomalous signal. These tests provide the user and the various wizards with a set of statistics that characterize a data set. For analysis of twinning, phenix.xtriage consolidates a number of statistics to provide a balanced verdict of possible symmetry and twin-related issues with the data. Phenix.xtriage provides the user with feedback on the overall characteristics of the data. Routine usage of phenix.xtriage during or immediately after data collection has resulted in the timely discovery of twinning or other issues (Flynn et al., 2007 ▶; Kostelecky et al., 2009 ▶). Detection of these idiosyncrasies in the data typically reduces the overall effort in a successful structure determination. A likelihood-based estimation of the overall anisotropic scale factor is performed using the likelihood formalism described by Popov & Bourenkov (2003 ▶). Database-derived standard Wilson plots for proteins and nucleic acids are used to detect anomalies in the mean intensity. These anomalies may arise from ice rings or other issues (Morris et al., 2004 ▶). Data strength and low-resolution completeness are also analysed. The presence of anomalous signal is detected by analysis of the measurability, a quantity expressing the fraction of statistically significant Bijvoet differences in a data set (Zwart, 2005 ▶). The native Patterson function is used to detect the presence of pseudo-translational symmetry. A database-derived empirical distribution of maximum peak heights is used to assign significance to detected peaks in the Patterson function. A comprehensive automated twinning analysis is performed. Twin laws are derived from first principles to facilitate the identification of pseudo-merodehral cases. Amplitude and intensity ratios, 〈|E 2 − 1|〉 values, the L-statistic (Padilla & Yeates, 2003 ▶) and N(Z) plots are derived from data cut to the resolution limit suggested by the data-strength analysis. The removal of shells of data with relatively high noise content greatly improves the automated interpretation of these statistics. A Britton plot, H-test and a likelihood-derived approach are used to estimate twin fractions when twin laws are present. If a model has been supplied, an R versus R (Lebedev et al., 2006 ▶) analysis is carried out. This type of analysis is of particular use when dealing with pseudo-symmetry, space-group problems and twinning (Zwart et al., 2008 ▶). To test for inconsistent indexing between different data sets, a set of reindexing laws is derived from first principles given the unit cells and space groups of the sample and reference data sets. A correlation analysis suggests the most likely choice of reindexing of the data. Analysis of the metric symmetry of the unit cell provides a number of likely point groups. A likelihood-inspired method is used to suggest the most likely point group of the data. Subsequent analysis of systematic absences in a likelihood framework ranks subsequent space-group possibilities (details to be published). 3. Substructure determination, phasing and molecular replacement After ensuring that the diffraction data are sound and understood, the next critical necessity for solving a structure is the determination of phases using one of several strategies (Adams, Afonine et al., 2009 ▶). 3.1. Substructure determination The substructure-determination procedure implemented as phenix.hyss (Hybrid Substructure Search; Grosse-Kunstleve & Adams, 2003 ▶) combines the multi-trial dual-space recycling approaches pioneered by Shake-and-Bake (Miller et al., 1994 ▶) and later SHELXD (Sheldrick, 2008 ▶) with the use of the fast translation function (Navaza & Vernoslova, 1995 ▶; Grosse-Kunstleve & Brunger, 1999 ▶). The fast translation function is the basis for a systematic search in the Patterson function (performed in reciprocal space), in contrast to the stochastic alternative of SHELXD (performed in direct space). Phenix.hyss is the only substructure-determination program to fully integrate automatic comparison of the substructures found in multiple trials via a Euclidean Model Matching procedure (part of the cctbx open-source libraries). This allows phenix.hyss to detect if the same solution was found multiple times and to terminate automatically if this is the case. Extensive tests with a variety of SAD data sets (Grosse-Kunstleve & Adams, 2003 ▶) have led to a parameterization of the procedure that balances runtime considerations and the likelihood that repeated solutions present the correct substructure. In many cases the procedure finishes in seconds if the substructure is detectable from the input data. 3.2. Phasing Phaser, available in PHENIX as phenix.phaser, applies the principle of maximum likelihood to solving crystal structures by molecular replacement, by single-wavelength anomalous diffraction (SAD) or by a combination of both. The likelihood targets take proper account of the effects of different sources of error (and, in the case of SAD phasing, their correlations) and allow different sources of information to be combined. In solving a molecular-replacement problem with a number of different components, the information gained from a partial solution increases the signal in the search for subsequent components. Because the likelihood scores for different models can be directly compared, decisions among models can readily be made as part of automation strategies (discussed below). 3.3. Noncrystallographic symmetry (NCS) Noncrystallographic symmetry is an important feature of many macromolecular crystals that can be used to greatly improve electron-density maps. PHENIX has tools for the identification of NCS and for using NCS and multiple crystal forms of a macromolecule in phase improvement. Phenix.find_ncs and phenix.simple_ncs_from_pdb are tools for the identification of noncrystallographic symmetry in a structure using information from a heavy-atom substructure or an atomic model. Phenix.simple_ncs_from_pdb will identify NCS and generate transformations from the chains in a model in a PDB file. Phenix.find_ncs will identify NCS from either a heavy-atom substructure (Terwilliger, 2002a ▶) or the chains in a PDB file and will then compare this NCS with the density in a map to verify that the NCS is actually present. Phenix.multi_crystal_average is a method for combining information from several crystal forms of a structure. It is especially well suited to cases where each crystal form has its own NCS, adjusting phases for each crystal form so that all the NCS copies in all crystals are as similar as possible. NCS restraints should normally be applied in density modification and model building in all cases except where there is clear evidence that NCS is not present. In density modification within PHENIX the presence of NCS is identified from the heavy-atom sites or from an atomic model if available. The local correlation of density in NCS-related locations is then used automatically to set variable restraints on NCS symmetry in the map. In refinement, NCS symmetry is applied through coordinate restraints, targeting the positions of each NCS copy relative to those of the other NCS-related chains. The default NCS restraints in PHENIX are very tight, with targets of 0.05 Å r.m.s. At resolutions lower than about 2.5 Å these tight restraints on NCS should usually be applied. At higher resolutions it may be appropriate to use looser restraints or to remove them altogether. Additionally, if there are segments of the chains that clearly do not obey the NCS relationships they should be excluded from the NCS restraints. Normally this is performed automatically, but it can also be specified explicitly. 4. Model building, ligand fitting and nucleic acids Key steps in the analysis of a macromolecular crystal structure are building an initial core model, identification and fitting of ligands into the electron-density map and building an atomic model for loop regions that are less well defined than the majority of the structure. PHENIX has tools for rapid model building of secondary structure and main-chain tracing (phenix.find_helices_strands) and for the fitting of flexible ligands (phenix.ligandfit) as well as for fitting a set of ligands to a map (phenix.find_all_ligands) and for the identification of ligands in a map (phenix.ligand_identification). PHENIX additionally has a tool for the fitting of missing loops (phenix.fit_loops). Validation tools are provided so that the models produced can be validated at each step along the way. 4.1. Model building Phenix.find_helices_strands will rapidly build a secondary-structure-only model into a map or very rapidly trace the polypeptide backbone of a model into a map. To build secondary structure in a map, phenix.find_helices_strands identifies α-helical regions and β-strand segments, models idealized helices and strands into the corresponding density, allowing for bending of the helices and strands, and assembles these into a composite model. To very rapidly trace the main chain in a map, phenix.find_helices_strands finds points along ridgelines of high density where Cα atoms might be located, identifies pairs and then triplets of these Cα atoms that have density between the atoms and plausible geometry, constructs all possible connections of these Cα atoms into nonamers and then identifies all the longest possible chains that can be made by joining the nonamers. This process can build a Cα model at a rate of about 20 residues per second, yielding a backbone model that can readily be interpreted visually or automatically to evaluate the quality of the map that it is based on. Phenix.fit_loops will fit missing loops in an atomic model. It uses RESOLVE model building (Terwilliger, 2003a ▶,b ▶,c ▶) to extend the chain from either end where a loop is missing and to connect the chains into a loop with the expected number of residues. 4.2. Ligand fitting Phenix.ligandfit is a tool for fitting a flexible ligand into an electron-density map (Terwilliger et al., 2006 ▶). The key approaches used are breaking the ligand into its component rigid-body parts, finding where each of these can be placed into density, tracing the remainder of the ligand based on the positions of these core rigid-body parts and recombining the best parts of multiple fits while scoring based on the fit to the density. Phenix.find_all_ligands is a tool for finding all the instances of each of several ligands in an electron-density map. Phenix.find_all_ligands finds the largest contiguous region of unused density in a map and uses phenix.ligandfit to fit each supplied ligand into that density. It then chooses the ligand that has the highest real-space correlation to the density (Terwilliger, Adams et al., 2007 ▶). It then repeats this process until no ligands can be satisfactorily fitted into any remaining density in the map. Phenix.ligand_identification is a tool for identifying which ligands are compatible with unknown electron density in a map (Terwilliger, Adams et al., 2007 ▶). It can search using the 200 most common ligands from the PDB or from a user-supplied list of ligands. Phenix.ligand_identification uses phenix.ligandfit to fit each ligand to the map and identifies the best-fitting ligand using the real-space correlation and surface complementarity of the ligand and the atoms in the structure surrounding the ligand-binding site. 4.3. RNA and DNA In common with most macromolecular crystallographic tools, PHENIX was originally developed with protein structures primarily in mind. Now that nucleic acids, and especially RNA, are increasingly important in large biological structures, the system is being modified in places where subtle differences in procedure are needed rather than just the relevant libraries. Model building in phenix.autobuild now has a preliminary set of nucleic acid procedures that take advantage of the relatively well determined phosphate and base positions, as well as the preponderance of double helix, and that make use of the RNA backbone conformers recently defined by the RNA Ontology Consortium (Richardson et al., 2008 ▶). Nucleic acid structures benefit significantly from torsion-angle refinement, which has recently been added to the options in phenix.refine. A principal problem in RNA models is getting the ribose pucker correct, although it is known to consist almost entirely of either C3′-endo (which is commoner and that found in the A-form helix) or C2′-endo (Altona & Sundaralingam, 1972 ▶). MolProbity uses the perpendicular distance from the 3′ phosphate to the line of the C1′—N1/9 glycosidic bond as a reliable diagnostic of ribose pucker (Davis et al., 2007 ▶; Chen et al., 2010 ▶). This same test has now been built into phenix.refine to allow the use of pucker-specific target parameters for bond lengths, angles and torsions (Gelbin et al., 1996 ▶) rather than the uneasy compromise values (Parkinson et al., 1996 ▶) used in most pucker-agnostic refinement. Currently, if an incorrect pucker is diagnosed it must usually be fixed by user rebuilding, for instance in Coot (Emsley & Cowtan, 2004 ▶) or in RNABC (Wang et al., 2008 ▶). A rebuilding functionality will probably be incorporated into PHENIX soon, but in the meantime the refinement will now correctly maintain the geometry of a C2′-endo pucker once it has been built and identified using conformation-specific residue names. 4.4. Maps, models and avoiding bias Phenix.refine (and the graphical tool phenix.create_maps) can produce various types of maps, including anomalous difference, maximum-likelihood weighted (p*mF obs − q*DF model)exp(iαmodel) and regular (p*F obs − q*F model)exp(iαmodel), where p and q are any user-defined numbers, filled and kick maps. The coefficients m and D of likelihood-weighted maps (Read, 1986 ▶) are computed using test-set reflections as described in Lunin & Skovoroda (1995 ▶) and Urzhumtsev et al. (1996 ▶). Data incompleteness, especially systematic incompleteness, can cause map distortions (Lunin, 1988 ▶; Tronrud, 1997 ▶). An approach to remedying this problem is to replace (‘fill’) missing observations with nonzero values. One can use DF model (similarly to REFMAC; Murshudov et al., 1997 ▶) to replace the missing F obs or use 〈F obs〉, where the F obs are averaged across a resolution bin around the missing F obs value. Based on a limited number of tests, both ‘filling’ schemes produce similar results, reiterating the importance of phases. However, it is important to keep in mind that by replacing missing F obs there is a risk of introducing bias and obviously the more incomplete the data is the larger the risk. At present it is advisable to use both maps simultaneously: filled and not filled. An average kick map (AK map; Gunčar et al., 2000 ▶; Turk, 2007 ▶; Pražnikar et al., 2009 ▶) is the result of the following procedure. A large ensemble of structures is created where the coordinates of each structure from the ensemble are all randomly shaken. A map is then computed for each structure. Finally, all maps are averaged to generate one AK map. An AK map is expected to have less bias and less noise and to enhance the existing signal and can potentially clarify some initially bad densities. A computationally intensive but powerful method of creating a very low-bias map is to carry out iterative model building and refinement while omitting one region of the map from all calculations of structure factors (Terwilliger, Grosse-Kunstleve, Afonine, Moriarty, Adams et al., 2008 ▶). The phenix.autobuild iterative-build OMIT map procedure carries this out automatically for either a single OMIT region or for overlapping OMIT regions to create a composite iterative-build OMIT map. 5. Model, and model-to-data, validation The result of crystallographic structure determination is the atomic model. There are three principal components in assessing model quality: the covalent model geometry, the model stereochemistry and the quality of fit between the model and experimental data in both real space and in reciprocal space. All three provide overall measures, and the first two plus the real-space aspect of the third also provide checks for local outliers, which give the best leverage for user intervention to actively improve model accuracy (Arendall et al., 2005 ▶). (Validation of the experimental data was described in §2 above.) PHENIX includes many individual tools for specific aspects of validation, plus several systems that combine those results into overall summaries. Validation is provided both for user evaluation of the progress and results of a structure solution and also to help inform the automated choices made by other parts of the system. Most aspects of the MolProbity model-validation tools (Davis et al., 2007 ▶; Chen et al., 2010 ▶) have been adapted or rewritten for integrated use within PHENIX and are presented to the user by the new GUI (§1.2). H atoms are added by phenix.reduce, with optimization of entire local hydrogen-bond networks, consideration of the first layer of crystallographic waters and optional correction of side-chain amide or histidine 180° ‘flips’ (Word, Lovell, Richardson et al., 1999 ▶). All-atom contacts (Word, Lovell, LaBean et al., 1999 ▶) are calculated by phenix.probe, which provides the atomic overlap information needed for the validation of serious all-atom steric clashes and can also be visualized in Coot. For the PHENIX GUI, the set of MolProbity-based tools provides both overall model statistics, such as clashscore and percentage of outliers, and detailed lists of the Ramachandran (Lovell et al., 2003 ▶), rotamer (Lovell et al., 2000 ▶), Cβ deviation (Lovell et al., 2003 ▶) and clash outliers. Command-line tools are available for these validation methods: phenix.rotalyze, phenix.ramalyze, phenix.cbetadev, phenix.clashscore, phenix.reduce and phenix.probe. Additionally, phenix.validate_model, which analyzes the deviations of bond lengths, bond angles, planarity etc. from ideal library values, complements the MolProbity torsional and atomic clash tools. Phenix.real_space_correlation asserts the local model-to-data correspondence by providing a quantitative measure of how the atomic model fits the electron-density map at the residue or atom level (depending on the resolution). Rapidly obtaining a snapshot of global figures of merit for a crystallographic model and associated experimental data is a frequent task that is performed at all stages of structure solution. This task can be complicated for several reasons: the presence of novel ligands or nonstandard residues in the PDB-format (Berman et al., 2000 ▶) coordinate file, data collected from twinned crystals, various reflection datafile formats, different representation of atomic displacement parameters in the presence of TLS (Schomaker & Trueblood, 1968 ▶), experimental data type (X-ray and/or neutron), files with multiple models and various formatting issues. Phenix.model_vs_data is designed to automatically handle all these complications with minimal user input (a PDB file and a reflection data file) and provide a concise summary output. Phenix.polygon (Urzhumtseva et al., 2009 ▶) is a graphical tool that is designed to indicate the similarity of validation parameters, such as free R value, for a particular structure compared with those deposited in the PDB. This comparison is performed for all other structures solved at similar resolution limits. The result is presented graphically. Phenix.validation combines all of the tools described above in one GUI, providing a single place for assessing the results of structure determination. 5.1. Model and structure-factor manipulation and analysis PHENIX has a range of tools for displaying, analyzing and manipulating structure-factor and model information. Phenix.mtz.dump and phenix.cif_as_mtz display and convert structure-factor data. Phenix.print_sequence, phenix.pdb_atom_selection and phenix.pdbtools display and manipulate coordinate files. Phenix.tls is a tool for the extraction and manipulation of TLS information. Using this tool, TLS matrices and selections can be extracted from REFMAC- or PHENIX-formatted PDB file headers and the total or residual atomic B factors can be computed and output. Future functionality will include the complete analysis of TLS matrices and their graphical visualization. Phenix.get_cc_mtz_mtz and phenix.get_cc_mtz_pdb are tools for analyzing the agreement between maps based on a pair of MTZ files or between maps calculated from an MTZ file and a PDB file. The key attributes of these tools are that they automatically search all allowed origin shifts that might relate the two maps and that they write out a modified version of one of the MTZ files or of the PDB file, shifted to match the other. 6. Structure refinement Phenix.refine is the state-of-the-art crystallographic structure-refinement engine of PHENIX. The foundational refinement machinery is a combination of highly efficient programming tools and new or rethought crystallographic algorithms. Phenix.refine possesses an extensive set of tools that cover the majority of refinement scenarios at any data resolution from low to ultrahigh. Various reflection-data formats (for example, CNS, MTZ and SHELX) are recognized automatically. The input experimental data are checked for outliers (Read, 1999 ▶; Zwart et al., 2005 ▶) and any reflections identified as such are excluded from the refinement calculations. Twinning can also be taken into account by providing a twin-law operator, which can be obtained using phenix.xtriage. Both X-ray and/or neutron diffraction data can be used and an option for joint XN refinement is available (simultaneous refinement against X-ray and neutron data; Adams, Mustyakimov et al., 2009 ▶). Each refinement run begins with robust mask-based bulk-solvent correction and anisotropic scaling (Afonine et al., 2005 ▶). Tools such as efficient rigid-body refinement (multiple-zones algorithm; Afonine et al., 2009 ▶), simulated-annealing refinement (Brünger et al., 1987 ▶) in Cartesian or torsion-angle space (Grosse-Kunstleve et al., 2009 ▶), automatic NCS detection and its use as restraints in refinement are important at low resolution and in the initial stages of refinement. A broad range of atomic displacement parameterizations are available, including grouped isotropic, constrained anisotropic (TLS) and individual atomic isotropic or anisotropic, allowing efficient modelling of atomic displacement parameters at any resolution. Occupancy refinement (grouped, individual, group constrained for alternative conformations or any mixture) can be performed for any user-defined atoms. Atoms in alternative conformations are recognized automatically based on altLoc identifiers in the input PDB file and their occupancies are refined by default. Ordered solvent (water) model updating is integrated into the refinement process. The availability of ultrahigh-resolution data makes it possible to visualize the residual density arising from bonding effects; phenix.refine employs a novel interatomic scatterers model (Afonine et al., 2007 ▶) to adequately account for these features. A flexible parameterization of H atoms allows their use at any resolution from subatomic (where their parameters can be refined individually) to low resolution (where a riding model is used). Refinement can be performed using a variety of refinement target functions, including maximum likelihood, maximum likelihood with experimental phase information and amplitude least squares. The refinement of coordinates can be performed in real or reciprocal space (allowing dual-space refinement). Novel ligands can easily be included in refinement by providing a corresponding CIF file as input (the CIF file can be automatically created using phenix.ready_set). Manual fixing of amino-acid side-chain rotamers can be time-consuming, especially for large structures. Although the use of simulated-annealing refinement increases the convergence radius, it can still fail to fit incorrectly modelled side chains into the correct density. Phenix.refine has an option for automatic selection of the best rotamer based on a rotamer library (Lovell et al., 2000 ▶) and optimal fit into the density (details to be published elsewhere). Furthermore, coupling real-space refinement with the built-in rotamer library and available MolProbity tools allows the automated identification and robust correction of common systematic errors involving backward-fit conformations for Leu, Thr, Val, Ile and Arg side chains, as developed and tested in the Autofix method (Headd et al., 2009 ▶). Phenix.refine allows multi-step complex refinement protocols in which most of the available refinement strategies can be combined with each other and applied to any selected part of the model. For example, a run of phenix.refine may perform rigid-body refinement, simulated annealing, individual and grouped B factors combined with TLS refinement, constrained occupancy refinement and automatic water picking. The output of phenix.refine includes various maps (maximum-likelihood weighted, kicked, incompleteness corrected, anomalous difference and those with any user-defined coefficients), complete model and data statistics and PDB file with a formatted REMARK 3 header ready for PDB deposition. The phenix.refine GUI is integrated with Coot and PyMOL, allowing seamless visual analysis of the refined model and associated maps. Phenix.refine is tightly integrated with other PHENIX components, making structure solution, building and refinement a one-step process (for example, in the AutoMR and AutoBuild wizards). It is routinely tested by automatic re-refinement of all models in the PDB for which the experimental data are available. 6.1. Ligand-coordinate and restraint-geometry generation The electronic Ligand Builder and Optimization Builder (eLBOW; Moriarty et al., 2009 ▶) is a suite of tools designed for the reliable generation of Cartesian coordinates and geometry restraints for both novel and known ligands. In line with the rest of the PHENIX package, the eLBOW modules are written in Python, with the numerically intensive portions of the code written in C++. eLBOW is a flexible platform for converting a majority of common chemical inputs to optimized three-dimensional coordinates and geometry restraints for refinement. Ligand geometries can be minimized using the semi-empirical AM1 quantum-chemical method (Stewart, 2004 ▶), a numerically efficient and chemically accurate technique for the class of molecules commonly complexed with or bound to proteins. In addition, a graphical user interface for editing geometry restraints and simple geometry manipulation of ligands has been developed. The Restraints Editor, Especially Ligands (REEL) removes the tedium of manually editing a restraints file by providing a number of commonly performed actions via pull-down menus and other interactive features. The effect of changes in the restraints can be immediately reflected in the molecule view to provide user feedback. A tool that uses many of the features of eLBOW to quickly and easier prepare a protein model for refinement is known as ReadySet! The flexibility of the Python interface is exemplified by the use of Reduce, eLBOW and several smaller portions of the cctbx toolkit to add H and/or D atoms to the model, ligands and water and to generate metal-coordination files and geometry restraints for unknown ligands. The files required for covalently bound ligands are also generated. 7. Integrated structure determination 7.1. Why automation? Automation has dramatically changed macromolecular crystallography over the past decade, both by greatly speeding up the process of structure solution, model building and refinement and by bringing the tools for structure determination to a much wider group of scientists. As automation becomes increasingly comprehensive, it will allow users to test many more possibilities for structure determination, will allow improved estimation of uncertainties in the final structures and will allow the determination of ever more complex and difficult structures. The PHENIX environment has been developed with automation as a key and defining feature. Each tool within PHENIX can seamlessly and nearly effortlessly be incorporated as part of any other tool or process in PHENIX. This means that very complex tasks can be built up from well tested and characterized tools and that tools and higher-level methods can be re-used in many different contexts. With a full automatic regression testing system as an integral part of the PHENIX environment, all these tasks and high-level methods are tested daily to ensure the integrity of the entire PHENIX system. 7.2. Automated structure solution PHENIX has fully integrated structure-solution capability for both experimental phasing (MAD, SAD, MIR and combinations of these), carried out by phenix.autosol, and for molecular replacement, performed by phenix.automr. Each of these automated procedures feeds directly into the iterative model building, density modification and refinement of phenix.autobuild. Phenix.autosol is designed to allow complete automation of experimental phasing while allowing a high degree of flexibility for advanced users. Beginning with structure-factor amplitudes and the sequence of the macromolecule, phenix.autosol uses phenix.solve (Terwilliger & Berendzen, 1999 ▶) to scale all data sets, phenix.xtriage (Zwart et al., 2005 ▶) to analyze the data for twinning and to correct any anisotropy in the data and phenix.hyss (Grosse-Kunstleve & Adams, 2003 ▶) to find potential heavy-atom or anomalously scattering atoms. Phenix.autosol carries out experimental phasing with phenix.phaser (McCoy et al., 2004 ▶, 2007 ▶) or phenix.solve (Terwilliger & Berendzen, 1999 ▶), density modification with phenix.resolve (Terwilliger, 1999 ▶) and preliminary model building using the methods in phenix.autobuild (Terwilliger, Grosse-Kunstleve, Afonine, Moriarty, Zwart et al., 2008 ▶). A key step in automated structure solution is the identification of which of several possible space-group and heavy-atom or anomalously scattering-atom substructures is correct. Phenix.autosol uses a Bayesian scoring algorithm based on analysis of the experimental electron-density maps to identify which substructures lead to the best maps (Terwilliger et al., 2009 ▶). The main features of the maps that are used in this evaluation are the skewness of the electron density (non-Gaussian histogram of density with more density in the positive tail than the negative tail) and the correlation of local r.m.s. density (large contiguous regions of high variation where the molecule is located and separate large contiguous regions of low variation where the solvent is located). Phenix.autosol is highly flexible, allowing any combination of experimental data, such as MAD + SIRAS or several SAD data sets. Although it is fully automated, the user can control nearly all aspects of the operation of the procedure, including the scoring criteria and decisions about how certain phenix.autosol should be that the correct solution is contained in the current lists of solutions. Phenix.autosol can carry out phasing using a combination of experimental SAD data and molecular-replacement information. If a molecular-replacement model is available, phenix.autosol will use phenix.phaser (McCoy et al., 2004 ▶, 2007 ▶) to complete the anomalous substructure iteratively by constructing log-likelihood gradient maps for the anomalous scatterers based on the model of the non-anomalous structure and any anomalous scatterers that have already been found. The anomalous substructure is then used along with the model to calculate phases with phenix.phaser. Phenix.automr carries out automated likelihood-based molecular replacement using phenix.phaser (Read, 2001 ▶; McCoy et al., 2005 ▶, 2007 ▶; McCoy, 2007 ▶). The procedure is highly automated, allowing several copies of each of several components to be placed in a single run, which can also test different possible choices of space group. If there are alternative choices of model for a component, the molecular-replacement calculation can try each of them in turn or combine them as a statistically weighted ensemble. Although the evaluation of the likelihood targets is slow (Read, 2001 ▶), the use of fast approximations for the rotation search (Storoni et al., 2004 ▶) and the translation search (McCoy et al., 2005 ▶) gives run times that are competitive with traditional Patterson-based methods. Likelihood has been demonstrated to be more sensitive to the correct solution, particularly in difficult cases (Read, 2001 ▶). When there are several copies or several components to place, the ability of the likelihood functions to take advantage of preliminary partial solutions can provide a crucial increase in the signal. 7.3. Iterative model building, density modification and refinement Phenix.autobuild is a highly integrated and automated procedure for model building and model improvement through iterative model building, density modification and refinement. Phenix.autobuild uses phenix.resolve (Terwilliger, 2003a ▶,b ▶) to carry out model building, model extension, model assembly, loop fitting and building outside existing models. It further uses phenix.resolve to improve electron-density maps with statistical density modification, including information from the newly built models as well as that obtained from experiment (e.g. phenix.autosol), from NCS (Terwilliger, 2002b ▶) and from other expected features of electron-density maps such as a flat solvent (Wang, 1985 ▶), the presence of secondary-structural features (Terwilliger, 2001 ▶) and the presence of local patterns of density characteristic of macromolecules (Terwilliger, 2003c ▶). To reduce model bias in the procedure, prime-and-switch phasing can also be used (Terwilliger, 2004 ▶). Phenix.autobuild uses phenix.refine (Afonine et al., 2005 ▶) throughout this process to improve the quality of the models that are built. Phenix.autobuild provides two complementary approaches to model building. For cases in which no model or only a preliminary model has been built, phenix.autobuild will construct a new model considering the main chain of any supplied models as potential coordinates. In cases where a nearly final model is available, phenix.autobuild can apply a rebuild-in-place approach in which the polypeptide chain is rebuilt a few residues at a time without changing the register or the overall features of the model. The rebuild-in-place approach in phenix.autobuild provides a powerful method for the assessment of uncertainties in an atomic model by repetitive rebuilding of the model using different random seeds for each iteration (Terwilliger, Grosse-Kunstleve et al., 2007 ▶). The variability in the coordinates of each atom in the ensemble that is created is a lower bound on the uncertainty of the position of that atom. 8. Conclusions Advances in computational methods and algorithms have made it possible to automate the solution of many structures with PHENIX. However, many challenges still exist. In particular, the development of automated methods that can be applied at low resolution (worse than 3.0 Å) remains a priority. In this resolution range there are typically too few experimental data to uniquely define the macromolecular structure for automated ab initio model building. Thus, methods are required that rely on prior knowledge from existing macromolecular structures to permit productive automated data interpretation. These methods will need to be developed and applied for all stages of structure solution and tightly integrated to maximize the information extracted from the experimental data.

- Record: found
- Abstract: found
- Article: not found

H M Berman, J Westbrook, Z. Feng … (2000)

The Protein Data Bank (PDB; http://www.rcsb.org/pdb/ ) is the single worldwide archive of structural data of biological macromolecules. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource.

- Record: found
- Abstract: found
- Article: not found

1. Introduction Improved crystallographic methods rely on both improved automation and improved algorithms. The software handling one part of structure solution must be automatically linked to software handling parts upstream and downstream of it in the structure solution pathway with (ideally) no user input, and the algorithms implemented in the software must be of high quality, so that the branching or termination of the structure solution pathway is minimized or eliminated. Automation allows all the choices in structure solution to be explored where the patience and job-tracking abilities of users would be exhausted, while good algorithms give solutions for poorer models, poorer data or unfavourable crystal symmetry. Both forms of improvement are essential for the success of high-throughput structural genomics (Burley et al., 1999 ▶). Macromolecular phasing by either of the two main methods, molecular replacement (MR) and experimental phasing, which includes the technique of single-wavelength anomalous dispersion (SAD), are key parts of the structure solution pathway that have potential for improvement in both automation and the underlying algorithms. MR and SAD are good phasing methods for the development of structure solution pipelines because they only involve the collection of a single data set from a single crystal and have the advantage of minimizing the effects of radiation damage. Phaser aims to facilitate automation of these methods through ease of scripting, and to facilitate the development of improved algorithms for these methods through the use of maximum likelihood and multivariate statistics. Other software shares some of these features. For molecular replacement, AMoRe (Navaza, 1994 ▶) and MOLREP (Vagin & Teplyakov, 1997 ▶) both implement automation strategies, though they lack likelihood-based scoring functions. Likelihood-based experimental phasing can be carried out using Sharp (La Fortelle & Bricogne, 1997 ▶). 2. Algorithms The novel algorithms in Phaser are based on maximum likelihood probability theory and multivariate statistics rather than the traditional least-squares and Patterson methods. Phaser has novel maximum likelihood phasing algorithms for the rotation functions and translation functions in MR and the SAD function in experimental phasing, but also implements other non-likelihood algorithms that are critical to success in certain cases. Summaries of the algorithms implemented in Phaser are given below. For completeness and for consistency of notation, some equations given elsewhere are repeated here. 2.1. Maximum likelihood Maximum likelihood is a branch of statistical inference that asserts that the best model on the evidence of the data is the one that explains what has in fact been observed with the highest probability (Fisher, 1922 ▶). The model is a set of parameters, including the variances describing the error estimates for the parameters. The introduction of maximum likelihood estimators into the methods of refinement, experimental phasing and, with Phaser, MR has substantially increased success rates for structure solution over the methods that they replaced. A set of thought experiments with dice (McCoy, 2004 ▶) demonstrates that likelihood agrees with our intuition and illustrates the key concepts required for understanding likelihood as it is applied to crystallography. The likelihood of the model given the data is defined as the probability of the data given the model. Where the data have independent probability distributions, the joint probability of the data given the model is the product of the individual distributions. In crystallography, the data are the individual reflection intensities. These are not strictly independent, and indeed the statistical relationships resulting from positivity and atomicity underlie direct methods for small-molecule structures (reviewed by Giacovazzo, 1998 ▶). For macromolecular structures, these direct-methods relationships are weaker than effects exploited by density modification methods (reviewed by Kleywegt & Read, 1997 ▶); the presence of solvent means that the molecular transform is over-sampled, and if there is noncrystallographic symmetry then other correlations are also present. However, the assumption of independence is necessary to make the problem tractable and works well in practice. To avoid the numerical problems of working with the product of potentially hundreds of thousands of small probabilities (one for each reflection), the log of the likelihood is used. This has a maximum at the same set of parameters as the original function. Maximum likelihood also has the property that if the data are mathematically transformed to another function of the parameters, then the likelihood optimum will occur at the same set of parameters as the untransformed data. Hence, it is possible to work with either the structure-factor intensities or the structure-factor amplitudes. In the maximum likelihood functions in Phaser, the structure-factor amplitudes (Fs), or normalized structure-factor amplitudes (Es, which are Fs normalized so that the mean-square values are 1) are used. The crystallographic phase problem means that the phase of the structure factor is not measured in the experiment. However, it is easiest to derive the probability distributions in terms of the phased structure factors and then to eliminate the unknown phase by integration, a process known as integrating out a nuisance variable (the nuisance variable being the introduced phase of the observed structure factor, or equivalently the phase difference between the observed structure factor and its expected value). The central limit theorem applies to structure factors, which are sums of many small atomic contributions, so the probability distribution for an acentric reflection, F O, given the expected value of F O (〈F O〉) is a two-dimensional Gaussian with variance Σ centred on 〈F O〉. (Note that here and in the following, bold font is used to represent complex or signed structure factors, and italics to represent their amplitudes.) In applications to molecular replacement and structure refinement, 〈F O〉 is the structure factor calculated from the model (F C) multiplied by a fraction D (where 0 R, H = 0. The atoms are taken to be of equal mass. The eigenvalues λ and eigenvectors U of H can then be calculated. The eigenvalues are directly proportional to the squares of the vibrational frequencies of the normal modes, the lowest eigenvalues thus giving the lowest normal modes. Six of the eigenvalues will be zero, corresponding to the six degrees of freedom for a rotation and translation of the entire structure. For all but the smallest proteins, eigenvalue decomposition of the all-atom Hessian is not computationally feasible with current computer technology. Various methods have been developed to reduce the size of the eigenvalue problem. Bahar et al. (1997 ▶) and Hinsen (1998 ▶) have shown that it is possible to find the lowest frequency normal modes of proteins in the elastic network model by considering amino acid Cα atoms only. However, this merely postpones the computational problem until the proteins are an order of magnitude larger. The problem is solved for any size protein with the rotation–translation block (RTB) approach (Durand et al., 1994 ▶; Tama et al., 2000 ▶), where the protein is divided into blocks of atoms and the rotation and translation modes for each block used project the full Hessian into a lower dimension. The projection matrix is a block-diagonal matrix of dimensions 3N × 3N. Each of the NB block matrices P nb has dimensions 3N nb × 6, where N nb is the number of atoms in the block nb, For atom j in block nb displaced from the centre of mass, of the block, the 3 × 6 matrix P nb,j is The first three columns of the matrix contain the infinitesimal translation eigenvectors of the block and last three columns contain the infinitesimal rotation eigenvectors of the block. The orthogonal basis Q of P nb is then found by QR decomposition: where Q nb is a 3N nb × 6 orthogonal matrix and R nb is a 6 × 6 upper triangle matrix. H can then be projected into the subspace spanned by the translation/rotation basis vectors of the blocks: where The eigenvalues λP and eigenvectors U P of the projected Hessian are then found. The RTB method is able to restrict the size of the eigenvalue problem for any size of protein with the inclusion of an appropriately large N nb for each block. In the implementation of the RTB method in Phaser, N nb for each block is set for each protein such that the total size of the eigenvalue problem is restricted to a matrix H P of maximum dimensions 750 × 750. This enables the eigenvalue problem to be solved in a matter of minutes with current computing technology. The eigenvectors of the translation/rotation subspace can then be expanded back to the atomic space (dimensions of U are N × N): As for the decomposition of the full Hessian H, the eigenvalues are directly proportional to the squares of the vibrational frequencies of the normal modes, the lowest eigenvalues thus giving the lowest normal modes. Although the eigenvalues and eigenvectors generated from decomposition of the full Hessian and using the RTB approach will diverge with increasing frequency, the RTB approach is able to model with good accuracy the lowest frequency normal modes, which are the modes of interest for looking at conformational difference in proteins. The all-atom, Cα only and RTB normal-mode analysis methods are implemented in Phaser. After normal-mode analysis, n normal modes can be used to generate 2 n − 1 (nonzero) combinations of normal modes. Phaser allows the user to specify the r.m.s. deviation between model and target desired by the perturbation, and the fraction dq of the displacement vector for each mode combination corresponding to each model combination is then used to generate the models. Large r.m.s. deviations will cause the geometry of the model to become distorted. Phaser reports when the model becomes so distorted that there are Cα clashes in the structure. 2.4. Packing function The packing of potential solutions in the asymmetric unit is not inherently part of the translation function. It is therefore possible that an arrangement of models has a high log-likelihood gain, although the models may overlap and therefore be physically unreasonable. The packing of the solutions is checked using a clash test using a subset of the atoms in the structure: the ‘trace’ atoms. For proteins, the trace atoms are the Cα positions, spaced at 3.8 Å. For nucleic acid, the phosphate and C atoms in the ribose-phosphate backbone and the N atoms of the bases are selected as trace atoms. These atoms are also spaced at about 3.8 Å, so that the density of trace atoms in nucleic acid is similar to that of proteins, which makes the number of protein–protein, protein–nucleic acid and nucleic acid–nucleic acid clashes comparable where there is a mixed protein–nucleic acid structure. For the clash test, the number of trace atoms from another model within a given distance (default 3 Å) is counted. The clash test includes symmetry-related copies of the model under consideration, other components in the asymmetric unit and their symmetry-related copies. If the search model has a low sequence identity with the target, or has large flexible loops that could adopt an alternative conformation, the number of clashes may be expected to be nonzero. By default the best packing solutions are carried forward, although a specific number of allowed clashes may also be given as the cut-off for acceptance. However, it is better to edit models before use so that structurally nonconserved surface loops are excluded, as they will only contribute noise to the rotation and translation functions. Where an ensemble of structures is used as the model, the highest homology model is taken as the template for the packing search. Before this model is used, the trace atom positions are edited to take account of large conformational differences between the models in the ensemble. Equivalent trace atom positions are compared and if the coordinates deviate by more than 3 Å then the template trace atom is deleted. Thus, use of an ensemble not only improves signal to noise in the maximum likelihood search functions, it also improves the discrimination of possible solutions by the packing function. 2.5. Minimizer Minimization is used in Phaser to optimize the parameters against the appropriate log-likelihood function in the anisotropy correction, in MR (refines the position and orientation of a rigid-body model) and in SAD phasing. The same minimizer code is used for all three applications and has been designed to be easily extensible to other applications. The minimizer for the anisotropy correction uses Newton’s method, while MR and SAD use the standard Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm. Both minimization methods in Phaser include a line search. The line search algorithm is a basic iterative method for finding the local minimum of a target function f. Starting at parameters x , the algorithm finds the minimum (within a convergence tolerance) of by varying γ, where γ is the step distance along a descent direction d . Newton’s method and the BFGS algorithm differ in the determination of the descent direction d that is passed to the line search, and thus the speed of convergence. Within one cycle of the line search (where there is no change in d ) the trial step distances γ are chosen using the golden section method. The golden ratio (51/2/2 + 1/2) divides a line so that the ratio of the larger part to the total is the same as the ratio of the smaller to larger. The method makes no assumptions about the function’s behaviour; in particular, it does not assume that the function is quadratic within the bracketed section. If this assumption were made, the line search could proceed via parabolic interpolation. Newton’s method uses the Hessian matrix H of second derivatives and the gradient g at the initial set of parameters x 0 to find the values of the parameters at the minimum x min. If the function is quadratic in x then Newton’s method will find the minimum in one step, but if not, iteration is required. The method requires the inversion of the Hessian matrix, which, for large matrices, consumes a large amount of computational time and memory resources. The eigenvalues of the Hessian need to be positive for the function to be at a minimum, rather than a maximum or saddle point, since the method converges to any point where the gradient vector is zero. When used with the anisotropy correction, the full Hessian matrix is calculated analytically. The BFGS algorithm is one of the most powerful minimization methods when calculation of the full Hessian using analytic or finite difference methods is very computationally intensive. At every step, the gradient search vector is analysed to build up an approximate Hessian matrix H, in order to make the resulting search vector direction d better than the original gradient vector direction. In the ‘pure’ form of the BFGS algorithm, the method is started with matrix H equal to the identity matrix. The off-diagonal elements of the Hessian, the mixed second derivatives (i.e. ∂2LL/∂p i ∂p j ) are thus initially zero. As the BFGS cycle proceeds, the off-diagonal elements become nonzero using information derived from the gradient. However, in Phaser, the matrix H is not the identity but rather is seeded with diagonal elements equal to the second derivatives of the parameters (p i ) with respect to the log-likelihood target function (LL) (i.e. ∂2LL/∂p i 2, or curvatures), the values found in the ‘true’ Hessian. For the SAD refinement the diagonal elements are calculated analytically, but for the MR refinement the diagonal elements are calculated by finite difference methods. Seeding the Hessian with the diagonal elements dramatically accelerates convergence when the parameters are on different scales; when an identity matrix is used, the parameters on a larger scale can fail to shift significantly because their gradients tend to be smaller, even though the necessary shifts tend to be larger. In the inverse Hessian, small curvatures for parameters on a large scale translate into large scale factors applied to the corresponding gradient terms. If any of these curvature terms are negative (as may happen when the parameters are far from their optimal values), the matrix is not positive definite. Such a situation is corrected by using problem-specific information on the expected relative scale of the parameters from the ‘large-shift’ variable, as discussed below in §2.5.1. In addition to the basic minimization algorithms, the minimizer incorporates the ability to bound, constrain, restrain and reparameterize variables, as discussed in detail below. Bounds must be applied to prevent parameters becoming nonphysical, constraints effectively reduce the number of parameters, restraints are applied to include prior probability information, and reparameterization of variables makes the parameter space more quadratic and improves the performance of the minimizer. 2.5.1. Problem-specific parameter scaling information When a function is defined for minimization in Phaser, information must be provided on the relative scales of the parameters of that function, through a ‘large-shifts’ variable. As its name implies, the variable defines the size of a parameter shift that would be considered ‘large’ for each parameter. The ratios of these large-shift values thus specify prior knowledge about the relative scales of the different parameters for each problem. Suitable large-shift values are found by a combination of physical insight (e.g. the size of a coordinate shift considered to be large will be proportional to d min for the data set) and numerical simulations, studying the behaviour of the likelihood function as parameters are varied systematically in a variety of test cases. The large-shifts information is used in two ways. Firstly, it is used to prevent the line search from taking an excessively large step, which can happen if the estimated curvature for a parameter happens to be too small and can lead to the refinement becoming numerically unstable. If the initial step for a line search would change any parameter by more than its large-shift value, the initial step is scaled down. Secondly, it is used to provide relative scale information to correct negative curvature values. Parameters with positive curvatures are used to define the average relationship between the large-shift values and the curvatures, which can then be used to compute appropriate curvature values for the parameters with negative curvatures. This stabilizes the refinement until it is sufficiently close to the minimum that all curvatures become positive. 2.5.2. Reparameterization Second-order minimization algorithms in effect assume that, at least in the region around the minimum, the function can be approximated as a quadratic. Where this assumption holds, the minimizer will converge faster. It is therefore advantageous to use functions of the parameters being minimized so that the target function is more quadratic in the new parameter space than in the original parameter space (Edwards, 1992 ▶). For example, atomic B factors tend to converge slowly to their refined values because the B factor appears in the exponential term in the structure-factor equation. Although any function of the parameters can be used for this purpose, we have found that taking the logarithm of a parameter is often the most effective reparameterization operation (not only for the B factors). The offset x offset is chosen so that the value of x′ does not become undefined for allowed values of x, and to optimize the quadratic nature of the function in x′. For instance, atomic B factors are reparameterized using an offset of 5 Å2, which allows the B factors to approach zero and also has the physical interpretation of accounting roughly for the width of the distribution of electrons for a stationary atom. 2.5.3. Bounds Bounds on the minimization are applied by setting upper and/or lower limits for each variable where required (e.g. occupancy minimum set to zero). If a parameter reaches a limit during a line search, that line search is terminated. In subsequent line searches, the gradient of that parameter is set to zero whenever the search direction would otherwise move the parameter outside of its bounds. Multiplying the gradient by the step size thus does not alter the value of the parameter at its limit. The parameter will remain at its limit unless calculation of the gradient in subsequent cycles of minimization indicates that the parameter should move away from the boundary and into the allowed range of values. 2.5.4. Constraints Space-group-dependent constraints apply to the anisotropic tensor applied to ΣN in the anisotropic diffraction correction. Atoms on special positions also have constraints on the values of their anisotropic tensor. The anisotropic displacement ellipsoid must remain invariant under the application of each symmetry operator of the space group or site-symmetry group, respectively (Giacovazzo, 1992 ▶; Grosse-Kunstleve & Adams, 2002 ▶). These constraints reduce the number of parameters by either fixing some values of the anisotropic B factors to zero or setting some sets of B factors to be equal. The derivatives in the gradient and Hessian must also be constrained to reflect the constraints in the parameters. 2.5.5. Restraints Bayes’ theorem describes how the probability of the model given the data is related to the likelihood and gives a justification for the use of restraints on the parameters of the model. If the probability of the data is taken as a constant, then P(model) is called the prior probability. When the logarithm of the above equation is taken, Prior probability is therefore introduced into the log-likelihood target function by the addition of terms. If parameters of the model are assumed to have independent Gaussian probability distributions, then the Bayesian view of likelihood will lead to the addition of least-squares terms and hence least-squares restraints on those parameters, such as the least-squares restraints applied to bond lengths and bond angles in typical macromolecular structure refinement programs. In Phaser, least-squares terms are added to restrain the B factors of atoms to the Wilson B factor in SAD refinement, and to restrain the anisotropic B factors to being more isotropic (the ‘sphericity’ restraint). A similar sphericity restraint is used in SHELXL (Sheldrick, 1995 ▶) and in REFMAC5 (Murshudov et al., 1999 ▶). 3. Automation Phaser is designed as a large set of library routines grouped together and made available to users as a series of applications, called modes. The routine-groupings in the modes have been selected mainly on historical grounds; they represent traditional steps in the structure solution pipeline. There are 13 such modes in total: ‘anisotropy correction’, ‘cell content analysis’, ‘normal-mode analysis’, ‘ensembling’, ‘fast rotation function’, ‘brute rotation function’, ‘fast translation function’, ‘brute translation function’, ‘log-likelihood gain’, ‘rigid-body refinement’, ‘single-wavelength anomalous dispersion’, ‘automated molecular replacement’ and ‘automated experimental phasing’. The ‘automated molecular replacement’ and ‘automated experimental phasing’ modes are particularly powerful and aim to automate fully structure solution by MR and SAD, respectively. Aspects of the decision making within the modes are under user input control. For example, the ‘fast rotation function’ mode performs the ensembling calculation, then a fast rotation function calculation and then rescores the top solutions from the fast search with a brute rotation function. There are three possible fast rotation function algorithms and two possible brute rotation functions to choose from. There are four possible criteria for selecting the peaks in the fast rotation function for rescoring with the brute rotation function, and for selecting the results from the rescoring for output. Alternatively, the rescoring of the fast rotation function with the brute rotation function can be turned off to produce results from the fast rotation function only. Other modes generally have fewer routines but are designed along the same principles (details are given in the documentation). 3.1. Automated molecular replacement Most structures that can be solved by MR with Phaser can be solved using the ‘automated molecular replacement’ mode. The flow diagram for this mode is shown in Fig. 1 ▶. The search strategy automates four search processes: those for multiple components in the asymmetric unit, for ambiguity in the hand of the space group and/or other space groups in the same point group, for permutations in the search order for components (when there are multiple components), and for finding the best model when there is more than one possible model for a component. 3.1.1. Multiple components of asymmetric unit Where there are many models to be placed in the asymmetric unit, the signal from the placement of the first model may be buried in noise and the correct placement of this first model only found in the context of all models being placed in the asymmetric unit. One way of tackling this problem has been to use stochastic methods to search the multi-dimensional space (Chang & Lewis, 1997 ▶; Kissinger et al., 1999 ▶; Glykos & Kokkinidis, 2000 ▶). However, we have chosen to use a tree-search-with-pruning approach, where a list of possible placements of the first (and subsequent) models is kept until the placement of the final model. This tree-search-with-pruning search strategy can generate very branched searches that would be challenging for users to negotiate by running separate jobs, but becomes trivial with suitable automation. The search strategy exploits the strength of the maximum likelihood target functions in using prior information in the search for subsequent components in the asymmetric unit. The tree-search-with-pruning strategy is heavily dependent on the criteria used for selecting the peaks that survive to the next round. Four selection criteria are available in Phaser: selection by percentage difference between the top and mean log-likelihood of the search, selection by Z score, selection by number of peaks, and selection of all peaks. The default is selection by percentage, with the default percentage set at 75%. This selection method has the advantage that, if there is one clear peak standing well above the noise, it alone will be passed to the next round, while if there is no clear signal, all peaks high in the list will be passed as potential solutions to the next round. If structure solution fails, it may be possible to rescue the solution by reducing the percentage cut-off used for selection from 75% to, for example, 65%, so that if the correct peak was just missing the default cut-off, it is now included in the list passed to the next round. The tree-search-with-pruning search strategy is sub-optimal where there are multiple copies of the same search model in the asymmetric unit. In this case the search generates many branches, each of which has a subset of the complete solution, and so there is a combinatorial explosion in the search. The tree search would only converge onto one branch (solution) with the placement of the last component on each of the branches, but in practice the run time often becomes excessive and the job is terminated before this point can be reached. When searching for multiple copies of the same component in the asymmetric unit, several copies should be added at each search step (rather than branching at each search step), but this search strategy must currently be performed semi-manually as described elsewhere (McCoy, 2007 ▶). 3.1.2. Alternative space groups The space group of a structure can often be ambiguous after data collection. Ambiguities of space group within the one point group may arise from theoretical considerations (if the space group has an enantiomorph) or on experimental grounds (the data along one or more axes were not collected and the systematic absences along these axes cannot be determined). Changing the space group of a structure to another in the same point group can be performed without re-indexing, merging or scaling the data. Determination of the space group within a point group is therefore an integral part of structure solution by MR. The translation function will yield the highest log-likelihood gain for a correctly packed solution in the correct space group. Phaser allows the user to make a selection of space groups within the same point group for the first translation function calculation in a search for multiple components in the asymmetric unit. If the signal from the placement of the first component is not significantly above noise, the correct space group may not be chosen by this protocol, and the search for all components in the asymmetric unit should be completed separately in all alternative space groups. 3.1.3. Alternative models As the database of known structures expands, the number of potential MR models is also rapidly increasing. Each available model can be used as a separate search model, or combined with other aligned structures in an ‘ensemble’ model. There are also various ways of editing structures before use as MR models (Schwarzenbacher et al., 2004 ▶). The number of MR trials that can be performed thus increases combinatorially with the number of potential models, which makes job tracking difficult for the user. In addition, most users stop performing MR trials as soon as any solution is found, rather than continuing the search until the MR solution with the greatest log-likelihood gain is found, and so they fail to optimize the starting point for subsequent steps in the structure solution pipeline. The use of alternative models to represent a structure component is also useful where there are multiple copies of one type of component in the asymmetric unit and the different copies have different conformations due to packing differences. The best solution will then have the different copies modelled by different search models; if the conformation change is severe enough, it may not be possible to solve the structure without modelling the differences. A set of alternative search models may be generated using previously observed conformational differences among similar structures, or, for example, by normal-mode analysis (see §2.3). Phaser automates searches over multiple models for a component, where each potential model is tested in turn before the one with the greatest log-likelihood gain is found. The loop over alternative models for a component is only implemented in the rotation functions, as the solutions passed from the rotation function to the translation function step explicitly specify which model to use as well as the orientation for the translation function in question. 3.1.4. Search order permutation When searching for multiple components in the asymmetric unit, the order of the search can be a factor in success. The models with the biggest component of the total structure factor will be the easiest to find: when weaker scattering components are the subject of the initial search, the solution may be buried in noise and not significant enough to survive the selection criteria in the tree-search-with-pruning search strategy. Once the strongest scattering components are located, then the search for weaker scattering components (in the background of the strong scattering components) is more likely to be a success. Having a high component of the total structure factor correlates with the model representing a high fraction of the total contents of the asymmetric unit, low r.m.s. deviation between model and target atoms, and low B factors for the target to which the model corresponds. Although the first of these (high completeness) can be determined in advance from the fraction of the total molecular weight represented by the model, the second can only be estimated from the Chothia & Lesk (1986 ▶) formula and the third is unknown in advance. If structure solution fails with the search performed in the order of the molecular weights, then other permutations of search order should be tried. In Phaser, this possibility is automated on request: the entire search strategy (except for the initial anisotropic data correction) is performed for all unique permutations of search orders. 3.2. Automated experimental phasing SAD is the simplest type of experimental phasing method to automate, as it involves only one crystal and one data set. SAD is now becoming the experimental phasing method of choice, overtaking multiple-wavelength anomalous dispersion because only a single data set needs to be collected. This can help minimize radiation damage to the crystal, which has a major adverse effect on the success of multi-wavelength experiments. The ‘automated experimental phasing’ mode in Phaser takes an atomic substructure determined by Patterson, direct or dual-space methods (Karle & Hauptman, 1956 ▶; Rossmann, 1961 ▶; Mukherjee et al., 1989 ▶; Miller et al., 1994 ▶; Sheldrick & Gould, 1995 ▶; Sheldrick et al., 2001 ▶; Grosse-Kunstleve & Adams, 2003 ▶) and refines the positions, occupancies, B factors and values of the atoms to optimize the SAD function, then uses log-likelihood gradient maps to complete the atomic substructure. The flow diagram for this mode is shown in Fig. 2 ▶. The search strategy automates two search processes: those for ambiguity in the hand of the space group and for completing atomic substructure from log-likelihood gradient maps. A feature of using the SAD function for phasing is that the substructure need not only consist of anomalous scatterers; indeed it can consist of only real scatterers, since the real scattering of the partial structure is used as part of the phasing function. This allows structures to be completed from initial real scattering models. 3.2.1. Enantiomorphic space groups Since the SAD phasing mode of Phaser takes as input an atomic substructure model, the space group of the solution has already been determined to within the enantiomorph of the correct space group. Changing the enantiomorph of a SAD refinement involves changing the enantiomorph of the heavy atoms, or in some cases the space group (e.g. the enantiomorphic space group of P41 is P43). In some rare cases (Fdd2, I41, I4122, I41 md, I41 cd, I 2d, F4132; Koch & Fischer, 1989 ▶) the origin of the heavy-atom sites is changed [e.g. the enantiomorphic space group of I41 is I41 with the origin shifted to ( , 0, 0)]. If there is only one type of anomalous scatterer, the refinement need not be repeated in both hands: only the phasing needs to be carried out in the second hand to be considered. However, if there is more than one type of anomalous scatterer, then the refinement and substructure completion needs to be repeated, as it will not be enantiomorphically symmetric in the other hand. To facilitate this, Phaser runs the refinement and substructure completion in both hands [as does other experimental phasing software, e.g. Solve (Terwilliger & Berendzen, 1999 ▶) and autosharp (Vonrhein et al., 2006 ▶)]. The correct space group can then be found by inspection of the electron density maps; the density will only be interpretable in the correct space group. In cases with significant contributions from at least two types of anomalous scatterer in the substructure, the correct space group can also be identified by the log-likelihood gain. 3.2.2. Completing the substructure Peaks in log-likelihood gradient maps indicate the coordinates at which new atoms should be added to improve the log-likelihood gain. In the initial maps, the peaks are likely to indicate the positions of the strongest anomalous scatterers that are missing from the model. As the phasing improves, weaker anomalous scatterers, such as intrinsic sulfurs, will appear in the log-likelihood gradient maps, and finally, if the phasing is exceptional and the resolution high, non-anomalous scatterers will appear, since the SAD function includes a contribution from the real scattering. After refinement, atoms are excluded from the substructure if their occupancy drops below a tenth of the highest occupancy amongst those atoms of the same atom type (and therefore ). Excluded sites are flagged rather than permanently deleted, so that if a peak later appears in the log-likelihood gradient map at this position, the atom can be reinstated and prevented from being deleted again, in order to prevent oscillations in the addition of new sites between cycles and therefore lack of convergence of the substructure completion algorithm. New atoms are added automatically after a peak and hole search of the log-likelihood gradient maps. The cut-off for the consideration of a peak as a potential new atom is that its Z score be higher than 6 (by default) and also higher than the depth of the largest hole in the map, i.e. the largest hole is taken as an additional indication of the noise level of the map. The proximity of each potential new site to previous atoms is then calculated. If a peak is more than a cut-off distance (κ Å) of a previous site, the peak is added as a new atom with the average occupancy and B factor from the current set of sites. If the peak is within κ Å of an isotropic atom already present, the old atom is made anisotropic. Holes in the log-likelihood gradient map within κ Å of an isotropic atom also cause the atom’s B factor to be switched to anisotropic. However, if the peak or hole is within κ Å of an anisotropic atom already present, the peak or hole is ignored. If a peak is within κ Å of a previously excluded site, the excluded site is reinstated and flagged as not for deletion in order to prevent oscillations, as described above. At the end of the cycle of atom addition and isotropic to anisotropic atomic B-factor switching, new sites within 2κ Å of an old atom that is now anisotropic are then removed, since the peak may be absorbed by refining the anisotropic B factor; if not, it will be accepted as a new site in the next cycle of log-likelihood gradient completion. The distance κ may be input directly by the user, but by default it is the ‘optical resolution’ of the structure (κ = 0.715d min), but not less than 1 Å and no more than 10 Å. If the structure contains more than one significant anomalous scatterer, then log-likelihood gradient maps are calculated from each atom type, the maps compared and the atom type associated with each significant peak assigned from the map with the most significant peak at that location. 3.2.3. Initial real scattering model One of the reasons for including MR and SAD phasing within one software package is the ability to use MR solutions with the SAD phasing target to improve the phases. Since the SAD phasing target contains a contribution from the real scatterers, it is possible to use a partial MR model with no anomalous scattering as the initial atomic substructure used for SAD phasing. This approach is useful where there is a poor MR solution combined with a poor anomalous signal in the data. If the poor MR solution means that the structure cannot be phased from this model alone, and the poor anomalous signal means that the anomalous scatterers cannot be located in the data alone, then using the MR solution as the starting model for SAD phasing may provide enough phase information to locate the anomalous scatterers. The combined phase information will be stronger than from either source alone. To facilitate this method of structure solution, Phaser allows the user to input a partial structure model that will be interpreted in terms of its real scattering only and, following phasing with this substructure, to complete the anomalous scattering model from log-likelihood gradient maps as described above. 3.3. Input and output The fastest and most efficient way, in terms of development time, to link software together is using a scripting language, while using a compiled language is most efficient for intensive computation. Following the lead of the PHENIX project (Adams et al., 2002 ▶, 2004 ▶), Phaser uses Python (http://python.org) as the scripting language, C++ as the compiled language, and the Boost.Python library (http://boost.org/libs/python/) for linking C++ and Python. Other packages, notably X-PLOR (Brünger, 1993 ▶) and CNS (Brünger et al., 1998 ▶), have defined their own scripting languages, but the choice of Python ensures that the scripting language is maintained by an active community. Phaser functionality has mostly been made available to Python at the ‘mode’ level. However, some low-level SAD refinement routines in Phaser have been made available to Python directly, so that they can be easily incorporated into phenix.refine. A long tradition of CCP4 keyword-style input in established macromolecular crystallography software (almost exclusively written in Fortran) means that, for many users, this has been the familiar method of calling crystallographic software and is preferred to a Python interface. The challenge for the development of Phaser was to find a way of satisfying both keyword-style input and Python scripting with minimal increase in development time. Taking advantage of the C++ class structure allowed both to be implemented with very little additional code. Each keyword is managed by its own class. The input to each mode of Phaser is controlled by Input objects, which are derived from the set of keyword classes appropriate to the mode. The keyword classes are in turn derived from a CCP4base class containing the functionality for the keyword-style input. Each keyword class has a parse routine that calls the CCP4base class functions to parse the keyword input, stores the input parameters as local variables and then passes these parameters to a keyword class set function. The keyword class set functions check the validity and consistency of the input, throw errors where appropriate and finally set the keyword class’s member parameters. Alternatively, the keyword class set functions can be called directly from Python. These keyword classes are a standalone part of the Phaser code and have already been used in other software developments (Pointless; Evans, 2006 ▶). An Output object controls all text output from Phaser sent to standard output and to text files. Switches on the Output object give different output styles: CCP4-style for compatibility with CCP4 distribution, PHENIX-style for compatibility with the PHENIX interface, CIMR-style for development, XML-style output for developers of automation scripts and a ‘silent running’ option to be used when running Phaser from Python. In addition to the text output, where possible Phaser writes results to files in standard format; coordinates to ‘pdb’ files and reflection data (e.g. map coefficients) to ‘mtz’ files. Switches on the Output object control the writing of these files. 3.3.1. CCP4-style output CCP4-style output is a text log file sent to standard output. While this form of output is easily comprehensible to users, it is far from ideal as an output style for automation scripts. However, it is the only output style available from much of the established software that developers wish to use in their automation scripts, and it is common to use Unix tools such as ‘grep’ to extract key information. For this reason, the log files of Phaser have been designed to help developers who prefer to use this style of output. Phaser prints four levels of log file, summary, log, verbose and debug, as specified by user input. The important output information is in all four levels of file, but it is most efficient to work with the summary output. Phaser prints ‘SUCCESS’ and ‘FAILURE’ at the end of the log file to demarcate the exit state of the program, and also prints the names of any of the other output files produced by the program to the summary output, amongst other features. 3.3.2. XML output XML is becoming commonly used as a way of communicating between steps in an automation pipeline, because XML output can be added very simply by the program author and relatively simply by others with access to the source code. For this reason, Phaser also outputs an XML file when requested. The XML file encapsulates the mark-up within 〈phaser〉 tags. As there is no standard set of XML tags for crystallographic results, Phaser’s XML tags are mostly specific to Phaser but were arrived at after consultation with other developers of XML output for crystallographic software. 3.3.3. Python interface The most elegant and efficient way to run Phaser as part of an automation script is to call the functionality directly from Python. Using Phaser through the Python interface is similar to using Phaser through the keyword interface. Each mode of operation of Phaser described above is controlled by an Input object and its parameter set functions, which have been made available to Python with the Boost.Python library. Phaser is then run with a call to the ‘run-job’ function, which takes the Input object as a parameter. The ‘run-job’ function returns a Result object on completion, which can then be queried using its get functions. The Python Result object can be stored as a ‘pickled’ class structure directly to disk. Text is not sent to standard out in the CCP4 logfile way but may be redirected to another output stream. All Input and Result objects are fully documented. 4. Future developments Phaser will continue to be developed as a platform for implementing novel phasing algorithms and bringing the most effective approaches to the crystallographic community. Much work remains to be done formulating maximum likelihood functions with respect to noncrystallographic symmetry, to account for correlations in the data and to consider non-isomorphism, all with the aim of achieving the best possible initial electron density map. After a generation in which Fortran dominated crystallographic software code, C++ and Python have become the new standard. Several developments, including Phaser, PHENIX (Adams et al., 2002 ▶, 2004 ▶), Clipper (Cowtan, 2002 ▶) and mmdb (Krissinel et al., 2004 ▶), simultaneously chose C++ as the compiled language at their inception at the turn of the millennium. At about the same time, Python was chosen as a scripting language by PHENIX, ccp4mg (Potterton et al., 2002 ▶, 2004 ▶) and PyMol (DeLano, 2002 ▶), amongst others. Since then, other major software developments have also started or converted to C++ and Python, for example PyWarp (Cohen et al., 2004 ▶), MrBump (Keegan & Winn, 2007 ▶) and Pointless (Evans, 2006 ▶). The choice of C++ for software development was driven by the availability of free compilers, an ISO standard (International Standardization Organization et al., 1998 ▶), sophisticated dynamic memory management and the inherent strengths of using an object-oriented language. Python was equally attractive because of the strong community support, its object-oriented design, and the ability to link C++ and Python through the Boost.Python library or the SWIG library (http://www.swig.org/). Now that a ‘critical mass’ of developers has taken to using the new languages, C++ and Python are likely to remain the standard for crystallographic software for the current generation of crystallographic software developers. Phaser source code has been distributed directly by the authors (see http://www-structmed.cimr.cam.ac.uk/phaser for details) and through the PHENIX and CCP4 (Collaborative Computing Project, Number 4, 1994 ▶) software suites. The source code is released for several reasons, including that we believe source code is the most complete form of publication for the algorithms in Phaser. It is hoped that generous licensing conditions and source distribution will encourage the use of Phaser by other developers of crystallographic software and those writing crystallographic automation scripts. There are no licensing restrictions on the use of Phaser in macromolecular crystallography pipelines by other developers, and the license conditions even allow developers to alter the source code (although not to redistribute it). We welcome suggestions for improvements to be incorporated into new versions. Compilation of Phaser requires the computational crystallography toolbox (cctbx; Grosse-Kunstleve & Adams, 2003 ▶), which includes a distribution of the cmtz library (Winn et al., 2002 ▶). The Boost libraries (http://boost.org/) are required for access to the functionality from Python. Phaser runs under a wide range of operating systems including Linux, Irix, OSF1/Tru64, MacOS-X and Windows, and precompiled executables are available for these platforms when only keyword-style access (and not Python access) is required. Graphical user interfaces to Phaser are available for both the PHENIX and the CCP4 suites. User support is available through PHENIX, CCP4 and from the authors (email cimr-phaser@lists.cam.ac.uk).

Journal ID (nlm-ta): Biochemistry

Journal ID (iso-abbrev): Biochemistry

Journal ID (publisher-id): bi

Journal ID (coden): bichaw

Title:
Biochemistry

Publisher:
American Chemical Society

ISSN
(Print):
0006-2960

ISSN
(Electronic):
1520-4995

Publication date
(Electronic):
10
September
2021

DOI: 10.1021/acs.biochem.1c00414

PMC ID: 8457326

PubMed ID: 34506130

SO-VID: f53b2905-8243-4b0f-be31-efaecdf9dd6a

Copyright statement: © 2021 American Chemical Society

License:

This article is made available via the PMC Open Access Subset for unrestricted RESEARCH re-use and analyses in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.

Funded by: National Institute of General Medical
Sciences, doi 10.13039/100000057;

Award ID: R01 GM135919

Funded by: National Institute of Allergy and
Infectious Diseases, doi 10.13039/100000060;

Award ID: R21 AI149716

Funded by: National Institute of General Medical
Sciences, doi 10.13039/100000057;

Award ID: R35 GM118112

Subject:
Article

document-id-old-9 bi1c00414

document-id-new-14 bi1c00414

ccc-price

ScienceOpen disciplines: Biochemistry

Data availability:

ScienceOpen disciplines: Biochemistry