Unified access to up-to-date residue-level annotations from UniProtKB and other biological databases for PDB data

Choudhary, Preeti; Anyango, Stephen; Berrisford, John; Tolchard, James; Varadi, Mihaly; Velankar, Sameer

doi:10.1038/s41597-023-02101-6

Unified access to up-to-date residue-level annotations from UniProtKB and other biological databases for PDB data

Article
Open access
Published: 12 April 2023

Volume 10, article number 204, (2023)
Cite this article

Download PDF

You have full access to this open access article

Scientific Data

Unified access to up-to-date residue-level annotations from UniProtKB and other biological databases for PDB data

Download PDF

1870 Accesses
2 Citations
3 Altmetric
Explore all metrics

Abstract

More than 61,000 proteins have up-to-date correspondence between their amino acid sequence (UniProtKB) and their 3D structures (PDB), enabled by the Structure Integration with Function, Taxonomy and Sequences (SIFTS) resource. SIFTS incorporates residue-level annotations from many other biological resources. SIFTS data is available in various formats like XML, CSV and TSV format or also accessible via the PDBe REST API but always maintained separately from the structure data (PDBx/mmCIF file) in the PDB archive. Here, we extended the wwPDB PDBx/mmCIF data dictionary with additional categories to accommodate SIFTS data and added the UniProtKB, Pfam, SCOP2, and CATH residue-level annotations directly into the PDBx/mmCIF files from the PDB archive. With the integrated UniProtKB annotations, these files now provide consistent numbering of residues in different PDB entries allowing easy comparison of structure models. The extended dictionary yields a more consistent, standardised metadata description without altering the core PDB information. This development enables up-to-date cross-reference information at the residue level resulting in better data interoperability, supporting improved data analysis and visualisation.

Databases and Protein Structures

Computational Methods for Annotation Transfers from Sequence

Databases and Protein Structures

Introduction

As of March 2023, the Protein Data Bank (PDB)¹ contains over 200,000 entries representing over 61,000 unique entries in the Universal Protein Resource Knowledgebase (UniProtKB)². Often, the PDB archive has the same protein in multiple entries under different experimental conditions or interacting with different macromolecules (proteins, DNA, RNA) or ligand molecules^3,4,5. Multiple 3-dimensional coordinates of the same protein are invaluable for comparative structure-function studies^3,6,7. Linking structure data with annotations available in other data resources such as UniProtKB² and to the structural and functional annotations is critical in order to understand biological function and processes at a molecular level. However, one of the barriers to comparative analysis or data integration is the independent, depositor-provided residue numbering in the coordinate files, which may not be the same as the protein sequence numbering⁸. While solving a protein 3D structure, many times the experiments are carried out only on a part of complete protein molecules (e.g. a domain) to make the sample amenable to experimental methods, especially in cases where there are highly flexible linker regions or intrinsically disordered regions^9,10. Around 58% of the structures in the PDB contain smaller fragments (e.g. a domain) corresponding to different regions of a protein sequence. To determine where these fragments are located on the full-length protein sequence, these fragments need to be mapped to a common reference e.g. protein sequence numbering from a relevant entry in the UniProtKB database. The situation becomes complicated as often the flexible regions in the protein molecules are not modelled leading to unobserved residues i.e. residues without atomic coordinates in protein structures. The occurrence of missing residues makes structure-to-sequence mapping even more challenging. To address this fundamental problem of standardising residue numbering to make protein structure data more accessible to the broader scientific community, the PDBe¹¹ and UniProtKB² teams collaborated to establish the Structure Integration with Function, Taxonomy and Sequences (SIFTS) resource in 2002^12,13. SIFTS provides up-to-date residue level mapping, with each weekly PDB release, between UniProtKB protein sequences and PDB protein structures allowing better integration of annotations based on protein sequence and structure.

In addition to mapping PDB structures to UniProtKB sequences, SIFTS also maps to other biological resources such as Pfam¹⁴, InterPro¹⁵, SCOP¹⁶, CATH¹⁷, IntEnz¹⁸, GO^19,20, Ensembl²¹, NCBI taxonomy database²² and Homologene²³.

In the past 20 years, SIFTS has become an essential resource, and its data provides the foundation of many data services and web pages. SIFTS is fundamental to the PDBe and PDBe-KB data resources²⁴ and other databases, such as UniProtKB², Pfam¹⁴, RCSB PDB²⁵, PDBj²⁶, SCOP2²⁷, InterPro¹⁵ and MobiDB²⁸, rely on SIFTS to fetch cross-references between PDB structures and other biological databases. SIFTS data is distributed as summary flat files in CSV/TSV formats and also as a detailed per-entry XML files with residue-level information available from the EMBL-EBI FTP area (ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/). SIFTS data is also accessible via the PDBe API²⁹.

While SIFTS data has significantly improved the interoperability of PDB structure data with other key data resources, it still requires to be accessed separately from the 3D coordinates data in the PDB. The SIFTS output format is incompatible with 3D visualisation software that use the PDBx/mmCIF standard³⁰ and requires an additional step of parsing the data to display SIFTS annotations on protein 3D structure. To boost the FAIRness³¹ (Findable, Accessible, Interoperable and Reusable) by further improving the findability and interoperability of PDB structures, the next logical step is to integrate SIFTS annotations alongside the 3D coordinates in the PDBx/mmCIF files. Moreover, with the availability of numerous high-quality, predicted protein structure models from resources like SWISS-MODEL³² and AlphaFold DB^33,34, which generally follow the protein sequence numbering scheme, it was timely and essential to augment the protein sequence numbering for the experimentally determined 3D coordinates in the PDB. Using data from SIFTS resource, the PDBrenum⁸ web server replaces author sequence numbering with UniProtKB numbering in PDB or PDBx/mmCIF format files but it has certain limitations while handling special cases. For instance, while renumbering if this web server does not find any mapping data in SIFTS, it simply adds a large number to the residue’s sequence position number. These residues can be expression tags or insertions and need to be represented appropriately without losing the experimental context of the sample. Similarly, for chimeric proteins which are mapped to more than one protein sequence (UniProtKB accession), PDBrenum only renumbers according to the one protein sequence which has maximum coverage, losing information about remaining proteins in the chimeric construct. It does not integrate annotations to other data resources from SIFTS like Pfam, SCOP2 and CATH as well. Thus, there is a need to find a more consistent, sustainable and up-to-date solution while incorporating UniProtKB numbering and annotations from various other data resources in the 3D coordinate files.

Here, we describe incorporating SIFTS annotations in extended PDBx/mmCIF files to directly incorporate UniProtKB residue numbering next to the atomic coordinates. The PDBx/mmCIF is an extensible format that also provides a mechanism to maintain data integrity and is the master format for macromolecular structure data in the PDB³⁵. We describe how this current work extends the PDBx/mmCIF dictionary by leveraging the extensibility of its structured framework, thereby providing a mechanism to enrich the biological context of a PDB structure.

Results

Extension to the core SIFTS pipeline

The core SIFTS pipeline¹³ includes (1) a semi-automated process to retrieve the manually curated UniProtKB cross-reference (or canonical UniProtKB accession) for each protein chain in the PDB and (2) an automated process that generates residue-level correspondences between structure (PDB) and the corresponding sequence (UniProtKB). Initial mapping of UniProtKB sequence to the PDB structure is manually curated during the wwPDB annotation process³⁶. During the semi-automated process, these manually curated mappings are checked for obsoleted or secondary UniProtKB accessions and are updated accordingly. In the automatic process, the manually curated canonical accession is then expanded to include all its isoforms, and sequence alignment is computed for each PDB-UniProtKB pair. Taking only the PDB-UniProtKB pairs with the same source organism or atleast having a common ancestor within one or two levels up to species level in the taxonomy tree and having at least 90% sequence identity, the pair with the highest sequence identity is annotated as the best mapping. Once we have established the mapping between UniProtKB and PDB protein residues, the cross-references from other resources such as Pfam¹⁴, InterPro¹⁵, SCOP¹⁶, CATH¹⁷, IntEnz¹⁸, GO^19,20, Ensembl²¹ and Homologene²³ are added. The SIFTS annotations are stored in the SIFTS database, which is used to make the data accessible via the PDBe REST API. Individual XML files for each PDB entry with residue-level information are exported and the summary files are generated in CSV/TSV formats. An additional process was designed that reads the data from the SIFTS database and augments the PDB structure files with UniProtKB numbering and structure (SCOP2, and CATH resource) and sequence (Pfam resource) domain annotations. This update yields more consistent, standardised metadata. It is important to note that none of the core PDB information, such as atomic coordinates and experimental data, are altered in any way. Figure 1 shows the schematic overview of the data flow of the SIFTS process and highlights the additional process that was developed to export these data into PDBx/mmCIF files. The process helps researchers and data services access SIFTS data directly from the PDBx/mmCIF³⁷ files. To facilitate this update, additional “SIFTS-specific” mmCIF data categories were designed and integrated into the core PDBx/mmCIF data dictionary. These format specifications are discussed in detail below.

Extensions to the PDBx/mmCIF framework

PDBx/mmCIF framework organises information in categories containing related data items³⁷. The updated PDBx/mmCIF files contain the residue mappings between UniProtKB and PDB, and annotations from Pfam, SCOP2, and CATH. The SIFTS annotations are integrated in two ways: per-segment and per-residue. The per-segment annotations refer to a continuous segment in the protein sequence, where only the start and end positions for the annotations are provided. On the other hand, the per-residue annotations expand the segment boundaries to provide annotations for every residue that spans that region. The reason for having both types of annotation is that expanding segment annotations to the residue level can be complex due to factors such as missing residues, insertions, expression tags, and linker regions in the protein sequence. Moreover, the PDB residue numbers are not always uniquely defined and can have insert codes which together with the PDB residue number uniquely identify a particular residue. These factors can lead to gaps in the numbering between residues, which can make it challenging to expand segment annotations to the residue level. Therefore providing both per-segment and per-residue annotations affords the flexibility to visualise and analyse these data in a way that best suits the user needs. New data categories were added to represent these additional per-segment and per-residue mappings (Fig. 2). Two new categories “_pdbx_sifts_unp_segments” and “_pdbx_sifts_xref_db_segments” were added to represent per-segment mapping to UniProtKB and other data resources - Pfam, SCOP2, CATH. A third category, “_pdbx_sifts_xref_db”, was added to provide per-residue mapping from all the external resources. The “_atom_site” category, which represents the coordinate information, was extended with additional data items to integrate UniProtKB residue numbering from the best mapping adjacent to the atomic coordinates.

A summary of the new and modified data categories necessary to encode the SIFTS annotations data is provided below:

1.
_pdbx_sifts_unp_segments

This new category describes residue range-based cross-references specific to the UniProtKB database. It shows segments/regions of PDB residues mapped to the canonical UniProtKB accession and all its isoforms. The residue mapping is established by aligning the PDB sequence to each UniProtKB accession (canonical and all the isoforms) and the sequence identity between the aligned PDB-UniProtKB pair is provided. This category also indicates the best mapped UniProtKB accession.
2.
_pdbx_sifts_xref_db_segments

This new category describes residue range-based cross-references to additional databases such as Pfam, SCOP2, and CATH.
3.
_pdbx_sifts_xref_db

PDB structures often have missing residues, expression tags or linker regions, making the expansion of mappings from segments (residue range) to individual residues cumbersome. An essential category, “_pdbx_sifts_xref_db”, therefore describes residue level cross-references to external databases. This category provides annotations specific to the best mapped UniProtKB accession and can be used to identify all the mappings for each residue to external databases (Fig. 3).
4.
_atom_site

New data items were added to the “_atom_site” category to represent the best mapped UniProtKB accession, residue type and number. The new data item “_atom_site.pdbx_label_index” along with the “atom_site.label_asym_id” provide a unique identifier for all the polymer residues and individual non-polymer and solvent components.
Fig. 3
Single placeholder in PDBx/mmCIF files to find all the annotations associated with any residue from external databases. This figure shows the “_pdbx_sifts_xref_db” category for PDB 4daj. This critical new data category can describe residue-level cross-references to external databases. The items specific to the UniProtKB database and other cross-reference databases are marked in beige and green coloured boxes respectively.
Full size image

There are two different numbering schemes followed to indicate each residue (amino-acid or nucleotide) in the PDBx/mmCIF file. Firstly, “auth_seq_id” which is the numbering provided by the author. An author can assign its value in any desired way and the values may be used to relate the given structure to a numbering scheme in a homologous structure, including sequence gaps or insertion codes, which are not necessarily numbers. Secondly, “label_seq_id “ which is the wwPDB assigned numbering which starts from 1 and increments sequentially only for all the polymer residues. All the SIFTS-specific categories refer consistently to the wwPDB assigned numbering scheme defined by the “label_seq_id” data item in the atom_site category. The reference to labl_seq_id is provided by the data items “.seq_id”, “.seq_id_start” and “.seq_id_end” in the relevant categories. Data on the author provided or the PDB numbering scheme can be retrieved using the appropriate relationships defined in the PDBx/mmCIF categories (Fig. 4).

Often in many proteins, several domains are tandemly repeated³⁸. Additionally, researchers also synthesise structures where even the entire protein is repeated for specific research purposes^39,40. Previously, there was no automated way to find corresponding UniProtKB mappings for multiple domains in a protein structure in the PDB. The data item “.instance_id” is designed to help identify multiple instances of the same protein segment. For example, in the single-chain dimeric Streptavidin structure (PDB 6s50), the two copies of Streptavidin⁴¹ are easily identified by instance ids “1” and “2” for the UniProtKB accession P22629 (Fig. 5).

Similarly, users can rely on this data item to easily identify multiple copies of the same domains in a protein structure.

During evolution protein structures may evolve with an insertion of an additional domain which splits the original structural domain into a discontinuous range of residues in the sequence⁴². For example, the E.coli enzyme RNA 3′-terminal phosphate cyclase (PDB 1qmh) consists of two structural domains where a smaller insert domain (residues 186–276) splits the larger domain (residues 5–182 and 277–337)⁴³. The identification of the split domain (residues 5–182, 277–337) is evident from the “.segment_id” data item (Fig. 6).

Complete documentation for all the new and updated data categories and items is available at https://mmcif.wwpdb.org/dictionaries/ascii/mmcif_pdbx_v50.dic.

Applications

The SIFTS resource has been widely used in various research studies to retrieve residue correspondence between PDB structures and UniProtKB sequences^{44,45,46,47,48}. However, in many cases, researchers have had to manually renumber the coordinate files to reflect UniProtKB numbering for subsequent comparative analysis across multiple PDB structures^49,50,51. While SIFTS has been used in several functional studies, including mapping somatic mutations to protein structures to identify 3D clusters of mutations with functional significance⁵² and mapping GPCR structures to their respective G protein structures to investigate the allosteric mechanism of GPCR activation⁵³, authors still had to manually validate missing positions in PDB structures to verify genuine cases of chimeric proteins, peptide tags, or point mutations. Unfortunately, this process was both time-consuming and error-prone. However, with the incorporation of SIFTS residue-level mapping to the best mapped UniProtKB sequence in the PDBx/mmCIF files, manual verification is no longer necessary, saving time and facilitating the analysis and interpretation of data.

Integration of UniProtKB sequence annotations and 3D-structures, can furnish the biological and functional context for the structural data. For instance, mapping variant annotations onto 3D-structure, can provide insights into the genetic basis of complex traits and diseases. SIFTS resource has also been used to fetch annotations like sequence domains and structural domains for various PDB structures^51,54. Using the domain annotations mapped to a protein sequence in these PDBx/mmCIF files, researchers can easily identify the location, multiple copies and boundaries of different domains within a protein, which can help in understanding the overall structure and function of the protein. This also facilitates comparing proteins with similar domain structures and identifies potential functional relationships.

SIFTS is not only widely used in scientific research but also by several data resources¹². For example, UniProtKB exploits SIFTS information to provide structure mapping in the UniProtKB database. SCOP⁵⁵ and Pfam^14,56 also use SIFTS to map protein domains and connect sequence domains with their corresponding structures. The web resource Kincore relies on SIFTS to map protein kinases to their respective structures, extract relevant information such as domain boundaries and ligand binding sites, and provide a structural classification of protein kinases and their inhibitors⁵⁰. The PDBx/mmCIF files with SIFTS annotations address the fundamental need by combining data from various resources and providing coordinate files with a common reference frame, improving interoperability and reuse of these data. The availability of these files will streamline data extraction and promote consistent and efficient data sharing.

Adding UniProtKB, Pfam, SCOP2, and CATH annotations to PDB coordinate files can be very helpful for resources like Gene Integration with Function, Taxonomy, and Sequence (GIFTS, https://www.ebi.ac.uk/gifts/), Venus⁵⁷ or PhyreRisk⁵⁸. These annotations provide valuable information to gain a deeper understanding of the relationships between protein structure and function⁵⁹, which can be used to link structural and functional data on a genome-wide scale⁶⁰. By integrating these annotations in PDBx/mmCIF files, it becomes easier to map genetic variants to protein structures, which can greatly facilitate genome-wide studies. The use of SIFTS annotations in the COSMIC data resource is an excellent example of how this approach can be used to efficiently and accurately analyse the impact of genetic variants on protein function and stability⁶¹. This can be further expanded to support a wide range of computational approaches for analysing protein structure and function⁶², including functional annotation⁴⁸, structural comparison⁵⁹, ligand binding analysis⁶³, identifying new protein-protein interactions⁶⁴, functional pathways, and potential drug targets⁶⁵ on a large scale.

Various data visualisation tools can directly use these PDBx/mmCIF files, making the mapping of 1D sequence data onto the 3D structure views straightforward. With our improvements, researchers from various scientific fields can easily map sequence feature data onto PDB structures. Users can directly retrieve all the SIFTS annotations like structural domains, sequence domains and conflicts between sequences and structures from the PDBx/mmCIF files.

These files also provide a basis for improved comparisons between experimentally determined and predicted protein models. UniProtKB numbering in the coordinate files allows direct residue correspondence making structural comparison and superposition easier. It also makes it easier to compare PDB structures with the predicted model structures from AlphaFold DB^33,34, SWISS-MODEL³², RoseTTAFold⁶⁶, and many other resources, as these models follow a natural sequence numbering. These files are already being used by Mol*⁶⁷ (https://molstar.org/viewer/) to perform extremely fast superpositions using the SIFTS UniProtKB mapping. This superposition functionality in Mol*⁶⁷ is very powerful as it gives users the means to directly superimpose protein structures in their web browser without downloading any data or software. Mol* uses the SIFTS specific new data items added in the “_atom_site” category to establish the residue equivalence (UniProtKB residue number) from different PDB structures. Mol* superimposes the structures by calculating the optimal rotation and translation that align the corresponding atoms in each equivalent protein residue. Figure 7 shows the superposition of the unbound and bound forms of human Protein Tyrosine Phosphatase 1B protein (PTP1B, UniProtKB accession: P18031) performed using the “UniProt” button (highlighted in red box) in the Mol* Superposition panel⁶⁷. This protein is known to be a signalling molecule regulating a variety of cellular processes including cell growth, differentiation and oncogenic transformation and is a potential therapeutic target for the treatment of type 2 diabetes and cancer⁶⁸. Upon substrate/inhibitor binding, the WPD loop transitions from an open to a closed conformation^69,70,71,72 as shown in Fig. 7.

The new PDBx/mmCIF files also provided a basis for developing interactive visualisations. For example, the PDBe entry pages show the ProtVista component⁷³, a 2D visualisation for displaying the primary sequence features of proteins. ProtVista was developed in collaboration with UniProtKB and InterPro at EMBL-EBI. The PDBx/mmCIF files with PDB-UniprotKB residue mapping, enable interactivity between the 3D viewer (Fig. 8C), the ProtVista sequence viewer (Fig. 8A) and the 2D topology component (Fig. 8B). Consequently, Mol* can easily display all the annotations available in ProtVista and the 2D topology component on the 3D structure. As shown in Fig. 8, for Mannose-1-phosphate guanyltransferase, PDB 7d72 (https://www.ebi.ac.uk/pdbe/entry/pdb/7d72/protein/1), if users click on any residue annotation in the 2D viewer ProtVista, the residue or the residue segment is automatically highlighted in 3D,in the Mol* viewer. Similarly, users can highlight various structural or sequence domains, or other annotations in either the 2D topology component, 2D ProtVista component or Mol* viewer, and the three visualisations cross-talk with each other simultaneously, making visualisation and interpretation of data much easier. Mol* already uses these PDBx/mmCIF files to display various annotations on PDBe and PDBe-KB webpages. With SIFTS annotations directly available in the coordinate file, the 3D visualisation on PDBe and PDBe-KB webpages is more efficient and optimal.

It is important to note that adding additional data to a PDBx/mmCIF file, such as augmenting best mapped UniProtKB residue mapping in the “atom_site” category can come with a trade-off of an increase in the file size. While this may not be an issue for smaller PDB entries, it can become problematic for larger entries with significant file size. To address this issue, wwPDB provides binaryCIF⁷⁴ (bcif) files as an alternative to traditional PDBx/mmCIF files. The bcif format is a compressed binary version of the PDBx/mmCIF format that significantly reduces the file size, making it easier to handle and share large amounts of structural data. The Mol*, an open-source software for 3D molecular visualisation and analysis, also supports the bcif file format, allowing users to easily access and analyse structural data in this format.

Discussion

Interoperability challenges between the protein structure data in the PDB and protein sequences in the UniProtKB presents a significant barrier to accessibility and reusability. The seemingly trivial task of mapping residue-level information proved to be a formidable task that necessitated the development of the SIFTS resource. While SIFTS has successfully provided up-to-date mappings between the PDB and other data resources for the past 20 years, using these mappings still required some level of technical expertise.

To remove a tedious but previously mandatory step in many structural data analyses, we worked on adding the SIFTS mapping data directly into the PDBx/mmCIF files, the master format for the PDB archive. We designed new data categories and extended existing ones to provide flexible support for residue-level annotations. This development will allow easy linking of structural and functional annotations derived using structure and sequence data. It will also streamline the vast majority of high-throughput bioinformatics analysis pipelines by allowing developers to remove a tedious and error-prone step from their processes. Including the SIFTS data in the PDBx/mmCIF will also improve the efficiency of data visualisation tools, both those that specialise in 3D molecular graphics and those that focus on the interactive mapping of annotations onto to the protein structure representations e.g. sequence or topology.

By extending the PDBx/mmCIF data format, this work has laid the foundation for the future integration of additional annotations, allowing the files to be more comprehensive and to provide the biological context for PDB structures.

Methods

PDBx/mmCIF file format and PDBx/mmCIF dictionary

The PDBx/mmCIF(Protein Data Bank exchange/macromolecular Crystallographic Information File) is a well-established data format utilised for storing and sharing information related to the three-dimensional structure of macromolecules, including proteins and nucleic acids. Widely considered as the master format for the PDB archive, it is extensively used for representing structural data. The format uses a text-based file format that encodes data and metadata utilising data items grouped into categories. The PDBx/mmCIF dictionary³⁰ defines a standardised set of categories and data items, along with controlled vocabularies and explicit relationships between different categories and data items. This format is extensible, allowing the incorporation of new data items and categories, as demonstrated by the IHM⁷⁵ and ModelCIF⁷⁶ extensions. The IHM extension enables the archiving of structural models of macromolecular assemblies obtained through integrative/hybrid methods, while the ModelCIF extension enables the consistent representation of molecular models obtained through computational methods. By facilitating such inclusion of new information and accommodating scientific advancements, the PDBx/mmCIF dictionary continues to remain relevant and valuable to the scientific community. The PDBx/mmCIF dictionary is maintained by the wwPDB consortium and is regularly updated with new data items to reflect changes in the field of structural biology. The mmCIF dictionary can be accessed and downloaded freely from https://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v50.dic/Index/.

SIFTS-specific data categories and items in PDBx/mmCIF Dictionary

The PDBx/mmCIF dictionary was extended with, three new data categories to provide the necessary semantic organisation to represent SIFTS annotations: “_pdbx_sifts_unp_segments”, “_pdbx_sifts_xref_db_segments”, and “_pdbx_sifts_xref_db”.

The “_pdbx_sifts_unp_segments” category displays the UniProtKB sequence segments that correspond to the PDB structure. The “_pdbx_sifts_xref_db_segments” category provides information about the cross-references between the PDB structure and other databases, such as Pfam, CATH, and SCOP2. Finally, the “_pdbx_sifts_xref_db” category displays per-residue annotations between the PDB structure, UniProtKB, and other data resources.

Additionally, the “_atom_site” category was modified to integrate residue-level cross-reference data to the best mapped UniProtKB sequence. The updated PDBx/mmCIF dictionary, including all the new and updated data categories and items, is publicly available at https://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v50.dic/Index/.

Augmenting the core SIFTS process

SIFTS (Structure Integration with Function, Taxonomy and Sequences) is a collaborative resource between the PDBe (Protein Data Bank in Europe) and UniProtKB teams at EMBL-EBI. It is designed to map the protein structures available in PDB to the protein sequences in UniProtKB at the individual residue level. The SIFTS mapping can facilitate transfer annotations from a variety of biological resources including the NCBI taxonomy database, IntEnz, GO, Pfam, InterPro, SCOP, CATH, PubMed, Ensembl, Homologene, and automatic Pfam domain assignments based on HMM profiles. The pipeline is run weekly by PDBe as part of the PDB release process.

The mapping between PDB protein structures and UniProtKB protein sequences is manually curated by PDB and UniProtKB annotators. SIFTS performs automatic sequence alignment and generates a residue-level mapping between aligned protein structures and sequences. The pipeline downloads and parses data from various biological resources, which is then loaded into the SIFTS database (Fig. 1). SIFTS database is queried to derive residue-level annotations for all these biological resources. The SIFTS process generates per-entry XML files, summary CSV and TSV files to distribute all the SIFTS annotations. The SIFTS database also powers all the SIFTS related PDBe API²⁹.

To update PDBx/mmCIF files with residue-level annotations from SIFTS resources, a new process was added to the existing SIFTS pipeline. For a given PDB entry, the new process reads all the relevant data from the SIFTS database and integrates it into the PDBx/mmCIF file. The integration of SIFTS data uses the extended PDBx/mmCIF dictionary discussed earlier. The new process is implemented in Python and uses gemmi⁷⁷ to parse the PDBx/mmCIF file and write the SIFTS annotations in the corresponding categories. The process is executed as part of the PDBe weekly release pipeline, ensuring up-to-date SIFTS data in the PDBx/mmCIF files every Wednesday to coincide with the weekly PDB release. Currently, residue-level SIFTS annotations for UniProtKB, Pfam, SCOP2, and CATH databases are integrated in the PDBx/mmCIF files.

Data availability

We expanded the PDBe release pipeline with a process that adds SIFTS annotations to the PDBx/mmCIF files for individual structures in the PDB archive. The scientific community can download these PDBx/mmCIF files from the PDBe entry pages (https://pdbe.org/7dr0) and through direct URLs (https://www.ebi.ac.uk/pdbe/static/entry/7o9f_updated.cif), using the PDBe download service (https://www.ebi.ac.uk/pdbe/download/api) or from the EMBL-EBI FTP area (https://ftp.ebi.ac.uk/pub/databases/msd/updated_mmcif/).

Code availability

To assist users in utilising the updated PDBx/mmCIF files and SIFTS annotations, a Google Colab notebook is available at https://colab.research.google.com/github/PDBe-KB/sifts_data_analysis/blob/main/sifts.ipynb or via GitHub at https://github.com/PDBe-KB/sifts_data_analysis. This notebook provides information on how to parse, extract and filter SIFTS annotations from the updated PDBx/mmCIF files. Additionally, the notebook demonstrates how users can compare various numbering schemes of a given residue across different PDB structures of the same protein.

References

wwPDB consortium. Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res. 47, D520–D528 (2019).
Article Google Scholar
The UniProt Consortium. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 49, D480–D489 (2021).
Article Google Scholar
Brylinski, M. & Skolnick, J. What is the relationship between the global structures of apo and holo proteins? Proteins 70, 363–377 (2008).
Article CAS PubMed Google Scholar
Burra, P. V., Zhang, Y., Godzik, A. & Stec, B. Global distribution of conformational states derived from redundant models in the PDB points to non-uniqueness of the protein structure. Proc. Natl. Acad. Sci. 106, 10505 (2009).
Article ADS CAS PubMed PubMed Central Google Scholar
Lobanov, M. Y. et al. ComSin: database of protein structures in bound (complex) and unbound (single) states in relation to their intrinsic disorder. Nucleic Acids Res. 38, D283–D287 (2010).
Article CAS PubMed Google Scholar
Gutteridge, A. & Thornton, J. Conformational changes observed in enzyme crystal structures upon substrate binding. J. Mol. Biol. 346, 21–28 (2005).
Article CAS PubMed Google Scholar
Vishwanath, S., de Brevern, A. G. & Srinivasan, N. Same but not alike: Structure, flexibility and energetics of domains in multi-domain proteins are influenced by the presence of other domains. PLOS Comput. Biol. 14, e1006008 (2018).
Article PubMed PubMed Central Google Scholar
Faezov, B. & Dunbrack, R. L. Jr. PDBrenum: A webserver and program providing Protein Data Bank files renumbered according to their UniProt sequences. PLOS ONE 16, e0253411 (2021).
Article CAS PubMed PubMed Central Google Scholar
Oldfield, C. J. et al. Utilization of protein intrinsic disorder knowledge in structural proteomics. Biochim. Biophys. Acta 1834, 487–498 (2013).
Article CAS PubMed Google Scholar
Seffernick, J. T. & Lindert, S. Hybrid methods for combined experimental and computational determination of protein structure. J. Chem. Phys. 153, 240901 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Armstrong, D. R. et al. PDBe: improved findability of macromolecular structure data in the PDB. Nucleic Acids Res. 48, D335–D343 (2020).
CAS PubMed Google Scholar
Dana, J. M. et al. SIFTS: updated Structure Integration with Function, Taxonomy and Sequences resource allows 40-fold increase in coverage of structure-based annotations for proteins. Nucleic Acids Res. 47, D482–D489 (2019).
Article CAS PubMed Google Scholar
Velankar, S. et al. SIFTS: Structure Integration with Function, Taxonomy and Sequences resource. Nucleic Acids Res. 41, D483–D489 (2013).
Article CAS PubMed Google Scholar
Mistry, J. et al. Pfam: The protein families database in 2021. Nucleic Acids Res. 49, D412–D419 (2021).
Article CAS PubMed Google Scholar
Blum, M. et al. The InterPro protein families and domains database: 20 years on. Nucleic Acids Res 49, D344–D354 (2021).
Article CAS PubMed Google Scholar
Andreeva, A. et al. Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res. 36, D419–D425 (2008).
Article CAS PubMed Google Scholar
Sillitoe, I. et al. CATH: increased structural coverage of functional space. Nucleic Acids Res. 49, D266–D273 (2021).
Article CAS PubMed Google Scholar
Fleischmann, A. et al. IntEnz, the integrated relational enzyme database. Nucleic Acids Res. 32, D434–437 (2004).
Article CAS PubMed PubMed Central Google Scholar
Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000).
Article CAS PubMed PubMed Central Google Scholar
The Gene Ontology Consortium. The Gene Ontology resource: enriching a GOld mine. Nucleic Acids Res. 49, D325–D334 (2021).
Article Google Scholar
Howe, K. L. et al. Ensembl 2021. Nucleic Acids Res. 49, D884–D891 (2021).
Article CAS PubMed Google Scholar
Schoch, C. L. et al. NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database J. Biol. Databases Curation 2020, (2020).
Sayers, E. W. et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 49, D10–D17 (2021).
Article CAS PubMed Google Scholar
PDBe-KB consortium. PDBe-KB: collaboratively defining the biological context of structural data. Nucleic Acids Res. 50, D534–D542 (2022).
Article Google Scholar
Bittrich, S. et al. RCSB Protein Data Bank: improved annotation, search and visualization of membrane protein structures archived in the PDB. Bioinformatics 38, 1452–1454 (2022).
Article CAS PubMed Google Scholar
Bekker, G.-J. et al. Protein Data Bank Japan: Celebrating our 20th anniversary during a global pandemic as the Asian hub of three dimensional macromolecular structural data. Protein Sci. 31, 173–186 (2022).
Article CAS PubMed Google Scholar
Andreeva, A., Howorth, D., Chothia, C., Kulesha, E. & Murzin, A. G. Investigating Protein Structure and Evolution with SCOP2. Curr. Protoc. Bioinforma. 49, 1.26.1–1.26.21 (2015).
Article Google Scholar
Piovesan, D. et al. MobiDB: intrinsically disordered proteins in 2021. Nucleic Acids Res. 49, D361–D367 (2021).
Article CAS PubMed Google Scholar
Nair, S. et al. PDBe aggregated API: programmatic access to an integrative knowledge graph of molecular structure data. Bioinformatics 37, 3950–3952 (2021).
Article CAS PubMed PubMed Central Google Scholar
Westbrook, J. D. et al. PDBx/mmCIF Ecosystem: Foundational Semantic Tools for Structural Biology. J. Mol. Biol. 434, 167599 (2022).
Article CAS PubMed Google Scholar
FAIR principles for data stewardship. Nat. Genet. 48, 343–343 (2016).
Waterhouse, A. et al. SWISS-MODEL: homology modelling of protein structures and complexes. Nucleic Acids Res. 46, W296–W303 (2018).
Article CAS PubMed PubMed Central Google Scholar
Varadi, M. et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50, D439–D444 (2022).
Article CAS PubMed Google Scholar
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Bourne, P. E. et al. [30] Macromolecular crystallographic information file. in Methods in Enzymology vol. 277 571–590 (Academic Press, 1997).
Young, J. Y. et al. Worldwide Protein Data Bank biocuration supporting open access to high-quality 3D structural biology data. Database 2018, bay002 (2018).
Article PubMed PubMed Central Google Scholar
Bourne, P. et al. The Macromolecular Crystallographic Information File (mmCIF). (2001).
Björklund, A. K., Ekman, D. & Elofsson, A. Expansion of protein domain repeats. PLoS Comput. Biol. 2, e114 (2006).
Article ADS PubMed PubMed Central Google Scholar
Aslan, F. M., Yu, Y., Mohr, S. C. & Cantor, C. R. Engineered single-chain dimeric streptavidins with an unexpected strong preference for biotin-4-fluorescein. Proc. Natl. Acad. Sci. 102, 8507–8512 (2005).
Article ADS CAS PubMed PubMed Central Google Scholar
Mikel, P., Vasickova, P. & Kralik, P. One-plasmid double-expression His-tag system for rapid production and easy purification of MS2 phage-like particles. Sci. Rep. 7, 17501 (2017).
Article ADS PubMed PubMed Central Google Scholar
Wu, S. et al. Breaking Symmetry: Engineering Single-Chain Dimeric Streptavidin as Host for Artificial Metalloenzymes. J. Am. Chem. Soc. 141, 15869–15878 (2019).
Article MathSciNet CAS PubMed PubMed Central Google Scholar
Aroul-Selvam, R., Hubbard, T. & Sasidharan, R. Domain insertions in protein structures. J. Mol. Biol. 338, 633–641 (2004).
Article CAS PubMed PubMed Central Google Scholar
Palm, G. J., Billy, E., Filipowicz, W. & Wlodawer, A. Crystal structure of RNA 3′-terminal phosphate cyclase, a ubiquitous enzyme with unusual topology. Structure 8, 13–23 (2000).
Article CAS PubMed Google Scholar
MacGowan, S. A. & Barton, G. J. Missense variants in ACE2 are predicted to encourage and inhibit interaction with SARS-CoV-2 Spike and contribute to genetic risk in COVID-19. bioRxiv 2020.05.03.074781, https://doi.org/10.1101/2020.05.03.074781 (2020).
Hall, M. W. J., Shorthouse, D., Jones, P. H. & Hall, B. A. Investigating structure function relationships in the NOTCH family through large-scale somatic DNA sequencing studies. bioRxiv 2020.03.31.018325, https://doi.org/10.1101/2020.03.31.018325 (2020).
Utgés, J. S., Tsenkov, M. I., Dietrich, N. J. M., MacGowan, S. A. & Barton, G. J. Ankyrin repeats in context with human population variation. PLoS Comput. Biol. 17, e1009335 (2021).
Article ADS PubMed PubMed Central Google Scholar
Betts, M. J. et al. Systematic identification of phosphorylation-mediated protein interaction switches. PLoS Comput. Biol. 13, e1005462 (2017).
Article PubMed PubMed Central Google Scholar
Li, B., Roden, D. M. & Capra, J. A. The 3D mutational constraint on amino acid sites in the human proteome. Nat. Commun. 13, 3273 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Xu, Q. et al. Identifying three-dimensional structures of autophosphorylation complexes in crystals of protein kinases. Sci. Signal. 8, rs13 (2015).
Article PubMed PubMed Central Google Scholar
Modi, V. & Dunbrack, R. L. Jr. Kincore: a web resource for structural classification of protein kinases and their inhibitors. Nucleic Acids Res. 50, D654–D664 (2022).
Article CAS PubMed Google Scholar
Frappier, V., Duran, M. & Keating, A. E. PixelDB: Protein–peptide complexes annotated with structural conservation of the peptide binding mode. Protein Sci. 27, 276–285 (2018).
Article CAS PubMed Google Scholar
Gao, J. et al. 3D clusters of somatic mutations in cancer reveal numerous rare mutations as functional targets. Genome Med. 9, 4 (2017).
Article PubMed PubMed Central Google Scholar
Flock, T. et al. Universal allosteric mechanism for Gα activation by GPCRs. Nature 524, 173–179 (2015).
Article ADS CAS PubMed PubMed Central Google Scholar
Hashemi, S., Nowzari Dalini, A., Jalali, A., Banaei-Moghaddam, A. M. & Razaghi-Moghadam, Z. Cancerouspdomains: comprehensive analysis of cancer type-specific recurrent somatic mutations in proteins and domains. BMC Bioinformatics 18, 370 (2017).
Article PubMed PubMed Central Google Scholar
Andreeva, A., Kulesha, E., Gough, J. & Murzin, A. G. The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures. Nucleic Acids Res. 48, D376–D382 (2020).
Article CAS PubMed Google Scholar
Finn, R. D. et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 44, D279–285 (2016).
Article CAS PubMed Google Scholar
Ferla, M. P., Pagnamenta, A. T., Koukouflis, L., Taylor, J. C. & Marsden, B. D. Venus: Elucidating the Impact of Amino Acid Variants on Protein Function Beyond Structure Destabilisation. Comput. Resour. Mol. Biol. 434, 167567 (2022).
CAS Google Scholar
Ofoegbu, T. C. et al. PhyreRisk: A Dynamic Web Application to Bridge Genomics, Proteomics and 3D Structural Data to Guide Interpretation of Human Genetic Variants. Comput. Resour. Mol. Biol. 431, 2460–2466 (2019).
CAS Google Scholar
Slodkowicz, G. & Goldman, N. Integrated structural and evolutionary analysis reveals common mechanisms underlying adaptive evolution in mammals. Proc. Natl. Acad. Sci. 117, 5977–5986 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Zerbino, D. R., Frankish, A. & Flicek, P. Progress, Challenges, and Surprises in Annotating the Human Genome. Annu. Rev. Genomics Hum. Genet. 21, 55–79 (2020).
Article CAS PubMed PubMed Central Google Scholar
Tate, J. G. et al. COSMIC: the Catalogue Of Somatic Mutations In Cancer. Nucleic Acids Res. 47, D941–D947 (2019).
Article CAS PubMed Google Scholar
Gligorijević, V. et al. Structure-based protein function prediction using graph convolutional networks. Nat. Commun. 12, 3168 (2021).
Article ADS PubMed PubMed Central Google Scholar
Coudert, E. et al. Annotation of biologically relevant ligands in UniProtKB using ChEBI. Bioinformatics 39, btac793 (2023).
Article PubMed Google Scholar
Huttlin, E. L. et al. Dual proteome-scale networks reveal cell-specific remodeling of the human interactome. Cell 184, 3022–3040.e28 (2021).
Article CAS PubMed PubMed Central Google Scholar
Sargsyan, K., Mazmanian, K. & Lim, C. A strategy for evaluating potential antiviral resistance to small molecule drugs and application to SARS-CoV-2. Sci. Rep. 13, 502 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Sehnal, D. et al. Mol* Viewer: modern web app for 3D visualization and analysis of large biomolecular structures. Nucleic Acids Res. 49, W431–W437 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Combs, A. P. Recent Advances in the Discovery of Competitive Protein Tyrosine Phosphatase 1B Inhibitors for the Treatment of Diabetes, Obesity, and Cancer. J. Med. Chem. 53, 2333–2344 (2010).
Article ADS CAS PubMed Google Scholar
Han, Y. et al. Discovery of [(3-bromo-7-cyano-2-naphthyl)(difluoro)methyl]phosphonic acid, a potent and orally active small molecule PTP1B inhibitor. Bioorg Med Chem Lett 18, 3200–3205 (2008).
Article CAS PubMed Google Scholar
Scapin, G. et al. The Structural Basis for the Selectivity of Benzotriazole Inhibitors of PTP1B. Biochemistry 42, 11451–11459 (2003).
Article CAS PubMed Google Scholar
Barford, D., Flint, A. J. & Tonks, N. K. Crystal Structure of Human Protein Tyrosine Phosphatase 1B. Science 263, 1397–1404 (1994).
Article ADS CAS PubMed Google Scholar
Puius, Y. A. et al. Identification of a second aryl phosphate-binding site in protein-tyrosine phosphatase 1B: A paradigm for inhibitor design. Proc. Natl. Acad. Sci. 94, 13420–13425 (1997).
Article ADS CAS PubMed PubMed Central Google Scholar
Deshpande, M. et al. PDB ProtVista: A reusable and open-source sequence feature viewer https://doi.org/10.1101/2022.07.22.500790 (2022).
Article Google Scholar
Sehnal, D. et al. BinaryCIF and CIFTools—Lightweight, efficient and extensible macromolecular data management. PLOS Comput. Biol. 16, e1008247 (2020).
Article CAS PubMed PubMed Central Google Scholar
Vallat, B. et al. New system for archiving integrative structures. Acta Crystallogr. Sect. D 77, 1486–1496 (2021).
Article CAS Google Scholar
Vallat, B. et al. ModelCIF: An extension of PDBx/mmCIF data representation for computed structure models. J. Mol. Biol. 168021, https://doi.org/10.1016/j.jmb.2023.168021 (2023).
Wojdyr, M. GEMMI: A library for structural biology. J. Open Source Softw. 7, 4200 (2022).
Article ADS Google Scholar

Download references

Acknowledgements

This paper is dedicated to the fond memory of our dear collaborator and wwPDB member John Westbrook. John critically reviewed the new “SIFTS-specific” categories in PDBx/mmCIF data dictionary and provided valuable feedback. We also thank Ezra Peisach for his valuable comments while updating the PDBx/mmCIF data dictionary. Both John Westbrook and Ezra Peisach are members of the RCSB Protein Data Bank, a co-founder of the wwPDB along with PDBj and PDBe. We also thank EMBL and EMBL-EBI for providing the necessary infrastructure to run this process weekly. This work was supported by funding from EMBL and funding awarded to PDBe by the UK Biotechnology and Biological Research Council (BB/V004247/1, PI:Sameer Velankar) and RCSB PDB by the NSF (DBI-2019297, PI: S.K. Burley) supporting development of a Next Generation PDB archive.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
Preeti Choudhary, Stephen Anyango, John Berrisford, James Tolchard, Mihaly Varadi & Sameer Velankar
AstraZeneca, Biomedical Campus, 1 Francis Crick Ave, Trumpington, Cambridge, CB2 0AA, UK
John Berrisford
Claude Bernard University, Villeurbanne, Lyon, 69100, France
James Tolchard

Authors

Preeti Choudhary
View author publications
You can also search for this author in PubMed Google Scholar
Stephen Anyango
View author publications
You can also search for this author in PubMed Google Scholar
John Berrisford
View author publications
You can also search for this author in PubMed Google Scholar
James Tolchard
View author publications
You can also search for this author in PubMed Google Scholar
Mihaly Varadi
View author publications
You can also search for this author in PubMed Google Scholar
Sameer Velankar
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: P.C., J.B., S.V.; Methodology: P.C., J.T., J.B.; Software: P.C., S.A.; Investigation: P.C., J.B., S.V.; Funding: S.V.; Writing - Original Draft: P.C.; Writing - Review and Editing: P.C., M.V., S.V.; All authors read and approved the final version of the manuscript.

Corresponding author

Correspondence to Preeti Choudhary.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Choudhary, P., Anyango, S., Berrisford, J. et al. Unified access to up-to-date residue-level annotations from UniProtKB and other biological databases for PDB data. Sci Data 10, 204 (2023). https://doi.org/10.1038/s41597-023-02101-6

Download citation

Received: 06 December 2022
Accepted: 23 March 2023
Published: 12 April 2023
DOI: https://doi.org/10.1038/s41597-023-02101-6
Springer Nature Limited

Unified access to up-to-date residue-level annotations from UniProtKB and other biological databases for PDB data

Abstract

Similar content being viewed by others

Databases and Protein Structures

Computational Methods for Annotation Transfers from Sequence

Databases and Protein Structures

Introduction

Results

Extension to the core SIFTS pipeline

Extensions to the PDBx/mmCIF framework

Applications

Discussion

Methods

PDBx/mmCIF file format and PDBx/mmCIF dictionary

SIFTS-specific data categories and items in PDBx/mmCIF Dictionary

Augmenting the core SIFTS process

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Navigation

Unified access to up-to-date residue-level annotations from UniProtKB and other biological databases for PDB data

Abstract

Similar content being viewed by others

Databases and Protein Structures

Computational Methods for Annotation Transfers from Sequence

Databases and Protein Structures

Introduction

Results

Extension to the core SIFTS pipeline

Extensions to the PDBx/mmCIF framework

Applications

Discussion

Methods

PDBx/mmCIF file format and PDBx/mmCIF dictionary

SIFTS-specific data categories and items in PDBx/mmCIF Dictionary

Augmenting the core SIFTS process

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation