Background

Deciphering the structure and dynamics of the protein-protein interaction (PPI) networks is among the major objectives of the systems biology research in the quest for the mechanisms of life. For the human protein interactome in particular, its reconstruction and further exploration of its topology and dynamics are expected to have a significant impact in biomedical research and applications [1, 2]. The number of experimentally supported PPIs has drastically increased for model organisms since 2000 [37] and for the human interactome since 2005 [8, 9] mainly due to the gradually increasing number of high-throughput methodologies for PPI detection. The experimentally identified PPIs are mined from the literature and stored in bulk in PPI databases, most of which are repositories for many species. For the human interactome, the various source PPI databases report the protein identifiers at different molecular levels of biological information, and include protein interaction sets of limited overlap due to own literature mining criteria, differences in PPI incorporation rates from small-scale experiments, as well as differences in methods for PPI selection, curation and updating [1014]. Therefore, several PPI meta-databases also exist, combining information from multiple source databases [1523]. However, as each meta-database has distinct curation objectives and methods for data normalization and integration, the use of its combined PPI dataset may not be straight away comparable to the direct query on the source databases [11, 12]. In addition, it is worth mentioning that the set of protein nodes of a meta-database network varies depending on the PPIs of the employed source datasets, and it may change upon updating or incorporation of new datasets. This fact creates heterogeneity between the various PPI meta-databases and hinders the direct comparison among their networks [11]. Because of this inherent heterogeneity, although there have been many studies comparing a variety of PPI datasets [1014], the way in which the human protein interactome expands via the integration of multiple datasets has not been comprehensively explored; therefore, a global perspective of the biology emerging from the network structure is still eluding.

The objective of the present study is to reconstruct the current experimentally supported network of direct human protein interactions in a global structured way, explore it to obtain information about the fraction of the human proteome that it currently involves, discuss the biological role of proteins within the topology of the network, and identify the presently absent from the network (“orphan”) proteins. To this end, we started by defining the UniProtKB manually reviewed human “complete” proteome [24] as the reference set of nodes that the human PPI network can have. Then, we mined five major source PPI databases, i.e.: HPRD [25], IntAct [26], MINT [27], DIP [28] and BioGRID [29], for direct interactions exclusively between members of the defined reference protein set. After appropriate updating of the old and filtering of the obsolete protein identifiers, the acquired PPI data were normalized to and combined at the UniProt protein identifier level. We analyzed the reconstructed network to discuss whether the revealed role of proteins based on their position in the interactome topology is supported by the currently available knowledge about their function. In addition, based on the verified scale-free structure of the PPI network in human [1, 30], we predict the number of connections of the unrepresented proteins and provide a novel perspective about the presently “missing” part of the interactome.

Methods

Protein and PPI datasets

The UniProtKB/Swiss-Prot manually reviewed human “complete” proteome

From UniProtKB, the knowledgebase of the Universal Protein (UniProt) resource [24], we downloaded the tab-delimited files of: (a) the entire set of human UniProt identifiers, and (b) the manually reviewed human “complete” proteome. The latter contained 20,242 UniProt identifiers in the Dec 14 2011 release of UniProtKB downloaded on Jan 23 2012. The two tab-delimited files included all default columns augmented by the cross-references with the EMBL nucleotide, the NCBI nucleotide and the Entrez Gene databases. The text file indicating the correspondence of the secondary to the respective primary UniProt identifier(s) was downloaded too.

The Human Protein Reference Database (HPRD)

HPRD is a manually curated reference database for human protein information [25]. In this study, we used only its binary PPI dataset, which is provided in the form of interactions between HPRD identifiers. From the total 19651 HPRD identifiers in the HPRD version 9, downloaded on Jan 23 2012, 9673 were involved in at least one of the 39204 PPIs reported as binary interactions. Only the primary one-to-one correspondence of the HPRD identifiers to nucleotide sequence identifiers was considered. Any necessary updating or conversion of the nucleotide sequence identifiers to other molecular levels of biological information (i.e. gene or protein level) was carried out through cross-reference with current versions of the relevant databases.

IntAct

IntAct, a main partner of the International Molecular Exchange (IMEx) Consortium [10], is a repository of molecular interaction data for multiple organisms [26]. In the single file supplied by IntAct for external use, including interaction information from all species, PPIs are provided mainly at the UniProt protein identifier level. From the Jan 3, 2012 release downloaded on Jan 30, 2012, only the non - “spoke” PPIs between two human protein identifiers were retained, as the label “spoke” characterizes the PPIs originated from protein complex expansion.

The Molecular INTeraction database (MINT)

Similarly to IntAct, MINT [27] is a repository of literature-curated PPIs from multiple organisms and an IMEx consortium partner with PPI information provided mainly at the UniProt protein identifier level. The binary PPI file for human used in the present study was downloaded on Jan 30, 2012 (release date: Dec 8, 2011).

Database of Interacting Proteins (DIP)

DIP [28] is also a collection of experimentally supported protein interactions from multiple organisms and among the first partners of the IMEx consortium. In the downloaded on Jan 30, 2012 PPI file for human (release date: Oct 27 2011), PPIs are provided as interactions between DIP identifiers. The latter are corresponded mainly to UniProt protein identifier(s) and most to NCBI nucleotide RefSeq identifier(s), too.

The Biological General Repository for Interaction Datasets (BioGRID)

BioGRID [29] is the most recently initiated among the five source PPI databases used in this study, currently participating in the IMEx consortium as an affiliate member. The PPI file for human was downloaded from the BioGRID web site on Jan 30, 2012 (release 3.1.84 tab2 file). PPIs are provided as interactions between BioGRID identifiers, which are in one to one correspondence to Entrez Gene identifiers (GeneID). BioGRID provides extensive information about the experimental method and the nature, i.e. low- or high- throughput, of the experimental set-up used for any PPI detection; however, it does neither make a distinction between binary interaction and protein complex data nor provide a relevant filtering criterion. To avoid including PPI data expanded from protein complexes, we opted to keep (a) all physical associations identified in low-throughput setups and (b) from the physical associations detected only in high-throughput experiments, those derived from any of “protein complementation assay (PCA)”, “reconstituted complex”, “protein-peptide”, “FRET”, “two-hybrid” or “co-crystal structure” methods. Genetic interactions provided in BioGRID were de facto filtered out.

PPI data mining

Direct PPIs with both interactors belonging to the set of the 20,242 primary UniProt identifiers included in the manually reviewed human “complete” proteome were mined from: (a) the binary PPI dataset of HPRD, (b) all PPIs of IntAct not characterized with the term “spoke” in the “expansion” field, (c) the binary PPI dataset of MINT, (d) the DIP dataset, which is provided as containing only binary manually reviewed PPIs, and (e) all physical associations in BioGRID detected in at least one low-throughput experiment or by any of the detection methods mentioned above, if identified only in high-throughput setups.

Protein identifier normalization

Normalization of the protein identifiers to the UniProt identifier level was required for: (a) HPRD, since it reports the interactors at the nucleotide sequence level, (b) BioGRID, which reports the interactors at the gene level and (c) few cases of IntAct, MINT and DIP, for which other than the default UniProt identifier has been used.

Source PPI dataset uploading

To upload, store and handle the five PPI datasets and integrate them into the final reconstructed PPI network, the Microsoft SQL Server (MSSQL) 2008 Developer Edition platform equipped with SQL Server Integration Services (SSIS) was used under the University of Patras academic license. The source PPI dataset uploading was organized in a set of SSIS modules executed at the server side. Each module involves a series of subtasks for the filtering and updating of certain data from the source PPI dataset, along with a large number of checks to monitor and handle exceptions, avoiding thus the contamination of the final database with erroneous or ill-formatted data. Additional file 1 shows the workflow for the IntAct uploading sub-module.

The first subtask of the filtering and updating algorithm involves the extraction of the interactions between human protein identifiers. In sequence, the main interactor identifiers are retained for each PPI. For IntAct, MINT and DIP, the interactors are expected to be represented by a UniProtKB accession number. If the relevant format is not recovered from the algorithm for any of the two interactors, then the non-UniProt interactor identifier is compared against a maintained interactor identifier dictionary. If matched to a dictionary entry and identified as active, the non-UniProt interactor identifier is replaced by the corresponding primary UniProt identifier. If it has become obsolete or cannot be assigned to a UniProtKB accession number, it is removed from the finally uploaded dataset along with all associated PPIs. If active, all isoform UniProt protein identifiers are replaced by their primary UniProt identifier(s). Any remaining non-UniProt interactor identifiers are stored in a separate table, for the curator to appropriately update the interactor identifier dictionary, so that the “patching” process is completed in a second iteration. In HPRD, the interactor identifier dictionary is used to update the nucleotide sequence identifiers to their currently active entries. Notably, among the 9673 HPRD identifiers involved in PPIs, 119 were identified to correspond to obsolete nucleotide sequence identifiers, 4 corresponded to non protein-coding RNAs, while 16 were replaced by new nucleotide sequence identifiers; due to this updating, in three cases, two HPRD identifiers were assigned to the same nucleotide sequence identifier. In BioGRID, all interactors were identified by an active Entrez GeneID, thus no updating was necessary. For the PPIs remaining after the interactor identifier patching step, the algorithm inspects the identifier of the supporting publication(s). If no publication is provided, the PPI is removed from the uploaded dataset. If a non-PubMed publication identifier is provided, this is patched based on an in-memory maintained dictionary as described for the interactor identifiers in the previous step. The utilized interactor identifier dictionary was created based on information recovered from the online UniProt converter and the online versions of all relevant databases on February 2, 2012. The Digital Object Identifier (DOI) numbers and IMEx reference identifiers were assigned to their PubMed publication identifiers based on an online converter and the online version of MINT, respectively. After uploading IntAct, MINT and DIP, their PPI data were further processed based on information from UniProtKB to include only interactions between two active primary UniProt identifiers in the human manually reviewed “complete” proteome.

Gene functional classification analysis

Gene functional classification analysis was carried out using the DAVID Bioinformatics Resources version 6.7 [31, 32] by combining all available gene annotation categorizations.

Identification of network characteristics

The identification of the reconstructed PPI network characteristics was carried out using the relevant “Network Analysis” tool of the open source network visualization and analysis software Cytoscape - version 2.8 [33].

Results and discussion

Reconstructing the human protein interactome based on a well-defined set of protein nodes

The novelty of our approach regarding the PPI data integration from major literature-curated source PPI datasets compared to existing meta-databases was the a priori definition of the set of nodes of the human protein interactome considering the UniProtKB manually reviewed human “complete” proteome as a robust, well-defined reference set. Thus, instead of merging PPI information for any protein identifier stored in the source databases, the latter were selectively mined for PPIs exclusively between members of the as above defined reference human protein set.

For proper normalization of the source PPI datasets to the UniProt identifier level, it was also important to consider the continuous updating of biological information, since it can lead to changes in the annotation of protein identifiers and in their associations at other molecular levels. Thus, we proceeded to a careful updating of the old and filtering of the obsolete protein identifiers in the source datasets based on the current knowledge about gene annotation. UniProtKB and its cross-references with major resources at the nucleotide sequence and gene levels of molecular information (i.e. NCBI, Entrez Gene and EMBL databases) provided a valuable reference for the appropriate normalization of HPRD and BioGRID identifiers to the UniProt level, and of a small fraction of IntAct, MINT and DIP protein entries that were not provided at the default UniProt level. It is noted that during this conversion to the UniProt level, 1920 BioGRID identifiers reported as human were found to correspond to non-human UniProt identifiers (data not shown), leading thus to the exclusion of their PPIs from the final integrated PPI network.

In the normalized HPRD, IntAct, MINT, DIP and BioGRID files, only the PPIs between two active primary UniProt identifiers in the manually reviewed human “complete” proteome were retained. These datasets were combined keeping one record for each included PPI. A last source of PPI redundancy in the normalized datasets that was eliminated, concerns the double reporting of an interaction using opposite sequence of the two interactors. In some cases, such duplications may have been intentionally included by the curator of a source PPI dataset to report the experimentally supported sequence of the interactors; this type of duplications were encountered in IntAct and MINT. In most cases, however, they were just a product of the protein identifier conversions at the various stages of the PPI dataset uploading and formatting and had to be eliminated at the integration stage.

The final integrated PPI dataset will be referred to as the PICKLE (Protein InteraCtion KnowLedge BasE) dataset. Table 1 shows the number of (a) the direct PPIs in the PICKLE and the normalized source PPI datasets, (b) the UniProt identifiers in the manually reviewed human “complete” proteome covered by each of them, and (c) the publications providing experimental evidence for the PPIs. As expected, the integrated PICKLE dataset is much larger than any of the individual source datasets with respect to the number of PPIs, of the protein interactors and of the supporting publications, verifying the value of PPI resource integration.

Table 1 The size of the reconstructed direct PPI network for the manually reviewed human “complete” proteome

Reconstructing the PPI network in this global structured way:

  • we resolve the issue of potential protein identifier and consequently PPI redundancy in the network originating from the combination of records of multiple databases reporting at different levels of biological information;

  • we determine which protein nodes of the manually reviewed human “complete” proteome remain with no direct PPIs (“orphan” proteins) and discuss this fact in the context of the current information about these proteins;

  • we comment on the proteins represented in the interactome with a high number of PPIs with respect to the importance of their function within the entire network;

  • we consider the human interactome in its entirety, commenting on its future expansion to the maximum potential format in the context of the expected scale-free structure, a fundamental feature of PPI networks [30, 34]. Consequently, the interactome reconstructed in the presented way can only grow in edges (PPIs) between the defined set of protein nodes, while keeping its scale-free form. In this global context, we can argue for the expected number of interactions for the “orphan” protein nodes and for the type of their interactors, suggesting a novel perspective for the currently “missing” part of the network, as it is discussed in the following sections.

The reconstructed interactome covers nearly 60% of the manually reviewed human “complete” proteome

Out of the 20,242 UniProt identifiers in the manually reviewed human “complete” proteome, 11827 (58.4%) were found to have a total number of 75965 direct interactions (Table 1). Gene functional classification analysis (see Methods section) of the proteins currently included in the reconstructed interactome compared to the “orphan” ones did not indicate any functional annotations that could differentiate the one group from the other. Thus, the presently “orphan” proteins are not associated with any apparent functional or subcellular location characteristics that could “hinder” them from binding with other proteins.

Dataset integration augments the overall network mainly with additional interactions for largely overlapping sets of proteins

HPRD and BioGRID are the main contributors of the overall human PPI network, comprising, respectively, 78.7% and 78.3% of its UniProt identifiers, and 48.9% and 56.1% of its PPIs (Table 1 and Figures 1, 2 and 3). Moreover, exclusion of the information from HPRD and BioGRID wοuld, respectively, decrease the overall network by 20.4% and 18.9% in proteins and 33.2% and 39.1% in PPIs. These characteristics can be partially justified by the number of references used by each of these two databases, constituting 72.2% (HPRD) and 51.8% (BioGRID) of the total number of supporting references. In addition, HPRD is one of the first literature-curated databases, having though a decline in the rate of reference (and thus PPI) incorporation after 2005 (Figure 3B). BioGRID is currently the fastest growing, having also incorporated a significant part of the HPRD PPI network at the time of its creation [11, 29]. This information complements the observed much higher curation overlap between HPRD and BioGRID compared to the other pairs of source PPI datasets discussed by Turinsky et al. in [12]. On the other hand, IntAct corresponds to the largest ratio of PPIs per number of references, i.e. 12.1, followed by MINT, i.e. 6.7 (Table 1), indicating that a major fraction of their datasets originates from references of high-throughput PPI experiments. Notably, the reconstructed human protein interactome is mainly supported by small-scale studies (Figure 4A); 91% of the references supporting the PICKLE PPI dataset refer to a maximum of five PPIs, and only 51 publications report more than 100 PPIs. In this aspect, PICKLE follows the characteristics of HPRD, currently the main contributor of references to the overall dataset. It is worth mentioning that 84% of the 75965 PPIs in the human interactome are supported by only one reference (Figure 4B) and just 42 PPIs by more than 20 (Additional file 2). Considering that the degree of confidence of a given PPI increases with the number of independent supporting references [35], it is evident that, apart from exploiting existing models for PPI assessment [36], further targeted experimentation is required for validating the majority of the PPI data.

Figure 1
figure 1

Source of data in the integrated PICKLE PPI dataset.

Figure 2
figure 2

The fractions of PICKLE UniProt identifiers (A) and PPIs (B) contributed from combinations of source datasets. The common contributions for the nodes and the edges of the integrated PPI network from all five source datasets constitute 8.7% and 0.2% of the total, respectively. Only the values of the larger than 3% fractions are shown with the exception of the unique contributions from each individual source dataset, for which all fractions are indicated.

Figure 3
figure 3

The PICKLE reference fraction contributed from source datasets (A) and the reference incorporation rate in the datasets (B). Only 8 common references between the five datasets were identified, confirming that they incorporate knowledge from different studies. In (A), only the values of the larger than 3% fractions are shown with the exception of the unique contributions from each individual source dataset, for which all fractions are indicated.

Figure 4
figure 4

Distribution of PPIs per reference (A) and supporting references per PPI (B) in PICKLE.

A noteworthy observation of our work, revealing an interesting aspect of the literature-supported PPI data collections, is that the fraction of protein nodes that each source dataset uniquely contributes to the integrated network is much smaller than the corresponding fraction for the PPIs, even for the largest HPRD and BioGRID datasets (Figures 1 and 2). The PPI diversity between the source datasets has been discussed earlier [e.g. 10, 12] and mainly attributed to the fact that the various databases incorporate knowledge from different publications. This was recently presented for the IMEx Consortium member databases [10] and validated in the present study from the substantially small number, i.e. eight, of common references between the five employed datasets (Figure 1). Furthermore, Turinsky et al. [12] showed that the source databases exploit different curation criteria even for the shared publications. Thus, it is striking that, despite the heterogeneous text mining and data curation methods used by the various databases, the integration of multiple source PPI datasets augments mainly the interactome with different PPIs for essentially the same part of the manually reviewed human “complete” proteome.

This observation suggests that the knowledge about direct PPIs that is available in the literature and can be promptly identified through existing text mining algorithms refers mainly to the fraction, i.e. approximately 60%, of the manually reviewed human “complete” proteome already incorporated in the interactome, while evidence for PPIs for the rest 40% cannot be easily spotted. In this context, as PPI information from all high-throughput experiments has been included in at least one of the source datasets, there are two possibilities for the “orphan” proteins: either there is currently no available PPI information in the literature, or, if existing, it should concern reports of targeted small-scale functional experiments. From this kind of reports, protein interactions can be indirectly deduced, requiring thus advanced directed text mining algorithms. Furthermore, there is a higher probability for such experiments to refer to PPIs occurring under specialized and/or highly transient or rare physiological conditions, while this type of interactions cannot be easily identified in high-throughput experiments. These implied direct interaction characteristics for the “orphan” proteins support a peripheral role for most of them within the topology of the PPI network. In this context, the actual determination of the “orphan” proteins may assist in directed literature mining to extract potentially existing relevant PPI information from currently unexploited reports or promote further experimentation to verify the argument.

The proteins with a high number of interactions are involved in essential biological processes

Analysis of the integrated human PPI network characteristics indicated that 11577 out of the 11827 UniProt identifiers are connected in one component. The remaining 250 proteins are currently in separate components of up to four nodes, among which 114 homodimers and 46 heterodimers (Table 2). The vastest functional categories for these proteins as indicated by gene functional classification analysis concerned 107 glycoproteins, 64 of which are homodimers and 89 signal peptides, among which 65 glycoproteins; 68 of the signal peptides, including 39 glycoproteins, are associated with extracellular matrix. While the network diameter, i.e. the greatest distance between two protein nodes, was determined equal to 12, the characteristic path length is 3.69. This feature along with the equal to 1 radius and the high value of shortest paths metric (i.e. 95%) indicates a well-connected network, despite its low density (i.e. 0.001) (Table 2). The distribution of PPIs per protein, i.e. protein degree, indicated 53% of the proteins as having up to five interactions (Figure 5), while 16 UniProt identifiers had more than 300 PPIs each (Table 3). This pattern is consistent with the relevant “network biology” theory supported by Barabasi [30, 37], according to which the human PPI network is expected to follow a scale-free structure with few protein hubs and the majority of the protein nodes having a small number of interactions. Indeed, even though it is currently incomplete and many interactions are still in need of verification, the reconstructed human protein interactome correlates well with the power law (Figure 5), implying that the degree distribution of the current PPI network already suggests the role of most proteins as high-, middle- or low- degree nodes.

Table 2 The characteristics of the integrated PPI network
Figure 5
figure 5

The distribution of direct interactions for the UniProt identifiers in the PICKLE network. The red line indicates the power-law fit (logarithmic scale); the related equation and R2 correlation are also shown. UniProt identifier P0CG48 (Polyubiquitin, UBC) is identified with the largest number of interactions, i.e. 1112; note the deviation of this UniProt identifier degree from the power-law fit.

Table 3 The 16 UniProt identifiers with more than 300 interactions in the integrated PPI network

The sixteen proteins determined with more than 300 PPIs (Table 3) are mainly implicated in the regulation of apoptosis (10 proteins), the MAP kinase signalling pathway (6 proteins) and the cell cycle (7 proteins). A full list of the most significant protein ontology clusters for these high-degree proteins is shown in Additional file 3. Notably, eight of them have been associated with pathways in cancer, while subsets of nine are involved in transcription regulation, covalent chromatin modification or the ubiquitin-like modifier (ubl) conjugation pathway. This information indicates that the observed central role of these proteins within the topology of the PPI network is not a mere result of them being extensively studied, i.e. “study bias”, but correlates well with the current knowledge about their function, as it has also been suggested earlier for the cancer-associated proteins [38, 39]. An additional fact which counter argues the “study bias” for these proteins is that, apart from various targeted small-scale experiments, many of their direct interactions have also been detected in independent high-throughput setups. For example, at least 54 interactions of the cellular tumor antigen p53 [8, 40], 257 interactions of the 14-3-3 protein zeta/delta [41], 212 interactions of the Myc proto-oncogene protein [42] and 48 interactions of the TNF receptor-associated factor 6 [43] have been identified in high-throughput studies.

Polyubiquitin: a hub to be discussed

Polyubiquitin (UniProt identifier: P0CG48, UBC) was the protein identified with the largest number of interactions in the reconstructed network. It interacts with more than a thousand, i.e.: 1112, members of the manually reviewed human “complete” proteome, while the second ranked high-degree node, i.e.: TP53 (UniProt identifier: P04637), has 476 interactions. Notably, this much larger number of interactions for polyubiquitin compared to the other protein hubs deviates from the scale-free network structure, assigning a centralized role to a single protein (Figure 5). Querying the PICKLE dataset, we identified HPRD, IntAct, MINT, DIP and BioGRID reporting, respectively, 19 (2 unique), 5 (0 unique), 143 (48 unique), 53 (15 unique) and 1423 (909 unique) polyubiquitin PPIs. Without exhausting our search regarding polyubiquitin PPI supporting publications, we detected that our integrated dataset contains interactions from studies investigating polyubiquitin function in the context of protein degradation (e.g. [44]). Polyubiquitin can be covalently linked to a protein through an isopeptide bond and mark it for degradation at the proteasome. However, it is questionable whether this one-sided polyubiquitin action on a protein should be included in the interactome or should be considered in the post-translational modification (PTM) network [45, 46]. The latter could explain why, apart from BioGRID, the other source databases used in this work have considered a limited number of polyubiquitin PPIs. In the context of the non-directional PPI network, the existence of an interaction link from one protein to another directly implies a link in the opposite direction, too. Consequently, the absence of a protein and thus its interactions will affect its neighbours and add a certain stress to the network, the extent of which depends on the network structure and dynamics. In the case of unidirectional polyubiquitination of a protein for leading it to degradation, the absence of the protein will neither affect polyubiquitin nor exert a stress to the rest of the polyubiquitin substrates. Thus, this type of actions of a protein on another should be considered as a separate category than the non-directional protein-protein interactions and modelled differently for their role in cell physiology dynamics. On the other hand, the monoubiquitination of proteins for regulatory purposes (e.g. [47]) fits into the notion of the non-directional PPI network. However, even in this case, it is questionable whether ubiquitin itself or rather the ubiquitinated proteins should be included as nodes of the network. In this context, the incorporation of ubiquitin PPIs in the interactome should be cautiously curated. Accordingly, this argument is also relevant to other proteins involved in interactions of similar type, like the small ubiquitin-related modifiers (SUMO1-4) and neddylin (NEDD8) engaged in the sumoylation and neddylation reactions, respectively.

The bulk of the proteins currently absent from the network should have up to four interactions

As shown, the reconstructed human protein interactome follows the scale-free structure with a very good correlation (Figure 5). The part of the network that contributes to the decrease in the correlation coefficient refers to the proteins with up to four interactions. The difference between the data and the power-law curve for a nearly perfect fit is calculated to be about 8300 UniProt identifiers, with more than 6500 of them corresponding to degree equal to 1. Consequently, with 8415 UniProt identifiers not currently included in the interactome (“orphan” proteins), it could be speculated that the vast majority of them should have up to four interactions with nodes in the same degree group. This anticipated network structure implies that the core of the human protein interactome has essentially been revealed and could provide a reasonable explanation for the current lack of PPI information for about 40% of the human proteome, agreeing with a specialized “peripheral” role for most of these “orphan” proteins. Indeed, with most of them expected to have a single PPI, and in general no more than four, with similarly not well-connected proteins, the probability of them being involved in specialized physiological conditions is high. This speculation further corroborates with the fact that interactions for these proteins cannot be easily confirmed in PPI identification experiments, as discussed in section C.

Conclusions

We have obtained a normalized and clean from outdated protein identifier annotations integrated set of direct PPIs referring to the well-defined UniProtKB manually reviewed human “complete” proteome. We suggest that this PPI network with the involvement of approximately 60% of the “complete” proteome represents the core of the human protein interactome. Based on a global view of the way in which the current network will have to expand to its maximum potential in accordance with the scale-free theory, we provide a novel perspective for suggesting its currently “missing” part. We envisage that the proteins not yet identified in direct PPI assays may participate in specialized biological functions interacting with a limited number of other not well-connected proteins. Now determined, this set of “orphan” proteins may trigger targeted text mining efforts or appropriately designed functional experiments for the identification of any relevant PPIs. In effect, we suggest that this reconstructed human interactome already provides a useful tool for generating valuable working hypotheses for the investigation of important biological processes and molecular functions in the context of biomedical research and applications.