Background

The main objective of functional proteomics analysis is often to estimate changes in the amount of proteins found in complex biological systems, in response to physiological and clinical factors such as cell development, disease progression, or drug treatment. In particular, one of the key issues in proteomics research based on tandem mass spectrometry (MS/MS) is the identification of protein species and the characterization of their expression changes in normal and disease samples. Three analysis techniques are often required in an MS/MS study: expressed peptide identification, target protein characterization, and quantification [1]. For hundreds to tens of thousands of fragment ion spectra generated, the assignment of the fragment ion spectra to peptide sequences, the identification of proteins represented by each peptide, and the estimation of their abundances in the analyzed sample require complex computations and still remain as high statistical challenges [2].

Quantification of protein expression using mass spectrometry (MS) is often required for the discovery of protein biomarkers associated with cancer, their response to stimuli, cell signalling cascades and the function of cell cycle-promoting proteins, and various biomedical investigations [3]. Two categories of quantification methods for MS data have been used: stable isotope labelling quantification and label-free quantification [2].

Several stable isotope-based quantification methods have been introduced based on different labelling reagents that can be chemically bound to peptides [4]. It is, however, difficult to simultaneously quantify the amount of proteins/peptides in multiple samples because of the limited number of labelling reagents available [5]. Moreover, current practical applications can typically quantify, at most, a few hundreds of peptides, measuring relative expression values of each pair of contrasting samples. Furthermore, the high costs of labelling reagents make these quantification methods difficult to be commonly applied for the characterization of the global proteome.

On the other hand, label-free quantification, which does not require the use of a stable isotope labeling, has the advantages of low cost and simplicity. Currently, two label-free methods are available to measure expression levels of peptides: spectra counting and spectra feature analysis. The spectral counting method can estimate the peptide expression levels by means of spectrum counting (from MS/MS data) or through the estimation of the integrated ion intensities [6, 7]. The spectral feature analysis method quantitatively determines the peptide expression levels by comparing three-dimensional patterns (retention time, m/z and intensity) between different samples [813].

However, these label-free quantitative methods have two main shortcomings. The first limitation is due to numerous false-positive discriminative peptides, which are the result of the chromatographic variability between LC-MS experiments. In the analysis of the spectra features, after finding two candidates with same MS1 retention time and m/z, the difference in their MS1 intensities is used to define the peptide levels. Therefore, spectra feature analysis requires stringent reproducibility [3, 8] and additional pre-processing of the LC normalization or retention time alignment [14, 15].

The second limitation is that spectra counting cannot be performed without peptide identification because the relative peptide levels can be quantified only after peptide identification. In peptide identification, MS/MS spectra are verified using a database searching algorithm or spectral library searching algorithm such as SEQUEST, MASCOT, or SpectraST. Specifically, database search algorithms calculate score functions to compare the experimental MS/MS spectra with theoretical MS/MS spectra of peptides derived from protein sequence databases. The pool of theoretical MS/MS spectra is restricted by user-specified criteria such as mass tolerance, proteolytic enzymes, and the types of post-translational modification [2, 16]. A number of spectra may not be assigned to the correct peptides for diverse reasons, including deficiencies of the scoring scheme implemented in the database search tools, sequence variations (e.g., single nucleotide polymorphisms, SNPs), omissions in the database searched, post-translational or chemical modifications of the peptide analyzed, and the observation of genomic sequences that are not anticipated (e.g., splice forms, somatic rearrangement, and processed proteins) [17]. For all these reasons, a large number of important peptides may be lost during the database search.

Instead of matching acquired MS/MS spectra against theoretically predicted spectra, MS/MS spectra can also be assigned to peptides by matching those in a spectral library. The spectral library is compiled from a large collection of experimentally observed MS/MS spectra identified in previous experiments [18]. Generally, a set of spectra of known peptide sequences is collected into a library and used as a reference. The experimental spectrum may be identified by a similar match in the library. However, this method can only be identified when spectra were observed previously and entered into the library. So, these library searching methods are well suited for targeted proteomics, in which one seeks not to discover previously unseen peptides, but rather limited to finding and quantifying expected peptides of interest in the sample [19].

To overcome these limitations of label-free quantification methods, we propose a novel spectral counting method to estimate a peptide's abundance by counting MS/MS spectra, comparing and clustering all experimentally observed spectra. This approach has several advantages. First, because the same peptide may be fragmented multiple times or repeatedly observed at different time points from an MS/MS run, multiple spectra may be extracted for the same peptides. In other words, duplicated spectra are ubiquitous in large-scale proteomics data [20]. Our method thus attempts to identify and group all the duplicate spectra, which allows us to quantify the amount of peptide found in complex biological systems without searching through the databases or using LC normalization.

For the given spectra, our method, referred to as the Quantification method derived by Finding the Identical Spectra set for a Homogenous peptide (Q-FISH) employs a two-stage clustering algorithm to determine whether they are from the same peptides with homogeneous spectral patterns. The Q-FISH algorithm employs two similarity measures: the difference between two precursor ions and the correlation coefficient of moving window averages. Subsequently, the algorithm clusters spectra from the same peptide through all plausible pair-wise comparisons. By counting the spectra of each cluster set of peptides, we can estimate the amount of peptides. Figure 1 summarizes the workflow of the proposed Q-FISH algorithm.

Figure 1
figure 1

Work flow chart. This figure shows a flow schematic of the analysis process performed by Q-FISH algorithm

Our proposed algorithm was applied to identify differentially expressed peptides from a real data obtained during a Nano-LC-MS/MS experiment performed on human HCC and normal liver tissue samples.

Results & Discussion

We introduced and tested the so-called Q-FISH algorithm to identify and quantify the amount of all expressed peptides from an MS/MS dataset by clustering and counting spectra with homogeneous spectral patterns. In order to test our algorithm, we performed a Nano-LC MS/MS experiment with triplicated human hepatocellular carcinoma and normal liver tissue samples. For a total of 44,318 MS/MS spectra obtained through three MS/MS analysis for two samples, Q-FISH yielded 14,748 clusters. More specifically, 5,777 clusters were identified only in the hepatocellular carcinoma (HCC) sample, 6,648 clusters only in the normal sample, and 2,323 clusters in both HCC and normal samples. For the purpose of comparison, we also implemented SEQUEST and SpectraST to identify peptides. However, only 4,824 of 44,318 spectra were identified using SEQUEST, and a total of 1,326 peptides from the experimental spectra. Generally, most database search algorithms including SEQUEST assign specific experimental spectra to peptides by comparing the experimental data with theoretical spectra generated from the peptide sequence. It should be noted that neither the best match nor a high search score may not be a true match, especially for novel protein targets. Therefore, many peptides could be misidentified, or not be identified, unless they were previously generated and stored into the database sequence. In our experiments, a large number of experimental spectra (89.12%, namely 39,494 of a total of 44,318 spectra) could not be used for the peptide identification using SEQUEST. On the other hands, 5,549 spectra and 3,295 peptides could be identified using SpectraST. That is, a large number of spectra still could not be used for the peptide identification by SpectraST (87.48%, namely 38,769 of a total of 44,318 spectra). On the other hand, our proposed method directly compares all observed experimental spectra to discover differentially expressed peptides without a loss of observed spectra.

The standardized intensities of the experimental spectra plotted in Figure 2 are characterized by positive intensity values (upper part) and the reference spectrum plotted using negative intensity values (lower part). Specifically, Figure 2(a), which illustrates an example of one cluster with nine similar spectra, shows spectral patterns of the MS/MS spectra as well as the reference spectrum for clustered spectral set. The overall patterns look quite similar and all nine spectra pairs seem to have almost identical patterns. Table 1 shows the search results returned by SEQUEST and SpectraST. Subsequently, in the case of spectral set S366006, nine spectra were identified by means of the same peptide sequence, "SIFSAVLDELK" in the SEQUEST and SpectraST with XCorr above 1.97. In addition, a reference spectrum for the clustered spectral set was identified as the peptide sequence, "SIFSAVLDELK" with a SEQUEST score, XCorr = 2.96. This analysis reveals that these spectra can be regarded as the spectra of a homogenous peptide. In other words, each cluster could be expected to be composed of spectra from the same peptide.

Figure 2
figure 2

Pattern-plots of reference spectrum and experimental MS/MS spectra in clustered spectral sets. This figure shows pattern-plots the of the experimental MS/MS spectra with plotted using positive intensities (upper part) and the reference spectrum using negative intensities (lower part). Then, (a) all of nine spectra were identified as a same peptide, while (b) two of the eleven spectra are not identified by SEQUEST and (c) four of the seven spectra were only identified by SpectraST, although pattern-plots are very similar.

Table 1 Results of SEQUEST & SpectraST for spectra in clustered spectral sets

Similarly, Figures 2(b) and 2(c) show spectral patterns for the reference spectrum and the experimental spectra of a single cluster. It should be noted that the overall patterns look quite similar and all spectra pairs are characterized by high correlation coefficients. However, while all spectra in S1157004 could be identified by SpectraST, two out of the eleven spectra could not be identified by SEQUEST, as shown Table 1. On the contrary, all spectra in S65002 are identified by SEQUEST with high scores, while three spectra could not be identified by SpectraST. In other words, if we relied only on the conventional peptide identification such as SEQUEST or SpectraST, these spectra would have been excluded despite the similar peak patterns. On the other hand, our Q-FISH algorithm was able to include these spectra without a loss of information.

In this study, we were interested in identifying proteins and characterizing their differential expressions in normal and HCC samples. Hence, we first focused on the 2,323 clusters, which were observed in both samples. Figure 3 and Table 2 show a scatter plot and a correlation matrix with the number of spectra in the same cluster, which were obtained through the replicated experiments on HCC and normal tissue samples, respectively. It is worth noting that the number of spectra in the same cluster showed high correlations (0.7178~0.8315), while the number of spectra for different samples showed weak correlations (0.0654~0.1549). For a given spectral set, the reference spectrum was estimated by averaging the relative intensities of the spectra. Consequently, the reference spectrum corresponds to the number of expressed spectra in the normal and HCC samples. We computed the false clustering rate (FCR) on the 2,323 clusters shared by the HCC and normal samples. Among these clusters, 1,571 clusters had FCRs smaller than 0.05. Our next step was to perform a beta-binomial test to isolate differentially expressed peptides (DEPs) [21]. The result showed that only 84 out of the 1,571 reference spectra were characterized by different spectral counts between the HCC and normal tissue samples. Also, 5,777 clusters were observed only in the HCC sample and 6,648 clusters only in the normal sample by Q-FISH. Among these clusters, 1,571 and 1,556 clusters, respectively, had FCRs smaller than 0.05.

Figure 3
figure 3

Scatter plot between different samples and within replicated samples. This figure represents the scatter plot with the number of spectra in clustered sets obtained through the replicated experiments on HCC and normal tissue samples, respectively. Then, two black boxes show the relationships of the number of spectra in replicated each HCC and normal samples, while a gray box represents the relationships of the number of spectra in clustered sets between HCC and normal samples.

Table 2 Correlation matrix and the number of shared spectral clusters between different samples and within replicated samples

In order to compare the performance of Q-FISH with the spectral counting method by SEQUEST, we used the human liver data and validated the results through literature search. For the human liver data, Q-FISH provided 1571 differentially expressed clusters for HCC sample and 1556 for normal sample, among which 57 and 99 clusters were identified by SEQUEST in HCC and normal samples, respectively. On the other hand, SEQUEST provided 93 and 145 peptides for HCC and normal tissue samples, respectively. Among the 57 identified clusters in HCC samples, 37 clusters were found to be over-expressed by Q-FISH; 20 peptides/clusters were overlapped by Q-FISH and SEQUEST. On the other hands, 73 peptides were identified only by SEQUEST. 49 peptides/clusters were identified as over-expressed by both Q-FISH and SEQUEST in normal sample. Also, 50 and 96 peptides/clusters were identified as over-expressed only by Q-FISH and SEQUEST, respectively.

We compared two results through literature search. We assumed that it is a true match if a peptide was reported in a previous literature in cancer. While there is a certain degree of uncertainty for reported protein biomarkers, this assumption is not biased to any of the two methods and allowed us to statistically compare their performance. For examples, alpha-2-macroglobulin (A2M) annotated by "VSVQLEASPAFLAVPVEK" was reported to be over-expressed in HCC sample [22]. This peptide was found to be over-expressed by Q-FISH, but under-expressed by spectral counting analysis by SEQUEST. The full list of peptides is given in Additional file 1. Based on this report, the 2 × 2 confusion tables can be constructed as shown in Table 3.

Table 3 2 × 2 tables for literature search results of Q-FISH and SEQUEST

For Q-FISH result, 65 peptides were found in the literature: 31 for HCC sample and 34 for normal sample. Among 31 peptides for HCC sample, 25 are reported as over-expressed in the literature, and are assumed to be correctly identified. Among 17 peptides for normal sample, 17 are reported as under-expressed in the literature, and thus are assumed to be correctly identified. The remaining 17 and 6 peptides are assumed incorrectly identified.

For SEQUEST result, 93 peptides were reported in the literature: 43 for HCC sample and 50 for normal sample. Among them, 34 and 24 peptides were correctly identified, while 26 and 9 peptides were incorrectly identified. Based on these numbers, accuracy measure was computed showing that Q-FISH (accuracy = 64.62%) has slightly higher accuracy than SEQUEST (accuracy = 62.37%). This comparison showed that Q-FISH performed as reliably as SEQUEST, despite the comparison giving SEQUEST a natural advantage.

Table 4 provides a list of potential protein biomarkers. Q scores were calculated by averaging the correlation coefficient between moving averages over the reference spectrum and experimental spectra of the clustered spectral set. If it has a relatively high value, then the reference spectrum is well represented in the clustered spectral set.

Table 4 Lists of differentially expressed peptides in HCC and normal sample.

To find the potential biomarkers in each sample, we searched the reference spectra of clusters using SEQUEST. Consequently, we could find 50 and 95 peptides as the candidate biomarkers from HCC sample and normal sample, respectively, as shown Table 4. Among them, 24 peptides in HCC sample and 56 peptides in normal samples are known biomarkers for the human liver cancer. Also, 22 reference spectra among 84 DEPs were identified by SEQUEST. Among them, 13 peptides are known markers for the human liver cancer, too.

As shown in Table 4, carbamoyl-phosphate synthetase 1 (CPS1) are annotated by various sequences such as "MEYDGILIAGGPGNPALAEPLIQNVR" "SIFSAVLDELK", "TAVDSGIPLLTNFQVTK" and "GLNSESMTEETLK". These sequences are underexpressed in the HCC sample. Kinoshita et al. [23] performed differential gene display analysis (DGDA) to compare the intensities of polymerase chain reaction (PCR) products and evaluated the degrees of mRNA expression in HCC tissue samples and noncancerous hepatitis tissues. Subsequently, they confirmed that CPS1 is underexpressed. Specifically, CPS1 synthesizes carbamyl phosphate from bicarbonate, adenosine triphosphate (ATP) and ammmonia. A genetic mutation of CPS1 was identified as the source of hyperammonemia. In HCC tissue samples, underexpression of the CPS1 gene had been reported in rats, but the scientists' study was the first to result in such a finding for humans [23]. Heterogeneous nuclear ribonucleoprotein C (HNRNPC) annotated as "MIAGQVLDINLAAEPK" and actin, cytoplasmic 1 (ACTB) annotated as "DLYANTVLSGGTTMYPGIADR" were found to be over-expressed in the HCC sample [24, 25]. On the contrary, glutathione S-transferase (GSTA1) annotated as "NDGYLMFQQVPMVEIDGMK" has been down-regulated in the human HCC sample [26]. Moreover, fatty acid-binding protein (FABP1) annotated as "SVTELNGDIITNTMTLGDIVFK", and Isoform 1 of Liver carboxylesterase 1 (CES1) annotated as "EGYLQIGANTQAAQK" are all characteristic of the HCC sample [27, 28].

As shown in Table 4 many peptides are also known to be associated with cancer. Specifically, EMILIN-1 (EMILIN1), elongation factor 1-delta (EEF1D), galectin-7/p53-induced gene 1 protein (LGALS7), hemoglobin subunit beta (HBB) and malate dehydrogenase 2 (MDH 2) are differentially expressed in breast cancer cells [2931]. Consequently, the LGALS7 gene is known to be related to over-expression when compared with control cells. Likewise, our result was also over-expressed. Table 4 provides a list of different types cancers associated with specific genes [2834]. Figure 4 shows a scatter plot of the spectral counts of normal and HCC samples. The × axis and y axis represent the number of expressed spectra in each HCC and normal sample. Specifically, the symbol "▲" indicates DEPs identified with the use of SEQUEST, whereas the symbol "●" indicates unidentified DEPs. However, 62 DEPs were not identified by SEQUEST despite their significant differences by the beta-binomial test.

Figure 4
figure 4

Scatter plot of spectral counts between normal and HCC samples. This figure plots the number of spectra in clustered sets in HCC and normal sample, respectively. The × axis and y axis represent the number of expressed spectra in each HCC and normal sample. Specifically, the grey triangle indicates DEPs identified with the use of SEQUEST, whereas the black circle indicates unidentified DEPs.

We believe there were several reasons why 62 DEPs were not identified by SEQUEST. First, "one-size-fits-all" search parameter values of SEQUEST would not have been chosen appropriately for this protein target. Second, these unidentified DEPs may have other post-translational modification, sequence variation (e.g., alternative splicing) or insufficient peptide ions information.

We re-run SEQUEST with many different parameter options for allowing phosphorylation modification and two missed cleavages, and for using other sequence databases (NCBI nr and EST human). However, even with these parameter options, SEQUEST did not identify the remaining 62 DEPs. Next, we tried to identify 62 reference spectra using other searching engines such as MASCOT and SpectraST. MASCOT identified 2 DEPs, Alcohol dehydrogenase 1A (ADH1A) and Isoform 2 of Myosin-9(MYH9) but SpectraST did not identify any DEPs. The remaining 60 DEPs could not be identified by these search engines. In order to identify these DEPs, further experiments may be needed. For example, additional MS/MS experiments such as MRM (Multiple Reaction Monitoring) or SRM (Selective Reaction Monitoring) can be carried out within the range of the corresponding retention times for all the unidentified spectra in order to collect more detailed peptide information.

Conclusions

In this paper, we proposed a novel method to estimate peptide's abundance by counting MS/MS spectra clustered through the direct comparison of all experimentally observed spectra. For a given pair of spectra, our method can be used to answer the question of whether they are from the same peptide without computationally searching them from a theoretical library of protein spectra. Examining all possible pair-wise comparisons, our method results into a set of spectra for the same peptide and enables us to estimate the amount of peptides found in biological samples of interest by counting the spectra clusters. Since our proposed method compares all possible pairs of experimental spectra, it can discover even modified and unknown peptides, which may not be searchable from a theoretical spectral library. For practical MS/MS experimental data, a large proportion of spectra are often misidentified or completely lost during a computational database search. On the other hand, Q-FISH can identify these spectra without any loss of information. As demonstrated in our practical examples, the majority of DEPs derived by Q-FISH were found to be highly related with various cancers, which were not discovered by other methods.

We thus believe our Q-FISH algorithm will be highly useful in the identification of novel peptides [19]. Also, Q-FISH has the potential to find applications in many other practical proteomic studies. For example, it can be used to discover unknown biomarkers or drug targets through the comparison of proteins with statistically significant difference and by quantifying sets of identical peptides in multiple samples. Unknown spectral clusters can often come from non-peptide contaminants as revealed by a recent publication [35]. Q-FISH can evaluate the significance of such unknown clusters, some of which can be novel biomarkers, requiring further experimental confirmation by de novo sequencing, unrestricted sequence database search (using e.g. InsPect [36]) or spectral library search (using e.g. pMatch [37]).

Methods

Sample Preparation, Nano-LC-ESI-MS/MS

Tissue samples such as hepatocellular carcinoma (HCC) tumour tissue and adjacent healthy liver tissue were collected under the guidelines of the Institutional Review Board (IRB) established at Yonsei Medical Center (Seoul, Korea). All tissues were prepared and subsequently, in-solution tryptic digestion was performed as previously described [20]. Nano-LC-MS/MS analysis was performed on an Agilent Nano HPLC 1100 system using an linear trap quadruple (LTQ) mass spectrometer (Thermo Electron, San Jose, US). LC-MS/MS was performed as previously described [38]. The peptide fractionation was performed by means of cationic exchange chromatography (SCX) at a flow rate of 0.5 mL/min where absorbance of the column effluent was maintained stable at 280 nm for 40 min. Fractions were automatically transferred every 0.5 min into a 96-microplate.

Nano-LC MS/MS experiments were carried out three times on two different samples (human liver cancer and normal tissues) and 44,318 MS/MS spectra were generated. These tandem mass spectrometry data were first analyzed by means of the database search software SEQUEST (Bioworks 3.2, ThermoFinnigan, San Jose, US). The sequence database downloaded from European Bioinformatics Institute (EBI) was the International Protein Index (IPI) human version 3.61. The next step was to combine the protein sequence database with its reverse sequences. The maximum number of missed cleavage sites was set to 1, and only tryptic cleavage after arginine and lysine was allowed. The mass tolerance of the precursor peptide ion was set to 3.0 Da, while the fragment ion tolerance was set to 0.5 Da. These tolerance values were chosen to minimize FDR when XCorr > 1.5 [39]. Modification at cysteine with carboxyamidomethylation and methionine with oxidation were allowed [40]. All peptides assigned to reverse sequence were removed before proceeding to peptide identification to inhibit false-positive identifications. We chose XCorr as 1.44(+1), 1.97(+2) and 3.13(+3) which yielded FDR close to 0.05, respectively, and the value of DeltaCn is equal to a great than 0.1. These score criteria were considered to ensure high confidence in the results of protein identification [41]. The spectra derived by mass spectrometry were also analyzed by means of the spectral library search software SpectraST, which was initially developed by the Institute for Systems Biology (ISB) and National Institute of Standards and Technology (NIST). SpectraST is integrated with the Trans-Proteomic Pipeline (TPP) software suite, which provides the supporting functionalities necessary in a full proteomics data analysis pipeline. Then, the SpectraST program was validated in the NIST Human IT Library with the SpectraST's scores > 0.9 [18, 38, 42]. The precursor tolerance was set to 1.5 Da/z (Thomson).

Q-FISH algorithm for direct comparison of experimental spectra

We assumed that MS/MS spectra from the same peptide would present similar patterns. Under this assumption, the proposed Q-FISH algorithm can be applied to find DEPs both in normal and disease samples. As shown in Figure 1, to evaluate the similarities between two spectra, we use a correlation coefficient of the moving window averages. The analytical process is summarized as follows:

1. Scale Standardization

Perform scale standardization by dividing the intensity values by its maximum value.

2. Moving average

Compute the moving window average over the spectra using a window of fixed size.

3. Correlation index for moving average-based peak patterns

Calculate a summary statistic based on the correlation coefficient of the moving averages between two spectra.

4. Spectral count-based quantification using two-stage clustering

Cluster duplicated peptides with similar peak patterns and retention time using a two-stage clustering method.

5. Identification of differentially expressed peptides

Employ the beta-binomial test to identify DEPs among the experimental groups.

Similarity measure between pairs of MS/MS spectra

Scale standardization

Because the intensities of the spectra obtained may be different for various physical and chemical reasons such as inconsistencies in the total ion currents, we cannot use the raw data for the intensity of m/z peaks. In light of this, we used a scale-standardization method, which involves the division of the m/z peak values for all ions by their maximum value. Let x[i] be the intensity of the ith m/z peak. Then, the scale standardized intensity, y[i], is defined by

y [ i ] = x [ i ] max ( x [ i ] ) .

Moving window average

To reduce the background noise of the peak intensities, the moving window average (MWA) is used. The most simple moving average is the unweighted (or uniformly weighted) average of n data points within a given window, and the weighted moving average (WMWA) is the average calculated using multiplying weight factors to give different weight to each data point. Among the various options for the weights of WMWA, we selected the "Gaussian" kernel, which uses the probability density function (pdf) of the standard Gaussian distribution with mean 0 and variance 1 as a weight function.

For a given spectrum, the MWA is calculated by averaging the peak intensities within the sliding window sequentially for all m/z peaks. In other words, the MWA is not a single value, but a set of averages. The next step is to calculate correlation between the MWAs of two spectra and determine whether there are identical spectra from the same peptide.

We assume that there are N moving windows of fixed size K along the entire m/z range. Subsequently, the WMWA for the ith moving window (i = 1, 2,..., N) is defined by

m [ i ] = j = 0 K - 1 w j y [ i + j ] ,

where y[i + j] is the jth scale standardized intensity in the ith moving window and w j are the weights. For a uniform kernel w j = 1/K or the Gaussian kernel, w j = Φ(z j ) represents the pdf of the standard Gaussian distribution, where z j represents the value of y[i+j] standardized by using mean and variance of m/z's in the ith window. Total number of windows, N can be determined by the fixed window size K along with the entire m/z range (200-2000 Da). In order to determine the optimal window size, we randomly selected some pairs of spectra from the same and different peptides using target-decoy sequence database. We implemented receiver operating characteristic (ROC) analysis to determine the window size. Based on ROC analysis, we chose a window size, K = 30 (3.0Da) and accordingly N = 19,771 (20-2000 Da at interval of 0.1 Da). However, the areas under the curve (AUC) did not differ much and were less sensitive to the window size.

Correlation index for moving average-based peak patterns

For peptides p and q, the correlation coefficient is computed as follows:

r p q = i = 1 N ( m p [ i ] - m ̄ p ) ( m q [ i ] - m ̄ q ) i = 1 N ( m p [ i ] - m ̄ p ) 2 i = 1 N ( m q [ i ] - m ̄ q ) 2 ,

where m ̄ p and m ̄ q are the means of moving window averages for peptide p and q. The closer the correlation coefficient is to 1, the stronger is the correlation between spectra from the same peptides.

Quantification by counting spectra in clustered spectra set from a homogenous peptide

Two-stage cluster analysis is used to cluster peptide sets consisting of spectra with similar patterns. As previously assumed, if the spectra have approximately the same shape, then the spectra would have come from the same peptide. Namely, each cluster can be expected to be composed of the spectra obtained from a homogenous peptide. Two-stage clustering analysis employs two similarity measures to cluster peptides: the first is the difference between precursor ions and the second is the correlation coefficient between two MWAs. It is theoretically predicted that MS/MS spectra obtained from the same peptide have similar precursor ions. First, clusters can be defined in terms of pair-wise differences between the precursor ions. For any two pair of precursor ions in the same cluster, their difference is smaller than the threshold value. In our analysis, we set ± 1 Da as a threshold value. The next step is to perform a hierarchical clustering analysis for each of the clusters defined. Specifically, we employ "single linkage," also known as the nearest neighbour technique. Here, the correlation coefficient of MWAs is used as a similarity measure.

Because this two-stage clustering analysis yields clustered spectra sets consisting of MS/MS spectra from the same peptide, the amount of peptides can be quantified by counting the spectra included in each clustered set. Lastly, representative spectra called "reference spectra" can be defined based on the basic patterns of precursor ions as the average spectra for a given spectral set.

Validation of the clustering results using retention times

It is well known that the same peptides tend to elute continuously within a limited liquid chromatography (LC) interval. Thus, the clustering results can be validated using the retention time (RT) information.

In order to validate the clustering results, we propose a new measure to estimate the clustering error rate using the spectral RT information. Note that the Q-FISH results provide the list of clusters. If a cluster contains only peptides from the same spectra, the RTs of peptides would have similar values. If a cluster contains peptides from the different spectra, the RTs would have different values. As a measure of similarity, we consider the measures representing the variability of RTs from the same cluster such as coefficient of variation (CV) and standard deviation (SD) of RTs. Since the RT varies much across of spectra, CV would be a better measure than SD. Using CV, we propose a new measure called the false clustering rate (FCR) which is similar in spirit to that of the false discovery rate (FDR). It measures the rate how often a cluster is composed of spectra from the different peptides. We provide a threshold value of CV, Δ, to determine whether a cluster is well clustered or not. That is, if the value of CV of a given cluster is smaller than Δ, then we call it is a good cluster. For the given value of Δ, FCR can be computed. The detailed procedure of computing FCR is given as follows:

  1. 1)

    Calculate the coefficient of variation (CV) of spectral RT in the same clusters from the Q-FISH results.

  2. 2)

    Permute the spectra while maintaining the number of spectra in each cluster fixed.

  3. 3)

    Calculate CV p for each permuted cluster for the p th permuted sample.

  4. 4)

    Compute FCR as follows:

    F C R = 1 P p = 1 P # { i | C V p ( i ) Δ } # { i | C V ( i ) Δ } , i = 1 , 2 , , C ,

where P is the number of permutations, Δ the threshold value, and C the total number of clusters.

For our HCC data, we computed FCR for various values of Δ, as summarized in the Table 5. From our analysis, we chose the value of Δ as 4.4 which yielded FCR close to 0.05.

Table 5 Validation for clustering result using the false clustering rate (FCR)

We also calculated FCR to determine the cut-off value of correlation coefficient, ρ for spectral clustering. For the given threshold value of ρ, FCR can be computed in the similar manner as Δ. We computed FCR for the various values of the given ρ, as summarized in the Table 5. We chose ρ = 0.6 which yielded FCR close to 0.05.

Differentially expressed peptides (DEPs)

To estimate the peptide's abundance found in different samples such as control and disease tissue samples, a spectral counting method like Q-FISH can be employed. Pham et al. [21] proposed the use of the beta-binomial distribution to test the significance of DEPs in spectral counts in label-free mass spectrometry-based proteomics. Their results revealed that the beta-binomial test can be applied to experiments with one or more replicates, as well as for the comparison of multiple conditions. We applied the beta-binomial model to test the abundance of DEPs in the clustered spectral set through three replicated MS/MS experiments.

Let x denote the number of spectral counts in the clustered spectral set and n, the total number of spectral counts of all spectral in each sample. Then, assume that x is distributed with the true proportion π, 0 ≤ π ≤ 1,

x | π ~ B i n o m i a l ( n , π )

Differently, π is approximated as a random variable based on the beta distribution with real parameters α > 0 and β > 0.

π ~ B e t a ( α , β ) , E π = α α + β = θ
V a r π = α β α + β 2 α + β + 1

Subsequently, the marginal distribution of x is the beta-binomial distribution [21],

p x | α , β , n = 0 1 p ( x | π , n ) p ( π | α , β ) d π = 0 1 n x π x + α - 1 ( 1 - π n - x + β - 1 ) B α , β d π , = n x B ( α + x , n + β - x ) B α , β

where B(·,·) is the beta function.

The following parameterization is used

π = α α + β = h ( X b ) = h ( η ) and ϕ = 1 α + β + 1 ,

where h is the inverse of the link function (logit or complementary log-log), X a design matrix, b a vector of fixed effects, η = Xb the linear predictor, and Φ the overdispersion parameter. Based on this parameterization, the marginal mean and variance are:

E x = n π
V a r x = n π ( 1 - π ) 1 + ( n - 1 ) ϕ .

It should be noted that parameters b and ϕ are estimated by maximizing the log-likelihood of the marginal model. Given the estimated coefficients, the testing hypothesis is rephrased as to whether the b coefficient is 0 [43]. We also used Benjamini and Hochberg's method to correct for multiple comparisons in multiple testing for DEPs [44].