Background

Of the entire 3GB human genome, only about 2% codes for proteins. The identification of biological functions of the entire genome remains a major challenge [1, 2]. One powerful venue to gain functional insights is to identify the proteins that bind to each genomic region. Recent development of chromatin immunoprecipitation followed by microarray or sequencing (ChIP- chip or ChIPseq) technologies has made it feasible to map genome-wide protein-DNA interaction profiles [35]. The data generated by these experiments have not only greatly facilitated the genome-wide characterization of regulatory elements such as enhancers [6, 7] but also been integrated with other data sources to build gene regulatory networks [811].

An important question is to what extent a specific protein-DNA interaction is mediated at the level of genomic sequences. While it is well known that specific sequence motifs are crucial for transcription factors (TF) mediated cis-regulation, there are many other proteins, such as chromatin modifiers, whose target sequences cannot simply be characterized by a handful of distinct motifs [12]. Such sequences are often regarded as nonspecific and not studied further. However, recent studies in nucleosome positioning have provided new insights by going beyond this motif-centric view [13]. Here various sequence features have been associated with nucleosome positioning, including poly dA:dT track [14, 15], abundance of G/C content [16, 17], and certain periodic patterns [18, 19]. Such patterns cannot be captured by traditional motif analysis methods. Similar results have been obtained by analyzing histone modification [20, 21] and DNA methylation data [22, 23].

Despite the success of these recent sequence-based prediction models, it remains difficult to determine which sequences lack intrinsic specificity because a poor prediction outcome might imply than more sophisticated models. A guide is needed for developing sequence-based prediction models. To this end, here we present a simple approach to quantify sequence specificity based on the frequency distribution of k-mers. We will also systematically investigate the relative merit of various distance or similarity functions for capturing specific sequence information. While k-mers have been extensively to detect splice sites [24], to study functional genomic regions[25], to identify protein coding genes[26] and used in motif analysis (reviewed by [27]), to our knowledge, they have not been used to quantify sequence specificity.

We evaluated the performance of our approach by analyzing one simulated datasets and two real experimental datasets, corresponding to a TF (STAT1) and a histone modification (H3K4me1) respectively. Our results have provided new insights into the role of DNA sequences in modulating protein-DNA interactions regardless of motif presence.

Results

A simple measure of sequence specificity

While specific sequence information has been identified in the absence of distinct motifs, to our knowledge, it is always associated with enrichment of certain k-mers (where k is a small number, such as 4). Its main difference with motifs is that, when k is small, a single k-mer may occur many times in the genome and therefore would not be useful for any practical purpose. On the other hand, we reasoned that more specific information can be obtained by combinations of multiple k-mers. Therefore, it seems appropriate to quantify sequence specificity by aggregating enrichment information for all k-mers. For the rest of the paper, we fix k = 4, although the method presented below is equally applicable to any choice of k. Treating complementary sequences as identical, there are 136 non-redundant 4-mers. By counting the frequency of each 4-mer, each input sequence is then mapped to a 136 dimensional numerical vector containing the frequency of each k-mer. The distributions corresponding to sequences containing specific information should be distinct from those for random sequences, which are generated to match the number and length of the input sequences. We use the symmetric Kullback-Leibler (KL) divergence [28] for comparing frequency distributions and average over the entire set of input sequences. We term the resulting value as the M otif I ndependent M etric (MIM). To evaluate statistical significance, we estimate the null distribution by computing MIM values for sets of random sequences. The detailed procedure is described in the Methods section.

Model Validation

Simulated data

As an initial evaluation, we synthetically generated 8 sequence sets each containing 2000 sequences, mimicking TF ChIPseq experiments for which the corresponding TF recognizes a single motif: TTGACA. The difference between these sequence sets is the motif strength, which is parameterized by a real number ε (see Methods). In particular, a perfect motif corresponds to ε = 0, whereas a random sequence corresponds to ε = 0.25. In a typical ChIPseq experiment, only a subset of target sequences contains the motif. To simulate this fact, we randomly selected 1000 sequences from each set and inserted the motif at a randomly selected location. As control, we also synthesized 1000 sets of 2000 random sequences each.

We calculated the MIM values for each sequence set and evaluated the statistical significance of the resulting values. We found that the MIM values are statistically significant (p-value < 0.001) for ε up to 0.1 (Figure 1a and 1b). The information content for the corresponding motif is 5.35 bit, which is still lower than 98% of the motifs in the JASPAR core database [29]. In the following we will show that our method indeed performs well for real data. We ranked each k-mer according to its relative contribution to the MIM. The most informative k-mers are shown in Table 1. The methodology used to select such motif is outlined in the methods section. We noticed that the top k-mers are substrings of the inserted motif (highlighted in bold in Table 1), suggesting that these k-mers may be used as a seed for motif detection, in a similar way as the dictionary approach [30]. In additional to the KL divergence considered here, there are a number of other metrics to compare frequency distributions. We selected a few commonly used metrics and repeated the above analysis (Methods). We found that the results are quite similar (Table 2).

Figure 1
figure 1

MIM values for simulated sequences. (a) The MIM values and corresponding p-values (above the bars) for the simulated data. Note that the MIM values change in the same direction as motif strength; (b) comparison of the MIM values with respect to the null distribution, which is obtained by using 1000 sets of random sequences.

Table 1 Top 20 k-mers ranked by different distances on Cell1 of Synthetic dataset
Table 2 Distances values on Synthetic dataset

Real ChIPseq data

To validate our method using real experimental data, we analyzed a publicly available ChIPseq dataset for STAT1 [31], a member of the signal transducer and activator of transcription (STAT) family TFs, in the HeLa S3 cell line. The dataset contains 39,000 target sequences, 35% of which contains the consensus motif TTCCNGGAA (JASPAR database [29]). As control, we sampled random sequences from genomic background matching the number and length of the target sequences.

We evaluated the level of sequence specificity of the whole set of target sequences by using the MIM measure. The sequences are indeed highly specific (see Figure 2a and 2b). Again, among the top ranked k-mers, several are substrings of the "classic" STAT1 motif (highlighted in bold in Table 3), suggesting it may provide useful information for identifying discriminative sequence signatures without the knowledge of TF motifs. Furthermore, the results are not sensitive to the specific choice of distances as in the simulated data experiment (Table 4).

Figure 2
figure 2

MIM values for STAT1 target sequences. (a) The MIM values and corresponding p-values (above the bars) for different subsets of STAT1 target sequences: all targets, STAT1 motif containing ones, and STAT1 motif absent ones; (b) comparison of the MIM values with respect to the null distribution, which is estimated by using 1000 sets of random sequences.

Table 3 Top 20 k-mers ranked by different distances on motif sequences on STAT1 dataset
Table 4 Distances values on STAT1 dataset

Detecting sequence specificity in absence of a dominant motif

STAT1

As mentioned above, while the presence of STAT1 motif can explain the sequence specificity for 35% of the target sequences, it is unclear how TF is recruited to the other 65% of the targets. In order to evaluate the role of DNA sequence specificity for these motif-absent targets, we compared the MIM values between the motif-present and motif-absent subsets of targets. Surprisingly, we found that the MIM value for motif-absent targets is almost indistinguishable from motif-present targets (see Figure 2a and 2b). This high level of specificity cannot be simply explained by promoter-related biases, because only 11% of target sequences are located in promoters.

To gain mechanistic insights, we searched for enrichment of other TF motifs in the JASPAR database [29], using the FIMO software [32]. We found two motifs that are significantly enriched (threshold p-value < 10-6): SP1 and ESR1, both have previously been shown to interact with STAT1 [33, 34]. Therefore, STAT1 might be recruited to the motif-absent targets through interaction with these other TFs. We further compared the associated gene ontology terms between the motif-present and motif-absent sets to see if there are any functional differences. We found that these two sets share many similar biological functions, such as hydrolase and ATPase activities (p < 10-17). On the other hand, while the motif-present targets are highly enriched for the voltage-gated calcium channel complex (p < 10-12), the motif-absent targets are highly enriched for cytoplasmic components instead (p < 10-12).

H3K4me1

Unlike TFs, histone (de)modifying enzymes usually do not directly interact with DNA. The role of DNA sequences in the regulation of histone modification patterns remains poorly understood. As an example, the histone modification H3K4me1 plays an important role in gene regulation by demarcating cell-type specific enhancers [6]; yet how it is recruited to enhancer regions is poorly understood. We hypothesized that the role of DNA sequence may play a cell-type specific role and aimed to detect such differences by using our MIM measure. To this end, we assembled an H3K4me1 ChIPseq dataset in seven human cell-lines, including H1 (a human embryonic stem cell line), K562 (a myelogenous leukemia cell line), Huvec (human umbilical vein endothelial cells), Nhek (normal human epidermal keratinocytes), and three T cell-lines (CD4+, CD36+, and CD133+) from the public domain [1, 4, 35]. For each cell line, we identified the peak locations by using cisGenome [36] then calculated the MIM value for DNA sequences at the peaks (in Table 5 the top 20 k-mers ranking by different distances on H1 cell line). The MIM values are highly cell-type specific (see Figure 3a and 3b and Table 6). Interestingly, the value for H1 cells is much higher than any other cell line, suggesting that the DNA sequence plays a unique role in H3K4me1 recruitment in ES cells. To eliminate the possibility that this difference may be simply due to a GC content related bias, we repeated the analysis by using a different null model, obtained by random shuffling the original sequences within each dataset. While the MIM values slightly change, they are ordered in nearly the same way as before (Additional File 1). Importantly, the MIM values are distinctively higher in the H1 cell line compared to the other cell lines, suggesting that such differences are unlikely due to a GC- content related bias.

Table 5 Top 20 k-mers ranking by different distances on H1 cell line on H3k4me1 dataset
Figure 3
figure 3

MIM values for H3K4me1 target sequences. (a) The MIM values and corresponding p-values (above the bars) for H3k4me1 target sequences in different cell lines. Note that the MIM value for H1 is much higher than for other cell lines; (b) comparison of the MIM values with respect to the null distribution, which is estimated from 1000 sets of random sequences.

Table 6 Distances values on H3k4me1 dataset

Since the H3K4me1 marks cell-type specific enhancers, one possible explanation for the high sequence specificity in ES cells is that the targets might be associated with a few ES-specific TFs. To test this possibility, we searched for enrichment of TF motifs in the JASPAR database using FIMO. Surprisingly, we were unable to find any significantly-enriched motif, suggesting that the specificity is contributed to a different mechanism.

We then investigated whether the H3K4me1 targets in ES cells are indeed highly predictable. In previous work, we developed a sequence-based model, called the N-score model, to predict epigenetic targets [19, 21]. This model integrates information from three classes of sequence features (sequence periodicity, word counts, and DNA structural parameters) by using stepwise logistic regression model (see methods for details). Here we applied the N-score model to predict H3K4me1 target sequences. As negative control, we selected the same number of sequences from the genome at random. We evaluated the model performance by using a 3-fold cross-validation. We found that prediction accuracy is indeed high for ES cells (AUC = 0.967) (Figure 4), whereas the accuracy for other cell types is much lower.

Figure 4
figure 4

N-score prediction of H3K4me1 target sequences. Receiver operating characteristic (ROC) curves for different cell lines using the N-score. Note as the AUC for H1 is much higher than for other cell lines.

Discussion

Recently it has been shown that a large number of proteins may weakly bind to DNA [37]. It remains unclear to what extent such events are mediated by specific sequence information. This question cannot be answered by using traditional motif analysis, since the target sequences do not contain distinct motifs. As an alternative approach, we define a simple measure, called MIM, to quantify sequence specificity by aggregating information from all k-mers. Our approach does not make any assumptions regarding motif presence, providing a more versatile tool for sequence analysis. We validated this method by analyzing both simulated and experimental data and found that it is indeed effective for detecting sequence specificity in both cases.

We also showed that the MIM measure can provide new biological insights. Specifically, we found that the motif-absent targets of a TF may also contain specific sequence information due to interaction with other TFs. We also found that the sequence specificity for H3K4me1 targets is higher in ES cells than in differentiated cell-types, suggesting a unique role of DNA sequence in the recruitment of H3K4me1 in ES cells. Interestingly, this high specificity cannot be explained by enrichment of known TF motifs, suggesting a yet uncharacterized recruitment mechanism in ES cells. The MIM algorithm is implemented in Python and can be freely accessed at : https://github.com/lucapinello/mim.

Conclusion

The role of DNA sequence in gene regulation remains incompletely understood. Our MIM method has extended previous work by further accounting for sequence specificity due to accumulation of weak sequence features. The information can be used as a guide to systematically investigate the regulatory mechanisms for a wide variety of biological processes.

Methods

Synthetic data generation

We simulated ChIPseq data for a TF whose motif sequence is TTGACA. In order to simulate the variation of motif sites among different target sequences, we modeled the position weight matrix (PWM) as illustrated in Table 7, where ε measures the mutation rate of the motif and can change between 0 (perfect motif) and 0.25 (totally random). We sampled ε at 8 different values: 0, 0.001, 0.005, 0.01, 0.05, 0.1, 0.1667, and 0.25. For each choice of ε, we generated 2000 sequences of 500 bp each. The sequences were initially generated by randomly sampling from the background distribution with the probabilities of A,C,G,T equal to 0.15, 0.35, 0.35, 0.15, respectively. In addition, we randomly selected a subset of 1000 sequences and inserted the motif at a random location.

Table 7 PWM for synthetic motif generation

ChIPseq data source

Genome-wide STAT1 peak locations in HeLa S3 cell lines were obtained from the http://archive.gersteinlab.org/proj/PeakSeq/Scoring_ChIPSeq/Results/STAT1[31]. ChIPseq data for H3K4me1 in seven human cell lines were obtained from literature: CD4+ T cell [4], CD36+ and CD133+ T cells [35], H1, Huvec, K562, and Nhek [1]. The raw data were processed by cisGenome to identify peak locations [36]. The DNA sequences at the peak locations were analyzed subsequently.

Motif analysis

Motif analysis was done by using several tools in the MEME suite (http://meme.nbcr.net/meme/) as follows. Scanning DNA sequences for matches of a known motif was done by using the FIMO [32]. Motif comparison was done by using TOMTOM.

Functional annotation

Functional annotation was done by using the GOrilla software [38] (http://cbl-gorilla.cs.technion.ac.il/).

Details of the MIM measure

Each DNA sequence is mapped to numerical values by enumerating the frequency of each k-mer treating complementary k-mers as the same. There are m = 136 non-redundant k-mers for k = 4. MIM is essentially a metric between two distributions of k-mer frequencies. Specifically, let P = ( P ij ) be the k-mer frequency distributions corresponding to a set of n target sequences S = ( S i ), where S i represents a sequence in the set S. We generate a set of n random sequences R = ( R i ) matching the sequence lengths (analogously R i represents a sequence in the set R). Let Q = (Q ij ) be the k-mer frequency distributions corresponding to R. Finally let P j = Σ i P i j Σ i j P i j and Q j = Σ i Q i j Σ i j Q i j (P j in particular represents the probability of the j-th k-mer in S, analogously, Q j represents the probability of the j-th k-mer in R) then the difference between P and Q is quantified by the symmetrical Kullback-Leibler (KL) divergence [28], as follows:

d k l ( S , R ) = j = 1 m P j l o g 2 P j Q j + j = 1 m Q j l o g 2 Q j P j 2

The MIM value corresponding to S is defined as the expected value d kl (S, R), which is estimated by averaging over 1000 sets of random sequences. The MIM value, using the symmetrical KL divergence, can be interpreted as the number of the expected number of extra bits required to code samples from S when using a code based on the background distribution. Note that there exist several alternatives to measure the similarity of two probability distributions [39]. To evaluate whether the results are sensitive to the specific choice of distances, we also computed MIM values based on two other well-known distances between probability distributions:

  1. 1)

    The Hellinger distance [39]

    d h l ( S , R ) = 1 2 j = 1 m P j - Q j 2

whose main differences from d kl are 1) d hl naturally satisfies the triangle inequality; and 2) the range of d hl is the interval [0,1].

  1. 2)

    The Bhattacharyya distance[40]

    d b h ( S , R ) = j = 1 m μ P j - μ Q j 2 4 σ P j + σ Q j + 1 2 log σ P j + σ Q j 2 σ P j + σ Q j

which has been widely used for pattern recognition in computer science [41];

where μ P j = 1 n i = 1 n P i j and σ P j = 1 n - 1 i = 1 n P i j - μ P j 2 are the mean and standard deviation, respectively, of P j ( μ Q j and σ Q j are defined similarly for Q j ).

In order to estimate the null distribution, we generated 1000 sets of random sequences and then calculated MIM values for each random sequence set. The probability density function (pdf) was estimated by using a kernel method [42]. This pdf was used to infer not only the mean and standard deviation of the null distribution but also the statistical significance for any MIM value. Recognizing the limited resolution of the estimated pdf, we did not distinguish p-values that are smaller than 0.001.

N-score model

The N-score model was described previously [19, 21]. In brief, the model integrates three types of sequence features, including sequence periodicities [19], word counts [16], and structural parameters [43], a total of 2920 candidate features. Model selection was done by stepwise logistic regression. The final model was used for target prediction.

Most informative k-mers selection

Giving P j and Q j associated to S and R respectively, it is possible to calculate their Kullback-Leibler (KL) divergence for each j, where j indicates the j-th k-mer component. This results in a list of 136 distance values, whose ranking can be used as a guide to identify the most informative k-mers.