Background

In the context of genomics research, the functional approach is based on the ability to analyze genome-wide patterns of gene expression and the mechanisms by which gene expression is coordinated. Microarray technology and other high-throughput methods are used to measure expression values of thousands of genes over different samples/experimental conditions. In recent years the microarray technology has been used to measure in a single experiment expression values of thousands of genes under a huge variety of experimental conditions across different time points. This kind of datasets can be referred to as time series microarray datasets. Because of the large data volume, computational methods are used to analyze such datasets. Clustering is one of the most common methods for identifying coexpressed genes [1]. This kind of analysis is facilitative for constructing gene regulatory networks in which single or groups of genes interact with other genes. Besides this, coexpression analysis also reveals information about some unknown genes that form a cluster with some known genes.

A clustering algorithm is used to group genes that are coexpressed over all conditions/samples or to group experimental conditions over all genes based on some similarity/dissimilarity metric. However clustering may fail to find the group of genes that are similarly expressed over a subset of samples/experimental conditions i.e. clustering algorithms are unable to find such local patterns in the gene expression dataset. To deal with that problem, biclustering algorithms are used. A bicluster can be defined as a subset of genes that are coexpressed over a subset of samples/experimental conditions. The first biclustering algorithm that was used to analyse gene expression datasets was proposed by Cheng and Church and they used a greedy search heuristic approach to retrieve largest possible bicluster having mean squared residue (MSR) under a predefined threshold value δ (δ-bicluster) [2]. But nowadays, biologists are eager to analyze 3D microarray dataset to answer the question: “Which genes are coexpressed under which subset of experimental conditions/samples across which subset of time points?” Biclustering is not able to deal with such 3D datasets. So, in this case we need some other clustering technique that can mine 3D datasets. Hence the term Triclustering has been defined and a tricluster can be delineated as a subset of genes that are similarly expressed across a subset of experimental conditions/samples over a subset of time points. Zhao and Zaki proposed a triclustering algorithm TRICLUSTER that is based on graph-based approach. They defined coherence of a tricluster as max ( e ib / e ia , e jb / e ja ) min ( e ib / e ia , e jb / e ja ) -1, where e i a ,e i b denote the expression values of two columns a and b respectively for a row i. A tricluster is valid if it has a ratio below a maximum ratio threshold ϵ[3].

Here we introduce an efficient triclustering algorithm δ-TRIMAX[4] that aims to cope with noisy 3D gene expression dataset and is less sensitive to input parameters. The normalization method does not influence the performance of our algorithm, as it produces the same results for both normalized and raw datasets. Here we propose a novel extension of MSR [2] for 3D gene expression data and use a greedy search heuristic approach to retrieve triclusters, having MSR values below a threshold δ. Hence the triclusters can be defined as δ-tricluster.

In this work we have applied our proposed δ-TRIMAX algorithm on a time-series gene expression data in estrogen induced breast cancer cell. Estrogen, a chemical messenger plays an instrumental role in normal sexual development, regulating woman’s menstrual cycles and normal development of the breast. Estrogen is also needed for heart and healthy bones. As estrogen plays vital role in stimulating breast cell division, has an effect on other hormones implicated in breast cell division and provides support to the growth of estrogen-responsive tumors, it may be involved in risk for breast cancer [5]. Though since last decade, some research has been done to decipher some unknown questions on breast cancer risk, still some questions such as involvement of genes in breast cancer risk etc. remain unanswered. Here our coexpression analysis reveals some genes that have already been found to be associated with estrogen induced breast cancer and some other genes that might play an important role in this context. Additionally, our coregulation analysis brings out some important information such as which transcription factor binds the promoter regions of genes and play an important role in this context.

In section 2, we have described our proposed triclustering algorithm in detail. Section 3 shows results of our algorithm using one artificial dataset and one real-life dataset. In section 4, we conclude our work.

Methods

Definitions

Definition 1

Time Series Microarray Gene Expression Dataset). We can model a time series microarray gene expression dataset (D) as a G × C × T matrix and each element of D (d i j k ) corresponds to the expression value of gene i over jth sample/experimental condition across time point k, where i ∈ (g1,g2,...,g G ), j ∈ (c1,c2,...,c C ) and k ∈ (t1,t2,...,t T ).

Definition 2

Tricluster). A tricluster is defined as a submatrix M(I,J,K) = [m i j k ], where i ∈ I, j ∈ J and k ∈ K. The submatrix M represents a subset of genes (I) that are coexpressed over a subset of conditions (J) across a subset of time points (K).

Definition 3

(Perfect Shifting Tricluster). A Tricluster M(I,J,K) = m i j k , where i ∈ I, j ∈ J and k ∈ K, is called a perfect shifting tricluster if each element of the submatrix M is represented as: m i j k = Γ + α i + β j + η k , where Γ is a constant value for the tricluster, α i , β j and η k are shifting factors of ith gene, jth samples/experimental condition and kth time point, respectively. As the noise is present in microarray datasets, the deviation from actual value and expected value of each element in the dataset also exists. For this deviation, every tricluster is not a perfect one.

Cheng and Church proposed an algorithm for retrieving large and maximal biclusters that have mean squared residue score (MSR) below a threshold δ in 2D microarray gene expression dataset. They also showed that MSR of a perfect δ-bicluster and perfect shifting bicluster is zero (S = δ = 0) [2, 6]. Now extending this idea, here we present a novel definition of Mean Squared Residue score for 3D microarray gene expression datasets. The MSR (S) of a perfect shifting tricluster becomes also zero, where each element m i j k = Γ + α i + β j + η k . For delineating new MSR score (S), at first we need to define the residue score:

Let the mean of i th gene (m i j k ): m iJK = 1 | J | | K | j J , k K m ijk , the mean of j th sample/experimental condition (m i j k ): m IjK = 1 | I | | K | i I , k K m ijk , the mean of k th time point (m i j k ): m IJk = 1 | I | | J | i I , j J m ijk , and the mean of tricluster (m i j k ): m IJK = 1 | I | | J | | K | i I , j J , k K m ijk . Now the mean of the tricluster can be considered as the value of constant i.e. Γ=m I J K . We can define the shifting factor for the ith gene (α i ) as the difference between m i j k and m i j k i.e. α i =m i J K -m I J K . Similarly, we can define shifting factor for the jth condition (β j ) as β j =m I j K -m I J K and shifting factor for the kth time point (η k ) can be defined as η k =m I J k -m I J K . Hence we can define each element of a perfect shifting tricluster as m i j k =Γ+α i +β j +η k =m I J K +(m i J K -m I J K )+(m I j K -m I J K )+(m I J k -m I J K )=(m i J K +m I j K +m I J k -2m I J K ). But usually noise is evident in microarray gene expression dataset. Therefore to evaluate the difference between the actual value of an element (m i j k ) and its expected value, obtained from above equation, the term “residue” can be used [6]. Thus the residue of a tricluster (r i j k ) can be defined as follows: r i j k =m i j k -(m i J K +m I j K +m I J k -2m I J K )=(m i j k -m i J K -m I j K -m I J k +2m I J K ).

Definition

4 (Mean Squared Residue). We define the term Mean Squared Residue MSR(I,J,K) or S of a tricluster M(I,J,K) to estimate the quality of a tricluster i.e. the level of coherence among the elements of a tricluster as follows:

S = 1 | I | | J | | K | i I , j J , k K r ijk 2 = 1 | I | | J | | K | i I , j J , k K ( m ijk - m iJK - m IjK - m IJk + 2 m IJK ) 2 .
(1)

Lower residue score represents larger coherence and better quality of a tricluster.

Proposed method

δ-TRIMAX aims to find largest and maximal triclusters in a 3D microarray gene expression dataset. It is an extension of Cheng and Church biclustering algorithm [2] that deals with 2-D microarray datasets. In contrast, our algorithm is capable to mine 3D gene expression dataset. There is always a submatrix in an expression dataset that has a perfect MSR(I,J,K) or S score i.e. S = 0 and this submatrix is each element of the dataset. But as mentioned above, our algorithm finds maximal triclusters having S score under a threshold δ, hence we have used a greedy heuristic approach to find triclusters. Our algorithm therefore starts with the entire dataset containing all genes, all samples/experimental conditions and all time points.

Algorithm 1 (δ-TRIMAX)

Input. D, a matrix that represents 3D microarray gene expression dataset, λ> 1, an input parameter for multiple node deletion algorithm, δ≥ 0, maximum allowable MSR score.

Output. All possible δ-triclusters.

Initialization. Missing elements in D ← random numbers, D’ ← D

Repeat

a. D’1 ← Results of Algorithm 2 on D’ using delta and λ. If the no. of genes (conditions/samples and/or no. of time points) is 50 (This value can be choosen experimentally. Large value increases the execution time of the algorithm as it then executes more number of iterations.), then do not apply Algorithm 2 on genes (conditions/samples and/or time points).

b. D’2 ← Results of Algorithm 3 on D’1 usingδ.

c. D’3 ← Results of Algorithm 4 on D’2.

d. Return D’3 and replace the elements that exist in D’ and D’3 with random numbers.

Until(No gene is found for δ-tricluster)

Initially, our algorithm removes genes or conditions or time points from the dataset to accomplish largest diminishing of score S; this step is described in the following section in which a node corresponds to a gene or experimental condition or time point in the 3D microarray gene expression dataset.

Algorithm 2 (Multiple node deletion)

Input. D, a matrix of real numbers that represents 3D microarray gene expression dataset; δ≥ 0, maximum allowable MSR threshold, λ> 1, threshold for multiple node deletion. The value of λ has been set experimentally to optimize the speed and performance (to avoid falling into local optimum) of the algorithm.

Output. M i j k , a δ-tricluster, consisting of a subset(I) of genes, a subset(J) of samples/experimental conditions and a subset of time points, having MSR score (S) less than or equal to δ.

Initialization. I ←{set of all genes }, J ←{set of all experimental conditions/ samples } and K ←{set of all time points } and to M(I,J,K) ← D

Repeat

Calculate m i j k , ∀ i ∈ I; m i j k , ∀ j ∈ J; m i j k , ∀ k ∈ K; m i j k and S.

If Sδ return M(I,J,K)

Else

Delete genes i ∈ I that satisfy the following inequality

1 | J | | K | Σ j J , k K ( m ijk - m iJK - m IjK - m IJk + 2 m IJK ) 2 >λS

Recalculate m i j k , ∀ i ∈ I; m i j k , ∀ j ∈ J; m i j k , ∀ k ∈ K; m i j k and S

Delete samples/experimental conditions j ∈ J that satisfy the following inequality

1 | I | | K | Σ i I , k K ( m ijk - m iJK - m IjK - m IJk + 2 m IJK ) 2 >λS

Recalculate m i j k , ∀ i ∈ I; m i j k , ∀ j ∈ J; m i j k , ∀ k ∈ K; m i j k and S

Delete time points k ∈ K that satisfy the following inequality

1 | I | | J | Σ i I , j J ( m ijk - m iJK - m IjK - m IJk + 2 m IJK ) 2 >λS

End if

Until(There is no change in I, J and/or K)

The complexity of this algorithm is O(max(m,n,p)) where m, n and p are the number of genes, samples and time points in the 3D microarray dataset.

In the second step, we delete one node at each iteration from the resultant submatrix, produced by Algorithm 2, until the score S of the resultant submatrix is less than or equal to δ. This step results in a δ-tricluster.

Algorithm 3 (Single node deletion)

Input. D, a matrix of real numbers that represents 3D microarray gene expression dataset; δ≥ 0, maximum allowable MSR threshold.

Output. M i j k , a δ-tricluster, consisting of a subset(I) of genes, a subset(J) of samples/experimental conditions and a subset of time points, having MSR score (S) less than or equal to δ.

Initialization. I ←{set of all genes in D }, J ←{set of experimental conditions/samples in D } and K ←{set of time points in D } and to M(I,J,K) ← D

Calculate m i j k , ∀ i ∈ I; m i j k , ∀ j ∈ J; m i j k , ∀ k ∈ K; m i j k and S.

While S >δ

Detect gene i ∈ I that has the highest score

μ(i)= 1 | J | | K | Σ j J , k K ( m ijk - m iJK - m IjK - m IJk + 2 m IJK ) 2

Detect sample/experimental condition j ∈ J that has the highest score

μ(j)= 1 | I | | K | Σ i I , k K ( m ijk - m iJK - m IjK - m IJk + 2 m IJK ) 2

Detect time point k ∈ K that has the highest score

μ(k)= 1 | I | | J | Σ i I , j J ( m ijk - m iJK - m IjK - m IJk + 2 m IJK ) 2

Delete gene or sample/experimental condition or time point that has highest μ score and modify I or J or K. Recalculate m i j k , ∀ i ∈ I; m i j k , ∀ j ∈ J; m i j k , ∀ k ∈ K; m i j k and S.

End while

Return M(I,J,K)

The complexity of first and second steps is O(mnp) as those will iterate (m+n+p) times. The complexity of selection of best genes, samples and time points is O(log m + log n + log p). So it is suggested to use algorithm II before algorithm 3.

As the goal of our algorithm is to find maximal triclusters, having MSR score (S) below the threshold δ, the resultant tricluster M(I,J,K) may not be the largest one. That means some genes and/or experimental conditions/samples and/or time points may be added to the resultant tricluster T produced by node deletion algorithm, so that the MSR score of new tricluster T’ produced after node addition does not exceed the MSR score of T. Now the third step of our algorithm is described below.

Algorithm 4 (Node addition)

Input. D, a matrix of real numbers that represents δ-tricluster, having a subset of genes (I), a subset of experimental conditions/samples (J) and a subset of time points (K).

Output. M I J K , a δ-tricluster, consisting of a subset of genes (I’), a subset of samples/experimental conditions (J’) and a subset of time points (K’), such that I ⊂ I , J ⊂ J , K ⊂ K’ and MSR(I ,J ,K’) ≤ MSR of D.

Initialization. M(I,J,K) ← D

Repeat

Calculate m i j k , ∀ i; m i j k , ∀ j; m i j k , ∀ k; m i j k and S. Add genes i ∉ I that satisfy the following inequality

1 | J | | K | Σ j J , k K ( m ijk - m iJK - m IjK - m IJk + 2 m IJK ) 2 S

Recalculate m i j k , ∀ j; m i j k , ∀; m i j k and S

Add samples/experimental conditions j ∉ J that satisfy the following inequality

1 | I | | K | Σ i I , k K ( m ijk - m iJK - m IjK - m IJk + 2 m IJK ) 2 S

Recalculate m i j k , ∀ i; m i j k , ∀ k; m i j k and S

Add time points k ∉ K that satisfy the following inequality

1 | I | | J | Σ i I , j J ( m ijk - m iJK - m IjK - m IJk + 2 m IJK ) 2 S

Until(There is no change in I, J and/or K)

I’ ← I, J’ ← J, K’ ← K Return I’, J’, K’

The complexity of this algorithm is O(mnp) as each step iterates (m+n+p) times.

Tricluster eigengene

To find tricluster eigengene we applied singular value decomposition method (SVD) on the expression data of each tricluster [7]. For instance, X g × ( c t ) i represents the expression matrix of ith tricluster, where g, c and t represent the number of genes, samples and time points of ith tricluster. Now we apply SVD on the data matrix (normalized to mean=0 and variance=1). Now, the SVD of ith tricluster can be represented as,

X i =UD V T ,
(2)

where U and V are the orthogonal matrices. Ui is a g ∗ (c ∗ t) matrix with orthonormal columns, Vi is a (c ∗ t) × (c ∗ t) orthogonal matrix and Di is (c ∗ t) × (c ∗ t) diagonal matrix of singular values.

Assuming that singular values in matrix Di are arranged in non-decreasing order, we can represent eigengene of ith tricluster by the first column of matrix Vi, i.e.

E i = V 1 i ,
(3)

Results and discussion

Results on simulated dataset

We have produced one simulated dataset SMD of size 2000 × 30 × 30. At first we have implanted three perfect shifting triclusters of size 100 × 6 × 6, 80 × 6 × 6 and 60 × 5 × 5 into the dataset SMD and then implanted three noisy shifting triclusters of the same size mentioned before into it. To estimate the degree of similarity between the implanted and obtained triclusters, we define affirmation score in the same way as Prelic et. al. defined for two sets of biclusters [6, 8]. So, overall average affirmation score of T1 with respect to T2 is as follows, where (SM G (T1, T2)) is the average gene affirmation score, (SM C (T1, T2)) is the average sample affirmation score and (SM K (T1, T2)) is the average time point affirmation score of T1 with respect to T2:

S M ( T 1 , T 2 ) = ( S M G ( T 1 , T 2 ) × S M C ( T 1 , T 2 ) × S M T ( T 1 , T 2 ) )
(4)

Suppose, we have two sets of triclusters T i m and T r e s where T i m represents the set of implanted triclusters and T r e s corresponds to the set of triclusters retrieved by any triclustering algorithm. Hence SM (T i m ,T r e s ) denotes how well the triclustering algorithm finds the true triclusters that have been implanted into the dataset. This score varies from 0 to 1 (if T i m = T r e s ).

For the dataset containing perfect shifing triclusters, we have assigned 0.35 and 1.0005 to the parameters δ and λ, respectively. The value of δ varies from one dataset to another dataset. Then we have added noisy triclusters having different levels i.e. standard deviations (σ = 0.1, 0.3, 0.5, 0.7, 0.9, 1.1, 1.3, 1.5, 1.7). To have an idea about the δ value, we have first clustered the genes over all time points and then the time points over the subset of genes for each gene cluster in each sample plane using the K-means algorithm. Then we have computed the MSR value(S) of the submatrix, considering a randomly selected sample plane, gene and time-pont cluster for 100 times. Then we have taken the lowest value as the value of δ. For these noisy datasets, we have assigned 3.75 and 1.004 to the parameters δ and λ, respectively. In Figure 1 we have compared the performance of our algorithm with that of the TRICLUSTER algorithm [3] in terms of affirmation score using the artificial dataset. Our δ-TRIMAX algorithm performs better than TRICLUSTER algorithm for the noisy dataset. For perfect additive triclusters, performances of both these algorithms are comparable with each other.

Figure 1
figure 1

Comparison in terms of Affirmation Scores. a. Comparison of Affirmation scores produced by -TRIMAX and TRICLUSTER algorithm. b. Comparison of running time of -TRIMAX and TRICLUSTER algorithm on the synthetic dataset.

Results on real-life dataset

Datasets for genome-wide analysis of estrogen receptor binding sites

This dataset contains 54675 affymetrix probe-set ids, 3 biological replicates and 4 time points. In this experiment MCF7 cells are stimulated with 100 nm estrogen for 0, 3, 6 and 12 hours and the experiments are performed in triplicate. This dataset is publicly available at Gene Expression Omnibus (GEO) (dataset id- GSE 11324). It was used for discovering of cis-regulatory sites in previously uninvestigated regions and cooperating transcription factors underlying estrogen signaling in breast cancer [9]. We assign 0.012382 and 1.2 to δ and λ respectively. In this case our algorithm results in 115 triclusters. From Figure 2, we observe that the genes in tricluster 4 have similar expression profiles over all three samples across 0, 6 and 12 hours but not at 3 hour.

Figure 2
figure 2

Expression Profiles. Figures in first row show the expression profiles of genes ESR1, HOXA11, FAM71A, SPEF2, IFIH1, FPR2, SPAG9, NCF4, ADAM3A, CCNYL1 respectively of tricluster 4 over all samples; The red-colored time point (3 hrs.) is not a member of this tricluster. The figures in second row show the expression profiles of the same genes across 0, 6 and 12 hours.

To compare the performance of our proposed algorithm with TRICLUSER algorithm on real-life dataset, we have used three validation indexes.

Coverage

Coverage for any triclustering algorithm can be delineated as

Coverage= g alg × c alg × t alg G × C × T ×100,
(5)

where g a l g , c a l g and t a l g denote total number of genes, experimental samples and time points retrieved by the triclustering algorithm. G, C and T represent number of all genes, experimental samples and time points in the dataset.

Triclustering Quality Index (TQI)

We can elucidate Triclusering Quality Index of a tricluster by equation 4.

TQI= MS R i Volum e i ,
(6)

where M S R i and V o l u m e i represent mean-squared residue and volume of ith tricluster. Lower TQI score represents better quality of tricluster.

Statistical Difference from Background (SDB)

Here we have introduced another quality measurement, termed as Statistical Differences from Background (SDB) [10] as

SDB= 1 n i = 1 n MS R i 1 r j = 1 r RMS R j - MSRi ,
(7)

where n is the total number of triclusters extracted by the algorithm. M S R i represents mean squared residue of ith tricluster retrieved by the algorithm and R M S R j represents mean squared residue of jth random tricluster having the same number of genes, experimental samples and time points as that of ith resultant tricluster. Here higher value of the denominator denotes better quality of the resultant tricluster. Hence, lower SDB score signifies better performance of the algorithm. Table 1 shows the comparison between proposed δ-TRIMAX algorithm and TRICLUSTER algorithm in terms of coverage, SDB and TQI score.

Table 1 Comparison between δ -TRIMAX and TRICLUSTER algorithm using coverage, Statistical Difference of from Background (SDB) and Triclustering Quality Index (TQI)

Biological significance

We have established the biological significance of genes belonging to each resultant tricluster by performing (a) Gene Ontology (GO) and KEGG pathway enrichment analysis, (b) cogitating each tricluster with different estrogen-responsive stages (early (3 hour), middle (6 hour) and late (12 hour)), (c) identifying hub genes of each tricluster and (d) Transcription Factor Binding Site (TFBS) enrichment analysis.

GO and KEGG pathway enrichment analysis

We have used GOStats package [11] in R to perform GO and KEGG pathway enrichment analysis for establishing biological significance of genes belonging to each tricluster. We have adjusted the p-values using Benjamini-Hochberg FDR method [12] and considered those terms as significant ones that have a p-value below a threshold of 0.05. The smaller p-value represents higher significance level. We have found statistically enriched GO terms for genes belonging to each tricluster. We have compared the performance of our proposed δ-TRIMAX algorithm with that of TRICLUSTER algorithm on real-life dataset. For comparison of the performances we have considered GO Biological Processes (GOBP) and KEGG pathway terms that have already been reported to play an important role in estrogen induced breast cancer cell. Table 2 shows the comparison between δ-TRIMAX and TRICLUSTER algorithm in terms corrected p-values of GOBP and KEGG pathway terms cell adhesion and Wnt signaling pathway that are observed to be associated with estrogen induced breast cancer [13, 14], respectively.

Table 2 Comparison between δ -TRIMAX and TRICLUSTER algorithm in terms of p-values of GO and KEGG pathway term enrichment analysis

Association of triclusters with different stages of response to estrogen stimulus

To cogitate each tricluster with different estrogen responsive stages of the experiment, we represent each tricluster by eigengene. Then we have examined whether the eigengene of each tricluster is differentially expressed at early, middle and late estrogen responsive stages using Limma package in R [15] (FDR-BH corrected p-value cutoff 0.05). If eigengene of one tricluster is found to be differentially expressed at any possible responsive stages, then the genes having highly correlated expression profiles with that of eigengene can also be considered to be significantly expressed at the same stages. In total our algorithm results in 115 triclusters. Eigengene of tricluster 7 has been found to be differentially expressed between 0 hour - 6 hours, 0 hour - 12 hours, 3 hours - 12 hours and 6 hours - 12 hours. 429 genes among 505 genes are found to be differentially expressed in this tricluster. KEGG pathway term mTOR signaling pathway is observed to be meliorated in this tricluster and has been reported to be associated with estrogen induced breast cancer cell [16]. Genes PIK3CA, PRKAA1, RPS6, ULK2 participate in that pathway. The genes belonging to tricluster 50 are coexpressed over all samples across 0, 6 and 12 hours. The eigengene of tricluster 50 has been observed to be differentially expressed between 0 hour - 12 hours and 6 hours - 12 hours. 96% of the genes belonging to this tricluster are found to be differentially expressed. The genes in this tricluster are meliorated with the KEGG pathway term ubiquitin mediated proteolysis (UBE2K, CUL4B, PIAS1, CDC23). It has been reported in a previous study that there is crosstalk between ER α and targets of ER α for ubiquitin mediated proteolysis [17]. In tricluster 71 time points 3, 6 and 12 hours are present in that tricluster and the eigengene is significantly expressed between 3 hours and 12 hours. 44 genes out of 52 genes in this tricluster are significantly expressed between 3 and 12 hours. Genes belonging to this tricluster are also enriched with the KEGG pathway term ubiquitin mediated proteolysis (UBA6, BIRC6, ANAPC1, CUL5). Genes belonging to tricluster 48 are coexpressed across 0, 3 and 12 hours. Eigengene of tricluster 48 is significantly expressed between 0 and 12 hours, 3 and 12 hours. The KEGG pathway term TGF-beta signaling pathway is meliorated in this tricluster and the crosstalk between TGF-beta signaling pathway and ER α has been reported in a previous study [18]. Genes SKP1, BMPR2 are found to play a role in the enriched pathway. Eigengene of tricluster 95 is significantly upregulated between 0 and 12 hours. 60% of all genes belonging to this tricluster are differentially up regulated at late responsive stage. The genes in this tricluster has been found to be coexpressed across 0 hour, 12 hours over all samples. The KEGG pathway term apoptosis (XIAP, IRAK4, CASP6) is observed to be meliorated in this tricluster and it has been found in a recent study that apoptosis can be induced by estrogen in estrogen deprivation-resistant breast cancer cell [19]. The genes of that tricluster 75 have been observed to be coexpressed over all samples across 3 and 12 hours. The eigengene of tricluster 75 is differentially expressed between 3 hours and 12 hours. 39 genes among 64 genes are significantly expressed between 3 and 12 hours. In this case we have observed enrichment for KEGG pathway terms ErbB signaling pathway that is found to be associated with estrogen induced breast cancer cell [20]. The coexpressed genes NCK1, SOS2 in this tricluster participate in that pathway.

Identification and roles of hub genes

To identify hub genes of each tricluster, we have computed tricluster membership of each gene by calculating Pearson correlation coefficient between each gene and the eigengene of that tricluster. We have considered the top fifteen genes as hub genes having highest correlation coefficient with the eigengene of that tricluster. For tricluster 1, we have identified NPC1L1, TMEM161B-AS1, POU5F1P3, POU5F1P4, POU5F1B, CCL2 as hub-genes that are coexpressed over all-time points. It has been observed in a previous study that high doses of estrogen augment intestinal cholesterol absorption attributable in part to an up-regulated expression of NPC1L1 which is known as intestinal sterol influx transporter [21]. CCL2 is found to play an important role in mediating cross-talk between cancer cells and stromal fibroblasts in breast cancer cells [22]. DNAJC3-AS1, ITSN2, TRPC1, CD47, ZNF286A, TSC22D2, PHF17, ZNF286B, TMEM67, NFIB, JKAMP, DENND4A, HPGD are identified as hub-genes that are coexpressed over all samples and 0, 3, 6 and 12 hours in tricluster 7. NFIB has been reported as a potential target of ER negative breast cancers [23]. Transient receptor potential cation channel (TRPC1) is known to play an important role in breast cancer [24]. CD47 has been found to intervene killing of breast cancer cells [25]. HPGD plays important role in epithelial-mesenchymal transition and migration in breast cancer cells [26]. In tricluster 4, the genes are coexpressed over 0, 6 and 12 hours and IGKV1-13, FAM69C, SGCD, CSNK1E, TRMU, CRYBA2, IGKV1D-13, IGSF11, PACS1, IQCK are identified as hub-genes. CSNK1E has been observed to play an important role proliferation of breast cancer cells and act as a regulator of activated -catenin driven transcription [27]. For tricluster 13 we have identified ESYT3, SERINC2, LRRC14, ALDH4A1, RPL10, BRD4, DEC1, ZFP30, TCP11L2, ALDOA as hub-genes. The gene for Aldolase A (ALDOA) plays an instrumental role in hypoxia which is a feature of solid tumors in breast cancer [28]. Besides this BRD4 known as Bromodomain 4 is found to be associated with breast cancer progression [29]. Genes PFKFB1, TAF1, PIKFYVE, MEMO1P1, KIF1B, PHF20L1, ARHGAP24, TSC22D1, AK7, DPY30, MEMO1, PTEN, ADAM9, PTPN2, MTSS1L are found as hubgenes of tricluster 95. PTEN is known to be a tumor suppressor gene in breast cancer [30, 31]. PTPN2, ADAM9 have been reported to be associated with breast cancer in previous studies [32, 33]. PIKFYVE has been found to intervene epidermal growth factor receptor that is associated with human breast cancers [34]. In case of tricluster 42, ANTXR2, RHBDL2, GSTCD, DENND1B, KLC3, PREP, NOS1, STOML3, CDK5R1, CLEC7A, HGD, FOXC1, MSRB3, TEX34, SLC36A1 are appeared as hub-genes that are coexpressed over all samples and across 0, 12 hours. In a recent work, the activity of RHBDL2 has been identified in many tumour cells including breast cancer [35]. The role of FOXC1 as a regulator of human breast cancer cells by activating NF κ B signaling has been discovered in a recent work [36].

TFBS enrichment analysis

To analyse the potential coregulation of coexpressed genes, we have done transcription factor binding site (TFBS) enrichment analysis using the TRANSFAC library (version 2009.4) [37] that contains eukaryotic transcription factors, their experimentally proven binding sites, and regulated genes. Here we used 42,544,964 TFBS predictions that have high affinity scores and are conserved between human, mouse, dog and cow [38]. Out of these 42 million conserved TFBSs we have selected the best 1% for each TRANSFAC matrix individually to identify the most specific regulator (transcription factor) - target interactions. We have used hypergeometric test [39] and Benjamini Yekutieli-FDR method [40] for p-value correction to find over-represented binding sites (p-value ≤ 0.05) in the upstream regions of genes belonging to each tricluster. Table 3 shows the list of triclusters where we have found statistically meliorated TFBSs. From Table 3, we can observe that the genes in tricluster 26 are enriched with helix-turn-helix, zinc-coordinating DNA-binding and basic domain transcription factors. The helix-turn-helix domain transcription factor E2F1, to which TRANSFAC matrix V$E2F_Q2 is associated acts as a regulator of cell proliferation in estrogen-induced breast cancer cell [41]. The zinc finger transcription factors Sp1 and Sp4, asociated with matrix V$SP1_Q6_01 have already been reported to play an important role in estrogen-induced MCF-7 breast cancer cell line [42, 43]. In tricluster 17, the basic domain transcrption factor CREB (matrix V$CREB_01) is important for malignancy in breast cancer cell. ATF1, ATF2, ATF3, ATF4, ATF5 (matrix V$CREBATF_Q6) likewise play an important role in breast cancer cell [44]. We have observed enrichment for matrix V$NFAT1_Q6. The corresponding transcription factor (NFATC1) has been found to be associated with clinical characteristics in breast cancer cell [45]. In tricluster 4 POU2F1, the TF associated with matrix V$OCT1_03 is a helix-turn-helix domain transcription factor (Oct-1) and has been reported to be estrogen-responsive in a previous study [46]. Table 4 shows some statistically enriched KEGG pathway terms for coexpressed and differentially expressed (using adjusted p-value ≤ 0.05) genes the promoters of which are bound by aforementioned transcription factors.

Table 3 TRANSFAC Matrices for Triclusters, having statistically enriched TFBS for real-life dataset
Table 4 Statistically enriched KEGG pathway terms for differentially expressed and coexpressed targets of TRANSFAC matrices V$NFAT1_Q6, V$OCT1_03, V$CREB_01, V$CREBATF_Q6, V$E2F_Q2 and V$SP1_Q6_01

Conclusion

In this work we have proposed δ-TRIMAX triclustering algorithm that aims to retrieve large and coherent groups of genes, having an MSR score below a threshold δ. Genes belonging to each tricluster are coexpressed over a subset of samples/ experimental conditions and across subset of time points. The results of GO and KEGG pathway enrichment analysis show that our proposed algorithm is able to extract group of coexpressed genes that are biologically significant. We have performed TFBS enrichment analysis to establish the fact that the promoter regions of the genes having similar expression profile are bound by the same transcription factors. We have compared the performance of our algorithm with that of existing algorithm using one artificial dataset in terms of affirmation score and one real-life dataset in terms of coverage, statistical difference from background and triclustering quality index score. In case of these two datasets our proposed algorithm outperformed the existing one. Additionally, here we have represented the expression profiles of genes belonging to each tricluster by eigengene and then identified hub genes using the profile of eigengene. We have observed that most of the identified hub-genes are previously reported to be associated with breast cancer and estrogen responsive elements. The other identified hub genes might be associated with breast cancer and need to be verified experimentally. Hence our integrative approach and findings might provide new insights into breast cancer prognosis.