Background

Hydrophobins play a very important role in fungal growth and development. They are a class of small fungal proteins, which are responsible for the formation of aerial hyphae and attachment of hyphae to hydrophobic surfaces [1]. The remarkable ability of self-assembly at hydrophilic/hydrophobic surfaces makes hydrophobins interesting and unique, not only in biology but also in industrial applications. There are two main classes of hydrophobins: class I and class II. These two classes have been distinguished based on the differences in their hydropathy patterns and biophysical properties [2]. The membrane formed by self-assembly of Class I hydrophobins is more stable than the one formed by Class II. It is also characterized by a rodlet structure, which does not exist in Class II hydrophobins. More interestingly, this structure is observed in amyloid proteins, which suggests similarities in structure and self-assembly mechanism.

SC3, a glycosylated hydrophobin, is the most extensively studied class I hydrophobin to date [3]. It contains the only known motif in hydrophobins, the eight-cysteine residues [4]. The cysteines form four disulfide bridges, which keep SC3 from self-assembling and account for controlled assembly at hydrophilic/hydrophobic interfaces [5]. They also divide the entire molecule into four loops with N-terminal sequence of nearly 30 amino acids. On the other end of the sequence, the C-terminal region only contains 6 amino acids [6].

It is assumed that self-assembly of class I hydrophobins is accompanied by conformational changes as it is the case for SC3. The intermediate α-helix state changes into the stable ultimate β-sheet form at water-air interface. However, study of class I hydrophobin EAS from Neurospora crassa by NMR found no obvious structure other than a small core region composed of three antiparallel β-strands, which is probably stabilized by the four disulfide bridges [7]. This discrepancy in both structure and sequence is a major hurdle in studying hydrophobins and their unique self-assembly properties. In this paper, we use computational tools to extract information from hydrophobin sequences and identify conserved motifs to find new members of the hydrophobin family.

Results

Batch BLAST search for all hydrophobin sequences and identification of new conserved motifs

A total of 183 sequences containing the key word hydrophobin were found including 22 sequences in UniProtKB/Swiss-Prot and 161 sequences in UniProtKB/TrEMBL, most of which were fragments only. No filtering was performed only because we did not want to lose any potential signals. All 183 sequences, including fragments, were used to conduct batch BLAST search against nr database on NCBI server. A perl program was used to perform this task. By doing so, all potential hydrophobins should have been found in one or more search results. 6715 sequences were returned by the batch BLAST. All results were automatically filtered by the presence of the word hydrophobin in their description. After elimination of identical sequences, judged by GI number, and the sequences that don't have the eight-cysteine residues pattern (C-CC-C-C-CC-C) by the perl program, 128 hydrophobin non-fragment sequences were dug out from the nr database and put into an Excel file. Also, the program retrieved all relative information on each individual sequence of these 128 hydrophobins to prepare for the conserved motif search.

Highly similar sequences were removed to reduce bias on conserved motif search. Only 110 sequences were used as input to MEME to conduct conserved motif search. Considering the high diversity among the hydrophobin sequences, the maximum motif number was set to 10. The first 6 motifs were reasonable and therefore used as input to MAST program. They are in PSSM format; however the consensus sequences are shown in the Table 1.

Table 1 List of Consensus Sequences of the 6 motifs.

Identification of new hydrophobins

Using motifs obtained from MEME, we conducted a MAST search and analyzed the results with a perl program. It removed not only all the identical sequences but also the ones used as input from the search results. The e-value for qualified sequence is 10.

As we have mentioned before, in order not to miss any signal, we have set the maximum motif number to 10, which leads to the increase in the number of false positives. Pattern matching and domain analysis have been conducted to filter the results from the last step. First, any known hydrophobin sequences in the returned sequences were removed (111 sequences removed). Then, all the obtained sequences have been scanned with the eight-cysteine-residue pattern (19 sequences removed). All sequences containing more or less cysteine residues have been filtered out. The sequence gi:71020699 hypothetical protein was also filtered out because its length is far larger than any other known hydrophobin sequence. At last, domain analysis with SMART on each individual sequences was conducted. The existence of hydrophobin domain has to be positive to keep the sequence in the candidate list (6 sequences removed).

In total, 9 strong candidates as potential hydrophobins were found among five species, Aspergillus nidulans, Agrocybe aegerita, Gibberella zeae, Metarhizim anisopliae and Ustilago maydis. The details are listed in Table 2. These nine hydrophobins have not been named as hydrophobins and not reported previously, and are found for the first time in the study.

Table 2 All the identified new hydrophobins and their source organisms.

Pattern and domain analysis

As we described, conserved motifs were used to search for potential novel hydrophobins, pattern and domain analysis were further used to filter the results. Consequently, all these 9 new hydrophobin sequences have at least one of the 6 conserved motifs we first found in this study. In Figure 1, we aligned the second motif in these 9 sequences and other known hydrophobin sequences (5 sequences), and this motif is highly conserved among all the sequences including new and known hydrophobin sequences, indicating the motifs we identified in the study are useful for finding potential new hydrophobins. We also did a multiple sequence alignment (figure 2) which clearly shows the hydrophobin pattern in all the newly identified hydrophobin sequences. In figure 3, we can clearly see the detail location of the hydrophobin domain in each newly identified hydrophobin sequence.

Figure 1
figure 1

Motif Alignment among known and newly identified hydrophobins. The second motif we found exists in almost all the hydrophobin sequences. The first 9 sequences are the new identified hydrophobins and the other 5 sequences are known hydrophobins.

Figure 2
figure 2

The alignment of the regions including the eight-cysteine-residues in the 9 sequences.

Figure 3
figure 3

Hydrophobin Domain in the newly identified sequences. All the 9 newly identified hydrophobins contain the hydrophobin domain stored in the Pfam database. This is another indication that they are possibly hydrophobins.

Phylogenetic tree

All the potential hydrophobins found in our analysis were used to build a phylogenetic tree that also included selected known hydrophobins. The purpose of phylogenetic analysis is to study the evolutionary relationship between the hydrophobins identified here and the other known hydrophobins and also to perform a final check on the reliability of the results obtained. We selected 15 sequences from the input sequence pool along with 9 new identified sequences to construct a phylogenetic tree. The results are shown in Figure 4. We can see that most new identified hydrophobins are grouped together with other reference sequences, which supports our results.

Figure 4
figure 4

Phylogenetic analysis among newly identified hydrophobins and other known hydrophobins. In this figure we can see that newly identified hydrophobins are grouped together with other known hydrophobins, which strongly supports the research.

Discussion

The unique property of hydrophobins to self-assemble at hydrophobic-hydrophilic interfaces makes them interesting candidates for use in medical and industrial applications [8]. The high stability of the Class I hydrophobins have made them ideal candidates for industrial application. For example, hydrophobins could be used as a protection layer of many instruments against damage from other chemical reagents. It could also be used to change a hydrophobic surface into a hydrophilic one and vice versa. In addition, the size and properties of rodlets formed by class I hydrophobins make them useful for application in nanotechnology. Another very important application is the use of hydrophobins as an intermediate to attach cells or molecules to surfaces with different properties. Each hydrophobin is slightly different from the others, which makes it possible to find a suitable hydrophobin for a specific application. This underlines the need to identify new hydrophobins.

The best way of making use of hydrophobins is structure determination. In order to facilitate research in hydrophobin structure, a database is currently under development. This database will store not only all common protein properties but also properties specific to hydrophobins, such as self-assembled layer stability. Users can make various queries against the database, such as all the hydrophobin sequences in a specific species. With more and more hydrophobins discovered, more information will be answered by sequence analysis.

Conclusion

All of 9 newly identified potential hydrophobins possess the common pattern and hydrophobin domain. The result of the multiple sequences alignment has shown that some of the newly identified potential hydrophobin sequences are closely grouped to other known ones, which strongly supported the result of this research. The phylogenetic relationships between them are very close. Besides, we have also identified a pattern which exists in almost all hydrophobin sequences. It could become a very accurate criterion in new hydrophobin identification and it could also facilitate the research on the unique property of hydrophobin family.

Methods

The entire workflow is given as following (Figure 5). It gives an overview of the methods used in this paper.

Figure 5
figure 5

Workflow of the entire project.

Prepare sequences

Key word hydrophobin was used to search the UniProt Knowledgebase (Swiss-Prot and TrEMBL http://www.expasy.org/cgi-bin/sprot-search-ful) to find existing hydrophobin protein sequences. Both whole length sequences and fragments are included to insure the maximum coverage of hydrophobin sequence information.

Batch BLAST

A perl program was used to conduct batch BLAST against NCBI database http://www.ncbi.nih.gov/. In order to keep noise at the lowest level, all qualified hydrophobin sequences contained the eight-cysteine residue pattern (C-CC-C-C-CC-C) along with word hydrophobin in their sequence description. The perl program also conducted filtering based on the criteria.

Multiple Sequence alignment and highly identical sequences elimination

Multiple sequence alignment was performed using ClustalW program http://www.ebi.ac.uk/clustalw/ available at the European Bioinformatics Institute web site [9]. Based on the result, highly identical sequences were removed to reduce the bias on the conserved motif search.

Conserved motif search

MEME http://meme.sdsc.edu/meme/intro.html[10] and PRATT http://www.ebi.ac.uk/pratt/[11] were used to dig out conserved motif information. The output of the program is completed studied and then used as input to the next step of the pipeline.

Database search for new hydrophobins

MAST http://meme.sdsc.edu/meme/mast.html[12], bundled with MEME, was used to search for new hydrophobin sequences against nr database (non-redundant protein database). Huge amount of sequences are retrieved. A perl program is written to eliminate replications based on genbank ID.

Domain analysis of identified hydrophobins

SMART http://smart.embl-heidelberg.de/[13] domain analysis tool was used not only to predict the domain architecture of identified sequences but also to filter the results. A domain alignment between the newly identified potential hydrophobins and the known hydrophobins is conducted based on the result of this research.

Phylogenetic analysis

Phylip http://evolution.genetics.washington.edu/phylip.html was used to build a phylogenetic tree. A phylogenetic tree was constructed using the neighbor-joining algorithm as described by Saitouand Nei [14]. To investigate the evolutionary relationship between the putative hydrophobins identified and other hydrophobins, a phylogenetic tree was built.