Biomedical Knowledge Representation Learning

Zeng, Zheni; Liu, Zhiyuan; Lin, Yankai; Sun, Maosong

doi:10.1007/978-981-99-1600-9_12

Zheni Zeng⁴,
Zhiyuan Liu⁴,
Yankai Lin⁵ &
…
Maosong Sun⁴

3413 Accesses

Abstract

As a subject closely related to our life and understanding of the world, biomedicine keeps drawing much attention from researchers in recent years. To help improve the efficiency of people and accelerate the progress of this subject, AI techniques especially NLP methods are widely adopted in biomedical research. In this chapter, with biomedical knowledge as the core, we launch a discussion on knowledge representation and acquisition as well as biomedical knowledge-guided NLP tasks and explain them in detail with practical scenarios. We also discuss current research progress and several future directions.

You have full access to this open access chapter, Download chapter PDF

Drug knowledge discovery via multi-task learning and pre-trained models

Article Open access 16 November 2021

Applications of Knowledge Representation Learning

BioBERT and Similar Approaches for Relation Extraction

12.1 Introduction

There is a widely adopted perspective that the twenty-first century is the age of biology [30]. Actually, biomedicine has always occupied an important position and maintained a relatively rapid development. Researchers devote to explore how the life systems (e.g., cells, organisms, individuals, and populations) work, what the mechanism of genetics (e.g., DNA and RNA) is, how the external environment (e.g., chemicals and drugs) affects the systems, and many other important topics [42]. Recent flourish has been brought by the development of emerging interdisciplinary domains [92], among which biomedical NLP draws much attention as a representative topic in AI for science, which aims to apply modern AI tools to various areas of science to achieve efficient scientific knowledge acquisition and applications.

12.1.1 Perspectives for Biomedical NLP

The prospect of biomedical NLP is to improve human experts’ efficiency by mining useful information and finding potential implicit laws automatically, and this is closely related to two branches of biology: computational biology and bioinformatics. Computational biology emphasizes solving biological problems with the favor of computer science. Researchers use computer languages and mathematical logics to describe and simulate the biological world. Bioinformatics studies the collection, processing, storage, dissemination, analysis, and interpretation of biological information. Bioinformatics research mainly focuses on the two aspects of genomics and proteomics.^{Footnote 1} The two terms are now generally used interchangeably.

According to the format of the processed data, we can make an inductive analysis of biomedical NLP from two perspectives. The first perspective is NLP tasks in biomedical domain text, in which we regard biomedicine as a specific domain of natural language documents; therefore the basic tasks are common with general domain NLP, while the corpus has its own features. Typical tasks [17] include named entity recognition, term linking, relation extraction, information retrieval, document classification, question answering, etc.

The other perspective is NLP methods for biomedical materials, in which the NLP techniques are adopted and transferred for modeling non-natural-language data and solving biomedical problems, such as the data mining of genetic and protein sequences [38]. As shown in Fig. 12.1, biomedical materials include natural language documents and other materials. The latter can be expressed in sequences, graphs, and other forms, and therefore the representation learning technique we introduce in the previous chapters can be employed to help model biomedical material. To ensure the effectiveness of general NLP techniques in the new scenario, adjustments are required to better fit with the data characteristics (e.g., fewer tokens for genetic sequences compared with natural language).

An illustration depicts the different components of the N L P system, which consists of N L P tasks in various domains and N L P methods for biomedical materials, which further leads to linguistic, textual, and biomedical knowledge. — **Fig. 12.1**

Overall, the biomedical natural language documents contain linguistic and commonsense knowledge and also provide explicit and flexible descriptions for biomedical knowledge. Meanwhile, the special materials in biomedical domain contain even more subject knowledge in implicit expressions. We believe that the two perspectives are gradually fusing together to achieve more universal biomedical material processing, and we will go into more detail about this trend later.

12.1.2 Role of Knowledge in Biomedical NLP

A characteristic of biomedical NLP is that expert knowledge is of key importance to get a deep comprehension of the processing materials. This even restricts the scale of golden datasets due to the high cost and difficulty of manual annotation. Therefore, we emphasize the knowledge representation, knowledge acquisition, and knowledge-guided NLP methods for the biomedical domain.

First, biomedical materials have to be expressed properly to fit automatic computing, and this benefits from the development of knowledge representation methods such as distributed representations. Next, echoing the basic goals of AI for science, we expect the biomedical NLP systems to assist us in extracting and summarizing useful information or rules in a mass of unstructured materials, which is an important part of the knowledge acquisition process. Nevertheless, we have mentioned above that the biomedical NLP datasets are hard to reach on a large scale, which is one reason that the data-driven deep learning system performance in the biomedical domain is not always satisfying. To improve the performance of these intelligent systems under limited conditions, the knowledge-guided NLP methods become especially important. With the help of biomedical knowledge, NLP models trained on the general domain can be easily transferred to biomedical tasks with minimal supervision. For instance, the definition and synonyms of terms in biomedical ontologies can guide models to get a deeper comprehension of biomedical terms occurring in the processing texts.

In Sect. 12.2, we first introduce the representation and acquisition of biomedical knowledge, which comes from two types of materials: natural language text and other biomedical data. Later, in Sect. 12.3, we focus on the knowledge-guided biomedical NLP methods which are divided into four groups according to the discussion in Chap. 9. After learning about the basic situation of biomedical knowledge representation learning, we will then explore several typical application scenarios in Sect. 12.4 and discuss some advanced topics that are worth researching in Sect. 12.5.

12.2 Biomedical Knowledge Representation and Acquisition

Going back decades, AI systems for biomedical decision support have already shown the importance of knowledge representation and acquisition. The former is the basis of practical usage and the latter ensures the sustainability of expert systems with growing knowledge. Biomedical knowledge is represented in a structured manner in that period. For instance, DENDRAL [10] is an expert system providing advice for chemical synthesis, and production rules in DENDRAL first recognize the situation and then generate corresponding actions. This two-stage process is similar to human reasoning and has a strong explanation capability. Other systems also represent the knowledge in the form of frame, relations, and so on [37]. Correspondingly, the acquisition of biomedical knowledge mainly relies on manual collection, and the assistant information extraction systems are conducted mainly based on the results of manual feature engineering.

With the development of machine learning, knowledge representation and acquisition have been raised to new heights. Our following discussion is divided into two different sources of knowledge: natural language text materials and other materials, which can correspond to the two perspectives mentioned in the last section.

12.2.1 Biomedical Knowledge from Natural Language

Text Biomedical textual knowledge is scattered in various natural language documents, patents, clinical records, etc. Various knowledge representation learning methods in the general domain are applied to these natural language text materials. What is special about biomedical texts is that we have to achieve a deep comprehension of the key biomedical terms. Therefore, we are going to first discuss various term-oriented biomedical tasks that researchers explore. Further, we turn to pre-trained models (PTMs) to achieve the overall understanding of language descriptions (including sentences, paragraphs, and even document materials around these terms).

Term-Oriented Biomedical Knowledge

Biomedical terms, including the professional concepts and entities in the biomedical domain, are important carriers of domain knowledge. Common biomedical terms that we may process include chemicals/genetics/protein entities, disease/drug/examination/treatment items, cell/tissue/organ parts, and others. To process the biomedical natural language materials better, deeper comprehension of these biomedical terms is necessary. Dictionary-based and rule-based methods are very manpower demanding, and it is difficult to maintain immediacy and hold complicated scenarios [46]. To grasp and analyze the data features automatically, machine learning and statistical learning are adopted to get more generalized term representations and achieve better acquisition performances [87], while still far from satisfaction. Further, deep learning has been rapidly developed and proven its effectiveness in the biomedical domain; therefore we are going to mainly introduce biomedical term process methods in the deep learning approach, which is currently the mainstream solution for biomedical knowledge representation and acquisition.

Biomedical Term Representations

The mainstream term representation methods are in a self-supervised manner, which is to predict the missing parts for the given context, hoping to get general feature representations for various downstream tasks.

Many works in the general domain such as word embeddings are directly used in biomedical scenarios without adaptation. The skip-gram version of word2vec [16], for example, is proven to get satisfying performance on the biomedical term semantic relatedness task [65]. Besides, researchers also try distributed representations especially trained for biomedical terms [22, 70, 102], introducing extra information such as the UMLS [9] ontology and the medical subject headings (MeSH) [54]. Based on the shallow term embeddings, we can go a step further to adopt deep neural networks such as CNNs and BiLSTMs to get the deep distributed representations for biomedical terms [48, 74].

In recent years, PTM is the most popular choice to generate distributed representations as the basis of various downstream tasks. Inspired by the PTMs including BERT that have achieved more and more surprisingly great performances in the general domain, researchers quickly transfer the self-supervised approach to the biomedical domain. SciBERT [7] is one of the earliest PTMs that are specially adapted for the scientific corpus, followed by BioBERT [47] which further refines the corpus field of the model target into biomedicine. The specific operation is very simple: replacing the pre-training corpus of BERT in the general domain (e.g., Wikipedia, books, and news) with biomedical texts (e.g., literature and medical records). Some other biomedical PTMs also follow this strategy, such as SciFive [73] which is adapted from T5.

To sum up, the knowledge representation methods in the general domain are adapted to biomedical terms quite well. Special hints including information from the subject ontologies can provide extra help to generate better representations.

Biomedical Term Knowledge Acquisition

The identification of terms involves many subtasks: recognition (NER), classification (typing), mapping (linking), and so on [46]. We introduce several mainstream solutions in chronological order.

Some simple supervised learning methods are applied for term recognition, such as the hidden Markov model (HMM) for term recognition and classification [19, 24] and the support vector machine for biomedical NER [67]. These systems mainly rely on the pre-defined domain-specific word features and perform not well enough for some lack-of-data classes. Neural networks including LSTM are also widely adopted for biomedical term knowledge acquisition [31, 33]. Unsupervised approaches are also explored and proven to be effective [101].

With the help of biomedical PTMs, we can better acquire and organize knowledge from the mass of unstructured text. At the term level, PTMs encode the long text and get a dense representation, which can then be fed into classifiers or softmax layers to finish NER, entity typing, and linking precisely. The tuning methods are sometimes specially designed, such as conducting self-alignment training on the pair tuples of the entity names and categorical labels in several ontologies [40, 55]. Though the methodology of PTMs has been successfully adapted to the biomedical domain, there still exist domain-special problems waiting for solving. For example, compared with the general domain, biomedical texts have more nested entities because of the terminology naming convention. For example, the DNA entity IL-2 promoter also contains a shorter protein entity IL-2, and G1123S/D refers to two separate entities G1123S and G1123D as shown in Fig. 12.2. We can solve the nested entity problem by separating different types of entities in different output layers [26] or by detecting the boundaries and assembling all the possible combinations [15, 89].

A text document highlights drug-gene-mutation relations with biomedical terms such as gene, mutation, cell line, disease, and chemical. — **Fig. 12.2**

Language-Described Biomedical Knowledge

As we can see, the research we have discussed concerns more about the special terms with professional biomedical knowledge. Nevertheless, other words and phrases in the language materials also contain rich information such as commonsense knowledge and can express much more flexible biomedical facts and attributes than isolated terms. It is necessary to represent the whole language descriptions instead of only biomedical terms, and this can be achieved by domain PTMs quite well. Based on the representations of language materials, the biomedical knowledge scattered in the unstructured text can be acquired and organized into a structured form.

We now introduce the overall development of the language-described biomedical knowledge extraction. The popular datasets are mostly small-scale and focus on specific types of relations, like the BC5CDR chemical-disease relation detection dataset and the ChemProt chemical-protein interaction dataset [49]. These simple tasks can sometimes be finished quite well with the help of distributed representations, even if they are generated by simple neural networks without pre-training [85]. However, in practical scenarios, biomedical knowledge exists in more sophisticated information extraction (e.g., N-ary relations, overlapping relations and events). Since scientific facts usually involve stricter conditions, few of them can be expressed clearly with only a triplet. For example, the effect of drugs on a disease is related to the characteristics of the sample, the course of the disease, etc. As shown in Fig. 12.2, the text mentioning N-ary relations is usually quite long and may cross several paragraphs. PTMs show their effectiveness due to their capability of capturing long-distant dependence for sophisticated relations in long documents, encoding the mentions, and then getting the distributed entity representations for the final prediction [41].

Summary

Overall, researchers solve the simple biomedical text processing scenarios quite well by transferring many knowledge representation and acquisition methods in the general domain of NLP, while the challenges still exist from practical perspectives. Knowledge storage structures with stronger expressive ability, plenty of annotated data, and targeted-designed architectures are urgently expected for sophisticated biomedical knowledge representation and acquisition.

12.2.2 Biomedical Knowledge from Biomedical Language Materials

Biomedical materials contain not only textual materials scattered in natural language but also some materials unique to the biomedical field. These materials have their own special structures in which rich knowledge exists, and we collectively refer to them here as biomedical language (e.g., genetic language) materials. Compared with natural language materials, biomedical language materials like genetic sequences are not easy to comprehend and require extensive experience and background information for analysis. Fortunately, modern neural networks can process not only natural language documents but also most of the sequential data including some representations of chemical and genetic substances. Besides, deep learning methods can also be applied to represent and acquire biomedical knowledge in other forms such as graphs. In this section, we consider genetic language, protein language, and chemical language for discussion, and substances expressed by these languages are linked by the genetic central dogma [21]. As shown in Fig. 12.3, genetic sequences are expressed to get proteins, which react with various chemicals to execute their functions.

Three illustrations of the complementary base pairing of D N A with helical structure, the genetic code of R N A with the wave structure, and the protein folding of the peptide in the form of a spiral. — **Fig. 12.3**

Genetic Language

There are altogether only five different types of nucleic acid, among which A, G, C, and T are in the DNA sequences and A, G, C, and U are in the RNA sequences. Since the coding region of the unwinding DNA is transcribed to generate an mRNA sequence with a very fixed correspondence, i.e., A-T(U) and G-C, the processing methods for DNA and RNA sequences are often similar. We mainly discuss DNA sequences in this section. We first introduce basic tasks for DNA sequence processing and then discuss the similarities and differences between genetic language and natural language. In terms of genetic language, we show related work about tokenization and encoding methods.

Basic Tasks for Genetic Sequence Processing

First, let’s take a look at various downstream property prediction tasks for genetic sequences. Some of them emphasize the high-level semantic understanding of DNA sequences [5, 81] (long-distance dependency capturing and gene expression prediction), such as transcriptional activity, histone modifications, TF binding, and DNA accessibility in various cell types and tissues for held-out chromatin. Other tasks evaluate the low-level semantic understanding of DNA sequences [68] (precise recognition of basic regulatory elements), such as the prediction of promoters, transcription factor binding sites (TFBSs), and splice sites.

Features of Genetic Language

Although both DNA/RNA language and natural language are textual sequences, there still exist differences between them. Firstly, the genetic sequences are quite long and dull, thus not as reader-friendly for human beings as natural language sequences. However, the NLP models are actually good at reading and learning from the mass of data and finding patterns. Secondly, compared with natural language, genetic language has a much smaller vocabulary (only five types of nucleic acid as mentioned above); therefore low-level semantic modeling is important for overall sequence comprehension, about which researchers have launched many explorations as the following introduction.

Genetic Language Tokenization

Early works express the sequences via one-hot coding [82]. The nucleic-acid-level features can be captured by converting the sequences into 2D binary matrices. Based on the tokenized results, convolutional layers and sequence learning modules such as LSTMs are applied to get the final distributed representations [34]. More researchers use the k-mer (substrings of length k monomers contained within a biological sequence) tokenizer to take co-occurrence information into account. In other words, the encoding of each position in the gene sequences will be considered together with the preceding and following positions (a sliding window with a total length of k) [63]. Other methods such as byte pair encoding [80] have also been proven to be useful.

Genetic Sequence Representation

The shallow models can hardly process the long sequences which may have thousands of base pairs, while Transformers [88] can capture the long-distance dependency quite well thanks to its attention module. Further, the self-supervised pre-training for Transformers is proven to be also effective on the genetic language [39]. Besides, improved versions of Transformers are implemented and achieve good performances on DNA tasks. For instance, Enformer [5] is designed to enlarge the receptive field. To be more specific, the ideas from computer vision can be borrowed to use deep convolution layers to expand the region that each neuron can process. Enformer replaces the base Transformer layer with seven convolutional layers to capture the low-level semantic information. The captured features are fed into 11 Transformer layers and processed by the separately trained organism-specific heads. Experimental results show that Enformer improves gene expression prediction, variant effect prediction, and mutation effect prediction.

Protein Language

Protein sequence processing has a lot in common with genetic sequences. There exist altogether 20 types of amino acids in the human body, so protein language is a special language with low readability and a small vocabulary size as well. We also discuss some basic tasks and methods first and then introduce a representative work in protein sequence processing.

Basic Tasks for Protein Sequence Processing

The sequence specificity of DNA- and RNA-binding proteins [2] is a basic task that we are concerned about, because RNA sequences are translated to obtain an amino acid sequence and the two types of sequences are highly related. Moreover, the spatial structure analysis is another unique and important task for protein sequences, since the protein quaternary structure determines the properties and functions.

We have introduced the similarity of genetic and protein language, which allows most genetic sequence processing methods to be adapted to proteins. However, there are also some special methods for protein sequence processing. A significant fact is that structural and functional similarities exist between homologous protein sequences, which can help supervise protein representation learning. By contact prediction and pairwise comparison, we can conduct multi-task training of protein sequence distributed representations [8] and conversely assist spatial structure prediction.

Landmark Work for Protein Spatial Structure Analysis

AlphaFold [43] proposed by DeepMind has achieved a breakthrough in highly accurate protein structure prediction and become the champion of the Critical Assessment of protein structure prediction challenge. The system incorporates multiple sequence alignment (MSA) [11] templates and pairwise information for the protein sequence representation. It is built based on a variant of Transformers, which is named as EvoFormer. The column and row attention of MSA sequences and pair representations are fed into EvoFormer blocks. Peptide bond angles and distances are then predicted by the subsequent modules. The interfaces and tools for AlphaFold interaction have been developed quite well, and it is easy for users without an AI background to master. This reflects the essence of interdisciplinary research: division of labor and cooperation to improve efficiency.

Besides, it is worth mentioning that the initial results generated by AlphaFold can be further improved with the help of molecular dynamics knowledge. Incorporating domain knowledge also shows its effectiveness in some other scenarios, such as using chemical reaction templates for retrosynthesis learning [28]. Overall, the combination of professional knowledge and data-driven deep learning is getting better results, which is an important development trend for biomedical NLP.

Chemical Language

Apart from biological sequences, chemical substances (especially small molecules) can also be encoded and expressed into molecule representations, which can help finish property prediction and filtering. These representations play similar roles as the molecule fingerprint (a commonly used abstract molecular representation that converts the molecular structure into a series of binary sequences by checking whether some specific substructures exist).

Early Fashions for Chemical Substance Representation

In the early days of applying machine learning to assist the prediction of molecular properties, molecule descriptors such as nuclear charges and atomic positions are provided for nonlinear statistical regression [77]. Essentially, people still need to manually select features for the molecule descriptors. To alleviate the labor of manual feature engineering, data-driven deep learning systems have gradually become the main approach for the analysis of molecules.

For current deep learning systems of chemical substance representations, we classify according to the different expressions of chemical substances, for which there are several common methods as shown in Fig. 12.4.

A table of the different methods of chemical expression as follows. Natural language, 2D or 3D graphs, and linear text. — **Fig. 12.4**

Graph Representations

One of the clearest ways is the 2D and 3D topology diagrams [23, 45] describing the inner chemical structure of molecules. This naturally corresponds to the essential elements of graphs. In molecular graphs, the nodes represent the atoms, and the edges represent the connections (chemical bond, hydrogen bond, van der Waals force, etc.). Graph representation learning bridges chemical expression and machine learning [95], and we have introduced graph representation learning in detail in Chap. 6. Graph Transformer [98], for example, is currently one of the most popular approaches in molecular graph representation learning [76]. With the graph representation learning methods, we can achieve two main tasks for molecular processing: molecular graph understanding to capture the topology information of molecular structures and predict properties [45] and molecular graph generation to provide assistance for drug discovery and refinement [59]. Overall, graph representation learning has already been proven to be an effective approach to chemical analysis.

Linear Text and Other Representations

There are also some other solutions for expressing chemical substances. For example, linear text such as the structural formula, structural abbreviation, and simplified molecular input line entry specification (SMILES) [79] can be adopted for chemical expression. The straightforward advantage of linear text expressions is that they can naturally be fed into any NLP model. Although different from natural language text, the SMILES text expressing molecules and chemical reactions can also be processed by the Transformer-based models, if only with the assistance of specially designed tokenizers [50] and pre-training tasks [90]. Nevertheless, the linear text losses some structural information, and the 2D topologic and 3D spatial hints are still proven to be important. The atom coordinates computed according to SMILES help improve the performance of SMILES processing models [93], and this inspires us that the domain knowledge (e.g., molecule 3D procure) will enhance the NLP models when processing biomedical materials.

Summary

Apart from substances related to central dogma, there exist some other types of special materials in the biomedical domain, such as image data and numeric data. The former including molecule images and medical magnetic resonance images [58] can be automatically processed by AI systems to some extent. The latter such as continuous monitoring health data is also processed with NLP methods adapted to the biomedical domain [94]. In summary, the materials waiting for processing are in versatile forms, and deep learning methods have already achieved satisfying performances on many of them. Further, to achieve deep comprehension and precise capture of biomedical knowledge, we believe that adaptive and universal processing of various materials will gradually become the trend in biomedical NLP research.

12.3 Knowledge-Guided Biomedical NLP

We have already discussed the development and basic characteristics of biomedical knowledge representation and acquisition. Conversely, domain knowledge can guide and enhance biomedical NLP systems to better finish those knowledge-intensive tasks. Though the commonsense and facts in the general domain can be learned in a self-supervised manner, the biomedical knowledge we use to guide the systems is more professional and has to be additionally introduced. The guidance from domain knowledge bases can even assist human experts and help improve their performances, let alone the biomedical NLP systems. In this section, we introduce the basic ideas and representative work for knowledge-guided biomedical NLP, according to the four types of methods mentioned in Chap. 9: input augmentation, architecture reformulation, objective regularization, and parameter transfer.

12.3.1 Input Augmentation

To guide neural networks with biomedical knowledge, one simple solution is to directly provide the knowledge as the input augmentation of the systems. There exist different sources of knowledge that can augment the input, as we are going to introduce later. One mainstream source is the biomedical knowledge graph (KG) which contains human knowledge and facts organized in a structured form. Besides, knowledge may also come from linguistic rules, experimental results, and other unstructured records. The problem for input augmentation is to select helpful information, encode, and fuse it with the processing input.

Encoding Knowledge Graph

Information from professional KGs is of high quality and suitable for guiding models in downstream tasks. Usually, we rely on basic entity recognition and linking tools to select the subgraphs or triplets from KGs that are related to the current context and further finish more sophisticated tasks such as reading comprehension and information extraction. We now give three instances: (1) Improving word embeddings with the help of KGs. Graph representation learning approaches like GCN-based methods can get better-initialized embeddings for the link prediction task based on biomedical KGs [3]. (2) Augmenting the inputs with knowledge. Models such as the hybrid Transformers can encode token sequences and triplet sequences at the same time and incorporate the knowledge into the raw text [6]. (3) Mounting the knowledge by extra modules. Extra modules are designed to encode the knowledge, such as a graph-based network encoding KG subgraphs to assist biomedical event extraction [36]. As shown in Fig. 12.5, the related terms in the UMLS ontology are parsed and form a subgraph, which is encoded and concatenated into the hidden layer of the SciBERT text encoder to assist event trigger and type classification. There also exist other examples, such as the separate KG encoder providing entity embeddings for the lexical layers in the original Transformer [27] and the KG representations trained by TransE being attached to the attention layers [14].

An illustration of the structure of GEANet, which interacts with the induce and biological function networks. — **Fig. 12.5**

Encoding Other Information

Apart from KG information, there are other types of knowledge that are proven to be helpful. Syntactic information, for example, is a significant part of linguistic knowledge. Though not a part of biomedical expert knowledge, syntactic information can also be provided as augmented input to better analyze sentences, recognize entities, and so on [86]. For non-textual material processing tasks, such as the discovery of the relationship between basal gene expression and drug response, researchers believe that experimentally verified prior knowledge including the protein and genetic interactions is important. The information can be concatenated with the original input substances to get representations and show the effectiveness of input augmentation [25]. Overall, introducing extra knowledge usually shows at least no harm to the performance, while we need to decide whether the knowledge is related and helpful to the specific tasks, through human experience or automatic filtering.

12.3.2 Architecture Reformulation

Human prior knowledge is sometimes reflected in the design of model architectures, as we have mentioned in the representation learning of biomedical data. This is especially significant when we try to process domain-specific materials, such as the substances we have introduced in the last section. After all, the backbone models are designed for general materials (e.g., natural language documents, natural images), which may have remarkable differences from biomedical substances. Here we analyze two examples in detail: Enformer [5] and MSA Transformer [75].

Enformer is an adapted version of Transformers framework for DNA sequences, and we provide the model architecture in Fig. 12.6. The general idea of this model has already been introduced when we discuss genetic sequences. Here we take a look at two designs in Enformer that help the model better capture the low-level semantic information in the super-long genetic sequences, and this information is of key importance for the high-level sequence analysis. First, Enformer emphasizes the relative position information, selects the relative positional encoding basis function carefully, and uses a concatenation of exponential, gamma, and central mask encodings. Second, convolutional layers are applied to capture the low-level features, enlarging the receptive field and greatly expanding the number of relevant enhancers seen by the model.

An illustration of the structure of the Enformer model is as follows. Input D N A sequence, convolution layers, transformer layers, organism-specific heads, and output genomic tracks. — **Fig. 12.6**

When discussing AlphaFold, we have mentioned the significance of MSA information. Inspired by this idea, MSA Transformer is proposed to process multiple protein sequences. The model architecture is shown in Fig. 12.7. The normal Transformers conduct attention calculations separately for each sequence. However, different sequences in the same protein family share information including the co-evolution signal. MSA Transformer introduces the column attention corresponding to the row attention of each sequence and is trained with a variant of the masked language modeling across different protein families. Experimental results show that MSA Transformer gets obviously better performance compared with processing only single sequences, and this becomes the basic paradigm of processing protein sequences.

An illustration depicts the structure of the M S A transformer model as follows. Layer norm, row attention, layer norm, column attention, layer norm, feedforward. On the left side, the column and row attention grids are displayed, with the single position column and single sequence row highlighted. — **Fig. 12.7**

12.3.3 Objective Regularization

Formalizing new tasks from extra knowledge can change the optimization target of the model and guide the models to finish the target task better. In the biomedical domain, there are plenty of ready-made tasks that can be adopted for objective regularization once chosen carefully, and we do not need to specially formalize new tasks. Usually, we conduct multi-task training in the downstream adaptation period. Some researchers also explore objective regularization in the pre-training period, and PTMs learn the knowledge contained in the multiple pre-training tasks. We will give examples of these two modes and conduct a comparative analysis.

Multi-task Adaptation

The introduced multiple tasks can be the same or slightly different from the target task. For the former one, we usually collect several datasets (may be differently distributed or in various language styles) for the same task. For instance, the biomedical NER model has shared parameters while separated output layers for various datasets to deal with the style gap [12, 20]. When we do not have ready-made datasets, KGs can help generate more silver data for training, such as utilizing the KG shortest dependency path for relation extraction augmentation [84]. Further, different tasks can also benefit each other, such as several language understanding tasks (biomedical NER, sentence similarity, and relation extraction) in the BLUE benchmark [71]. Similarly, when dealing with non-textual biomedical materials, we can conduct multi-task adaptation to require the models to understand different properties of the same substances. For example, the molecular encoder reads the SMILES strings and learns comprehensive capability on five different molecule property classification tasks [53], and the knowledge in these tasks assists in improving the performance of each other.

Multi-task Pre-training

Pre-training itself is a knowledge-guided method, which we will introduce later in the next subsection. When it comes to multi-task pre-training, with knowledge of KGs or expert ontologies, we can create extra data and conduct knowledgeable pre-training tasks. The domain-specific PTMs we have mentioned such as SciBERT and BioBERT simply keep the masked language modeling training strategy. To introduce more knowledge, biomedical PTMs with specially designed pre-training tasks are proposed. One instance is masked entity prediction, e.g., MC-BERT [100] is trained with the Chinese medical entities and phrases masked instead of randomly picked characters with the assistance of biomedical KGs. The other instance is entity detection and linking, e.g., KeBioLM [97] annotates the large-scale corpus with the SciSpacy [66] tool and introduces the entity-oriented tasks during pre-training, which essentially integrates the entity understanding capabilities of the annotation tool. The PTMs enhanced by extra pre-training tasks usually show much better performance on the corresponding downstream tasks. In short, the multi-task pre-training period implicitly injects knowledge from the KGs/ontologies or the ready-made annotation tools, and this can help improve the capability of the PTMs in related aspects.

Comparing the two approaches above, we can find that multi-task adaptation is a more direct way to change the optimization target for the target task, and therefore the introduced datasets have to be of high quality and highly related to our target data. In contrast, the requirement for multi-task pre-training is less stringent since the pre-training period is conducted on a sufficiently large corpus that is insensitive to small disturbances, while the assistance of pre-training tasks is also not so explicit and remarkable compared with multi-task adaptation.

12.3.4 Parameter Transfer

One of the most common paradigms of transfer learning is the pre-training-fine-tuning paradigm. In this way, the data-driven deep learning systems can be applied to specific domains which may lack annotated data. The knowledge learned from the source domain corpus/tasks can help improve the performance of the target domain tasks. Taking the PTMs as an example, they transfer the commonsense, linguistic knowledge, and other useful information from the large-scale pre-training corpus to the downstream tasks. We now discuss two types of parameter transfer: between different data domains and between tasks.

Cross-Domain Transfer

The models pre-trained in the general domains are frequently transferred to the biomedical domain, and two of the most common scenarios are the processing of natural language documents and images. For example, the model pre-trained on ImageNet can better understand medical images and finish melanoma screening [61]. Compared with randomly initialized models, PTMs such as BERT can also achieve satisfying performances when fine-tuned on biomedical text processing datasets.

Nevertheless, with more biomedical corpora obtained, we do not have to rely on general domain pre-training now. Experimental results have shown that domain-specific pre-training has a more obvious improvement than general domain pre-training [64]. In fact, each domain may have its own characteristics, such as some empirical results in the biomedical domain showing that pre-training from scratch gains more over continual pre-training of general-domain PTMs [32], which is contrary to popular belief and waiting for further exploration.

Cross-Task Transfer

Models can be tuned on other tasks or styles of data before being transferred to the target task data, and knowledge learned from other tasks is contained in the initialized parameters. Specific to the biomedical domain, the high cost of biomedical data annotation limits the scale of golden samples labeled by human experts. Some methods can generate large-scale silver datasets automatically, such as distant supervision, which assumes that a piece of text/image expresses the already-known relation, if only the head and tail entities appear in it. Sometimes it is too absolute to directly change the optimization target. Instead, we consider using the cross-task transfer method to utilize the knowledge of the introduced task more softly. Pre-training on the silver-standard corpora and then tuning on the golden-standard datasets is proved to be effective [29]. Another example is cross-species biomedical data for transfer learning, in which the underlying biological laws of different species have similarities; therefore the biological data from other species can be used for pre-training before fine-tuning with the data from the target species and achieving higher accuracy for DNA sequence site prediction [52, 56].

Summary

To sum up, knowledge-guided NLP methods are widely used in biomedical tasks, such as parameter transfer which can be easily conducted, being proven useful in various scenarios and becoming an essential paradigm. For textual material processing, the structured biomedical expert knowledge in KGs is suitable for providing augmented input and designing better objective functions. For non-textual material processing, architecture reformulation is usually necessary due to the differences in the data characteristics between various forms of raw materials. Some special materials naturally provide clues for objective regularization, such as multiple properties for the given molecule. The satisfying performances achieved by the above methods inspire us to emphasize the significance of knowledge-guided biomedical NLP.

12.4 Typical Applications

In this section, we explain the practical significance of biomedical knowledge representation learning through three specific application scenarios. Literature processing is a typical scenario for biomedical natural language material processing, and retrosynthetic prediction focuses more on biomedical language (chemical language) material processing. Both the two applications belong to AI for science problems, attempting to search from a large space and collect useful information to improve the efficiency of human researchers. We then talk about diagnosis assistance, which is of high practical value in our daily life.

12.4.1 Literature Processing

The size of the biomedical literature is expanding rapidly, and it is hardly possible for researchers to keep pace with every aspect of biomedical knowledge development. We provide an example of a literature processing pipeline in Fig. 12.8 to show how biomedical NLP helps improve our efficiency. We divide the pipeline into four stages: literature screening, information extraction, question answering, and result analysis.

Literature Screening

In our usual academic search process, we first screen the mass of literature returned by our search engine. We require the information retrieval model to return a relevance score ranking according to the query conditions, which may describe the type and age limit of the document, the entities or relation pairs we are concerned about, and other details. Echoing the importance of the biomedical terms we have mentioned, sometimes the document representations in the biomedical information retrieval models emphasize the key biomedical terms in the documents and queries for better matching [1].

Information Extraction

We have already introduced some significant tasks for biomedical information extraction, such as term recognition, linking, and relation extraction. After we get the targeted literature by screening, we have to mine the text, extract the useful information, and convert it into a structured form just as we do in those extraction tasks. This stage usually relies on the knowledge-transferred PTMs reading and understanding the long documents.

Result Analysis and Question Answering

We may also care about advanced meta-relations between the extracted structured knowledge items or facts. An example is to perform a meta-analysis for clinical randomized controlled trials [4], which is one of the most convincing pieces of evidence in evidence-based medicine. The process of inductively analyzing the results of different trials does not necessarily need to be fully automated, while we expect the AI system to help us do a quality assessment and conclusion highlighting and therefore largely improve our efficiency. Based on the analysis result, we may even get assistance from the conversation systems generating reasonable responses to medical questions and providing effective suggestions for further research.

12.4.2 Retrosynthetic Prediction

Organic synthesis is an essential application for modern organic chemistry and plays an important role in drug discovery, material science, and other fields. To design synthetic routines for the target molecules more efficiently, AI systems are applied for chemical reaction reading, such as the reaction classification task. Further, we expect the systems to achieve deep comprehension of the reactions and can therefore generate single-step reaction predictions. Eventually, the multi-step retrosynthesis task, reasoning the synthetic routes for the given target product, can also be finished automatically with the help of extra information from knowledge bases or ontologies.

Chemical Reaction Classification

Machine learning methods can help researchers analyze large-scale reaction records and summarize useful reaction templates, which is a significant form of chemical knowledge [18]. These templates can further guide human researchers or AI systems to design synthetic routines.

Single-Step Reaction Prediction

In recent years, models such as the Transformers are pre-trained on the large-scale reaction corpus, and they are proven to be effective when predicting the single-step reactions without the guidance of templates [91].

Multi-step Reaction Prediction

For predicting multi-step reactions, most of the current methods search for reasonable routes based on the already-known reaction knowledge in the knowledge bases [13]. With the development of biomedical deep learning models, we may also explore end-to-end generation for multi-step retrosynthesis in the future, as shown in Fig. 12.9. Specifically, the heuristic algorithm for searching routes, the query of knowledge bases, and other operations may all be finished with unified models guided by chemical knowledge.

An illustration depicts the possible solution for the chemical structure of molecule processes using A I. — **Fig. 12.9**

12.4.3 Diagnosis Assistance

There exists a huge demand for diagnosis assistance. The scarce medical resources in some areas call for AI systems to provide patients with auxiliary knowledge for simple daily situations. This can reduce the pressure on medical resources and improve the work efficiency of hospital systems.

We first take a look at several basic tasks in diagnosis assistance. The most practical application is automatic triage. The system is fed with the symptom descriptions from the patients and predicts the suitable clinic. This is essentially a disease classification problem. A similar task is medicine prescription, which requires processing more complex diagnostic information (including the text of complaints, quantified findings, and even images) and providing advice with the aid of medical knowledge. Further, the doctor-patient conversation is a challenging task due to the gap between the colloquial style of patients and the standard terms and structured items in KGs. The system must first recognize the key information and finish linking and then provide correct and helpful knowledge with good interpretability and readability.

Since safety is significant for issues related to medical care, the assistance systems have to be supported by plenty of knowledge and provide explainable suggestions. Incorporating knowledge representations with text representations achieves significantly better performance on the diagnosis assistance task [51].

12.5 Advanced Topics

We have introduced the current development in biomedical knowledge representation learning. There are several consensuses for biomedical NLP through which we can further discuss and get inspiration about future trends. We have discussed the significance of high-quality training data for the current deep learning biomedical systems, and data scarcity can lead to research in two ways: by guiding the models with the knowledge to adapt with few data or by incorporating different data forms from multiple sources. Besides, the black-box property of deep learning systems brings challenges for domain research since biomedical applications are highly related to human life and emphasizes safety and ethical justification. Next, we will elaborate on the above two solution paths and one main concern.

Knowledgeable Warm Start

There is a term in the field of recommendation algorithms called the cold-start [78] problem, which describes impaired performance when lacking user history. Extended to more deep learning applications such as biomedical NLP, we also face the cold-start challenge under scarce-data scenarios and often alleviate the problem with the help of transfer learning or other methods. For biomedical NLP tasks, data annotation is difficult, and we always have few supervision signals for model training. Therefore, it becomes more important to achieve warm start training for biomedical NLP systems.

As we have mentioned above, knowledge can guide deep learning systems in several different ways even when the data is comparably plenty, such as biomedical PTMs transferring linguistic and commonsense knowledge to help achieve a warm start. When it comes to the low-resource scenarios, there have been a few explorations. Knowledge-aware modules such as the self-attention layer introducing external KGs are designed for biomedical few-shot learning [96]. Special tuning strategies such as entity-aware masking are also applied and proved to be effective under low-resource problems [72]. Still, the knowledgeable warm start problem is rarely discussed in a targeted manner or even just clearly raised, although it is prevalent in biomedical NLP tasks. We believe that it deserves more attention and research.

Cross-Modal Knowledge Processing

Though the annotated datasets are small-scale, we have various forms of biomedical data that are linked to each other by biomedical knowledge. Apart from the regular cross-modal tasks (about which we can learn more details in Chap. 7) including medical image captioning, other types of materials can also be versatilely processed. For example, natural language and chemical language can describe the same chemical entities, and they may provide complementary information from different perspectives. KV-PLM [99] has proved that the connections between natural language descriptions and molecular structures can be modeled in an unsupervised manner through pre-training (Fig. 12.10). It can even surpass human professionals in the molecular property comprehension task and reveal its potential in drug discovery. Follow-up works further incorporate other materials such as molecular graphs with the text [83].

An illustration depicts the different components of the K V-P L M model as follows. Internal information, meta-knowledge, external information, language model pre-training, and knowledgeable machine reading. — **Fig. 12.10**

Different expressions for biomedical terms have diverse emphases. Bridging them together and capturing the mapping relations between various data forms through a large number of observations, just as humans do, is a form of meta-knowledge learning, enabling a deeper understanding of terms while alleviating data scarcity issues. As long as we can design tokenizers to utilize different structures uniformly, the advantages of data-driven deep learning systems can be carried forward.

Interpretability, Privacy, and Ease of Use

There exist some other concerns about biomedical NLP. The first one is the interpretability problem, which we have discussed in Chap. 8. Most deep learning systems are black boxes that have poor interpretability, and this leads to distrust of automated decision-making, especially under medical scenarios closely related to human lives. Directly predicting the prescription without providing symptom analysis and disease diagnosis makes it hard for users to assess the credibility of the recommendations. This is not only related to safety but also some ethical problems including accident liability determination. There are already some researchers that focus on the interpretability of biomedical NLP due to its importance [60].

The second one is the privacy problem. The ethical controversy of privacy always exists when we talk about AI development. For example, the genetic sequence training data of deep learning models may be leaked by privacy attacks, and the genetic traits and disease information of the system users may be illegally sold. Some methods such as private aggregation of teacher ensembles can alleviate the privacy leakage problem [69], while it still needs more effort to be solved.

Thirdly, as the assistance tool for domain research, biomedical NLP systems are supposed to be designed as easily as possible to use. Some toolkits and online demos are developed [103], while most of them still propose quite high requirements for the users’ devices and programming foundation. There is a huge market for user-friendly platforms, and we hope the AI community to implement useful aids as soon as possible.

12.6 Summary and Further Readings

In this chapter, we discuss the representation learning of biomedical NLP. As an emerging interdisciplinary field, biomedical NLP has undergone rapid development in recent years, especially after deep learning methods such as PTMs appeared. We first introduce the knowledge representation and acquisition in biomedical materials, including natural language text materials and other materials, of which the latter adapts the advanced NLP algorithms and models to the biomedicine scenarios. Further, we explain the knowledge-guided methods in the biomedical domain in the four aspects: input augmentation, architecture reformulation, objective regularization, and parameter transfer. Future directions in this field have also been discussed.

For further understanding of biomedical knowledge representation learning, we recommend reading some surveys about the early works [62] and the comprehensive analysis for PTMs [32] which is the recent-year representative results.

Notes

1.
https://www.genome.gov/genetics-glossary/Bioinformatics

References

Maristella Agosti, Stefano Marchesin, and Gianmaria Silvello. Learning unsupervised knowledge-enhanced representations to reduce the semantic gap in information retrieval. ACM Transactions on Information Systems (TOIS), 38(4):1–48, 2020.
Article Google Scholar
Babak Alipanahi, Andrew Delong, Matthew T Weirauch, and Brendan J Frey. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nature Biotechnology, 33(8):831–838, 2015.
Google Scholar
Mona Alshahrani, Maha A Thafar, and Magbubah Essack. Application and evaluation of knowledge graph embeddings in biomedical data. PeerJ Computer Science, 7:e341, 2021.
Google Scholar
Ashwin Karthik Ambalavanan and Murthy V Devarakonda. Using the contextual language model bert for multi-criteria classification of scientific articles. Journal of Biomedical Informatics, 112:103578, 2020.
Google Scholar
Žiga Avsec, Vikram Agarwal, Daniel Visentin, Joseph R Ledsam, Agnieszka Grabska-Barwinska, Kyle R Taylor, Yannis Assael, John Jumper, Pushmeet Kohli, and David R Kelley. Effective gene expression prediction from sequence by integrating long-range interactions. Nature Methods, 18(10):1196–1203, 2021.
Google Scholar
Helena Balabin, Charles Tapley Hoyt, Colin Birkenbihl, Benjamin M Gyori, John Bachman, Alpha Tom Kodamullil, Paul G Plöger, Martin Hofmann-Apitius, and Daniel Domingo-Fernández. STonKGs: a sophisticated transformer trained on biomedical text and knowledge graphs. Bioinformatics, 38(6):1648–1656, 2022.
Google Scholar
Iz Beltagy, Kyle Lo, and Arman Cohan. SciBERT: A pretrained language model for scientific text. In Proceedings of EMNLP-IJCNLP, 2019.
Google Scholar
Tristan Bepler and Bonnie Berger. Learning protein sequence embeddings using information from structure. In Proceedings of ICLR, 2018.
Google Scholar
Olivier Bodenreider. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Research, 32(suppl_1):D267–D270, 2004.
Google Scholar
Bruce Buchanan, Georgia Sutherland, and Edward A Feigenbaum. Heuristic DENDRAL: A program for generating explanatory hypotheses. Organic Chemistry, 1969.
Google Scholar
Humberto Carrillo and David Lipman. The multiple sequence alignment problem in biology. SIAM Journal on Applied Mathematics, 48(5):1073–1082, 1988.
Article MathSciNet MATH Google Scholar
Zhaoying Chai, Han Jin, Shenghui Shi, Siyan Zhan, Lin Zhuo, and Yu Yang. Hierarchical shared transfer learning for biomedical named entity recognition. BMC Bioinformatics, 23(1):1–14, 2022.
Article Google Scholar
Binghong Chen, Chengtao Li, Hanjun Dai, and Le Song. Retro*: learning retrosynthetic planning with neural guided A* search. In Proceedings of ICML, 2020.
Google Scholar
Jing Chen, Baotian Hu, Weihua Peng, Qingcai Chen, and Buzhou Tang. Biomedical relation extraction via knowledge-enhanced reading comprehension. BMC Bioinformatics, 23(1):1–19, 2022.
Article Google Scholar
Yanping Chen, Ying Hu, Yijing Li, Ruizhang Huang, Yongbin Qin, Yuefei Wu, Qinghua Zheng, and Ping Chen. A boundary assembling method for nested biomedical named entity recognition. IEEE Access, 8:214141–214152, 2020.
Article Google Scholar
Kenneth Ward Church. Word2vec. Natural Language Engineering, 23(1):155–162, 2017.
Google Scholar
Kevin Bretonnel Cohen and Dina Demner-Fushman. Biomedical natural language processing, volume 11. John Benjamins Publishing Company, 2014.
Google Scholar
Connor W Coley, William H Green, and Klavs F Jensen. Machine learning in computer-aided synthesis planning. Accounts of Chemical Research, 51(5):1281–1289, 2018.
Google Scholar
Nigel Collier, Chikashi Nobata, and Jun’ ichi Tsujii. Extracting the names of genes and gene products with a hidden Markov model. In Proceedings of COLING, 2000.
Google Scholar
Gamal Crichton, Sampo Pyysalo, Billy Chiu, and Anna Korhonen. A neural network multi-task learning approach to biomedical named entity recognition. BMC Bioinformatics, 18(1):1–14, 2017.
Article Google Scholar
Francis Crick. Central dogma of molecular biology. Nature, 227(5258):561–563, 1970.
Article Google Scholar
Lance De Vine, Guido Zuccon, Bevan Koopman, Laurianne Sitbon, and Peter Bruza. Medical semantic similarity with a neural language model. In Proceedings of CIKM, 2014.
Google Scholar
David K Duvenaud, Dougal Maclaurin, Jorge Aguileraiparraguirre, Rafael Gomezbombarelli, Timothy D Hirzel, Alan Aspuruguzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Proceedings of NeurIPS, 2015.
Google Scholar
Sean R Eddy. What is a hidden Markov model? Nature Biotechnology, 22(10):1315–1316, 2004.
Google Scholar
Amin Emad, Junmei Cairns, Krishna R Kalari, Liewei Wang, and Saurabh Sinha. Knowledge-guided gene prioritization reveals new insights into the mechanisms of chemoresistance. Genome Biology, 18(1):1–21, 2017.
Google Scholar
Hao Fei, Yafeng Ren, and Donghong Ji. Recognizing nested named entity in biomedical texts: A neural network model with multi-task learning. In Proceedings of BIBM, 2019.
Google Scholar
Hao Fei, Yafeng Ren, Yue Zhang, Donghong Ji, and Xiaohui Liang. Enriching contextualized language model from knowledge graph for biomedical information extraction. Briefings in Bioinformatics, 22(3):bbaa110, 2021.
Google Scholar
Michael E Fortunato, Connor W Coley, Brian C Barnes, and Klavs F Jensen. Data augmentation and pretraining for template-based retrosynthetic prediction in computer-aided synthesis planning. Journal of chemical information and modeling, 60(7):3398–3407, 2020.
Google Scholar
John M Giorgi and Gary D Bader. Transfer learning for biomedical named entity recognition with neural networks. Bioinformatics, 34(23):4087–4094, 2018.
Google Scholar
Anne Glover. The 21st century: the age of biology. In OECD Forum on Global Biotechnology, Paris, 2012.
Google Scholar
Mourad Gridach. Character-level neural network for biomedical named entity recognition. Journal of biomedical informatics, 70:85–91, 2017.
Article Google Scholar
Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2021.
Google Scholar
Maryam Habibi, Leon Weber, Mariana Neves, David Luis Wiegandt, and Ulf Leser. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics, 33(14):i37–i48, 2017.
Google Scholar
Hamid Reza Hassanzadeh and May D Wang. DeeperBind: Enhancing prediction of sequence specificities of DNA binding proteins. In Proceedings of BIBM, 2016.
Google Scholar
Johannes M Heuckmann, Michael Hölzel, Martin L Sos, Stefanie Heynck, Hyatt Balke-Want, Mirjam Koker, Martin Peifer, Jonathan Weiss, Christine M Lovly, Christian Grütter, et al. ALK mutations conferring differential resistance to structurally diverse ALK inhibitors. Clinical Cancer Research, 17(23):7394–7401, 2011.
Google Scholar
Kung-Hsiang Huang, Mu Yang, and Nanyun Peng. Biomedical event extraction with hierarchical knowledge graphs. In Findings of EMNLP, 2020.
Google Scholar
Donna L Hudson and Maurice E Cohen. Neural networks and artificial intelligence for biomedical engineering. Wiley Online Library, 2000.
Google Scholar
Hitoshi Iuchi, Taro Matsutani, Keisuke Yamada, Natsuki Iwano, Shunsuke Sumi, Shion Hosoda, Shitao Zhao, Tsukasa Fukunaga, and Michiaki Hamada. Representation learning applications in biological sequence analysis. Compuqihnology Journal, 19:3198–3208, 2021.
Google Scholar
Yanrong Ji, Zhihan Zhou, Han Liu, and Ramana V Davuluri. DNABERT: pre-trained bidirectional encoder representations from Transformers model for DNA-language in genome. Bioinformatics, 37(15):2112–2120, 2021.
Google Scholar
Zongcheng Ji, Qiang Wei, and Hua Xu. BERT-based ranking for biomedical entity normalization. AMIA Summits on Translational Science Proceedings, 2020:269, 2020.
Google Scholar
Robin Jia, Cliff Wong, and Hoifung Poon. Document-level n-ary relation extraction with multiscale representation learning. In Proceedings of NAACL-HLT, 2019.
Google Scholar
George Brooks Johnson and Peter H Raven. Biology: Principles & Explorations. Recording for the Blind & Dyslexic, 2007.
Google Scholar
John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873):583–589, 2021.
Google Scholar
Hye Seung Jung, Byung-Soo Youn, Young Min Cho, Kang-Yeol Yu, Hong Je Park, Chan Soo Shin, Seong Yeon Kim, Hong Kyu Lee, and Kyong Soo Park. The effects of rosiglitazone and metformin on the plasma concentrations of resistin in patients with type 2 diabetes mellitus. Metabolism, 54(3):314–320, 2005.
Google Scholar
Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. Molecular graph convolutions: moving beyond fingerprints. Journal of Computer-aided Molecular Design, 30(8):595–608, 2016.
Article Google Scholar
Michael Krauthammer and Goran Nenadic. Term identification in the biomedical literature. Journal of Biomedical Informatics, 37(6):512–526, 2004.
Article Google Scholar
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240, 2020.
Google Scholar
Haodi Li, Qingcai Chen, Buzhou Tang, Xiaolong Wang, Hua Xu, Baohua Wang, and Dong Huang. CNN-based ranking for biomedical entity normalization. BMC Bioinformatics, 18(11):79–86, 2017.
Google Scholar
Jiao Li, Yueping Sun, Robin J. Johnson, Daniela Sciaky, Chih-Hsuan Wei, Robert Leaman, Allan Peter Davis, Carolyn J. Mattingly, Thomas C. Wiegers, and Zhiyong Lu. BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database, 2016, 05 2016.
Google Scholar
Xinhao Li and Denis Fourches. SMILES pair encoding: a data-driven substructure tokenization algorithm for deep learning. Journal of Chemical Information and Modeling, 61(4):1560–1569, 2021.
Article Google Scholar
Xuedong Li, Yue Wang, Dongwu Wang, Walter Yuan, Dezhong Peng, and Qiaozhu Mei. Improving rare disease classification using imperfect knowledge graph. BMC Medical Informatics and Decision Making, 19(5):1–10, 2019.
Google Scholar
Zutan Li, Hangjin Jiang, Lingpeng Kong, Yuanyuan Chen, Kun Lang, Xiaodan Fan, Liangyun Zhang, and Cong Pian. Deep6mA: a deep learning framework for exploring similar patterns in DNA N6-methyladenine sites across different species. PLoS Computational Biology, 17(2):e1008767, 2021.
Google Scholar
Sangrak Lim and Yong Oh Lee. Predicting chemical properties using self-attention multi-task learning based on SMILES representation. In Proceedings of ICPR, 2021.
Google Scholar
Carolyn E Lipscomb. Medical subject headings (MeSH). Bulletin of the Medical Library Association, 88(3):265, 2000.
Google Scholar
Fangyu Liu, Ehsan Shareghi, Zaiqiao Meng, Marco Basaldella, and Nigel Collier. Self-alignment pretraining for biomedical entity representations. In Proceedings of NAACL-HLT, 2021.
Google Scholar
Quanzhong Liu, Jinxiang Chen, Yanze Wang, Shuqin Li, Cangzhi Jia, Jiangning Song, and Fuyi Li. DeepTorrent: a deep learning-based approach for predicting DNA N4-methylcytosine sites. Briefings in Bioinformatics, 22(3):bbaa124, 2021.
Google Scholar
Zhiyuan Liu, Yankai Lin, and Maosong Sun. Representation Learning for Natural Language Processing. Springer, 2020.
Book Google Scholar
Alexander Selvikvåg Lundervold and Arvid Lundervold. An overview of deep learning in medical imaging focusing on MRI. Zeitschrift für Medizinische Physik, 29(2):102–127, 2019.
Google Scholar
Omar Mahmood, Elman Mansimov, Richard Bonneau, and Kyunghyun Cho. Masked graph modeling for molecule generation. Nature Communications, 12(1):1–12, 2021.
Article Google Scholar
Sherin Mary Mathews. Explainable artificial intelligence applications in NLP, biomedical, and malware classification: a literature review. In Intelligent Computing-proceedings of the Computing Conference, pages 1269–1292. Springer, 2019.
Google Scholar
Afonso Menegola, Michel Fornaciali, Ramon Pires, Flávia Vasques Bittencourt, Sandra Avila, and Eduardo Valle. Knowledge transfer for melanoma screening with deep learning. In Proceedings of ISBI, 2017.
Google Scholar
Seonwoo Min, Byunghan Lee, and Sungroh Yoon. Deep learning in bioinformatics. Briefings in Bioinformatics, 18(5):851–869, 2017.
Google Scholar
Xu Min, Wanwen Zeng, Ning Chen, Ting Chen, and Rui Jiang. Chromatin accessibility prediction via convolutional long short-term memory networks with k-mer embedding. Bioinformatics, 33(14):i92–i101, 2017.
Article Google Scholar
Milad Moradi, Kathrin Blagec, Florian Haberl, and Matthias Samwald. GPT-3 models are poor few-shot learners in the biomedical domain. arXiv preprint arXiv:2109.02555, pages arXiv–2109, 2021.
Google Scholar
TH Muneeb, Sunil Sahu, and Ashish Anand. Evaluating distributed word representations for capturing semantics of biomedical concepts. In Proceedings of BioNLP, 2015.
Google Scholar
Mark Neumann, Daniel King, Iz Beltagy, and Waleed Ammar. ScispaCy: Fast and robust models for biomedical natural language processing. In Proceedings of the 18th BioNLP Workshop and Shared Task, pages 319–327, 2019.
Google Scholar
William S Noble. What is a support vector machine? Nature Biotechnology, 24(12):1565–1567, 2006.
Google Scholar
Mhaned Oubounyt, Zakaria Louadi, Hilal Tayara, and Kil To Chong. DeePromoter: robust promoter predictor using deep learning. Frontiers in genetics, 10:286, 2019.
Google Scholar
Nicolas Papernot, Martın Abadi, Ulfar Erlingsson, Ian Goodfellow, and Kunal Talwar. Semi-supervised knowledge transfer for deep learning from private training data. stat, 1050:7, 2016.
Google Scholar
Shengwen Peng, Ronghui You, Hongning Wang, Chengxiang Zhai, Hiroshi Mamitsuka, and Shanfeng Zhu. DeepMeSH: deep semantic representation for improving large-scale MeSH indexing. Bioinformatics, 32(12):i70–i79, 2016.
Article Google Scholar
Yifan Peng, Qingyu Chen, and Zhiyong Lu. An empirical study of multi-task learning on BERT for biomedical text mining. In Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, pages 205–214, 2020.
Google Scholar
Gabriele Pergola, Elena Kochkina, Lin Gui, Maria Liakata, and Yulan He. Boosting low-resource biomedical QA via entity-aware masking strategies. In Proceedings of EACL, 2021.
Google Scholar
Long N Phan, James T Anibal, Hieu Tran, Shaurya Chanana, Erol Bahadroglu, Alec Peltekian, and Grégoire Altan-Bonnet. Scifive: a text-to-text transformer model for biomedical literature. arXiv preprint arXiv:2106.03598, 2021.
Google Scholar
Minh C Phan, Aixin Sun, and Yi Tay. Robust representation learning of biomedical names. In Proceedings of ACL, 2019.
Google Scholar
Roshan M Rao, Jason Liu, Robert Verkuil, Joshua Meier, John Canny, Pieter Abbeel, Tom Sercu, and Alexander Rives. MSA transformer. In Proceedings of ICML, pages 8844–8856. PMLR, 2021.
Google Scholar
Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang, and Junzhou Huang. Self-supervised graph transformer on large-scale molecular data. In Proceedings of NeurIPS, 2020.
Google Scholar
Matthias Rupp, Alexandre Tkatchenko, Klaus-Robert Müller, and O Anatole Von Lilienfeld. Fast and accurate modeling of molecular atomization energies with machine learning. Physical Review Letters, 108(5):058301, 2012.
Google Scholar
Andrew I Schein, Alexandrin Popescul, Lyle H Ungar, and David M Pennock. Methods and metrics for cold-start recommendations. In Proceedings of ACM SIGIR, 2002.
Google Scholar
Philippe Schwaller, Teodoro Laino, Théophile Gaudin, Peter Bolgar, Christopher A Hunter, Costas Bekas, and Alpha A Lee. Molecular Transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Central Science, 5(9):1572–1583, 2019.
Google Scholar
Yusuxke Shibata, Takuya Kida, Shuichi Fukamachi, Masayuki Takeda, Ayumi Shinohara, Takeshi Shinohara, and Setsuo Arikawa. Byte Pair encoding: A text compression scheme that accelerates pattern matching. 1999.
Google Scholar
Toshiyuki Shiraki, Shinji Kondo, Shintaro Katayama, Kazunori Waki, Takeya Kasukawa, Hideya Kawaji, Rimantas Kodzius, Akira Watahiki, Mari Nakamura, Takahiro Arakawa, et al. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. In Proceedings of the NAS, 2003.
Google Scholar
Maria Stepanova, Feng Lin, and Valerie C-L Lin. A hopfield neural classifier and its FPGA implementation for identification of symmetrically structured DNA motifs. The Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology, 48(3):239–254, 2007.
Google Scholar
Bing Su, Dazhao Du, Zhao Yang, Yujie Zhou, Jiangmeng Li, Anyi Rao, Hao Sun, Zhiwu Lu, and Ji-Rong Wen. A molecular multimodal foundation model associating molecule graphs with natural language. arXiv preprint arXiv:2209.05481, 2022.
Google Scholar
Peng Su, Yifan Peng, and K Vijay-Shanker. Improving BERT model using contrastive learning for biomedical relation extraction. In Proceedings of the 20th Workshop on Biomedical Language Processing, pages 1–10, 2021.
Google Scholar
Cong Sun, Zhihao Yang, Leilei Su, Lei Wang, Yin Zhang, Hongfei Lin, and Jian Wang. Chemical–protein interaction extraction via gaussian probability distribution and external biomedical knowledge. Bioinformatics, 36(15):4323–4330, 2020.
Article Google Scholar
Yuanhe Tian, Wang Shen, Yan Song, Fei Xia, Min He, and Kenli Li. Improving biomedical named entity recognition with syntactic information. BMC Bioinformatics, 21(1):1–17, 2020.
Article Google Scholar
Yoshimasa Tsuruoka and Jun’ichi Tsujii. Improving the performance of dictionary-based approaches in protein name recognition. Journal of biomedical informatics, 37(6):461–470, 2004.
Article Google Scholar
Ashish Vaswani, Noam Shazeer, Niki Parmar, Llion Jones, Jakob Uszkoreit, Aidan N Gomez, and Lukasz Kaiser. Attention is all you need. In Proceedings of NeurIPS, 2017.
Google Scholar
Jue Wang, Lidan Shou, Ke Chen, and Gang Chen. Pyramid: A layered model for nested named entity recognition. In Proceedings of ACL, 2020.
Google Scholar
Sheng Wang, Yuzhi Guo, Yuhong Wang, Hongmao Sun, and Junzhou Huang. SMILES-BERT: large scale unsupervised pre-training for molecular property prediction. In Proceedings of ACM-BCB, 2019.
Google Scholar
Xiaorui Wang, Yuquan Li, Jiezhong Qiu, Guangyong Chen, Huanxiang Liu, Benben Liao, Chang-Yu Hsieh, and Xiaojun Yao. Retroprime: A diverse, plausible and transformer-based method for single-step retrosynthesis predictions. Chemical Engineering Journal, 420:129845, 2021.
Article Google Scholar
Ying Wei, Jun Zhou, Yin Wang, Yinggang Liu, Qingsong Liu, Jiansheng Luo, Chao Wang, Fengbo Ren, and Li Huang. A review of algorithm & hardware design for AI-based biomedical applications. IEEE Transactions on Biomedical Circuits and Systems, 14(2):145–163, 2020.
Article Google Scholar
Fang Wu, Qiang Zhang, Dragomir Radev, Jiyu Cui, Wen Zhang, Huabin Xing, Ningyu Zhang, and Huajun Chen. Molformer: Motif-based Transformer on 3D heterogeneous molecular graphs. arXiv preprint arXiv:2110.01191, 2021.
Google Scholar
Cao Xiao, Edward Choi, and Jimeng Sun. Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review. Journal of the American Medical Informatics Association, 25(10):1419–1428, 2018.
Article Google Scholar
Hai-Cheng Yi, Zhu-Hong You, De-Shuang Huang, and Chee Keong Kwoh. Graph representation learning in bioinformatics: trends, methods and applications. Briefings in Bioinformatics, 23(1):bbab340, 2022.
Google Scholar
Shujuan Yin, Weizhong Zhao, Xingpeng Jiang, and Tingting He. Knowledge-aware few-shot learning framework for biomedical event trigger identification. In Proceedings of BIBM, 2020.
Google Scholar
Zheng Yuan, Yijia Liu, Chuanqi Tan, Songfang Huang, and Fei Huang. Improving biomedical pretrained language models with knowledge. In Proceedings of the 20th Workshop on Biomedical Language Processing, pages 180–190, 2021.
Google Scholar
Seongjun Yun, Minbyul Jeong, Raehyun Kim, Jaewoo Kang, and Hyunwoo J Kim. Graph transformer networks. Advances in neural information processing systems, 32, 2019.
Google Scholar
Zheni Zeng, Yuan Yao, Zhiyuan Liu, and Maosong Sun. A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals. Nature Communications, 13(1):1–11, 2022.
Article Google Scholar
Ningyu Zhang, Qianghuai Jia, Kangping Yin, Liang Dong, Feng Gao, and Nengwei Hua. Conceptualized representation learning for Chinese biomedical text mining. arXiv preprint arXiv:2008.10813, 2020.
Google Scholar
Shaodian Zhang and Noémie Elhadad. Unsupervised biomedical named entity recognition: Experiments with clinical and biological texts. Journal of Biomedical Informatics, 46(6):1088–1098, 2013.
Article Google Scholar
Yijia Zhang, Qingyu Chen, Zhihao Yang, Hongfei Lin, and Zhiyong Lu. BioWordVec, improving biomedical word embeddings with subword information and MeSH. Scientific Data, 6(1):1–9, 2019.
Article Google Scholar
Zhaocheng Zhu, Chence Shi, Zuobai Zhang, Shengchao Liu, Minghao Xu, Xinyu Yuan, Yangtian Zhang, Junkun Chen, Huiyu Cai, Jiarui Lu, et al. Torchdrug: A powerful and flexible machine learning platform for drug discovery. arXiv preprint arXiv:2202.08320, 2022.
Google Scholar

Download references

Acknowledgements

The contributions of all authors are as follows: Zhiyuan Liu, Yankai Lin, and Maosong Sun designed the overall architecture of this chapter; Zheni Zeng drafted this chapter. Zhiyuan Liu and Yankai Lin proofread and revised this chapter.

We thank Ganqu Cui, Yankai Lin, Yuan Yao, Xu Han, Chenyang Song, Zeyu Pan, Kunlun Zhu, and Ruiyi Fang for proofreading the chapter and proposing valuable revisions.

This chapter about biomedical knowledge representation learning is the newly complemented content in the second edition of the book Representation Learning for Natural Language Processing. The first edition of the book was published in 2020 [57].

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Beijing, China
Zheni Zeng, Zhiyuan Liu & Maosong Sun
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China
Yankai Lin

Authors

Zheni Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Zhiyuan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yankai Lin
View author publications
You can also search for this author in PubMed Google Scholar
Maosong Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhiyuan Liu .

Editor information

Editors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Beijing, China
Zhiyuan Liu
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China
Yankai Lin
Department of Computer Science and Technology, Tsinghua University, Beijing, China
Maosong Sun

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Zeng, Z., Liu, Z., Lin, Y., Sun, M. (2023). Biomedical Knowledge Representation Learning. In: Liu, Z., Lin, Y., Sun, M. (eds) Representation Learning for Natural Language Processing. Springer, Singapore. https://doi.org/10.1007/978-981-99-1600-9_12

Download citation

DOI: https://doi.org/10.1007/978-981-99-1600-9_12
Published: 24 August 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-1599-6
Online ISBN: 978-981-99-1600-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Biomedical Knowledge Representation Learning

Abstract

Similar content being viewed by others

Drug knowledge discovery via multi-task learning and pre-trained models

Applications of Knowledge Representation Learning

BioBERT and Similar Approaches for Relation Extraction

12.1 Introduction

12.1.1 Perspectives for Biomedical NLP

12.1.2 Role of Knowledge in Biomedical NLP

12.2 Biomedical Knowledge Representation and Acquisition

12.2.1 Biomedical Knowledge from Natural Language

Term-Oriented Biomedical Knowledge

Biomedical Term Representations

Biomedical Term Knowledge Acquisition

Language-Described Biomedical Knowledge

Summary

12.2.2 Biomedical Knowledge from Biomedical Language Materials

Genetic Language

Basic Tasks for Genetic Sequence Processing

Features of Genetic Language

Genetic Language Tokenization

Genetic Sequence Representation

Protein Language

Basic Tasks for Protein Sequence Processing

Landmark Work for Protein Spatial Structure Analysis

Chemical Language

Early Fashions for Chemical Substance Representation

Graph Representations

Linear Text and Other Representations

Summary

12.3 Knowledge-Guided Biomedical NLP

12.3.1 Input Augmentation

Encoding Knowledge Graph

Encoding Other Information

12.3.2 Architecture Reformulation

12.3.3 Objective Regularization

Multi-task Adaptation

Multi-task Pre-training

12.3.4 Parameter Transfer

Cross-Domain Transfer

Cross-Task Transfer

Summary

12.4 Typical Applications

12.4.1 Literature Processing

Literature Screening

Information Extraction

Result Analysis and Question Answering

12.4.2 Retrosynthetic Prediction

Chemical Reaction Classification

Single-Step Reaction Prediction

Multi-step Reaction Prediction

12.4.3 Diagnosis Assistance

12.5 Advanced Topics

Knowledgeable Warm Start

Cross-Modal Knowledge Processing

Interpretability, Privacy, and Ease of Use

12.6 Summary and Further Readings

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation