Abstract
Due to the complexity of the biomedical domain, the ability to capture semantically meaningful representations of terms in context is a long-standing challenge. Despite important progress in the past years, no evaluation benchmark has been developed to evaluate how well language models represent biomedical concepts according to their corresponding context. Inspired by the Word-in-Context (WiC) benchmark, in which word sense disambiguation is reformulated as a binary classification task, we propose a novel dataset, BioWiC, to evaluate the ability of language models to encode biomedical terms in context. BioWiC comprises 20’156 instances, covering over 7’400 unique biomedical terms, making it the largest WiC dataset in the biomedical domain. We evaluate BioWiC both intrinsically and extrinsically and show that it could be used as a reliable benchmark for evaluating context-dependent embeddings in biomedical corpora. In addition, we conduct several experiments using a variety of discriminative and generative large language models to establish robust baselines that can serve as a foundation for future research.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Background & Summary
Biomedical corpora, such as scientific articles and patient reports, contain a wealth of knowledge and information that can be used to enable high-quality research. However, the extraction of knowledge from these free-text sources is a challenging task as it requires the ability to understand the meaning of natural language and the idiosyncrasies of the biomedical domain but also due to the volume of the data1. Biomedical natural language processing (NLP) techniques have been used to analyze information from free-text sources at scale, enabling the extraction and synthesis of biomedical information, and transforming unstructured data into a structured format2,3.
Compared to general corpora, NLP models face three main challenges for semantic representation of biomedical data4,5,6,7. First, the number of biomedical entities is extremely high. For example, the SNOMED-CT ontology8 defines more than 300’000 medical concepts while the UniProt Knowledgebase (UniProtKB)9 contains more than 550’000 curated proteins. Combined, the number of concepts described in these two knowledge organization systems is higher than the number of terms defined in dictionaries for many natural languages. Second, biomedical concepts have many synonyms and alternative expressions for the same concept. For example, in Fig. 1 the concept “C0007134” defined in the Unified Medical Language System (UMLS) thesaurus can be represented with at least four terms: “Renal Cell Carcinoma”, “RCC”, “Nephroid Carcinoma”, and “Adenocarcinoma”. Third, biomedical corpora are notorious for their overabundance of abbreviations and acronyms10. These abbreviations and acronyms are often polysemous, e.g., the acronym “RCC” in Fig. 1 belongs to two concepts – “C2826323” and “C0007134” – making their semantic representation even more challenging.
Entity linking11 and word sense disambiguation (WSD)12 are two NLP tasks trying to address the issue of semantic representation in the biomedical field. Entity linking systems aim to connect terms mentioned in a text with corresponding concepts in a knowledge organization system13,14. For instance, the abbreviation “CA” in biomedical contexts can stand for either “calcium”, an essential mineral in the human body, or “cancer”, a group of diseases characterized by abnormal cell growth. An ideal entity linking system would employ contextual cues to correctly map“CA” to its standardized form in a chosen knowledge base, e.g., UMLS. This proper alignment assists in reducing ambiguity, enhancing the understanding of biomedical corpora15,16. In the biomedical domain, a wide array of datasets exists for entity linking, each employing distinct text corpora as their primary contextual resource. For instance, MedMentions17 and BC5CDR18 focus on biomedical abstracts, N2C2 201919 on clinical notes, and COMETA20 on social media content. These datasets are also differentiated by their target ontologies. For instance, MedMentions17 aligns with UMLS, BC5CDR18 connects to MeSH, and SMM4H21 links with the MedDRA ontology. Each dataset serves a unique purpose within the biomedical entity linking landscape.
Given a word in context, the objective of WSD is to associate the word with its correct meaning in a sense inventory22,23. For example, in the sentence “The patient has been suffering from a cold.”, the sense for the word cold should be associated with its medical meaning as opposed to temperature or literature (i.e., James Bond novel by John Gardner) meanings. Two of the most prominent biomedical WSD datasets are MSH WSD24 and NLM WSD25. The MSH WSD dataset, created by the National Library of Medicine, comprises 37’888 instances across 203 ambiguous terms and abbreviations from the Medical Literature Analysis and Retrieval System Online (MEDLINE) 2010 baseline, each linked to the MeSH ontology. Similarly, the NLM WSD dataset, also developed by the National Library of Medicine, includes 5’000 instances for 50 ambiguous biomedical terms, with each instance linked to UMLS. Despite the steps forward in this promising research direction, the main limitation of the current approach to the WSD task lies in the restriction on the range of word and sense representations defined by the predefined sense inventories26,27.
To bridge this gap, the Word-in-Context (WiC) benchmark26 presented a novel perspective on WSD, dropping the requirement of traditional formulation of WSD task to the fixed sense inventory27. WiC formulates WSD as a binary classification task, where a polysemous word appears in two different sentences, and the task is to infer whether the word holds the same meaning or not. WiC has been integrated as a component of SuperGLUE28, a comprehensive evaluation framework designed to assess the performance of natural language understanding systems. XL-WiC27 and TempoWiC29 are two recent extensions of WiC adapting it to 12 different languages and targeting the detection of meaning shifts in Twitter, respectively. The WiC-TSV (Target Sense Verification of Words in Context) dataset30 is closely related to WiC and focuses on a binary disambiguation task, determining if the contextually intended sense of a word aligns with a pre-defined target sense. This dataset comprises general domain instances in its training and development sets, but the test set is distinctively composed of instances in the general domain as well as three domain-specific subsets: cocktails, medicine, and computer science. For all instances, the primary context source is the Wikilinks dataset31. For the biomedical domain instances in WiC-TSV specifically, the target sense definitions are sourced from the MeSH ontology. The main limitation of this dataset is the small number of biomedical instances it offers — 205 instances representing 8 unique biomedical terms. Moreover, the dataset’s scope is limited as it only includes target terms and definitions from the MeSH ontology. These constraints could potentially limit the effectiveness of the dataset in the development and evaluation of comprehensive WSD systems in the biomedical domain.
Despite significant progress both in WSD and entity linking tasks in the biomedical domain15,31,32,33,34, there exists no benchmark that specifically targets the semantic representation of biomedical terms as a WiC-style task. To bridge this gap, we present the BioWiC35 benchmark, a novel dataset that provides high-quality annotations for the evaluation of contextualized term representations in the biomedical domain. Inspired by the WiC26, we formulate BioWiC as a binary classification task, whose aim is to identify whether two target terms in their respective contexts have the same meaning. In addition to its focus on biomedical concepts, BioWiC differs from WiC in several ways. First, in contrast to WiC which focuses on single token words, as targets, BioWiC allows for terms that can be single words, phrases, or multiword expressions. Second, BioWiC terms may be represented not only by the same terms in different contexts but also by different term forms referring to the same concept (or not). The dataset is named “BioWiC”, reflecting its design for the biomedical domain while showcasing its relation to the WiC task.
A key attribute of BioWiC35 is its flexibility and scalability. Unlike WSD and entity linking that is restricted to concepts covered by existing knowledge graphs, BioWiC can be expanded independently of such resources. This is because expanding the dataset for a novel concept can be accomplished by annotating instances where two sentences contain the target concept, regardless of whether or not it is included in any existing knowledge organization resource. This flexibility allows for continual evolution and improvement, independent of updates to standardized resources, providing a more comprehensive and up-to-date resource for research in the biomedical field.
Methods
In this section, we present BioWiC35 – a novel benchmark dataset for evaluating in-context biomedical concept representations. First, we explain the resources we used to create the corpus and the pre-processing steps. We then provide an overview of the methodology used to create the dataset and discuss the processes for instance generation, dataset splitting, and quality assessment.
BioWiC resources
As shown in Table 1, BioWiC35 instances were built using annotations from the following manually curated biomedical entity linking datasets:
MedMentions17: this is the largest entity linking dataset in the biomedical domain. It includes 4’392 PubMed abstracts and over 350’000 mentions linked to UMLS. The full MedMentions version covers 128 UMLS semantic types. However, as stated by17, the concepts can be either too expansive (e.g., “Group, South Asia”) or cover peripheral and supplementary topics (e.g., “Rural Area, No difference”). Thus, we follow36,37 and focus on the officially released subset of MedMentions called ST21pv (21 Semantic Types from Preferred Vocabularies), which contains 203’282 biomedical mentions from 21 UMLS semantic types.
BC5CDR18: introduced in the BioCreative challenge, this dataset comprises 1’500 PubMed abstracts and 13’343 mentions linked to Medical Subject Headings (MeSH) concepts. The dataset covers a wide range of biomedical entities, including 4’409 chemicals, 5’818 diseases, and 3’116 instances of chemical-disease interactions.
NCBI Disease38: developed by the National Center for Biotechnology Information (NCBI), this dataset includes biomedical information derived from 793 PubMed abstracts. It comprises 6’892 disease mentions, each associated with their relevant standardized forms in the MeSH or Online Mendelian Inheritance in Man (OMIM) terminologies.
Data pre-processing
To have homogeneous word-in-context instances from different resources, we unified their format using the following steps:
-
Sentence segmentation: Each BioWiC35 instance is composed of a pair of target terms together with their respective sentences. We use the PySBD library39, version 0.3.4, to determine sentence boundaries in the initial source texts (i.e., abstracts of publications). We parse documents and keep only sentences that contain mapped mentions.
-
Label unification: The source datasets of BioWiC35 map mentions (i.e., terms) have different target knowledge organization resources, i.e., MeSH, OMIM, and UMLS. This results in concept codes, i.e., unique identifiers in the target ontology, that cannot be directly comparable. To address this issue, we used UMLS as the main reference and transferred the concept identifiers from MeSH and OMIM to UMLS using available ontology mappings in UMLS 2021AB. To avoid ambiguity, MeSH or OMIM concepts with multiple mappings in UMLS 2021AB were removed.
BioWiC construction
BioWiC35 instances follow a similar format to WiC, where each instance involves a pair of biomedical terms (w1 and w2) and their corresponding sentences (s1 and s2). The task is to classify each instance as True if the target terms carry the same meaning across both sentences or False if they do not. We represent each instance as a tuple pair t = [(s1,w1),(s2,w2)]: y where w1 and w2 are the target terms, s1 and s2 are the corresponding sentences, and y is the associated binary label. Table 2 presents some examples of BioWiC instances. In contrast to WiC, where both target terms of each instance always share the same lemma, BioWiC allows for variations such as abbreviations, synonyms, identical terms, and terms with similar surface forms.
To evaluate challenging scenarios for semantic representation, such as synonymy, polysemy, and abbreviations, BioWiC35 is divided into four main groups of instances. Group A (term identity) contains instances where the target terms w1 and w2 are identical. In group B (abbreviations), either w1 or w2 could represent the abbreviation of the other one. Group C (synonyms), consists of instances where w1 and w2 could be synonyms (according to UMLS). Lastly, group D (label similarity) includes instances where w1 and w2 share similar surface forms. We employed the following five steps to generate the BioWiC instances:
-
(i)
Sentence collection: We first gathered all the sentences from the source datasets manually annotated with terms M(W,C) = {(w1, c1), (w2, c2), …, (wn, cn)}, where w ∈ W is a term and c ∈ C is a concept defined in UMLS. Then, we created a set S = {s1, …, sn}, where each sentence s ∈ S has at least one mention w ∈ W linked to c ∈ C.
-
(ii)
Tuple creation: For each sentence s ∈ S, we randomly chose one of the annotated mentions w and created a set of sentence-term tuples P = {(s1, w1), (s2, w2),…,(sn, wn)}, where for each (si, wi) ∈ P, si includes wi. We then paired the tuples of P and created a collection of tuple pairs:
$$T=\left\{\left[({s}_{1},{w}_{1}),({s}_{2},{w}_{2})\right],\left[({s}_{1},{w}_{1}),({s}_{3},{w}_{3})\right],\ldots ,\left[({s}_{m},{w}_{m}),({s}_{n},{w}_{n})\right]\right\}.$$ -
(iii)
Instance definition and labeling: We considered each pair t = [(si,wi), (sj,wj)] ∈ T as a potential BioWiC35 instance, where wi and wj serve as target terms and si and sj are their corresponding sentences, respectively. Each instance is labeled as y = True if the target terms wi and wj were linked to the same or synonym UMLS concept, and as y = False if they were not. We then added the label y to each tuple pair to create the dataset of possible BioWiC instances t = [(si,wi),(sj,wj)]: y.
-
(iv)
Tuple selection: We categorized each instance t: y to one of the main groups of BioWiC35. Group A included instances for which wi and wj are identical. Group B included instances where wi is the abbreviated form of wj or vice-versa. Group C included instances where wi and wj could be synonyms. Group D included instances where wi and wj are not identical but share similar surface characteristics.
-
(v)
Dataset splitting: We divided the instances into three parts: training set, development set, and test set, providing a consistent and reliable framework for model training and evaluation.
For clarity, in Fig. 2 we provide an example of building BioWiC35 instances for the target term “delivery”. Initially, we preprocess the resource data and extract all sentences in which “delivery” is linked to UMLS. We transform each sentence to the sentence-term tuple (si,w) format where si represents a sentence containing the term w = “delivery”. Subsequently, we permute all possible combinations of tuples (si,w) identified in the preceding step to generate BioWiC instances t = [(si,w),(sj,w)], where “delivery” serves as the target term in both sentences. Finally, we classify each instance as True when “delivery” is mapped to the same CUI code in both sentences and as False when it is not.
Instance generation
To build the BioWiC35 instances, we considered two main challenges of biomedical texts: semantic and lexical ambiguities. The presence of semantically ambiguous terms, that is, terms that can have multiple meanings in different contexts, is one of the most difficult aspects of biomedical text processing3. For instance, the term staph can be used either as a type of disease (usually followed by infection) or bacteria in other contexts. In addition, one concept can be used in different domains to represent meaning. To assess the capability of language models to provide context-sensitive representations for a term across different contexts, we included a group of instances (group A) in BioWiC in which a target biomedical term appears in two different contexts. Another key challenge in the biomedical domain is that terms can be expressed in various forms or using different lexical formats, even if they refer to the same biomedical concepts. To account for this challenge, we developed three other groups of BioWiC instances to measure language models’ ability to use context and produce similar representations for synonym terms with different surface strings. We categorize synonyms into three different groups: i) abbreviations, ii) synonyms, and iii) concepts with similar surface characteristics. Each instance in these groups contains two target terms with different surfaces, each occurring in a different context and the models should identify whether these terms refer to the same biomedical concept or not.
Instance groups
In what follows, we discuss how we created the instances for each group:
-
(A)
Term identity: To create these instances, we use the tuple pair list, built-in step 3 of the construction pipeline, and consider every pair t = [(si,wi),(sj,wj)] ∈ T as an instance of group A if wi and wj are identical. We classified each t as True if both terms were linked to the same UMLS CUI and False otherwise. Two instances of this type are shown in Table 2 (examples one and two). In the first example, both target terms refer to the same concept and have the same meaning (i.e., toxicity that impairs or damages the heart, UMLS CUI C0876994). So, the instance label is True. In the second instance, however, the target terms are mapped to different CUIs (C0032914 and C0034065), and thus the instance label is False.
-
(B)
Abbreviations: In this group, one of the target terms is the abbreviated form of the other one, e.g., heart rate and hr. From the tuple pair list, we pick up all the pairs t = [(si, wi),(sj,wj)] ∈ T if wi is the abbreviated form of wj or vise-versa. To verify this, we generate the abbreviated form of wi by combining the initial letters from each part obtained after splitting it (e.g., “FEO” is considered as the abbreviation of “familial expansile osteolysis”). Next, we compare whether wj is the same as the abbreviation of wi. We perform the same procedure for wj as well. If either of the wi or wj is the abbreviation of the other, we categorize the tuple pair into this group. Each tuple pair then is assigned to the label True if wi and wj are mapped to the same UMLS and False otherwise. As shown in example 3 of Table 2, “FEO” in sentence 1 is used as the abbreviation of “familial expansile osteolysis”. So the instance is labeled as True. In example 4, however, the target term PD does not have the same meaning as “Periodontal disease” and thus the instance is labeled as False.
-
(C)
Synonyms: This group refers to instances in which the target terms w1 and w2 belong to the same UMLS concept. Each UMLS synonym set consists of a group of biomedical synonym concepts that express the same meaning. As shown in Fig. 1, due to semantic ambiguity, biomedical concepts with several distinct meanings can be represented by several distinct synonym sets. For instance, “Adenocarcinoma” could have the same meaning as either “Renal Cell Carcinoma” (CUI C0007134) or “Carcinoma in adenoma” (CUI C0001418). Consequently, we consider these concepts as potential synonyms, which may or may not hold the same meaning depending on their context. To create the instances, we collect all the tuple pairs t = [(si,wi),(sj,wj)] from T in which wi and wj both are present in a UMLS synonym set. We then assigned the label True to each instance if wi and wj are linked to the same UMLS CUI code, and False if they are not. Two examples of this group of instances are shown in Table 2.
-
(D)
Label similarity: Despite broad coverage of synonyms and semantic types, UMLS synonym sets still suffer a lack of a large number of reformed concepts that can be used in biomedical contexts. For instance, the concept “chronic pseudomonas aeruginosa infection” can be reformed as “chronic PA infection”, which is not covered by UMLS. To deal with this and to cover a wide range of target concepts with different formats in the dataset, we developed the fourth group of instances in which the corresponding terms have a high Levenshtein distance ratio (see examples 7 and 8 in Table 2). To create such instances, we retrieve all tuple pairs t = [(si,wi),(sj,wj)] ∈ T in which the Levenshtein distance between wi and wj surpasses the threshold of 0.75. Each tuple t is marked as True when wi and wj correspond to the identical UMLS entry, and False in the other case. The main idea behind this strategy was to include instances where target terms have similar surface forms but refer to different medical concepts. Two instances of this group are shown in Table 2. In example 7, both “piebald” and “piebaldism” refer to the same concept, whereas in example 8, “anemic” and “anaemia” refer to two different concepts.
Data Records
BioWiC35 dataset is available on Figshare (https://doi.org/10.6084/m9.figshare.25611591.v2), HuggingFace (https://huggingface.co/datasets/hrouhizadeh/BioWiC), and GitHub (https://github.com/hrouhizadeh/BioWiC). It comprises three distinct JSON files: training set, development set, and test set. Each instance within a JSON file includes ten parts. The first two items, term1 and term2, followed by sentence1 and sentence2, correspond respectively to the two target terms and two sentences within each instance. The character-level positioning of target terms is defined by start1 and start2, indicating the starting positions, and end1 and end2, marking the end positions within their respective sentences. Furthermore, the cat attribute classifies each instance into one of the BioWiC groups, i.e., term_identity, abbreviations, synonyms, or label_similairty. Lastly, a binary label is attached to each instance, taking the value of either 1 (True) or 0 (False).
Technical Validation
Dataset splits
We divided the BioWiC35 instances into three main parts i.e., training set, development set, and test set, thereby establishing a structured and robust framework for model development and evaluation. To do so, we first built the test set including 2’000 instances with three constraints: 1) only one instance for each unique pair of target terms, 2) no sentence repetition between instances, and 3) no overlap between sentences and term pairs of the test set and training or development sets. The primary objective of rules 1 and 2 was to ensure a diverse range of term pairs and sentences in the test set. Rule 3 was also introduced to assess the generalization power of the language models, i.e., the model’s ability to adapt to new, previously unseen data. Taking into account the constraints mentioned, we randomly sampled a set of 2000 term pair instances from the groups defined in section 2.3.1 (800, 200, 800, and 200 samples for term identity, abbreviations, synonyms, and label similarity groups, respectively) to build the testing data set. Finally, we used the remaining instances to create the training set. General statistics of the different splits of BioWiC are reported in Table 3. In addition, following WiC, we balanced all the data splits in terms of the number of tags, i.e., 50% True and 50% False.
During the compilation of the training set, we adopted a simple approach where we only included examples of their corresponding sentences that did not exceed a certain frequency threshold. We built the training set with various thresholds, ranging from 1 to 200, to determine the most appropriate limit. As illustrated in Fig. 3, the size of the training set, the number of unique concepts, and the number of semantic types in the training set varied based on these thresholds. It was observed that once a sentence recurrence surpassed 100 times, the incremental growth of the training set size as well as the number of unique concepts was marginal, registering below 2%. Furthermore, if the threshold is set higher, the number of unique semantic types included in the training set will not exceed 98. As a result, we chose 100 as our cut-off point.
Quality control
UMLS is known as a broadly used resource in the biomedical domain, covering a wide range of biomedical concepts. A key feature of UMLS is its capability to connect a wide range of concepts from different biomedical terminologies, such as SNOMED CT, LOINC, MeSH, RxNorm, etc. Through this mapping, one single code from a source terminology can be mapped to several UMLS CUI codes. For instance, MeSH code D020274, which represents “Depressive Disorder” is mapped to three distinct UMLS CUIs, C5671289, C0751871, and C0751872, for “Autoimmune Encephalitis”, “Autoimmune Diseases of the Nervous System” and “Immune Disorders, Nervous System”, respectively. In our dataset, there are instances where different CUI codes are assigned to the target concepts, resulting in the False label. However, the CUI codes and the confusion and same code in alternative ontologies, underlying concepts represented by those codes are equivalent. To prevent any confusion and to ensure the dataset’s reliability, we have employed a pruning strategy and removed the instances in which the target terms are mapped to multiple UMLS codes, while those UMLS codes correspond to the same code in another ontology. The process also involved eliminating any pairs whose CUIs are considered synonyms as per the MRREL.RRF file from UMLS. We also followed WiC26 and XL-WiC27 and filtered out all the pairs where one CUI is directly related to the other as a broader concept in the UMLS hierarchy.
Cross-mapping validation
To determine the quality of BioWiC35, we extracted two random subsets of 100 instances (with 50 mutual instances) from the test set and asked two domain experts to label them. Both annotators were medical doctors with vast experience in semantic annotation. They were provided with a set of instructions including a short description of the task as well as a few examples of labeled instances. During the annotation process, no external information from UMLS or any other resources was provided to the experts. The annotators had Cohen’s Kappa score of 0.84 which is representative of the high quality of the dataset. An average human-level accuracy of 0.80 (0.80 and 0.81 for annotator 1 and annotator 2 respectively) was obtained through the annotation process, which can be viewed as the upper bound for model performance.
Dataset coverage
In this section, we focus on the scope of the dataset by studying the unique CUI codes and comparing them to the total CUI present in UMLS. Additionally, we investigate the semantic types within the dataset, examining both the number included and the proportions among them. Table 3 shows that BioWiC35 covers over 5,000 unique CUI codes from UMLS. Additionally, BioWiC includes almost 80% of UMLS semantic types, i.e., 99 out of 127, across different splits. This wide coverage is indicative of the dataset’s comprehensive and its potential as a valuable resource for biomedical research. In Fig. 4, we present the ratio of the top 10 semantic types and semantic groups included in BioWiC. Additionally, Table 4 shows the frequency and proportion of target terms across different BioWiC splits, categorized by their token counts.
Compared to WSD datasets in the biomedical domain, BioWiC35 stands out as the most comprehensive in terms of the variety of unique biomedical terms it includes, covering a total of 7’413 distinct terms. This range far surpasses that of other datasets, such as MSH WSD24 with 203 terms, NLM WSD25 with 50 terms, and WiC-TSV30, which includes only 8 terms. Moreover, the extensive scope of BioWiC is emphasized by its incorporation of 99 different semantic types from UMLS, in contrast to the narrower range covered by other datasets, i.e., MSH WSD24, NLM WSD25, and WiC-TSV30, which include 81, 46, and 8 UMLS semantic types respectively.
Baseline experiments
We have implemented several baseline models, covering all the SuperGLUE28 benchmark suites. Considering that all divisions of BioWiC35 are balanced in terms of positive and negative instances, we take the same approach as WiC26 and use the accuracy metric to measure the performance of different models. This is determined by the percentage of correctly predicted cases (whether they are true positives or true negatives) compared to the total number of samples. The baselines include:
Random: We provide a lower bound for the performance by randomly assigning a class to each instance.
GloVe: In this baseline, we used GloVe-840B40 pre-trained embeddings. We averaged token embeddings to represent each sentence and fed the resulting feature vector to an MLP classifier (with 128 neurons in the hidden layer and one neuron in the output layer).
Bi-LSTM: We also trained a BiLSTM model (with 128 hidden units) to capture both the forward and backward context information of the sentence. The BiLSTM model output was fed into a fully connected layer with one output neuron for binary classification.
BERT: We explored the performance of several BERT-based models to provide stronger baselines for the BioWiC35 task. To evaluate well language model’s performance generalized to concepts of the biomedical domain, our baseline includes three general transformer-based language models – BERT41, RoBERTa42, and ELECTRA43. In addition, to assess the effect of prior knowledge of language models on biomedical concept representation, we evaluated the performance of three language models pre-trained with biomedical and clinical data – BioBERT44, Bio_ClinicalBERT45, and SciBERT46 trained on PubMed abstracts and PubMed Central, the MIMIC-III database47, and papers from Semantic Scholar (mostly in the biomedical domain), respectively. To fine-tune each model, we used the Sentence-BERT48 framework, which incorporates siamese and triplet network architectures to generate semantically meaningful embeddings. We pre-processed each input sentence by enclosing the target terms within double quotes, emphasizing their significance, and fed the modified sentences into the BERT architecture for further processing. We have also used a different pre-processing technique for input sentences in our BERT models. Supplemental Table 1 in the Supplementary Information section compares the results of both strategies.
Llama-2: We also conducted experiments using three different versions of the Llama-2 language model, i.e., Llama-2-7b, Llama-2-13b, and Llama-2-70b49. Our experiments involve a few-shot approach where the language model receives a small number of examples before making predictions and a fine-tuning approach, where we utilized the BioWiC35 instances to fine-tune the language models.
BERT/Llama-2++: We conducted additional experiments where we incorporated the general domain data from the WiC dataset26 as additional training data for fine-tuning the transformer-base models. By expanding our training data with extra instances from the general domain, we aim to explore the potential benefits of leveraging diverse sources of information for the BioWiC35 task.
Results
The performance of the baseline models on the BioWiC35 benchmark is presented in Fig. 5. The results indicate that the state-of-the-art language models fine-tuned on the BioWiC training set, surpass the random baseline by a margin of 18% to 26% (p-value < 0.001). Both GloVe and BiLSTM baselines are unable to compete with the fine-tuned large language models. Overall, Llama-2-70b outperforms all competing methods, achieving the highest accuracy. The closest to the Llama-2-70b model in terms of accuracy are BioBERT, BioBERT++, and SciBERT++, which Llama-2-70b outperforms by 2% (p-value = 0.04). It is worth noting that in contrast to the different variations of the Lamma-2 language model, which are pre-trained on general domain corpora, BioBERT is pre-trained on large biomedical data, allowing it to understand complex biomedical texts effectively44. However, Llama-2-70b achieves state-of-the-art performance, illustrating its high capability for adapting to the task of representing biomedical terms in context.
In our analysis of different Llama-2 models, we observe a significant difference in performance depending on the method used in our evaluation, i.e., few-shot learning or fine-tuning. As shown in Fig. 5, Llama-2-7b surpassed the random baseline by a slight margin in the few-shot setting; however, its performance increased by 17% upon fine-tuning (p-value < 0.001). This pattern of performance boost was consistent with the other Llama-2 variants. Specifically, after the fine-tuning process, the accuracy of Llama-2-13b improved from 0.61 to 0.73 (p-value < 0.001), while Llama-2-70b experienced an increase from 0.68 to 0.78 (p-value < 0.001). These observations emphasize the crucial role of the fine-tuning phase in enhancing the contextualized representation of biomedical terms. Additionally, our results are consistent with a prior study50, which demonstrated that the GPT-3 language model failed to surpass random baseline performance on the WiC dataset under a few-shot evaluation.
Comparing the performance of different BERT-based models shows that BioBERT and SciBERT achieve the highest performance among different groups of the test set. Overall, BioBERT outperforms SciBERT by a slight margin of 1% accuracy, i.e., 0.75 and 0.76 (p-value = 0.04), respectively. The potential reason for the superior performance of BioBERT and SciBERT can be attributed to their pre-training phase on large biomedical corpora. This provides them with an in-depth knowledge of biomedical terminologies and concepts, leading to more accurate representations of terms and expressions when compared to BERT-based models pre-trained on the general domain corpora44. Surprisingly, Bio_ClinicalBERT performance is similar to the general domain BERT models and does not align with other superior biomedical BERT variants.
Further analysis of the results for different groups indicates that the “term identity” and “synonyms” groups present a greater challenge compared to the other groups for all models. Regarding model performance for the “label similarity” group, it is plausible that minor changes in term structure carry meaningful distinctions in biomedical contexts. Models might utilize structural alterations, such as the addition of suffixes or prefixes, influencing the meanings of terms. This understanding of term structure can be particularly relevant and beneficial for performance in the “label similarity” group. As for the “abbreviations” group, it is important to note that abbreviations are commonly used in the biomedical domain. The models may have come across these abbreviations (along with their full form) in various contexts during both the pre-training and fine-tuning phases. This exposure to abbreviations in diverse settings helps the models to effectively learn and capture their meanings. The group of “synonym” instances appears to be more difficult for models to handle. This might be because, in the biomedical field, a single term can have multiple synonyms with varied forms and each synonym can have multiple meanings (as shown in Fig. 1) which makes it hard for the models to recognize synonym terms with different shapes across different contexts. For the “term identity” group, since this group of instances doesn’t present any difference between the target terms, the models cannot rely on lexical cues and must prioritize the comprehension of the contextual information from the surrounding context, which makes the task more challenging.
In our study, we also conducted experiments in which we incorporated general domain training data from WiC26 into our dataset (denoted by adding++ to the name of the language model). We observe slight fluctuations in the performance of the models when merging general and biomedical domain datasets. It could be possibly explained by the fact that the model faces potential distribution shifts due to the distinct nature of each domain. Despite the increased volume of training data, this misalignment in data distributions can offset the advantages of the added samples. Thus, while the combined dataset is larger, it may not necessarily lead to improved model performance in the biomedical context.
Alternative evaluation scenarios
To gain a deeper understanding of how models perform in the BioWiC35 benchmark, we analyzed their performance in two alternative scenarios. First, we assessed how the data distribution impact their results. Here, we considered seen and unseen data distributions. Second, we assessed what is the influence of the training corpus on the performance. Differently, in this scenario, we are interested to see whether learning from general corpus examples would enable models to generalise to the biomedical domain.
Seen vs unseen: In this analysis, the aim is to evaluate the variation in performance based on whether the target terms in the instances have been previously seen during training or not. For this purpose, we used the models fine-tuned on the BioWiC35 training set and divided the test set into two categories: “seen” and “unseen”. The first category includes instances where the model has been exposed to at least one of the target terms during training, while the second category involves instances where both target terms are new to the model. Table 5 reports the number and proportion of seen and unseen data across different groups within the BioWiC test set. Note that term pairs (the two target terms of each instance) and the sentences in the test set are unique and were not presented to the model during its training phase.
Table 6 shows the accuracy of different models, fine-tuned on the BioWiC35 training set when tested on seen and unseen data sets. As we can see, the models exhibit a significant decline in performance, i.e., between 5% and 13%, when classifying unseen instances. Interestingly, models demonstrate improved performance on the unseen data in the “abbreviation” groups, aligning with the notion that abbreviations are prevalent across contexts and models may possess prior knowledge in this aspect. Overal, the findings suggest that there is huge scope for improvement in this field, particularly as the performance of models decreases when encountering novel data.
Cross-domain analysis: We conducted additional experiments to assess the performance of language models when fine-tuned exclusively on data from the general domain, specifically WiC. The results indicate that all models experience a substantial decrease in performance when fine-tuned only with WiC data (Table 6). This highlights the importance of the training data provided by BioWiC35 in enhancing the ability of language models in the representation of different forms of concepts within the biomedical field. Furthermore, this suggests that the differences in terminology and linguistic patterns between the biomedical and general domains might be a reason why models fine-tuned on BioWiC exhibit superior performance.
Evaluating models’ upper bound: To assess whether state-of-the-art models have reached an upper bound on the BioWiC dataset, we leveraged two subsets of 100 instances from the BioWiC test set that were manually annotated by subject matter experts (see the cross-mapping validation section for more details). On the 50 instances annotated by both experts, we observed strong inter-annotator agreement (Cohen’s Kappa score = 0.84), confirming the quality of the dataset annotations. However, the best-performing model (Llama-2-70b) exhibited low agreement with the human annotators on this mutually annotated subset (Cohen’s Kappa scores of 0.35 and 0.36). The pattern of discrepancies between human and model annotations persisted across the two subsets of 100 instances (Cohen’s Kappa scores of 0.33 and 0.47 for annotators 1 and 2, respectively). These results highlight the substantial room for improvement of language models to represent contextualized biomedical terms.
Usage Notes
The primary objective of this study is to develop a novel biomedical dataset, BioWiC35, introducing unique challenges for biomedical concept representation. The complexity of the biomedical language, with its abundance of polysemous terms, abbreviations, and acronyms, highlights the need for models to accurately disambiguate the intended meanings of terms based on the context they appear. We propose that BioWiC can serve as a robust benchmark dataset, enabling NLP models to better understand the intended meaning of biomedical terms within their given textual context, allowing models to generate representations that precisely capture those intended meanings across different contexts. This enhanced contextual understanding is critical for several downstream NLP tasks in the biomedical domain, such as information retrieval, question-answering, and machine translation, where accurately interpreting the meaning of terms within their specific context is essential for optimal model performance51.
The proposed benchmark has certain limitations that should be taken into consideration. The breadth of coverage of concepts is rather limited as BioWiC35 only deals with a small subset of the concepts present in the biomedical domain, i.e., 5’000 CUIs out of 4.5 M CUI codes available in UMLS. Moreover, it may not be adequate for certain use cases that require a specific coverage of concepts, e.g., genomics and proteomics. Additionally, our benchmark is currently designed to work with medical documents written in English only. Lastly, it is a static benchmark, in the sense that it does not currently provide a seamless platform (i.e., web service) for users to contribute to it through crowd-sourcing. This limits the ability to keep the benchmark up-to-date and reflective of the latest developments in the biomedical domain. These limitations can be addressed in future versions of the benchmark.
Code availability
The entire process, including the development of the dataset35 and the conduction of experiments, was implemented using the Python programming language. The complete code and dataset are hosted on GitHub at: https://github.com/hrouhizadeh/BioWiC.
References
Detroja, K., Bhensdadia, C. & Bhatt, B. S. A survey on relation extraction. Intell. Syst. with Appl. 200244 (2023).
Shi, J. et al. Knowledge-graph-enabled biomedical entity linking: a survey. World Wide Web 1–30 (2023).
French, E. & McInnes, B. T. An overview of biomedical entity linking throughout the years. J. Biomed. Informatic 104252 (2022).
Yazdani, A., Proios, D., Rouhizadeh, H. & Teodoro, D. Efficient joint learning for clinical named entity recognition and relation extraction using Fourier networks:a use case in adverse drug events. In Akhtar, M. S. & Chakraborty, T. (eds.) Proceedings of the 19th International Conference on Natural Language Processing (ICON), 212–223 (Association for Computational Linguistics, New Delhi, India, 2022)
Naderi, N., Knafou, J., Copara, J., Ruch, P. & Teodoro, D. Ensemble of deep masked language models for effective named entity recognition in health and life science corpora. Front. research metrics analytics 6, 689803 (2021).
Copara, J. et al. Contextualized french language models for biomedical named entity recognition. In Actes de la 6e conférence conjointe Journées d’Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Atelier DÉfi Fouille de Textes, 36–48 (2020).
He, J. et al. An extended overview of the clef 2020 chemu lab: information extraction of chemical reactions from patents. In Proceedings of the CLEF 2020 conference (22-25 September 2020, 2020).
Donnelly, K. et al. Snomed-ct: The advanced terminology and coding system for ehealth. Stud. health technology informatics 121, 279 (2006).
Consortium, U. et al. Uniprot: the universal protein knowledgebase in 2021. Nucleic acids research 49, D480–D489 (2021).
Erhardt, R. A., Schneider, R. & Blaschke, C. Status of text-mining techniques applied to biomedical text. Drug discoverytoday 11, 315–325 (2006).
Sung, M., Jeon, H., Lee, J. & Kang, J. Biomedical entity representations with synonym marginalization. In Jurafsky, D., Chai, J., Schluter, N. & Tetreault, J. (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 3641–3650, https://doi.org/10.18653/v1/2020.acl-main.335 (Association for Computational Linguistics, Online, 2020).
Alexopoulou, D. et al. Biomedical word sense disambiguation with ontologies and metadata: automation meets accuracy.BMC bioinformatics 10, 1–15 (2009).
Miftahutdinov, Z., Kadurin, A., Kudrin, R. & Tutubalina, E. Medical concept normalization in clinical trials with drug anddisease representation learning. Bioinformatics 37, 3856–3864 (2021).
Tutubalina, E., Miftahutdinov, Z., Nikolenko, S. & Malykh, V. Medical concept normalization in social media posts with recurrent neural networks. J. biomedical informatics 84, 93–102 (2018).
Niu, J., Yang, Y., Zhang, S., Sun, Z. & Zhang, W. Multi-task character-level attentional networks for medical concept normalization. Neural Process. Lett. 49, 1239–1256 (2019).
Limsopatham, N. & Collier, N. Normalising medical concepts in social media texts by learning semantic representation. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers), 1014–1023 (2016).
Mohan, S. & Li, D. Medmentions: A large biomedical corpus annotated with umls concepts. In Automated Knowledge Base Construction (AKBC) (2019).
Li, J. et al. Biocreative v cdr task corpus: a resource for chemical disease relation extraction. Database 2016 (2016).
Luo, Y.-F., Sun, W. & Rumshisky, A. MCN: a comprehensive corpus for medical concept normalization. J. biomedical informatics 92, 103132 (2019).
Basaldella, M., Liu, F., Shareghi, E. & Collier, N. COMETA: A corpus for medical entity linking in the social media. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 3122–3137,7010.18653/v1/2020.emnlp-main. (Association for Computational Linguistics, Online, 2020).
Yazdani, A., Rouhizadeh, H., Alvarez, D. V. & Teodoro, D. Ds4dh at# smm4h 2023: zero-shot adverse drug events normalization using sentence transformers and reciprocal-rank fusion. arXiv preprint arXiv:2308.12877 (2023).
Navigli, R. Word sense disambiguation: A survey. ACM computing surveys (CSUR) 41, 1–69 (2009).
Moro, A., Raganato, A. & Navigli, R. Entity linking meets word sense disambiguation: a unified approach. Transactions Assoc. for Comput. Linguist. 2, 231–244 (2014).
Jimen-Yepes, A., McInnes, B. & Aronson, A. Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation. BMC Bioinformatics 12(1), 223 (2011).
Weeber M, Mork J, Aronson A: Developing a test collection for biomedical word sense disambiguation. Proceedings of the AMIA Symposium, American Medical Informatics Association 2001, 746.
Pilehvar, M. T. & Camacho-Collados, J. WiC: the word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 1267–1273, https://doi.org/10.18653/v1/N19-1128 (Association for Computational Linguistics, Minneapolis, Minnesota, 2019).
Raganato, A., Pasini, T., Camacho-Collados, J. & Pilehvar, M. T. XL-WiC: A multilingual benchmark for evaluating semantic contextualization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 7193–7206, https://doi.org/10.18653/v1/2020.emnlp-main.584 (Association for Computational Linguistics, Online, 2020).
Wang, A. et al. Superglue: A stickier benchmark for general-purpose language understanding systems. Adv. neural information processing systems 32 (2019).
Loureiro, D. et al. TempoWiC: An evaluation benchmark for detecting meaning shift in social media. In Calzolari, N. et al. (eds.) Proceedings of the 29th International Conference on Computational Linguistics, 3353–3359 (International Committee on Computational Linguistics, Gyeongju, Republic of Korea, 2022).
Breit, A., Revenko, A., Rezaee, K., Pilehvar, M. T. & Camacho-Collados, J. WiC-TSV: An evaluation benchmark for target sense verification of words in context. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 1635–1645, https://doi.org/10.18653/v1/2021.eacl-main.140 (Association for Computational6Linguistics, 2021).
Sameer Singh, Amarnag Subramanya, Fernando Pereira, and Andrew McCallum. 2012. Wikilinks: A large-scale cross-document coreference corpus labeled via links to Wikipedia. Technical Report UM-CS-2012-015, University of Massachusetts, Amherst.
Miftahutdinov, Z. & Tutubalina, E. Deep neural models for medical concept normalization in user-generated texts. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, 393–399, https://doi.org/10.18653/v1/P19-2055 (Association for Computational Linguistics, Florence, Italy, 2019).
Liu, F., et al. (eds.) Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4228–4238, https://doi.org/10.18653/v1/2021.naacl-main.334 (Association for Computational Linguistics, Online, 2021).
Angell, R., et al. (eds.) Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2598–2608, https://doi.org/10.18653/v1/2021.naacl-main.205 (Association for Computational Linguistics, Online, 2021).
Rouhizadeh, H. et al. A Dataset for Evaluating Contextualized Representation of Biomedical Concepts in Language Models. figshare https://doi.org/10.6084/m9.figshare.25611591.v2 (2024).
Loureiro, D. & Jorge, A. M. Medlinker: Medical entity linking with neural representations and dictionary matching. In European Conference on Information Retrieval, 230–237 (Springer, 2020).
Mohan, S., Angell, R., Monath, N. & McCallum, A. Low resource recognition and linking of biomedical concepts from a large ontology. In Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics, 1–10 (2021).
Dogan, R. I., Leaman, R. & Lu, Z. NCBI disease corpus: a resource for disease name recognition and concept normalization. J. biomedical informatics 47, 1–10 (2014).
Sadvilkar, N. & Neumann, M. PySBD: Pragmatic sentence boundary disambiguation. In Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS), 110–114, https://doi.org/10.18653/v1/2020.nlposs-1.15 (Association for Computational Linguistics, Online, 2020).
Pennington, J., Socher, R. & Manning, C. D. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532–1543 (2014).
Devlin, J., et al (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186, https://doi.org/10.18653/v1/N19-1423 (Association for Computational Linguistics, Minneapolis, Minnesota, 2019).
Liu, Y. et al. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
Clark, K., Luong, M.-T., Le, Q. V. & Manning, C. D. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555 (2020).
Lee, J. et al. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2020).
Alsentzer, E. et al. Publicly available clinical BERT embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, 72–78, https://doi.org/10.18653/v1/W19-1909 (Association for Computational Linguistics, Minneapolis, Minnesota, USA, 2019).
Beltagy, I., Lo, K. & Cohan, A. SciBERT: A pretrained language model for scientific text. In Inui, K., Jiang, J., Ng, V. & Wan, X. (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3615–3620, https://doi.org/10.18653/v1/D19-1371 (Association for Computational Linguistics, Hong Kong, China, 2019).
Johnson, A. E. et al. Mimic-iii, a freely accessible critical care database. Sci. data 3, 1–9 (2016).
Reimers, N. & Gurevych, I. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Inui, K., Jiang, J., Ng, V. & Wan, X. (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3982–3992, https://doi.org/10.18653/v1/D19-1410 (Association for Computational Linguistics, Hong Kong, China, 2019).
Touvron, H. et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
Brwon, T. et al. Language models are few-shot learners. Adv. neural information processing systems 33, 1877–1901 454 (2020).
Hristea, F. & Colhon, M. The long road from performing word sense disambiguation to successfully using it in information retrieval: An overview of the unsupervised approach. Comput. Intell. 36, 1026–1062 (2020).
Frénal, K., Kemp, L. E. & Soldati-Favre, D. Emerging roles for protein s-palmitoylation in toxoplasma biology. Int. J. forParasitol. 44, 121–131 (2014).
Author information
Authors and Affiliations
Contributions
H.R. and D.T. conceptualized the study, and H.R., A.Y. and B.Z. implemented the codes for the creation and evaluation of the dataset. J.E. and C.G. performed human annotation. The manuscript was drafted by H.R., D.T. and I.N. and edited by A.B., A.Y. and N.N. All authors reviewed the final version.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Rouhizadeh, H., Nikishina, I., Yazdani, A. et al. A Dataset for Evaluating Contextualized Representation of Biomedical Concepts in Language Models. Sci Data 11, 455 (2024). https://doi.org/10.1038/s41597-024-03317-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-024-03317-w
- Springer Nature Limited