ICDXML: enhancing ICD coding with probabilistic label trees and dynamic semantic representations

Wang, Zeqiang; Wang, Yuqi; Zhang, Haiyang; Wang, Wei; Qi, Jun; Chen, Jianjun; Sastry, Nishanth; Johnson, Jon; De, Suparna

doi:10.1038/s41598-024-69214-9

ICDXML: enhancing ICD coding with probabilistic label trees and dynamic semantic representations

Article
Open access
Published: 07 August 2024

Volume 14, article number 18319, (2024)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

ICDXML: enhancing ICD coding with probabilistic label trees and dynamic semantic representations

Download PDF

Zeqiang Wang^1,2^na1,
Yuqi Wang^1,3^na1,
Haiyang Zhang¹,
Wei Wang¹,
Jun Qi¹,
Jianjun Chen¹,
Nishanth Sastry²,
Jon Johnson⁴ &
…
Suparna De ORCID: orcid.org/0000-0001-7439-6077²

537 Accesses
2 Altmetric
Explore all metrics

Abstract

Accurately assigning standardized diagnosis and procedure codes from clinical text is crucial for healthcare applications. However, this remains challenging due to the complexity of medical language. This paper proposes a novel model that incorporates extreme multi-label classification tasks to enhance International Classification of Diseases (ICD) coding. The model utilizes deformable convolutional neural networks to fuse representations from hidden layer outputs of pre-trained language models and external medical knowledge embeddings fused using a multimodal approach to provide rich semantic encodings for each code. A probabilistic label tree is constructed based on the hierarchical structure existing in ICD labels to incorporate ontological relationships between ICD codes and enable structured output prediction. Experiments on medical code prediction on the MIMIC-III database demonstrate competitive performance, highlighting the benefits of this technique for robust clinical code assignment.

Exploring Label Correlations for Quantification of ICD Codes

CASCADENET: An LSTM Based Deep Learning Model for Automated ICD-10 Coding

Automatic ICD Coding Based on Multi-granularity Feature Fusion

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Introduction

In the modern era of big data and advancing health technologies, accurate classification and documentation of medical diagnoses and procedures are critical for effective healthcare management and data-driven decision-making. The International Classification of Diseases (ICD) coding system serves as a standardized framework for precisely categorising the wide range of medical conditions and interventions encountered in healthcare settings¹. With the widespread adoption of electronic health records (EHRs), detailed patient information is now captured in unstructured clinical notes. These notes, such as discharge summaries, contain valuable data on patients but are often lengthy and written in medical domain language. Manual clinical coding in EHR is thus time-consuming, labor-intensive, and prone to inconsistent coding between healthcare providers². In such cases, there is a growing need for automated approaches that can reliably extract information from clinical notes and accurately assign standardized ICD codes.

Natural language processing (NLP) methods hold promise for developing high-throughput systems to automate ICD coding from unstructured EHR text³. Recent studies have shown that pre-trained language models (PLMs), such as BERT⁴, Alpaca⁵, and Llama⁶, have shown great success in capturing domain-specific knowledge of medical language, leveraging the contextual information of sentences and adapting to different medical tasks^7,8. These models have been widely employed in the research community for supervised learning tasks involving EHR data. For example, BERT-based PLMs like ClinicalBERT⁹ and BlueBERT¹⁰ have been fine-tuned and evaluated on medical diagnosis prediction based on patient notes, which can analyze unstructured clinical text to identify and classify reported symptoms, tests, and assessments into diagnostic codes. More recently, PLMs have also been adopted for automating ICD code assignment from EHRs¹¹. Such models can be trained on notes labelled with relevant codes to learn associations between clinical language and standardized billing codes. Fine-tuned transformer models have achieved state-of-the-art results for multi-label ICD coding across various datasets, and their contextual representations are able to capture ambiguities in medical language compared with traditional NLP approaches.

While PLMs have shown their capabilities for learning generalisable representations of medical language, automated ICD coding still remains highly challenging due to several factors^12,13. Firstly, the label space consisting of thousands of ICD codes presents modelling difficulties. Predicting from such a large set of codes in a multi-label setting is considerably more complex compared to single-label categorization; secondly, the label distribution in EHR is extremely imbalanced, also known as the long-tail problem, where the majority of ICD codes have only a few positive examples. This issue makes it difficult to learn robust patterns for rare codes. Finally, the fragmented nature of EHR notes makes extracting signals for each code label difficult. Relevant text for a particular code may be scattered throughout a lengthy note. Identifying these scattered semantic signals in a unified manner is non-trivial.

Given the complex challenges posed by multi-label classification at scale, it is imperative to incorporate external knowledge, such as taxonomic hierarchy and textual definitions for the automated ICD coding task. Specifically, the taxonomic hierarchy between ICD codes provides useful semantic signals, such as parent-child constraints and code similarity clusters that can effectively reduce spurious predictions. Textual definitions of each ICD code offer important contextual details, and incorporating these as supplementary descriptors can help models better characterize the intended meaning of codes. In this work, we present an effective ICD coding model called ICDXML, which utilizes a label tree to integrate both structural and textual knowledge for the task of extreme multi-label classification. By jointly leveraging these complementary knowledge types, our framework aims to improve context-aware prediction of relevant codes from sparse and imbalanced EHR data. Experiments on MIMIC-III datasets^14,15 demonstrate the ability of our integrated approach to advance the state-of-the-art in automated ICD coding through the fusion of external domain knowledge into deep neural architectures.

Related work

Automated ICD coding using NLP techniques to assign standardized codes to discharge summaries in EHR has been extensively studied and gained significant research attention. Earlier approaches relied heavily on manually curated rules such as predefined inclusion and exclusion criteria^16,17. However, these rule-based systems required extensive expert effort to engineer effective rules. They also lacked the flexibility to handle medical language and required constant updating.

Machine learning methods, such as logistic regression¹⁸ and support vector machines¹⁹, have been utilized to learn statistical associations between text and codes. However, these methods struggled to scale to the high dimensionality of thousands of ICD codes and the multi-label nature of assigning multiple codes per note.

Recent breakthroughs in deep learning have shown significant potential for automated ICD coding. Previous studies^20,21 have utilized recurrent neural networks like Long Short-Term Memory Networks (LSTMs), Convolutional Neural Networks (CNNs) and Graph Neural Networks (GNNs). These architectures have demonstrated an ability to identify higher-level abstract features within the medical text and capture semantic context effectively, particularly across lengthy clinical notes. Particularly, for GNNs, the structural information and dependencies in the medical graph can be better captured, allowing the model to learn the relationships among the symptoms and diseases, etc.^22,23 Based on these architectures, researchers have also incorporated attention-based mechanisms^15,24 into model architectures, enabling selectively focusing on pertinent segments of the input text to enhance the precision of code predictions. Furthermore, the integration of label graphs allows for explicit modelling of coding hierarchy and co-occurrence relationships between codes, contributing to more accurate code assignments²⁵. Additionally, multitask learning has been explored as a beneficial approach for ICD coding^26,27. By jointly training on multiple related tasks, such as ICD and Clinical Classification Software (CCS) code prediction, multitask learning schemes can effectively transfer ICD code association knowledge from the auxiliary task branch, leading to improved performance in ICD code prediction.

Another significant development is the adoption of transfer learning techniques, where large-scale medical corpora are pre-trained using transformer-based language models before fine-tuning them for specific ICD coding tasks^9,28,29. This approach has gained popularity due to its capacity to leverage the knowledge contained within extensive medical datasets, ultimately leading to improved performance in automated ICD coding systems. Moreover, several studies have explored using pre-trained generative models instead of discriminative models by reframing text classification as a text generation problem³⁰. The generated text can then be mapped to the corresponding ICD code.

ICDXML

Input representation

To represent the discharge summary, we begin by encoding the textual data with the PLM and integrating this representation with our proposed one-dimensional deformable convolutional structure. We concatenate the features from the PLMs and knowledge graph embeddings to enhance the quality of our representations. An overview of the input representation generation for discharge summaries is illustrated in Fig. 1.

Pre-trained language model encoding

One typical paradigm for text representation is to use PLMs to extract vector representations of text. Our model follows this design principle and further considers that the length of EHR texts often exceeds the maximum input length limit of models like BERT. Therefore, we use ClinicalLongformer³¹, a PLM for long texts pre-trained on medical corpora, to encode the discharge summaries and generate corresponding semantic representations. In previous research, the vector of the CLS position from the last layer of the PLM is commonly used as the representation of the entire input. However, recent research has shown that using the average of the last layer and first layer hidden states can achieve better results. We conjecture that because the bottom hidden state retains more abundant textual feature information, while the top hidden state contains more advanced semantic abstractions, combining both can provide a more comprehensive text representation, thereby enhancing the performance of downstream tasks. Specifically, given an input token sequence $t=\{\text {[CLS]}, t_1, t_2,..., t_n, \text {[SEP]}\}$, we obtain a series of contextualized hidden state representations at l-th layer, $\textbf{h}^l=[h_1^l, h_2^l,...,h_m^l]$, instead of the CLS token, where m denotes the dimensionality of the hidden states. We use the representations of the last hidden state $h_i^L$ and the first hidden state $h_i^1$ to calculate the average as the i-th element in the final encoding vector for the text:

$$\begin{aligned} h_i = \frac{1}{2}(h_i^L + h_i^1), \quad i = 1,2,\ldots ,m \end{aligned}$$

In this way, the representation integrates bottom-layer semantic features and high-level semantic abstractions, which will provide a high-quality text representation for downstream tasks.

Deformable convolutions for hidden states fusion

To enhance the representation capability of our model, we introduce the integration of deformable convolutional structures, which dynamically fuse hidden states from the PLM. This concept, rooted in the computer vision domain, initially featured two-dimensional deformable convolutional networks.

CNNs have achieved tremendous success in image processing by extracting local features through sliding convolution kernels over the input image. The convolution stride, which determines the stride at which the kernel slides, is an important parameter. A fixed stride ensures consistency between features extracted at different positions but makes it difficult to extract high-level semantic features. Deformable convolutions, proposed by Dai et al.³², were initially designed for computer vision to handle irregular objects by adjusting kernel shape to fit receptive fields.

In natural language processing, CNNs have also been widely applied to tasks like text classification and named entity recognition, usually with a fixed stride, which struggles to capture long-distance dependencies. To address this limitation, we introduce one-dimensional deformable convolutions to perform dynamic fusion of the hidden states to achieve more dynamic combinations. Like the two-dimensional counterparts, one-dimensional deformable convolutions adaptively adjust kernel sampling, resulting in richer combinations and semantic representations. Dynamic convolutions are expected to enhance the discriminative capacity of representations, which is likely to have a positive impact on downstream performance. A typical operation in a convolutional neural network can be expressed as:

$$\begin{aligned} o(p_i) = \sum _{p_j \in \mathcal {R}} w(p_i) * x(p_i + p_j) \end{aligned}$$

where $o(p_i)$ represents the value of the output feature map at position $p_i$, $w(p_i)$ denotes the weight of the convolution kernel at relative position $p_i$, $x(p_i + p_j)$ denotes the value of the input feature map at position $p_i + p_j$ and $\mathcal {R}$ denotes the receptive field of the convolution kernel, i.e. in the case of 1D convolutional neural network with a kernel size of 3, $\mathcal {R} = \{-1, 0, 1\}$. For deformable convolution, an offset $\Delta p_j$ is added for each sampling location based on the standard convolution operation:

$$\begin{aligned} o^\prime (p_i) = \sum _{p_j \in \mathcal {R}} w(p_i) * x(p_i + p_j + \Delta p_j) \end{aligned}$$

where $\Delta p_j$ is jointly learned from the input feature map and another convolutional layer, typically resulting in decimal values. Hence, linear interpolation is required to compute $x(p_i + p_j + \Delta p_j)$. In this way, the sampling locations of the convolutional kernel are no longer restricted to fixed grids but can be deformed according to the content of the input feature to obtain a more flexible representation. We use this proposed one-dimensional deformable convolution to dynamically learn the fusion weights of different layer states in the first dimension, i.e.

$$\begin{aligned} \textbf{h}^{\prime } = \textrm{DCN1d}(\textbf{H}; \textbf{w}) \end{aligned}$$

where $\mathrm {DCN1d(\cdot )}$ is the one-dimensional deformable convolution operation, $\textbf{H}$ is the set of all hidden states from PLM at the CLS position, i.e. $\textbf{H}=\{\textbf{h}^1, \textbf{h}^2,..., \textbf{h}^L\}$, $\textbf{h}^{\prime }$ stands for the new state after fusion with the deformable convolution, and $\textbf{w}$ is the learnable offset that controls the dynamic range of fusion.

Biomedical knowledge integration

Existing research has shown that injecting external knowledge into language models can greatly improve performance on lengthy text classification by introducing external cues while enhancing model sensitivity to keywords. A key challenge in this task is maintaining responsiveness to keywords without overly mixing them into abstract semantic information. To address this challenge, in addition to representations from the PLM, we also encode the discharge notes using pre-trained biomedical knowledge graph embeddings³³ from Medical Subject Headings (MeSH)³⁴ for knowledge injection. When merging the PLM and external knowledge representations in the biomedical domain, we note that the two representations, generated from meaningful semantic space and knowledge graph space, respectively, can be seen as heterogeneous concatenation.

Recent studies, such as BERT-flow³⁵ and BERT-whitening³⁶, have revealed the anisotropy of sentence embeddings generated by PLMs. Specifically, in the PLM vector space, low-frequency words are distributed sparsely, while high-frequency words are densely clustered. This is in contrast to knowledge graph embeddings, which are typically isotropic with uniform coverage of the vector space. The anisotropy arises from the objective of PLM to focus on learning predictive relationships between frequently co-occurring words. Therefore, simply concatenating the sentence representations from PLM with biomedical knowledge graph embeddings results in combining non-overlapping spaces with very different geometric shapes. To effectively integrate these heterogeneous representations, we propose borrowing techniques from multimodal factorized fusion² to project vectors to a shared space before combining information, i.e.

$$\begin{aligned} \textrm{MFB}(\textbf{h}^\prime , \textbf{v}) = \textbf{U} * (\textbf{P}_{h^\prime } * \textbf{h}^\prime ) \otimes (\textbf{P}_v * \textbf{v}) \end{aligned}$$

where $\mathrm {MFB(\cdot )}$ is the Multi-modal Factorized Bilinear (MFB) operation, $\textbf{v}$ is the feature from the biomedical knowledge graph space, $\textbf{P}_{h^\prime }$ and $\textbf{P}_v$ are projection matrices for $\textbf{h}^\prime$ and $\textbf{v}$, respectively, $\otimes$ represents the outer product and $\textbf{U}$ is the learnable weight matrix.

Output prediction

Our approach performs multi-label classification by encoding the input discharge summary s into a representative vector $\textbf{s}$ and leveraging a tree-based classifier for large output spaces. For small label sets, independent classifiers are trained on a per-label basis. However, when the number of labels $|\mathcal {Y}|$ exceeds the extreme classification threshold, we construct a hierarchical label tree T over the output space $\mathcal {Y}$ to exploit correlations and dependencies between ICD codes. The classification threshold was set as a group of candidate values by taking into account the number of potential labels for each text. The classification threshold that achieved the best performance on the validation set among the candidate values was chosen as the threshold for the test set. A specialized model $f_\theta$ with tree-structured classifiers is trained to capture these relationships. During prediction, $f_\theta$ takes the input representation $\textbf{s}$ and label hierarchy T, traversing the tree in a top-down manner to output predicted codes $\hat{y}$. This allows the model to scale effectively to large label spaces while retaining the output space structure. The overall process can be formalized in Algorithm 1

We will introduce the bottom-up label tree construction and top-down hierarchical label prediction in the following subsections.

Bottom-up probabilistic label tree construction

Currently, the best solutions for multi-label text classification are mostly label-wise and trainable on small label sets. However, dedicated classification heads for each label are infeasible with extremely large label sets. Therefore, inspired by the study of You et al.³⁷, we adopt an extreme multi-label classification approach, computing text representations against a probabilistic label tree to predict probabilities for each label.

Unlike existing extreme classification datasets, the potential tree structure in ICD codes is definite, with certain diseases belonging to the same entry. However, diseases in ICD lack direct relationships, and the hierarchy can be seen as a forest structure. Hence, we propose a bottom-up approach (illustrated in Fig. 2) to construct a probabilistic label tree, incorporating the ICD forest into the tree.

In the initial step, we first retrieve all text descriptions for each ICD code from the MIMIC-III database. We then obtain vector representations of these descriptions using sentence transformers. We treat ICD codes as the terminal or leaf nodes of our taxonomy. Parent nodes are introduced to encapsulate ICD codes that share common parent descriptions. Notably, individual diseases devoid of parent nodes are considered to be self-contained parents in this hierarchical structure. In the following step, we initiate the hierarchical clustering process for all parent nodes. The clustering technique employed here is the Iterative Self-Organizing Data Analysis Technique (ISODATA)³⁸, utilizing cosine similarity as the distance metric. Hyperparameters are thoughtfully controlled to strike a balance between the number of nodes beneath a parent and the overall breadth and depth of the resulting tree structure. We then consider the iterative refinement of the hierarchical structure. Specifically, this procedure is performed iteratively until the predetermined target tree depth is reached. The predetermined depth of the tree followed the common design logic for classification trees of being shallow and wide^37,39. In this case, the depth of the classification tree was set to 4. At each iteration, parent-child relationships and the overall hierarchical arrangement are refined to ensure the attainment of the desired tree depth.

Top-down hierarchical label prediction

In the prediction stage, improving computational efficiency is critical since label tree-based inference entails large-scale matrix operations. However, the hierarchical structure of the labels can be leveraged to enhance predictive efficiency. Our model implements a dedicated, fully connected classifier head for each layer of the label tree. This further fuses the knowledge inputs and mixed text representations from the PLM into a consolidated vector space adapted to each tree layer. Fixing the label tree weights during prediction improves adaptability to the distinct levels of the hierarchy.

We first utilize a classifier head to generate text representations for the discharge summary and obtain probabilities for each possible node at the top. To limit the search space, when moving to the next layer of nodes, we only consider the top-C highest probability nodes from the current layer as candidates for splitting subtrees from the overall hierarchy. This recursive selection of the most likely child nodes prunes unlikely branches. We recursively execute this node prediction and candidate selection, traversing down the tree until it reaches the deepest layer of leaf nodes. Final code predictions are generated by thresholding the predicted probabilities at the leaves.

This greedy top-down traversal enables structured prediction over the output space by recursively focusing on the most probable child subtrees. Pruning subtrees with low predicted probability at higher layers narrow the search area, allowing scalability over large output hierarchies. The pseudo-code listed in Algorithm 2 summarizes the prediction with the constructed label tree.

Results and evaluation

Experimental setup

Dataset

Our study utilizes the MIMIC-III clinical text corpus containing de-identified discharge summaries from real hospitalizations, annotated with ICD-9 codes by medical experts. We preprocess the text in the following steps: First, de-identification tokens inserted to protect patient privacy were removed. Next, all characters apart from punctuation and alphanumerical content were replaced with white space to standardize the format and improve model processability. Finally, extraneous white space was stripped to streamline the text sequences, enhancing computational and memory efficiency for downstream models.

Given computational requirements, earlier approaches truncated summaries to 4,000 words. We employ the ClinicalLongformer³¹, which uses token-level inputs, enabling longer sequence modelling. Summaries are truncated at 8,192 tokens - empirically observed to approximate a 2 : 1 word-to-token ratio. Since ICD-9 codes relate to subjective note sections, we retain relevant segments for overlong summaries, discarding nonsalient discharge details. This comprehensive dataset is termed MIMIC-III-FULL. For predicting frequent codes, the data is filtered to samples with at least one of the 50 most common labels, using standard splits for comparison. This subset is denoted MIMIC-III-50. Descriptive statistics are tabulated in Table 1.

Table 1 Statistics of datasets

Full size table

Implementation details

To inject external knowledge into the model, we utilize word-level embeddings from BioWordVec³³ for keyword representation. These biomedical knowledge graph embeddings are concatenated to the representations from the pre-trained language model to provide additional external cues. The BioWordVec embeddings contain complementary knowledge about term associations beyond what is captured solely from the discharge summary text. Moreover, to construct the label tree for prediction, we encode the textual ICD code descriptions using BioSimCSE-BioLinkBERT-BASE⁴⁰. During the top-down search on the label tree, the number of highest probability nodes (C) from each layer is set to 20. To demonstrate the encoding of ICD codes and descriptions, sample text can be constructed as follows: “The ICD number is [ICD Id]. The content is [ICD Text Description]”, where [ICD Id] represents the numeric disease code, and [ICD Text Description] denotes the textual definition and description for that particular code. This encoding scheme concatenates the unique identifier with the associated descriptive content for each ICD entity.

We trained our models using PyTorch 1.9.0 and the Transformers library 4.11.3. All experiments were conducted on a cluster of 3 NVIDIA V100 GPUs, each with 32GB of memory. We employed Distributed Data Parallel (DDP) training to optimize resource utilization and accelerate the training process. We demonstrate the hyperparameters and training settings for fine-tuning in Table 2.

Table 2 Hyperparameters and training settings

Full size table

We employed dynamic padding to the longest sequence in each batch to optimize computational efficiency. Layer Normalization was applied after each Transformer block to stabilize training. For newly introduced parameters, we used a truncated normal distribution (mean=0, std=0.02) for initialization, while retaining pre-trained weights for the base model. To ensure the robustness of our results, all experiments were repeated 10 times with different random seeds. The reported scores represent the mean of these runs, with standard deviations provided in parentheses.

Baselines

CAML¹⁵ is a model that uses a convolutional neural network to encode the document text and a label-wise attention mechanism to generate label-specific representations. It also incorporates the hierarchical structure of the ICD codes to improve the performance of rare labels.
MultiResCNN²⁰ is a model that encodes the free text with a multi-filter residual convolutional neural network and applies a label code attention mechanism to enable each ICD code to attend to different parts of the document. It also uses a multi-label focal loss to handle the class imbalance problem.
JointLAAT⁴¹ is a model that uses a causal meta-learning framework to predict the personalized effects of different interventions. It trains a single meta-model across thousands of tasks, each constructed by sampling an intervention, along with its recipients and nonrecipients. It also uses a hierarchical joint learning mechanism to deal with the imbalanced class issue.
KEPTLongformer⁴² is a model that enhances the Longformer model with medical knowledge from the Unified Medical Language System (UMLS) knowledge graph. It uses a hierarchical self-alignment pre-training method to learn from the hierarchy, synonym, and abbreviation relations in the knowledge graph. It also uses focal loss to improve the performance of rare labels.
EffectiveCAN¹³ is a model that uses a deep convolution-based encoder with squeeze-and-excitation networks and residual networks to aggregate the information across the document and learn meaningful document representations. It also uses multi-layer and sum-pooling attention to extract the most informative features from these representations. It combines binary cross-entropy loss and focal loss to improve performance for rare labels.
ICDBigBird⁴³ is a model that employs BigBird⁴⁴, a transformer-based encoder, to deal with the lengthy discharge summary. It leverages Graph Convolutional Network (GCN) to extract the relations between ICD codes and construct contextual ICD embedding.
MARN ²⁷ is a model that employs multitask learning to facilitate knowledge sharing between different coding branches, enabling the capture of code associations. Furthermore, it includes a recalibrated aggregation module composed of sequential convolutional blocks to extract high-level semantic features, which helps mitigate noise and process lengthy documents effectively.
Generation with Prompt (GP)³⁰ is a generative model that generates free-text diagnoses and procedures based on the Subjective, Objective, Assessment and Plan (SOAP) structure, a common medical documentation framework used by physicians. Rather than directly predicting ICD codes, the model generates lower-dimensional text descriptions, which are then used to infer the ICD codes.

Table 3 Results on the MIMIC-III-50 and MIMIC-III-FULL compared with baselines.We conducted the experiments on our model 10 times, and each time, we used different random seeds for initialization. We report the mean and standard deviation of each metric.

Full size table

Overall performance

We demonstrate the results for the common disease code assignment (MIMIC-III-50) task and the full disease code assignment (MIMIC-III-FULL) task in Table 3. It is observable that our model is comparable to the state-of-the-art model, KEPTLongformer, with only a minor performance gap. However, it is important to note that the KEPTLongformer model is restricted to datasets with precisely 50 labels, limiting its real-world applicability. The reported results on the full disease code assignment task using MIMIC-III-FULL are obtained by first utilizing separate classification models, Multiple Synonyms Matching Network (MSMN)⁴⁵, to reduce the label space to 50 codes, before re-ranking with KEPTLongformer. A similar approach to KEPTLongformer is employed in the GP method, re-ranking the generated ICD codes using a discriminative model on the MIMIC-III-FULL dataset. In contrast, our proposed model is designed for generalized application without constraints on the output space size. Therefore, a direct comparison to KEPTLongformer and GP may be unfair, given the difference in task setting and intended use cases. Compared to the EffectiveCAN and JointLAAT models, which have a similar goal of broad applicability, our approach demonstrates a competitive advantage in performance on the full code assignment task. Moreover, the comparison between our method and ICDBigBird indicates that the proposed pre-trained label tree can better capture the structural relationships of ICD codes than the GCN used in ICDBigBird. Additionally, our method also delivered results similar to those of MARN, the multi-task learning framework, but is more efficient as it does not rely on auxiliary information derived from other tasks during the training process. Overall, while slightly trailing the heavily constrained KEPTLongformer in accuracy, our method achieves state-of-the-art results among models capable of scalable, real-world clinical coding.

Ablation study

Table 4 Ablation study with Deformable Convolution Hidden State Fusion (DCF), Knowledge Graph Embedding (KGE) and Pre-trained Label Tree (PLT).

Full size table

Our proposed model can be characterized as a combination of three architectural components: 1) Deformable Convolution Hidden State Fusion, the fusion of hidden states with 1D deformable convolutional layers; 2) Biomedical Knowledge Integration, injection of external knowledge; and 3) Probabilistic Label Tree: prediction based on a probabilistic label tree. The ablation experiments presented in Table 4 quantify the contribution of each component.

The deformable convolutional layers enable more flexible vector representations compared to standard approaches by adaptively aggregating hidden states. Incorporating external knowledge, such as annotations and ontologies, grounds the model in real-world facts and relationships. Fusing information from multiple modalities with these supplementary cues guides the model to improve predictive performance. Organizing the output space into a probabilistic label tree introduces a structure that captures hierarchies and dependencies among codes. Compared to direct prediction, the tree decoder is better equipped to handle multi-label prediction tasks by modelling inter-code relationships. Without deformable fusion, performance decreases due to less effective integration of the hidden state inputs from multiple encoders. Ablating external knowledge similarly degrades results, as a key inductive bias guiding predictive coherence is removed. Using a flat classifier instead of the tree decoder also impacts accuracy by disregarding output space structure. The combination of all three components—deformable convolutional fusion, external knowledge injection, and the probabilistic label tree decoder—yields the optimal model architecture with improved results.

Deformable convolutions for hidden state fusion

A key focus of this work is obtaining effective text representations from pre-trained models. As evidenced in the experimental results, utilizing more dynamic representation techniques can yield more flexible embeddings compared to prevalent approaches like last-first-avg pooling. By emphasizing critical token-level responses during classification, these enhanced text encodings improve performance. Rather than succinctly summarizing the entire sequence, the proposed method generates contextualized representations tailored to each target. Masked pooling selectively focuses on pertinent words, while the deformable convolution adaptively aggregates important fine-grained signals spread throughout the text. In contrast, conventional strategies like last-first-avg produce static condensed embeddings, discarding substantial semantic information. The resulting generic summaries lack the selectivity to highlight relevant local interactions critical for accurate prediction. By extracting representations specialized to each target’s context, our approach better models fine-grained linguistic phenomena while retaining global semantics. The representations distil the most salient signals, providing improved cues for inference compared to fixed summarizations. The gains over established baselines demonstrate the advantages of dynamic context-aware encodings.

Injecting external knowledge through multimodal concatenation

This work has two main aims: (1) It looks to emphasize critical keywords by incorporating low-level word vector; (2) Considering the proven vector space differences between knowledge graph embeddings and PLM embeddings, a multimodal concatenation approach is adopted. The results show that this fusion method can improve model performance. Prior work has shown that PLMs largely discard surface word identity and instead encode meaning holistically based on context. Hence, knowledge graph embeddings can retain additional lexical knowledge. The gains over single-source models demonstrate the complementary nature of multimodal encoding. The approach is straightforward yet effective at aggregating diverse feature perspectives. Further work could investigate more sophisticated fusion techniques beyond simple concatenation. Overall, the results highlight the utility of fusing multiple representation types, rather than relying solely on pre-trained features, for enriched encoding. The enhanced embeddings improve downstream performance by consolidating signals from both granular lexical and holistic semantic sources.

Probabilistic label tree construction

The probabilistic label tree provides several advantages that improve model performance on multi-label classification tasks. By encoding output dependencies, the tree decoder better handles label co-occurrence and hierarchical relationships compared to flat per-class classifiers. Specifically, the tree incorporates intrinsic structure by grouping related codes into branches, enabling the expression of refined conditional dependencies between labels that independent classifiers miss. This modelling of correlations facilitates the sharing of statistical strengths between related classes and provides useful inductive biases that guide coherent joint predictions based on observed co-occurrence patterns. Additionally, the hierarchical organization propagates signals between coarse-grained parent codes and fine-grained children codes, allowing smoothing of predictions along the directed acyclic graph structure. Computationally, efficiency is improved by pruning unlikely subtrees and beam search, focusing on promising label subsets guided by output space patterns. In summary, by probabilistically modelling output relationships, the tree decoder produces medically coherent predictions aligned with expert knowledge. The explicit encoding of inter-dependencies within the label space allows robust multi-label classification.

Limitations

While the proposed model achieves strong performance on medical code assignment, certain limitations should be noted: (1) the deformable convolution fusion technique, while more flexible than standard approaches, may still lack the representational power to fully capture complex semantic interactions. More advanced fusion methods could be explored; (2) the external knowledge injection is limited to the basic concatenation of supplementary features. Investigating learned integration or attention mechanisms may further enhance multimodal synergy; (3) the label tree decoder imposes an intrinsic output structure that may not fully align with the optimal data-driven dependencies. Allowing end-to-end learning of output relationships could potentially improve results. Inference is also slowed by requiring sequential tree decoding.

Conclusion and future work

Overall, this work presents a novel approach for automated ICD coding of clinical text, combining deformable convolutional fusion, external knowledge injection, and probabilistic label tree decoding. The results demonstrate state-of-the-art performance on the MIMIC-III benchmark, validating the effectiveness of the proposed techniques. However, limitations exist that could be addressed in future work: (1) The representation fusion could be enhanced with more sophisticated deep learning approaches; (2) Multimodal integration could also utilize mechanisms that are learned beyond basic concatenation; (3) Finally, directly learning the output space dependencies in an end-to-end manner could potentially improve upon the predefined tree structure.

Data availability

All data supporting this research are from previously reported studies and datasets, which have been cited within the article. The processed data are available from the corresponding author upon request.

References

Xu, K. et al. Multimodal machine learning for automated ICD coding. In Machine Learning for Healthcare Conference, 197–215 (PMLR, 2019).
Luo, J., Xiao, C., Glass, L., Sun, J. & Ma, F. Fusion: Towards automated ICD coding via feature compression. In Findings of the Association for Computational Linguistics: ACL-IJCNLP2021, 2096–2101 (2021).
Zhang, Z., Liu, J. & Razavian, N. Bert-xml: Large scale automated ICD coding using bert pretraining. In Proceedings of the 3rd Clinical Natural Language Processing Workshop, 24–34 (2020).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186, (Association for Computational Linguistics, Minneapolis, Minnesota, 2019). https://doi.org/10.18653/v1/N19-1423
Taori, R. et al. Alpaca: A strong, replicable instruction-following model. Stanford Center Res. Found. Models 3, 7 (2023).
Google Scholar
Touvron, H. et al. Llama: Open and efficient foundation language models. (2023) arXiv preprintarXiv:2302.13971.
Zhou, P. et al. Mets-cov: A dataset of medical entity and targeted sentiment on covid-19 related tweets. Adv. Neural. Inf. Process. Syst. 35, 21916–21932 (2022).
Google Scholar
Wang, Y. et al. Zero-shot medical information retrieval via knowledge graph embedding. In International Workshop on Internet of Things of Big Data for Healthcare, 29–40 (Springer, 2023).
Huang, K., Altosaar, J. & Ranganath, R. Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342 (2019).
Peng, Y., Chen, Q. & Lu, Z. An empirical study of multi-task learning on bert for biomedical text mining. In Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, 205–214 (2020).
Huang, C.-W., Tsai, S.-C. & Chen, Y.-N. PLM-ICD: Automatic ICD coding with pretrained language models. In Proceedings of the 4th Clinical Natural Language Processing Workshop, 10–20, (Association for Computational Linguistics, Seattle, WA, 2022). https://doi.org/10.18653/v1/2022.clinicalnlp-1.2
Yang, Z., Wang, S., Rawat, B. P. S., Mitra, A. & Yu, H. Knowledge injected prompt based fine-tuning for multi-label few-shot ICD coding. In Findings of the Association for Computational Linguistics: EMNLP 2022, 1767–1781, (Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022). https://doi.org/10.18653/v1/2022.findings-emnlp.127
Liu, Y., Cheng, H., Klopfer, R., Gormley, M. R. & Schaaf, T. Effective convolutional attention network for multi-label clinical document classification. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 5941–5953 (2021).
Johnson, A. E. et al. Mimic-iii, a freely accessible critical care database. Sci. Data 3, 1–9 (2016).
Article Google Scholar
Mullenbach, J., Wiegreffe, S., Duke, J., Sun, J. & Eisenstein, J. Explainable prediction of medical codes from clinical text. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 1101–1111 (2018).
Farkas, R. & Szarvas, G. Automatic construction of rule-based ICD-9-cm coding systems. BMC Bioinform. 9, 1–9 (2008).
Article Google Scholar
Lussier, Y. A., Shagina, L. & Friedman, C. Automating ICD-9-cm encoding using medical language processing: A feasibility study. In Proceedings of the AMIA Symposium, 1072 (American Medical Informatics Association, 2000).
Kavuluru, R., Rios, A. & Lu, Y. An empirical evaluation of supervised learning approaches in assigning diagnosis codes to electronic medical records. Artif. Intell. Med. 65, 155–166 (2015).
Article PubMed PubMed Central Google Scholar
Perotte, A. et al. Diagnosis code assignment: Models and evaluation metrics. J. Am. Med. Inform. Assoc. 21, 231–237 (2014).
Article PubMed Google Scholar
Li, F. & Yu, H. ICD coding from clinical text using multi-filter residual convolutional neural network. In Proceedings of the AAAI Conference on Artificial Intelligence 05, 8180–8187 (2020).
Yuan, Z., Tan, C. & Huang, S. Code synonyms do matter: Multiple synonyms matching network for automatic ICD coding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 808–814 (2022).
Yuan, Q., Chen, J., Lu, C. & Huang, H. The graph-based mutual attentive network for automatic diagnosis. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, 3393–3399 (2021).
Wu, H., Chen, W., Xu, S. & Xu, B. Counterfactual supporting facts extraction for explainable medical record based diagnosis with graph network. In Proceedings of the 2021 Conference of The North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1942–1955 (2021).
Dong, H., Suárez-Paniagua, V., Whiteley, W. & Wu, H. Explainable automated coding of clinical notes using hierarchical label-wise attention networks and label embedding initialisation. J. Biomed. Inform. 116, 103728 (2021).
Article PubMed Google Scholar
Lu, J., Du, L., Liu, M. & Dipnall, J. Multi-label few/zero-shot learning with knowledge aggregated from multiple label graphs. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2935–2943 (2020).
Sun, W., Ji, S., Cambria, E. & Marttinen, P. Multitask recalibrated aggregation network for medical code prediction. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 367–383 (Springer, 2021).
Sun, W., Ji, S., Cambria, E. & Marttinen, P. Multitask balanced and recalibrated network for medical code prediction. ACM Trans. Intell. Syst. Technol. 14, 1–20 (2022).
Google Scholar
Li, Y., Wehbe, R. M., Ahmad, F. S., Wang, H. & Luo, Y. Clinical-longformer and clinical-bigbird: Transformers for long clinical sequences. (2022) arXiv preprint arXiv:2201.11838.
Wang, Z., Wang, Y., Wu, J., Teng, Z. & Yang, J. Yato: Yet another deep learning based text analysis open toolkit. (2022) arXiv preprint arXiv:2209.13877.
Yang, Z., Kwon, S., Yao, Z. & Yu, H. Multi-label few-shot ICD coding as autoregressive generation with prompt. In: Proceedings of the AAAI Conference on Artificial Intelligence 37, 5366–5374 (2023).
Li, Y., Wehbe, R. M., Ahmad, F. S., Wang, H. & Luo, Y. A comparative study of pretrained language models for long clinical text. J. Am. Med. Inform. Assoc. 30, 340–347 (2023).
Article PubMed Google Scholar
Dai, J. et al. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, 764–773 (2017).
Zhang, Y., Chen, Q., Yang, Z., Lin, H. & Lu, Z. Biowordvec, improving biomedical word embeddings with subword information and mesh. Sci. Data 6, 52 (2019).
Article PubMed PubMed Central Google Scholar
Lipscomb, C. E. Medical subject headings (mesh). Bull. Med. Libr. Assoc. 88, 265 (2000).
CAS PubMed PubMed Central Google Scholar
Li, B. et al. On the sentence embeddings from pre-trained language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 9119–9130 (2020).
Su, J., Cao, J., Liu, W. & Ou, Y. Whitening sentence representations for better semantics and faster retrieval. (2021) arXiv preprint arXiv:2103.15316.
You, R. et al. Attentionxml: Label tree-based attention-aware deep model for high-performance extreme multi-label text classification. In Advances in Neural Information Processing Systems32 (2019).
Ball, G. H. et al. ISODATA, A Novel Method of Data Analysis and Pattern Classification Vol. 4 (Stanford Research Institute Menlo Park, 1965).
Google Scholar
Jiang, T. et al. Lightxml: Transformer with dynamic negative sampling for high-performance extreme multi-label text classification. In Proceedings of the AAAI Conference on Artificial Intelligence 35, 7987–7994 (2021).
Kanakarajan, K. r., Kundumani, B., Abraham, A. & Sankarasubbu, M. BioSimCSE: BioMedical sentence embeddings using contrastive learning. In Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI), 81–86 (Association for Computational Linguistics, Abu Dhabi (Hybrid), 2022).
Vu, T., Nguyen, D. Q. & Nguyen, A. A label attention model for ICD coding from clinical text. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, 3335–3341 (2021).
Yang, Z., Wang, S., Rawat, B. P. S., Mitra, A. & Yu, H. Knowledge injected prompt based fine-tuning for multi-label few-shot ICD coding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, vol. 2022, 1767 (NIH Public Access, 2022).
Michalopoulos, G., Malyska, M., Sahar, N., Wong, A. & Chen, H. ICDBigBird: A contextual embedding model for ICD code classification. In Demner-Fushman, D., Cohen, K. B., Ananiadou, S. & Tsujii, J. (eds.) Proceedings of the 21st Workshop on Biomedical Language Processing, 330–336, https://doi.org/10.18653/v1/2022.bionlp-1.32 (Association for Computational Linguistics, Dublin, 2022).
Zaheer, M. et al. Big bird: Transformers for longer sequences. Adv. Neural. Inf. Process. Syst. 33, 17283–17297 (2020).
Google Scholar
Yuan, Z., Tan, C. & Huang, S. Code synonyms do matter: Multiple synonyms matching network for automatic ICD coding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 808–814, https://doi.org/10.18653/v1/2022.acl-short.91 (Association for Computational Linguistics, Dublin, 2022).

Download references

Acknowledgements

We would like to acknowledge the financial support provided by the Postgraduate Research Scholarship (PGRS) at Xi’an Jiaotong-Liverpool University (contract number PGRS2006013). Additionally, this research has received partial funding from the Jiangsu Science and Technology Programme (contract number BK20221260). The corresponding author also acknowledges support through the UKRI Future Data Services pilot (Grant number: ES/Z502935/1) and the Strategic Priority Fund as part of the wider Protecting Citizens Online programme (Grant number: EP/W032473/1).

Author information

Zeqiang Wang and Yuqi Wang These authors contributed equally to this work.

Authors and Affiliations

Department of Computing, Xi’an Jiaotong Liverpool University, Suzhou, 21500, China
Zeqiang Wang, Yuqi Wang, Haiyang Zhang, Wei Wang, Jun Qi & Jianjun Chen
School of Computer Science and Electronic Engineering, University of Surrey, Surrey, GU2 7XH, UK
Zeqiang Wang, Nishanth Sastry & Suparna De
Department of Computer Science, University of Liverpool, Liverpool, L69 3BX, UK
Yuqi Wang
UCL Social Research Institute, University College London, London, WC1E 6BT, UK
Jon Johnson

Authors

Zeqiang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yuqi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Haiyang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Wei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jun Qi
View author publications
You can also search for this author in PubMed Google Scholar
Jianjun Chen
View author publications
You can also search for this author in PubMed Google Scholar
Nishanth Sastry
View author publications
You can also search for this author in PubMed Google Scholar
Jon Johnson
View author publications
You can also search for this author in PubMed Google Scholar
Suparna De
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Z.W. and Y.W. conceived and conducted experiments, and wrote the main manuscripts. All authors contributed to its overall scientific content, reviewed and revised the manuscript.

Corresponding author

Correspondence to Suparna De.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, Z., Wang, Y., Zhang, H. et al. ICDXML: enhancing ICD coding with probabilistic label trees and dynamic semantic representations. Sci Rep 14, 18319 (2024). https://doi.org/10.1038/s41598-024-69214-9

Download citation

Received: 01 December 2023
Accepted: 01 August 2024
Published: 07 August 2024
DOI: https://doi.org/10.1038/s41598-024-69214-9
Springer Nature Limited

ICDXML: enhancing ICD coding with probabilistic label trees and dynamic semantic representations

Abstract

Similar content being viewed by others

Exploring Label Correlations for Quantification of ICD Codes

CASCADENET: An LSTM Based Deep Learning Model for Automated ICD-10 Coding

Automatic ICD Coding Based on Multi-granularity Feature Fusion

Explore related subjects

Introduction

Related work

ICDXML

Input representation

Pre-trained language model encoding

Deformable convolutions for hidden states fusion

Biomedical knowledge integration

Output prediction

Bottom-up probabilistic label tree construction

Top-down hierarchical label prediction

Results and evaluation

Experimental setup

Dataset

Implementation details

Baselines

Overall performance

Ablation study

Deformable convolutions for hidden state fusion

Injecting external knowledge through multimodal concatenation

Probabilistic label tree construction

Limitations

Conclusion and future work

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation