Text classification models for assessing the completeness of randomized controlled trial publications based on CONSORT reporting guidelines

Jiang, Lan; Lan, Mengfei; Menke, Joe D.; Vorland, Colby J.; Kilicoglu, Halil

doi:10.1038/s41598-024-72130-7

Text classification models for assessing the completeness of randomized controlled trial publications based on CONSORT reporting guidelines

Article
Open access
Published: 17 September 2024

Volume 14, article number 21721, (2024)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

Text classification models for assessing the completeness of randomized controlled trial publications based on CONSORT reporting guidelines

Download PDF

Lan Jiang¹,
Mengfei Lan¹,
Joe D. Menke¹,
Colby J. Vorland² &
…
Halil Kilicoglu¹

Abstract

Complete and transparent reporting of randomized controlled trial publications (RCTs) is essential for assessing their credibility. We aimed to develop text classification models for determining whether RCT publications report CONSORT checklist items. Using a corpus annotated with 37 fine-grained CONSORT items, we trained sentence classification models (PubMedBERT fine-tuning, BioGPT fine-tuning, and in-context learning with GPT-4) and compared their performance. We assessed the impact of data augmentation methods (Easy Data Augmentation (EDA), UMLS-EDA, text generation and rephrasing with GPT-4) on model performance. We also fine-tuned section-specific PubMedBERT models (e.g., Methods) to evaluate whether they could improve performance compared to the single full model. We performed 5-fold cross-validation and report precision, recall, F₁ score, and area under curve (AUC). Fine-tuned PubMedBERT model that uses the sentence along with the surrounding sentences and section headers yielded the best overall performance (sentence level: 0.71 micro-F₁, 0.67 macro-F₁; article-level: 0.90 micro-F₁, 0.84 macro-F₁). Data augmentation had limited positive effect. BioGPT fine-tuning and GPT-4 in-context learning exhibited suboptimal results. Methods-specific model improved recognition of methodology items, other section-specific models did not have significant impact. Most CONSORT checklist items can be recognized reasonably well with the fine-tuned PubMedBERT model but there is room for improvement. Improved models can underpin the journal editorial workflows and CONSORT adherence checks.

Creating efficiencies in the extraction of data from randomized trials: a prospective evaluation of a machine learning and text mining tool

Article Open access 16 August 2021

Technology-assisted title and abstract screening for systematic reviews: a retrospective evaluation of the Abstrackr machine learning tool

Article Open access 12 March 2018

Natural language processing (NLP) to facilitate abstract review in medical research: the application of BioBERT to exploring the 20-year use of NLP in medical research

Article Open access 15 April 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Introduction

Complete and transparent reporting in biomedical publications is critical for assessing the validity of research findings, promoting scientific integrity, and facilitating evidence-based decision-making in patient care and health policy^1,2,3. However, poor reporting has been a persistent issue leading to potential biases and difficulties incorporating results into meta-analyses, complicating replication efforts, and ultimately undermining the trustworthiness of biomedical research^2,4,5. For example, studies of clinical trial publications indicated that 40–89% of them lacked adequate descriptions of interventions, making replication in subsequent trials infeasible². Other surveys suggested that most studies had at least one primary outcome that was changed, introduced, or omitted during the course of the research². Reporting guidelines have been developed to provide researchers with a minimum list of essential information that they need to report for readers to clearly understand study methods and findings^5,6,7,8,9. They also aim to facilitate the replication of study procedures and improve the reliability of published research. While they have been endorsed by many high impact journals¹⁰, adherence to reporting guidelines remains inadequate^2,11,12.

Randomized controlled trials (RCTs), when designed and conducted rigorously, remain the most robust method to determine the effectiveness of an intervention¹³. The CONSORT 2010 Statement^6,13 is a reporting guideline for RCT results publications and consists of a checklist and participant flowchart. The checklist includes 25 items considered the minimum information needed to understand RCTs (e.g., outcomes, randomization, masking, harms). While CONSORT has been endorsed by many journals, publishers, and editorial organizations, systematic studies of current practices show poor reporting even in well-conducted RCTs^12,13,14. Some studies have shown more complete reporting of CONSORT items over time¹⁵.

Adherence to CONSORT, and to reporting guidelines more broadly, remains low, partly because journal endorsement generally does not entail enforcement or verification. CONSORT implementation, where RCT submissions are scrutinized by journal editors for compliance before peer review, has been shown to improve reporting quality^16,17. However, manual screening is labor-intensive and time-consuming for journal editors and staff. Automated screening, based on natural language processing (NLP) and machine learning approaches, could reduce the burden of manual checking, streamline the peer review process, and contribute to better reporting quality^18,19,20.

In prior work, we developed a corpus of RCT result publications annotated with CONSORT checklist items (CONSORT-TM)²¹and reported NLP models for recognizing the CONSORT items related to methodology (17 items)^15,21,22. In this study, we extend our work by training and validating NLP models for the full CONSORT checklist at fine granularity (37 items). Our main contributions are as follows:

1.
We develop and evaluate the first NLP models targeting automatic recognition of all CONSORT checklist items at fine granularity.
2.
We compare different input representations and features for the task (context size, section information, sentence position).
3.
We fine-tune a GPT-based model (BioGPT²³) to study whether it confers any benefits over the models based on the BERT architecture²⁴.
4.
We assess in-context learning using GPT-4 for the task.
5.
To address the data size and imbalance, we assess the ability of GPT-4 to generate useful training instances for this task and compare it to other data augmentation methods (Easy Data Augmentation (EDA)²⁵, UMLS-EDA²⁶).

Related work

Text classification in RCT articles

Most NLP research on RCT publications has focused on classifying sentences using the PICO framework (Population, Intervention, Comparator, Outcome) to aid the systematic review process and evidence-based medicine^{27,28,29,30,31,32,33,34}. Other research has focused on automating risk of bias assessment using text classification^35,36,37 and rhetorical classification of medical abstracts (e.g., Objective, Methods)^38,39,40. Research on other key characteristics is less common, although methods have been reported for identifying study design^29,30,41, sample size^28,37,41, statistical methods⁴², and limitations⁴³. The methods range from rule-based methods in early work^27,42 to (semi-)supervised machine learning methods in later work^28,31,35,43, including deep learning approaches of the recent years^{32,33,34,38,39,40,41}.

We presented CONSORT-TM, a corpus which represents the most comprehensive annotation of RCT characteristics, to our knowledge²¹. We also trained NLP models for recognizing methodology items. A BioBERT-based model⁴⁴ outperformed rule-based and traditional machine learning approaches. We used our best model to study reporting trends in more than 176 K RCT publications published between 1966 and 2018, which showed an improvement in methodology reporting while also highlighting the shortcomings in the reporting of most items¹⁵.

Generative large language models for biomedical literature mining

Generative large language models (LLMs) based on Transformer architecture⁴⁵ (e.g., GPT family^46,47, PaLM⁴⁸, LLaMA⁴⁹) have shown remarkable language generation capabilities and are increasingly applied to NLP tasks in the general domain⁵⁰ and in the biomedical domain^51,52. Domain-specific LLMs for the biomedical domain have also been trained (e.g., BioGPT²³, Med-PaLM⁵¹). Prompt engineering for specific tasks has become an effective strategy for leveraging the in-context learning abilities of LLMs⁵². In the biomedical domain, fine-tuned BioGPT²³ has shown superior performance to BERT-based models in document classification, while the performance of the GPT models in zero- or one-shot settings has been found to trail that of fine-tuned BERT models⁵³. Most relevant to this work, a recent study used GPT-3.5 to check RCT reports on sports medicine and exercise science for adherence with 9 CONSORT checklist items, reporting accuracy in the range of 70–100%⁵⁴. This study is limited in scope compared to ours and focuses on article-level binary decisions for the 9 items, not sentence classification.

Materials and methods

In this section, we first briefly describe the dataset used in this study (CONSORT-TM²¹). Next, we provide the details of the NLP models, including data augmentation and experimental settings. Lastly, we discuss the evaluation of NLP models. A high-level overview of our study is illustrated in Fig. 1.

Dataset

The CONSORT-TM corpus²¹ consists of 50 RCT publications annotated at the sentence level with 37 CONSORT checklist items. It contains a total of 10,709 sentences, 4,845 (45%) of which are annotated with 5,246 labels (i.e., a multi-label corpus). Each article contains an average of 27.5 out of 37 fine-grained checklist items. In this work, we exclude the checklist item Background (2a) because it is too broad a category that virtually all papers report and checking its reporting was deemed unnecessary and focus on 36 items. We provide the CONSORT checklist items and their descriptions in Supplementary File Table S1.

PubMedBERT fine-tuning

In prior work on CONSORT methodology reporting^21,22, we fine-tuned the BioBERT model⁴⁴. In more recent work¹⁵, we substituted the BioBERT model with PubMedBERT⁵⁵, which has shown better performance in many biomedical NLP tasks. We continue to use PubMedBERT for multi-label sentence classification in this study. We excluded items 1a (whether the study is indicated as randomized in the title) and 1b (whether the abstract is structured) from the sentence classification model, because these items are article-level, unlike the rest of the items, and we use simple rules to recognize them (see below).

In previous work, we represented each input sentence as the concatenation of its enclosing section header and the sentence text. Here, we incorporate more contextual information into classification by taking into account the preceding and the following sentence, based on the observation that many CONSORT items are reported over several contiguous sentences (i.e., zones), and additional information from neighboring sentences could help in classification. The input representation is illustrated in Fig. 2. It consists of three sentences (preceding, target, trailing) delimited by special [SEP] tokens and prepended by the [CLS] classification head, following earlier work^56,57. Each sentence is prepended with the list of (nested) section headers associated with the sentence (e.g., Methods Patients [SENTENCE]). The [CLS] token representation generated by PubMedBERT encoder is fed into a fully connected layer and the sigmoid function is used for multi-label classification.

We also leveraged sentence position in the article as an additional feature, based on the observation that checklist items tend to be discussed in a predictable order in an article. For example, the first few sentences of the Methods section often discuss item 3a (Trial Design). We concatenated sentence position embedding with the sentence representation to incorporate this information. We experimented with absolute and relative positions. We used an embedding layer to encode absolute position. We convert relative position (continuous value) into a categorical value by creating 10 bins (0–0.1, 0.1–0.2, etc.) first and then used an embedding layer for encoding it.

CONSORT checklist is organized by sections in which the items are expected to be reported. We examined whether models trained on specific sections and items specific to those sections could lead to better performance than a single model trained on full articles and all labels. For these experiments, we created models specific to Methods, Results, and Discussion sections. We did not create an Introduction-specific model, because we consider only one relevant label in these sections (2b (Objectives)).

Data augmentation

CONSORT-TM is relatively small and some checklist items are infrequently reported (e.g., 6b (Changes to Outcomes)). This has led to poor performance for rare items in previous work^21,22. In this study, we leveraged the text generation capabilities of LLMs to improve the quality of data augmentation and compared it to simpler approaches we explored in previous work (EDA²⁴, UMLS-EDA²⁵). Our goal was to assess whether this type of data augmentation could improve model performance for rare labels and enhance generalizability.

Prompt-based augmentation using GPT-4

GPT-4 can generate fluent, human-like text, even about complex topics like medicine⁵⁸. Previous work has demonstrated that GPT-based models can be used for data augmentation to improve a model’s performance; for example, AugGPT reports a framework that uses ChatGPT to rephrase existing text instances⁵⁹.

We adapted AugGPT⁵⁹ to paraphrase the instances of rare labels in our corpus. In addition, we used GPT-4 to generate completely new sentences based on label descriptions, referred to here as generative instances. For paraphrased instances, we used the labels of the original instance. For generative instances, we applied the label of the criterion description used within the prompt.

We augmented data for categories that have fewer than 100 samples in the entire corpus. These categories are: 3b (Changes to Trial Design), 6b (Changes to Outcomes), 7b (Interim Analyses and Stopping Guidelines), 9 (Allocation Concealment Mechanism), 11b (Similarity of Interventions), 12b (Statistical Methods for Other Analyses), 14b (Trial Stopping) and 21 (Generalizability). To better isolate the effect of the different augmentation strategies, we evaluated the performance on a version of the model architecture that only uses the target sentence for classification, as well as on our best-performing architecture, which uses section headings and surrounding sentences. We used the prompts shown in Table 1 to generate sentences using the OpenAI GPT-4 API (generation performed on Sept. 11, 2023). In both prompts, N indicates the number of instances to generate. Each instance consists of a preceding sentence, a positively labeled sentence, and a trailing sentence to be used by the model.

Table 1 GPT-4 prompts used for data augmentation.

Full size table

For rephrased instances, N was set to 6, following AugGPT. For generative instances, we iteratively prompted GPT-4 setting N to 6 or 8 until we accumulated 100 total instances (13 times). Item descriptions used for the generative prompt were adapted from Moher et al.¹³. For all descriptions, the phrase “with reasons” was changed to “with specifics”, and irrelevant text (e.g., “when applicable” or “if relevant”) and examples (e.g., “such as eligibility criteria”) were removed to improve the overall quality and diversity of GPT-4 responses.

Easy data augmentation (EDA)

As a simpler alternative to prompt-based data augmentation, we also generated examples using EDA²⁴ and its variant, UMLS-EDA²⁵. EDA²⁴ is a rule-based method that synthesizes samples via simple modifications to the original sentence, including random deletion, random insertion, random swap, and synonym replacement based on WordNet. UMLS-EDA²⁵ is an adaptation of EDA that additionally uses synonym replacement based on the UMLS⁶⁰. For both methods, we generated six variations for each original sample, for consistency with the rephrasing approach. We only used the instances with a single label in the corpus for data augmentation.

In-context learning with GPT-4

To assess the few-shot learning ability of GPT-4 for our task, we prompt GPT-4 to directly infer whether a sentence in the article reports a specific CONSORT checklist item. The prompt given to GPT-4 consists of the task description, checklist item descriptions, examples (in one-shot and five-shot settings), and the entire article. We provide the full prompt in Supplementary File.

BioGPT fine-tuning

We also fine-tuned BioGPT²³ for the task, formulating the task as text generation. BioGPT is based on GPT-2⁴⁶ and trained using biomedical literature and has shown improved performance on text classification tasks over BERT-based architectures²³. We used a fine-tuning formulation similar to that proposed in Luo et al.²³ for document classification. We generate the sequence using the format “[SENTENCE]. This sentence describes [LABEL]”. [SENTENCE] includes the section header information. We also learned one virtual token to better steer the language model to our task. We let BioGPT complete the sentence to perform the sentence classification task.

Article-level classification for items 1a and 1b

CONSORT checklist items related to title and abstract (1a and 1b) are document-level. We developed a simple rule-based approach for these items. For 1a (is the study described as “randomized” in the title?), we check the stems in the title for the presence of random, randomis, and randomiz. For 1b (is the abstract structured?), we check whether the abstract starts with a structured abstract header included in the list developed by the National Library of Medicine⁶¹.

Experimental settings

We used HuggingFace implementation of the PubMedBERT (BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext) model. We use the following hyperparameters: batch size (4), learning rate (1e-5), and dropout rate of 0.1. The learning rate of the fully connected layer is 1e-3. In each experimental run, we trained the model for 20 epochs.

We fine-tuned BioGPT which is pre-trained on PubMed abstracts from scratch for 20 K steps with a peak learning rate of 1e-5 and 1000 warm-up steps. For GPT-4, we set the temperature to 1 when performing data augmentation to increase the creativity of the responses and 0 when performing direct inference to ensure that the responses are consistent.

Evaluation

To evaluate the fine-tuned PubMedBERT and BioGPT models, we used group 5-fold cross-validation, ensuring that sentences in one article can only be in training or test set for each cross-validation run. To evaluate in-context learning with GPT-4, we randomly sampled 10 articles as the test set, because the model yielded only modest performance in preliminary experiments (see Results) and using OpenAI API for GPT-4 incurs significant cost.

Following previous work, we used precision, recall, and their harmonic mean, F₁ score, as the main evaluation criteria. We report micro- and macro-averaged results and the area under the ROC curve (AUC). To observe whether different input representations and data augmentation approaches led to statistically significant differences in model performances compared to the baseline, we used McNemar’s test⁶² adopting the approach outlined by Gillick and Cox⁶³.

Sentence-level evaluation provides a strict perspective on model performance. In practical use cases, such as guideline adherence checks and large-scale reporting analyses¹⁵, the user of such a model is likely to be interested in whether the model identifies the reported and missing checklist items for a given article and provides justification for the reported items. To accommodate such use cases, we also considered two article-level evaluation schemes—article (ANY) and article (1 +). The former evaluates whether the model correctly predicts if the article includes at least one sentence relevant to the checklist item. The latter is similar to article (ANY) but also requires at least one sentence overlap between the predictions for an item and the ground truth sentences for that item. Article (ANY) is the most lenient evaluation, whereas article (1 +) is likely most useful for practical purposes, because it ensures that at least one correct supporting sentence is identified for the checklist item, as well.

Results

High-level comparison of PubMedBERT and GPT-based models

Table 2 shows the performance of the sentence classification models. The results show that section headers contribute significantly to PubMedBERT performance. Prepending all relevant section headers outperforms prepending only the innermost or outermost section header. Incorporating positional information does not improve results. On the other hand, incorporating context from surrounding sentences yields the best performance, specifically by improving recall. We consider this model (PubMedBERT using surrounding context, prepending section headers to sentences, and using [CLS] token representation) as our main model. This model yields 0.71 micro-F₁ and 0.67 macro-F₁ with balanced precision and recall (0.72 and 0.71, respectively).

Table 2 Overall model performance for CONSORT sentence classification over fivefold cross-validation. The evaluation is at the sentence level.

Full size table

BioGPT fine-tuning yields modest improvement over the baseline PubMedBERT model; however, it is outperformed by PubMedBERT models which use richer input representations. Zero-shot in-context learning with GPT-4, however, yields poorer performance compared to fine-tuned models. Surprisingly, providing example sentences (one-shot and five-shot) degraded the GPT-4 performance further. GPT-4 showed improved recognition of some rare items, such as 7b (Interim Analysis/Stopping Guidelines), 9 (Allocation Concealment), and 11b (Similarity of Interventions); while its performance on common items, such as 6a (Outcomes), and 12a (Statistical Methods for Outcomes) was notably lower. Item-level results for the models are shown in Supplementary File Tables S2–S4.

Item-level results for the best-performing PubMedBERT model

The best-performing PubMedBERT model yields over 0.8 F₁ score for 8 items at the sentence level (out of 34), all of which contain more than 100 instances in the dataset (Supplementary File Table S2). The model performance remains relatively low in classifying infrequently reported items. F₁ score remains under 0.5 for another 9 items, most of which have fewer than 100 instances in the dataset (3b, 6b, 7b, 9, 12b, and 21). Some CONSORT items are multi-part (indicated by a and b in the item numbers) and in some cases the model struggles with distinguishing these closely related items. For example, item 12b (Statistical Methods for Other Analyses) is often confused with 12a (Statistical Methods for Outcomes). The performance is highest for items related to Introduction (0.89 F₁ for 2b (Objectives)), followed by those in Methods sections (0.75 F₁). It is lowest for Results-related items (0.62 F₁). The performance on items not associated with specific sections (items 23–25) is over 0.8 F₁.

Article-level evaluation for the best-performing PubMedBERT model

Article (ANY) and article (1 +) evaluation results for the best-performing PubMedBERT model are provided in Supplementary File Table S2. The model reaches a high micro-F₁ score of 0.92 and macro-F₁ of 0.87 in article (ANY) evaluation, with 29 out of 36 items being recognized with F₁ score of 0.8 or higher. In the more stringent article (1 +) evaluation, we obtain 0.90 micro-F₁ and 0.84 macro-F₁, with 27 items being recognized with F₁ score of 0.8 or higher. In both evaluation schemes, verification of infrequent items remains challenging. Both items 1a and 1b are recognized accurately, showing that the simple rules are sufficient for these items.

Data augmentation

We present samples generated via data augmentation in Supplementary File Table S5. GPT-4 is able to generate to coherent sentences in “generative” setting from the item descriptions. Using GPT-4 for rephrasing also seem to largely preserve the semantic content of the sentence. On the other hand, EDA and UMLS-EDA methods do not preserve meaning.

To assess the contribution of data augmentation, we use both the baseline PubMedBERT model and the best-performing model. The results are presented in Table 3. We observe that data augmentation does not improve results of the best-performing model. For the baseline model (sentence text only), UMLS-EDA improves the results most (2 percentage points). A closer analysis reveals that different methods improve the performance of infrequent items (reflected by the increase in macro-F₁), while this improvement is often offset by performance reduction in more common items.

Table 3 Performance of CONSORT sentence classification with different data augmentation methods.

Full size table

Comparison with section-specific models

The comparison of the model trained on full articles and label set with the models trained on specific sections and related labels are shown in Table 4. Training a Methods-specific model using the Methods sentences yielded better micro-F₁ score than the single model trained on the full article. This finding held for Results and Discussion sections, albeit to a smaller degree. The effect of section-specific training seems to be to improve precision with some recall loss. Macro-F₁ scores were higher with the single model, suggesting that section-specific models primarily improve the performance of the common items.

Table 4 Performance of sentence classification models trained on specific sections or on the entire article.

Full size table

Discussion

This study is the first to present an automated approach for recognizing all CONSORT checklist items in RCT results publications. The overall performance of the best model is reasonable (0.71 micro-F₁ with balanced precision and recall at the sentence level). For common items such as Eligibility Criteria (4a), Outcomes (6a), Sample Size Determination (7a), and Registration (23), its performance is over 0.8 F₁ score, indicating that the model could be used for recognizing such items in practice. The performance is lowest on rare items, such as Changes to Trial Design (3b), Changes to Outcomes (6b), and Allocation Concealment (9). Recognition of CONSORT items at the article level is high (0.92 micro-F₁ with article (ANY) and 0.90 micro-F₁ with article (1 +)). This suggests that article-level predictions of the model can be used to point out whether or not a publication reports a specific item and to provide at least one sentence supporting the prediction, which facilitates automatic screening of RCT publications by journals. The best-performing classifier is a fine-tuned PubMedBERT model that uses as input the target sentence as well as the surrounding sentences, each prepended with their section headers. This indicates the utility of longer context and document structure for the task. This is not surprising, given that some CONSORT items are reported over passages and the section header are sometimes directly related to the CONSORT item (e.g., Primary outcomes). The impact of incorporating longer context versus document structure is similar and they act synergistically to further improve the performance, although this additional improvement is small. We leave the investigation of whether even longer contexts could lead to further performance improvement to future work. An analysis of the errors made by the best-performing PubMedBERT model is presented below.

Generative models

The generative models for sentence classification (BioGPT fine-tuning and zero- or few-shot in-context learning with GPT-4) underperformed the best PubMedBERT model by significant margins. Similar to PubMedBERT models, BioGPT did well for some common items (e.g., 7a (Sample Size Determination), 0.87 F₁), while its performance was poor for rare items and some multi-part items (Supplementary file Table S3). BioGPT fine-tuning involved only the target sentence, and adding surrounding sentence could possibly improve performance; however, fine-tuning BioGPT is much more computationally intensive to fine-tuning PubMedBERT. BioGPT is based on GPT-2⁴⁶, and using more recent domain-specific models such as PMC-LLaMA⁶⁴ could be a more promising avenue.

In-context learning with GPT-4 failed to achieve satisfactory results, even for common items. Surprisingly, providing examples (one- or few-shot) did not improve upon zero-shot setting. Existing studies point out that GPT models are sensitive to the prompts and even the order of elements in the prompts; therefore, it may be possible to design better prompts to enhance in-context learning. We randomly sampled demonstration examples for one- or few-shot settings; selecting examples similar to the target sentence could improve results. At the same time, our results with GPT-4 are consistent with other comparisons of GPT-4 in-context learning with fine-tuned models for text classification⁵³. A more comprehensive study of prompting strategies for the task is needed in the future.

Data augmentation

The effect of data augmentation on PubMedBERT fine-tuning was minimal, which is consistent with our previous findings²². To our surprise, GPT-4 based approaches underperformed EDA-based approaches, UMLS-EDA in particular, even though they produced more meaningful, generally semantically coherent sentences. Our findings with GPT-4 contrast other studies that found that synthetic data generation with LLMs led to improved performance of downstream tasks⁶⁵. In GPT-4-based augmentation, we only provide the target sentence for rephrasing and let GPT-4 generate corresponding preceding and trailing sentences. This may have led to inconsistencies between the sentences generated and reduced the effectiveness of this approach. Data augmentation had a more pronounced effect when it is used to enhance the baseline PubMedBERT model, in contrast to the best-performing model that uses longer contexts. This suggests that longer contexts, to some extent, could compensate for data scarcity. The limited effectiveness of data augmentation might also be due to the fact that three methods (EDA, UMLS-EDA, and GPT-4 rephrasing) use existing training examples and they may not introduce enough diversity to the training set. The other method (GPT-4 generative) relies on label descriptions, which could also limit diversity among the generated sentences. Leveraging distant supervision approaches, such as active learning, could improve diversity of the dataset and generalizability of the models.

Section-specific training

Our comparison of section-specific model training with training of a single CONSORT model was inconclusive. Methods-specific model worked better on methodology items than the more comprehensive model. On the other hand, the results for Results and Discussion sections were mostly similar. Precision based on section-specific training was notably higher, which may be desirable in some cases. However, because the differences are minor, it seems more efficient to train and perform predictions using a single full model.

Error analysis

We analyzed the errors made by the best-performing PubMedBERT model. Most error cases involved sentences predicted to report a CONSORT item different from the original label (37.2% of errors). For some of these cases, the true label was among the predictions, but additional labels were also predicted (8.3%). Similarly, for 6.3% of the errors, at least one true label was correctly predicted, but some other labels were missed. In 1.1% of the cases, there was at least one label overlap, while some labels were missed and others were incorrectly predicted. These types of overlaps, accounting for 15.7% of the total number of errors could be considered less fatal. About 29.3% of the errors involved negative sentences that were predicted to report a CONSORT item, and the rest (33.5%) were those labeled with a CONSORT item but for which no CONSORT label was predicted. We provide the samples for these error types in Table 5.

Table 5 Samples for different error types. The labels shown are Outcomes (6a), Outcome Results (17a), Binary Outcome Results (17b), and Ancillary Analyses (18).

Full size table

The most confusion occurred between Outcome Results (17a) and related items Binary Outcome Results (17b) and Ancillary Analyses (18). This was followed by Statistical Methods for Outcomes (12a) and Statistical Methods for Other Analyses (12b). We note that inter-annotator agreement for these labels was lower, as annotators often confused them as well²¹. For practical use, it might be practical to merge these labels. The label that was most often completely missed or incorrectly predicted was Interpretation (22), which is a broad and diffuse category, similar to Background (2a), and its utility might be considered debatable.

To better understand model behavior, we further examined some false positives (cases where the model predicted a label but the sentence was not labeled in the ground truth) and false negatives (labeled sentences completely missed by the model). We found that in some false positive cases, predicted sentences might relate to the checklist item, but does not contain enough information relevant to the item. For example, for the sentence “Sample size was calculated from the study of Lee et al.”, the model predicted the label Sample Size Calculation (7a), even though the sentence does not discuss the specifics of the calculation. Similarly, the sentence “The remaining 12 adverse events in the cryotherapy group were either unrelated to the trial treatment or unlikely to be related.” is predicted as Harms (19), although specific adverse events are not mentioned in the sentence. Apart from sentences reporting Interpretation (22), false negatives involved sentences that describe how outcomes were assessed or interventions were administered, as opposed to what the outcomes and interventions were. In these examples, the model correctly labeled other sentences in the same article that discuss the outcomes and interventions, so these errors may be less problematic. An example is “Nominated study staff at each site (general practitioner or practice nurse) verified the blood pressure measurements by an independent audit of the clinical details in the case records, outputs from the blood pressure monitor, and the computer decision tool.”, which describes how an outcome was measured.

Limitations

Our study has several limitations. The dataset consists of a small number of articles in XML format from PubMed Central, which may not be fully representative of the RCT literature, and contains limited data for the infrequently reported CONSORT checklist items. We attempted to address this issue using data augmentation; however, the effect was minimal. Given the scarcity of data, it might be reasonable to resort to rule-based methods for some rare items, such as 3b (Changes to Trial Design). At the same time, it is necessary to develop larger datasets, which is challenging, as it requires significant domain expertise. Distant supervision approaches leveraging unlabeled data from the literature could be a promising avenue. Our exploration of generative models was limited. A more systematic exploration of prompting strategies is needed in future work.

We have focused on recognizing sentences reporting CONSORT checklist items, a first step toward assessing adherence (e.g., whether the statistical methods used are appropriate), which we have not attempted in this study. We leave this much more challenging task for future work.

Conclusions

In this study, we extended our earlier work to recognize all CONSORT checklist items in RCT publications. A PubMedBERT fine-tuned model using surrounding contexts and article structure yielded the best performance. We did not observe significant benefits from using LLMs for data augmentation or in-context learning, or fine-tuning them. We also did not observe an advantage of training section-specific models.

In future work, we aim to improve the models further for practical use. We plan to achieve this by extending the annotated corpus and the models and making the models more efficient by employing techniques such as distillation. While active implementation of CONSORT in the peer review process has been shown to improve reporting quality, additional time requirements for the editorial staff and longer peer review process have also been noted¹⁷. With further enhancements, the models would speed up this process and assist journals in checking for CONSORT compliance in a human-in-the-loop setting. It could also help the authors in improving the completeness and transparency of their manuscripts prior to peer review. We also plan to extend our models to assess the extent to which articles are CONSORT-compliant, potentially increasing the practical utility of the models.

Data availability

CONSORT-TM dataset, the PubMedBERT models, and source code are available at https://github.com/ScienceNLP-Lab/RCT-Transparency.

References

Landis, S. C. et al. A call for transparent reporting to optimize the predictive value of preclinical research. Nature. 490(7419), 187–191 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Glasziou, P. et al. Reducing waste from incomplete or unusable reports of biomedical research. The Lancet. 383(9913), 267–276 (2014).
Article Google Scholar
Iqbal, S. A., Wallach, J. D., Khoury, M. J., Schully, S. D. & Ioannidis, J. P. Reproducible research practices and transparency across the biomedical literature. PLoS Biol. 14(1), e1002333 (2016).
Article PubMed PubMed Central Google Scholar
Chalmers, I. & Glasziou, P. Avoidable waste in the production and reporting of research evidence. The Lancet. 374(9683), 86–89 (2009).
Article Google Scholar
Simera, I. et al. Transparent and accurate reporting increases reliability, utility, and impact of your research: Reporting guidelines and the EQUATOR Network. BMC Medicine. 8(1), 24 (2010).
Article PubMed PubMed Central Google Scholar
Schulz, K. F., Altman, D. G. & Moher, D. CONSORT 2010 statement: Updated guidelines for reporting parallel group randomised trials. BMJ. 340, c332 (2010).
Article PubMed PubMed Central Google Scholar
Von Elm, E. et al. The strengthening the reporting of observational studies in epidemiology (STROBE) statement: Guidelines for reporting observational studies. Bullet. World Health Organ. 85, 867–872 (2007).
Article Google Scholar
Page, M. J. et al. statement: An updated guideline for reporting systematic reviews. BMJ. 2021, 372 (2020).
Google Scholar
Chan, A. W. et al. SPIRIT 2013 statement: Defining standard protocol items for clinical trials. Ann. Internal Med. 158(3), 200–207 (2013).
Article Google Scholar
Shamseer, L., Hopewell, S., Altman, D. G., Moher, D. & Schulz, K. F. Update on the endorsement of CONSORT by high impact factor journals: A survey of journal “instructions to authors” in 2014. Trials. 17(1), 301 (2016).
Article PubMed PubMed Central Google Scholar
Samaan, Z. et al. A systematic scoping review of adherence to reporting guidelines in health care literature. J. Multidisciplinary Healthcare. 6, 169-88 (2013).
Google Scholar
Jin, Y. et al. Does the medical literature remain inadequately described despite having reporting guidelines for 21 years?–A systematic review of reviews: An update. J. Multidiscip. Healthc. 11, 495–510 (2018).
Article PubMed PubMed Central Google Scholar
Moher, D. et al. Explanation and Elaboration: Updated guidelines for reporting parallel group randomised trials. BMJ. 2010, 340 (2010).
Google Scholar
Turner L, Shamseer L, Altman D, Weeks L, Peters J, Kober T, et al. Consolidated standards of reporting trials (CONSORT) and the completeness of reporting of randomised controlled trials (RCTs) published in medical journals. Cochrane Database Syst. Rev. 2012; (11).
Kilicoglu, H. et al. Methodology reporting improved over time in 176,469 randomized controlled trials. J. Clin. Epidemiol. 162, 19–28 (2023).
Article PubMed Google Scholar
Hopewell, S., Ravaud, P., Baron, G. & Boutron, I. Effect of editors’ implementation of CONSORT guidelines on the reporting of abstracts in high impact medical journals: Interrupted time series analysis. BMJ. 344, 4178 (2012).
Article Google Scholar
Pandis, N., Shamseer, L., Kokich, V. G., Fleming, P. S. & Moher, D. Active implementation strategy of CONSORT adherence by a dental specialty journal improved randomized clinical trial reporting. J. Clin. Epidemiol. 67(9), 1044–1048 (2014).
Article PubMed Google Scholar
Kilicoglu, H. Biomedical text mining for research rigor and integrity: Tasks, challenges, directions. Brief. Bioinformat. 19(6), 1400–1414 (2018).
Google Scholar
Weissgerber, T. et al. Automated screening of COVID-19 preprints: Can we help authors to improve transparency and reproducibility?. Nat. Med. 27(1), 6–7 (2021).
Article CAS PubMed PubMed Central Google Scholar
Schulz, R. et al. Is the future of peer review automated?. BMC Res. Notes. 15(1), 1–5 (2022).
Article Google Scholar
Kilicoglu, H. et al. Toward assessing clinical trial publications for reporting transparency. J. Biomed. Inform. 116, 103717 (2021).
Article PubMed PubMed Central Google Scholar
Hoang L, Jiang L, Kilicoglu H. Investigating the impact of weakly supervised data on text mining models of publication transparency: a case study on randomized controlled trials. In: AMIA Annual Symposium Proceedings. vol. 2022. American Medical Informatics Association; 2022. p. 254.
Luo, R. et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief. Bioinformat. 23(6), 409 (2022).
Article Google Scholar
Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) 2019 (pp. 4171–4186).
Wei J, Zou K. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) 2019 (pp. 6382–6388).
Kang, T., Perotte, A., Tang, Y., Ta, C. & Weng, C. UMLS-based data augmentation for natural language processing of clinical research literature. J. Am. Med. Inform. Assoc. 28(4), 812–823 (2021).
Article PubMed Google Scholar
Demner-Fushman, D. & Lin, J. Answering clinical questions with knowledge-based and statistical techniques. Comput. Linguist. 33(1), 63–103 (2007).
Article Google Scholar
Kiritchenko, S., De Bruijn, B., Carini, S., Martin, J. & Sim, I. ExaCT: Automatic extraction of clinical trial characteristics from journal publications. BMC Med. Inform. Decis. Making. 10(1), 1–17 (2010).
Article Google Scholar
Kim, S. N., Martinez, D., Cavedon, L. & Yencken, L. Automatic classification of sentences to support evidence based medicine. BMC Bioinformatics. 12(2), 1–10 (2011).
CAS Google Scholar
Hassanzadeh, H., Groza, T. & Hunter, J. Identifying scientific artefacts in biomedical literature: The evidence based medicine use case. J. Biomed. Inform.. 49, 159–170 (2014).
Article PubMed Google Scholar
Wallace, B. C., Kuiper, J., Sharma, A., Zhu, M. & Marshall, I. J. Extracting PICO sentences from clinical trial reports using supervised distant supervision. J. Mach. Learning Res. 17(1), 4572–4596 (2016).
MathSciNet Google Scholar
Nye B, Li JJ, Patel R, Yang Y, Marshall I, Nenkova A, Wallace BC. A Corpus with Multi-Level Annotations of Patients, Interventions and Outcomes to Support Language Processing for Medical Literature. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2018 (pp. 197–207).
Brockmeier, A. J., Ju, M., Przybyła, P. & Ananiadou, S. Improving reference prioritisation with PICO recognition. BMC Med. Inform. Decis. Making. 19(1), 1–14 (2019).
Article Google Scholar
Jin, D. & Szolovits, P. Advancing PICO element detection in biomedical text via deep neural networks. Bioinformatics. 36(12), 3856–3862 (2020).
Article CAS PubMed Google Scholar
Marshall, I. J., Kuiper, J. & Wallace, B. C. RobotReviewer: Evaluation of a system for automatically assessing bias in clinical trials. J. Am. Med. Inform. Assoc. 23(1), 193–201 (2016).
Article PubMed Google Scholar
Millard, L. A., Flach, P. A. & Higgins, J. P. Machine learning to assist risk-of-bias assessments in systematic reviews. Int. J. Epidemiol. 45(1), 266–277 (2016).
Article PubMed Google Scholar
Marshall, I. J. et al. Trialstreamer: A living, automatically updated database of clinical trial reports. J. Am. Med. Inform. Assoc. 27(12), 1903–1912 (2020).
Article PubMed PubMed Central Google Scholar
Dernoncourt F, Lee JY, Szolovits P. Neural networks for joint sentence classification in medical paper abstracts. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers 2017 (pp. 694–700).
Jin D, Szolovits P. Hierarchical neural networks for sequential sentence classification in medical scientific abstracts. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing 2018 (pp. 3100–3109).
Li X, Burns G, Peng N. Scientific discourse tagging for evidence extraction. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume 2021 (pp. 2550–2562).
Hoang L, Guan Y, Kilicoglu H. Methodological information extraction from randomized controlled trial publications: a pilot study. In AMIA Annual Symposium Proceedings, vol. 2022. (Vol. 2022, p. 542–551). American Medical Informatics Association.
Hsu W, Speier W, Taira RK. Automated extraction of reported statistical analyses: towards a logical representation of clinical trial literature. In AMIA Annual Symposium Proceedings. vol. 2012. American Medical Informatics Association; 2012. p. 350–359.
Kilicoglu, H., Rosemblat, G., Malički, M. & ter Riet, G. Automatic recognition of self-acknowledged limitations in clinical research literature. J. Am. Med. Inform. Assoc. 25(7), 855–861 (2018).
Article PubMed PubMed Central Google Scholar
Lee, J. et al. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 36(4), 1234–1240 (2020).
Article CAS PubMed Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In Advances in Neural Information Processing Systems; 2017. p. 5998–6008.
Radford, A. et al. Language models are unsupervised multitask learners. OpenAI blog. 1(8), 9 (2019).
Google Scholar
Brown, T. et al. Language models are few-shot learners. Adv. Neural Inform. Process. Syst. 33, 1877–1901 (2020).
Google Scholar
Chowdhery A, Narang S, Devlin J, Bosma M, Mishra G, Roberts A, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:220402311. 2022.
Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T, et al. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:230213971. 2023.
Zhao WX, Zhou K, Li J, Tang T, Wang X, Hou Y, et al. A survey of large language models. arXiv preprint arXiv:230318223. 2023.
Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Hou L, et al. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:230509617. 2023.
Tian S, Jin Q, Yeganova L, Lai PT, Zhu Q, Chen X, et al. Opportunities and challenges for ChatGPT and large language models in biomedicine and health. arXiv preprint arXiv:230610070. 2023.
Chen Q, Du J, Hu Y, Keloth VK, Peng X, Raja K, et al. Large language models in biomedical natural language processing: benchmarks, baselines, and recommendations. arXiv preprint arXiv:230516326. 2023.
Wrightson JG, Blazey P, Khan KM, Ardern CL. GPT for RCTs?: Using AI to measure adherence to reporting guidelines. medRxiv. 2023:2023–12.
Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc (HEALTH). 3(1), 1–23 (2021).
CAS Google Scholar
Cohan A, Beltagy I, King D, Dalvi B, Weld DS. Pretrained language models for sequential sentence classification. arXiv preprint arXiv:190904054. 2019.
Pan F, Canim M, Glass M, Gliozzo A, Fox P. CLTR: An End-to-End, Transformer-Based System for Cell-Level Table Retrieval and Table Question Answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations 2021 (pp. 202–209).
Nov, O. et al. Putting ChatGPT’s medical advice to the (Turing) test: Survey study. JMIR Med. Edu. 9(1), e46939 (2023).
Article Google Scholar
Dai H, Liu Z, Liao W, Huang X, Cao Y, Wu Z, et al. AugGPT: Leveraging ChatGPT for text data augmentation. arXiv preprint arXiv:230213007. 2023.
Bodenreider, O. The unified medical language system (UMLS): Integrating biomedical terminology. Nucl. Acids Res. 32(suppl1), D267-70 (2004).
Article CAS PubMed PubMed Central Google Scholar
Ripple, A. M., Mork, J. G., Knecht, L. S. & Humphreys, B. L. A retrospective cohort study of structured abstracts in MEDLINE, 1992–2006. J. Med. Library Assoc. JMLA. 99(2), 160 (2011).
Article Google Scholar
McNemar, Q. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika. 12(2), 153–157 (1947).
Article CAS PubMed Google Scholar
Gillick L, Cox SJ. Some statistical issues in the comparison of speech recognition algorithms. In International Conference on Acoustics, Speech, and Signal Processing, 1989 (pp. 532–535). IEEE.
Wu C, Zhang X, Zhang Y, Wang Y, Xie W. PMC-LLaMA: Further finetuning LLaMA on medical papers. arXiv preprint arXiv:230414454. 2023.
Tang R, Han X, Jiang X, Hu X. Does synthetic data generation of LLMs help clinical text mining? arXiv preprint arXiv:230304360. 2023.

Download references

Acknowledgements

This work was supported by the National Library of Medicine of the National Institutes of Health under the award number R01LM014079. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The funder had no role in considering the study design or in the collection, analysis, interpretation of data, writing of the report, or decision to submit the article for publication.

Author information

Authors and Affiliations

School of Information Sciences, University of Illinois Urbana-Champaign, 501 E Daniel Street, Champaign, IL, 61820, USA
Lan Jiang, Mengfei Lan, Joe D. Menke & Halil Kilicoglu
School of Public Health, Indiana University, Bloomington, IN, USA
Colby J. Vorland

Authors

Lan Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Mengfei Lan
View author publications
You can also search for this author in PubMed Google Scholar
Joe D. Menke
View author publications
You can also search for this author in PubMed Google Scholar
Colby J. Vorland
View author publications
You can also search for this author in PubMed Google Scholar
Halil Kilicoglu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

LJ: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Writing—Original draft, Writing—Review & Editing. ML: Methodology, Software, Validation, Formal analysis, Investigation, Writing—Original draft, Writing—Review & Editing. JM: Methodology, Software, Validation, Formal analysis, Investigation, Writing—Original draft, Writing—Review & Editing. CJV: Methodology, Software, Formal analysis, Writing—Review & Editing. HK: Conceptualization, Methodology, Data curation, Investigation, Supervision, Project administration, Funding acquisition, Writing—Original draft, Writing—Review & Editing.

Corresponding author

Correspondence to Halil Kilicoglu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Jiang, L., Lan, M., Menke, J.D. et al. Text classification models for assessing the completeness of randomized controlled trial publications based on CONSORT reporting guidelines. Sci Rep 14, 21721 (2024). https://doi.org/10.1038/s41598-024-72130-7

Download citation

Received: 03 April 2024
Accepted: 04 September 2024
Published: 17 September 2024
DOI: https://doi.org/10.1038/s41598-024-72130-7
Springer Nature Limited

Text classification models for assessing the completeness of randomized controlled trial publications based on CONSORT reporting guidelines

Abstract

Similar content being viewed by others

Creating efficiencies in the extraction of data from randomized trials: a prospective evaluation of a machine learning and text mining tool

Technology-assisted title and abstract screening for systematic reviews: a retrospective evaluation of the Abstrackr machine learning tool

Natural language processing (NLP) to facilitate abstract review in medical research: the application of BioBERT to exploring the 20-year use of NLP in medical research

Explore related subjects

Introduction

Related work

Text classification in RCT articles

Generative large language models for biomedical literature mining

Materials and methods

Dataset

PubMedBERT fine-tuning

Data augmentation

Prompt-based augmentation using GPT-4

Easy data augmentation (EDA)

In-context learning with GPT-4

BioGPT fine-tuning

Article-level classification for items 1a and 1b

Experimental settings

Evaluation

Results

High-level comparison of PubMedBERT and GPT-based models

Item-level results for the best-performing PubMedBERT model

Article-level evaluation for the best-performing PubMedBERT model

Data augmentation

Comparison with section-specific models

Discussion

Generative models

Data augmentation

Section-specific training

Error analysis

Limitations

Conclusions

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation