Introduction

Complete and transparent reporting in biomedical publications is critical for assessing the validity of research findings, promoting scientific integrity, and facilitating evidence-based decision-making in patient care and health policy1,2,3. However, poor reporting has been a persistent issue leading to potential biases and difficulties incorporating results into meta-analyses, complicating replication efforts, and ultimately undermining the trustworthiness of biomedical research2,4,5. For example, studies of clinical trial publications indicated that 40–89% of them lacked adequate descriptions of interventions, making replication in subsequent trials infeasible2. Other surveys suggested that most studies had at least one primary outcome that was changed, introduced, or omitted during the course of the research2. Reporting guidelines have been developed to provide researchers with a minimum list of essential information that they need to report for readers to clearly understand study methods and findings5,6,7,8,9. They also aim to facilitate the replication of study procedures and improve the reliability of published research. While they have been endorsed by many high impact journals10, adherence to reporting guidelines remains inadequate2,11,12.

Randomized controlled trials (RCTs), when designed and conducted rigorously, remain the most robust method to determine the effectiveness of an intervention13. The CONSORT 2010 Statement6,13 is a reporting guideline for RCT results publications and consists of a checklist and participant flowchart. The checklist includes 25 items considered the minimum information needed to understand RCTs (e.g., outcomes, randomization, masking, harms). While CONSORT has been endorsed by many journals, publishers, and editorial organizations, systematic studies of current practices show poor reporting even in well-conducted RCTs12,13,14. Some studies have shown more complete reporting of CONSORT items over time15.

Adherence to CONSORT, and to reporting guidelines more broadly, remains low, partly because journal endorsement generally does not entail enforcement or verification. CONSORT implementation, where RCT submissions are scrutinized by journal editors for compliance before peer review, has been shown to improve reporting quality16,17. However, manual screening is labor-intensive and time-consuming for journal editors and staff. Automated screening, based on natural language processing (NLP) and machine learning approaches, could reduce the burden of manual checking, streamline the peer review process, and contribute to better reporting quality18,19,20.

In prior work, we developed a corpus of RCT result publications annotated with CONSORT checklist items (CONSORT-TM)21and reported NLP models for recognizing the CONSORT items related to methodology (17 items)15,21,22. In this study, we extend our work by training and validating NLP models for the full CONSORT checklist at fine granularity (37 items). Our main contributions are as follows:

  1. 1.

    We develop and evaluate the first NLP models targeting automatic recognition of all CONSORT checklist items at fine granularity.

  2. 2.

    We compare different input representations and features for the task (context size, section information, sentence position).

  3. 3.

    We fine-tune a GPT-based model (BioGPT23) to study whether it confers any benefits over the models based on the BERT architecture24.

  4. 4.

    We assess in-context learning using GPT-4 for the task.

  5. 5.

    To address the data size and imbalance, we assess the ability of GPT-4 to generate useful training instances for this task and compare it to other data augmentation methods (Easy Data Augmentation (EDA)25, UMLS-EDA26).

Related work

Text classification in RCT articles

Most NLP research on RCT publications has focused on classifying sentences using the PICO framework (Population, Intervention, Comparator, Outcome) to aid the systematic review process and evidence-based medicine27,28,29,30,31,32,33,34. Other research has focused on automating risk of bias assessment using text classification35,36,37 and rhetorical classification of medical abstracts (e.g., Objective, Methods)38,39,40. Research on other key characteristics is less common, although methods have been reported for identifying study design29,30,41, sample size28,37,41, statistical methods42, and limitations43. The methods range from rule-based methods in early work27,42 to (semi-)supervised machine learning methods in later work28,31,35,43, including deep learning approaches of the recent years32,33,34,38,39,40,41.

We presented CONSORT-TM, a corpus which represents the most comprehensive annotation of RCT characteristics, to our knowledge21. We also trained NLP models for recognizing methodology items. A BioBERT-based model44 outperformed rule-based and traditional machine learning approaches. We used our best model to study reporting trends in more than 176 K RCT publications published between 1966 and 2018, which showed an improvement in methodology reporting while also highlighting the shortcomings in the reporting of most items15.

Generative large language models for biomedical literature mining

Generative large language models (LLMs) based on Transformer architecture45 (e.g., GPT family46,47, PaLM48, LLaMA49) have shown remarkable language generation capabilities and are increasingly applied to NLP tasks in the general domain50 and in the biomedical domain51,52. Domain-specific LLMs for the biomedical domain have also been trained (e.g., BioGPT23, Med-PaLM51). Prompt engineering for specific tasks has become an effective strategy for leveraging the in-context learning abilities of LLMs52. In the biomedical domain, fine-tuned BioGPT23 has shown superior performance to BERT-based models in document classification, while the performance of the GPT models in zero- or one-shot settings has been found to trail that of fine-tuned BERT models53. Most relevant to this work, a recent study used GPT-3.5 to check RCT reports on sports medicine and exercise science for adherence with 9 CONSORT checklist items, reporting accuracy in the range of 70–100%54. This study is limited in scope compared to ours and focuses on article-level binary decisions for the 9 items, not sentence classification.

Materials and methods

In this section, we first briefly describe the dataset used in this study (CONSORT-TM21). Next, we provide the details of the NLP models, including data augmentation and experimental settings. Lastly, we discuss the evaluation of NLP models. A high-level overview of our study is illustrated in Fig. 1.

Fig. 1
figure 1

A high-level overview of our study, including NLP model training and evaluation.

Dataset

The CONSORT-TM corpus21 consists of 50 RCT publications annotated at the sentence level with 37 CONSORT checklist items. It contains a total of 10,709 sentences, 4,845 (45%) of which are annotated with 5,246 labels (i.e., a multi-label corpus). Each article contains an average of 27.5 out of 37 fine-grained checklist items. In this work, we exclude the checklist item Background (2a) because it is too broad a category that virtually all papers report and checking its reporting was deemed unnecessary and focus on 36 items. We provide the CONSORT checklist items and their descriptions in Supplementary File Table S1.

PubMedBERT fine-tuning

In prior work on CONSORT methodology reporting21,22, we fine-tuned the BioBERT model44. In more recent work15, we substituted the BioBERT model with PubMedBERT55, which has shown better performance in many biomedical NLP tasks. We continue to use PubMedBERT for multi-label sentence classification in this study. We excluded items 1a (whether the study is indicated as randomized in the title) and 1b (whether the abstract is structured) from the sentence classification model, because these items are article-level, unlike the rest of the items, and we use simple rules to recognize them (see below).

In previous work, we represented each input sentence as the concatenation of its enclosing section header and the sentence text. Here, we incorporate more contextual information into classification by taking into account the preceding and the following sentence, based on the observation that many CONSORT items are reported over several contiguous sentences (i.e., zones), and additional information from neighboring sentences could help in classification. The input representation is illustrated in Fig. 2. It consists of three sentences (preceding, target, trailing) delimited by special [SEP] tokens and prepended by the [CLS] classification head, following earlier work56,57. Each sentence is prepended with the list of (nested) section headers associated with the sentence (e.g., Methods Patients [SENTENCE]). The [CLS] token representation generated by PubMedBERT encoder is fed into a fully connected layer and the sigmoid function is used for multi-label classification.

Fig. 2
figure 2

Experimental flow including PubMedBERT fine-tuning, data augmentation strategies, few-shot prompting, and BioGPT fine-tuning. The PubMedBERT model takes sentences surrounding the target sentence into consideration. The prediction is made on the [CLS] token representation.

We also leveraged sentence position in the article as an additional feature, based on the observation that checklist items tend to be discussed in a predictable order in an article. For example, the first few sentences of the Methods section often discuss item 3a (Trial Design). We concatenated sentence position embedding with the sentence representation to incorporate this information. We experimented with absolute and relative positions. We used an embedding layer to encode absolute position. We convert relative position (continuous value) into a categorical value by creating 10 bins (0–0.1, 0.1–0.2, etc.) first and then used an embedding layer for encoding it.

CONSORT checklist is organized by sections in which the items are expected to be reported. We examined whether models trained on specific sections and items specific to those sections could lead to better performance than a single model trained on full articles and all labels. For these experiments, we created models specific to Methods, Results, and Discussion sections. We did not create an Introduction-specific model, because we consider only one relevant label in these sections (2b (Objectives)).

Data augmentation

CONSORT-TM is relatively small and some checklist items are infrequently reported (e.g., 6b (Changes to Outcomes)). This has led to poor performance for rare items in previous work21,22. In this study, we leveraged the text generation capabilities of LLMs to improve the quality of data augmentation and compared it to simpler approaches we explored in previous work (EDA24, UMLS-EDA25). Our goal was to assess whether this type of data augmentation could improve model performance for rare labels and enhance generalizability.

Prompt-based augmentation using GPT-4

GPT-4 can generate fluent, human-like text, even about complex topics like medicine58. Previous work has demonstrated that GPT-based models can be used for data augmentation to improve a model’s performance; for example, AugGPT reports a framework that uses ChatGPT to rephrase existing text instances59.

We adapted AugGPT59 to paraphrase the instances of rare labels in our corpus. In addition, we used GPT-4 to generate completely new sentences based on label descriptions, referred to here as generative instances. For paraphrased instances, we used the labels of the original instance. For generative instances, we applied the label of the criterion description used within the prompt.

We augmented data for categories that have fewer than 100 samples in the entire corpus. These categories are: 3b (Changes to Trial Design), 6b (Changes to Outcomes), 7b (Interim Analyses and Stopping Guidelines), 9 (Allocation Concealment Mechanism), 11b (Similarity of Interventions), 12b (Statistical Methods for Other Analyses), 14b (Trial Stopping) and 21 (Generalizability). To better isolate the effect of the different augmentation strategies, we evaluated the performance on a version of the model architecture that only uses the target sentence for classification, as well as on our best-performing architecture, which uses section headings and surrounding sentences. We used the prompts shown in Table 1 to generate sentences using the OpenAI GPT-4 API (generation performed on Sept. 11, 2023). In both prompts, N indicates the number of instances to generate. Each instance consists of a preceding sentence, a positively labeled sentence, and a trailing sentence to be used by the model.

Table 1 GPT-4 prompts used for data augmentation.

For rephrased instances, N was set to 6, following AugGPT. For generative instances, we iteratively prompted GPT-4 setting N to 6 or 8 until we accumulated 100 total instances (13 times). Item descriptions used for the generative prompt were adapted from Moher et al.13. For all descriptions, the phrase “with reasons” was changed to “with specifics”, and irrelevant text (e.g., “when applicable” or “if relevant”) and examples (e.g., “such as eligibility criteria”) were removed to improve the overall quality and diversity of GPT-4 responses.

Easy data augmentation (EDA)

As a simpler alternative to prompt-based data augmentation, we also generated examples using EDA24 and its variant, UMLS-EDA25. EDA24 is a rule-based method that synthesizes samples via simple modifications to the original sentence, including random deletion, random insertion, random swap, and synonym replacement based on WordNet. UMLS-EDA25 is an adaptation of EDA that additionally uses synonym replacement based on the UMLS60. For both methods, we generated six variations for each original sample, for consistency with the rephrasing approach. We only used the instances with a single label in the corpus for data augmentation.

In-context learning with GPT-4

To assess the few-shot learning ability of GPT-4 for our task, we prompt GPT-4 to directly infer whether a sentence in the article reports a specific CONSORT checklist item. The prompt given to GPT-4 consists of the task description, checklist item descriptions, examples (in one-shot and five-shot settings), and the entire article. We provide the full prompt in Supplementary File.

BioGPT fine-tuning

We also fine-tuned BioGPT23 for the task, formulating the task as text generation. BioGPT is based on GPT-246 and trained using biomedical literature and has shown improved performance on text classification tasks over BERT-based architectures23. We used a fine-tuning formulation similar to that proposed in Luo et al.23 for document classification. We generate the sequence using the format “[SENTENCE]. This sentence describes [LABEL]”. [SENTENCE] includes the section header information. We also learned one virtual token to better steer the language model to our task. We let BioGPT complete the sentence to perform the sentence classification task.

Article-level classification for items 1a and 1b

CONSORT checklist items related to title and abstract (1a and 1b) are document-level. We developed a simple rule-based approach for these items. For 1a (is the study described as “randomized” in the title?), we check the stems in the title for the presence of random, randomis, and randomiz. For 1b (is the abstract structured?), we check whether the abstract starts with a structured abstract header included in the list developed by the National Library of Medicine61.

Experimental settings

We used HuggingFace implementation of the PubMedBERT (BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext) model. We use the following hyperparameters: batch size (4), learning rate (1e-5), and dropout rate of 0.1. The learning rate of the fully connected layer is 1e-3. In each experimental run, we trained the model for 20 epochs.

We fine-tuned BioGPT which is pre-trained on PubMed abstracts from scratch for 20 K steps with a peak learning rate of 1e-5 and 1000 warm-up steps. For GPT-4, we set the temperature to 1 when performing data augmentation to increase the creativity of the responses and 0 when performing direct inference to ensure that the responses are consistent.

Evaluation

To evaluate the fine-tuned PubMedBERT and BioGPT models, we used group 5-fold cross-validation, ensuring that sentences in one article can only be in training or test set for each cross-validation run. To evaluate in-context learning with GPT-4, we randomly sampled 10 articles as the test set, because the model yielded only modest performance in preliminary experiments (see Results) and using OpenAI API for GPT-4 incurs significant cost.

Following previous work, we used precision, recall, and their harmonic mean, F1 score, as the main evaluation criteria. We report micro- and macro-averaged results and the area under the ROC curve (AUC). To observe whether different input representations and data augmentation approaches led to statistically significant differences in model performances compared to the baseline, we used McNemar’s test62 adopting the approach outlined by Gillick and Cox63.

Sentence-level evaluation provides a strict perspective on model performance. In practical use cases, such as guideline adherence checks and large-scale reporting analyses15, the user of such a model is likely to be interested in whether the model identifies the reported and missing checklist items for a given article and provides justification for the reported items. To accommodate such use cases, we also considered two article-level evaluation schemes—article (ANY) and article (1 +). The former evaluates whether the model correctly predicts if the article includes at least one sentence relevant to the checklist item. The latter is similar to article (ANY) but also requires at least one sentence overlap between the predictions for an item and the ground truth sentences for that item. Article (ANY) is the most lenient evaluation, whereas article (1 +) is likely most useful for practical purposes, because it ensures that at least one correct supporting sentence is identified for the checklist item, as well.

Results

High-level comparison of PubMedBERT and GPT-based models

Table 2 shows the performance of the sentence classification models. The results show that section headers contribute significantly to PubMedBERT performance. Prepending all relevant section headers outperforms prepending only the innermost or outermost section header. Incorporating positional information does not improve results. On the other hand, incorporating context from surrounding sentences yields the best performance, specifically by improving recall. We consider this model (PubMedBERT using surrounding context, prepending section headers to sentences, and using [CLS] token representation) as our main model. This model yields 0.71 micro-F1 and 0.67 macro-F1 with balanced precision and recall (0.72 and 0.71, respectively).

Table 2 Overall model performance for CONSORT sentence classification over fivefold cross-validation. The evaluation is at the sentence level.

BioGPT fine-tuning yields modest improvement over the baseline PubMedBERT model; however, it is outperformed by PubMedBERT models which use richer input representations. Zero-shot in-context learning with GPT-4, however, yields poorer performance compared to fine-tuned models. Surprisingly, providing example sentences (one-shot and five-shot) degraded the GPT-4 performance further. GPT-4 showed improved recognition of some rare items, such as 7b (Interim Analysis/Stopping Guidelines), 9 (Allocation Concealment), and 11b (Similarity of Interventions); while its performance on common items, such as 6a (Outcomes), and 12a (Statistical Methods for Outcomes) was notably lower. Item-level results for the models are shown in Supplementary File Tables S2S4.

Item-level results for the best-performing PubMedBERT model

The best-performing PubMedBERT model yields over 0.8 F1 score for 8 items at the sentence level (out of 34), all of which contain more than 100 instances in the dataset (Supplementary File Table S2). The model performance remains relatively low in classifying infrequently reported items. F1 score remains under 0.5 for another 9 items, most of which have fewer than 100 instances in the dataset (3b, 6b, 7b, 9, 12b, and 21). Some CONSORT items are multi-part (indicated by a and b in the item numbers) and in some cases the model struggles with distinguishing these closely related items. For example, item 12b (Statistical Methods for Other Analyses) is often confused with 12a (Statistical Methods for Outcomes). The performance is highest for items related to Introduction (0.89 F1 for 2b (Objectives)), followed by those in Methods sections (0.75 F1). It is lowest for Results-related items (0.62 F1). The performance on items not associated with specific sections (items 23–25) is over 0.8 F1.

Article-level evaluation for the best-performing PubMedBERT model

Article (ANY) and article (1 +) evaluation results for the best-performing PubMedBERT model are provided in Supplementary File Table S2. The model reaches a high micro-F1 score of 0.92 and macro-F1 of 0.87 in article (ANY) evaluation, with 29 out of 36 items being recognized with F1 score of 0.8 or higher. In the more stringent article (1 +) evaluation, we obtain 0.90 micro-F1 and 0.84 macro-F1, with 27 items being recognized with F1 score of 0.8 or higher. In both evaluation schemes, verification of infrequent items remains challenging. Both items 1a and 1b are recognized accurately, showing that the simple rules are sufficient for these items.

Data augmentation

We present samples generated via data augmentation in Supplementary File Table S5. GPT-4 is able to generate to coherent sentences in “generative” setting from the item descriptions. Using GPT-4 for rephrasing also seem to largely preserve the semantic content of the sentence. On the other hand, EDA and UMLS-EDA methods do not preserve meaning.

To assess the contribution of data augmentation, we use both the baseline PubMedBERT model and the best-performing model. The results are presented in Table 3. We observe that data augmentation does not improve results of the best-performing model. For the baseline model (sentence text only), UMLS-EDA improves the results most (2 percentage points). A closer analysis reveals that different methods improve the performance of infrequent items (reflected by the increase in macro-F1), while this improvement is often offset by performance reduction in more common items.

Table 3 Performance of CONSORT sentence classification with different data augmentation methods.

Comparison with section-specific models

The comparison of the model trained on full articles and label set with the models trained on specific sections and related labels are shown in Table 4. Training a Methods-specific model using the Methods sentences yielded better micro-F1 score than the single model trained on the full article. This finding held for Results and Discussion sections, albeit to a smaller degree. The effect of section-specific training seems to be to improve precision with some recall loss. Macro-F1 scores were higher with the single model, suggesting that section-specific models primarily improve the performance of the common items.

Table 4 Performance of sentence classification models trained on specific sections or on the entire article.

Discussion

This study is the first to present an automated approach for recognizing all CONSORT checklist items in RCT results publications. The overall performance of the best model is reasonable (0.71 micro-F1 with balanced precision and recall at the sentence level). For common items such as Eligibility Criteria (4a), Outcomes (6a), Sample Size Determination (7a), and Registration (23), its performance is over 0.8 F1 score, indicating that the model could be used for recognizing such items in practice. The performance is lowest on rare items, such as Changes to Trial Design (3b), Changes to Outcomes (6b), and Allocation Concealment (9). Recognition of CONSORT items at the article level is high (0.92 micro-F1 with article (ANY) and 0.90 micro-F1 with article (1 +)). This suggests that article-level predictions of the model can be used to point out whether or not a publication reports a specific item and to provide at least one sentence supporting the prediction, which facilitates automatic screening of RCT publications by journals. The best-performing classifier is a fine-tuned PubMedBERT model that uses as input the target sentence as well as the surrounding sentences, each prepended with their section headers. This indicates the utility of longer context and document structure for the task. This is not surprising, given that some CONSORT items are reported over passages and the section header are sometimes directly related to the CONSORT item (e.g., Primary outcomes). The impact of incorporating longer context versus document structure is similar and they act synergistically to further improve the performance, although this additional improvement is small. We leave the investigation of whether even longer contexts could lead to further performance improvement to future work. An analysis of the errors made by the best-performing PubMedBERT model is presented below.

Generative models

The generative models for sentence classification (BioGPT fine-tuning and zero- or few-shot in-context learning with GPT-4) underperformed the best PubMedBERT model by significant margins. Similar to PubMedBERT models, BioGPT did well for some common items (e.g., 7a (Sample Size Determination), 0.87 F1), while its performance was poor for rare items and some multi-part items (Supplementary file Table S3). BioGPT fine-tuning involved only the target sentence, and adding surrounding sentence could possibly improve performance; however, fine-tuning BioGPT is much more computationally intensive to fine-tuning PubMedBERT. BioGPT is based on GPT-246, and using more recent domain-specific models such as PMC-LLaMA64 could be a more promising avenue.

In-context learning with GPT-4 failed to achieve satisfactory results, even for common items. Surprisingly, providing examples (one- or few-shot) did not improve upon zero-shot setting. Existing studies point out that GPT models are sensitive to the prompts and even the order of elements in the prompts; therefore, it may be possible to design better prompts to enhance in-context learning. We randomly sampled demonstration examples for one- or few-shot settings; selecting examples similar to the target sentence could improve results. At the same time, our results with GPT-4 are consistent with other comparisons of GPT-4 in-context learning with fine-tuned models for text classification53. A more comprehensive study of prompting strategies for the task is needed in the future.

Data augmentation

The effect of data augmentation on PubMedBERT fine-tuning was minimal, which is consistent with our previous findings22. To our surprise, GPT-4 based approaches underperformed EDA-based approaches, UMLS-EDA in particular, even though they produced more meaningful, generally semantically coherent sentences. Our findings with GPT-4 contrast other studies that found that synthetic data generation with LLMs led to improved performance of downstream tasks65. In GPT-4-based augmentation, we only provide the target sentence for rephrasing and let GPT-4 generate corresponding preceding and trailing sentences. This may have led to inconsistencies between the sentences generated and reduced the effectiveness of this approach. Data augmentation had a more pronounced effect when it is used to enhance the baseline PubMedBERT model, in contrast to the best-performing model that uses longer contexts. This suggests that longer contexts, to some extent, could compensate for data scarcity. The limited effectiveness of data augmentation might also be due to the fact that three methods (EDA, UMLS-EDA, and GPT-4 rephrasing) use existing training examples and they may not introduce enough diversity to the training set. The other method (GPT-4 generative) relies on label descriptions, which could also limit diversity among the generated sentences. Leveraging distant supervision approaches, such as active learning, could improve diversity of the dataset and generalizability of the models.

Section-specific training

Our comparison of section-specific model training with training of a single CONSORT model was inconclusive. Methods-specific model worked better on methodology items than the more comprehensive model. On the other hand, the results for Results and Discussion sections were mostly similar. Precision based on section-specific training was notably higher, which may be desirable in some cases. However, because the differences are minor, it seems more efficient to train and perform predictions using a single full model.

Error analysis

We analyzed the errors made by the best-performing PubMedBERT model. Most error cases involved sentences predicted to report a CONSORT item different from the original label (37.2% of errors). For some of these cases, the true label was among the predictions, but additional labels were also predicted (8.3%). Similarly, for 6.3% of the errors, at least one true label was correctly predicted, but some other labels were missed. In 1.1% of the cases, there was at least one label overlap, while some labels were missed and others were incorrectly predicted. These types of overlaps, accounting for 15.7% of the total number of errors could be considered less fatal. About 29.3% of the errors involved negative sentences that were predicted to report a CONSORT item, and the rest (33.5%) were those labeled with a CONSORT item but for which no CONSORT label was predicted. We provide the samples for these error types in Table 5.

Table 5 Samples for different error types. The labels shown are Outcomes (6a), Outcome Results (17a), Binary Outcome Results (17b), and Ancillary Analyses (18).

The most confusion occurred between Outcome Results (17a) and related items Binary Outcome Results (17b) and Ancillary Analyses (18). This was followed by Statistical Methods for Outcomes (12a) and Statistical Methods for Other Analyses (12b). We note that inter-annotator agreement for these labels was lower, as annotators often confused them as well21. For practical use, it might be practical to merge these labels. The label that was most often completely missed or incorrectly predicted was Interpretation (22), which is a broad and diffuse category, similar to Background (2a), and its utility might be considered debatable.

To better understand model behavior, we further examined some false positives (cases where the model predicted a label but the sentence was not labeled in the ground truth) and false negatives (labeled sentences completely missed by the model). We found that in some false positive cases, predicted sentences might relate to the checklist item, but does not contain enough information relevant to the item. For example, for the sentence “Sample size was calculated from the study of Lee et al.”, the model predicted the label Sample Size Calculation (7a), even though the sentence does not discuss the specifics of the calculation. Similarly, the sentence “The remaining 12 adverse events in the cryotherapy group were either unrelated to the trial treatment or unlikely to be related.” is predicted as Harms (19), although specific adverse events are not mentioned in the sentence. Apart from sentences reporting Interpretation (22), false negatives involved sentences that describe how outcomes were assessed or interventions were administered, as opposed to what the outcomes and interventions were. In these examples, the model correctly labeled other sentences in the same article that discuss the outcomes and interventions, so these errors may be less problematic. An example is “Nominated study staff at each site (general practitioner or practice nurse) verified the blood pressure measurements by an independent audit of the clinical details in the case records, outputs from the blood pressure monitor, and the computer decision tool.”, which describes how an outcome was measured.

Limitations

Our study has several limitations. The dataset consists of a small number of articles in XML format from PubMed Central, which may not be fully representative of the RCT literature, and contains limited data for the infrequently reported CONSORT checklist items. We attempted to address this issue using data augmentation; however, the effect was minimal. Given the scarcity of data, it might be reasonable to resort to rule-based methods for some rare items, such as 3b (Changes to Trial Design). At the same time, it is necessary to develop larger datasets, which is challenging, as it requires significant domain expertise. Distant supervision approaches leveraging unlabeled data from the literature could be a promising avenue. Our exploration of generative models was limited. A more systematic exploration of prompting strategies is needed in future work.

We have focused on recognizing sentences reporting CONSORT checklist items, a first step toward assessing adherence (e.g., whether the statistical methods used are appropriate), which we have not attempted in this study. We leave this much more challenging task for future work.

Conclusions

In this study, we extended our earlier work to recognize all CONSORT checklist items in RCT publications. A PubMedBERT fine-tuned model using surrounding contexts and article structure yielded the best performance. We did not observe significant benefits from using LLMs for data augmentation or in-context learning, or fine-tuning them. We also did not observe an advantage of training section-specific models.

In future work, we aim to improve the models further for practical use. We plan to achieve this by extending the annotated corpus and the models and making the models more efficient by employing techniques such as distillation. While active implementation of CONSORT in the peer review process has been shown to improve reporting quality, additional time requirements for the editorial staff and longer peer review process have also been noted17. With further enhancements, the models would speed up this process and assist journals in checking for CONSORT compliance in a human-in-the-loop setting. It could also help the authors in improving the completeness and transparency of their manuscripts prior to peer review. We also plan to extend our models to assess the extent to which articles are CONSORT-compliant, potentially increasing the practical utility of the models.