Introduction

Researchers performing systematic reviews (SRs) face bias at two potential levels: first, at the level of the SR methods themselves, and second, at the level of the included primary studies [1]. To safeguard correct interpretation of the review’s results, transparency is required at both levels. For bias at the level of the SR methods, this is ensured by transparent reporting of the full SR methods, at least to the level of detail as required by the PRISMA statement [2]. For bias at the level of the included studies, study reporting quality (RQ) and/or risk of bias (RoB) are evaluated at the level of the individual included study. Specific tools are available to evaluate RoB in different study types [3]. Also, for reporting of primary studies, multiple guidelines and checklists are available to prevent missing important experimental details and more become available for different types of studies over time [4, 5]. Journal endorsement of these types of guidelines has been shown to improve study reporting quality [6].

While undisputedly important, evaluation of the RoB and/or RQ of the included studies is one of the most time-consuming parts of an SR. Experienced reviewers need 10 min to an hour to complete an individual RoB assessment [7], and every included study needs to be evaluated by two reviewers. Besides spending substantial amounts of time on RoB or RQ assessments, reviewers tend to become frustrated because of the scores frequently being unclear or not reported (personal experience from the authors, colleagues and students). While automation of RoB seems to be possible without loss of accuracy [8, 9], so far, this automation has not had significant impact on the speed; in a noninferiority randomised controlled trial of the effect of automation on person-time spent on RoB assessment, the confidence interval for the time saved ranged from − 5.20 to + 2.41 min [8].

In any scientific endeavour, there is a balance between reliability and speed; to guarantee reliability of a study, time investments are necessary. RoB or RQ assessment is generally considered to be an essential part of the systematic review process to warrant correct interpretation of the findings, but with so many studies scoring “unclear” or “not reported”, we wondered if all this time spent on RoB assessments is resulting in increased reliability of reviews.

Overall unclear risk of bias in the included primary studies is a conclusion of multiple reviews, and these assessments are useful in pinpointing problems in reporting, thereby potentially improving the quality of future publications of primary studies. However, the direct goal of most SRs is to answer a specific review question, and in that respect, unclear RoB/not reported RQ scores contribute little to the validity of the review’s results. If all included studies score “unclear” or “high” RoB on at least one of the analysed elements, the overall effect should be interpreted as inconclusive.

While it is challenging to properly evaluate the added validity value of a methodological step, we had data available allowing for an explorative case study to assess the informative value of various RoB and RQ elements in different types of studies. We previously performed an SR of the nasal potential difference (nPD) for cystic fibrosis (CF) in animals and humans, aiming to quantify the predictive value of animal models for people with CF [10, 11]. That review comprised between-subject comparisons of both baseline versus disease-control and treatment versus treatment control. For that review, we performed full RoB and RQ analyses. This resulted in data allowing for comparisons of RoB and RQ between animal and human studies, but also between baseline and treatment studies, which are both presented in this manuscript. RoB evaluations were based on the Cochrane collaboration’s tool [12] for human studies and SYRCLE’s tool [13] for animal studies. RQ was tested based on the ARRIVE guidelines [14] for animal studies and the 2010 CONSORT guidelines [15] for human studies. Brief descriptions of these tools are provided in Table 1.

Table 1 A brief description of the relevant reporting guidelines and risk-of-bias tools

All these tools are focussed on interventional studies. Lacking more specific tools for baseline disease-control comparisons, we applied them as far as relevant for the baseline comparisons. We performed additional analyses on our RQ and RoB assessments to assess the amount of distinctive information gained from them.

Methods

The analyses described in this manuscript are based on a case study SR of the nPD related to cystic fibrosis (CF). That review was preregistered on PROSPERO (CRD42021236047) on 5 March 2021 [16]. Part of the results were published previously [10]. The main review questions are answered in a manuscript that has more recently been published [11]. Both publications show a simple RoB plot corresponding to the publication-specific results.

For the ease of the reader, we provide a brief summary of the overall review methods. The full methods have been described in our posted protocol [16] and the earlier publications [10, 11]. Comprehensive searches were performed in PubMed and Embase, unrestricted for publication date or language, on 23 March 2021. Title-abstract screening and full-text screening were performed by two independent reviewers blinded to the other’s decision (FS and CL) using Rayyan [17]. We included animal and/or human studies describing nPD in CF patients and/or CF animal models. We restricted to between-subject comparisons, either CF versus healthy controls or experimental CF treatments versus CF controls. Reference lists of relevant reviews and included studies were screened (single level) for snowballing. Discrepancies were all resolved by discussions between the reviewers.

Data were extracted by two independent reviewers per reference in several distinct phases. Relevant to this manuscript, FS and CL extracted RoB and RQ data in Covidence [18], in two separate projects using the same list of 48 questions for studies assessing treatment effects and studies assessing CF-control differences. The k = 11 studies that were included in both parts of the overarching SR were included twice in the current data set, as RoB was separately scored for each comparison. Discrepancies were all resolved by discussions between the reviewers. In violation of the protocol, no third reviewer was involved.

RoB and SQ data extraction followed our review protocol, which states the following: “For human studies, risk of bias will be assessed with the Cochrane Collaboration’s tool for assessing risk of bias. For animal studies, risk of bias will be assessed with SYRCLE’s RoB tool. Besides, we will check compliance with the ARRIVE and CONSORT guidelines for reporting quality”. The four tools contain overlapping questions. To prevent unnecessary repetition of our own work, we created a single list of 48 items, which were ordered by topic for ease of extraction. For RoB, this list contains the same elements as the original tools, with the same response options (high/unclear/low RoB). For RQ, we created checklists with all elements as listed in the original tools, with the response options reported yes/no. For (RQ and RoB) elements specific to some of the included studies, the response option “irrelevant” was added. We combined these lists, only changing the order and merging duplicate elements. We do not intend this list to replace the individual tools; it was created for this specific study only.

In our list, each question was preceded by a short code indicating the tool it was derived from (A for ARRIVE, C for CONSORT, and S for SYRCLE’s) to aid later analyses. When setting up, we started with the animal-specific tools, with which the authors are more familiar. After preparing data extraction for those, we observed that all elements from the Cochrane tool had already been addressed. Therefore, this list was not explicit in our extractions. The extraction form always allowed free text to support the response. Our extraction list is provided with our supplementary data.

For RoB, the tools provide relatively clear suggestions for which level to score and when, with signalling questions and examples [12, 13]. However, this still leaves some room for interpretation, and while the signalling questions are very educative, there are situations where the response would in our opinion not correspond to the actual bias. The RQ tools have been developed as guidelines on what to report when writing a manuscript, and not as a tool to assess RQ [14, 15]. This means we had to operationalise upfront which level we would find sufficient to score “reported”. Our operationalisations and corrections of the tools are detailed in Table 2.

Table 2 Operationalisation of the analysed tools

Analysis

Data were exported from Covidence into Microsoft’s Excel, where the two projects were merged and spelling and capitalisation were harmonised. Subsequent analyses were performed in R [21] version 4.3.1 (“Beagle Scouts”) via RStudio [22], using the following packages: readxl [23], dplyr [24], tidyr [25], ggplot2 [26], and crosstable [27].

Separate analyses were performed for RQ (with two levels per element) and RoB (with three levels per element). For both RoB and RQ, we first counted the numbers of irrelevant scores overall and per item. Next, irrelevant scores were deleted from further analyses. We then ranked the items by percentages for reported/not reported, or for high/unclear/low scores, and reported the top and bottom 3 (RoB) or 5 (RQ) elements.

While 100% reported is most informative to understand what actually happened in the included studies, if all authors continuously report a specific element, scoring of this element for an SR is not the most informative for meta-researchers. If an element is not reported at all, this is bad news for the overall level of confidence in an SR, but evaluating it per included study is also not too efficient except for highlighting problems in reporting, which may help to improve the quality of future (publications of) primary studies. For meta-researchers, elements with variation in reporting may be considered most interesting because these elements highlight differences between the included studies. Subgroup analyses based on specific RQ/RoB scores can help to estimate the effects of specific types of bias on the overall effect size observed in meta-analyses, as has been done for example randomisation and blinding [28]. However, these types of subgroup analyses are only possible if there is some variation in the reporting. Based on this idea, we defined a “distinctive informative value” (DIV) for RQ elements, based on the optimal variation being 50% reported and either 0% or 100% reporting being minimally informative. Thus, this “DIV” was calculated as follows:

$$\mathrm{DIV}\;=\;\lbrack50\;-\;(\mathrm{distance}\;\mathrm{of}\;\%\mathrm Y\;\mathrm{to}\;50\%)\rbrack$$
$$\mathrm{With}\;\%\mathrm Y\;=\;\%\;\mathrm{reported}$$

Thus, the DIV could range from 0 (no informative value) to 50 (maximally informative), visualised in Fig. 1.

Fig. 1
figure 1

Visual explanation of the DIV value

The DIV value was only used for ranking. The results were visualised in a heatmap, in which the intermediate shades correspond to high DIV values.

For RoB, no comparable measure was calculated. With only 10 elements but at 3 distinct levels, we thought a comparable measure would sooner hinder interpretation of informative value than help it. Instead, we show the results in an RoB plot split by population and study design type.

Because we are interested in quantifying the predictive value of animal models for human patients, we commonly perform SRs including both animal and human data (e.g. [29, 30]). The dataset described in the current manuscript contained baseline and intervention studies in animals and humans. Because animal studies are often held responsible for the reproducibility crisis, but also to increase the external validity of this work, explorative chi-square tests (the standard statistical test for comparing percentages for binary variables) were performed to compare RQ and RoB between animal and human studies and between studies comparing baselines and treatment effects. They were performed with the base R “chisq.test” function. No power calculations were performed, as these analyses were not planned.

Results

Literature sample

We extracted RoB and RQ data from 164 studies that were described in 151 manuscripts. These manuscripts were published from 1981 through 2020. Overall, 164 studies comprised 78 animal studies and 86 human studies, 130 comparisons of CF versus non-CF control, and 34 studies assessing experimental treatments. These numbers are detailed in a crosstable (Table 3).

Table 3 Cross-tabulation of included comparisons

The 48 elements in our template were completed for these 164 studies, which results in 7872 assessed elements. In total, 954 elements (12.1%) were irrelevant for various reasons (mainly for noninterventional studies and for human studies). The 7872 individual scores per study are available from the data file on OSF.

Of the 48 questions in our extraction template, 38 addressed RQ, and 10 addressed RoB.

Overall reporting quality

Of the 6232 elements related to RQ, 611 (9.8%) were deemed irrelevant. Of the remainder, 1493 (26.6% of 5621) were reported. The most reported elements were background of the research question (100% reported), objectives (98.8% reported), interpretation of the results (98.2% reported), generalisability (86.0% reported), and the experimental groups (83.5% reported). The least-reported elements were protocol violations, interim analyses + stopping rules and when the experiments were performed (all 0% reported), where the experiments were performed (0.6% reported), and all assessed outcome measures (1.2% reported).

The elements with most distinctive variation in reporting (highest DIV, refer to the “Methods” section for further information) were as follows: ethics evaluation (64.6% reported), conflicts of interest (34.8% reported), study limitations (29.3% reported), baseline characteristics (26.2% reported), and the unit of analysis (26.2% reported). RQ elements with DIV values over 10 are shown in Table 4.

Table 4 Distinctive informative values of at least 10 within the current sample

Overall risk of bias

Of the 1640 elements related to RoB, 343 (20.9%) were deemed irrelevant. Of the remainder, 219 (16.9%) scored high RoB, and 68 (5.2%) scored low RoB. The overall RoB scores were highest for selective outcome reporting (97.6% high), baseline group differences (19.5% high), and other biases (9.8% high); lowest for blinding of participants, caregivers, and investigators (13.4% low); blinding of outcome assessors (11.6% low) and baseline group differences (8.5% low); and most unclear for bias due to animal housing (100% unclear), detection bias due to the order of outcome measurements (99.4% unclear), and selection bias in sequence generation (97.1% unclear). The baseline group differences being both in the highest and the lowest RoB score are explained by the baseline values being reported better than the other measures, resulting in fewer unclear scores.

Variation in reporting is relatively high for most of the elements scoring high or low. Overall distinctive value of the RoB elements is low, with most scores being unclear (or, for selective outcome reporting, most scores being high).

Animal versus human studies

For RQ, the explorative chi-square tests indicated differences in reporting between animal and human studies for baseline values (Χ1 = 50.3, p < 0.001), ethical review (Χ1 = 5.1, p = 0.02), type of study (Χ1 = 11.2, p < 0.001), experimental groups (Χ1 = 3.9, p = 0.050), inclusion criteria (Χ1 = 24.6, p < 0.001), the exact n value per group and in total (Χ1 = 26.0, p < 0.001), (absence of) excluded datapoints (Χ1 = 4.5, p = 0.03), adverse events (Χ1 = 5.5, p = 0.02), and study limitations (Χ1 = 8.2, p = 0.004). These explorative findings are visualised in a heatmap (Fig. 2).

Fig. 2
figure 2

Heatmap of reporting by type of study. Refer to Table 3 for absolute numbers of studies per category

For RoB, the explorative chi-square tests indicated differences in risk of bias between animal and human studies for baseline differences between the groups (Χ2 = 34.6, p < 0.001) and incomplete outcome data (Χ2 = 7.6, p = 0.02). These explorative findings are visualised in Fig. 3.

Fig. 3
figure 3

Risk of bias by type of study. Refer to Table 3 for absolute numbers of studies per category. Note that the data shown in these plots overlap with those in the two preceding publications [10, 11]

Studies assessing treatment effects versus studies assessing baseline differences

For RQ, the explorative chi-square tests indicated differences in reporting between comparisons of disease with control versus comparisons of treatment effects for the title listing the type of study (X1 = 5.0, p = 0.03), the full paper explicitly mentioning the type of study (X1 = 14.0, p < 0.001), explicit reporting of the primary outcome (X1 = 11.7, p < 0.001), and reporting of adverse events X1 = 25.4, p < 0.001). These explorative findings are visualised in Fig. 2.

For RoB, the explorative chi-square tests indicated differences in risk of bias between comparisons of disease with control versus comparisons of treatment effects for baseline differences between the groups (Χ2 = 11.4, p = 0.003), blinding of investigators and caretakers (Χ2 = 29.1, p < 0.001), blinding of outcome assessors (Χ2 = 6.2, p = 0.046), and selective outcome reporting (Χ2 = 8.9, p = 0.01). These explorative findings are visualised in Fig. 3.

Overall, our results suggest lower RoB and higher RQ for human treatment studies compared to the other study types.

Discussion

This literature study shows that reporting of experimental details is low, frequently resulting in unclear risk-of-bias assessments. We observed this both for animal and for human studies, with two main study designs: disease-control comparisons and, in a smaller sample, investigations of experimental treatments. Overall reporting is slightly better for elements that contribute to the “story” of a publication, such as the background of the research question, interpretation of the results and generalisability, and worst for experimental details that relate to differences between what was planned and what was actually done, such as protocol violations, interim analyses, and assessed outcome measures. The latter also results in overall high RoB scores for selective outcome reporting.

Of note, we scored this more stringently than SYRCLE’s RoB tool [13] suggests and always scored a high RoB if no protocol was posted, because only comparing the “Methods” and “Results” sections within a publication would, in our opinion, result in an overly optimistic view. Within this sample, only human treatment studies reported posting protocols upfront [31, 32]. In contrast to selective outcome reporting, we would have scored selection, performance, and detection bias due to sequence generation more liberally for counterbalanced designs (Table 2), because randomisation is not the only appropriate method for preventing these types of bias. Particularly when blinding is not possible, counterbalancing [33, 34] and Latin-square like designs [35] can decrease these biases, while randomisation would risk imbalance between groups due to “randomisation failure” [36, 37]. We would have scored high risk of bias for blinding for these types of designs, because of increased sequence predictability. However, in practice, we did not include any studies reporting Latin-square-like or other counterbalancing designs.

One of the “non-story” elements that is reported relatively well, particularly for human treatment studies, is the blinding of participants, investigators, and caretakers. This might relate to scientists being more aware of potential bias of participants; they may consider themselves to be more objective than the general population, while the risk of influencing patients could be considered more relevant.

The main strength of this work is that it is a full formal analysis of RoB and RQ in different study types: animal and human, baseline comparisons, and treatment studies. The main limitation is that it is a single case study from a specific topic: the nPD test in CF. The results shown in this paper are not necessarily valid for other fields, particularly as we hypothesise that differences in scientific practice between medical fields relate to differences in translational success [38]. Thus, it is worth to investigate field-specific informative values before selecting which elements to score and analyse in detail.

Our comparisons of different study and population types show lower RoB and higher RQ for human treatment studies compared to the other study types for certain elements. Concerning RQ, the effects were most pronounced for the type of experimental design being explicitly mentioned and the reporting of adverse events. Concerning RoB, the effects were most pronounced for baseline differences between the groups, blinding of investigators and caretakers, and selective outcome reporting. Note, however, that the number of included treatment studies is a lot lower than the number of included baseline studies, and that the comparisons were based on only k = 12 human treatment studies. Refer to Table 3 for absolute numbers of studies per category. Besides, our comparisons may be confounded to some extent by the publication date. The nPD was originally developed for human diagnostics [39, 40], and animal studies only started to be reported at a later date [41]. Also, the use of the nPD as an outcome in (pre)clinical trials of investigational treatments originated at a later date [42, 43].

Because we did not collect our data to assess time effects, we did not formally analyse them. However, we had an informal look at the publication dates by RoB score for blinding of the investigators and caretakers, and by RQ score for ethics evaluation (in box plots with dot overlay), showing more reported and fewer unclear scores in the more recent publications (data not shown). While we thus cannot rule out confounding of our results by publication date, the results are suggestive of mildly improved reporting of experimental details over time.

This study is a formal comparison of RoB and RQ scoring for two main study types (baseline comparisons and investigational treatment studies), for both animals and humans. Performing these comparisons within the context of a single SR [16] resulted in a small, but relatively homogeneous sample of primary studies about the nPD in relation to CF. On conferences and from colleagues in the animal SR field, we heard that reporting would be worse for animal than for human studies. Our comparisons allowed us to show that particularly for baseline comparisons of the nPD in CF versus control, this is not the case.

The analysed tools [12, 13, 15] were developed for experimental interventional studies. While some of the elements are less appropriate for other types of studies, such as animal model comparisons, our results show that many of the elements can be used and could still be useful, particularly if the reporting quality of the included studies would be better.

Implications

To correctly interpret the findings of a meta-analysis, awareness of the RoB in the included studies is more relevant than the RQ on its own. However, it is impossible to evaluate the RoB if the experimental details have not been reported, resulting in many unclear scores. With at least one unclear or high RoB score per included study, the overall conclusions of the review become inconclusive. For SRs of overall treatment effects that are performed to inform evidence-based treatment guidelines, RoB analyses remain crucial, even though the scores will often be unclear. Ideally, especially for SRs that will be used to plan future experiments/develop treatment guidelines, analyses should only include those studies consistently showing low risk of bias (i.e. low risk on all elements). However, in practice, consistently low RoB studies in our included literature samples (> 20 SRs to date) are too scarce for meaningful analyses. For other types of reviews, we think it is time to consider if complete RoB assessment is the most efficient use of limited resources. While these assessments regularly show problems in reporting, which may help to improve the quality of future primary studies, the unclear scores do not contribute much to understanding the effects observed in meta-analyses.

With PubMed already indexing nearly 300,000 mentioning the term “systematic review” in the title, abstract, or keywords, we can assume that many scientists are spending substantial amounts of time and resources on RoB and RQ assessments. Particularly for larger reviews, it could be worthwhile to restrict RoB assessment to either a random subset of the included publications or a subset of relatively informative elements. Even a combination of these two strategies may be sufficiently informative if the results of the review are not directly used to guide treatment decisions. The subset could give a reasonable indication of the overall level of evidence of the SR while saving resources. Different suggested procedures are provided in Table 5. The authors of this work would probably have changed to such a strategy during their early data extraction phase, if the funder would not have stipulated full RoB assessment in their funding conditions.

Table 5 Examples of potential SR procedures to evaluate the included studies and when to use them

We previously created a brief and simple taxonomy of systematised review types [44], in which we advocate RoB assessments to be a mandatory part of any SR. We would still urge anyone calling their review “systematic” to stick to this definition and perform some kind of RoB and/or RQ assessment, but two independent scientists following a lengthy and complex tool for all included publications, resulting in 74.6% of the assessed elements not being reported, or 77.9% unclear RoB, can, in our opinion, in most cases be considered inefficient and unnecessary.

Conclusion

Our results show that there is plenty of room for improvement in the reporting of experimental details in medical scientific literature, both for animal and for human studies. With the current status of the primary literature as it is, full RoB assessment may not be the most efficient use of limited resources, particularly for SRs that are not directly used as the basis for treatment guidelines or future experiments.