Introduction

The estimated prevalence of chronic hepatitis B virus (HBV) infection worldwide in 2016 was 3.5%, with approximately 257 million people living with chronic hepatitis B (CHB)1, in whom up to 40% of HBV infections may be progress to decompensated cirrhosis and hepatocellular carcinoma (HCC)2. Timely and appropriate antiviral therapy can inhibit viral replication and prevent disease progression, but not all HBV-infected patients require antiviral therapy. The American association for the study of liver diseases (AASLD) recommend that patients with early liver fibrosis (F < 2) need only follow-up observation, while those with significant liver fibrosis (F ≥ 2) definitely require antiviral therapy3,4. Therefore, there is a pressing, to search for easily available, relatively inexpensive and reliable biomarkers to assess the stages of liver fibrosis with non-invasive techniques are an urgent.

At present, liver biopsy is still the gold standard when it comes to determine the degree of liver fibrosis. However, biopsy by sampling error, the difference between the observer and the limitation of various potential complications, cause for stable and asymptomatic patients with early liver fibrosis5,6,7. In view of this, non-invasive methods are replacing liver biopsy to evaluate the degree of liver fibrosis. In recent years, the serological detection gets more and more attention, and profit from serum detect is easy to obtain, simple to use, and lower cost, especially in the application of routine screening8. Most of these serological detections were developed in viral hepatitis or nonalcoholic fatty liver disease (NAFLD). In general, these tests are assessed by combining markers such as liver injury, complications of poral hypertension, or markers that measure fibrosis generation9. However, it should be noted that the sensitivity and specificity of individual serological biomarkers are significantly different, and making them insufficient to accurately evaluate the stages of liver fibrosis in CHB patients. Therefore, there is an urgent and unmet clinical need to find biomarkers that are readily available, relatively inexpensive, and reliable and that can be assessed with less invasive or even noninvasive techniques for assessing the stages of liver fibrosis.

At this stage, the liver fibrosis assessment model combined with multiple serological biomarkers, including Fibrosis 4 score (FIB-4), aspartate aminotransferase to platelet ratio index (APRI) and other classic assessment indicators, is an effective method. However, these methods have reduced the sensitivity and specificity of the test results due to the influence of CHB complications, which can’t accurately reflect the results of liver biopsy10,11. Therefore, highly accurate and reliable models are urgently needed in clinical practice to improve decision support for liver fibrosis assessment in CHB patients. In recent years, the application and development of large data and artificial intelligence technology has pioneered the new medical decision-making system, especially the use of ML to improve the reliability and accuracy of diagnostic systems for specific diseases. A preliminary study applying ML to nonalcoholic steatohepatitis (NASH) histology demonstrated the feasibility of this method in assessing liver tissue12,13. In this study, we assumed that using new combination of serological biomarkers to construct a predictive model through ML can allows for accurate diagnosis of liver fibrosis in CHB patients. The potential utility of the ML-based risk stratification approach may be demonstrated in comparison with the results of the traditional classic serological model and pathological evaluation.

In conclusion, the purpose of this study was to identify novel predictors of liver fibrosis staging from clinical and serological datasets. Furthermore, a high-flux ML-based predictive model was developed and validated in two independent cohorts based on the identified predictors for the accurate identification of liver fibrosis staging in CHB patients to aid in clinical decision-making.

Methods

This retrospective study was approved by the Ethics Committee of Zhejiang Provincial People’s Hospital. We confirm that all methods were performed in accordance with relevant guidelines and regulations. We retrospectively analyzed the clinical data of 1189 CHB patients from two independent centers, including general information and laboratory data obtained within 1 week prior to liver biopsy. HBV-infected patients over 18 years old with hepatitis B surface antigen (HBsAg) positivity for at least 6 months were included14. The exclusion criteria were as follows: ① antiviral therapy; ② incomplete clinical data; ③ other causes of liver disease, such as hepatitis C virus (HCV), alcoholic liver disease, NASH, or autoimmune liver disease. The detailed process of liver biopsy and pathological diagnosis can be found in the Supplementary Material. In this study, we used the dataset from Zhejiang Provincial People’s Hospital as a training cohort to build the model, and we also included 571 HBV-infected patients from the First Affiliated Hospital of Zhejiang University as an independent external validation cohort for the model to further determine the generalizability of the model. We confirm that informed consent was obtained from all subjects and their legal guardian. The flowchart of patient enrollment is shown in Fig. 1.

Figure 1
figure 1

Flowchart of patient recruitment and study design.

Statistical analyses of the data were performed with SPSS software (version 25.0, SPSS Inc, IBM,) R 3.5.1 and Python 3.5.6. Quantitative data was tested for normality using the Kolmogorov–Smirnov test and met the normal distribution using the independent sample t-test. Non-normally distributed data was analyzed using the Mann–Whitney U test. Counting data was expressed as frequency numbers, and between groups were compared using χ2 test. The calculated AUC and 95% confidence interval (CI) were used to evaluate the performance of the ML model for the prediction of grade liver fibrosis. A two-tailed P value < 0.05 indicated statistical significance.

Results

There were significant differences in Alkaline phosphatase, TBIL, DB, AST/ALT and ALT-square among the training cohort and external validation cohort (P < 0.05), but there were no significant differences among other characteristics (P > 0.05), and detailed feature analysis can be found in Table 1.

Table 1 Characteristics of patients in the two cohorts.

This study first performed maximum relevance minimum redundancy (mRMR) to exclude redundancy features, and a total of 10 features remained, includes gender, age, HBV-DNA, International NORMALIZED Ratio (INR), thrombin time (TT), Mean erythrocyte volume, albumin, Gamma Glutamyl Transferase (GGT) and Alkaline phosphatase. Second, after dimensionality reduction based on GBDT, 5 features remained. Among them, the weight of HBV-DNA was the highest at 0.261, followed by PLT (0.214), TT (0.183), albumin (0.179) and INR (0.162) (See Figure S2). The variance inflation factor (VIF) is 1.128, 1.01, 1.101, 1.017 and 1.099, which shows no collinearity between the features. The detailed process and results of dimensionality reduction can be found in the Supplementary Materials.

The model was constructed using the 5 selected features based on the training dataset (Fig. 2). The model constructed by the SVM ML method showed the best stability, with a mean AUC and RSD of 0.7848 (SD 0.01223) and 1.5583, respectively. The model constructed by the DT ML method showed the best diagnostic performance, with a mean AUC and RSD of 0.8617 (SD 0.02377) and 2.7585, respectively (Fig. 3 and Table S2). To select the best-performing model, the SVM and DT models were applied to the training cohort and the external validation cohort. The RSDs calculated based on the AUC values of the two groups were 1.347 and 0.116, respectively (Table S3). Therefore, the DT model showed the best generalized performance among different independent datasets, and its mode visualization can be seen in Fig. 4.

Figure 2
figure 2

Flowchart of model construction and validation.

Figure 3
figure 3

Density chart of AUC values obtained by each machine learning model after 100 resamplings. The SVM model showed the highest density accumulation, and the DT model showed the highest AUC value. SVM support vector machine, KNN K-nearest neighbor, DT decision tree, RF random forest, ML machine learning.

Figure 4
figure 4

DT model visualization. The square boxes represent clinical outcomes. Darker shades of purple indicate a greater predicted probability value. In this study, the probability value 0 represents the pathological result of F0-1, the probability value 1 represents the pathological result of F2, the probability value 2 represents the pathological result of F3, and the probability value 3 represents a pathological result of F4. GGT glutamyl transpeptidase, INR international normalized ratio, PLT platelet.

The AUC values of the DT model for assessment of liver fibrosis stages (F0-1, F2, F3 and F4) in the training cohort were 0.898, 0.891, 0.907 and 0.944, respectively. The AUC values of the DT model for assessment of liver fibrosis stages (F0-1, F2, F3 and F4) in the external validation cohort were 0.906, 0.876, 0.931 and 0.933, respectively. The calibration curve of the model showed good agreement with the ideal curve. The Hosmer–Lemeshow test also showed no significant difference between the predicted calibration curve and the ideal curve (P > 0.05) (Fig. 5). In addition, Delong’s test showed that the diagnostic performance of the DT model was significantly different from those of the mixed features in the training cohort and the external validation cohort, highlighting the improvement in the prediction performance of the DT model, as shown in Table 2.

Figure 5
figure 5

ROC curves of the tree model in the training cohort (A) and external validation cohort (B). The calibration curves showed the agreement between the predictive performance and the actual results in the training cohort (C) and external validation cohort (D). The diagonal line represents perfect predictive performance, and the color line represents the model classification performance.

Table 2 Diagnostic performance of the ML model and traditional serological model for assessment of hepatic fibrosis stages in the two cohorts.

The DT model calculated the liver fibrosis probability of all patients, and identify early stage of fibrosis(F0-1), significance fibrosis(F2), advanced fibrosis(F3) and Cirrhosis(F4) according to the optimal diagnostic threshold (F0-1 cut-off value: > 0.5, F2 cut-off value: > 0.25714, F3 cut-off value: > 0.05714 and F4 cut-off value: > 0.03846). We observed that there were no significant differences in the number of cases diagnosed by pathology between the four groups, whether in the training or external validation cohorts or even in the entire study cohort (P > 0.05, Table 3), which indicated that the model risk classification effect of this study was good. For all 1189 patients, the Delong’s test showed that the performances of ML model did not significant differences among F0-1, F2, F3 and F4 regarding to different inflammation subgroups (P > 0.05, Table 4 and Fig. 6).

Table 3 Clinical classification performance of machine learning model.
Table 4 Diagnostic performance comparison of ML model for patients with different inflammation levels.
Figure 6
figure 6

The diagnostic performance of ML model recognizes liver fibrosis grade in mild inflammation (A), moderate inflammation (B) and severe inflammation (C) groups, respectively. The calibration curves showed the agreement between the predictive performance and the actual results in the mild inflammation (D), moderate inflammation (E) and severe inflammation (F) groups. The diagonal line represents perfect predictive performance, and the color line represents the model classification performance.

Discussion

In this multicenter study, we designed a prediction model based on ML to accurately assessment liver fibrosis stages of CHB patients. Compared with traditional statistical models such as APRI or FIB-4, and ML model demonstrated significant improvements and was easy to process, which also suggested the great potential of ML in the field of noninvasive liver fibrosis evaluation. In addition, our study results indicated that ML model provided similar diagnostic efficacy with the reference standard liver biopsy, which may provide a reliable theoretical basis for the further development of simple, easy-to-use and accurate tools for the evaluation of liver fibrosis.

In this study, we used ML methods with the hope of more accurately assessing the staging of liver fibrosis, thereby improving the accuracy of the model. The final results revealed that our model showed superior accuracy compared to traditional serological models such as APRI or FIB-4. It is also significantly higher than the diagnostic efficacy of seventeen noninvasive liver fibrosis models in Chinese patients with hepatitis B mentioned in the study of Li et al.19. In addition, stratification analysis in inflammation subgroups was performed, and the results did show no significant impact on the performance of ML model. These findings suggest that ML model may overcome the influence of inflammation for cirrhosis evaluation, which is likely to be a potential breakthrough in non-invasive diagnosis. This was helped by a new approach to model building that had the following main advantages. First, we compared the performance of models constructed by several ML methods, and then we focused on and validated the DT model because of its better performance and ease of use. In fact, the DT model has been applied to evaluate hepatitis C liver fibrosis and has shown significant performance20. In addition, previous studies mainly used a classification method (logistic regression analysis)21, and features were selected through univariate tests (t tests, Welch tests, etc.) in many patients22,23. However, this method is often overly optimistic, prone to overfitting, and difficult to reproduce. To overcome these problems, we used integration algorithms, including mRMR and GBDT, to remove redundant features to prevent multicollinearity, and we used only high-scoring variables to construct prediction models to avoid overfitting. Second, our model allows patients to be assessed by a single blood draw without the need for additional modalities. This concept is particularly attractive for routine screening of people at high risk of disease development, such as those with advanced or severe liver fibrosis, in primary care settings. These cases which clinically suspected severe liver fibrosis previously required puncture pathology to be confirmed. However, now only need to routine serological examination to judge the probability of severe liver fibrosis, so invasive puncture examination can be avoided. Therefore, it has obvious advantages in terms of cost and prognosis. In addition, our method can be used to construct a similar model visualization to distinguish early liver fibrosis from significant liver fibrosis, and does not require specially trained clinicians, which is more convenient for clinicians in practice and of great value for clinical promotion.

In this study, we also hoped to improve the diagnostic performance of the model by identifying more specific markers and constructing the model based on the combination of known serologically relevant features. We integrated some of the most routine serological markers, in contrast to Zeng et al., who used laboratory markers such as B2-macroglobulin, haptoglobin and apolipoprotein A1, which are not commonly used in most hospitals24. Although these laboratory markers may show higher accuracy than routine serological markers, they are not suitable for practical clinical application. Our results showed of the five conventional serological markers used to construct the ML model, HBV-DNA had the greatest contribution to the model, which is consistent with the recommendation of some guidelines that patients with high HBV-DNA levels should be evaluated for noninvasive liver fibrosis4,25. HBV DNA is the marker for viral replication. For chronic HBV infection, the development of the disease is a dynamic process, and the infection status also exists for a long time. For patients with chronic HBV infection in the indeterminate phase, the results of examination alone may not be able to accurately assess the natural history stage, so dynamic follow-up observation is needed. Studies have shown that HBV DNA levels correlated with significant fibrosis in HBeAg(−) CHB patients. HBV DNA level could predict liver fibrosis in HBeAg(−) CHB patients with biopsy indication26,27.

In addition, two coagulation factors including INR and TT were integrated into the model, although the two coagulation factors are closely related in clinical practice28,29, which was may lead to over fitting of the model and overestimate the role of coagulation factors. However, we calculated the VIF value of relevant factors and did not show collinearity. Therefore, we speculate that the contribution of coagulation factors to the model should not be overestimated.

It is well known that distinguishing F0-1 from F2-4 is more challenging in many studies30,31, which is because the heterogeneity of liver fibrosis in patients with F ≥ 2 liver fibrosis is more serious than that in those with F ≥ 3 and 4 liver fibrosis, which generally reduces the accuracy of all classification strategies. In fact, our research results confirm that DT model has the lowest accuracy (AUC of 0.891 in training cohort and AUC of 0.876 in Validation cohort) in identifying patients with liver fibrosis grade F2. However, DT model shows high accuracy and excellent stability for each fibrosis grade in two cohorts, especially in identifying liver cirrhosis (F4), which was shows this model could be used to refine phenotypes in large research studies. Our study result also showed that the highest overall recognition rate for patients with liver cirrhosis (F4) was higher than that for patients with other stages of liver fibrosis when the model was used to classify risk prediction in the two cohorts or the whole cohort. These results suggested that our ML model may be part of a more accurate preclinical detection pathway to assess liver cirrhosis and may be used for the screening and treatment of liver cirrhosis in HBV-infected patients in routine clinical environments, although this needs to be validated in prospective studies.

This study has some limitations. First, this study was a retrospective study, which may lead to the simulation of retrospective statistics depending on too many assumptions. Future research should focus on the development of prediction and classification models based on prospective research, which will allow time evolution information to be used to evaluate, modify and reevaluate prediction models. Second, the model itself needs to be further optimized through better engineering and further development through more comprehensive integration of other clinical data to improve the overall performance of the model and achieve a more accurate noninvasive diagnosis of liver fibrosis staging. Finally, our study did not investigate the performance of ML model for classifying patients with CHB of different ethnic populations, which are also worthy of further studies in the future. Of course, in this study, we still emphasize that as conceptual research, it can still provide a certain basis for the real clinical practice in the future, although this future still needs a long way to go.

In conclusion, this study demonstrated that ML model was more accurate than traditional serological mixed biomarkers in assessing all four liver fibrosis stages in patients with CHB. In addition, the results of this study promote the goal of assessing liver fibrosis in CHB patients and improving the existing prognostic models, thereby facilitating a future prospective study design and evaluation and clinical disease surveillance and treatment. We also hope to further refine and expand this work to clarify the application of this model to a wider range of liver fibrotic diseases.