Background

Pulmonary tuberculosis (PTB) is a highly infectious disease caused by Mycobacterium tuberculosis (MTB) infection that poses a significant threat to human life and health. It has been referred to as “phthisis” in ancient Chinese folk medicine. This disease has been historically associated with large-scale epidemics and high mortality rates, resulting in the name “white plague” [1]. Global data from the World Health Organization (WHO) reveal that approximately 1.7 billion people worldwide have latent PTB infection, accounting for 23% of the global population. China is among the 30 countries with a high prevalence of PTB, with 780,000 new cases reported in 2021, representing 7.4% of the total global new cases and ranking third worldwide [2]. In Kashgar, epidemiological surveys have shown a high incidence of PTB, reaching 315.50 per 100,000 people, which is 5.21 times the national average [3].

PTB is characterized by a long course and long treatment duration, often lasting approximately six months or longer. This imposes a significant economic burden on patients and their families, impacting employment opportunities and worsening financial hardships. According to WHO data, tuberculosis is the seventh leading cause of death in low- and middle-income countries and ranks seventh in terms of the overall economic burden of disease worldwide [4]. Among more than thirty countries with a high burden of tuberculosis worldwide, China ranks second in terms of new tuberculosis cases and cases of multidrug-resistant tuberculosis. While China has made progress in tuberculosis control through the implementation of the DOTS strategy, the country still faces challenges in tuberculosis prevention and control, with the incidence of PTB decreasing by approximately 61% from 2000 to 2010. However, this reduction has not significantly changed the situation in high-burden tuberculosis countries, highlighting the ongoing severity of the disease. The healthcare costs associated with tuberculosis in China have been continuously increasing. From 2006 to 2018, the total healthcare costs and average cost per patient for tuberculosis increased, with a budget of 7.19 million US dollars allocated for tuberculosis healthcare costs in 2019. Despite China’s implementation of a free service policy for tuberculosis, patients still need to allocate a significant portion of their annual income to treatment costs. This has earned tuberculosis, the reputation of being a “disease of poverty,” pushing many families into or back into financial hardship.

With the development of machine learning algorithms, predictive models have been widely applied in fields such as education, finance, and healthcare. Recently, there has been a growing interest in predictive models targeting high-cost patients to reduce healthcare expenditures compared to interventions targeting the entire population [5]. Existing “medical cost prediction models” can be broadly classified into two categories [6]: rule-based prediction models and supervised machine learning models. Rule-based models utilize traditional algorithms based on predefined cost rules. However, these models require substantial domain knowledge and may struggle to adapt to complex data in practical scenarios. On the other hand, supervised machine learning models, such as random forests and support vector machines, can capture relationships between data. However, these models require extensive feature engineering for different characteristics and may not adapt well to high-dimensional data. Despite these drawbacks, both types of models have shown good prediction performance in various domains, including house price prediction and the prediction of diagnosis and treatment costs for chronic diseases such as hepatitis B and coronary heart disease. For example, Taloba et al. [7] utilized linear regression analysis, the naive bayes classifier, and random forest models to predict medical costs related to spinal fusion, highlighting the good prediction performance of linear regression models based on the MAPE and R2. Similarly, Gowd et al. [8] employed logistic regression, K nearest neighbor, random forest, naive Bayes, decision tree, gradient elevation tree, and other models to predict total medical costs after total shoulder arthroplasty, revealing that the random forest gradient elevation tree performed the best by comparing accuracy and the area under the receiver operating characteristic curve.

However, the abovementioned models are often considered incapable of adequately detecting complex patterns in large-scale population data [9]. Fortunately, rapidly developing deep learning techniques, including the latest deep neural network structures and quantitative methods, have shown promise in overcoming these challenges. Recent studies have demonstrated the success of deep learning in various medical applications, such as helping dermatologists examine skin cancer [10], predicting patient outcomes using medical text data [11], and as a clinical diagnostic to streamline the triaging of patients and facilitate the clinical decision-making process [12]. The early perceptron model was limited to simple binary classification problems and failed to solve complex nonlinear problems. However, the introduction of hidden layers and activation functions allowed the multilayer perceptron (MLP), also known as multilayer neural network, to be widely used in data mining, pattern recognition, machine learning, and other fields [13,14,15]. Introducing skip-layer connections in feedforward neural networks allows for the full utilization of input information to supplement the parts of the original feature information lost during the network training compression process. The predictive performance has been validated and proven effective, MA Morid et al. [16] attempted to improve the performance of healthcare cost prediction methods by leveraging the feature learning power of convolutional neural networks for temporal pattern detection. P Drewe-Boss et al. [17] have shown that neural networks compare favorably to several baseline methods and that tools such as integrated gradients can be used to explain predictions of population health costs. A Al Bataineh et al. [18] proposed an MLP neural network trained with PSO for heart disease detection. The findings demonstrated that the MLP-PSO model can assist healthcare providers in more accurately diagnosing patients and recommending better treatments. Therefore, given the limited complexity of multiple regression and random forest regression (RFR), we selected the MLP model in our manuscript to validate the results and compare the three models.

In China, there have been few studies on healthcare cost prediction due to the public welfare and nonprofit nature of healthcare services. Therefore, the innovative analysis of the introduction of the MLP model provides insights into the future application of artificial intelligence in the medical field. By utilizing the MLP model to predict hospitalization costs, we can potentially provide more intelligent economic management tools for medical institutions and offer more personalized and economically reasonable medical services for patients. This has significant social and economic implications for promoting the coordinated development of the “artificial intelligence + medicine” field and improving the quality and efficiency of medical services.

Method

Data

This study aimed to investigate patients with PTB (coded as A15-A16 according to the ICD-10) who were admitted to Kashgar Pulmonary Hospital between 2020 and 2022 and subsequently discharged. The data are derived from the hospital information system of a pulmonary hospital in Kashgar. Medical professionals record and organize patient medical records within 24 h and promptly upload them to the hospital information system. Additionally, data within the system are independently collected and organized by two individuals, followed by cross-checking to ensure the accuracy and completeness of the date within the system. Patients with missing data and inability to verify cost were excluded from the study, a total of 9,570 eligible patients with PTB were included as subjects for this study (Fig. 1).

Fig. 1
figure 1

This figure shows the patient inclusion and exclusion process

Total hospitalization cost

The costs were categorized according to the cost items listed on the first page of the latest version of the medical records in 2017. Similar cost items were combined, resulting in six categories, namely, diagnosis cost (including pathological diagnosis, laboratory diagnosis, imaging diagnosis, clinical diagnosis items, etc.), medical service cost (including general medical services, general treatment operations, nursing cost, etc.), material cost (including examination, treatment, surgical disposable medical materials, etc.), treatment cost (including nonsurgical treatment items, surgical treatment cost, etc.), drug cost (including western medicine, Chinese patent medicine, Chinese herbal medicine, etc.), and other cost (including rehabilitation cost, traditional Chinese medicine cost, blood and blood product cost, etc.).

Sociological data

Sociological data such as gender, age, marital status, and payment method were collected. Disease Characteristics Data: Information related to the admission condition, length of hospital stay, initial treatment, presence of other diseases, transfer, allergy, drug resistance, and admission department were also collected.

It is important to note that this study obtained approval from the hospital ethics committee, and all patient information used in the research was completely anonymous.

Statistical methods

A comprehensive database was created using SPSS 26.0, and the data were collated and cleaned. Descriptive analysis was conducted on the hospitalization cost data, sociological data, and disease characteristic data of all patients with PTB. Categorical data was described using frequencies and constituent ratios, while medians and interquartile ranges were used to describe the central tendency and dispersion tendency of continuous data with a skewed distribution.

In univariate analysis, Mann–Whitney U test was used for pairwise comparisons, including 6 factors such as gender, initial treatment, presence of other diseases, transfer, allergy, and drug resistance; Kruskal–Wallis H test was used for multiple-group comparisons, including 6 factors such as age, marital status, payment method, admission condition, length of hospital stay, and admission department. After the results of the univariate analysis, multiple stepwise regression analysis was performed to analyze the main factors influencing the total hospitalization cost. P < 0.05 indicated a statistically significant difference.

Based on the collected data, multiple regression, RFR and MLP prediction models were established to predict the diagnosis cost, medical service cost, material cost, treatment cost, drug cost, other cost, and total hospitalization cost of patients with PTB. The prediction efficacy of the three models was compared using R-square (R2), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE). R2 reflects the proportion of variation explained by the model and is usually used to evaluate the predictive ability of the model. A higher R2 value indicates a better fit of the model to the data. The RMSE is considered to measure the average error that a model produces when predicting an observation. The lower the RMSE is, the better the predictive ability of the model. Like the RMSE, the MAE measures prediction errors but is less sensitive to outliers; the lower the MAE is, the better the model. An essential use of predictive models is identifying key data features and their impact on predictions. We leveraged Permutation Importance to assess the best model’s feature importance.

Predictive model build

All the data were divided into a training set (2020–2021) and a test set (2022) according to the year. In terms of feature processing, multicategory variables were transformed into dummy variables; for binary variables, they were converted into 0 and 1; and the predictive costs were transformed using the logarithm. The numerical variables such as age and length of stay, were divided into ordinal categories by interval. The variable assignments were detailed in Appendix Table.

Seven multiple regression models were separately established for diagnosis cost, medical service cost, material cost, treatment cost, drug cost, other cost and total hospitalization cost. The forward method was employed to select the optimal variables.

With sklearn package in Python 3.7, seven RFR models also were separately established. In order to select the optimal hyper-parameters, a fivefold cross-validation method with grid search was used to tune below parameters, n_estimators (5 ~ 100), max_features ([‘auto’, ‘sqrt’, ‘log2’]), max_depth (2 ~ 12), min_samples_split (5 ~ 150), and min_samples_leaf (5 ~ 50). By systematically exploring the parameter space through this comprehensive tuning strategy, we aimed to identify the optimal hyper-parameter configuration for each RFR model, ensuring optimal performance and generalization ability.

The model architecture adopted in this study was different from the traditional multilayer perceptron. MLP encompasses an input layer, one or multiple hidden layers, and an output layer. Neurons in a layer are fully connected to those in the subsequent layer, enabling information transmission through weighted connections and biases. In our model’s design,the original input was connected to the hidden vector in the last hidden layer and fed to the output layer. This setup permitted the network to harness both simple input–output relationships and residuals from complex deep learning architectures. Then, the final seven target costs including diagnosis cost, medical service cost, material cost, treatment cost, drug cost, other cost, and total hospitalization cost were predicted. The model was built and optimized by the Keras package in Python 3.7. To select the optimal hyperparameters, a fivefold cross-validation method with grid search was used to tune the number of hidden layers (3 ~ 10), the number of neurons per layer (10 ~ 50), the dropout (0.25, 0.5) and the learning rate (0.001, 0.01). Besides, Rectified Linear Units (ReLU) was applied as the activation function, the Adam optimizer for parameter optimization.

In the fivefold cross-validation, 300 epochs were used for each training, and to prevent overfitting, the early stopping mechanism was used, indicating that training would be halted if the loss in the validation set did not decrease for 10 consecutive epochs. We used the Adam optimizer and the mean squared error as the loss function. The optimal model was evaluated with an average of 5 loss results. An overview of inference and analysis framework was illustrated in Fig. 2.

Fig. 2
figure 2

This is an overview of Inference and Analysis Framework

Results

Description of general feature

Among the 9,570 patients with PTB, the proportion of female patients (53.77%) was higher. The median and quartile of patient age were 67.00 (55.00, 74.00) years. The majority of patients were married (91.41%). Most patients (81.04%) used urban and rural residents’ medical insurance as the payment method. The majority of patients were admitted as general patients (53.11%). The median and quartile length of hospital stay for patients were 14.00 (11.00, 21.00) days, and 79.51% of patients were newly diagnosed with PTB. A total of 61.78% of patients had comorbidities. A total of 95.17% of patients were not transferred to another department. A total of 98.51% of patients had no allergies. A total of 97.77% of patients did not have resistance to medications. The proportion of PTB patients admitted to the respiratory department was the largest (17.46%).

Univariate analysis revealed significant differences in the total hospitalization cost among the 11 factors except allergy (P = 0.358), including gender, age, marital status, payment method, admission condition, length of hospital stay, initial treatment, presence of other diseases, transfer, drug resistance, and admission department (P < 0.05) (Table 1).

Table 1 Basic information of patients with PTB

The hospitalization costs of various categories of PTB patients exhibited a skewed distribution, the median total hospitalization cost was 13,150.45 yuan, with an interquartile range between 9,891.34 yuan and 19,648.48 yuan. According to the analysis of hospitalization costs, material costs accounted for the greatest proportion of patients with PTB, followed by drug costs (Table 2).

Table 2 Composition of hospitalization costs for patients with PTB

Multiple regression analysis revealed that the F value was 1377.916, indicating a highly significant relationship (P < 0.001). The R2 was 0.760, suggesting that the regression model explained 76.0% of the total variation. The RMSE was 0.130, indicating a relatively small average difference between the predicted and actual values. Additionally, the coefficient of variance expansion (VIF) for all independent variables was less than 5, indicating the absence of multicollinearity among the variables included in the analysis.

The analysis revealed nine statistically significant factors (age, marital status, admission condition, length of hospital stay, initial treatment, presence of other diseases, transfer, drug resistance and admission department) that were the main factors influencing the hospitalization costs of patients with PTB (P < 0.05) (Table 3).

Table 3 Multiple linear regression analysis results of influencing factors of hospitalization costs in patients with PTB

Modeling Results

The seven multiple regression models were developed using the forward method, with varying characteristics included for cost prediction. Age and length of hospital stay were used in all models; Admission condition, initial treatment, transfer and admission department were effective variables for most models; Less used in the model were marital status, presence of other diseases and drug resistance (Table 4).

Table 4 Feature selection in multiple regression models for seven hospitalization costs

As to RFR, the best parameters were different in 7 cost prediction models. In total hospitalization cost prediction model, the optimal parameter were ‘max_depth’ = 8, ‘max_features’ = ‘auto’, ‘min_samples_leaf’ = 10, ‘min_samples_split’ = 10, ‘n_estimators’ = 10.

Figure 3 showed the relationship between the number of trees and RMSE when using the RFR algorithm to predict total hospitalization cost, with the RMSE being minimal near 10 trees. While figure (b) showed that the optimal max_depth was 8.

Fig. 3
figure 3

This diagram shows training process of the RFR

Based on the average loss results from fivefold cross-validation, the optimal parameter combination for the MLP model was determined(number of hidden layers = 3, number of neurons per layer = 50, dropout rate = 0.25, and learning rate = 0.001). With the optimal parameter combination, the validation set loss was minimized at the 18th epoch and did not decrease further after 10 epochs (Fig. 4). Figure 5 shows the true and predicted values of the MLP in the training and test sets, the scatter plot highlights the great predictive performance of the MLP. Figure 6 showed the ranking of feature importance for the model, with admission department and length of hospital stay ranking highly.

Fig. 4
figure 4

This graph shows the loss curve of train and validation set of the MLP under the optimal parameter combination

Fig. 5
figure 5

This picture shows the scatter plot of the true and predicted values of seven hospitalization costs in MLP model. Shown in (a, c, e, g, i, k, m) are the training while shown in (b, d, f, h, j, l, n) are testing sets

Fig. 6
figure 6

This picture shows the ranking of feature importance for the model

Table 5 showed the evaluation results of the three modeling methods. Overall, MLP demonstrated the superior performance in most cost predictions, outperforming RFR and multiple regression. For instance, as to the prediction of total hospitalization cost, both in the training set and the test set, MLP achieved the highest R2 values (0.817 and 0.832, respectively), surpassing RFR’s R2 (0.758 and 0.809, respectively) and multiple regression’s R2 (0.707 and 0.777, respectively); Additionally, MLP had the lowest RMSE (0.103 and 0.114, respectively) and MAE (0.078 and 0.086, respectively), lower than the RMSE (0.118 and 0.122, respectively) and MAE (0.086 and 0.091, respectively) of RFR and the RMSE (0.130 and 0.132, respectively) and MAE (0.098 and 0.099, respectively) of multiple regression. However, in the prediction of diagnostic cost, RFR exhibited the best performance in the test set, with an R2 of 0.629, RMSE of 0.231, and MAE of 0.167, better than MLP (0.609, 0.237 and 0.163, respectively) and multiple regression (0.518, 0.264 and 0.175, respectively). Despite multiple regression showed low performance across all models, in the prediction of other cost, multiple regression revealed the best results in the test set, with an R2 of 0.352, RMSE of 0.792, and MAE of 0.627, better than MLP (0.313, 0.816 and 0.611, respectively) and RFR (0.342, 0.799 and 0.619, respectively).

Table 5 The comparison results of multiple regression, RFR and MLP

Discuss

The combined results of univariate analysis and multivariate linear regression analysis revealed that the main factors affecting the hospitalization costs of PTB patients were, age, marital status, admission condition, length of hospital stay, initial treatment, presence of other diseases, transfer, drug resistance and admission department. Length of hospital stay, initial treatment and age were important factors affecting the hospitalization costs of PTB patients, with standard regression coefficients of 0.657, 0.200 and 0.107, respectively. Notably, younger patients tend to demonstrate better physical fitness and fewer underlying health conditions, which contributes to more favorable treatment outcomes and, consequently, lower hospitalization costs. With increasing age, the patient’s own immunity will decrease, which will have a certain impact on the therapeutic effect, resulting in increased hospitalization costs; widowed patients need to bear the economic productivity of the entire family alone, and various external pressures may also cause psychological and physical double pressures, so the patient’s physical health will be adversely affected, indirectly leading to an increase in the total hospitalization cost; critically ill patients face a large number of treatment costs and examination costs, which may require shortening the length of hospital stay as much as possible during the diagnosis and treatment period, strengthening the intensity of treatment, and avoiding unnecessary treatment items, examination items and drugs as much as possible, leading to lower per capita hospitalization costs; the longer the length of hospital stay is, the greater the number of medical and health resources, such as examination costs, medical costs and bed costs, and thus, it is bound to increase the total hospitalization costs [19,20,21]. Multiple treatment, combined with other diseases, transfer, drug resistance, and PTB patients in ICU departments may be more complex and correspondingly more difficult to treat, usually accompanied by more health resource consumption, which invisibly leads to an increase in the total hospitalization cost [22,23,24]. The above influencing factors suggest that relevant medical insurance policies should be improved, standardized diagnosis, treatment and management models should be improved, and popular science awareness of PTB prevention and control should be promoted to optimize the structure of hospitalization costs and reasonably control the increase in medical costs.

The results of our study showed that the MLP model had better predictive performance for hospitalization costs than did the multiple regression and RFR, which was similar to the findings of previous studies. [25,26,27]. Traditional models such as multiple regression and RFR might not be able to identify complex relationships among variables, leading to suboptimal fitting performance. However, the MLP model has the characteristics of large-scale parallel processing, high fault tolerance, self-organization, self-adaptive ability and an association function [28]. This approach necessitated a less restrictive specification of independent variables and could effectively solve highly complex classification problems. It was suitable for processing information that involves considering numerous factors and conditions simultaneously, especially in situations that are imprecise and fuzzy [29]; high nonlinear global action, automatic extraction of “reasonable” solution rules, and certain promotion and generalization ability were also the advantages of the MLP model. Chen Y et al. [30] proposed a new LPR-MLP hybrid model, which uses LBP, PCA, and Relief F to process image data and meteorological mechanics data, respectively, and then uses MLP to predict its health level, thus solving the challenge of predicting the health status of transmission lines under high-dimension, multimode, nonlinear, and heterogeneous data. The experimental results have shown that the LPR-MLP model has high prediction accuracy and performance.

Recently, the success of deep learning has led to a resurgence of interest in MLP. MLP, which is regarded as a standard supervised learning algorithm in the field of pattern recognition and continues to become a subject of research in the field of computational neurology and parallel distributed processing, is often used as a back propagation algorithm for learning [31]. At present, the MLP has been proven to be a general functional approximation method that can be used to fit complex functions or solve classification problems. Compared with other prediction models, MLP models have shown unique advantages, and their popularity and application are wider [32]. The utilization of the MLP model for predicting hospitalization costs positively impacts medical cost management, aiding in the rational allocation of medical resources by diverse healthcare institutions and balancing various hospitalization costs through adjustments to high-cost items. This ultimately achieves the goal of controlling patients’ medical costs and alleviating their economic burden.

At present, there is limited research on predicting medical costs. The introduction of the MLP model for hospitalization cost prediction holds significant social and economic importance in enhancing the quality and efficiency of medical services. The MLP model is currently capable of effectively addressing more complex problems, but it also exhibits notable shortcomings, such as the challenging task of determining the appropriate number of hidden nodes in the network, potential inadequacies in learning, and lengthy training times. Consequently, future research on MLP models should further expand the research sample and include more influencing factors and comparative models.

This study was constrained by the limitations of its data source, as patient data ware derived from a single center, and its generalizability may be restricted. Furthermore, as a retrospective study, the cases were sourced from the hospital information systems, limiting access to additional variables that may impact hospitalization costs. Therefore, a multicenter study with a greater number of variables is required.

Conclusion

In general, the MLP model has demonstrated significant advantages over the traditional multiple regression models in terms of prediction efficacy. It enabled the utilization of comprehensive patient information and effectively predicted hospitalization costs, thereby facilitating the rationalization of cost structures and reducing the economic burden on patients. Furthermore, the insights gained from the MLP model hold considerable reference value for research on other diseases, highlighting its broader applicability in the field of healthcare economics.