Introduction

Obesity is a silent global pandemic estimated to affect about 4 billion people by 2035, compared with over 2.6 billion in 20201. Particularly, the excess of body fat mass increases the risk of chronic disorders including type 2 diabetes mellitus (T2D), fatty liver disease, hypertension, and cardiovascular disease2,3,4. This is mainly due to obesity being associated with hyperglycemia5,6, dyslipidemia7 and liver disease8. Although the risk of developing these obesity-related complications is proportional to the degree of adiposity, and more specifically to abdominal (or android) fat accumulation9,10, the prevalence of these comorbidities is not uniform across people living with obesity11,12,13,14. Previous studies have underscored the distribution of fat depots to correctly assess the risk of obesity-related health issues15,16,17, giving special importance to the visceral adipose tissue (VAT)18,19,20,21. Therefore, the correct assessment of VAT may play a relevant role in identifying and stratifying those individuals at higher risk of suffering obesity-associated comorbidities. Although obesity is defined as an excessive accumulation of body fat, the most commonly used measure to classify obesity is the body mass index (BMI), which is calculated by dividing a person’s weight in kilograms by the square of their height in meters. Multiple cut-points for BMI and other anthropometric measures have been employed in clinical practice for their non-invasive nature and simplicity of use22. Nevertheless, they do not allow us to distinguish the two main compartments of abdominal fat: VAT and subcutaneous adipose tissue23,24,25. Image techniques, like dual-energy X-ray absorptiometry (DXA) have gained popularity since they provide an accurate measurement of body composition26,27,28. However, due to its costs and the need for trained personnel, they are not always available. Conversely, blood analyses are easy to obtain and could provide an alternative worth investigating since multiple parameters are known to be altered in the context of obesity.

Over the past decades, considerable efforts have been made in constructing classification and prediction models especially for men living with obesity due to their predisposition to accumulate VAT29,30. Consequently, women have been largely underrepresented in these models, which draws attention to the need for a better characterisation of this population. In this sense, our study aims to develop supervised classification techniques in the female population based on blood chemical concentrations, and a few clinical parameters, and compare the results to the same classification models obtained from DXA-based data. The goal is to determine the ability of risk stratification of both blood-derived and DXA-derived data.

Materials

Study population

This study included a total of 149 bariatric surgery candidate women living with obesity. This cohort was already presented in Osorio-Conles et al.31 and in Pané et al.32. Inclusion criteria were female patients, aged between 18 and 70 years, BMI \(\ge 40\) kg/\(\hbox {m}^{2}\) or \(\ge 35\) kg/\(\hbox {m}^{2}\) in the presence of obesity-related comorbidities. The Institutional Ethics Committee approved the study protocol in both cohorts (ID. Reg. HCB/2017/0984 and Reg. HCB/2019/0137) and written consent was obtained from all participants. For each patient, DXA-estimated data (DXAdata) and chemical concentrations from blood samples (BLDdata) were collected. Total body fat and lean mass were measured by DXA using a Lunar iDXA scan (GE Healthcare, Madison, WI, USA). The estimated VAT content was computed by the validated CoreScan software (EnCore version 17.0). All the procedures and all the methods were performed following the Helsinki Declaration33.

Variables included in the study

DXAdata included values for: fat mass arms (Arms_FM), fat mass trunk (Trunk_FM), fat mass android (Android_FM), total fat mass (Total_FM), muscular mass trunk (Trunk_MM), muscular mass android (Android_MM), tissue mass trunk (Trunk_TM), tissue mass android (Android_TM), total tissue mass (Total_TM), free-fat mass trunk (Trunk_FFM), free-fat mass android (Androrid_FFM), total mass trunk (Trunk_TotalM), total mass android (Android_TotalM), Overall mass (OverallMass). BLDdata included: ultrasensitive reactive C protein (usCRP), fasting plasma glucose (FPG), glycosylated hemoglobin (HbA1c), colesterol (COLT), triglycerides (TG), low-density lipoproteins (LDL), aspartate aminotransferase (AST), alanine aminotransferase (ALT), gamma glutamyltransferase (GGT). A summary list of the variables included in this study, with the corresponding unit of measurement and abbreviation adopted in the text, can be found in Table 2.

Methods

Data preprocessing and analysis, model construction, and performance evaluation were performed by using customized algorithms based on the Scikit-learn python library34.

Preprocessing

Patients were grouped according to their VAT weight into three classes: class 0 (VAT < 1854.60 g); class 1 (1854.60 g \(\le \) VAT \(\le \) 2495.88 g) and class 2 (VAT > 2495.88 g), corresponding to the first, second and third tertile, respectively, of the VAT distribution in our population as shown in Fig. 1. In each class, missing values were replaced by using the k-nearest neighbors method34,35,36 with default values for (1) the number of neighboring samples to use for imputation (2) the weight function used in prediction, and (3) the distance metric for searching neighbors, and all of them can be found in the help page37. Next, the Kruskal-Wallis test among each variable in both DXAdata and BLDdata was performed to select only those showing statistically significant variations across the three classes. Finally, a panel of trained doctors validated the clinical significance of the selected DXAdata and BLDdata variables.

Figure 1
figure 1

Swarn plot showing VAT weight distribution. Blue, orange, and green colours were used for classes 0, 1, and 2, respectively.

Regression models

Multilinear regression analysis (MLR) was initially performed to directly estimate the VAT weight from each ALLdata, DXAdata and BLDdata by using the built-in function already provided in the Scikit-learn python library34. For that, we first divide each dataset into training (90%) and test (10%) sets with the same random state for each dataset. This ensures that the models were trained and then tested over the same subgroup of subjects, thus allowing a proper comparison of the results.

The performance of each regression model was assessed by computing the widely used evaluation metrics: R-squared (\(\hbox {R}^{2}\)) and adjusted \(\hbox {R}^{2}\), Mean Absolute Error (MAE), Mean Square Error (MSE) and Root Mean Square Error (RMSE)34.

Classification models

Logistic regression (LR), support vector machine (SVM), decision tree, random forest, k-nearest neighbors, and XG Boost classifiers were evaluated over DXAdata, BLDdata alone, and after joining them (ALLdata) and including routine clinical information: age, T2D, weight, BMI, waist and hip circumferences. VAT weight values were not included in any DXAdata, BLDdata, or ALLdata. Each dataset was normalized with the MinMaxScaler function as it preserves the relative relationships and distributions of the variables34,35,36, and grid search analysis was performed to find the best hyper-parameter (HP) configuration for each model with a 10-fold cross-validation strategy, being this latter already used in several previous medical-related studies38. During every cross-validation, each dataset was partitioned into training and test (80% and 20% of the data, respectively) and the model was trained and evaluated with each set of partitions. The HP (Table 1) were either already employed in previous studies on related topics39,40,41 or selected according to their suitability to the analyzed dataset34,35,36.

Table 1 List of HP systematically evaluated during the 10-fold cross-validation grid search analysis to find the best-performing configuration for each model.

For every ALLdata, DXAdata, and BLDdata, the model (with its optimal HP configuration) showing the highest test accuracy from the 10-fold cross-validation grind search analysis, was selected. The HPs control the learning process of a classification model in order to optimally map the input variables (i.e. those in the ALLdata, DXAdata, and BLDdata) to the output labels (i.e. class 0, class 1, and class 2)36,42. The test accuracy, on the other hand, provides us with information, in terms of percentage, on how well each model classifies each patient by comparing the model-predicted class with the true class in the test group36,42. Then, we assessed the performance of each selected model over their corresponding dataset by evaluating the confusion matrix. For that purpose, we used 80% of each dataset for training the model and the remaining 20% for testing.

Next, the one-vs-rest analysis was executed36,42. This test involves splitting the multi-class dataset (i.e. class 0, class 1, and class 2) into multiple binary classification problems and then training the selected model on each problem: (1) class 0 vs [1 and 2]; (2) class 1 vs [0 and 2] (3) class 2 vs [0 and 1]. Thus, the one-vs-rest analysis provides information on the ability of a classifier to distinguish one class from the other two classes considered together.

Finally, for every ALLdata, DXAdata, and BLDdata, a confusion matrix was obtained for the selected model. Confusion matrix is a simple performance analysis tool which is used to represent the test result of a classification model: each column represents the instances in a predicted class, while each row represents the instances in an actual class.

Classification models interpretability and performance assessment

SHapley Additive exPlanations (SHAP)43 analysis was conducted to understand the contribution of each variable in ALLdata, DXAdata, and BLDdata to the classification made by the selected model. SHAP was calculated by comparing the classification made with and without including a given variable in the selected model and considering the difference between these two classifications as the contribution of that variable43,44. Moreover, SHAP considered all possible combinations of variables when calculating their contribution to the classification44.

To assess the performance from the one-vs-rest analysis, both receiver operating characteristic (ROC) and precision-recall (PR) graphs were computed. ROC graphs are useful tools for investigating a classification model according to their performance with respect to the false positive rate and true positive rate45. The diagonal of a ROC graph can be interpreted as random guessing, and classification models that fall below the diagonal are considered worse than random guessing. A perfect classifier would fall into the top-left corner of the graph with a true positive rate of 1 and a false positive rate of 0. Based on the ROC curve, the so-called ROC area under the curve (AUC) can be computed to characterize the performance of a classification model. A PR graph reports precision values, defined as the proportion of correct positive classifications (true positive) divided by the total number of predicted positive classifications that were made (true positive + false positive), and the recall values, that is the proportion of correct positive classifications (true positive) divided by the total number of the truly positive classifications (true positive + false negative) and can be considered equivalent to the false discovery rate curve46,47.

Results

Analysed variables

Out of 166 variables initially collected in Ref.31, only 29, including 14 from DXA and 9 from blood sample analysis, fulfilled the Kruskal-Wallis test and thus included in the study. These variables are shown in Table 2 grouped according to their dataset (i.e. DXAdata and BLDdata) and classes 0, 1, and 2, and the values are expressed as median [25-th–75-th percentile]. Although a slight overlap of the percentiles, an increment of the medians across classes of the DXAdata parameters can be observed. Analogous considerations can be made for most of the variables in the BLDdata, including FPG, TG, COLT, and GGT.

Table 2 Distributions of age, number of patients with T2D, weight, BMI, waist and hip dimensions, DXA-estimated data, and blood chemicals concentrations for each class. Data are presented as median [25-th–75-th percentile] or number of patients.

Table 3 shows the P-value obtained from the Wilcoxon signed-rank test, a non-parametric statistical test used to check statistical differences in the median values of two populations. Here the test was applied to each variable and for every couple of classes: class 0 vs 1, class 0 vs 2, and class 1 vs 2. In addition, Bonferroni’s correction for multiple testing was also performed and results were reported in the same table. When testing class 0 vs 1, only 10 of the proposed parameters showed significant statistical differences but all of them failed Bonferroni’s correction. A similar observation can be made when comparing class 1 vs 2. However, when comparing class 0 vs 2, all but two parameters were found to be statistically different, and 13 of them did pass Bonferroni’s correction. Nevertheless, only FPG in the BLDdata, when comparing class 0 vs 2, passed the multiclass Bonferroni’s correction.

Table 3 P-value for the Wilcoxon signed-rank test for each variable and couples of classes, along with multiple testing corrections adjusted P-value by Bonferroni’s correction. ns denotes no statistically significant values.

Regression models

MLR model coefficients obtained during the training for each variable and dataset are reported in Supplementary Table 1. Notably, some of the DXA-derived variables (e.g. Trunk_FM, Trunk_MM, and Trunk_TM) showed a relevant effect on the MLR model in both ALLdata and DXAdata, since the absolute value of their coefficients is considerably greater than the other variables, with few exceptions for usCRP and HbA1c in the case of ALLdata and T2D for DXAdata. Analogous consideration can be made for usCRP and HbA1c in the BLDdata.

Supplementary Table 2 shows the actual and estimated VAT weights for the same group of patients. In most of the cases, DXAdata-based MLR resulted in VAT weight values closer to the actual ones, the opposite of the BLDdata-derived MLR.

Table 4 presents the evaluation metrics computed for each MLR model. DXAdata showed the best results in terms of errors (MEA = 401.40, MSE = 305505.12 and RMSE = 552.73) respect to ALLdata and BLDdata as well as comparable \(\hbox {R}^{2}\) (0.58) and \(\hbox {R}^{2}\) adjusted (0.52) to ALLdata (\(\hbox {R}^{2}\) = 0.62 and \(\hbox {R}^{2}\) adjusted 0.53), which match with what observed in Supplementary Table 2.

Table 4 MLR models evaluation metrics for each ALLdata, DXAdata and BLDdata.

Classification models

Optimal HP configuration for each classification model obtained from the 10-cross-validation grid search analysis, along with the corresponding test accuracy values, are reported in Table 5. LR and SVM models trained over the DXAdata showed the highest test accuracy (0.63 and 0.60, respectively), followed by their equivalent evaluated on the ALLdata (0.57 and 0.53) and BLDdata (0.53 and 0.47), being this value always \(\le 0.47\) for every other combination of model and dataset. In addition, similar HPs were found for LR (i.e. solver = saga and penalty = l1) and SVM (i.e. kernel =linear) when comparing ALLdata and DXAdata but not for BLDdata where solver = liblinear and penalty = l2 for LR and kernel = rbf for SVM.

Table 5 Optimal HP configuration and corresponding test accuracy for each classification model and dataset obtained from 10-fold cross-validation grid search analysis. \(^*\) denotes the accuracy of the selected classification model for each dataset.

Confusion matrices, presenting the relationship between actual and predicted classes, are in Fig. 2 for each ALLdata DXAdata and BLDdata. Similar results are for both Fig. 2a,b showing that the majority of class 1 patients were misclassified as class 2. On the contrary, 6 out of 10 class 1 patients from BLDdata (Fig. 2c) were accounted as class 0.

Figure 2
figure 2

Confusion matrices for LR classification for ALLdata, DXAdata, and BLDdata. This figure presents confusion matrices showing actual subgroup membership along the rows and predicted subgroup membership by the LR classifier along the columns. In each cell, the numbers refer to counts of the number of individuals.

Classification models interpretability and performance assessment

The SHAP values, which quantify the magnitude of each variable’s contribution towards the correct patients’ classification, for each ALLdata, DXAdata, and BLDdata are presented in the diagrams in Fig. 3a–c, respectively. Three legends can be found in the figure, namely, class 0 (marked with the colour blue); class 1 (marked with the colour orange); and class 2 (marked with the colour green). Mean absolute SHAP values are displayed as proportional to the length of every colour-coded bar and variables are ranked according to their importance: in the upper part of the plot are the most influential to the classification model. According to these diagrams, Trunk_FM, OverallMass, Trunk_TM, Total_TM, and Trunk_TotalM were the top five variables in the ALLdata (Fig. 3a). Analogous distribution was found for variables in the DXAdata (Fig. 3b) being Trunk_FM, OverallMass, Trunk_TM, Total_TM, Trunk_TotalM the most relevant ones. As for the BLDdata (Fig. 3c), the first three most relevant variables were Weight, Age, and Hip followed by T2D and AST.

Figure 3
figure 3

SHAP diagrams showing the average impact of each variable on the LR model for ALLdata, DXAdata, and BLDdata. Blue, orange, and green colours were used for classes 0, 1, and 2, respectively.

The ROC curves for each dataset and class are in Fig. 4a–c and show the trade-off between specificity and sensitivity for one-vs-rest analysis. PR curves are in Fig. 4d,e and show the precision values for corresponding sensitivity (recall). The same three-colour legend as in the SHAP diagrams was here adopted. Similar performances were obtained when using the LR model in the one-vs-rest analysis for both ALLdata and DXAdata with class 1 being the most difficult to distinguish (ROC AUC = 0.31 and PR AUC = 0.24), indicating that both models could distinguish the ensemble of class 0 + class 2 as not class 1 but they were specific for class 1 only. However, better results can be observed for the BLDdata having ROC AUC \(\ge \) 0.61 and PR AUC \(\ge \) 0.42.

Figure 4
figure 4

Panels (ac) show the ROC while panels (df) are PR curves along with their corresponding AUC from one-vs-rest analysis for the selected LR model in each ALLdata, DXAdata, and BLDdata dataset. Blue, orange, and green colours were used for classes 0, 1, and 2, respectively.

Discussion

The main aim of this study was to investigate the possibility of combining the results from routine blood tests and basic clinical information, for the classification of women living with obesity based on their VAT weight by MLR models and machine-learning-based classification techniques. Both MLR and classification hold significance from a clinical standpoint for risk stratification of those subjects, given the well-established associations of VAT accumulation with metabolic48,49 and clinical outcomes18.

Analysed variables

On the one hand, all but one of the variables in the DXAdata were estimated from the core region of the body which does not seem too far removed from previously reported studies in different populations, as in Refs.50,51,52 to mention few. On the other hand, the variables included in the BLDdata were blood chemicals whose concentration can be altered by obesity and/or its comorbidities. Interestingly, the adoption of these blood biochemical concentrations for the classification of overweight and obesity, considering both males and females together, was already proposed in previous investigations53,54, thus corroborating their usage in the context of this research.

Although the Wilcoxon signed-rank showed that most of the variables were statistically different across the three groups, only a few variables did pass Bonferroni’s correction for multiple testing and the majority of them were for class 0 vs 2 comparison (see Table 3). In other words, single variables, either from DXA or blood samples, taken on their own might not be sufficient to robustly classify people living with obesity probably due to their physiological interconnection. Nevertheless, machine learning-based classification models can capture these complex relationships between variables more effectively than other statistical models, as confirmed by previous investigations including Ferenci et al.51 and Mitu et al.55, thus justifying their applications in this type of analysis.

Regression models

In this study, we chose the MLR since (1) it can model the complex relationships existing among all the DXA-based and blood chemicals, (2) it can determine individual variable effects on the estimated VAT weight values, and (3) its ability to control for confounding variables. Indeed, by simultaneously testing the effects of multiple variables, MLR enhances predictive accuracy and strengthens the robustness of the model, while also accommodating non-linear relationships34.

Due to the high estimation error and the poor \(\hbox {R}^{2}\), none of the MLR models yielded estimated VAT weight values sufficiently accurate to be reliably used in the medical decision-making process. Specifically, the estimated VAT weight values can differ by up to twice the actual VAT weight, as demonstrated by the case of patient 12 in Supplementary Table 2. This discrepancy may be attributed to the limited sample size (N = 149), which potentially compromised the accuracy of the MLR models, making them unreliable for VAT prediction and risk stratification within our study population.

Selection of the number of classes

As a preliminary study, four different threshold settings were investigated to define the most suitable number of clusters. The thresholds were based on (1) the mean, (2) the tertiles, (3) quartiles, and (4) pentiles values to obtain equally sized classes, hence avoiding any potential bias due to unbalanced data. The statistical power of the resulting groups was tested by using the “PWR” package in R56 obtaining 23% for mean-based, 22% for tertiles-based, 18% for quartiles-based, and 13% for pentiles-based grouping. Thus a more granular subdivision of the population would diminish the statistical power of the results.

On the one hand, a two-class approach generates an increment of the intra-class heterogeneity hence increasing the complexity of correctly classifying patients due to the remarkable overlap of their VAT weight distribution (see Supplementary Fig. 1a). This fact could have potentially lowered the clinical significance of the outcomes.

On the other hand, a three-class approach allows better discrimination of the two classes of major interest—class 0, including subjects with lower VAT weight and so lower risk, and class 2, containing subjects with higher VAT and so higher risk— as the amount of overlap is drastically reduced (see Supplementary Fig. 1b). Moreover, it adds an intermediate class that can contribute to a more refined classification (e.g. low, medium, and high-risk level) without losing statistical power.

Classification models

In general, LR works well on a wide variety of datasets, being its performance better than that of decision trees (and its derived random forest and XGBoost) and k-nearest neighbors57, as also noticeable from results in Table 5. This can be explained by the fact that LR regression is a classification technique where the target variable (the classes in this study) is assumed to be categorical (i.e. class 0, class 1, and class 2). Specifically, LR is best suited in the context of a binomial problem, as in the case of this study where the one-vs-rest analysis was employed34,42. The performance of k-nearest neighbors is generally worse on high-dimensional data especially in the presence of outliers, as it could be in this study, since they may negatively influence the calculation of the distance function42. SVM, on the other hand, has shown performance comparable to LR in the different studies on medical data sets58,59 being particularly useful in those scenarios where the number of variables was much greater than the number of samples. Nevertheless, LR remains the most familiar to clinicians given the relatively straightforward relationship between the inputs and output60.

We found remarkable similarities in the resulting HPs configuration for LR between ALLdata and DXAdata as well as in the confusion matrices, ROC, and PR graphs. This is probably because DXA-based variables could display a better correlation with VAT weight than blood chemicals thus reducing their contribution to the final classification (see Fig. 3a). Nevertheless, the LR model derived from BLDdata showed comparable test accuracy and similar classification performance (see Fig. 2) to ALLdata and DXAdata, at least for classes 0 and 2. In any case, class 1 appears to be the most difficult to assess correctly. This could be a consequence of the VAT distribution within class 1 where, as can be noticed in Fig. 1, most of the values lay near the boundary with either class 0 or class 2.

Classification models interpretability and performance assessment

SHAP was employed to better understand the underlying mechanisms of the LR classification and analyze the influence of each variable on the classification, sorting them from the most to the least relevant. Moreover, the length of each coloured bar relative to the others provides information on the importance of a given variable to that particular class. By visual inspection of SHAP diagrams, it is clear that both LR classification models derived from ALLdata (Fig. 3a) and DXAdata (Fig. 3b) are more biased toward class 0 and 2 as blue and green bars overshadow the orange one across all the dataset. On the contrary, classes in BLDdata (Fig. 3c) appear to be more balanced across variables, although those not related to blood (i.e. Weight, Age and Hip) still lean toward classes 0 and 2.

The one-vs-rest analysis allows us to extend any binary classifier to multi-class problems and has been extensively employed in multi-class classification problems61,62,63 and to evaluate the ability of the selected LR model to correctly distinguish patients in a specific class from all the others. Similar results between ALLdata and DXAdata can be observed when comparing AUCs for ROC and PR curves, see Fig. 4, being the classification by the BLDdata, on average, more accurate (i.e. higher AUCs). Interestingly, BLDdata-derived LR showed better performance in classifying class 1 than the same model obtained from any of the other two databases.

One could argue that when considering the ALLdata, blood chemicals appear to be less relevant than DXA-based variables (see Fig. 3a). However, despite the small reduction in the AUC for class 2, both the ROC and PR graphs point to better general outcomes for BLDdata-derived LR especially when it comes to class 1 patients which were almost completely classified thanks to the contribution to the model of blood chemicals such as AST, COLT, usCRP, LDL, and FPG, (see Fig. 3c) thus confirming their crucial role in the classification task in the BLDdata.

Clinical significance

This study demonstrated the usefulness of machine learning algorithms in building risk stratification models based not only on variables directly connected to VAT weight (i.e. DXAdata) but also on those derived from non-imagining techniques, such as blood chemical concentration, that can be altered by an excess of VAT. Indeed, thanks to machine learning, healthcare professionals can take advantage of subtle blood variations, which may remain unnoticed at first glance, and stratify women living with obesity according to their VAT weight using only common laboratory metrics and routine clinical information. In clinical practice, those non-imaging-based models could be employed in primary health care centers for early-stage risk stratification, enabling timely interventions to prevent severe clinical outcomes.

Furthermore, our results may represent an advancement in assessing obesity-related comorbidities, particularly cardiovascular disease because of its clear association with VAT. This gains significant relevance in the context of women, as data from the Framingham Heart Study64 unveils a heightened likelihood of developing cardiovascular disease due to obesity in women (64%) compared to men (46%). Obesity, alongside other contributing factors, markedly influences the prevalence and mortality of cardiovascular diseases in women, establishing it as a crucial focal point for health interventions65,66.

Limitations

First, it is important to note that usCRP may not consistently be included in routine blood sample tests, particularly in primary health care centres. Second, the relatively small number of subjects included might have hampered our ability to find significant changes in some of the parameters proposed in the original study and thus, have limited the total amount of possible variables to include in the final model. Third, the study population included people living with a severe degree of obesity making it impossible to test the model performance over subjects with lower degrees of obesity. Fourth, the threshold values used to define the three proposed categories may not be suitable for other cohorts, limiting the extrapolation of our results to other populations. Additionally, only women were considered to avoid the confounding effect of gender. Finally, the reduced size of the study population prevents us from applying more sophisticated techniques such as those based on deep learning analysis as proposed in similar studies like Agrawa et al.67 or in Klarqvist et al.68.

Conclusions and future research

This study aimed to assess the feasibility of data-driven machine learning models to directly estimate VAT weight as well as classify women with obesity based on their VAT content, using the concentration of specific blood parameters. Additionally, we sought to compare the performance of these models with variables derived from DXA.

MLR models failed to robustly predict VAT weight in our cohort independently of the dataset thus preventing their usage in clinical practice. Out of the six classification models LR was found to be the most accurate. Furthermore, the classification obtained through blood chemicals appeared more robust than DXA variables displaying higher accuracy, recall, and precision. Accordingly, the usage of machine learning together with non-imaging techniques can enhance early risk stratification of women living with obesity. This represents a significant advancement in the context of preventive and personalized medicine, offering an easier and more effective approach to managing a life-threatening condition like excess VAT content.

Future investigations, incorporating a larger participant pool and a control group, would strengthen the statistical power of our findings. Moreover, this expansion would facilitate the exploration of more advanced classification and, possibly, direct VAT estimation techniques, including those based on neural networks.