Abstract
Symptoms of Acute Respiratory infections (ARIs) among under-five children are a global health challenge. We aimed to train and evaluate ten machine learning (ML) classification approaches in predicting symptoms of ARIs reported by mothers among children younger than 5 years in sub-Saharan African (sSA) countries. We used the most recent (2012–2022) nationally representative Demographic and Health Surveys data of 33 sSA countries. The air pollution covariates such as global annual surface particulate matter (PM 2.5) and the nitrogen dioxide available in the form of raster images were obtained from the National Aeronautics and Space Administration (NASA). The MLA was used for predicting the symptoms of ARIs among under-five children. We randomly split the dataset into two, 80% was used to train the model, and the remaining 20% was used to test the trained model. Model performance was evaluated using sensitivity, specificity, accuracy, and the area under the receiver operating characteristic curve. A total of 327,507 under-five children were included in the study. About 7.10, 4.19, 20.61, and 21.02% of children reported symptoms of ARI, Severe ARI, cough, and fever in the 2 weeks preceding the survey years respectively. The prevalence of ARI was highest in Mozambique (15.3%), Uganda (15.05%), Togo (14.27%), and Namibia (13.65%,), whereas Uganda (40.10%), Burundi (38.18%), Zimbabwe (36.95%), and Namibia (31.2%) had the highest prevalence of cough. The results of the random forest plot revealed that spatial locations (longitude, latitude), particulate matter, land surface temperature, nitrogen dioxide, and the number of cattle in the houses are the most important features in predicting the diagnosis of symptoms of ARIs among under-five children in sSA. The RF algorithm was selected as the best ML model (AUC = 0.77, Accuracy = 0.72) to predict the symptoms of ARIs among children under five. The MLA performed well in predicting the symptoms of ARIs and associated predictors among under-five children across the sSA countries. Random forest MLA was identified as the best classifier to be employed for the prediction of the symptoms of ARI among under-five children.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Introduction
Acute Respiratory Infections (ARIs) are among the most common childhood illnesses which accounts for more than 6% of the global disease burden. ARIs are the leading cause of death among children under the age of five1,2. Worldwide, ARIs caused 16% of all deaths in 2015 and killed nearly one million children under the age of five, which is greater than the burden of diarrheal illness and malaria combined2,3,4. According to the World Health Organization (WHO) in 2019, in African and European regions, the under-five death rate due to ARIs was 73/1000 and 9/1000 live births respectively1,5, i.e. the African region under-five death rate was almost eight times higher than the European region. Different literature reported that symptoms of ARIs in under-5-year-old children are directly related to the population’s environmental, socioeconomic, and cultural variables2,6,7,8,9,10. Moreover, air pollution disproportionately affects the under-five children residing in low and middle-income countries (LMICs), including sSA. More than 89% of deaths due to air pollution occurred in LMICs, mainly in Africa and Asia11. Africa accounts for the highest excess mortality from ambient air pollution among under-five children, to which ARIs were suggested as a potential contributor11,12. It is confirmed that 92% of the world's population lives in areas where the air quality index (AQI) limit is exceeded (> 100, AQI near 100 is usually considered safe)13 and about 4.2 million people die every year from many diseases due to air pollution. Under-five children are at greater risk than the other population groups from many of the adverse health effects of air pollution, mainly due to a combination of physiological, environmental, and behavioral factors. Besides, children spend most of their time outside engaging in physical activities and playing, they breathe air located closer to the ground, where some of the air pollutants are at a higher concentration, and they have a higher breathing rate than adults increasing their risk of exposure14,15,16.
Previous studies attempted to identify the determinant factors of ARIs among under-five children2,6,7,8,9,10,11,12 using linear and non-linear regression models. As far as the researcher’s knowledge is concerned, there exist a few previous studies17,18,19,20 that applied machine learning algorithms to predict the ARIs among under-five children using air pollution factors. So far, these machine learning algorithms have not been extensively applied to the available cross-sectional datasets in low- and middle-income countries (LMICs). Hence, we applied machine learning (ML) algorithms to investigate the effects of air pollutants (such as Particulate Matter (PM2.5), nitrogen dioxide (NO2)), climate factors (temperature, land surface temperature, wet day), health-related information, and socio-demographic factors. Furthermore, a generic prediction framework is lacking for reliable assessment of the symptoms of respiratory infections among children under 5 years using a large-scale dataset employing MLA. To the best of our knowledge, this is the first study that employed different ML techniques to select and identify the associated risk factors with symptoms of ARIs in sSA countries. This MLA approach places the features according to their importance considers the selected risk factors (features) simultaneously in an unbiased manner and identifies the pattern of information, which is crucial to make a prediction. The objective of this study was twofold: first, to reveal the possible features for determining the ARIs among children, and second, to explore machine learning algorithms by considering the best possible features for predicting the ARIs among children in sub-Saharan African countries.
Materials and methods
Data sources and variables
The data for this study came from two sources: the Demographic and Health Survey (DHS), which is described in detail at https://dhsprogram.com. The data from 33 sSA countries (Fig. 1), including the global positioning systems (GPS) coordinates (latitude and longitude) of household clusters, were available (Table 1). In DHS, multistage sampling was used to select the sample for each survey in the countries included in this analysis. Hence, the first step of the sampling procedure involved the selection of clusters (enumeration areas (EAs)), followed by systematic household sampling within the selected EAs. The number of clusters is the first stage which is selected from the list of enumeration areas (EAs) created in the recent population census of each country and the households that are randomly selected in each of EAs. From the selected households, women aged 15–49 years are selected for an in-depth interview21. Moreover, the geographical covariates were extracted from the DHS site and were linked to the original individual DHS datasets through the cluster identifying number (ID). The key contextual climate factors in the study include the temperature, aridity index defined as the ratio of annual precipitation (0, most arid to 300, most wet), Daytime Land Surface Temperature (LST), and Enhanced Vegetation Index (EVI). The second data source is the National Aeronautics and Space Administration (NASA). From this source, the air pollution covariates such as global annual surface particulate matter (PM 2.5) concentration and the nitrogen dioxide (NO2) for 1998–2019 (v4.03) was estimated by the Atmospheric Composition Analysis Group. This data is available in the form of raster images (GeoTIFF) which are extracted using R software via the GPS locations (longitude and latitude). The data are publicly available at https://sedac.ciesin.columbia.edu/data/set/sdei-global-annual-gwr-pm2-5-modis-misr-seawifs-aod-v4-gl-0322. This dataset was combined with the original individual DHS datasets based on the community (enumeration areas) and the date of the survey. Air pollution covariates such as NO2 and PM2.5 for each of the EAs from 2012 to 2020 were obtained.
Variables
Outcome variables
To measure the symptoms of respiratory infections, mothers/caregivers were asked if each of their under-five children had experienced symptoms of ARI (Cough, short rapid breaths or difficulty breathing) and fever, each classified as binary outcome measures (yes, no), within 2 weeks before the DHS surveys. ARI was defined as a child who had a history of an illness in the 2 weeks preceding the survey with cough and breathing faster than usual with short, rapid breaths or had difficulty breathing23, and severe ARI (SARI) was defined as having all ARI with fever24.
Features (independent variables)
The independent variables extracted were based on a review of the literature3,5,6,7,9,25,26. The variables included in the analysis are summarized in the following framework (Fig. 2).
Model building
Model building
Machine learning algorithms such as Logistic Regression (LR)27, Ridge regression28, Least Absolute Shrinkage and Selection Operator (LASSO) regression29, Elastic Net30,31, Decision trees32, K-Nearest Neighbors (KNN)33, Naïve Bayes32,34,35, Random Forest (RF)31,36, Bagged tree37, Boosting37 and Artificial Neural Network (ANN)38,39 were included in the analysis. All the statistical analyses were performed using the R software 4.3.1 for Windows (R Development Core Team). Moreover, the function createDataPartition in the R caret package splits the dataset using the stratified random sampling technique, which can minimize the bias of the data distribution and create balanced data.
Logistic regression (LR)
LR is a widely applied statistical model for binary classification problems. Let \({y}_{i}\) be the response variable for the ith child, assumed to follow the Bernoulli distribution and takes on the value 1 with a probability of \({\pi }_{i}=P({y}_{i}=1|{{\varvec{x}}}_{i})\), where \({{\varvec{x}}}_{i}={({x}_{1i}, . . . , {x}_{pi})}^{T}\) is the ith child’s covariate vector, and value 0 with probability 1-\({\pi }_{i}\). Then the logistic regression model with the logit link function can be given as:
where \({\beta }_{0}\) is the intercept term, and \({\varvec{\beta}}={({\beta }_{1}, . . . , {\beta }_{p})}^{T}\) is a p × 1 vector of estimated regression parameters on the logit scale. When we have many features (dimensionality), the traditional LR model has a few limitations: over-fitting, multicollinearity, and computational difficulties. To address these problems, we used regularization which is a GLM that imposes a penalty on the parameters to shrink them toward zero27,28,29,30,31,40.
The ridge regression (\({L}_{2}\) regularization, which shrinks coefficients of correlated covariates towards each other) is obtained by maximizing the function with a penalized parameter \(\lambda\) applied for all the parameters except the constant (intercept)27,28. The penalized likelihood formulation for ridge regression is given by (2)
When the λ values are too large (λ → ∞), the coefficients of all the parameters tend to be zero, but when λ = 0, the ridge regression is equal to the traditional approach. The goal is to search for an optimal value between these two extremes.
The LASSO regression uses the \({L}_{1}\) penalty for variable selection and shrinkage. As such, if the \(\lambda\) is large enough, it forces the coefficient to be zero which provides a lesser number of predictors29. The function for the lasso regression is given by **Eq. (3)
The optimal regularization parameter (\(\lambda\)) was determined using the nfold cross-validation techniques. The smaller the \(\lambda\) value, the more the effect of regularization upon the number of covariates (features) in the model and their respective coefficients31,41,42. Thus, variables with non-zero estimates are considered important covariates for the outcome variable of interest.
The elastic net regularization is a combination of both **Eq. (2) and (3) penalties30,31. This method can effectively control for correlated features and also shrink the coefficients of non-informative features to zero30,31,40,43. The elastic net regression is given by (4)
All the GLM regularizations are operationalized in R programming software using the glmnet package44. In this paper, we trained the generalized linear model (GLM) estimators with common \(\alpha\) values from the set {0, 0.5, 1}, where (\(\alpha \hspace{0.17em}\)= 0.0, 0.5 and 1.0 respectively refers to the ridge, elastic net and lasso penalty)30,31,40.
Random forest (RF)
RF is the popular supervised ML approach in applied statistics because of its applicability in both classification and regression45,46,47. It is also used for variable screening for dimension reduction48,49,50. It is a "tree-based" technique in which several decision trees are constructed from a random set of covariates and used to predict an outcome label for a subset of samples. It builds multiple trees (called the forest) and the decision is based on the majority votes over all the trees in the forest. This model is also used to select the important features45,46,47,51. The Gini Importance analysis was conducted through random forest ML approaches to identify the features that have the most impact on the likelihood of developing symptoms of respiratory infections among under-five children in sSA countries.
Naïve Bayesian (NB)
NB is a collection of ML classification algorithms built on Bayes theorem. These algorithms are built on two basic assumptions; the first is that every pair of features being classified is independent of others and hence “naïve”), and the second is that each makes an independent and equal contribution to the outcome32,34,35. For a binary outcome variable, a Bernoulli Naïve Bayesian algorithm is appropriate and given as
where X is the covariates and (X) is the predictors' prior probability, P(y) is referred to as the probability before evidence is seen or the prior. P(X|y) is known as the likelihood.
Decision trees (DT)
The given dataset is repeatedly split into increasingly similar groups based on the variable that maximizes the similarity of resulting groups32. The nodes of the DT normally have multiple levels where the topmost or first node is known as the root node. The predictions and classifications are made by evaluating the new individual according to the established criteria. The DT classifier was constructed using the R package rpart, and the classification and regression tree (CART) was applied to build binary trees.
Figure 3 below shows the research workflow. Before performing any statistical analysis, the data were pre-processed, which was followed by feature selection. The data management, including missing values, the existence of outliers, and illogical values was checked. The missing value imputation process was carried out iteratively until 100% completeness of all variables was achieved. Specifically, we checked the missing values in the dataset. A value was excluded from the analysis if missing-ness was less than 10% for any variable including the study. However, mean imputation for continuous variable and mode imputation methods for categorical data were used to fill in the missing values if it is greater than 10%. The three-step approach consisted of feature selection, model comparison, and selection of the best ML models and interpretation. The random forest, which is one of the common approaches to identifying important features46,47,50,51,52, was used. It generates 1000 trees and selects the Gini criteria to compute the importance of each feature, the second quartile (median) was considered as a cut of point for selecting important features. Only the symptom of ARIs, as an outcome (dependent (target)) variable for the machine learning parts, was used. To assess the performance of the given ML classifications, we randomly split the dataset into two: training (80%) and (20%) testing datasets. The performances of the given ML models are evaluated using sensitivity, specificity, the area under the curve, and accuracy31,41,42,53,54,55,56 which are calculated using the observed data as the gold standard.
After constructing the ML models, sensitivity, specificity, accuracy, and area under the curve (AUC) were calculated to test the performance. The AUC gives an aggregated value which explains the probability that a random sample would be correctly classified by each of the ML algorithms54,57. The AUC of the receiver characteristics curve (ROC) averaged over 10 cross-validation folds (ten repeats)54, which partitions the original sample into ten disjoint subsets, uses nine of those subsets in the training process, and then makes predictions about the remaining subset. When viewing the area under the receiver operating curve (AUC-ROC), the classifiers that provide curves closer to the top-left corner represent a reliable performance and hence the RF model is more accurate in distinguishing the diagnosis of symptoms of respiratory infections among children under 5 years. The ROC curve is a virtual demonstration used to explain the diagnostic capability of binary classifiers which is a plot of the specificity (1-false positive rate (FPR)) on the horizontal axis and sensitivity-true positive rate (TPR) on the vertical axis. Then the identified best-fit model is used to predict the respiratory symptoms in another dataset, known as the test dataset31,41,42,53,54,55.
Compliance with ethics guidelines
The protocol for the sub-Saharan DHS was approved by the Humanities and Social Sciences Research Ethics Committee (HSSREC/00005776/2023) of the University of KwaZulu-Natal. The authors obtained permission from the demographic and health survey (DHS) program to download and use the data for this analysis and the need for informed consent was waived.
Results
Table 1 presents the prevalence of symptoms of respiratory infections among under-five children from 33 sSA countries. A total of 327,507 under-five children were included in the study. The overall prevalence of symptoms of ARI, SARI, cough, and fever for all countries was 7.10, 4.19, 20.61, and 21.02% respectively. However, there are inequalities in the symptoms of respiratory infections among under-five children across sSA countries (Table 1, Fig. 4).
The number of under-five children across the DHS waves for each country and the prevalence of symptoms of respiratory infections among U5C children in sSA | |||||||
---|---|---|---|---|---|---|---|
Survey countries | Survey year | Weighted sample | Percent | Children with symptoms of | |||
ARI n (%) | SARI n (%) | Cough n (%) | Fever n (%) | ||||
Angola | 2015 | 13,439 | 4.10 | 606 (4.51) | 317 (2.36) | 1416 (10.54) | 1934 (14.39) |
Benin | 2017 | 12,529 | 3.83 | 702 (5.60) | 395 (3.15) | 2016 (16.09) | 2427 (19.37) |
Burkina Faso | 2021 | 11,763 | 3.59 | 377 (3.20) | 230 (1.96) | 1308 (11.12) | 2622 (22.29) |
Burundi | 2016 | 12,432 | 3.80 | 1549 (12.46) | 1063 (8.55) | 4740 (38.13) | 4639 (37.31) |
Cameroon | 2017 | 8986 | 2.74 | 373 (4.15) | 167 (1.86) | 1687 (18.77) | 1387 (15.44) |
Chad | 2015 | 16,644 | 5.08 | 1794 (10.78) | 1053 (6.33) | 3092 (18.58) | 3531 (21.21) |
Comoros | 2011 | 2916 | 0.89 | 200 (6.86) | 130 (4.46) | 516 (17.70) | 622 (21.33) |
Congo democratic | 2013 | 16,960 | 5.18 | 2098 (12.37) | 1244 (7.33) | 5306 (31.29) | 5229 (30.83) |
Ivory Coast | 2017 | 9888 | 3.02 | 188 (1.90) | 111 (1.12) | 1187 (12.00) | 1724 (17.44) |
Ethiopia | 2016 | 9911 | 3.03 | 795 (8.02) | 493 (4.97) | 1583 (15.97) | 1354 (13.66) |
Gabon | 2019 | 5882 | 1.80 | 233 (3.96) | 150 (2.55) | 1426 (24.24) | 1311 (22.29) |
Gambia | 2019 | 7764 | 2.37 | 578 (7.44) | 288 (3.71) | 1463 (18.84) | 1324 (17.05) |
Ghana | 2014 | 5544 | 1.69 | 364 (6.57) | 178 (3.21) | 744 (13.42) | 821 (14.81) |
Guinea | 2018 | 6633 | 2.03 | 287 (4.33) | 157 (2.37) | 744 (11.22) | 1123 (16.93) |
Kenya | 2022 | 18,705 | 5.71 | 582 (3.11) | 340 (1.82) | 4328 (23.14) | 3143 (16.80) |
Lesotho | 2014 | 2818 | 0.86 | 259 (9.19) | 167 (5.93) | 789 (28.00) | 405 (14.37) |
Liberia | 2019 | 4083 | 1.55 | 518 (10.19) | 325 (6.39) | 1379 (27.13) | 1471 (28.94) |
Madagascar | 2021 | 11,647 | 3.56 | 651 (5.59) | 323 (2.77) | 2217 (19.03) | 1438 (12.35) |
Malawi | 2015 | 16,209 | 4.95 | 1648 (10.17) | 1044 (6.44) | 3889 (23.99) | 4687 (28.92) |
Mali | 2018 | 9175 | 2.80 | 311 (3.39) | 189 (2.06) | 866 (9.44) | 1497 (16.32) |
Mauritania | 2019 | 10,956 | 3.35 | 672 (6.13) | 495 (4.52) | 1372 (12.52) | 1874 (17.10) |
Mozambique | 2015 | 4954 | 1.51 | 758 (15.30) | 295 (5.95) | 1415 (28.56) | 1300 (26.24) |
Namibia | 2013 | 4426 | 1.35 | 604 (13.65) | 380 (8.59) | 1381 (31.20) | 1128 (25.49) |
Nigeria | 2018 | 30,597 | 9.34 | 1603 (5.24) | 940 (3.07) | 4816 (15.74) | 7535 (24.63) |
Rwanda | 2019 | 7758 | 2.37 | 587 (7.57) | 351 (4.52) | 2208 (28.46) | 1468 (18.92) |
Senegal | 2019 | 5726 | 1.75 | 430 (7.51) | 270 (4.72) | 848 (14.81) | 920 (16.07) |
Sierra Leone | 2019 | 8878 | 2.71 | 354 (3.99) | 233 (2.62) | 1231 (13.87) | 1473 (16.59) |
South Africa | 2016 | 3250 | 0.99 | 150 (4.62) | 108 (3.32) | 820 (25.23) | 647 (19.91) |
Tanzania | 2022 | 10,197 | 3.11 | 221 (2.17) | 145 (1.42) | 1197 (11.74) | 1011 (9.91) |
Togo | 2013 | 6460 | 1.97 | 922 (14.27) | 498 (7.71) | 1698 (26.28) | 1413 (21.87) |
Uganda | 2016 | 14,378 | 4.39 | 2164 (15.05) | 1349 (9.38) | 5766 (40.10) | 5027 (34.96) |
Zambia | 2019 | 9308 | 2.84 | 241 (2.59) | 142 (1.53) | 1948 (20.93) | 1549 (16.64) |
Zimbabwe | 2015 | 53,691 | 1.74 | 445 (7.82) | 166 (2.92) | 2103 (36.95) | 796(13.99) |
Total | 327,507 | 100 | 23,264 (7.10) | 13,736 (4.19) | 67,499 (20.61) | 68,830 (21.02) |
The preliminary analysis for symptoms of ARI using a generalized linear model (logistic regression) with the type of features and their relative importance values separately reported for socio-demographic, geospatial, health and nutrition, and environmental covariates are summarized in Table 2. The results of the variables showed that among the socio-demographic variables: age of mother, place of residence, and media exposure, from health nutrition-related features: breast-feeding, nutrition status (stunting, wasting, and underweight), and dietary diversity, from geospatial covariates: enhanced vegetation index, aridity, wet day, and the minimum temperature were positive predictors of the symptoms of ARIs. Additionally, environmental features: source of drinking water and toilet facility; air pollution features: fuel type, cooking place, PM2.5, and spatial locations (longitude, latitude) statistically and significantly affected the symptoms of ARI among under-five children in sSA countries (Table 2).
The relative importance results in a features score larger than the second quartile (20.3) was considered as a cut-off point for selecting important features and these were used for the subsequent machine learning models. As a result, 21 features are retained for the subsequent analysis. As shown in Fig. 5, the top features with strong influences on the symptoms of ARI among under-five children in sSA countries were air pollutants and climatic factors: household air pollution and air pollutants such as particulate matter (PM2.5), cooking indoors and outdoors, nitrogen dioxide and types of fuel. The features from geospatial/climate variables; spatial location (longitude, latitude), LST, EVI, Cattle, maximum/minimum temperature, aridity, and wet days have a relative importance score greater than the second quartile (20.3%). Whereas only the mother's age and sex of a child from socio-demographic and diarrhea status and vitamin A supplement from health-related features were selected for further ML models to predict the symptoms of ARIs among under-five children across sSA countries. Finally, the proposed ML models such as GLM (logistic regression), Ridge, LASSO, Elastic net, ANN, KNN, Boosting, Naïve Bayes, DT, RF, and Bagged Trees were employed based on the selected features to classify the diagnosis of symptoms of ARIs of the under-five children in sSA countries (Fig. 5).
The model evaluation and accuracy scores of different supervised machine learning models were done by randomly sampling 20% of the dataset as a test sample (Table 3). Table 3 revealed that there is no substantial difference in accuracies of the different MLAs that can predict the symptoms of ARI among under-five children in sSA countries. The highest model performance was obtained by Random Forest, Boosting, ANN, and Bagged trees with AUCs of 0.77, 0.76, 0.74, and 0.74 respectively. The lowest model performance was observed for DT and NB with AUC = 0.68 and 0.70 respectively (Table 3, Supplementary Fig. S1).
Discussion
This study explores a full statistical analysis of covariates associated with the ARIs among under-five children in sub-Saharan African countries, employing both descriptive data exploration and advanced machine learning algorithms. This study highlights a large variation in country-level prevalence of symptoms of ARIs among under-five children. Previous literature revealed that the distribution of the prevalence of ARIs varies from country to country6,7,8,58 and from district to district within the same country7,58,59,60.
One of the aims of this study was to apply ML algorithms to identify the key determinants (features) of ARIs among under-five children using a large dataset across sub-Saharan African countries. This is the first study to demonstrate the implementation of ML algorithms for predicting acute respiratory infection rates in sSA countries. The result of this study showcases the superior predictive capability powers of the MLA as compared to other conventional statistical techniques in identifying features linked to ARIs. The result is not surprising since MLA has been revealed to outperform traditional statistical models in several fields of the machine61,62,63,64. We have employed several ML techniques, to assess their predictive power capabilities. Evaluating the performance of these ML techniques, we investigated that all the techniques employed in this study achieved ROC values above the optimal threshold value (0.5). Using novel machine learning algorithms (MLA), our analysis of the multi-country DHS datasets strongly indicated the association of air pollution and environmental variables with the symptoms of ARI among children in sSA counties. In our study, PM2.5 was the most influential variable increasing the risk of ARI, together with NO2. Both PM2.5 and NO2 have been associated with the occurrence of respiratory infections11,12,16,65. Specifically, the support vector machine algorithm66,67 has previously shown that ARI is associated with NO2. Those previous researchers applied parametric linear models and semi-parametric and generalized additive models68,69,70,71 to determine the effects of air pollutants on symptoms of respiratory infections. To the best of our knowledge, few studies are using machine learning models to determine the association between air pollutants and human health72,73,74,75, and none have used ML models to determine the effects of air pollutants on children's symptoms of respiratory infections across the sub-Saharan regions. In this study, climate factors, such as temperature, wet day, and spatial location (longitude, latitude), were among the top features associated with the symptoms of respiratory infections. This is consistent with the previous studies76,77,78,79 that the temperature affects the occurrence of the symptoms of ARIs.
Nowadays, with the availability of large health-related data repositories (such as electronic medical records) and advances in computing power, classical statistical analysis is being combined with advanced machine learning algorithms to predict and classify the target variables (outcomes)80,81,82. The feature selection and feature relevance become prominent, especially in datasets with many features (independent variables)37,52,81,82,83. The RF approach has been also used for feature selection in previous studies46,47,52,74. Using this approach, we found that the most important features are particulate matter, age of the mother, spatial location (longitude, latitude), land surface temperature, enhanced vegetation index, nitrogen dioxide, aridity, wet day, temperature, and others were identified, and the similar result was obtained from previous studies6,7,8,58,84,85,86. In the study, all the ML classification approaches achieved greater accuracy in predicting/diagnostics of symptoms of ARI over traditional models like GLM also in line with studies on target variables46,47,52,74,75,87 elsewhere. The study used large nationally representative datasets of 33 sSA countries in examining and selecting the important features to diagnose the symptoms of ARIs. Again, this large dataset made it possible to apply the high-level ML approaches that confirm the accuracy of the findings. However, this study has some limitations. Firstly, we considered only one recent DHS dataset for each country, and hence we did not model the variables over time. Secondly, the data is cross-sectional so we can only make conclusions on statistical association (not causality). Thirdly, the study (survey) is conducted in different survey years and the comparison made on prevalence by country may mislead the readers. Lastly, even though the random forest machine learning method is commonly used for feature selection, other methods may prioritize features differently. Therefore, our future focus will be to include the temporal effects to draw inferences over time and possibly causality.
Conclusion
The present study tried to assess the performance of various supervised machine-learning algorithms for the prediction of symptoms of respiratory infections using data from DHS and NASA sources. In this study, before we started the feature selection process, our dataset contained a total of 51 features and 327,507 under-five children. Feature selection is essential for the classification and prediction of certain target variables. Using the random forest approach, the ranking of the contributions of the features was determined by using the average Gini Importance method and only 21 features were retained for further ML models. It was found that particulate matter (PM2.5), age of the mother, spatial location (longitude, latitude), land surface temperature, enhanced vegetation index, nitrogen dioxide, aridity, wet day, and temperature are the most important predictors of symptoms of ARI among children in sSA countries. Those selected features have scores greater than the second quartile (median), which is used as a rule of thumb for dimension reduction of features. The present study attempted to identify the best ML algorithms for the prediction of symptoms of ARI using nationwide cross-sectional data from 33 SSA countries. The performances of these ML models were compared using different statistical merits such as sensitivity, specificity, accuracy, and AUC. Air pollution is a leading cause of symptoms of respiratory infections (fever, cough, ARI, and SARI) among children and adults. In addition, the ML algorithms are more accurate for the prediction of the symptoms and this result may apply to other target variables, for large data sets. The findings of this study established the potential of the ML techniques in predicting the presence of ARI among under-five children across sSA countries. This opens up the opportunities for development of automated screening tools and decision support systems which may assist the concerned bodies in diagnosing and managing the ARIs among under-five children in the region. Moreover, the spatial location (longitude, latitude) is one of the influential features in predicting and diagnostic symptoms of ARIs, hence if the spatial model is integrated with the ML models, it is possible to identify and flag under five children who are at most risk, such that data-driven intervention can be targeted to communities where those children live.
Data availability
The datasets generated and analyzed during the current study are available subject to permission from the DHS program, in the DHS repository (https://dhsprogram.com/data).
References
World Health Organization. Children: Reducing Mortality (World Health Organization, 2019).
Rudan, I. et al. Global estimate of the incidence of clinical pneumonia among children under five years of age. Bull. World Health Organ. 82(12), 895–903 (2004).
Goodarzi, E. et al. Epidemiology of mortality induced by acute respiratory infections in infants and children under the age of 5 years and its relationship with the Human Development Index in Asia: An updated ecological study. J. Public Health 29(5), 1047–1054 (2021).
Organization, W. H. World Report on Ageing and Health (World Health Organization, 2015).
Anjum, M. U., Riaz, H. & Tayyab, H. M. Acute respiratory tract infections (Aris);: Clinico-epidemiolocal profile in children of less than five years of age. Prof. Med. J. 24(02), 322–325 (2017).
Ujunwa, F. & Ezeonu, C. Risk factors for acute respiratory tract infections in under-five children in enugu Southeast Nigeria. Ann. Med. Health Sci. Res. 4(1), 95–99 (2014).
Sultana, M. et al. Prevalence, determinants and health care-seeking behavior of childhood acute respiratory tract infections in Bangladesh. PloS one 14(1), e0210433 (2019).
Kjærgaard, J. et al. Diagnosis and treatment of acute respiratory illness in children under five in primary care in low-, middle-, and high-income countries: A descriptive FRESH AIR study. PLoS One 14(11), e0221389 (2019).
Banda, B. et al. Risk factors associated with acute respiratory infections among under-five children admitted to Arthur’s Children Hospital, Ndola, Zambia. Asian Pac. J. Health Sci. 3(3), 153–159 (2016).
Harerimana, J.-M. et al. Social, economic and environmental risk factors for acute lower respiratory infections among children under five years of age in Rwanda. Arch. Public Health 74(1), 1–7 (2016).
Landrigan, P. J. et al. The Lancet Commission on pollution and health. Lancet 391(10119), 462–512 (2018).
Lelieveld, J. et al. Loss of life expectancy from air pollution compared to other risk factors: A worldwide perspective. Cardiovasc. Res. 116(11), 1910–1917 (2020).
Mirabelli, M. C., Ebelt, S. & Damon, S. A. Air quality index and air quality awareness among adults in the United States. Environ. Res. 183, 109185 (2020).
Fleming, S. et al. Normal ranges of heart rate and respiratory rate in children from birth to 18 years of age: A systematic review of observational studies. Lancet 377(9770), 1011–1018 (2011).
Gasana, J. et al. Motor vehicle air pollution and asthma in children: A meta-analysis. Environ. Res. 117, 36–45 (2012).
Osborne, S. et al. Air quality around schools: Part II-mapping PM2.5 concentrations and inequality analysis. Environ. Res. 197, 111038 (2021).
Vong, C.-M. et al. Imbalanced learning for air pollution by meta-cognitive online sequential extreme learning machine. Cognit. Comput. 7, 381–391 (2015).
Ginantra, N., Indradewi, I. & Hartono E. Machine learning approach for acute respiratory infections (ISPA) prediction: Case study indonesia. in Journal of Physics: Conference series. (IOP Publishing, 2020).
Ku, Y. et al. Machine learning models for predicting the occurrence of respiratory diseases using climatic and air-pollution factors. Clin. Exp. Otorhinolaryngol. 15(2), 168 (2022).
Ravindra, K. et al. Application of machine learning approaches to predict the impact of ambient air pollution on outpatient visits for acute respiratory infections. Sci. Total Environ. 858, 159509 (2023).
Aliaga, A. & Ren, R. The Optimal Sample Sizes for Two-Stage Cluster Sampling in Demographic and Health Surveys (ORC Macro, 2006).
Hammer, M. S. et al. Global estimates and long-term trends of fine particulate matter concentrations (1998–2018). Environ. Sci. Technol. 54(13), 7879–7890 (2020).
Croft, T. N. et al. Guide to DHS Statistics Vol. 645 (Rockville, ICF, 2018).
Organization, W.H., Global influenza strategy 2019–2030. (2019).
Kjærgaard, J. et al. Correction: Diagnosis and treatment of acute respiratory illness in children under five in primary care in low-, middle-, and high-income countries: A descriptive FRESH AIR study. Plos one 15(2), e0229680 (2020).
Fetene, M. T., Fenta, H. M. & Tesfaw, L. M. Spatial heterogeneities in acute lower respiratory infections prevalence and determinants across Ethiopian administrative zones. J. Big Data 9(1), 1–16 (2022).
Yu, H.-F., Huang, F.-L. & Lin, C.-J. Dual coordinate descent methods for logistic regression and maximum entropy models. Mach. Learn. 85(1–2), 41–75 (2011).
Arthur, E. H. & Robert, W. K. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12(1), 55–67 (1970).
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 58(1), 267–288 (1996).
Zou, H. & Hastie, T. Addendum: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67(5), 768–768 (2005).
Géron, A. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems (O’Reilly Media, 2019).
James, G. et al. An Introduction to Statistical Learning Vol. 112 (Springer, 2013).
Patrick, E. A. & Fischer, F. P. III. A generalized k-nearest neighbor rule. Inform. Control 16(2), 128–152 (1970).
McCallum, A. & Nigam K. A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text categorization. (Madison, 1998).
Zhang, D. Bayesian classification. In Fundamentals of Image Data Mining 161–178 (Springer, 2019).
Chen, T. & Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York, NY, USA, 2016), KDD ‘16, ACM. (2016).
Chen, T. & Guestrin C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (2016).
Hecht-Nielsen, R. Theory of the backpropagation neural network. In Neural networks for perception 65–93 (Elsevier, 1992).
Abdelhafiz, D. et al. Deep convolutional neural networks for mammography: Advances, challenges and applications. BMC Bioinform. 20(11), 1–20 (2019).
Hoerl, A. E. & Kennard, R. W. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12(1), 55–67 (1970).
Molina, M. & Garip, F. Machine learning for sociology. Ann. Rev. Sociol. 45, 27–45 (2019).
Marsland, S. Machine Learning: An Algorithmic Perspective (CRC Press, 2015).
Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67(2), 301–320 (2005).
Yuan, G.-X., Ho, C.-H. & Lin, C.-J. An improved glmnet for l1-regularized logistic regression. J. Mach. Learn. Res. 13(1), 1999–2030 (2012).
Breiman, L. Random forests. Mach. Learn. 45(1), 5–32 (2001).
Genuer, R., Poggi, J.-M. & Tuleau-Malot, C. Variable selection using random forests. Pattern Recognit. Lett. 31(14), 2225–2236 (2010).
Janitza, S., Tutz, G. & Boulesteix, A.-L. Random forest for ordinal responses: Prediction and variable selection. Comput. Stat. Data Anal. 96, 57–73 (2016).
Genuer, R., Poggi, J.-M. & Tuleau-Malot, C. VSURF: An R package for variable selection using random forests. R J. 7(2), 19–33 (2015).
Pal, M. Random forest classifier for remote sensing classification. Int. J. Remote Sens. 26(1), 217–222 (2005).
Rodriguez-Galiano, V. F. et al. An assessment of the effectiveness of a random forest classifier for land-cover classification. ISPRS J. Photogramm. Remote Sens. 67, 93–104 (2012).
Liaw, A. & Wiener, M. Classification and regression by randomForest. R news 2(3), 18–22 (2002).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Quinlau, R. Induction of decision trees. Mach. Learn. 1(1), S1–S106 (1986).
Gareth, J. et al. An Introduction to Statistical Learning: With Applications in R (Spinger, 2013).
Zhang, H., The optimality of naïve Bayes. In FLAIRS2004 conference (2004).
Bland, J. M. & Altman, D. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 327(8476), 307–310 (1986).
Hanley, J. A. & McNeil, B. J. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1), 29–36 (1982).
Goodarzi, E. et al. Epidemiology of mortality induced by acute respiratory infections in infants and children under the age of 5 years and its relationship with the Human Development Index in Asia: An updated ecological study. J. Public Health 29, 1047–1054 (2021).
Harerimana, J.-M. et al. Social, economic and environmental risk factors for acute lower respiratory infections among children under five years of age in Rwanda. Arch. Public Health 74, 1–7 (2016).
Fenta, S. M. & Fenta, H. M. Risk factors of child mortality in Ethiopia: Application of multilevel two-part model. PloS one 15(8), e0237640 (2020).
Chekroud, A. M. et al. The promise of machine learning in predicting treatment outcomes in psychiatry. World Psychiatry 20(2), 154–170 (2021).
Kwon, J.-M. et al. Artificial intelligence algorithm for predicting mortality of patients with acute heart failure. PloS one 14(7), e0219302 (2019).
Krittanawong, C. et al. Machine learning and deep learning to predict mortality in patients with spontaneous coronary artery dissection. Sci. Rep. 11(1), 8992 (2021).
Bi, S. et al. Machine learning-based prediction of in-hospital mortality for post cardiovascular surgery patients admitting to intensive care unit: A retrospective observational cohort study based on a large multi-center critical care database. Comput. Methods Progr. Biome. 226, 107115 (2022).
Banda, W. et al. Risk factors associated with acute respiratory infections among under-five children admitted to Arthur’s Children Hospital, Ndola, Zambia. Asian Pac. J. Health Sci. 3(3), 153–159 (2016).
Vong, C.-M. et al. Short-term prediction of air pollution in Macau using support vector machines. J. Control Sci. Eng. 2012, 518032 (2012).
Cao, C., et al. Using support vector machine and decision tree to predict mortality related to traffic, air pollution, and meteorological exposure in Norway. In Three essays on Transportation and Environmental Economics, 70 (2023)
Schlink, U. et al. Longitudinal modelling of respiratory symptoms in children. Int. J. Biometeorol. 47, 35–48 (2002).
Schwartz, J. Nonparametric smoothing in the analysis of air pollution and respiratory illness. Can. J. Stat. 22(4), 471–487 (1994).
Silva, D. R. et al. Respiratory viral infections and effects of meteorological parameters and air pollution in adults with respiratory symptoms admitted to the emergency room. Influenza Other Respir. Viruses 8(1), 42–52 (2014).
Tang, S. et al. Measuring the impact of air pollution on respiratory infection risk in China. Environ. Pollut. 232, 477–486 (2018).
Quinlan, J. R. Induction of decision trees. Mach. Learn. 1, 81–106 (1986).
Beam, A. L. & Kohane, I. S. Big data and machine learning in health care. Jama 319(13), 1317–1318 (2018).
Panch, T., Szolovits, P. & Atun, R. Artificial intelligence, machine learning and health systems. J. Global Health https://doi.org/10.7189/jogh.08.020303 (2018).
Shahinfar, S. et al. Machine learning approaches for the prediction of lameness in dairy cows. Animal 15(11), 100391 (2021).
Omer, S. et al. Climatic, temporal, and geographic characteristics of respiratory syncytial virus disease in a tropical island population. Epidemiol. Infect. 136(10), 1319–1327 (2008).
Jati, S. & Ginandjar, P. Potential impact of climate variability on respiratory diseases in infant and children in Semarang. In IOP Conference Series: Earth and Environmental Science (IOP Publishing, 2017).
Tian, L. et al. Spatial patterns and effects of air pollution and meteorological factors on hospitalization for chronic lung diseases in Beijing, China. Sci. China Life Sci. 62, 1381–1388 (2019).
Kanannejad, Z. et al. Geo-climatic variability and adult asthma hospitalization in Fars, Southwest Iran. Front. Environ. Sci. 11, 1085103 (2023).
Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Series B Stat. Methodol. 67(2), 301–320 (2005).
Géron, A. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow (O’Reilly Media. Inc, 2022).
Abdelhafiz, D. et al. Deep convolutional neural networks for mammography: advances, challenges and applications. BMC Bioinform. 20, 1–20 (2019).
Molina, M. & Garip, F. Machine learning for sociology. Ann. Rev. Sociol. 45, 27–45 (2019).
Aguilera, R. et al. Mediating role of fine particles abatement on pediatric respiratory health during COVID-19 stay-at-home order in San Diego County, California. GeoHealth 6(9), e2022GH000637 (2022).
Odo, D. B. et al. Ambient air pollution and acute respiratory infection in children aged under 5 years living in 35 developing countries. Environ. Int. 159, 107019 (2022).
Cai, Y. S. et al. Ambient air pollution and respiratory health in sub-Saharan African children: A cross-sectional analysis. Int. J. Environ. Res. Public Health 18(18), 9729 (2021).
Fenta, H. M., Zewotir, T. & Muluneh, E. K. A machine learning classifier approach for identifying the determinants of under-five child undernutrition in Ethiopian administrative zones. BMC Med. Inform. Decis. Mak. 21(1), 1–12 (2021).
Acknowledgements
The datasets used in this study were obtained from the DHS program and NASA, thanks to the authorization received to download the dataset on the website. This research is supported by the Fogarty International Center of the National Institutes of Health under Award Number U2RTW012140. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health
Funding
Institute of Health (NIH) and the Fogarty International Center (FIC).
Author information
Authors and Affiliations
Contributions
H.M.F. was involved in this study from data management, data analysis, and drafting, and wrote the first draft of the manuscript. T.Z., S.N., R.N., and H.M. conceptualization, editing, and review of the manuscript. All authors contributed to the article and approved the submitted version.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Fenta, H.M., Zewotir, T.T., Naidoo, S. et al. Factors of acute respiratory infection among under-five children across sub-Saharan African countries using machine learning approaches. Sci Rep 14, 15801 (2024). https://doi.org/10.1038/s41598-024-65620-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-024-65620-1
- Springer Nature Limited