Exploring Machine Learning Algorithms to Predict Diarrhea Disease and Identify its Determinants among Under-Five Years Children in East Africa

Yehuala, Tirualem Zeleke; Derseh, Nebiyu Mekonnen; Tewelgne, Makda Fekadie; Wubante, Sisay Maru

doi:10.1007/s44197-024-00259-9

Exploring Machine Learning Algorithms to Predict Diarrhea Disease and Identify its Determinants among Under-Five Years Children in East Africa

Research Article
Open access
Published: 29 July 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Journal of Epidemiology and Global Health Aims and scope Submit manuscript

Exploring Machine Learning Algorithms to Predict Diarrhea Disease and Identify its Determinants among Under-Five Years Children in East Africa

Download PDF

Tirualem Zeleke Yehuala¹,
Nebiyu Mekonnen Derseh²,
Makda Fekadie Tewelgne¹ &
…
Sisay Maru Wubante¹

433 Accesses
Explore all metrics

Abstract

Background

The second most common cause of death for children under five is diarrhea. Early Predicting diarrhea disease and identify its determinants (factors) using an advanced machine learning model is the most effective way to save the lives of children. Hence, this study aimed to predict diarrheal diseases, identify their determinants, and generate some rules using machine learning models.

Methods

The study used secondary data from the 12 east African countries for DHS dataset analysis using Python. Machine learning techniques such as Random Forest, Decision Tree (DT), K-Nearest Neighbor, Logistic Regression (LR), wrapper feature selection and SHAP values are used for identify determinants.

Result

The final experimentation results indicated the random forest model performed the best to predict diarrhea disease with an accuracy of 86.5%, precision of 89%, F-measure of 86%, AUC curve of 92%, and recall of 82%. Important predictors’ identified age, countries, wealth status, mother’s educational status, mother’s age, source of drinking water, number of under-five children immunization status, media exposure, timing of breast feeding, mother’s working status, types of toilet, and twin status were associated with a higher predicted probability of diarrhea disease.

Conclusion

According to this study, child caregivers are fully aware of sanitation and feeding their children, and moms are educated, which can reduce child mortality by diarrhea in children in east Africa. This leads to a recommendation for policy direction to reduce infant mortality in East Africa.

Predicting Diarrhoea Among Children Under Five Years Using Machine Learning Techniques

Implementing Predictive Model for Child Mortality in Afghanistan

A Machine Learning Study to Classify the Type of Anemia in Children Under 5 Years of Age

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The major symptom of diarrhea is watery, loose, and more frequent bowel motions. It is caused by viruses, bacteria, parasites, medications, and lactose intolerance, among other possible causes [1]. According to a World Health Organization report, the second most common cause of death for children under five is diarrhea. Globally, there are nearly 1.7 billion child disease cases of diarrhea every year, and diarrhea kills about 525,000 children under five per year. Diarrhea is the cause of almost 1.6 million child fatalities annually, or nearly one in every five child deaths [2]. More than 90% of children under the age of five die from diarrhea in low- and lower-middle-income countries in 2021; in South Asia and sub-Saharan Africa (SSA), this percentage rises to 88% [3]. Diarrhea poses numerous challenges for children, such as decreased appetite, vomiting, low electrolyte levels, abdominal pain, undernourishment, heightened vulnerability to other contagious illnesses, and postponed physical and mental maturation, perhaps leading to impairment [3, 4].

Several studies have shown that East African nations have a significant prevalence of diarrheal illnesses in children under the age of five. The prevalence of diarrheal illness in Ethiopia is 8.5–30.5% [5], in Nairobi 25.6% [6], in Uganda 7.1% [7], in north Sudan, south kordofan, blue Nile 20%,40%,19 respectively [8],and in Rwanda 12.7% children’s had diarrhea [9]. Other research in Malawi, Rwanda, and Uganda found that the prevalence of diarrheal illnesses was 20% [10], 26.7% [11], and 32% [12], respectively. It is possible to prevent diarrheal illness by using clean, safe drinking water, practicing good hygiene, and hand washing correctly, which helps stop the spread [13]. However, it is challenging in the world’s poorest nations, such as Ethiopia, Sudan, and Somalia [14].

Several studies are involved on diarrhea disease among children under five years old through the application of classical statistical analysis techniques, which could limit the potential to discover hidden knowledge. No review exists focusing on early prediction of diarrhea disease by the use of machine learning. The study have been conducted in Ethiopia, Senegal, North Tanzania, Kenya, Uganda, sub- Saharan [15,16,17,18,19,20].

Little is known by using cross-sectional studies, unmatched case-control studies, retrospective cross-sectional studies, multivariate logistic regression, and community-based longitudinal study analysis. To the best of our knowledge, no prior researcher has attempted to use machine learning approaches to predict diarrhea in children under five. Therefore, the goal of this research is to use machine learning algorithms to determine the factors or determinants that influence diarrhea in children under five and to create prediction models based on those factors. In conclusion, the two primary questions this study seeks to address are as follows:

RQ1

Which determinants are the most significant for diarrhea disease?

RQ2

Which machine learning models help to effectively predict a diarrhea disease?

2 Method and Materials

In this section, the 12 East Africa DHS data sets were used. The prediction models for diarrhea disease based on supervised machine learning algorithms using random forest, logistic regression, decision trees, and gradient boosting algorithms are presented. Lastly, the evaluation methods of the supervised machine learning model will be discussed in detail.

2.1 Study Setting

This study was conducted in East African countries using the DHS data set. Geographically, east Africa is a sub-region of Africa that includes 20 internationally known countries, among which Burundi, Ethiopia, Comoros, Uganda, Rwanda, Tanzania, Mozambique, Madagascar, Zimbabwe, Kenya, Zambia, and Malawi were the Eastern African nations included in this study.

2.2 Data Source

The Measure DHS program served as the study’s data source. To access it, go to http://www.dhsprogram.com and submit the project title and study justification as part of a request. DHS data is a home survey that is nationally representative and is gathered on a regular basis from different groups. We used the Kids Record dataset (KR file) for this investigation. The most recent population health surveys from 12 countries in Eastern Africa were used in this study. Due to their lack of DHS conduction experience and the length of time since their last DHS conduction, the other East African counters were excluded.

2.3 Population, and Eligibility Criteria

The study populations in this study were all under-five children who were in the designated enumeration regions during the time of DHS data collection, whereas all under-five children in east African countries aged 0 to 59 months were regarded as the source population.

2.4 Study Variables and Measurements

In this investigation, the outcome variables were binary: when asked if they had diarrhea in the previous two weeks, those mothers who said that they had were coded as 1, and those who responded no were coded as 0. The features (independent variables) we will select by using feature selection methods.

2.5 Sample Size Determination and Sampling Technique

We used a weighted sample of 89,875 children aged 0 to 59 months across 12 east African countries using the recent DHS dataset. A two-stage stratified cluster sampling technique was used to select study participants. For this study, we used the Kids Record dataset (KR file). The two-stage stratified sampling approach was used as the basis for the procedure. Enumeration Areas (EAs) were first chosen at random in accordance with their respective clusters. Households were chosen for the second stage. In each selected household, mothers were interviewed with an individual questionnaire.

2.6 Data Analysis Procedure

A demographic and health survey data set from 12 East African countries was utilized in this study to predict diarrhea disease. Machine learning algorithms were used to come up with objective predictions about diarrhea disease and to identify the determinants that lead children’s to have diarrhea disease. By using machine learning algorithms and training data sets, we developed the diarrhea disease predictive model. Before we build a predictive model, we perform data processing. Data processing is a machine learning technique that transforms raw data into an understandable format [21]. Data processing and analysis are performed using Python (version 3.9) software and some basic packages like Panda, Scikit-Learn, Imblearn, Theano, Numpy, and Seaborne or Matplotlib, which are utilized for preparing data, discretizing data, transforming data, and choosing, visualizing, exploring, training, and evaluating models. Finally, we develop a predictive model that predicts both the diarrheal status and its associated determinants (Fig. 1).

2.7 Data PreProcessing

2.7.1 Data Cleaning

In this study, data processing and analysis will be performed using Python software. The DHS data set by nature is not clean and aggregated; it needs different tasks like data cleaning, data transformation, exploratory data collection, data integration, normalization, dimensionality reduction, data discretization, model selection, model training, and model evaluation. Preprocessing data has several benefits, including raising the model’s accuracy and positively affecting the model’s performance [22]. To address noise, missing values, and outliers, we used data cleaning. It is impossible to send raw data through a model testing and training process since it is always missing and it creates biased or misinformed [23]. We evaluated them to ensure that outliers were present. However, missing values were found for some features, ranging from 0.5 to 3% of the dataset. We used mean and most frequent value (mode) imputing methodologies to address these missing values for the continuous and categorical variables, respectively.

2.7.2 Feature Selection

In this study, there are many features in the dataset, including over a thousand features. However, not all of these features are always relevant; therefore, feature selection is essential since unnecessary features during model training cause us to degrade the model’s overall accuracy, increase its complexity, limit its capacity to be generalized, and bias the model [24]. We used wrapper feature selection-based machine learning algorithms and SHAP values to select relevant features for model building.

2.7.3 Data Transformation

In data transformation, due to the non-aggregate nature of the DHS dataset, it is necessary to reorganize or restructure the raw data and convert the data to create the same data types [25]. This is important for machine learning to retrieve strategic information efficiently and easily. In this study, we utilized one-hot-encoding techniques implemented in Python to encode categorical to dummy variables, with each category as a separate variable coded as 0 or 1 to indicate mothers who had no diarrhea in the previous two weeks and had diarrhea in the previous two weeks, respectively.

2.7.4 Data Discretization and Integration

The technique of discretization allows us to convert continuous variables into a discrete form [26]. In order to make the data easier to grasp and analyze, we did data discretization: we converted continuous variables into discrete features according to DHS guidelines to minimize outlier influence and reduce noise. In this study, continuous features like children’s age are categorized into intervals or ranges of 0–6 months, 7–12 months, 13–23 months, and 24–59 months for easy analysis and interpretation. Additionally, in this study, we integrate 12 country datasets into a single dataset.

2.7.5 Class Balancing

When presented with an uneven data set, machine learning algorithms are prone to bias toward the majority class [27]. We balanced before training the prediction model for resampling imbalanced datasets; SMOTE oversampling was used to improve classification performance. In this study, we extracted 89,875 records from these 76,672 (85%) children who have no diarrhea disease and 13,203 (15%) who have diarrhea disease.

2.8 Machine Learning Classifiers

The study used logistic regression, gradient boosting, random forest, K-nearest neighbor, and decision tree classifiers to predict diarrhea in children under five. Logistic regression is used to determine the likelihood that an event will succeed or fail. When the target variable has a binary (yes/no) nature, it is utilized [28]. Logistic regression is a very effective training method that is simpler to apply and analyze. In addition to logistic regression, the most widely used method for representing predictions is the decision tree. Decision trees are resistant to outliers and can be fairly strong when employed in ensemble algorithms since they are clearly interpretable and can handle huge, complex datasets efficiently without imposing a complex parametric structure [29]. The random forest algorithm is a type of supervised classification method where several decision trees cooperate with one another. The class that receives the most votes is the one that our model predicts. The decision tree algorithm’s drawbacks are removed as every tree in the random forest predicts a class [30]. As a result, the dataset becomes less overfat and more accurate. If a sizable percentage of record values are missing, the random forest approach may still yield the same results when applied to huge datasets.

2.8.1 Evaluation Criteria

A confusion matrix was utilized to study the model performance, and a number of common evaluation metrics, such as accuracy score, ROC curve, precision (P), recall (R), and F-measure, were used to assess the performances of our prediction models. These are their succinct descriptions: A table that makes it possible to see how well a supervised learning algorithm is performing is called a confusion matrix. Children who are accurately diagnosed and anticipated to have diarrhea are referred to as true positives (TP). When a model accurately predicts that the children did not develop diarrhea, it is referred to as a true negative (TN). Models that predict children having diarrhea inaccurately are known as false positives (FP). False negatives (TN) are samples that have been mistakenly classified as being free of diarrhea.

Furthermore, the Receiver Operating Characteristics Curve, often known as the ROC curve, provides a comprehensive assessment of a model’s accuracy and filters the range of threshold values for decision-making [31]. In a test dataset, the binary classifier predicts each data instance as either positive or negative. Positive and negative classes’ 2 × 2 matrices are shown in the following table. The model’s effectiveness is measured by the confusion matrix (Table 1).

Table 1 Confusion matrix

Full size table

3 Results

3.1 Description of Diarrhea Disease in East African

This study investigated a sample of 89,875 children under the age of five from 12 countries in East Africa that were part of a demographic and health survey. Overall, it was shown that 15% of children had diarrhea, and 85% of children did not have diarrhea. Most of the records in this study are for 50,262 (56%) of 25- to 59-month-old children. In the age range of 7–12 months, 27.3% of children had diarrhea, compared to 25–59 months. 10% of children had diarrhea. And my mother has no education. 16% of children had diarrhea due to their mothers having primary education. 13% of the children had diarrhea, so the mother is educated and has a history of diarrhea. Approximately 15% of children who used unimproved toilets had diarrhea, compared to 14% who used improved toilets. Compared to mothers who experienced media exposure, 86% of children had not diarrhea show (Table 2).

Table 2 Description of diarrhea disease in East African

Full size table

3.2 Class Balancing

In order to create balancing data for this study, we used the Synthetic Minority Oversampling Technique (SMOTE). This technique generates additional synthetic observations from the minority category in order to balance the unequal distribution of the outcome variable. Before smote balancing, having no diarrhea disease was 13,203 (15%), and having diarrhea was 76,672 (85%). We obtained a balanced sample of people who had diarrhea disease with counts of label 76,672 and not with counts of label 76,672 (Fig. 2).

3.3 Determinant Selection /Features Selection

Important features Selection was a technique for identifying important and essential subsets of features because it improved learning performance, helped select and impact determinants by eliminating extraneous or excessive features, and cut down on training time [30]. In this study, we used recursive feature elimination (RFE) to infer features’ relevance using an estimate of their importance from a random forest model, and all features were selected. We used a random forest with SHAP values to narrow down the set of potential features shown in Fig. 3.

3.4 ML Classifier Results

In order to identify determinants and predict diarrhea disease among under-five children using machine-learning techniques, supervised machine learning algorithms were considered, such as Decision Tree (DT), GB (Gradient Boosting), K-Nearest Neighbor, Logistic Regression (LR), and Random Forest (RF). The data set was divided into 80 for training and 20% for test sets. Python software version 3.9, which computes machine-learning algorithms (ML), was utilized for analysis.

The study compared four different supervised machine learning classification methods to verify the superiority of one of the proposed methods. The diarrhea disease status outcome prediction of Random Forest, Decision Tree (DT), GB (Gradient Boosting), K-Nearest Neighbor, and Logistic Regression experiments were done with the same testing parameters. Since accuracy, AUC, precision, recall, and F-measure are the parameters used to evaluate the performance of the model, and since RF performs the best overall in the proposed model, it was chosen as the top machine-learning algorithm. The outcomes are displayed in (Fig. 4), (Fig. 5), and (Table 3). With an accuracy of 86.5%, precision of 89%, F-measure of 86%, AUC curve of 92.7%, and recall of 82%, random forest is the best classifier in this study.

Furthermore, the random forest had a high true positive rate of 84%, a false positive rate of 10.6%, a true negative rate of 89.3%, and a false negative rate of 0.16%. The error rate discovery was 0.135%, and the AUC curve was high, 92%, as shown in Fig. 6.

Table 3 Accuracy, Precision, Recall and F-measure for the machine learning algorithms

Full size table

3.5 Important Rules that can be Generated from the Predictive Model

In this study, we generate important rules by using the best-performing or selected model (gradient boosting with all selected determinants) for the status of diarrhea disease among under-five children in east Africa. As shown in (Fig. 6), max-depth = 2 and random states = 0. The rules listed below have been verified by experts in the field employed by University of Gondar referral hospitals. We suggested that these guidelines are essential for formulating strategies and policies to stop or manage diarrheal illness among children under five in East Africa.

Rule 1

IF children age ==’7–12 months’ and countries ==’Kenya’ and wealth status = = poor’ and mothers educational status = = no ‘education ‘and mother age ==’15 = 24’ and source of drinking water=’improved’ and number of under five children = = gather than two and immunization status==’ partially and media exposure = = no and timing of breast feeding ==’ 0 to 1 h’ and mothers working status ==’ not working’ and types of toilets ==’unimproved’ and twins status ==’no ‘THEN diarrhea disease status ==’YES’.

Rule 2:If children age ==’25–59 month’ and countries ‘Mozambique’ and wealth status ==’middle’ and mothers educational status = = primary ‘education ‘and mother age ==’35 = 49’ and source of drinking water = = improved’ and number of under five children ==’one’ and immunization status ==’ fully’ and media exposure = = yes’ and timing of breast feeding ==’ 0 to 1 h’ and mothers working status ==’ working’ and types of toilet =’improved’ and twins status ==’no’ THEN diarrhea disease status =’No’.

4 Discussion

This study used the classification machine learning method to compare, identify, and help recognize specific risk factors related to diarrhea disease among under-five children in east African countries that can be used as intervention targets. When compared to other machine learning classifier models such as the RF, DT, GB, KN, and logistic regression, with an accuracy of 86.5%, precision of 89%, F-measure of 86%, AUC curve of 92%, and recall of 82%, random forest is the best classifier in this study. Our results were best with those made in Uganda, which indicated gradient boosting was highly significant for predicting diarrhea disease with an accuracy of 70% [32]. The reason for this research revealed in Uganda is that it used only one DHS data set; however, this study used 12 DHS datasets, so the large datasets increase the performance of the of the machine learning model [33]. And the study conducted in Zimbabwe predicted diarrhea disease. The study revealed that logistic regression was the was the best model, with a prediction accuracy of 85% [28]. It is comparatively good. To our knowledge, no previous studies have demonstrated the benefits of machine learning for predicting diarrhea disease in east Africa. As the feature importance rank identified, children age, countries, wealth status, mothers educational status, mother age, age, and source of drinking water, the number of under-five children immunization status, media exposure, timing of breast feeding, mothers working status, types of toilets, and twin status were the critical predictors of a chance of diarrhea disease in east Africa, according to the Gradient Boosting classification model.

Some of these variables had already proven to be predictors of diarrhea disease in previously published studies. The first important feature of east Africa as a predictor of diarrhea disease was children’s age.

The prevalence of developing diarrheal diseases was higher in children aged 7–12 and 13–23 months compared to children aged 0–6 and 24–59 months. This finding is in agreement with other studies [2, 34]. One of the reasons is that at this age, the immune system weakens and starts complementary feeding like bottle feeding because somehow it has hygiene problems, so it can simply be affected by viruses, bacteria, and parasites.

We found that children who used unimproved toilets were more likely to report having diarrhea disease compared with those who used improved toilets. This could be due to poor hygiene or the luck of washing hands with soap and water before eating. This finding is in agreement with what was found in Ethiopia [35, 36] and Senegal [16].

Inadequate feeding may also make it more vulnerable to diarrhea. In this study, diarrhea is a major health problem in low- and middle-income countries. The findings are supported by similar findings [37,38,39]. Diarrhea disease was significantly more common in children whose mothers or caregivers had no media exposure than in caregivers who had media exposure. Children with no media exposure had a 15% higher chance of being detected by diarrhea disease. These findings are supported by similar findings in Bangladesh [40]. Mother is aware of these aspects because she has been exposed to the mass media about diarrheal illness and its easy treatment.

5 Conclusion and Recommendation

Machine learning approaches can be used to classify certain hidden information that is indiscernible using conventional statistical tools. The findings of the last experiment showed that the gradient boosting model was the most accurate at evaluating risk factors and predicting diarrhea disease among children under five. The important determinants selected by Gradient Boosting were children age, countries, wealth status, mothers educational status, mother age, source of drinking water, number of under-five children immunization status, media exposure, timing of breast feeding, mothers working status, types of toilets, and twin status. Policymakers should consider the research’s findings and develop a plan for decreasing child mortality by diarrhea disease in east African nations based on the variables that have been found to be significant. Despite the intriguing outcome, more work has to be done using different kinds of approaches with different parameters. It is also advised that moms be fully aware of sanitation and feeding their children. Furthermore, moms who lack education should be made aware of the nature of diarrhea in children.

5.1 Strength and Limitations

In this study, the main strength is the recent DHS dataset, and the dataset was integrated from 12 countries DHS databases, which are the same data collection tools, and this study has such a huge sample size. Nevertheless, since DHS data collection is self-reported, there may have been some information bias added, which limits this study.

Data Availability

The datasets used in this analysis are publicly available in the DHS Program repository. After we submitted the research question, we were granted permission to access the data via the measure DHS program online request form on the website (https://dhsprogram.com/).

Abbreviations

DHS:: Demographic and Health Surveys
DT:: Decision Trees
FR:: Random Forests
ML:: Machine Learning
GB:: Gradient Boosting
LR:: Logistic Regression

References

Kefale B, Bedada D, Negash Y, Gobebo G. Determinants of diarrhea among children under age five using generalized linear model with Bayesian approach: the case of Kuyu General Hospital, Oromia Region, Ethiopia. Clinics Mother Child Health S. 2021;11.
Fenta SM, Nigussie TZ. Factors associated with childhood diarrheal in Ethiopia; a multilevel analysis. Archives Public Health. 2021;79(1):1–12.
Article Google Scholar
Demissie GD, Yeshaw Y, Aleminew W, Akalu Y. Diarrhea and associated factors among under five children in sub-saharan Africa: evidence from demographic and health surveys of 34 sub-saharan countries. PLoS ONE. 2021;16(9):e0257522.
Article CAS PubMed PubMed Central Google Scholar
Radlović N, Leković Z, Vuletić B, Radlović V, Simić D. Acute diarrhea in children. Srp Arh Celok Lek. 2015;143(11–12):755–62.
Article PubMed Google Scholar
Feleke DG, Chanie ES, Admasu FT, Bahir S, Amare AT, Abate HK. Two-week prevalence of acute diarrhea and associated factors among under five years’ children in Simada Woreda, South Gondar Zone, Northwest Ethiopia, 2021: a multi-central community based cross-sectional study. Pan Afr Med J. 2022;42.
Guillaume DA, Justus OO, Ephantus KW. Factors influencing diarrheal prevalence among children under five years in Mathare Informal Settlement, Nairobi, Kenya. J Public Health Afr. 2020;11(1).
Nantege R, Kajoba D, Ddamulira C, Ndoboli F, Ndungutse D. Prevalence and factors associated with diarrheal diseases among children below five years in selected slum settlements in Entebbe municipality, Wakiso district, Uganda. BMC Pediatr. 2022;22(1):1–8.
Article Google Scholar
Siziya S, Muula AS, Rudatsikira E. Correlates of diarrhoea among children below the age of 5 years in Sudan. Afr Health Sci. 2013;13(2):376–83.
CAS PubMed PubMed Central Google Scholar
Claudine U, Kim JY, Kim E-M, Yong T-S. Association between sociodemographic factors and diarrhea in children under 5 years in Rwanda. Korean J Parasitol. 2021;59(1):61.
Article PubMed PubMed Central Google Scholar
Moon J, Choi JW, Oh J, Kim K. Risk factors of diarrhea of children under five in Malawi: based on Malawi demographic and Health Survey 2015–2016. J Global Health Sci. 2019;1(2).
Habtu M, Nsabimana J, Mureithi C. Factors contributing to diarrheal diseases among children less than five years in Nyarugenge District, Rwanda. J Trop Dis. 2017;5(2):238.
Google Scholar
Bbaale E. Determinants of diarrhoea and acute respiratory infection among under-fives in Uganda. Australasian Med J. 2011;4(7):400.
Article CAS Google Scholar
Kimani HM. Assessement of diarrhoeal disease attributable to water, sanitation and hygiene among under five in Kasarani, Nairobi County. Department of Community Health, School of Public Health, Kenyatta University; 2013.
Toole MJ, Waldman RJ. Prevention of excess mortality in refugee and displaced populations in developing countries. JAMA. 1990;263(24):3296–302.
Article CAS PubMed Google Scholar
Ssenyonga R, Muwonge R, Twebaze F, Mutyabule R. Determinants of acute diarrhoea in children aged 0–5 years in Uganda. East Afr Med J. 2009;86(11):513–9.
Google Scholar
Thiam S, Diène AN, Fuhrimann S, Winkler MS, Sy I, Ndione JA, et al. Prevalence of diarrhoea and risk factors among children under five years old in Mbour, Senegal: a cross-sectional study. Infect Dis Poverty. 2017;6(04):43–54.
Google Scholar
Solomon ET, Gari SR, Kloos H, Mengistie B. Diarrheal morbidity and predisposing factors among children under 5 years of age in rural East Ethiopia. Trop Med Health. 2020;48(1):1–10.
Article Google Scholar
Anteneh ZA, Andargie K, Tarekegn M. Prevalence and determinants of acute diarrhea among children younger than five years old in Jabithennan District, Northwest Ethiopia, 2014. BMC Public Health. 2017;17(1):1–8.
Article Google Scholar
Deogratias A-P, Mushi MF, Paterno L, Tappe D, Seni J, Kabymera R, et al. Prevalence and determinants of Campylobacter infection among under five children with acute watery diarrhea in Mwanza, North Tanzania. Archives Public Health. 2014;72:1–6.
Article Google Scholar
Mutama R, Mokaya D, Wakibia J. Risk factors associated with diarrhea disease among children under-five years of age in Kawangware slum in Nairobi County, Kenya. Food Public Health. 2019;9(1):1–6.
Google Scholar
Brownlee J. Data preparation for machine learning: data cleaning, feature selection, and data transforms in Python. Machine Learning Mastery; 2020.
Crone SF, Lessmann S, Stahlbock R. The impact of preprocessing on data mining: an evaluation of classifier sensitivity in direct marketing. Eur J Oper Res. 2006;173(3):781–800.
Article Google Scholar
Emmanuel T, Maupong T, Mpoeleng D, Semong T, Mphago B, Tabona O. A survey on missing data in machine learning. J Big Data. 2021;8(1):1–37.
Article Google Scholar
Dhal P, Azad C. A comprehensive survey on feature selection in the various fields of machine learning. Appl Intell. 2022;52(4):4543–81.
Article Google Scholar
Tesfaye SH, Seboka BT, Sisay D. Application of machine learning methods for predicting childhood anaemia: Analysis of Ethiopian Demographic Health Survey of 2016. PLoS ONE. 2024;19(4):e0300172.
Article CAS PubMed PubMed Central Google Scholar
Liu H, Hussain F, Tan CL, Dash M. Discretization: an enabling technique. Data Min Knowl Disc. 2002;6:393–423.
Article Google Scholar
Nguyen GH, Bouzerdoum A, Phung SL. Learning pattern classification tasks with imbalanced data sets. Pattern recognition. 2009(10).
Mbunge E, Chemhaka G, Batani J, Gurajena C, Dzinamarira T, Musuka G, et al. editors. Predicting Diarrhoea Among Children Under Five Years Using Machine Learning Techniques. Computer Science On-line Conference; 2022: Springer.
John GH, editor. Editor robust decision trees: removing outliers from databases. KDD; 1995.
Shaikhina T, Lowe D, Daga S, Briggs D, Higgins R, Khovanova N. Decision tree and random forest models for outcome prediction in antibody incompatible kidney transplantation. Biomed Signal Process Control. 2019;52:456–62.
Article Google Scholar
Bradley AP. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn. 1997;30(7):1145–59.
Article Google Scholar
Kananura RM. Machine learning predictive modelling for identification of predictors of acute respiratory infection and diarrhoea in Uganda’s rural and urban settings. PLOS Global Public Health. 2022;2(5):e0000430.
Article PubMed PubMed Central Google Scholar
Al-Jarrah OY, Yoo PD, Muhaidat S, Karagiannidis GK, Taha K. Efficient machine learning for big data: a review. Big Data Res. 2015;2(3):87–93.
Article Google Scholar
Alemayehu K, Oljira L, Demena M, Birhanu A, Workineh D. Prevalence and determinants of diarrheal diseases among under-five children in Horo Guduru Wollega Zone, Oromia Region, Western Ethiopia: a community-based cross-sectional study. Can J Infect Dis Med Microbiol. 2021;2021:1–9.
Article Google Scholar
Workie GY, Akalu TY, Baraki AG. Environmental factors affecting childhood diarrheal disease among under-five children in Jamma district, South Wello Zone, Northeast Ethiopia. BMC Infect Dis. 2019;19:1–7.
Article Google Scholar
Melese B, Paulos W, Astawesegn FH, Gelgelu TB. Prevalence of diarrheal diseases and associated factors among under-five children in Dale District, Sidama Zone, Southern Ethiopia: a cross-sectional study. BMC Public Health. 2019;19(1):1–10.
Article Google Scholar
Paul P. Socio-demographic and environmental factors associated with diarrhoeal disease among children under five in India. BMC Public Health. 2020;20(1):1–11.
Article Google Scholar
Dagnew AB, Tewabe T, Miskir Y, Eshetu T, Kefelegn W, Zerihun K, et al. Prevalence of diarrhea and associated factors among under-five children in Bahir Dar City, Northwest Ethiopia, 2016: a cross-sectional study. BMC Infect Dis. 2019;19:1–7.
Article Google Scholar
Shine S, Muhamud S, Adanew S, Demelash A, Abate M. Prevalence and associated factors of diarrhea among under-five children in Debre Berhan town, Ethiopia 2018: a cross sectional study. BMC Infect Dis. 2020;20:1–6.
Article Google Scholar
Alam Z, Higuchi M, Sarker MAB, Hamajima N. Mass media exposure and childhood diarrhea: a secondary analysis of the 2011 Bangladesh demographic and health survey. Nagoya J Med Sci. 2019;81(1):31.
PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The authors express their gratitude to MEASURED DHS for granting them access to the data sets utilized in this study.

Funding

The author declares no funding.

Author information

Authors and Affiliations

Department Health informatics, Institute of Public Health, College of Medicine and Health Sciences, University of Gondar, Gondar, Ethiopia
Tirualem Zeleke Yehuala, Makda Fekadie Tewelgne & Sisay Maru Wubante
Department of Epidemiology and Biostatistics, Institute of Public Health, College of Medicine and Health Sciences, University of Gondar, Gondar, Ethiopia
Nebiyu Mekonnen Derseh

Authors

Tirualem Zeleke Yehuala
View author publications
You can also search for this author in PubMed Google Scholar
Nebiyu Mekonnen Derseh
View author publications
You can also search for this author in PubMed Google Scholar
Makda Fekadie Tewelgne
View author publications
You can also search for this author in PubMed Google Scholar
Sisay Maru Wubante
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

TZ and MF were in charge of conceptualizing and writing the first draft of the text. NM, and SM made significant contributions to the design, data collection, supervision, curation, analysis, and interpretation of the text. Each author contributed equally to the proposal’s concept, the article’s validation and modification, and TZ creation of Figs. 1, 2, 3, 4, 5 and 6, data analysis, visualization, and interpretation. All authors read, assessed, and approved the final manuscript.

Corresponding author

Correspondence to Tirualem Zeleke Yehuala.

Ethics declarations

Ethical Approval

Ethics clearance was not necessary because this was based on secondary data that was made available to the public. Through an online request, we were able to obtain the data set from the DHS website (https://dhsprogram.com/).

Consent for publication

The author declares no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Yehuala, T.Z., Derseh, N.M., Tewelgne, M.F. et al. Exploring Machine Learning Algorithms to Predict Diarrhea Disease and Identify its Determinants among Under-Five Years Children in East Africa. J Epidemiol Glob Health (2024). https://doi.org/10.1007/s44197-024-00259-9

Download citation

Received: 14 February 2024
Accepted: 01 June 2024
Published: 29 July 2024
DOI: https://doi.org/10.1007/s44197-024-00259-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Exploring Machine Learning Algorithms to Predict Diarrhea Disease and Identify its Determinants among Under-Five Years Children in East Africa

Abstract

Background

Methods

Result

Conclusion

Similar content being viewed by others

Predicting Diarrhoea Among Children Under Five Years Using Machine Learning Techniques

Implementing Predictive Model for Child Mortality in Afghanistan

A Machine Learning Study to Classify the Type of Anemia in Children Under 5 Years of Age

Explore related subjects

1 Introduction

2 Method and Materials

2.1 Study Setting

2.2 Data Source

2.3 Population, and Eligibility Criteria

2.4 Study Variables and Measurements

2.5 Sample Size Determination and Sampling Technique

2.6 Data Analysis Procedure

2.7 Data PreProcessing

2.7.1 Data Cleaning

2.7.2 Feature Selection

2.7.3 Data Transformation

2.7.4 Data Discretization and Integration

2.7.5 Class Balancing

2.8 Machine Learning Classifiers

2.8.1 Evaluation Criteria

3 Results

3.1 Description of Diarrhea Disease in East African

3.2 Class Balancing

3.3 Determinant Selection /Features Selection

3.4 ML Classifier Results

3.5 Important Rules that can be Generated from the Predictive Model

Rule 1

4 Discussion

5 Conclusion and Recommendation

5.1 Strength and Limitations

Data Availability

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical Approval

Consent for publication

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation