Performance of machine learning methods in predicting water quality index based on irregular data set: application on Illizi region (Algerian southeast)

Kouadri, Saber; Elbeltagi, Ahmed; Islam, Abu Reza Md. Towfiqul; Kateb, Samir

doi:10.1007/s13201-021-01528-9

Performance of machine learning methods in predicting water quality index based on irregular data set: application on Illizi region (Algerian southeast)

Original Article
Open access
Published: 06 November 2021

Volume 11, article number 190, (2021)
Cite this article

Download PDF

You have full access to this open access article

Applied Water Science Aims and scope Submit manuscript

Performance of machine learning methods in predicting water quality index based on irregular data set: application on Illizi region (Algerian southeast)

Download PDF

Saber Kouadri ORCID: orcid.org/0000-0002-7079-8404¹,
Ahmed Elbeltagi²,
Abu Reza Md. Towfiqul Islam³ &
…
Samir Kateb⁴

13k Accesses
148 Citations
Explore all metrics

Abstract

Groundwater quality appraisal is one of the most crucial tasks to ensure safe drinking water sources. Concurrently, a water quality index (WQI) requires some water quality parameters. Conventionally, WQI computation consumes time and is often found with various errors during subindex calculation. To this end, 8 artificial intelligence algorithms, e.g., multilinear regression (MLR), random forest (RF), M5P tree (M5P), random subspace (RSS), additive regression (AR), artificial neural network (ANN), support vector regression (SVR), and locally weighted linear regression (LWLR), were employed to generate WQI prediction in Illizi region, southeast Algeria. Using the best subset regression, 12 different input combinations were developed and the strategy of work was based on two scenarios. The first scenario aims to reduce the time consumption in WQI computation, where all parameters were used as inputs. The second scenario intends to show the water quality variation in the critical cases when the necessary analyses are unavailable, whereas all inputs were reduced based on sensitivity analysis. The models were appraised using several statistical metrics including correlation coefficient (R), mean absolute error (MAE), root mean square error (RMSE), relative absolute error (RAE), and root relative square error (RRSE). The results reveal that TDS and TH are the key drivers influencing WQI in the study area. The comparison of performance evaluation metric shows that the MLR model has the higher accuracy compared to other models in the first scenario in terms of 1, 1.4572*10–08, 2.1418*10–08, 1.2573*10–10%, and 3.1708*10–08% for R, MAE, RMSE, RAE, and RRSE, respectively. The second scenario was executed with less error rate by using the RF model with 0.9984, 1.9942, 3.2488, 4.693, and 5.9642 for R, MAE, RMSE, RAE, and RRSE, respectively. The outcomes of this paper would be of interest to water planners in terms of WQI for improving sustainable management plans of groundwater resources.

Applications of various data-driven models for the prediction of groundwater quality index in the Akot basin, Maharashtra, India

Article 20 October 2021

Assessment of groundwater quality using water quality index, multivariate statistical analysis and machine learning techniques in the vicinity of an open dumping yard

Article 08 July 2024

Groundwater quality assessment by multi-model comparison: a comprehensive study during dry and wet periods in semi-arid regions

Article 22 February 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Groundwater quality assessment and monitoring is a crucial task for sustainable optimal management of groundwater resources(Egbueri 2020; Kawo and Karuppannan 2018; Li et al. 2018; Islam et al. 2020a). The continuous growth of the population is directly associated with the growth of clean water demand (Dos Santos et al. 2017; Islam et al. 2017; Rahman et al. 2020). This demand makes the researchers more encouraged to develop new models for the prediction of water quality (Uddin et al. 2021). As a key element of the water cycle and drinking water resource, groundwater becomes an issue under a huge pressure worldwide (Ahmed et al. 2019; Saha et al. 2020). Thus, appraising water quality is of an urgent interest in recent times. Horton (1965) developed the first water quality index (WQI) in order to transform the several parameters containing water into one single number to describe the allover water quality. After that, several indices have been developed (Hossain and Patra 2020; Mukate et al. 2019; Islam et al. 2020b). The parameters involved in the calculation of the WQI have to be chosen carefully in order to get expressive results (Abbasi & Abbasi 2012). Various WQIs have been adopted by many researchers to assess the drinking suitability of groundwater and the quality river water (Islam et al. 2017; 2019; Kabir et al. 2021). However, the deterioration of water quality could be caused by many factors, e.g., inadequate proper sanitation, pollutants derived from industries and excessive use of fertilizer in agricultural practices, climate change, and poor groundwater management plan (Loecke et al. 2017; Alam et al. 2007; Trevett et al. 2005; Islam et al. 2018). On the other hand, the water quality appraisal involves some issues like sample collection at an enormous scale, testing in the laboratory, and data manipulation, which are mostly time-consuming processes and more expensive in terms of equipment, chemical, reagent, and human capital (Tiyasha et al. 2020). Besides, the subindex calculation is a time-taking process. Ongley (2000) found that water quality appraisal using traditional methods triggers losses in the economic aspect which influences the policy-making ability for groundwater quality management plans. In addition to this circumstance, the recent Corona pandemic made laboratories suffer from the lack of chemical analysis reactors used for water analysis after the remarkable reduction of the quantities of imported goods in several countries. Thus, to overcome these circumstances, it is necessary to use a promising and cost-effect tool for rapid and precise water quality appraisal. In such a case, the artificial intelligence (AI) model is an alternative option to generate models during the pandemic period that would help predict the overall quality of groundwater based on the results of analyses that do not need expensive reactors or very developed measurement instruments.

The AI technique is a potential and robust multifunctioning tool in water-science-related fields (Babbar and Babbar 2017; Kisi et al. 2018; Kim et al., 2019; Bui et al. 2020; Abba et al. 2020; Hayder et al. 2021; Singha et al. 2021; Bilali et al. 2021). Several research scholars have employed AI techniques worldwide including random forest (RF), support vector machine (SVM), and artificial neural network (ANN) in different water-related studies. The RF model was applied for the groundwater quality prediction (Singha et al. 2021), flood susceptibility study (Towfiqul Islam et al. 2021), river water quality prediction (Asadollah et al. 2021), and so on. Likewise, the SVM model was adopted for predicting marine water quality (Deng et al. 2021) and wastewater treatment plant monitoring (Nourani et al. 2018), with different precision levels. ANN-based prediction models have been extensively used in different fields including heavy metal pollution prediction (Singha et al., 2020), wetland vulnerability (Islam et al. 2021), and water level forecasting (Zhu et al. 2020).

Apart from these cited works, many studies have been performed for the prediction of WQI by appraising the performance of various AI models. For example, Gazzaz et al. (2012) adopted the ANN method to forecast river water quality and got a precision level of more than 90% (R2). Wang et al. (2017) applied a swarm optimization-based support vector regression model to predict WQI. A study performed by Ahmed et al. (2019) implemented 15 AI algorithms for the prediction of WQI, where the regression model and classification model outperformed the other models. Bui et al. (2020) found the better predictive performance of hybrid AI models over the conventional models for predicting WQI with 4 conventional and 12 hybrid AI techniques. Recently, Singha et al. (2021) applied deep learning for predicting WQI with 3 traditional models and found that the deep learning model is a more robust and accurate tool than the traditional model in the prediction of groundwater quality. Valentini et al. (2021) introduced a new WQI equation for Mirim Lagoon and evaluated its suitability based on 154 samples collected over three years at seven sampling points in Mirim Lagoon. For forecasting monthly WQI values at the Lam Tsuen River in Hong Kong, Asadollah et al. (2021) proposed a new ensemble machine learning algorithm called extra tree regression (ETR). The efficiency of the ETR model is comparable to that of traditional standalone models such as support vector regression (SVR) and decision tree regression (DTR) (Asadollah et al. 2021). Based on parameters such as pH, dissolved oxygen, conductivity, turbidity, fecal coliform, and temperature, Hu et al. (2021) investigated the classification of water quality using machine learning algorithms such as decision tree (DT), k-nearest neighbor (KNN), logistic regression (LogR), multilayer perceptron (MLP), and Naive Bayes (NB) and found that the DT algorithm outperformed other models with a classification accuracy of 99%.

From the aforementioned literature review, it is obvious that different AI models have been performed under various hydro geological conditions with different accuracy levels. In this context, additive regression (AR), M5P tree (M5P), random subspace (RSS), multilinear regression (MLR), and locally weighted linear regression (LWLR) were applied in our research to improve the reliability of water quality appraisal; however, these AI models are scarcely used in the hydrology field in the prediction of groundwater quality.

Besides, after thoroughly reviewing earlier literature, to the best of the author's knowledge, no previous studies have tested and verified the performance of these above-mentioned AI models for the prediction of groundwater quality. Thus, to close this gap, the current study used 8 ML-based WQI prediction models in Illizi region of the southeast, Algeria. Groundwater acts as a vital source of human use and consumption in the study area, and groundwater quality is mainly affected by human-induced pollution; hence, a thorough systematic appraisal of groundwater quality is necessary for this region. Additionally, no such scientific investigation has been done in the current study region. The WQI prediction using 8 ML techniques is a more robust tool than appraising it with any standalone tool. Hence, to achieve this aim, this study has developed two scenarios. The first scenario is developed using 8 models to predict the WQI using all the analyzed parameters as inputs variables to reduce the time consumption of calculations. The second scenario is constructed to reduce the number of inputs based on sensitivity analysis and to select the main parameters controlling water quality to predict the WQI in the critical case.

Materials and methods

Study area

General setting

With 284,618 km² Illizi county is the third largest wilayah by area. It is located in the extreme southeast of Algeria, and it borders with three countries on a 1,233 km border with: Tunisia and Libya from the east and Niger from the south, where Ouargla county and Tamanrasset county border it from the north and the west, respectively (Kouadri and Samir 2021). Although the study area is very large, the climate has a homogenized distribution, with a very long hot summer and very short warm winter. The rains are extremely irregular. June is the hottest month of the year, while January is the coldest. Winds are generally light to moderate. Figure 1 presents the study area location.

Hydrogeological settings

According to the authority of agricultural production in Saharian regions (CDARS), the hydrogeology of the Illizi area is distinguished by many aquifer deposits. The region has a large surface area, from which we can discern many aquifer horizons, such as Tassili's Cambro-Ordovician. Sandstone formations are traversed by a pattern of cracking and faults in addition to having a very low porosity. Tassili sandstones have a strong permeability due to these characteristics, which promote water circulation. The Devonian aquifer is located in Illizi and its surroundings, especially in the north, by exploitation from 250 to 1450 m in the Illizi and El Adeb Larach regions, respectively. The static level in regard to the land differs from one place to the next. In the high regions, it ranges from a few centimeters to a few meters; however, water is springing north and east of Illizi. The Carboniferous: This aquifer is extracted in the In Aménas area at various depths from 800 to 1100 m. The water drained by "lifting" is only used to keep the oil slicks under pressure and for irrigation; the static amount ranges between 200 and 300 m (Peterson 1985; Boudjema 1987; van de Weerd and Ware 1994; Kouadri and Kateb 2021).

The Continental Intercalaire (CI) aquifer system: It is found in the stratigraphic interval between the Triassic and the Albian summit. The Barremian and Albian, which are sandstone and sandy–clayey Lower Cretaceous continental deposits, form the majority of the aquifer layers. It drains the Triassic and Jurassic sandstone and clay–sandstone deposits in the Stah and In Aménas regions (where the CI is known as the Zaraitine and Taouratin Series), from Barremian and Albien to Deb Deb and Albien to BOD and Rhourd Nouss (Boudjema, 1987; Kouadri and Kateb 2021).

Medium-depth (400–500 m) drilling in, (T.F.T), Ohanet, and (B.O.D) capture the aquifer. Rhourd Nouss and the north of Deb Deb are comparatively wide (800–1200 m). The sheet’s waters are gushing at Rhourd Nouss, Bordj Omar Driss, Tabankort, Maouar, Zemelet Mederba, and the north of Deb Deb; they are exploited by pumping at differing depths (from a few meters to 300 m) at Tinfouyé, Ohanet, the south of Deb Deb, and Stah; the useful tank's strength exceeds 250 m. Static pressure readings show that pressures can exceed 18 bars (e.g., Rhourd Nouss, gushing water). The Mio-Pliocene aquifer is made up of a rearrangement of sands and clays that stretches from the far north-west of the wilayah to the far northeast. Drilling 160–300 m deep is used to extract it in the Rhourd Nouss and El Hamra areas. The water is pumped out at different depths ranging from 80 to 100 m. Oued Djanet's Infero-Flux (Alluvial): The alluvial aquifer of Wadi Djanet was the region's largest and only water supply until the Cambro-Ordovician aquifer was discovered. It is a shallow aquifer spanning 17 km2 of heterogeneous alluviums ranging from silty sand to small pebbles resting on a twenty-kilometer stretch. Currently, 24 boreholes (including 1 well) have demanded it, with 9 boreholes and 1 well in operation. The water in this aquifer is of good quality, with dry residue ranging from 146 to 340 mg/l (Boudjema 1987; Montgomery 1993; Kouadri and Kateb 2021).

Geological settings

According to the National Organization of Hydrographic Network (A.N.R.H), the city of Illizi is situated on a plateau land consisting of lower Devonian clay-sandstone and Emsian clay-sandstone deposits, as well as Quaternary. To the north, the middle to upper undifferentiated Devonian layers outcrop for around 12 km, before being surpassed much further north by Upper Devonian to Carboniferous layers created primarily by the Khenig sandstone, upper Famennian at Tournaisien, with average coastlines of 550–650 m and peaks exceeding 700 m. This disparity in elevation creates a landscape of canyons that favors river drainage and flow acceleration (Kouadri and Kateb 2021).

A plain landscape stretches from the northeast to the side of Tin-Tourha, east to the field of Halloufa, and south to the side of Gara Souf Mellene, passing through Adjnadjane to the Gara Tan Harab. This plain, which has an 8-km radius, is mostly made up of post-Mesozoic (Quaternary) formations with an altimetry of 560–570 m. The lower Devonian formations, known as the Oued Samène formations, are located in the south and beyond 8 km (Siegenien). Their elevations are in excess of 700 m. With frank deformations and large fractures, these formations form a tectonic domain. Less significant faults run east–west as well.

In a strict sense, the geology of the state of Illizi is divided into two broad units: the crystalline basement and the sedimentary cover, which are lithostratigraphically distinct.

Data collection

In order to prepare this work, the results of water analysis provided by the Directorate of Water Resources (DRE) of the State of Illizi were relied on. The presented data set consists of the results of analysis of 114 samples taken from 57 exploited wells of 6 different layers. The samples were taken between 1999 and 2020. The analyses of each sample consisted of physical elements represented by TDS, CE, and T°C and chemical elements represented in pH, Ca²⁺, Mg²⁺, Na⁺, K⁺, anions as Cl⁻, HCO₃⁻, SO₄²⁻, and pollution indicators as NO₃⁻. The different used models in this work to deal with this type of data considered a new challenge, where the efficiency and performance of the models will be tested with an irregular data set.

Calculation of water quality index (WQI)

WQI is one of the most widely used tools for determining the quality of water and its suitability for human use (El Baba et al. 2020; Reyes-Toscano et al. 2020; Zhang et al. 2020; Maskooni et al. 2020; Bahir et al. 2020). The following are the measures for estimating WQI: In the beginning, a weight must assign on to each factor ranging from 1 to 5, based on its significance and impact on drinking water and human health. Mineralization, SO₄²⁻, Cl⁻, and NO₃⁻ are awarded the highest rating of “5” due to their direct impact on water quality and human health (Seifi, A. et al. 2020). The bicarbonates HCO₃⁻, on the other hand, have a minimum value of “1”. Assigned weights, relative weights, and the limits required by WHO are shown in Table 1.

$$\mathrm{Wi}=\frac{\mathrm{wi}}{\sum_{\mathrm{i}=1}^{\mathrm{n}}\mathrm{wi}}$$

(1)

where“Wi” is the relative weight.“wi”is the weight/parameter.“n” is the number of parameters.

Table 1 Assigned weights and relative weights of physicochemical parameters

Full size table

Then, a quality rating scale (qi) for each parameter is calculated based on Eq. (2).

$$\mathrm{qi}=(\mathrm{Ci}/\mathrm{Si})\times 100$$

(2)

where“qi” is the quality rating.“Ci” is the chemical concentration/water sample (mg/L).“Si” is the WHO drinking water quality standard (mg/L).

Furthermore, a subindex of the ith parameter is calculated using Eq. (3).

$$\mathrm{SIi}=\mathrm{qi}\times \mathrm{Wi}$$

(3)

where

“SIi” is the subindex rating.

“qi” is the quality rating.

“Wi” is the relative weight.

Finally, the water quality index calculated as follows:

$$\mathrm{WQI}=\sum \mathrm{SIi}$$

(4)

Artificial intelligence models

In this study, ANN, MLR, SVM, M5P tree, RF, LWLR, RS, and AR models were proposed for the estimation of WQI of ILLIZI groundwater. Data set was partitioned into two parts. 70% of the data were employed for calibration phase and the 30% of the data for verification purposes. Selection of dominant inputs parameters is one of the important parts in any AI-based modeling. MATLAB (R2018b) was used for the analysis of ANN and MLR, while the rest of models were developed using Waikato Environment for Knowledge Analysis (WEKA-version 3.8.4).

Artificial neural network (ANN)

Artificial neural network (ANN) is a system that inspired its dynamic functionality from the simulation of human nervous system. It was used for the first time by McCulloch and Pitts (1943), where the method works to create a relationship between inputs and outputs through assigned weights which plays the role of a mathematical memory(Elbeltagi et al. 2020c).

As seen in Fig. 2, the ANN is made up of three groups of layers: The hiding layers are intermediate layers between the independent input and dependent output layers where all the computations are performed, and the output layer outputs the result for the given inputs (Babaee et al. 2021). The input layers' circles are denoted by the vector "i." The secret neuron layers are represented by the middle circles. The "activation" nodes are represented by these circles, which are often referred to as the weights (Ws). The final circle reflects the output sheet, which displays the water quality index's expected value (Elbeltagi et al.,2020a,b, c, d).

In order to optimize the performance of the network, training algorithm was founded; such as feed-forward back propagation algorithm. This algorithm works to minimize the error rate by calculating the difference between calculated and predicted values. Based on the error amount, new weights will be assigned in order to have better predicted results. Depending on the main factors affecting the performance of an ANN system, we can find the number of the hidden neurons and the activation function (Kouadri et al 2021; Elbeltagi et al. 2021a, b). In an attempt to select the optimal number of hidden neurons, an iterative algorithm had been used in order to plot the performance of the ANN model in function, of MSE in training and validation phase versus the number of hidden neurons number. The optimal number of hidden neurons is the one that give the lowest rate error in both training and validation phases.

Multi linear regression (MLR)

Multilinear regression analysis is considered as one of the simplest mathematical models. It is based on the linear relationships between inputs and outputs. In other words, it extracts the linear relationships between dependent and independent variables by involving a regression that is constant in the formula (Sihag et al., 2020). MLR work is based on the equation below:

$$y={b}_{0}+{b}_{1}{x}_{1}+{b}_{2}{x}_{2}+\dots {b}_{i}{x}_{i}$$

whereY: the independent variable.B: the regression constant.X: the i_thpredictor.

Support vector regression (SVM)

Initially, support vector machine (SVM) was developed in order to help identify the distribution pattern of data samples in order to classify them into categories and help in making good decisions. The main idea of this method depends on using a set of studied sample points as supports to draw vectors separating the various classes in the studied data. When SVM was used to solve discontinuous issues, support vector machine regressor (SVMR) was created to deal with continuous issues. This system is characterized by many features that make it a permanent target for use in solving linear and nonlinear correlation problems (Elbeltagi et al., 2021a, b). Among its advantages is the dependence on structural risk minimization (SRM) principle which showed greater effectiveness than traditional empirical risk minimization (ERM). SRM is characterized by its great ability to reduce error, unlike some other methods, such as artificial neural networks that reduce error only in the results of training phase; this has given the SVM method a greater effectiveness in treating prediction Issues. Using the one-dimensional example in Fig. 3, SVR problem formulation is often best obtained from a geometrical perspective. The equation below represents the continuous-valued equation that is being approximated (Awad& Khanna, 2015).

$$y=f\left(x\right)= <w,x>+b= {\sum }_{j=1}^{M}{w}_{j}{x}_{j}+b, y, b\in {\mathbb{R}},x, w \in {\mathbb{R}}^{M}$$

To simplify the mathematical notation for multidimensional data, multiply x by one and add b in the w vector to obtain the multivariate regression in equation below:

$$f\left(x\right)= {\left[\begin{array}{c}w\\ b\end{array}\right]}^{T}\left[\begin{array}{c}x\\ 1\end{array}\right]= {w}^{T}x+b x,w \in {\mathbb{R}}^{M+1}$$

M5P tree

M5P tree model has been presented by Quinlan (Quinlan 1992). It is a model that is a learner tree that deals with regression situations. The basis of this algorithm is based on dividing the overall problem into smaller problems by dividing the data, so that a multivariate model is constructed for each small problem and assigning linear regression functions into the final nodes. This method is characterized by its ability to deal with complex problems with many variables, with the condition that they are continuous class problems instead of discrete classes (Adnan et al. 2021; Sihag et al. 2020; Singh et al. 2017).

Figure 4 presents an M5P tree architecture. Depending on the amount of error calculated in each node, the M5P tree determines information about the criteria for dividing it. After studying the error, based on the standard deviation at the entrance to the node, the correction characteristic of this error is determined by testing all the characteristics of the studied node. The reduction of standard deviation is calculated by the following equation:

$$SDR=sd\left(K\right)- \sum \frac{\left|Ki\right|}{\left|K\right|}sd(Ki)$$

whereK: a set of instances that attain the node.K_i: the subset of illustrations that have the i th product of the possible set.sd: the standard deviation.

Random forest(RF)

The random forest method was first introduced by Breiman (Breiman 2001). This method is considered as one of the machine learning systems that depend mainly on a group of decision trees targeting the middle separation of the target groups using individual trees. The construction of this method depends on two factors in the random regression of forests, namely, first the number of trees to be planted in the forest, and it is symbolized by the symbol (k), second the number of variables specified at each node for the growth of the tree which is symbolized by (m)(Bournas et al. 2003; Pham et al. 2017; Sihag et al. 2019). The architecture of random forest model is presented in Fig. 5.

Locally weighted linear regression (LWLR)

LWLR is a multivariate smoothing technique for fitting a regression surface to data. In a moving fashion, the dependent variable is smoothed as a function of the independent variables, similar to how a moving average for a time series is calculated. The fundamental structure is as follow, let x,—(x_i1,…..x_ip), i = 1,…, n, be « n» measurements of p independent variables, and let y, I = 1,…, n) be measurements of the dependent variable. Assume that y_i = g(x_i) + ξ_i generates the results. We assume that the ξ_i are independent normal variables with mean 0 and variance σ², as in the most commonly used regression framework. If g is a member of a parametric class of functions, such as polynomials, in the ordinary setting, we will assume that g is a smooth function of the independent variables, but in this case, we will only assume that g is a smooth function of the independent variables. We can approximate a large class of smooth functions with local fitting, well more than we might possibly predict from any one parametric class of functions (Cleveland and Devlin 1988; Kisi and Ozkan 2017).

Random subspace (RSS)

Ho (Ho 1998) was the first who implemented the RS model as a novel coupled algorithms for resolving naturel issues based on artificial intelligence. This model uses combination and training of multiple classifier on altered feature space. The training basis of this model are the generated multiple training subsets for the classifiers (Ho 1998). The training set (x), the base-classifier (w), and the number of subspaces (L) are the RS inputs (Kuncheva and Plumpton 2010; Luo et al. 2019; Garca-Pedrajas and Ortiz-Boyer 2008; Lai et al. 2006; Wang et al. 2018, 2015). This technique is highly advocated by (Pham et al. 2017) to avoid over-fitting problems and to deal with the most unnecessary data sets. Figure 6 presents the architecture of an RSS model.

Additive regression (AR)

Hastie and Tibshirani (1986) have introduced the generalized additive model (GAM). The GAM, an extension of the generalized linear model (GLM) (McCullagh and Nelder 1989), has several benefits over the latter model. The GAM assumes no form of dependence, unlike the GLM, which is based on the clear assumption of linearity of the parameters, and the relationship is not generally linear. Its theory is based on the use of a sum of nonlinear functions to model the response, which helps one to model the effect of each explanatory variable more specifically. In modeling the effects of environmental variables, this precision makes it a common technique since these effects are often nonlinear and are difficult to specify parametrically (Peng and Dominici 2008; Bruneau and Grégoire 2011). The Jbilou and El Adlouni (2012) literature review described the capacity of the GAM in environmental health studies as a powerful technique to detect nonlinear associations between an environmental explanatory variable and a variable dependent on health. The equation used for this algorithm is written as:

$$g(E\left(y\right))={{\beta }_{0}+ f}_{1}{x}_{1}+{f}_{2}{x}_{2}+\dots +{f}_{p}{x}_{p}+\varepsilon$$

(5)

The nonlinear smooth functions are used in the estimation of this model's application.

${f}_{i}{(x}_{i})$, i = 1,.., p, for any single explanatory vector${x}_{i}$.

Several data set split features are selected using the standard deviation error (SDR) as a parameter for the best characteristics to segment the data set into each node. The selected attribute is meant to reduce errors.

$$SD\left( {Tree} \right) - \sum {\frac{{{\text{ }}Tree\_i}}{{Tree}}} * SD\left( {Tree\_i} \right)$$

(6)

where Tree (i) denotes the subset of examples with the product of the possible evaluations, SD() denotes the standard deviation of the statement. The stop criteria are the number of instances needed to reach a certain number or a small form value shift. All models’ parameters used for modeling the WQI are clarified in Table 2.

Table 2 Parameters of machine learning algorithms used for WQI modeling

Full size table

Sensitivity analysis

When there are several input variables, feature selection is one of the most important steps in developing a soft computing model to forecast and simulate engineering phenomena. There are many methods for determining the best possible combinations, including the best subset regression, shared knowledge, forward stepwise filtering, and so on. The best subset regression analysis was used in this research to find the best input combinations for the WQI model. Six statistical parameters were computed for this reason, including MSE, decision coefficients (R²⁾, adjusted R², Mallows' Cp (Gilmour 1996), Akaike's AIC, and Amemiya's PC (Claeskens and Hjort 2008).

Model’s performance criteria

Throughout the course of the analysis, actual WQI data and modeled values were compared. The following statistical metrics were chosen to determine the accuracy of models: root mean square error, coefficient of determination, and mean absolute error (Malone et al. 2017; Elbeltagi et al. 2020a, b, d).

All parameters are defined as follows:

${WQI}_{A}^{i}$ is the calculated or actual value.

${WQI}_{P}^{i}$ is predicted or foreseen value.

${WQI}^{-}$ is the mean value of reference samples, and N is the total number of data points.

Root mean square error

The sample standard deviation of the variations between expected and real values is known as the RMSE. It is given by:

$$RMSE=\sqrt{\frac{1}{\mathrm{N}}{\sum }_{i=1}^{N}{{(WQI}_{A}^{i}-{WQI}_{P}^{i})}^{2}}$$

(7)

Mean absolute error

The mean absolute error assesses the extent of errors in a series of predictions without taking their sign into account. It's an estimation of the absolute differences between expected and observed values over the test sample. It is defined as follows:

$$MAE= \frac{1}{\mathrm{N}}{\sum }_{i=1}^{N}{|WQI}_{P}^{i}-{WQI}_{A}^{i}|$$

(8)

Relative absolute error

The total absolute error is normalized by dividing it by the total absolute error of the basic indicator in the relative absolute error.

$$RAE=\frac{{WQI}_{A}^{i}-{ WQI}_{P}^{i}}{{WQI}_{P}^{i}}| \times 100$$

(9)

Root relative squared error

The total squared error is normalized by dividing it by the total squared error of the basic indicator in the relative squared error. The error is reduced to the same dimensions as the quantity being predicted by taking the square root of the relative squared error.

$$RRSE=\frac{\sqrt{{\sum }_{i=1}^{N}{{(WQI}_{P}^{i}-{WQI}_{A}^{i})}^{2}}}{\sqrt{{\sum }_{i=1}^{N}{{(WQI}_{A}^{i}-{WQI}^{-})}^{2}}}$$

(10)

Results and discussion

Statistical analysis

Table 3 presents the descriptive statistics for 114 groundwater samples. The correlation matrix is useful since it illustrates the importance of each parameter independently and their effect on the hydrochemistry mechanism (Helena et al. 2000; Khan 2011; Patil et al. 2020; Islam et al. 2017; 2020b). If the values of (r) are + 1 or—1 in the Pearson’s correlation matrix (Table 4), they are treated as strong correlation coefficients values and signify total correlation, i.e., functional dependency, between two variables. If the values are closer to zero, it means there is no meaningful interaction between two variables at the p˂ 0.05 level (Singh et al. 2011; Patil et al. 2020). If r is bigger than 0.7, the parameters are highly.

Table 3 Descriptive statistics for all input and output variables

Full size table

Table 4 The Pearson correlation coefficient between input and output variables

Full size table

correlated, and if r is between 0.4 and 0.7, the parameters are moderately correlated. A correlation matrix is used to consider the correlation between chemical parameters and WQI values in this study. The WQI which is the parameter focus on in this study has very weak correlations with pH and HCO₃-, moderate correlations with EC, TH, K⁺, Cl⁻, and NO₃⁻, and strong correlations with TDS, Ca²⁺, Mg²⁺, Na⁺, and SO₄²⁻.

The Electrical conductivity of water (EC) has a negative correlation with the pH, and positive correlation of r ˂ 0.4 with HCO₃⁻, Ca²⁺ and Cl⁻, 0.4 ˃ r ˃ 0.7 with TH, TDS, Mg²⁺, Na⁺, K⁺, SO₄²⁻, and WQI, r ˃ 0.7 with NO₃⁻ which has a strong correlation. The total hardness (TH) moderately correlated with HCO₃⁻, Mg²⁺, and WQI, where no correlation exists with the rest of parameters. pH is observed to have no correlation with other parameters with an r coefficient ranged between − 0.189 and 0.128. The correlation of TDS with HCO₃⁻ and NO₃⁻ is found to be weak and moderate, respectively, where all of Ca²⁺, Mg²⁺, Na⁺, K⁺, SO₄²⁻, Cl⁻, and WQI have a strong correlation with it. HCO3- have no existing relationship with Ca²⁺, Mg²⁺, Na⁺, K⁺, SO₄²⁻, Cl⁻, NO₃⁻, and WQI in the other hand the Ca²⁺, Mg²⁺, Na⁺, K⁺, SO₄²⁻, and Cl⁻ are characterized with strong and moderate correlation with each other.

Sensitivity analysis

In this section, a sensitivity analysis is performed to determine the most sensitive parameters in the considered combination set in predicting WQI. The selection of 2 best input combinations is mainly based on the nonlinear subset regression and sensitivity analysis. The advantage of using the nonlinear sensitivity input variables selection approach to carefully determine the most relevant factors has been reported in several studies (Bui et al. 2020; Kisi et al. 2018; Liu et al. 2019).The best subset regression analysis for determining the best input combinations is presented in Table 5. We found that the best combination was TH / pH / TDS / Ca / Mg / Na / K / SO4 / Cl / NO3 and achieved high correlation and less statistical errors. Besides, all founded combinations generated good results.

Table 5 The best subset regression analysis for determining the best input combinations to model WQI

Full size table

Figure 7 presents the standardized coefficients of inputs variables for sensitivity analysis. We conclude that TH is identified as the most sensitive parameter. It has the highest standardized coefficient (0.453) among the considered parameters. After TH, the TDS earn the second place in the list of the most sensitive variables with standardized coefficient equal to 0.243. On the other hand, SO42-, Cl-, and NO3- have 0.152, 0.176, and 0.135 as standardized coefficient, respectively, where the rest of parameters are considered as non-influential variables in predicting the WQI (Table 6).

Table 6 The regression analysis for identifying the most effective parameters

Full size table

Based on the results obtained from Tables 5, 6 and Fig. 7, and in order to achieve the objective targeted in this paper, two inputs combinations have been chosen: the first combination encloses all the parameters, where the second contains only the two strong influential inputs in predicting WQI which are TH and TDS.

Evaluation of several ML models in WQI prediction

This study included the results of performing eight different methods of predicting the water quality parameter (WQI). The eight models used were as follows: MLR, ANN, M5P tree, SVM, RF, AR, RSS, and LWLR. Two combinations of variables were relied upon. The first configuration contained all the chemical elements used in the calculation of the water quality factor (WQI), while the second configuration was limited to only two components, namely the sum of dissolved salts (TDS) and water hardness (TH). These two elements were identified as the most controlling water quality index (WQI) based on sensitivity analysis results. It is worth mentioning that the Continental Intercalaire (CI) aquifer system received non-point sewage from different industries and agricultural inputs which highly attributed in deteriorating WQI. Generally, in groundwater studies, some factors affect the predictive precision of the models. However, there are some possible factors affecting the precision in this work could definitely be the low correlation values between pH, and WQI, TDS, and TH. It could also be caused by the enhanced pollution that is triggered by human inputs on the side of the industry, which drastically decrease the precision of the models. This result is in good agreement with the studies done by Zhu and Heddam (2019).

Five statistical parameters were selected in order to determine the performance of the different models and compare them. Table 7 represents the results of the models depending on the first combination of inputs in the training and testing phases. As shown in Table 7, the MLR model was performed perfectly in the prediction process for the training phase, as it obtained a correlation coefficient of R = 1 and the performance indicators were the smallest value by MAE = 1.4 * 10–8, RMSE = 2.14 * 10–8, RAE = 1.25 * 10–10%, and RRSE = 3.17 * 10–10. It was followed directly by the ANN model which had a correlation coefficient of R = 0.9996, MAE = 0.925, RMSE = 1.4013, RAE = 1.89%, and RRSE = 0.024, whereas the lowest performing model in the training phase was the LWLR model with correlation coefficient R = 0.9423, MAE = 15.52, RMSE = 18.39, RAE = 36%, and RRSE = 33.76. Through the values of the performance of indicators, we note generally acceptable performance for the eight models. Yaseen et al. (2019) reported that RMSE is the most significant predictive numerical index for measuring the performance of the model in any data-mining modeling and time series forecasting. Our finding is in line with that of Yaseen et al. (2018), where the performance accuracy increases as the input variables are increased for the prediction of WQI.

Table 7 First input combination model result

Full size table

For the test phase, the MLR model had the highest correlation value of R = 1 and the smallest error indicators that closely approximated zero. MAE = 4.8 * 10–9, RMSE = 7.7 * 10–9, RAE = 7.7 * 10–11%, and RRSE = 2.5 * 10–10. It was followed by the ANN model which obtained a correlation coefficient of R = 0.9987 and MAE performance indicators = 1.4, RMSE = 2.7, RAE = 1.68%, and RRSE = 0.044, whereas the weakest performance was recorded in the testing phase when the SVM model consists of the correlation coefficient of R = 0.9412 and MAE performance indicators = 5.16, RMSE = 11.386, RAE = 22.6%, and RRSE = 37.265. The predictive capability of the MLR model is definitely not surprising, because it is an evolving nonlinear system identification tool and has shown better predictive ability in many studies (Abba et al. 2020; El Bilali et al. 2021).

Table 8 represents the performance results of the eight models depending on the second configuration of inputs, which includes the elements TH and TDS. Through Table 8, we note that during the training phase, the best results were recorded on the RF model with a correlation coefficient of R = 0.9984 and MAE performance indicators = 1.99, RMSE = 3.248, RAE = 4.6%, and RRSE = 5.96%. The ANN model came in second place with a correlation coefficient of R = 0.9969, MAE performance indicators = 2.46, RMSE = 3.88, RAE = 3.3%, and RRSE = 7.01%. For the ANN model that provided the best performance based on the first combination of inputs, it regressed to the fifth place when using the second combination of inputs with correlation coefficient of R = 0.9958 and performance indicators of MAE = 3.48, RMSE = 4.98, RAE = 4.23%, and RRSE = 7.37. The weakest performance was recorded when using the LWLR model with a correlation coefficient of R = 0.9406 and MAE performance indicators = 15.33, RMSE = 18.74, RAE = 36.08%, and RRSE = 34.42%. For the test phase, the ANN model outperformed the rest of the models with a correlation coefficient of R = 0.9957, MAE performance indicators = 3.85, RMSE = 6.19, RAE = 3.96%, and RRSE = 9.35%. Followed by the RF model which obtained a correlation coefficient of R = 0.9926 and performance indicators of MAE = 2.15, RMSE = 3.82, RAE = 9.45%, and RRSE = 12.51. The weakest performance was recorded on the MLR model with a correlation coefficient of R = 0.9325, MAE performance indicators = 7.94, RMSE = 11.04, RAE = 12.51%, and RRSE = 36.15%. The main reason for the poor performance of the other models in both input combinations can be related to the inverse association, which was identified by the negative correlation between the observed pH concentration and the NO₃⁻ and HCO₃²⁻ parameters except for the TH and TDS values. This observation was analogous to the results reported by Zhu and Heddam (2019).

Table 8 Second input combination model result

Full size table

It is noted that the ensemble tree-based model such as RF outperformed all the other models with considerable accuracy in second input combination model due to its robustness deal with complicated pathways which can perform predictions without requiring regular large datasets. Our results showed that the RF model is superior to other models in terms of precision. The key reason is that RF model can accommodate high-dimensional factors to improve water quality prediction accuracy, e.g., the inclusion of a monthly physicochemical variable in this study. Besides, according to the RF model, Castrillo and García (2020) reported a high prediction precision of the RF model compared to the MLR model. In addition, there is in line with earlier published works in classification problem (Salamand Islam 2020; Chen et al. 2020).

Figure 8 describes the dispersion of points representing the calculated WQI values against the predicted WQI values based on each model separately using the first set of inputs. Through Document 1, it appears that the MLR model is the most suitable for predicting the values of the water quality parameter due to the total match of the points with the perfect line 1:1. Fig. 9 describes the dispersion of points representing the calculated WQI values against the predicted WQI values based on each model separately using the second combination of inputs. The document shows a large dispersion of the MLR model points, while the RF model points are more ideally positioned compared to the rest of the models. The largest dispersion of points was in the case of using both the LWLR and RSS model, which indicates the poor performance of the two models in the case of using the second set of inputs.

The best model in each scenario is presented in Fig. 10 using scatter plot with smooth lines, bleu for calculated WQI and purple for predicted WQI values, and markers present samples. Part (a) presents results of MLR model from the first scenario, where an optimal fitness is shown between calculated and predicted WQI values. In part (b), we notice a presentation of RF model from the second scenario. The fitness in second scenario is not as in the first one, because a reduction in inputs had been made; this is why some predicted points does not fit with their calculated versus.

In addition to the aforementioned, the Wilcoxon rank-sum test was also relied upon in order to confirm the results mentioned in the previous paragraphs. This test is a nonparametric statistical news, used to compare two groups. The test calculates the difference between the pairs and the results are used to determine whether the two groups are statistically different from each other or not. In this work, this method was used to test the null hypothesis, which states that every two identical groups have the same continuous distribution. Some conditions must be met to apply this test, which is that the data should be from the same community and be associated. With random and independent data selection, Table 9 represents the P values for each model based on the first and second input configurations. In the case of using the first input combination, the highest probability was recorded when using the MLR and AR models with a value of P = 0.9951 for both models, whereas the lowest probability was recorded when using the RSS model with a value of P = 0.4730.

Table 9 Wilcoxon rank-sum test results (P values) of the optimal models in predicting WQI using first and second combination inputs

Full size table

The use of the second group of inputs witnessed noticeable changes in the performance of the models. The highest probability of match was recorded when using the RF model with a value of P = 0.9951. Both the MLR and AR models reported significant decreases in performance with values of P = 0.8588 and P = 0.7585, respectively. The weakest performance was recorded again when using the RSS model with a value of P = 0.5519.

The physicochemical parameters chosen in the current study may also pose a drawback due to possible inadequate sampling. In addition to this, the uncertainty problem of the physical-based models in water quality modeling is inevitable and has been discussed in many studies (Bui et al. 2020; Kisi et al. 2018; Singha et al. 2021). Future research may add the use of different input physicochemical parameters to predict the WQI based on WHO guidelines, to compare with other standard indexes. The model presented here should be also appraised for other similar climatic and hydrological settings. However, given the noisy characteristics of this dataset, there was still a threat that the models did not fit the data well, which might undermine the outcomes of the scenario forecasting. Besides, adding more influential physicochemical factors could also improve model fitting. For example, there may be other factors affecting TDS concentration besides climate and hydrogeological features (Islam et al. 2017). As the new development of machine learning models, it is promising for further work to predict contaminant concentration under the future pollution scenarios if the machine learning algorithm fits data well.

As mentioned in previous studies, a key gap in water quality studies has been a lack of consideration of cross effects between explanatory variables, such as the cross-correlation between land covers and the cross-correlation between land cover and climate in influencing stream water quality (Islam et al. 2021). Machine learning models can use input variables and improving model predictive accuracy, which is an advantage over conventional statistical models. For example, it is likely that physicochemical factors showed effects with environmental variables and groundwater pollution on groundwater water quality and the predictive accuracy can therefore be improved.

Conclusion

In this work, the effectiveness of a group of artificial intelligence methods in predicting the water quality parameter in a dry desert environment was examined based on the 114 samples collected from six aquifers at different time periods in Illizi state, southeast Algeria. Eight artificial intelligence models, namely MLR, ANN, SVM, M5P tree, RSS, RF, AR, and LWLR, were used, and their ability to predict was tested based on two scenarios and 2 different input combinations. The proposed two scenarios aim to solve two main problems. First, the classical computational method is replaced with modeling approach. Second, when there is a lack or unavailability of data in critical cases, this study provides an alternative solution. The first set of inputs included all the chemical elements present in the water and used in calculating the WQI, while the second combination contained the controlling parameters of the water quality changes which were determined using the sensitivity analysis.

The sensitivity analysis shows that all the subset performed well as predictors in modeling WQI, where the selection of only two parameters as input in the second scenario was developed in order to propose an alternative solution for monitoring the WQI in the study area in critical cases. In second scenario, the modeling procedure showed that TDS and TH concentrations were the most vital determinants of WQI. The MLR model was performed perfectly in the first scenario because the calculation procedures of the WQI was linear, which make the task executed perfectly using MLR model with all the parameters as inputs. The reduction of the number of inputs affects directly the performance of models, where the aim in second scenario was constructing which model performed well in such conditions. RF models observed to be the best model in predicting WQI based on TH and TDS as parameters in the study area.

It is worth noting that MLR and RF algorithms generate robust results using a dataset covering the longer periods based on two scenarios. Thus, these algorithms might be useful for developing places that have very limited wellbore. Our results recommend that the RF algorithms could be a robust and cost-effective model to enhance groundwater quality management plans in an arid region in southeast Algeria. It is possible that this model is more applicable in developing countries where the costs of estimating several water quality variables are high and might be commonly restrictive. These outcomes could not be generalized and employed to other regions or other hydrogeological datasets, and these algorithms might not be optimal (i.e., most reliable) in all areas and under all conditions.

Data Availability

The datasets generated during and/or analyzed during the current study are available from corresponding author based on reasonable request.

References

Abba SI, Hadi SJ, Sammen SS, Salih SQ, Abdulkadir RA, Pham QB, Yaseen ZM (2020) Evolutionary computational intelligence algorithm coupled with self-tuning predictive model for water quality index determination. J Hydrol 587:124974
Article Google Scholar
Abbasi T, Abbasi SA (2012) Water quality indices. Elsevier
Google Scholar
Adnan RM, Khosravinia P, Karimi B, Kisi O (2021) Prediction of hydraulics performance in drain envelopes using Kmeans based multivariate adaptive regression spline. Appl Soft Comput 100:107008. https://doi.org/10.1016/j.asoc.2020.107008
Article Google Scholar
Ahmed U, Mumtaz R, Anwar H, Shah AA, Irfan R, García-Nieto J (2019) Efficient water quality prediction using supervised Machine Learning. Water 11(11):2210. https://doi.org/10.3390/w11112210
Article Google Scholar
Alam MJ, Islam MR, Muyen Z, Mamun M, Islam S (2007) Water quality parameters along rivers. Int J Environ Sci Technol 4(1):159–167
Article Google Scholar
Asadollah SBHS, Sharafati A, Motta D, Yaseen ZM (2021) River water quality index prediction and uncertainty analysis: a comparative study of machine learning models. J Environ Chem Eng 9:104599. https://doi.org/10.1016/j.jece.2020.104599
Article Google Scholar
Babaee M, Maroufpoor S, Jalali M, Zarei M, Elbeltagi A (2021) Artificial intelligence approach to estimating rice yield*. Irrig Drain. https://doi.org/10.1002/ird.2566
Article Google Scholar
Babbar R, Babbar S (2017) Predicting river water quality index using data mining techniques. Environ Earth Sci. https://doi.org/10.1007/s12665-017-6845-9
Article Google Scholar
Bilali AE, Taleb A, Brouziyne Y (2021) Groundwater quality forecasting using machine learning algorithms for irrigation purposes. Agricul Water Manage 245:106625
Article Google Scholar
Boudjema, A., 1987, Evolution structurale du bassin petrolier «Triasique» du Sahara Nord Oriental (Algerie): Thèse a l’Université de Paris-Sud, Centre d’Orsay, 290 p.
Bournas N, Galdeano A, Hamoudi M, Baker H (2003) Interpretation of the aeromagnetic map of Eastern Hoggar (Algeria) using the Euler deconvolution, analytic signal and local wavenumber methods. J African Earth Sci 37:191–205. https://doi.org/10.1016/j.jafrearsci.2002.12.001
Article Google Scholar
Bruneau, B. and Grégoire, F., 2011. Étude de la distribution spatiale des données d’abondance de maquereau bleu (Scomber scombrus) et de capelan (Mallotus villosus) des relevés d’hiver aux poissons de fond des Divisions 4VW de l’OPANO à l’aide de modèles additifs généralisés. Rapport technique canadien des sciences halieutiques et aquatiques,2930, vi + 22.
Bui DT, Khosravi K, Tiefenbacher J et al (2020) Improving prediction of water quality indices using novel hybrid machine-learning algorithms. Sci Total Environ. https://doi.org/10.1016/j.scitotenv.2020.137612
Article Google Scholar
Castrillo M, García AL (2020) Estimation of high frequency nutrient concentrations from water quality surrogates using machine learning methods. Water Res 172:115490. https://doi.org/10.1016/j.watres.2020.115490
Article Google Scholar
Chen K, Chen H, Zhou C, Huang Y, Qi X, Shen R, Liu F, Zuo M, Zou X, Wang J, Zhang Y (2020) Comparative analysis of surface water quality prediction performance and identification of key water parameters using different machine learning models based on big data. Water Res 171:115454
Article Google Scholar
Claeskens G, Hjort N (2008) Model selection and model averaging. Cambirdge University Press
Google Scholar
Cleveland WS, Devlin SJ (1988) Locally weighted regression: an approach to regression analysis by local fitting. J Am Stat Assoc 83:596–610. https://doi.org/10.1080/01621459.1988.10478639
Article Google Scholar
Deng T, Chau KW, Duan HF (2021) Machine learning based marine water quality prediction for coastal hydro-environment management. J Environ Manage 284:112051
Article Google Scholar
Dos Santos S, Adams EA, Neville G, Wada Y, de Sherbinin A, Mullin Bernhardt E, Adamo SB (2017) Urban growth and water access in sub-Saharan Africa: progress, challenges, and emerging research directions. Sci Total Environ 607–608:497–508. https://doi.org/10.1016/j.scitotenv.2017.06.157
Article Google Scholar
Egbueri JC (2020) Groundwater quality assessment using pollution index of groundwater (PIG), ecological risk index (ERI) and hierarchical cluster analysis (HCA): a case study. Groundw Sustain Dev 10:100292. https://doi.org/10.1016/j.gsd.2019.100292
Article Google Scholar
Elbeltagi A, Deng J, Wang K, Hong Y (2020a) Crop Water footprint estimation and modeling using an arti fi cial neural network approach in the Nile Delta. Egypt Agric Water Manag 235:106080. https://doi.org/10.1016/j.agwat.2020.106080
Article Google Scholar
Elbeltagi A, Deng J, Wang K, Malik A, Maroufpoor S (2020b) Modeling long-term dynamics of crop evapotranspiration using deep learning in a semi-arid environment. Agric Water Manag 241:106334. https://doi.org/10.1016/j.agwat.2020.106334
Article Google Scholar
Elbeltagi A, Rizwan M, Malik A, Mehdinejadiani B, Srivastava A, Singh A, Deng J (2020c) The impact of climate changes on the water footprint of wheat and maize production in the Nile Delta. Egypt Sci Total Environ 743:140770. https://doi.org/10.1016/j.scitotenv.2020.140770
Article Google Scholar
Elbeltagi A, Zhang L, Deng J, Juma A, Wang K (2020d) Modeling monthly crop coefficients of maize based on limited meteorological data : a case study in Nile Delta. Egypt Comput Electron Agric 173:105368. https://doi.org/10.1016/j.compag.2020.105368
Article Google Scholar
Elbeltagi A, Kumari N, Dharpure JK, Mokhtar A, Alsafadi K, Kumar M, Mehdinejadiani B, Ramezani Etedali H, Brouziyne Y, Towfiqul Islam ARM, Kuriqi A (2021a) Prediction of combined terrestrial evapotranspiration index (Ctei) over large river basin based on machine learning approaches. Water (switzerland) 13:1–18. https://doi.org/10.3390/w13040547
Article Google Scholar
Elbeltagi A, Pande CB, Kouadri S, Islam ARM (2021) Applications of various data-driven models for the prediction of groundwater quality index in the Akot basin, Maharashtra, India. Environ Sci Pollut Res, pp 1–15‏
García-Pedrajas N, Ortiz-Boyer D (2008) Boosting random subspace method. Neural Netw 21(9):1344–1362
Article Google Scholar
Gazzaz NM, Yusoff MK, Aris AZ, Juahir H, Ramli MF (2012) Artificial neural network modeling of the water quality index for Kinta River (Malaysia) using water quality variables as predictors. Mar Pollut Bull 64:2409–2420
Article Google Scholar
Gilmour SG (1996) The interpretation of Mallows’s Cp-statistic. J Royal Statist Soc: D (The Statistician) 45(1):49–56
Google Scholar
Hayder G, Kurniawan I, Mustafa HM (2021) Implementation of machine learning methods for monitoring and predicting water quality parameters. Biointerf Res Appl Chem 11(2):9285–9295
Google Scholar
Helena B, Pardo R, Vega M, Barrado E, Fernandez JM, Fernandez L (2000) Temporal evolution of groundwater composition in an alluvial aquifer (Pisuerga River, Spain) by principal component analysis. Water Res 34(3):807–816
Article Google Scholar
Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8):832–844
Article Google Scholar
Hossain M, Patra PK (2020) Water pollution index – A new integrated approach to rank water quality. Ecol Indic 117:106668. https://doi.org/10.1016/j.ecolind.2020.106668
Article Google Scholar
Islam ARMT, Ahmed N, Bodrud-Doza M, Chu R (2017) Characterizing groundwater quality ranks for drinking purposes in Sylhet district Bangladesh, Using Entropy Method, Spatial Autocorrelation Index and Geostatistics. Environ Sci Pollut Res 24(34):26350–26374. https://doi.org/10.1007/s11356-017-0254-1
Article Google Scholar
Islam ARMT, Shen S, Haque MA et al (2018) Assessing groundwater quality and its sustainability in Joypurhat district of Bangladesh using GIS and multivariate statistical approaches. Environ Dev Sustain 20(5):1935–1959. https://doi.org/10.1007/s10668-017-9971-3
Article Google Scholar
Islam ARMT, Bodrud-doza M, Rahman MS, Amin SB, Chu R, Mamun HA (2019) Sources of trace elements identification in drinking water of Rangpur district, Bangladesh and their potential health risk following multivariate techniques and Monte-Carlo simulation. Groundw Sustain Dev 9:100275. https://doi.org/10.1016/j.gsd.2019.100275
Article Google Scholar
Islam ARMT, Mamun AA, Rahman MM, Zahid A (2020a) Simultaneous comparison of modified-integrated water quality and entropy weighted indices: Implication for safe drinking water in the coastal region of Bangladesh. Ecol Ind 113:106229. https://doi.org/10.1016/j.ecolind.2020.106229
Article Google Scholar
Islam ARMT, Siddiqua MT, Zahid A, Tasnim SS, Rahman MM (2020b) Drinking appraisal of coastal groundwater in Bangladesh: An approach of multi-hazards towards water security and health safety. Chemosphere 255:126933. https://doi.org/10.1016/j.chemosphere.2020.126933
Article Google Scholar
Islam ARMT, Talukdar S, Mahato S et al (2021) Machine learning algorithm-based risk assessment of riparian wetlands in Padma River Basin of Northwest Bangladesh. Environ Sci Pollut Res. https://doi.org/10.1007/s11356-021-12806-z
Article Google Scholar
Kabir MM, Akter S, Ahmed FT, Mohinuzzaman M, Didar-ul-Alam M, Mostofa KMG, Islam ARMT, Niloy NM (2021) Salinity-induced fluorescent dissolved organic matter influence co- contamination, quality and risk to human health of tube well water, southeast coastal Bangladesh. Chemosphere 275:130053. https://doi.org/10.1016/j.chemosphere.2020.130053
Article Google Scholar
Kawo NS, Karuppannan S (2018) Groundwater quality assessment using water quality index and GIS technique in Modjo River Basin, central Ethiopia. J African Earth Sci 147:300–311. https://doi.org/10.1016/j.jafrearsci.2018.06.034
Article Google Scholar
Khan N (2011) Eruption time of permanent teeth in Pakistani children. Iran J Public Health 40(4):63
Google Scholar
Kim J, Han H, Johnson LE, Lim S, Cifelli R (2019) Hybrid machine learning framework for hydrological assessment. J Hydrol. https://doi.org/10.1016/j.jhydrol.2019.123913
Article Google Scholar
Kisi O, Ozkan C (2017) A new approach for modeling sediment-discharge relationship: local weighted linear regression. Water Resour Manag 31:1–23. https://doi.org/10.1007/s11269-016-1481-9
Article Google Scholar
Kisi O, Azad A, Kashi H, Saeedian A, Ali S, Hashemi A, Ghorbani S (2018) Modeling groundwater quality parameters using hybrid neuro-fuzzy methods. Water Resour Manage. https://doi.org/10.1007/s11269-018-2147-6
Article Google Scholar
Kouadri S, Samir K (2021) Hydro-chemical study with geospatial analysis of groundwater Quality Illizi Region, South-East of Algeria. Iran J Chem Chemical Eng (IJCCE) 40(4):1315–1333. https://doi.org/10.30492/ijcce.2020.39800
Article Google Scholar
Kouadri S, Kateb S, Zegait R (2021) Spatial and temporal model for WQI prediction based on back-propagation neural network, application on EL MERK region (Algerian southeast). J Saudi Soci Agricul Sci 20(5):324–336
Google Scholar
Kuncheva LI, Plumpton CO (2010) Choosing parameters for random subspace ensembles for fMRI classification. In International Workshop on Multiple Classifier Systems (pp. 54–63). Springer, Berlin, Heidelberg.
Lai C, Reinders MJ, Wessels L (2006) Random subspace method for multivariate feature selection. Pattern Recogn Lett 27(10):1067–1076
Article Google Scholar
Li P, He S, Yang N, Xiang G (2018) Groundwater quality assessment for domestic and agricultural purposes in Yan’an City, northwest China: implications to sustainable groundwater quality management on the Loess Plateau. Environ Earth Sci 77:1–16. https://doi.org/10.1007/s12665-018-7968-3
Article Google Scholar
Liu P, Wang J, Sangaiah AK, Xie Y, Yin X (2019) Analysis and prediction of water quality using LSTM deep neural networks in IoT environment. Sustainability. https://doi.org/10.3390/su11072058
Article Google Scholar
Loecke TD, Burgin AJ, Riveros-Iregui DA, Ward AS, Thomas SA, Davis CA, Clair MAS (2017) Weather whiplash in agricultural regions drives deterioration of water quality. Biogeochemistry 133(1):7–15
Article Google Scholar
Luo X, Lin F, Chen Y, Zhu S, Xu Z, Huo Z, Peng J (2019) Coupling logistic model tree and random subspace to predict the landslide susceptibility areas with considering the uncertainty of environmental features. Sci Rep 9(1):1–13
Google Scholar
Malone BP, Styc Q, Minasny B, McBratney AB (2017) Digital soil mapping of soil carbon at the farm scale: a spatial downscaling approach in consideration of measured and uncertain data. Geoderma 290:91–99. https://doi.org/10.1016/j.geoderma.2016.12.008
Article Google Scholar
McCullagh P, Nelder JA (1989) Generalized linear models. CRC Press, London
Book Google Scholar
McCulloch WS, Pitts W (1943) A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics 5:115–133. https://doi.org/10.1007/BF02478259
Article Google Scholar
Montgomery S (1993) Ghadames Basin of north central Africa. Stratigraphy, Geologic History, and Drilling Summary: Petroleum Frontiers 10(3):51
Google Scholar
Mukate S, Wagh V, Panaskar D, Jacobs JA, Sawant A (2019) Development of new integrated water quality index (IWQI) model to evaluate the drinking suitability of water. Ecol Indic 101:348–354. https://doi.org/10.1016/j.ecolind.2019.01.034
Article Google Scholar
Nourani V, Elkiran G, Abba SI (2018) Wastewater treatment plant performance analysis using artificial intelligence–an ensemble approach. Water Sci Technol 78(10):2064–2076
Article Google Scholar
Ongley ED (2000) Water quality management: design, financing and sustainability considerations-II. In: Invited Presentation at the World Bank’s Water Week Conference: towards a Strategy for Managing Water Quality Management, pp. 1e16
Patil VBB, Pinto SM, Govindaraju T, Hebbalu VS, Bhat V, Kannanur LN (2020) Multivariate statistics and water quality index (WQI) approach for geochemical assessment of groundwater quality—a case study of Kanavi Halla Sub-Basin, Belagav India. Environ Geochem Health 42(9):2667–2684
Article Google Scholar
Peng, R.D. and Dominici, F., 2008. Statistical methods for environmental epidemiology with R. R: A Case Study in Air Pollution and Health (Springer). doi:https://doi.org/10.1007/978-0-387-78167-9
Peterson JA (1985) Geology and petroleum resources of north-central and northeastern Africa: U.S. Geological Survey Open-File Report 85–709, 54 p
Pham BT, Tien Bui D, Prakash I, Dholakia MB (2017) Hybrid integration of multilayer perceptron neural networks and machine learning ensembles for landslide susceptibility assessment at Himalayan area (India) using GIS. CATENA 149:52–63. https://doi.org/10.1016/j.catena.2016.09.007
Article Google Scholar
Rahman MM, Bodrud-Doza M, Siddique T, Zahid A, Islam ARMT (2020) Spatiotemporal distribution of fluoride in drinking water and associated probabilistic human health risk appraisal in the coastal region, Bangladesh. Sci Total Environ 724:138316. https://doi.org/10.1016/j.scitotenv.2020.138316
Article Google Scholar
Saha N, Bodrud-doza M, Islam ARMT et al (2020) Hydrogeochemical evolution of shallow and deeper aquifers in central Bangladesh: arsenic mobilization process and health risk implications from the potable use of groundwater. Environ Earth Sci 79(20):477. https://doi.org/10.1007/s12665-020-09228-4
Article Google Scholar
Salam R, Islam ARMT (2020) Potential of RT, Bagging and RS ensemble learning algorithms for reference evapotranspiration prediction using climatic data-limited humid region in Bangladesh. J Hydrol 590:125241. https://doi.org/10.1016/j.jhydrol.2020.125241
Article Google Scholar
Sihag P, Mohsenzadeh Karimi S, Angelaki A (2019) Random forest, M5P and regression analysis to estimate the field unsaturated hydraulic conductivity. Appl Water Sci 9:1–9. https://doi.org/10.1007/s13201-019-1007-8
Article Google Scholar
Sihag P, Angelaki A, Chaplot B (2020) Estimation of the recharging rate of groundwater using random forest technique. Appl Water Sci 10:1–11. https://doi.org/10.1007/s13201-020-01267-3
Article Google Scholar
Singh KP, Basant N, Gupta S (2011) Support vector machines in water quality management. Anal Chim Acta 703(2):152–162
Article Google Scholar
Singh B, Sihag P, Singh K (2017) Modelling of impact of water quality on infiltration rate of soil by random forest regression. Model Earth Syst Environ 3:999–1004. https://doi.org/10.1007/s40808-017-0347-3
Article Google Scholar
Singha S, Pasupuleti S, Singha SS, Singh R, Kumar S (2021) Prediction of groundwater quality using efficient machine learning technique. Chemosphere 276:130265
Article Google Scholar
Tiyasha TM, Yaseen ZM (2020) A survey on river water quality modelling using artificial intelligence models: 2000–2020. J Hydrol. https://doi.org/10.1016/j.jhydrol.2020.124670
Article Google Scholar
Towfiqul Islam ARM, Talukdar S, Mahato S, Kundu S, Eibek KU, Pham QB, Kuriqi A, Linh NTT (2021) Flood susceptibility modelling using advanced ensemble machine learning models. Geosci Front. https://doi.org/10.1016/j.gsf.2020.09.006
Article Google Scholar
Trevett AF, Carter RC, Tyrrel SF (2005) Mechanisms leading to post-supply water quality deterioration in rural Honduran communities. Int J Hyg Environ Health 208(3):153–161
Article Google Scholar
Uddin MG, Nash S, Olbert AI (2021) A review of water quality index models and their use for assessing surface water quality. Ecol Indic 122:107218. https://doi.org/10.1016/j.ecolind.2020.107218
Article Google Scholar
Valentini M, dos Santos GB, Muller Vieira B (2021) Multiple linear regression analysis (MLR) applied for modeling a new WQI equation for monitoring the water quality of Mirim Lagoon, in the state of Rio Grande do Sul—Brazil. SN Appl Sci 3:1–11. https://doi.org/10.1007/s42452-020-04005-1
Article Google Scholar
Wang G, Zhang Z, Sun J, Yang S, Larson CA (2015) POS-RS: A Random Subspace method for sentiment classification based on part-of-speech analysis. Inf Process Manage 51(4):458–479
Article Google Scholar
Wang X, Zhang F, Ding J (2017) Evaluation of water quality based on a machine learning algorithm and water quality index for the Ebinur Lake Watershed China. Sci Rep. https://doi.org/10.1038/s41598-017-12853-y
Article Google Scholar
Wang Q, Xu W, Zheng H (2018) Combining the wisdom of crowds and technical analysis for financial market prediction using deep random subspace ensembles. Neurocomputing 299:51–61
Article Google Scholar
van de Weerd AA, Ware PLG (1994) A review of the East Algerian Sahara oil and gas province (Triassic, Ghadames and Illizi Basins): First Break, 12(7):363–373
Yaseen Z, Ehteram M, Sharafati A, Shahid S, Al-Ansari N, El-Shafie A (2018) The integration of nature-inspired algorithms with least square support vector regression models: Application to modeling river dissolved oxygen concentration. Water 10(9):1124
Article Google Scholar
Yaseen ZM, Sulaiman SO, Deo RC, Chau K-W (2019) An enhanced extreme learning machine model for river flow forecasting: State-of-theart, practical applications in water resource engineering area and future research direction’,’. J Hydrol 569:387–408
Article Google Scholar
Zhu S, Heddam S (2019) Prediction of dissolved oxygen in urban rivers at the three gorges reservoir, China: Extreme learning machines (ELM) versus artificial neural network(ANN)’,’Water Qual. Res J 55(1):1–13
Google Scholar
Zhu S, Hrnjica B, Ptak M, Choinski A, Sivakumar B (2020) Forecasting of water level in multiple temperate lakes using machine learning models. J Hydrol 585:124819
Article Google Scholar

Download references

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Laboratory of Water and Environment Engineering in Sahara Milieu (GEEMS), Department of Civil Engineering and Hydraulics, Faculty of Applied Sciences, Kasdi Merbah University Ouargla, Ouargla, Algeria
Saber Kouadri
Agricultural Engineering Department, Faculty of Agriculture, Mansoura University, Mansoura, 35516, Egypt
Ahmed Elbeltagi
Department of Disaster Management, Begum Rokeya University, Rangpur, 5400, Bangladesh
Abu Reza Md. Towfiqul Islam
Research Laboratory in Exploitation and Development of Natural Resources in Arid Zones, University of Kasdi Merbah-Ouargla, PB 147 RP, 30000, Ouargla, Algeria
Samir Kateb

Authors

Saber Kouadri
View author publications
You can also search for this author in PubMed Google Scholar
Ahmed Elbeltagi
View author publications
You can also search for this author in PubMed Google Scholar
Abu Reza Md. Towfiqul Islam
View author publications
You can also search for this author in PubMed Google Scholar
Samir Kateb
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

SK, SK, and AE had the idea of this study and contributed to conceptualization and formal analysis. AE and SK implemented the modeling process; SK wrote the original draft; ARTI wrote the discussion and improved other sections; all co-authors performed review and editing and accepted the final draft.

Corresponding author

Correspondence to Saber Kouadri.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Kouadri, S., Elbeltagi, A., Islam, A.R.M.T. et al. Performance of machine learning methods in predicting water quality index based on irregular data set: application on Illizi region (Algerian southeast). Appl Water Sci 11, 190 (2021). https://doi.org/10.1007/s13201-021-01528-9

Download citation

Received: 03 July 2021
Accepted: 21 October 2021
Published: 06 November 2021
DOI: https://doi.org/10.1007/s13201-021-01528-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Performance of machine learning methods in predicting water quality index based on irregular data set: application on Illizi region (Algerian southeast)

Abstract

Similar content being viewed by others

Applications of various data-driven models for the prediction of groundwater quality index in the Akot basin, Maharashtra, India

Assessment of groundwater quality using water quality index, multivariate statistical analysis and machine learning techniques in the vicinity of an open dumping yard

Groundwater quality assessment by multi-model comparison: a comprehensive study during dry and wet periods in semi-arid regions

Explore related subjects

Introduction

Materials and methods

Study area

General setting

Hydrogeological settings

Geological settings

Data collection

Calculation of water quality index (WQI)

Artificial intelligence models

Artificial neural network (ANN)

Multi linear regression (MLR)

Support vector regression (SVM)

M5P tree

Random forest(RF)

Locally weighted linear regression (LWLR)

Random subspace (RSS)

Additive regression (AR)

Sensitivity analysis

Model’s performance criteria

Root mean square error

Mean absolute error

Relative absolute error

Root relative squared error

Results and discussion

Statistical analysis

Sensitivity analysis

Evaluation of several ML models in WQI prediction

Conclusion

Data Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation