Abstract
The spatial display of clustered data using machine learning (ML) as regions (bordered areas) is currently unfeasible. This problem is commonly encountered in various research fields that utilize clustering algorithms in their workflow. We present in this study an approach utilizing ML algorithm models that can be trained to any specific dataset to produce decision boundaries. These boundaries are overlaid onto the geographic coordinate system (GCS) to generate geographic clustering regions. The proposed approach is implemented in the Python Package Index (PyPI) as a geovisualization library called geographic decision zones (GeoZ). The efficiency of GeoZ was tested using a dataset of groundwater wells in the State of California. We experimented with 13 different ML models to determine the best model that predicts the existing regional distribution (subbasins). The support vector machine (SVM) algorithm produced a relatively high accuracy score and fulfilled the required criteria better than the other models. Consequently, the tested SVM model with optimized parameters was implemented in the GeoZ open-source library. However, it is important to note that limitations in the application of GeoZ may arise from the nature of the SVM algorithm, as well as the volume, discontinuity, and distribution of the data. We have attempted to address these limitations through various suggestions and solutions.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
Introduction
The conventional visualization of clustering algorithms output is commonly represented by scatter plots, without the ability of creating bordered areas, zones, or regions (Fig. 1). Our approach has been adopted due to the lack of methods for creating boundary-based regions. Liu et al. (2015) partially addressed this problem through developing a clustering method based on the Delaunay triangulation network. In their approach, the authors created an algorithm that clustered the data samples based on their spatial proximity from each other. The spatial boundary created based on the spatial density can be considered as part of the algorithm workflow which is used to distinguish between the different clusters. However, their mapping approach is only applicable to their clustering method and thus cannot be applied to other clustering algorithms, and their clustering method can only be applied to the latitude and longitude, and thus, it would not be useful for clustering parameters or features other than latitude and longitude.
Many fields that use clustering algorithms adopt an identical approach to represent the data, and in many cases, it is sufficient to show scatter plots with class-based colors (Fig. 1). However, there are instances in which displaying regions rather than points would be more informative and clearer. This is particularly useful in fields such as meteorology (Ohba et al. 2016; Singh et al. 2017), climatology (Köplin et al. 2013), oceanography (Sun et al. 2021; Wichmann et al. 2020), crime analysis (Lombardo and Falcone 2011; Mburu and Mutua 2023), and others (Li et al. 2020; Subba Rao and Chaudhary 2019). Displaying clusters as regions can help to identify important boundaries that can impact the output or reduce the risk of misclassification. GeoZ is an acronym that refers to geographic decision zones, an implementation of the above concept, and as its name indicates, its basic purpose is to project the decision zones of trained ML models into the geographic coordinate system (GCS).
To demonstrate the methodology and application of the boundary-based regions, we selected a groundwater well (GW) dataset that contains the longitude and latitude of the GW wells along with the subbasins they were allocated into, based on geomorphological and administrative boundaries (California Department of Water Resources (DWR) 2021). We tested conventional clusters mapping utilizing scatter plots (based on the manual introduced boundaries, Fig. 2) and basic statistical boundaries using the Voronoi tessellation method to form a picture of the data distribution and expected boundaries (Fig. 3). We created a Voronoi tessellation by forming a centroid inside each cluster. The tessellation diagram divided the study area into regions based on statistical boundaries of equal length between centroids. Although this approach provided an initial idea of the study area’s clustering, it had limitations due to the abnormal distribution of the data. Consequently, we adopted advanced machine learning (ML) models, such as support vector machines (SVM), to delineate stochastic boundaries around the data.
Despite the large amount of data usually available in clustering problems, data point density over the study area can be low in large areas, creating regions of uncertainty that are often discarded from consideration. To overcome this, we utilized ML algorithms to train a model on the available data and predict the areas of sparse data. We fed the model sequential points to fill a grid covering the entire study period and drew its decision boundaries into the GCS to see how the boundaries would reflect in the real world. Although the proposed approach had the disadvantage of depending on the availability of data for accuracy, it provided a new tool for mapping clustering datasets without well-defined boundaries, which can be rapidly and accurately represented using ML models.
Study Area and Data
Study Area
The State of California was selected for the case study to demonstrate the capabilities of our novel algorithm. The state ranks third in the USA in terms of size, with an area of almost half a million square kilometers (Prothero 2017). This makes California home to many unique natural systems, including several major water bodies and diverse natural terrains. The vast area of the state allows for various hydrological conditions and exhibits a distinctive and complex hydrogeological canvas (Carle 2015; Prothero 2017). Unlike GW basins, the GW subbasins in California are divided according to a combination of natural boundaries and approved administrative boundaries (California Department of Water Resources (DWR) 2021). This approach creates a complex and unique map that serves as an excellent illustration of the need for region-based representation (Fig. 2) and could serve as a potential utilization of modeling capabilities of the GeoZ library.
Datasets
The dataset employed in this study was provided by the California DWR and is publicly available on the DWR website (California Natural Resources Agency 2021). It contains 17 columns and 45,923 rows comprising the geographic locations of the wells drilled in the State of California and many other characteristics, such as the well type, uses, and depth, as well as the program monitoring the well because most wells fall under the jurisdiction of different agencies. For the purpose of our study, we filtered all columns except three: geographical coordinates of the GW wells (latitude and longitude) and the subbasin classification of each well.
The dataset was cleaned to remove the duplicates. In addition, any well that did not have a subbasin name classification was removed. This data filtration ensured that no null values were encountered by the model during training, thus avoiding any interruption during code execution. After dataset cleaning, the data of 42,868 wells distributed across 287 unique subbasins remained. The subbasin names were then encoded into numbers instead of their original names as strings. This was done to simulate the output of the clustering algorithms, given that GeoZ was specifically designed to address this issue. Another reason for encoding the names was to allow the bazel_cluster function to work without facing any issues (details of the bazel_cluster function will be discussed later in this paper).
Methodology
Theoretical Concept
To construct a practical library, we identified our requirements and the limitations associated with such demands. The premise of the library began with the simple goal of delineating the boundaries around the available data points. Similar to any data-driven model, the accuracy of the model is directly related to the amount of data fed into it. The model input comprised two features and one label column. The two features were latitude and longitude in their actual values, without scaling or normalization. The labeled data were the groups or classes containing the data points. Using these settings allowed us to train the model to identify the regions surrounding each point cluster while exhibiting a limited amount of uncertainty regarding the extent of each cluster’s boundary, a feature that is intended to be controlled by the prospective model’s hyperparameters.
Training the ML model with pure geographical data allowed us to restrict the feature space of the model to the GCS, thereby overlapping them. Thus, we can directly project the model predictions onto any geographical map. However, introducing geographical data without scaling or normalizing them could reduce the model accuracy under certain circumstances wherein the variance between the points becomes significant (Ozsahin et al. 2022). However, because of the rarity of these events in the spatial domain and considering this risk when using the model, these effects can be alleviated. As the model feature space and GCS overlap, the model output can be visualized on the surface representation of the Earth.
The next step is to find a model that is capable of predicting the data point classes with high precision based only on the provided latitude and longitude. Training the model on a reasonable amount of data allows it to create decision boundaries or zones (DZ) around each class. Thus, when a point’s latitude and longitude fall within the stated zone, the model classifies it as a class that covers that geographic boundary. Constricting the model input to latitude and longitude would project these DZ onto the real world. Thus, each cluster DZ would be drawn as the spatial boundary surrounding the cluster on the geographic map when visualized using a decision boundary plotting tool. Three decision boundary plotting tools with different properties were utilized in GeoZ and discussed in detail through the “Implementation: GeoZ Library” section. A workflow of the process is illustrated in Fig. 4.
We demonstrate the theoretical concept in Fig. 5, which shows the three layers. The top layer corresponds to the GW subbasins in California, which are color coded to facilitate differentiation. This layer represents our ground truth and the reference for accuracy verification. The second layer contains all the available data points (in blue) assigned to the GW subbasin location, which represent sample points used to train the ML model. The third layer shows the prediction model based on the data provided in the second layer. In this instance, the prediction is from K-means, one of the many ML models tested in this study, as described in the next section. The subbasin boundaries are shown in red, and the correspondence of the model input features to the latitude and longitude in the GCS is also indicated.
Because most of the models used in this study originated from the scikit-learn library (Pedregosa et al. 2011), we used their SCORE function implementation to measure the prediction (or drawing) accuracy of each classifier. The score function is part of their general API; therefore, it is inherited by most classifiers and, if required, modified to accommodate their nature. The function checks the correctly predicted values, divides them by all predicted values, and returns the percentage as a score.
In ML, datasets are usually divided into training and testing sets to evaluate the performance of the trained model. However, in our case, we fed all datasets into the model without removing a test set. The reasoning behind our approach can be explained by the purpose of the trained model. In our use-case, the main goal was to delineate a boundary around the given points, similar to a convex hull (Barber et al. 1996), and not to predict the values of future points. Therefore, although we can use clustering algorithms to achieve comparable results, testing their accuracy will be difficult. Moreover, akin to clustering algorithms, the model will not be used to make any future predictions or extrapolations from the training set, as it only needs to follow the data and draw the boundaries.
Based on all the aforementioned conditions, we can even argue that overfitting the model to a certain degree is acceptable. However, including a regularization hyperparameter in the model would be advantageous, as it would allow us to control the model fitness without sacrificing any data to measure or increase its generalizability. Therefore, to measure the model performance, we fed the same dataset that was used to train them into the score function, and any misclassified points would decrease the accuracy score and damage the shape of the drawn decision boundary.
ML Models
By establishing the basics of our model requirements, we can iteratively elaborate on any extra additions to the requirement list as we compare the model outputs with the ground truth to address any flaws or shortcomings in its performance. Because we have two features and a label, the problem can be considered as a simple classification problem. Based on the “No Free Lunch” theorem (Wolpert and Macready 1997), we decided to try most of the ML classification models available in scikit-learn library, as well as some of their clustering algorithms.
As discussed in the “Introduction” section, the purpose of this library was to delineate GW subbasins based on the available well data. Based on the complexity of the GW subsurface structures, we knew beforehand that linear models would fail to follow their shape and would make predictions with low accuracy. However, we still included linear models in our experiment to observe how they would behave while trying to address the complex shapes of the subsurface, as this could provide some insights into properly tuning more advanced models while minimizing costs (time and resources). Another reason for the linear model implementation was their code simplicity, as the scikit-learn general API enabled the testing of various classification models with just a few extra lines of code; hence, exploring the problem from different angles would be worth the time it takes to write these codes.
The models tested in this research included three clustering algorithms: (1) K-means, (2) Gaussian mixture model (GMM), and (3) Bayesian GMM. Nine classification algorithms were used: (4) linear regression, (5) Bayesian ridge regression, (6) logistic regression, (7) artificial neural network (ANN), (8) k-nearest neighbors, (9) linear discriminant analysis, (10) histogram-based gradient boosting, (11) AdaBoost, (12) Gaussian Naive Bayes (NB), and (13) SVM. To optimize the models’ performance, we primarily worked with the default hyperparameters and only devoted a few hours to manual tuning when suboptimal performance was observed based on the given dataset. We set a maximum time limit of one working day for parameter optimization to identify which tools could effectively function with the data without requiring extensive hyperparameter tuning. This approach allowed us to implement the classifier with default settings within the mapping function, minimizing user interference or modification. The time limit also provided an opportunity to assess the ease of tuning the model hyperparameters and the level of knowledge and experience required to achieve satisfactory results.
Requirements
Most of the models we tested, except for the pure linear models, produced acceptable prediction accuracies; however, the resulting geographical maps did not accurately represent reality. This was mainly because of interpolation, extrapolation, or delineation errors based on the model functionality and working mechanism. As a result, we introduced more requirements in our list to address the shortcomings of the model, eventually reaching the following 6-clause list:
-
1.
The model must avoid extrapolation as much as possible.
-
2.
If extrapolation is required, it must be performed such that it does not affect the delineation of the subsurface boundaries or distort them.
-
3.
The model must produce arbitrary shapes instead of only geometric ones; thus, it must achieve a complexity level that would allow it to delineate the subbasin boundaries as accurately as possible (this requirement disqualifies most linear models owing to their linear nature).
-
4.
The model must be as simple as possible to save computational resources and time.
-
5.
It should be stable and robust against outliers with low to moderate sensitivity.
-
6.
It should be flexible (hyperparameter-wise) to allow for some control over any useful elements (including the uncertainty) in the produced maps.
This list is not exhaustive and can be amended in the future; hence, it should be considered as a guideline rather than a list of restrictive rules. This list was formulated based on our experiments with GW data, and it can require fewer or even more rules when attempting to model data from any other field. The assessment of item five was not feasible in the present research; therefore, its inclusion is based on the acknowledged characteristics of each method (e.g., Gaussian NB is acknowledged to be vulnerable to outliers, whereas tree-based methods are known for their robustness against them).
The experiments documented in this study were performed on a computation node in the high-performance computing (HPC) system at the UAE University. The node comprised a 36-core processor, 377 GB of RAM, and Linux operating system. These were the initial conditions used for the experiments; however, after the creation of the GeoZ library and its publication in the Python Package Index (PyPI), it was used and tested on different systems, including a Windows system comprising a Core-i9 processor and 64 GB of RAM and a Linux system comprising a Core-i7 processor and 24 GB of RAM. We expected that the initial high hardware requirements are no longer necessary to run the library, and normal computer systems with an adequate amount of RAM would be able to run it with no issues; however, the proportional increase in RAM size based on the dataset size intended to be drawn must be considered.
Results and Discussion
In addition to the visual inspection of the produced maps to identify their flaws and inaccuracies, we also recorded their accuracy scores, which can indicate minor errors that might be difficult to detect through visual inspection. Based on the aforementioned tuning restrictions, most models achieved relatively accurate scores. The ANN implementation on scikit-learn was an exception. It performed poorly despite being one of the most complex models available (Ismailov 2023). Our efforts to tune its hyperparameters within the self-imposed time limits did not yield any noteworthy increase in accuracy. This made it reasonable to exclude ANN algorithms from the prospective classifiers as most of them require certain settings and hyperparameter tuning to adjust them according to the provided dataset (Gupta et al. 2021). Unlike normal scenarios, wherein the trained model is a part of the middle process and its inference ability is the final product, in our use-case, the trained model is the final product. Therefore, it is imperative to find a model that requires the least amount of user interference for training.
Table 1 shows the results of running each model on the dataset and the number of rules from the requirement list it was able to satisfy. We also noted the time required for each run, which included the times required to train the model and draw the final map. The only model that achieved a high accuracy score, passed the visual inspection, and satisfied all the requirements was the SVM, albeit with some limitations, which will be discussed later in detail along with our attempts to address them. However, because it met the minimum requirements, we were able to use it as the base model for the GeoZ library drawing mechanism; therefore, it can be considered for any kind of similar research or even in a production environment while bearing in mind the limitations imposed by its nature.
The results of the modeling experiments are shown in Fig. 6, and evidently, most linear models achieved low accuracies and showed linear structures in the map, which is far from reality. The accuracies of the clustering algorithms could not be measured using the score function; however, they showed a clear separation between the clusters and an excellent ability to follow the shape of the GW subbasins. Their disadvantage was that they use something akin to a linear extrapolation for any location outside the data conglomeration. This is particularly evident in the maps generated using K-means and k-nearest neighbor algorithms. The tree-based classifiers achieved very high accuracy; however, their dependence on geometrical shapes owing to their tree-based decision-making nature created unnatural maps that did not reflect the shapes of the GW subbasins, even if they were accurately predicted.
The GMM and Bayesian GMM are clustering algorithms that depend on the normal distribution of samples; however, as they cannot provide labeled data to the algorithms because they are cluster-based, several wells were inaccurately clustered. However, the GMM showed good ability to follow the shape of the GW subbasins. Finally, the SVM (depicted individually in Fig. 7a) and the Gaussian NB achieved high accuracies and showed the best ability to follow the shapes of the GW subbasins and could restrict generalization to a small area around the data instead of extrapolating them to infinity.
SVM
Some of the key differences between the two best methods in this study, SVM and Gaussian NB, are that SVM achieved higher accuracy and showed good mapping flexibility owing to its hyperparameter options, which allowed us to control the extent to which the model would extrapolate the boundary around the data as well as how interconnected were the distant points for each cluster. However, the SVM algorithm was distinguishable from all the tested algorithms in the manner in which it created the decision boundary for each cluster. SVM adopts an approach similar to the convex hull approach, but with a more flexible and closer follow-up to the data distribution. This creates clusters as closed-form boundaries surrounding the data, while leaving the remainder of the map empty. In such cases, most models attempt to generalize each cluster in a certain manner based on their mathematical nature to prevent any emptiness in the feature space. This means that if the model is provided with features, it will predict a label, even if that label is wrong or does not have a logical connection to the provided data, as evident from the Gaussian NB generalization of the edges.
In contrast, SVM creates a simple boundary around each cluster and then generalizes all the empty space and background as one of the newly created clusters. This allows for accurate representations of all clusters except one, which we call the “Generalized Cluster” for the sake of practicality. Generally, the densest or most dispersed cluster is selected as the Generalized Cluster. The SVM classifier generalizes this cluster to the entire feature space outside the other clusters. When the decision boundary is drawn, the Generalized Cluster appears as the background, whereas the remaining clusters are surrounded by it, appearing as if they are floating on top of it. This can be clearly observed in Fig. 7a. Despite being one of the weaknesses of the SVM, it is also one of its strengths. By restricting the generalization issue to one cluster, we can create solutions to resolve the issue in that one cluster instead of trying to address each cluster generalization individually, as is the case with other classification methods. We attempted to address this issue by creating a function called “Bazel Cluster.”
The SVM algorithm has two hyperparameters that can be modified to increase the model accuracy and control its behavior. The first is the C parameter, which controls the degree of influence each point has on the decision boundary of the classification as well as the importance of each point. Increasing C increases the importance of each point, which consequently increases the influence of each point on the location of the decision boundary, and decreasing it would result in the opposite. The gamma hyperparameter is the second parameter that controls the SVM behavior. It acts as a regularization parameter that prevents the model from overfitting. These two parameters have different effects on map creation.
Based on our experiments, the gamma hyperparameter functions as a controller for the uncertainty regions of the boundary. As a regularization parameter, it is inversely related to the buffer zone around the points of the cluster as well as the interconnectedness of the distant points within the same cluster/class. As a result, decreasing the value of this parameter would increase the buffer zone around the points, thereby increasing the uncertainty of the cluster boundaries compared to reality. Therefore, increasing its value would decrease the buffer zone around the points and consequently decrease the uncertainty of the cluster’s boundaries compared to reality. The C parameter controls the outlier effect on the boundary location; increasing its value forces the model to consider each point, and the boundary must follow the point location. In contrast, decreasing its value would smoothen the boundary and allow the model to ignore some points as outliers or misclassification errors. To obtain an accurate representation of the data, it is imperative to substantially increase the C parameter and control the gamma hyperparameter to a certain degree, based on the dataset.
Bazel Cluster
The Bazel function is a Python function that was named based on the English word “Bezel” to indicate its purpose, which involves acting as a bezel surrounding the cluster map. The “e” in “bezel” was replaced with an “a” to ease the process of identifying the function calling inside the Jupyter environment, which significantly helped with bug fixing during development. The function was created solely to address the weakness of the SVM algorithm; hence, it would be unwise to utilize it elsewhere. The Bazel Cluster function works by receiving the dataset intended for visualization, and based on the data coordinates, it determines the edges of the data distribution and creates a bezel or a frame that envelops the data. The bezel is formed by generating new data points. By default, their numbers are equal to the number of points in the largest cluster in the dataset plus one extra point; however, the user can adjust the number of samples in the frame. The frame is located 1 standard deviation (SD) away from the map edges; its width is also 1 SD but can be adjusted according to the user input.
This function works by adding an extra cluster of points that is dispersed around the map and has more points than the largest cluster in the dataset. Consequently, it forces the model to consider it as a Generalized Cluster. Once the model considers it as a Generalized Cluster, the produced map contains all the actual clusters, whereas the Bazel Cluster disappears and acts as the map background (Fig. 7b). Unfortunately, this method is not always effective because the model sometimes selects one of the actual clusters as the background. However, in such cases, it is easy to detect the failure as the Bazel Cluster would appear as a clear frame surrounding the map; in this case, the user can adjust the Bazel Cluster parameters to increase the frame width or the number of samples to force the SVM classifier to consider the Bazel Cluster as a Generalized Cluster. This process can be iterated until the expected result is achieved. The Bazel Cluster function is disabled by default in GeoZ but can be enabled through the “bazel” parameter.
Limitations
Despite addressing the generalization issue of the SVM classifier, this implementation has several limitations. Some of the major ones are as follows:
-
Hyperparameter tuning: in our experiments, the best visual and accurate results were obtained by setting the kernel to the radial basis function (RBF) and the gamma hyperparameter to 30; however, depending on the application or even the study area, the gamma value might have to be changed.
-
To overlap the feature space and GCS, we must maintain the original latitude and longitude values; however, a significant difference in the latitude and longitude between points in the dataset would considerably weaken the model’s ability to correctly predict and draw the boundaries. Therefore, the map area must be considered while viewing the dataset.
-
Another aspect of map size is related to the nature of GCS. Many ML algorithms require continuous data; however, the GCS is discontinuous, and this discontinuity would most probably not be considered by the ML models. Thus, it will produce incorrect results and even non-existent coordinates when it infers the cluster boundaries. An SVM using the RBF kernel can predict nonlinear relations and is therefore expected to predict discontinuous spaces; however, its performance highly depends on the quality and quantity of the provided dataset (De Marchi et al. 2020). Hence, it is preferable that the user works with map areas that do not cross the discontinuities of the GCS boundaries (180, − 180) and (90, − 90).
-
The computation requirements (primarily the RAM) for running the library are directly related to the size of the input dataset as the SVM classifier tries to load the entire dataset onto the RAM before processing it; thus, if the dataset is larger than the system RAM size, the algorithm will fail. Therefore, it would be wise to consider the specifications of the device running the library if the dataset is very large (> 100,000 records).
-
The Bazel Cluster function sometimes requires several modifications to successfully force the classifier to consider it as the background. However, the method is not guaranteed to work; hence, it only represents a good prospect for enhancing the fundamental concept or operational mechanism of the function.
-
The SVM classifier is robust against outliers; however, in our case, this ability can be considered a risk to the model delineation process. This is mainly because of our certainty that all data used in training are actual data and not outliers; hence, the removal of any data is detrimental to model accuracy. We have yet to find a solution to this limitation other than trying to manipulate the SVM hyperparameters.
-
Finally, aside from the kernel, the SVM classifier has two hyperparameters (C and gamma), both of which affect the model results and the proximity on the final map shape. Therefore, deciding the appropriate values for the SVM hyperparameters can be an issue, especially because they can differ depending on the field. In our experiments, the C hyperparameter did not have a significant effect as gamma on the model classification or map shape; however, because we are aware that all points matter and are accurate, we elected to assign a value of 100 to the C parameter to force it to consider all the points, which produced optimum results for our dataset. Regarding gamma, we discovered that the optimum value to represent GW subbasins is “30”; however, depending on the user preference in dealing with uncertainty, it can be increased up to “1000.”
Implementation: GeoZ Library
GeoZ is a Python library that was developed to implement the proposed theoretical approach. The library integrates several ML algorithms to create geographic maps for the output of unsupervised ML techniques, primarily clustering algorithms. It is written entirely in Python and is open-source with a BSD 3-Clause license. It was also published in PyPI. GeoZ contains three modules that perform similar tasks of creating geographic maps from the output of clustering algorithms, but with different approaches and using various libraries. However, it should be noted that Matplotlib is the backend drawing library used in both GeoZ and most Python drawing algorithms (Hunter 2007). Brief descriptions of the modules included in GeoZ and their purpose are provided in the following sections. The parameters within each module are detailed in the function’s documentation; therefore, to avoid redundancy, we did not elaborate on them in this study.
Convex Hull Module
This module creates a convex hull for each set of points that belong to a distinct cluster using Shapely’s “convex_hull” operation (Gillies et al. 2022), which is iterated for each cluster to eventually draw a map that contains all the clustered data. The main advantage of this method is that it can detect all evident overlaps in the clustering algorithm; other methods cannot derive overlapped regions owing to the underlying modeling algorithms (SVM). However, owing to its constricted geometrical drawing ability, it is incapable of accurately delineating the cluster regions, nor should it be used for that. Because this method does not involve the use of any ML algorithms, it executes quickly and is suitable for initial testing and prototyping of the clustering algorithm’s parameters to a certain degree.
Decision Boundary Display Module
This module utilizes scikit-learn’s “DecisionBoundaryDisplay” class (Pedregosa et al. 2011) to derive a geographic map. It also utilizes the Geopandas library to draw points on a map (Jordahl et al. 2022). This method is advantageous that it provides a significant amount of flexibility to users to modify and adjust the map according to their preference, as opposed to the other methods included in the GeoZ library, which are better suited for prototyping and quick drafts as one of the options available to the user. They allow users to reduce the resolution, thus producing more maps and variations in a short time.
Decision Region Module
This module uses MLxtend’s “decision_regions” function (Raschka 2018) to draw the map. The advantage of this method is that it produces a high-resolution detailed map in addition to using decision regions containing different colors and symbols for the data points to represent different clusters. This is a considerable advantage over default color schemes used in scikit-learn; however, the number of colors used increases with the increase in the number of clusters, forcing the algorithm to cycle through the same set and resulting in confusion for the end users. Therefore, adding symbols to differentiate between clusters, in addition to the colors of the different regions, is a significant advantage. However, the high resolution of the output generated by this method limits its usage; this is because this method requires a significant amount of time to draw the maps, which becomes a disadvantage during prototyping. This plotting method therefore should be used for creating the final exported map. The end results and library effect are demonstrated in a side-by-side comparison illustrating the actual subbasins, which is the classical method of drawing cluster results, and the proposed method’s mapping ability was obtained using MLxtend’s “decision_regions” function (Fig. 8).
Conclusions and Future Work
Our approach achieved a 99.1% accuracy in delineating the GW subbasins of California using a trained ML classification model employing data from the GWDB. SVM was the only ML model among the 13 tested models that fulfilled all the requirements for using it as a base model for the mapping library. We also highlighted the limitations that restrict the use of the model. Furthermore, we attempted to address the “Generalized Cluster” issue by creating a function called “bazel_cluster,” which has a high success rate in addressing the SVM limitation and provides clear signs when it fails. We implemented three mapping modules inside GeoZ to address the various expected use cases of the library. GeoZ has been made available in the PyPI, and its source code has been made available on GitHub (ElHaj 2023).
The library is being used in most of our ongoing work to delineate GW subbasins and aquifers. However, there is still room for improvement. One aspect would be the inclusion of a third dimension to illustrate the depth along with the lengths and widths of our areas of interest. Matplotlib is the backend drawing library used in GeoZ, and it includes 3D capabilities; therefore, it would be theoretically possible to achieve this. However, the decision zone drawing libraries can only draw the model inference in two dimensions. As a result, to create a 3D representation of our clusters, it is imperative to create a decision boundary/zone mapping library from the ground up or develop one of the established libraries used in GeoZ to accommodate 3D capabilities.
In addition to the geosciences field, GeoZ can also be used in other fields that use unsupervised clustering to create decision zones. A good example of such a field is crime analysis. In this field, substantial research is conducted by employing clustering algorithms to determine the degree of risk for each geographic region. However, the resulting maps can color the data points based on their classification and display them as scatter points on the map to determine the regions, without establishing their boundaries or zones. Even when boundaries are determined, they are mostly predefined regions based on administrative or natural boundaries, unlike GeoZ, wherein the boundaries are dynamic, clearly defined, and determined based on the cluster’s sample distribution. To the best of our knowledge, there are currently no known methods or libraries that can achieve the capabilities of GeoZ. Therefore, we hope that our study will offer a significant contribution in the fields of GIS and ML.
Data Availability
The GW dataset used throughout this study is available publicly in https://data.cnra.ca.gov/dataset/periodic-groundwater-level-measurements/resource/af157380-fb42-4abf-b72a-6f9f98868077.
Code Availability
The code created during this work is open source and can be accessed in Github through the link https://github.com/Ne-oL/geoz.
References
Barber CB, Dobkin DP, Huhdanpaa H (1996) The quickhull algorithm for convex hulls. ACM Trans Math Softw 22(4):469–483. https://doi.org/10.1145/235815.235821
California Department of Water Resources (DWR) (2021) “California’s groundwater update 2020 (bulletin 118).” The California Department of Water Resources 485. Retrieved from https://data.cnra.ca.gov/dataset/calgw_update2020. Accessed 11 Jan 2023
California Natural Resources Agency (2021) “Periodic groundwater level measurements - datasets - California Natural Resources Agency Open Data.” Retrieved from https://data.cnra.ca.gov/dataset/periodic-groundwater-level-measurements/resource/af157380-fb42-4abf-b72a-6f9f98868077. Accessed 1 Mar 2022
Carle D (2015) Introduction to water in California. University of California Press, Berkeley. https://doi.org/10.1525/9780520962897
De Marchi S, Marchetti F, Perracchione E (2020) Jumping with variably scaled discontinuous kernels (VSDKs). BIT Numer Math 60(2):441–463. https://doi.org/10.1007/s10543-019-00786-z
ElHaj K (2023) GeoZ: geographic decision zones. GitHub Repository. Retrieved from https://zenodo.org/record/7524946. Accessed 11 Jan 2023
ESRI (2013) Map services - world topographic map. Retrieved from http://www.esri.com/software/arcgis/arcgisonline/services/map-services. Accessed 30 Jan 2023
Gillies S, van der Wel C, Van den Bossche J, Taves MW, Arnott J, Ward BC et al (2022). Shapely. https://doi.org/10.5281/zenodo.7583915
Gupta M, Rajnish K, Bhattacharjee V (2021) “Impact of parameter tuning for optimizing deep neural network models for predicting software faults” edited by J Gou. Sci Program 2021:1–17. https://doi.org/10.1155/2021/6662932
Hunter JD (2007) Matplotlib: a 2D graphics environment. Comput Sci Eng 9(3):90–95. https://doi.org/10.1109/MCSE.2007.55
Ismailov VE (2023) A three layer neural network can represent any multivariate function. J Math Anal Appl 523(1):127096. https://doi.org/10.1016/j.jmaa.2023.127096
Jordahl K, Van den Bossche J, Fleischmann M, McBride J, Wasserman J, Richards M, Badaracco AG et al (2022) Geopandas/Geopandas: V0.12.2. Zenodo. https://doi.org/10.5281/zenodo.7422493
Köplin N, Schädler B, Viviroli D, Weingartner R (2013) The importance of glacier and forest change in hydrological climate-impact studies. Hydrol Earth Syst Sci 17(2):619–635. https://doi.org/10.5194/hess-17-619-2013
Li Y, Sun Q, Ji X, Li Xu, Chuanwei Lu, Zhao Y (2020) Defining the boundaries of urban built-up area based on taxi trajectories: a case study of Beijing. J Geovisualization Spat Anal 4(1):8. https://doi.org/10.1007/s41651-020-00047-6
Liu Q, Tang J, Deng M, Shi Y (2015) An iterative detection and removal method for detecting spatial clusters of different densities. Trans GIS 19(1):82–106. https://doi.org/10.1111/tgis.12083
Lombardo R, Falcone M (2011) Crime and economic performance. A cluster analysis of panel data on Italy’s nuts 3 regions, pp 0–33. https://econpapers.repec.org/RePEc:clb:wpaper:201112
Mburu E, Mutua F (2023) Investigating the influence of land use and alcohol outlet density on crime in Juja sub-county, Kenya. J Geovisualization Spat Anal 7(1):10. https://doi.org/10.1007/s41651-023-00141-5
Ohba M, Kadokura S, Nohara D (2016) Impacts of synoptic circulation patterns on wind power ramp events in East Japan. Renew Energy 96:591–602. https://doi.org/10.1016/j.renene.2016.05.032
Ozsahin DU, Mustapha MT, Mubarak AS, Said Ameen Z, Uzun B (2022) Impact of feature scaling on machine learning models for the diagnosis of diabetes. In: 2022 International Conference on Artificial Intelligence in Everything (AIE), Lefkosa, Cyprus, 87–94. https://doi.org/10.1109/AIE57029.2022.00024
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer PA, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–30
Prothero DR (2017) California’s amazing geology. CRC Press
Raschka S (2018) MLxtend: providing machine learning and data science utilities and extensions to Python’s scientific computing stack. J Open Source Softw 3(24):638. https://doi.org/10.21105/joss.00638
Singh SK, Lo E-M, Qin X (2017) Cluster analysis of monthly precipitation over the western maritime continent under climate change. Climate 5(4):84. https://doi.org/10.3390/cli5040084
Subba Rao N, Chaudhary M (2019) Hydrogeochemical processes regulating the spatial distribution of groundwater contamination, using pollution index of groundwater (PIG) and hierarchical cluster analysis (HCA): a case study. Groundw Sustain Dev 9:100238. https://doi.org/10.1016/j.gsd.2019.100238
Sun Q, Little CM, Barthel AM, Padman L (2021) A clustering-based approach to ocean model–data comparison around Antarctica. Ocean Sci 17(1):131–145. https://doi.org/10.5194/os-17-131-2021
Wichmann D, Kehl C, Dijkstra HA, van Sebille E (2020) Detecting flow features in scarce trajectory data using networks derived from symbolic itineraries: an application to surface drifters in the North Atlantic. Nonlinear Process Geophys 27(4):501–518. https://doi.org/10.5194/npg-27-501-2020
Wolpert DH, Macready WG (1997) No free lunch theorems for optimization. IEEE Trans Evol Comput 1(1):67–82. https://doi.org/10.1109/4235.585893
Funding
This study was funded by the Research Affairs Office, UAE University (Fund No. 31S445).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Ethics Approval
The authors have obtained all the ethical approvals about this paper. The authors declare to obey all the academic ethical standards.
Informed Consent
All the authors who made contributions to this paper are included and aware of the content of this paper. They also agree to submit this paper to the Journal of Geovisualization and Spatial Analysis.
Conflict of Interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
ElHaj, K., Alshamsi, D. & Aldahan, A. GeoZ: a Region-Based Visualization of Clustering Algorithms. J geovis spat anal 7, 15 (2023). https://doi.org/10.1007/s41651-023-00146-0
Accepted:
Published:
DOI: https://doi.org/10.1007/s41651-023-00146-0