Abstract
The growth of databases in the healthcare domain opens multiple doors for machine learning and artificial intelligence technology. Many medical devices are available in the medical field; however, medical errors remain a severe challenge. Different algorithms are developed to identify and solve medical errors, such as detecting anomalous readings, anomalous health conditions of a patient, etc. However, they fail to answer why those entries are considered an anomaly. This research gap leads to an outlying aspect mining problem. The problem of outlying aspect mining aims to discover the set of features (a.k.a subspace) in which the given data point is dramatically different than others. In this paper, we present a framework that detects anomalies in healthcare data and then provides an explanation of anomalies. This paper aims to effectively and efficiently detect anomalies and explain why they are considered anomalies by detecting outlying aspects. First, we re-introduced four anomaly detection techniques and outlying aspect mining algorithms. Then, we evaluate the performance of anomaly detection techniques and choose the best anomaly detection algorithm. Later, we detect the top k anomaly as a query and detect their outlying aspect. Lastly, we evaluate their performance on 16 real-world healthcare datasets. The experimental results show that the latest isolation-based outlying aspect mining measure, SiNNE, has outstanding performance on this task and has promising results.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
Introduction
Despite improvements in healthcare instruments, the presence of medical errors remains a severe challenge [1]. Applying machine learning (ML) and artificial intelligence (AI) algorithms in the healthcare industry helps improve patients’ health more efficiently. According to [2], around 86% of healthcare companies use machine learning and artificial intelligence algorithms. These algorithms help in many ways, such as medical image diagnosis [3, 4], disease detection/classification [5,6,7], medical data analysis [8], medical data classification [9, 10], drug discovery [8], robot surgery [8], detect anomalous reading [11], etc. Recently, researchers have been interested in detecting abnormal activity in the healthcare industry. Anomaly or outlierFootnote 1 is defined as a data instance that does not conform with the remainder of that set of data instances. In the healthcare domain, an anomaly is referred to as an unusual health condition or activity of a patient [12, 13]. A vast number of applications have been developed to detect anomalies from medical data [14,15,16,17]. However, no study has been conducted to find out why these points are considered as an anomaly, i.e., on which set of features a data point is dramatically different than others, as far as we know. The problem of detecting such an explanation leads to outlying aspect mining (a.k.a, outlier explanation, outlier interpretation, outlying subspaces detection). Outlying aspect mining aims to identify the set of features where the given point (or a given anomaly) is most inconsistent with the rest of the data.
In many healthcare applications, a medical officer wants to know the most outlying aspects of a specific patient compared to other patients. For example, you are a doctor having patients with Pima Indian diabetes disease. While treating a particular patient, you want to know in which aspects this patient differs from others. For example, let’s consider the Pima Indian diabetes disease data set.Footnote 2 For ‘Patient A’, the most outlying aspect will be having the highest number of pregnancies and low diabetes pedigree function (see Fig. 1), compared to other subspaces.
Another example is when a medical insurance analyst wants to know in which aspects the given insurance claim is most unusual. The above-given applications are different than anomaly detection. Instead of searching the whole data set for the anomaly, in outlying aspect mining, we are specifically interested in a given data instance. The goal is to find out outlying aspects where a given data instance stands out. Such data instance is called a query \(\textbf{q}\).
These interesting applications of outlying aspect mining in the medical domain motivated us to write this paper. In this paper, we first introduce four anomaly detection techniques and outlying aspect mining methods. Later, we evaluate their performance on 16 healthcare datasets. To the best of our knowledge, it is the first time when these algorithms have been applied to healthcare data. Our results have verified their performance on anomaly detection and outlying aspect mining tasks and found that isolation-based algorithm presents promising performance, i.e., iForest perform well in anomaly detection and SiNNE perform well for outlying aspect mining task.
The rest of the paper is organized as follows. Section 2 summarizes the principle and working mechanism of four outlying aspect mining algorithms and anomaly detection algorithms. Next, the experimental setup and results are summarized in Sects. 3 and 4, respectively. Finally, we conclude the paper in Sect. 5.
Existing methods
Before describing different outlying aspect mining algorithms, we first provide the problem formulation.
Basic notations and definitions
Definition 1
(Problem definition) Given a set of n instances \({\mathcal {X}}\) (\(\Vert {\mathcal {X}}\Vert\) = n) in d dimensional space, a data point \(\textbf{q} \in {\mathcal {X}}\), is called anomaly iff,
-
\({{\textbf {q}}}\) dramatically differs from others in full feature space.
and a subspace S is called outlying aspect of \(\textbf{q}\) iff,
-
outlyingness of \(\textbf{q}\) in subspace S is higher than other subspaces, and there is no other subspace with the same or higher outlyingness.
Outlying aspect mining algorithms first require a scoring measure to compute the outlyingness of the query in subspace and a search method to search for the most outlying subspace. In the rest of this section, we review different scoring measures only. For the search part, we will use Beam [18] search method because it is the latest search method and is used in different studies [18,19,20,21,22,23]. The flowchart of the complete process is presented in Fig. 2.
Existing anomaly detection scoring measures
LOF
The core idea of density-based anomaly detection is the density of the anomalous object is significantly different from the normal instance. The first local density-based approach, called LOF, which stands for Local Outlier Factor introduced by [24], which is the widely used local outlier detection approach. For any data object, the LOF score is the ratio of the average local density of its k-nearest neighbors to its local density [25]. The LOF score of data object \({{\textbf {q}}}\) is defined as follows:
where \(lrd({{\textbf {q}}}) = \frac{\Vert N^k ({{\textbf {q}}})\Vert }{\sum \limits _{x \in N^k({{\textbf {q}}})} max(dist^k(x,D),dist({{\textbf {q}}},x))}\), \(N^k({{\textbf {q}}})\) is a set of k-nearest neighbours of \({{\textbf {q}}}\), \(dist({{\textbf {q}}},x)\) is a distance between \({{\textbf {q}}}\) and x and \(dist^k({{\textbf {q}}},D)\) is the distance between \({{\textbf {q}}}\) and its k-NN in \({\mathcal {X}}\). The LOF score represents the sparseness of the data object. Data objects with higher LOF values are considered as anomalies.
iForest
Liu et al. [26] presented a framework called Isolation Forest or iForest, which isolates each data point by axis-parallel partitioning of the attribute space. To the best of our knowledge, iForest is the first technique that uses an isolation mechanism to detect anomalies.
iForest builds an ensemble of trees called isolation trees (iTree). Each iTree is built using a randomly selected sub-sample without replacement from the data set. A random split is performed at each node on a randomly selected point from attribute space. The partition will terminate once all the nodes have only one data object or nodes reach the tree’s height limit for iTree. The anomaly score for \({{\textbf {q}}} \in {\mathcal {R}}^d\) based on iForest is defined as:
where \(l_i({{\textbf {q}}})\) is the path length of \({{\textbf {q}}}\) in tree \(T_i\).
Sp
Rather than searching for k-nearest neighbor in the data set, [27] employs scoring measure based on the nearest neighbor (k =1) in random sub-samples (\({\mathcal {S}} \subset D\)). The Sp score of data object \({{\textbf {q}}}\) is defined as follows:
where \(dist({{\textbf {q}}},x)\) is a distance between \({{\textbf {q}}}\) and x.
In [27], authors have shown that Sp performs better than state-of-the-art anomaly detector LOF and runs faster than LOF.
iNNE
Bandaragoda et al. [28] proposed iNNE, which is stands for isolation using Nearest Neighbor Ensemble. The core idea behind iNNE is an anomaly is far away from its nearest neighbor, and the inverse is true for the regular object. iNNE implementation is influenced by iForest and LOF. The critical difference between iNNE and iForest is that iForest builds a tree from subspaces while iNNE builds hyperspheres using all dimensions. An isolation score of \({{\textbf {q}}}\) is defined as follows:
where \(cnn({{\textbf {q}}}) = \displaystyle \mathop {\mathrm {arg\,min}}\limits _{c \in S} \{ \tau (c): {{\textbf {q}}} \in {\mathcal {B}}(c) \}\), \({\mathcal {S}}\) is set of randomly selected sub-samples, \(\Vert {\mathcal {S}}\Vert = \psi\), \({\mathcal {B}}(c)\) is a hypersphere centered at c with radius \(\tau (c) = || c - \eta _c ||\), where \(\eta _c\) is nearest neighbour of c. The anomaly score for data object \({{\textbf {q}}}\) is defined as:
where \(I_i({{\textbf {q}}})\) is isolation score based on sub-sample in \(i^{th}\) set.
Outlying aspect mining algorithms
OAMiner
Duan et al. [29] introduce Outlying Aspect Miner (OAMiner in short), which uses a Kernel Density Estimation (KDE) [30] based scoring measure to compute the outlyingness of query \(\textbf{q}\) in subspace S:
where \(f_S(\textbf{q})\) is a kernel density estimation of \(\textbf{q}\) in subspace S, m is the dimensionality of subspace S (\(|S|=m\)), \(h_{i}\) is the kernel bandwidth in dimension i.
Duan et al. [29] have stated that density is a bias towards high-dimensional subspaces—density tends to decrease as the dimension increases. Thus, to remove the effect of dimensionality bias, they proposed using the query’s density rank as a measure of outlyingness. To find the most outlying subspace of the query, the density of all data points needs to compute in each subspace, where the subspace with the best rank is selected as an outlying aspect of the given query.
OAMiner systematically enumerates all the possible subspaces. In OAMiner, the author has used the set enumeration tree approach [31], which is widely used by the data mining research community. OAMiner searches for subspaces by traversing a depth-first manner [32]. OAMiner used some anti-monotonicity properties to prune the subspaces. Given data set \({\mathcal {O}}\), a query object \(\textbf{q}\) and subspace S, if \(rank(f_{S}(\textbf{q}))\) = 1, then every super-set of S cannot be a minimal subspace and thus can be pruned.
Beam
Vinh et al. [18] captures the concept of dimensionality unbiasedness and further investigates dimensionally unbiased scoring functions. Dimensionality unbiasedness is an essential property for outlying measures because the query object is compared in different subspaces with a different number of dimensions. They proposed two novel outlying scoring metrics (1) density Z-score and (2) isolation Path score (iPath in short). Their work showed that the proposed Z-score and iPath are dimensionally unbiased.
Therein, the density Z-score is defined as follows:
where \(\mu _{f_S}\) and \(\sigma _{f_S}\) are the mean and standard deviation of the density of all data instances in subspace S, respectively.
The iPath score is motivated by isolation Forest (iForest) anomaly detection approach [26]. The intuition behind iForest is that anomalies are few and susceptible to isolation. iForest constructs t trees, where each tree is built from randomly selected sub-samples \(\psi\) (\(\psi \ll n\)). Later, it divides using the axis-parallel random splits. Since in the outlying aspect mining context, the main focus is on the path length of the query; thus, authors have ignored other parts of the tree. In outlying aspect mining, the intuition behind the iPath score is that in the most outlying subspace, a given query is easy to isolate than the rest of the data.
The process of calculating the iPath of query \(\textbf{q}\) w.r.t. sub-samples \(\psi\) of the data is
where \(l_S^i(\textbf{q})\) is path length of \(\textbf{q}\) in \(i^{th}\) tree and subspace S.
Vinh et al. [18] was the first to coin the term dimensionality unbiasedness.
Definition 2
(Dimensionality unbiased [18]) A dimensionality unbiased outlyingness measure (OM) is a measure of which the baseline value, i.e., average value for any data sample \({\mathcal {O}} = \{o_1, o_2, \cdots , o_n \}\) drawn from a uniform distribution, is a quantity independent of the dimension of the subspace S, i.e.,
In [18, Theorem 3], it is proven that rank transformation and Z-score normalization have resulted in a constant average value in any data distribution. Furthermore, it is worth noting that the Z-score scoring function is not only normalized but also the variance of the normalized measures that are constant to dimensions.
The overall beam search process is divided into three stages. All 1-D subspaces are inspected in the first stage to identify trivial outlying features. In the subsequent stage, an exhaustive search is performed on all possible 2 dimensional subspaces. In the third stage, the beam search is implemented at level l. The beam algorithm only keeps top W subspaces (called beam width) in the search process. The total number of subspace considered by the beam algorithm is in the order of \(O(d^2 + W \ \ d_{max})\) where \(d_{max}\) is the maximum dimension of subspace, and W is the beam width.
sGrid
Wells and Ting [23] introduced a simple grid-based density estimator called sGrid. sGrid is a smoothed variant of a grid-based density estimator [30]. Let \({\mathcal {O}}\) be a collection of n data objects in D-dimensional space, x.S be a projection of a data object \(x \in {\mathcal {O}}\) in subspace S. The sGrid density of point \(\textbf{q}\) is computed as points that fall in a bin that covers point \(\textbf{q}\) and its surrounding neighbors.
Their work showed that the proposed density estimator has advantages over the existing kernel density estimator in outlying aspect mining by replacing the kernel density estimator with sGrid. By replacing KDE with the sGrid density estimator, OAMiner [29] and Beam [18] run two orders of magnitude faster than their original implementation. However, sGrid is not a dimensionally unbiased measure, requiring Z-Score normalization. Again, it makes sGrid computationally inefficient.
SiNNE
Very recently, [21] proposed a Simple Isolation score using Nearest Neighbor Ensemble (SiNNE in short) measure which from Isolation using Nearest Neighbor Ensembles (iNNE in short) method for anomaly detection [28]. SiNNE constructs t ensemble of models (\({\mathcal {M}}_1, {\mathcal {M}}_2, \cdots , {\mathcal {M}}_t\)). Each model \({\mathcal {M}}_i\) is constructed from randomly chosen sub-samples (\({\mathcal {D}}_i \subset {\mathcal {O}}, \Vert {\mathcal {D}}_i\Vert = \psi < n)\). Each model has \(\psi\) hyperspheres, where a radius of the hypersphere is the euclidean distance between a (\(a \in {\mathcal {D}}_i)\) to its nearest neighbor in \({\mathcal {D}}_i\).
The outlying score of \(\textbf{q}\) in model \({\mathcal {M}}_i\), \(I(q\Vert {\mathcal {M}}_i) = 0\) if \(\textbf{q}\) falls in any of the ball and 1 otherwise. The final outlying score of \(\textbf{q}\) using t models is:
In their work, they argue that Z-score normalization is biased towards a subspace having high-density variance, and the definition of dimensionality unbiasedness needs to be revised. Furthermore, SiNNE is computationally faster than density and distance-based measures.
Experimental setup
Datasets
In this study, we used 16 publicly available benchmarking medical datasets for anomaly detection; BreastW and Pima are from [33],Footnote 3Annthyroid, Cardiotocography, Heart disease, Hepatitis, WDBC and WPBC are from [34]Footnote 4 and Arrhythmia, Lympho, Mammography, Musk, Thyroid, Vertebral, WBC, and Yeast are from [35].Footnote 5 The summary of each data set is provided in Table 1.
Algorithm implementation and parameters
We use PyOD [36] Python library to implement anomaly detection algorithms. In terms of implementation of OAM algorithms, we used Java implementation of sGrid and SiNNE, which is made available by the authors [23] and [21], respectively. We implemented RBeam and Beam in Java using WEKA [37].
We used the default parameters of each algorithm as suggested in respective papers unless specified otherwise.
Anomaly detection algorithm:
-
LOF: the size of nearest neighbor (k) = 10;
-
iForest: number of sets t=100, and sub-sample size \(\psi\)=256;
-
Sp: sub-sample size \(\psi\)=20; and
-
iNNE: number of sets t=100, and sub-sample size \(\psi\)=8.
Outlying aspect mining algorithms:
-
Density rank and Density Z-score: KDE use Gaussian kernel with default bandwidth as suggested by [38];
-
sGrid: block size parameter w = 64;
-
SiNNE: sub-sample size \(\psi\) = 8, and ensemble size t = 100; and
-
Beam search: beam width W = 100, and maximum dimensionality of subspace \(\ell\) = 3.
Evaluation measure
We used the area under the ROC curve (AUC) [39] and precision at n (P@n)Footnote 6 [40] as a measure of effectiveness for anomaly ranking produced by an anomaly detector. An anomaly detector with a high AUC indicates better detection accuracy, whereas a low AUC indicates low detection accuracy.
Samariya and Ma [20] proposed a new kernel mean embedding-based evaluation measure in the outlying aspect mining domain. The intuition behind the evaluation measure is that in most outlying aspects, a query \(\textbf{q}\) is far from the distribution of data in those aspects.
Definition 3
The quality of discovered aspects (or subspace(s)) \({\mathcal {S}}\) for a query \(\textbf{q}\) is computed as
where \(K_{\mathcal {S}} (\textbf{q}, x)\) is a kernel similarity between \(\textbf{q}\) and x in subspace \({\mathcal {S}}\).
Therein, authors used chi-square kernel [41], computed as follows.
All experiments were conducted on a machine with an Intel 8-core i9 CPU and 16 GB main memory, running on macOS Big Sur version 11.1. We run each job on multiple single CPU treads, which is done using GNU parallel [42].
Empirical evaluation
In this section, we present the result of four anomaly detection methods; LOF, iForest, Sp, and iNNE and four outlying scoring measures; Kernel Density Rank (RBeam), Density Z-score (Beam), sGrid Z-score (sBeam) and SiNNE (SiBeam) using Beam search on medical datasets. All experiments were run for 1 h, and unfinished tasks were killed and presented as ‘\(\ddagger\)’.
Experiment-1: Performance of anomaly detection algorithms
In this sub-section, we presented the results of four anomaly detection techniques: LOF, iForest, Sp, and iNNE in terms of AUC.
The AUC comparison of LOF, iForest, Sp, and iNNE is presented in Table 2 (c.f. columns 2 to 5 of Table 2). It is interesting to note that no specific anomaly detection algorithm performs best in each dataset. However, iForest is the best-performing measure with having the best AUC in 10 datasets. In the last row of Table 2, the Avg. AUC of each anomaly detection method shows that iForest produced the best AUC while Sp had a significantly low AUC. Whereas LOF and iNNE produce comparative results.
The total runtime, which includes pre-processing, model building, ranking n instances, and computing AUC, is presented in Table 2 (c.f. columns 6 to 9 of Table 2). Overall, Sp is the fastest measure compared to others. While iForest and iNNE almost take similar time.
Experiment-2: Performance of outlying aspect mining algorithms
We first use the iForest anomaly detection method for each data set to detect top k=10 anomalies; then, they are used as queries. Each scoring measure identifies outlying aspects for each anomaly (queries). We detect the quality of subspace using Eq. 1.
Tables 3, 4, 5 and 6 shows the subspace found by four scoring measures and quality of discovered subspace on 16 real-world medical datasets. RBeam and Beam cannot finish on annthyroid and musk in an hour; thus, we presented as ‘\(\ddagger\)’.
Out of 160 queries, SiBeam detects a better subspace for 116 queries, and sGBeam detects a better subspace for only 23 queries. While RBeam detects better subspaces for 40 out of 140 queries and Beam only for 6 queries. Overall, SiBeam is the best-performing measure, and RBeam is a slow measure; however, it performs better than the Z-score-based measure. As mentioned in [20, 21], Z-score-based measures are biased towards subspace having high variance. Thus, both Z-score-based measures perform worst in this comparison.
Next, we visually present the discovered subspaces by different scoring measures of three queries from each data set. Note that each one-dimensional subspace is plotted using a histogram with 10 equal-width bins.
Tables 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, and 22 provides the visualization of discovered sub-spaces by RBeam, Beam, sGBeam, and SiBeam on annthyroid, arrhythmia, breastw, cardiotocography, diabetes, heart disease, hepatitis, lympho, mammography, musk, pima, thyroid, vertebral, wbc, wdbc and wpbc respectively. The query point is highlighted with a dark blue-green (teal) color and a golden color arrow. We highlighted visually better subspace with a green box.
By visually comparing discovered subspaces by each measure, out of 48 queries (3 from each data set), SiBeam and sGBeam detect better subspaces for 39 and 18 queries. In contrast, RBeam and Beam detect better subspaces for 29 and 11 out of 42 queries. Overall, visually we can say that SiBeam performs best or comparative to RBeam, Beam, and sGBeam.
Conclusion
This paper shows an interesting application of OAM in the healthcare domain. We first introduced four anomaly detection and outlying aspect mining algorithms. Then, we presented a framework that not only detects anomalies but also explains why a given query is an anomaly; by providing a set of features where it is most outlying compared to others. Our evaluation on 16 medical datasets shows that iForest is the best-performing measure. Furthermore, our experiment on the task of anomaly explanation (outlying aspect mining) shows that the recently developed isolation-based outlying scoring measure SiNNE outperforms other state-of-the-art outlying aspect mining scoring measures. In the medical domain, it is essential to have a fast algorithm; thus, kernel density or Z-score-based scoring measures are not suitable while the data set is huge.
Notes
Anomaly and outlier are the most commonly used terms in the literature. In this work, hereafter, we will use anomaly term only.
The description of the data set is provided in Table 1.
Available at https://www.ipd.kit.edu/~muellere/HiCS/
Available at http://odds.cs.stonybrook.edu
Note that, hereafter, we denote precision at n by P@n.
References
Hauskrecht M, Valko M, Batal I, et al. Conditional outlier detection for clinical alerting. In: AMIA annual symposium proceedings. American Medical Informatics Association; 2010, p. 286.
Siwicki B. 86% of healthcare companies use some form of AI. 2017. https://www.healthcareitnews.com/news/86-healthcare-companies-use-some-form-ai.
Sarki R, Ahmed K, Wang H, et al. Image preprocessing in classification and identification of diabetic eye diseases. Data Sci Eng. 2021;6(4):455–71.
Tachmazidis I, Chen T, Adamou M, et al. A hybrid AI approach for supporting clinical diagnosis of attention deficit hyperactivity disorder (adhd) in adults. Health Inf Sci Syst. 2021;9(1):1–8.
He J, Rong J, Sun L, et al. A framework for cardiac arrhythmia detection from iot-based ecgs. World Wide Web. 2020;23(5):2835–50.
Pham TD. Classification of covid-19 chest x-rays with deep learning: new models or fine tuning? Health Inf Sci Syst. 2021;9(1):1–11.
Wang J, Liang S, Wang Y, et al. A weighted overlook graph representation of eeg data for absence epilepsy detection. In: 2020 IEEE International Conference on Data Mining (ICDM); 2020, pp 581–590. https://doi.org/10.1109/ICDM50108.2020.00067.
Smiti A. When machine learning meets medical world: Current status and future challenges. Comput Sci Rev. 2020;37(100):280.
Ma J, Sun L, Wang H, et al. Supervised anomaly detection in uncertain pseudoperiodic data streams. ACM Trans Internet Technol. 2016;16(1):1–20. https://doi.org/10.1145/2806890.
Meng L, Tan W, Ma J, et al. Enhancing dynamic ecg heartbeat classification with lightweight transformer model. Artif Intell Med. 2022;124(102):236.
Pachauri G, Sharma S. Anomaly detection in medical wireless sensor networks using machine learning algorithms. Procedia Comput Sci. 2015;70:325–33.
Samariya D, Ma J, et al. Anomaly detection on health data. In: Traina A, Wang H, Zhang Y, et al., editors. Health information science. Cham: Springer Nature Switzerland; 2022. p. 34–41.
Samariya D, Thakkar A. A comprehensive survey of anomaly detection algorithms. Ann Data Sci. 2021;2021:1–22.
Konijn RM, Kowalczyk W. Finding fraud in health insurance data with two-layer outlier detection approach. In: International Conference on Data Warehousing and Knowledge Discovery, Springer; 2011, pp. 394–405.
Laurikkala J, Juhola M, Kentala E, et al. Informal identification of outliers in medical data. In: Fifth international workshop on intelligent data analysis in medicine and pharmacology; 2000, pp. 20–24.
Prastawa M, Bullitt E, Ho S, et al. A brain tumor segmentation framework based on outlier detection. Med Image Anal. 2004;8(3):275–83. https://doi.org/10.1016/j.media.2004.06.007.
van Capelleveen G, Poel M, Mueller RM, et al. Outlier detection in healthcare fraud: a case study in the medicaid dental domain. Int J Account Inf Syst. 2016;21:18–31. https://doi.org/10.1016/j.accinf.2016.04.001.
Vinh NX, Chan J, Romano S, et al. Discovering outlying aspects in large datasets. Data Min Knowl Discov. 2016;30(6):1520–55. https://doi.org/10.1007/s10618-016-0453-2.
Samariya D, Ma J. Mining outlying aspects on healthcare data. In: International Conference on Health Information Science, Springer; 2021, pp. 160–170.
Samariya D, Ma J. A new dimensionality-unbiased score for efficient and effective outlying aspect mining. Data Science and Engineering; 2022b, pp. 1–16.
Samariya D, Aryal S, Ting KM, et al. A new effective and efficient measure for outlying aspect mining. In: Huang Z, Beek W, Wang H, et al., editors. Web information systems engineering - WISE 2020. Cham: Springer; 2020. p. 463–74.
Samariya D, Ma J, Aryal S, et al. sgrid++: revising simple grid based density estimator for mining outlying aspect. In: Chbeir R, Huang H, Silvestri F, et al., editors. Web information systems engineering - WISE 2022. Cham: Springer; 2022. p. 194–208.
Wells JR, Ting KM. A new simple and efficient density estimator that enables fast systematic search. Pattern Recogn Lett. 2019;122:92–8.
Breunig MM, Kriegel HP, Ng RT, et al. Lof: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data. association for computing machinery, New York, NY, USA, SIGMOD’, pp. 93–104. 2000. https://doi.org/10.1145/342009.335388.
Chandola V, Banerjee A, Kumar V. Anomaly detection: a survey. ACM Comput Surv. 2009;41(3):1–58. https://doi.org/10.1145/1541880.1541882.
Liu FT, Ting KM, Zhou ZH. Isolation forest. In: 2008 Eighth IEEE International Conference on Data Mining; 2008, pp. 413–422. https://doi.org/10.1109/ICDM.2008.17.
Sugiyama M, Borgwardt K, et al. Rapid distance-based outlier detection via sampling. In: Burges CJC, Bottou L, Welling M, et al., editors. Advances in neural information processing systems, vol. 26. New York: Curran Associates Inc; 2013. p. 467–75.
Bandaragoda TR, Ting KM, Albrecht D, et al. Isolation-based anomaly detection using nearest-neighbor ensembles. Comput Intell. 2017;34:1–31. https://doi.org/10.1111/coin.12156.
Duan L, Tang G, Pei J, et al. Mining outlying aspects on numeric data. Data Min Knowl Discov. 2015;29(5):1116–51. https://doi.org/10.1007/s10618-014-0398-2.
Silverman BW. Density estimation for statistics and data analysis. London: Chapman & Hall; 1986.
Rymon R. Search through systematic set enumeration. In: Proceedings of the Third International Conference on Principles of Knowledge Representation and Reasoning. Morgan Kaufmann Publishers Inc., Cambridge, MA, KR’92, pp. 539–550. 1992. http://dl.acm.org/citation.cfm?id=3087223.3087278.
Russell S, Norvig P. Artificial intelligence: a modern approach. 3rd ed. Upper Saddle River: Prentice Hall Press; 2009.
Keller F, Muller E, Bohm K. Hics: High contrast subspaces for density-based outlier ranking. In: 2012 IEEE 28th International Conference on Data Engineering; 2012, pp. 1037–1048. https://doi.org/10.1109/ICDE.2012.88.
Campos GO, Zimek A, Sander J, et al. On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Discov. 2016;30(4):891–927.
Rayana S. ODDS library. 2016. http://odds.cs.stonybrook.edu.
Zhao Y, Nasrullah Z, Li Z. Pyod: a python toolbox for scalable outlier detection. J Mach Learn Res. 2019;20(96):1–7.
Hall M, Frank E, Holmes G, et al. The weka data mining software: an update. SIGKDD Explor Newsl. 2009;11(1):10–8. https://doi.org/10.1145/1656274.1656278.
Härdle W. Smoothing techniques: with implementation in S. New York: Springer; 2012.
Hand DJ, Till RJ. A simple generalisation of the area under the roc curve for multiple class classification problems. Mach Learn. 2001;45(2):171–86.
Craswell N. Precision at n. Boston: Springer US; 2009. p. 2127–8.
Zhang J, Marszałek M, Lazebnik S, et al. Local features and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vis. 2007;73(2):213–38.
Tange O. Gnu parallel 20201022 (‘samuelpaty’). Zenodo. 2020. https://doi.org/10.5281/zenodo.4118697.
Acknowledgements
The preliminary version of this paper is published in Proceedings of the 10th International Conference on Health Information Science (HIS) 2021 [19]. This work is supported by Federation University Research Priority Area (RPA) scholarship, awarded to Durgesh Samariya. Dr. Sunil Aryal is supported by the Air Force Office of Scientific Research award under number FA2386-20-1-4005.
Funding
Open Access funding enabled and organized by CAUL and its Member Institutions.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Not applicable.
Ethical approval
This article does not contain any studies with human participants by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Samariya, D., Ma, J., Aryal, S. et al. Detection and explanation of anomalies in healthcare data. Health Inf Sci Syst 11, 20 (2023). https://doi.org/10.1007/s13755-023-00221-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13755-023-00221-2