Abstract
k-means clustering is a classic method of unsupervised learning with the aim of partitioning a given number of measurements into k clusters. In many modern applications, however, this approach suffers from unstructured measurement errors because the k-means clustering result then represents a clustering of the erroneous measurements instead of retrieving the true underlying clustering structure. We resolve this issue by applying techniques from robust optimization to hedge the clustering result against unstructured errors in the observed data. To this end, we derive the strictly and \(\Gamma \)-robust counterparts of the k-means clustering problem. Since the nominal problem is already NP-hard, global approaches are often not feasible in practice. As a remedy, we develop tailored alternating direction methods by decomposing the search space of the nominal as well as of the robustified problems to quickly obtain feasible points of good quality. Our numerical results reveal an interesting feature: the less conservative \(\Gamma \)-approach is clearly outperformed by the strictly robust clustering method. In particular, the strictly robustified clustering method is able to recover clusterings of the original data even if only erroneous measurements are observed.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
In statistical analyses, a typical assumption is that the used data is measured without error. However, this assumption is often violated. One possible way to approach this problem is to assume measurement errors in the observed variables. However, these measurement errors often are not specifiable with distributional assumptions as nothing is known about them. Measurement errors with unknown structure appear in many different application areas and will be referred to as unstructured errors hereafter. For example, Rocke et al. (2009) state four sources of such errors in the case of micro-array analysis. When using register data, a typical source of errors are over- and undercounts, which are typically not detectable from the register itself; see Burgard and Münnich (2012). Another example is the analysis of the potential use of big data in public health research in Khoury and Ioannidis (2014), where the authors strikingly state “‘Big Error’ can plague Big Data”.
The field of robust optimization treats this problem of unstructured, uncertain data by robustifying optimization problems with uncertain parameters. In this setting, no additional information about these parameters is required besides that they belong to a prescribed uncertainty set. See Soyster (1973) for the first paper on robust optimization and the textbook by Ben-Tal et al. (2009) or the survey article by Bertsimas et al. (2011) for a general overview. For a recent application of these ideas to linear mixed models; see Burgard et al. (2020). However, the most basic variant of robust optimization—usually called strict robustness—was often criticized since it may lead to overly conservative solutions as they are explicitly hedged against the worst-case. As a remedy, less conservative concepts of robustness have been proposed in the last decades; see, e.g., recoverable robustness (Liebchen et al. 2009), light robustness (Fischetti and Monaci 2009), adjustable robustness (Ben-Tal et al. 2004), or \(\Gamma \)-robustness (Bertsimas and Sim 2004).
As motivated above, in statistics, there are several optimization problems, which should be re-engineered to account for the unstructured error setting as it arises in big data sources. By developing new robust optimization methodology for optimization problems from statistics and/or classification, it is widely accepted that a major advancement in the analysis and usability of large data sources can be achieved. For example, a very important and widely used method is k-means clustering (MacQueen 1967; Lloyd 1982). It is used to find clusters in an unsupervised learning setting and is widely applied in the context of big data analysis; see, e.g., Grira et al. (2004) and Celebi and Aydin (2016). However, it does not compensate for possible errors in the data set. Therefore, the clusters that are found are perturbed by the unstructured errors and do not necessarily correspond to the true underlying structure of interest.
The marriage of problems from classification or statistics with modern techniques from mathematical optimization has been a field of very active research during the last years and a comprehensive overview over this literature is out of the scope of this article. Thus, we mainly focus on reviewing the literature at the interface of robust optimization, k-means clustering, and classification problems. One of the most prominent methods for classification are support vector machines (SVMs), which also have been considered from the point of view of robust optimization; see, e.g., Bhattacharyya et al. (2005), Trafalis and Gilbert (2007), Pant et al. (2011) and Bertsimas et al. (2019). The results obtained in these papers often illustrate that robustified solutions can give higher accuracy than nominal ones, leading to improvements in the SVM method. The most recent paper by Bertsimas et al. (2019) from the above list also contains the application of robust optimization to two other major and widely used classification methods (besides SVMs): logistic regression and CART. The presented extensive computational results show that the robust formulations outperform nominal and regularized models for most of the instances—especially in the case of high-dimensional instances and instances that are hard to separate. Moreover, to the best of our knowledge, Bertsimas et al. (2019) is the only paper that also uses the concept of \(\Gamma \)-robustness for classification. Interestingly, the authors report that \(\Gamma \)-robustness is beneficial for robustifying SVMs. In contrast to that, our results for \(\Gamma \)-robustified k-means clustering are clearly outperformed by the strictly robustified models.
Our contribution is the following. We follow the way paved by Bertsimas et al. (2019) but consider strictly and \(\Gamma \)-robust optimization for the k-means clustering problem. The robustification of the underlying optimization problem allows us to consider erroneous input data. As a consequence, it becomes possible to compute clusterings that are closer to the clusterings of the original, i.e., error-free, data although only erroneous input data can be considered. To the best of our knowledge, this issue has not been studied before in the literature. Moreover, we develop tailored alternating direction methods (ADMs) to quickly obtain solutions of good quality. Here, one of the main insights is that an ADM for the nominal case can be generalized so that it becomes applicable to the robust counterparts while keeping its convergence properties.
Classic ADMs are extensions of augmented Lagrangian methods and have been originally proposed in Gabay and Mercier (1976) and Glowinski and Marroco (1975). A general analysis of the convergence theory of ADMs is presented in Bertsekas and Tsitsiklis (1989). The particular case of ADMs applied to convex objective functions f(u, v) over disjoint constraint sets \(u \in U\), \(v \in V\), is discussed in Wendell and Hurter (1976). An extension to the case of a biconvex f is given in Gorski et al. (2007). More recently, in Boyd et al. (2011), the authors discuss applications of the alternating direction methods of multipliers (ADMMs), which are closely related to ADMs, in the broad area of machine learning and highlight its possibility of parallelization and distributed computing. ADMs are very broadly applied; for an application of ADMs to solve nonconvex MINLPs in the context of gas transport, we refer to Geißler et al. (2015) and Geißler et al. (2018). For an application in supply chain management, we refer to Schewe et al. (2020). Furthermore, in Geißler et al. (2017) it is shown that idealized feasibility pumps can be seen as an ADM and a penalty-based ADM for MIPs and MINLPs is proposed. A similar method based on block coordinate descent and coordinate descent has been applied recently in the context of optimal imputation in Bertsimas et al. (2017). However, the authors do not consider clustering problems.
In general, these methods rely on a problem-specific decomposition of the variable space of the robust counterparts. Using these decompositions, solutions of good quality can be obtained by alternatingly solving two sub-problems that are significantly easier to solve than the robust counterpart. For the nominal case, the applied ADM exactly mimics the classic k-means clustering algorithm. However, our more abstract view on the decomposition of the problem later allows to carry the idea of alternatingly solving easier subproblems over to the robustified settings. Strictly robust counterparts for the k-means clustering problem are also considered in Li et al. (2016) and Vo et al. (2016). However, there is no investigation of structural properties of data sets that indicate whether a robustification of k-means might be beneficial or not. Additionally, by explicitly framing our method as an ADM, we are also able to provide a convergence result (both for the nominal and the two robustified clustering methods), which is not done in Li et al. (2016) and Vo et al. (2016). We also extend our ADM setting so that partial minima (delivered by the ADM) of bad quality can be detected and improved so that the alternating algorithm can be restarted with a significantly improved starting point. We show in our numerical results that this restart heuristic leads to significantly better clustering results. In Li et al. (2016), the authors focus on clustering of incomplete data sets. Thus, in their computational study, the robust clustering results are evaluated by comparing them to other existing k-means algorithms for incomplete data sets. In contrast to this, the goal of this paper is to evaluate the quality of robustified solutions obtained and to develop an understanding of structural properties of data sets for which robust clustering is beneficial. It turns out that a robustification of the clustering approach is not required if the underlying data points are already well separated. However, if this is not the case, the robustified method is able to recover the true clustering although only erroneous data is observed. Our analysis especially leads to an interesting result regarding the comparison of strictly and \(\Gamma \)-robust clustering. The \(\Gamma \)-robust approach was developed to overcome the criticism of strictly robust solutions that are often too conservative in practice. For k-means clustering we observe the contrary: The conservatism of strict robustness leads to significantly better clustering results when compared to the \(\Gamma \)-robust solutions.
The remainder of the paper is structured as follows. In Sect. 2 we formally introduce the k-means clustering problem and state it as a mixed-integer nonlinear optimization problem (MINLP). Afterward, Sect. 3 contains the strictly robust counterpart, followed by the \(\Gamma \)-robust counterpart in Sect. 4. We state a generic ADM in Sect. 5 and discuss its basic convergence properties that can also be applied if a tailored version is applied to nominal or robustified clustering problems. By doing so, we can explicitly characterize the obtained clustering solutions as so-called partial minima of the underlying MINLP. Moreover, we state the specific sub-problems that need to be solved in every iteration of the ADM (if applied to the nominal or one of the two robust clustering methods) and show that they can always be solved efficiently. In Sect. 6, we describe the already mentioned restart heuristic and present our numerical results in Sect. 7. The paper closes with some concluding remarks and notes on future research topics in Sect. 8.
2 The k-means clustering problem and an MINLP formulation
Let \(X \in \mathbb {R}^{p \times n}\) be the matrix containing the data set for clustering, which consists of n data points in \(\mathbb {R}^p\). Here, the data point \(x^i \in \mathbb {R}^p\), \(i=1, \ldots , n\), corresponds to the ith column of X. For given k, the goal of k-means clustering is to find mean vectors \(\mu ^j \in \mathbb {R}^p\), \(j=1, \ldots , k\), of k clusters that satisfy
where the loss function h is a sum of distances such as the squared Euclidean distance
with \(S_j \subset \mathbb {R}^p\) being the set of data points that are assigned to cluster j.
By introducing binary variables \(b_{i,j} \in \{0,1\}\) for \(i = 1, \ldots , n\) and \(j=1, \ldots , k\), we can reformulate the function h as
where the binary variables have the meaning
As \(x^i \in \mathbb {R}^p\) should belong to exactly only one cluster, we include the constraint
To simplify the presentation in what follows, we introduce the notations
Thus, a reformulation of the k-means clustering problem as a mathematical optimization problem is given by
This is a mixed-integer nonlinear programming problem (MINLP); all constraints are linear but the objective function is cubic.
3 The strict robust counterpart of the k-means clustering MINLP
Up to now, we assumed that the data points in X are known exactly. However, in practice, often only erroneous measurements \({\tilde{X}} = X + E\) can be observed instead, where \(E \in \mathbb {R}^{p \times n}\) describes the respective perturbation of the original data. We assume that there is neither information on the expected value nor on any higher order moment of the additive perturbation E. The error could even be a deterministic error as opposed to a stochastic one.
In statistics, measurement error correction methods typically assume that the errors are random variables that follow a known distribution; see Carroll et al. (2006). For a recent proposal on dealing with clustering in these classic measurement error settings; see Su et al. (2018). By accounting for the error distribution in the estimation process, the influence of the errors is reduced. However, this assumption is often very restrictive. Especially when using big data sources as in Davalos (2017) and Yamada et al. (2018), no meaningful error distribution can be assumed. As it is shown in White (2011) for biomarker measures, errors can be due to specimen collection and storage. Last but not least, in social science and econometric analysis, the indicators used are typically estimates resulting form a long statistical production process that is prone to different kind of errors; see, e.g., Alfons et al. (2013). All these different types of errors can even extend to be systematic and thus are far from being described as a zero-mean distributed error.
In this paper we tackle the following situation: Although we can only observe erroneous measures \({\tilde{X}}\) in practice, we want to compute a clustering that is as close as possible to the clustering of the original data \(X\in \mathbb {R}^{p \times n}\). Obviously, especially if E has a mean vector different from \(0\in \mathbb {R}^p\) over all observations in every cluster for every variable, the optimal solution
will differ from the optimal solution of Problem (1). Therefore, the optimization problem has to be reformulated to account for the unobservable and unstructured error E. This can be done by using the min-max problem
with \({\mathcal {U}}\subseteq \mathbb {R}^{p \times n}\) being the uncertainty set, which will be specified right below. In particular, we assume that \({\mathcal {U}}\) can be chosen so that \(E\in {\mathcal {U}}\) holds. Thus, given an uncertainty set \({\mathcal {U}}\), we minimize the function h considering the worst-case scenario w.r.t. all possible \(D \in {\mathcal {U}}\).
As noted above, we assume that the error E is unstructured. Thus, we particularly refrain from using distributional information. Instead, we consider the box-uncertainty set \({\mathcal {U}}\subseteq \mathbb {R}^{p\times n}\) to be given as
With this at hand, we can now state the strictly robust counterpart of Problem (3):
This robust counterpart is an optimization problem that is not directly tractable as it is stated in (5). The main reason for its hardness is the min-max structure of the objective function. Fortunately, we can reformulate this problem without the inner maximization problem, as shown in the following theorem.
Theorem 3.1
The robust counterpart (5) is equivalent to
Proof
First, we move the uncertainty from the objective function to the constraints by introducing an extra variable \(t\in \mathbb {R}\) and by rewriting the robust counterpart (5) equivalently as
Now, the uncertainties only appear in Constraint (7b), which in turn is equivalent to
Using the specific box-structure of the uncertainty set (4), we obtain
Thus, we get the constraint
instead of (7b), which, in turn, can be re-written as
Thus, Problem (5) is equivalent to
and the result follows by eliminating t again. \(\square \)
A standard criticism regarding strictly robust solutions (like those of Problem (6)) is that they tend to be too conservative as they are hedged against the worst-case error that can appear for all entries of the matrix \({\tilde{X}}\). To obtain robust solutions that are practically meaningful but less conservative, in Bertsimas and Sim (2004) the authors propose the so-called \(\Gamma \)-robust approach, which allows to control the degree of conservatism. Thus, in the next section, we apply this approach to the k-means clustering problem as well.
4 The \(\Gamma \)-robust counterpart of the k-means clustering MINLP
The \(\Gamma \)-approach proposed in Bertsimas and Sim (2004) does not assume that all parameters will realize in a worst-case way but restricts the number of such parameters by \(\Gamma \in {\mathbb {N}}\), which gives the approach its name. We now apply the technique proposed in Bertsimas and Sim (2004) to the k-means clustering problem. This means that some observations \({\tilde{x}}^i_l\) are not perturbed and, in this case, \({\tilde{x}}^i_l = x^i_l\) and \(d^i_l = 0\) holds. Thus, the matrices \({\tilde{X}}\) and X only differ in a subset of \(\Gamma \) many elements.
To state the \(\Gamma \)-robust counterpart of Problem (3), we consider the uncertainty set
Using this uncertainty set, we determine the maximum number of observations that may realize in worst-case way. However, we do not state in advance which will do so. Hence, the \(\Gamma \)-robust counterpart of Problem (3) is given by
where we already replaced the absolute values with additional variables and linear constraints as in Theorem 3.1. This model can be reformulated as an equivalent problem without the inner maximization problem as we show next.
Theorem 4.1
The \(\Gamma \)-robust counterpart (8) is equivalent to
Proof
First, we rewrite Problem (8) as
Note that the inner maximization problem that appears in Constraint (10b), can be re-written by re-arranging the terms in the sums. By doing so, we obtain the following equivalent subset selection problem
We now reformulate the latter problem as a linear optimization problem. For details, we refer to Proposition 1 in Bertsimas and Sim (2004). From the cited result, it follows that an equivalent formulation of Problem (11) is given by
This is a linear optimization problem in z and its dual problem reads
Here, \(\uplambda \) is the dual variable corresponding to Constraint (12b) and \(\beta ^i_l\) are the dual variables corresponding to the constraints in (12c). Since Problem (12) is feasible and bounded, we can apply strong duality, which states that the dual problem (13) is also feasible and bounded and that the primal and dual optimal objective function values coincide. We can thus replace the inner maximization problem in (10b) by its dual minimization problem. In addition, notice that we do not need the minimum here because if
is true, it also holds for the minimum value of
over the dual feasible set. Hence, we obtain
and the result follows by eliminating t. \(\square \)
5 Alternating direction methods for nominal and robust k-means clustering
In this section, we propose tailored alternating direction methods (ADMs) to compute approximate solutions to Problems (3), (6), and (4.1). Note that all these problems are MINLPs, which are in general NP-hard to solve to global optimality. Actually, the nominal k-means problem itself is an NP-hard problem even in the particular case of two dimensions; see, e.g., Dasgupta (2007), Aloise et al. (2009) and Mahajan et al. (2012). For such hard problems it is usually considered appropriate to study primal heuristics to quickly compute feasible points of good quality. The presented ADMs can be seen as such primal heuristics with the additional advantage that they come with a formal convergence analysis. The key idea of the ADMs is to decompose the set of variables into two blocks: the set of binary variables and the set of continuous variables. Afterward, we alternatingly solve the problem in one direction while keeping the other one fixed, which is a popular approach for clustering problems; see, e.g., Li et al. (2016) or Ames (2014), Pirinen and Ames (2019), where an alternating direction method of multipliers is used to solve semidefinite programming relaxations for computing a clustering of weighted graphs.
In the following, we first briefly review classic ADMs and afterward propose tailored versions of ADMs to solve Problems (3), (6), and (4.1).
5.1 General ADM framework and convergence properties
We now consider the general problem
for which we make the following assumption.
Assumption 1
The objective function \(f:\mathbb {R}^{n_u + n_v} \rightarrow \mathbb {R}\) and the constraint functions \(g:\mathbb {R}^{n_u + n_v} \rightarrow \mathbb {R}^m\), \(c:\mathbb {R}^{n_u + n_v} \rightarrow \mathbb {R}^q\) are continuous and the sets U and V are non-empty and compact.
The feasible set of Problem (15) is denoted by \(\Omega \), i.e.,
Alternating direction methods are iterative procedures that solve Problem (15) by alternatingly solving two simpler problems. Given an iterate \((u^t,v^t)\), they solve Problem (15) with v fixed to \(v^t\) into the direction of u, yielding a new u-iterate \(u^{t+1}\). Afterward, u is fixed to \(u^{t+1}\) and Problem (15) is solved into the direction of v, yielding a new v-iterate \(v^{t+1}\). The algorithm is formally stated in Algorithm 1. The for-loop is repeated until a termination criterion is reached. In order to state convergence results for Algorithm 1, we need the following definition.
Definition 5.1
Let \((u^*,v^*) \in \Omega \) be a feasible point of Problem (15). Then, \((u^*,v^*)\) is called a partial minimum of Problem (15) if
holds.
Consider \(\Theta ({\bar{u}},{\bar{v}})\) being the set of possible iterates starting from point \(({\bar{u}},{\bar{v}})\), i.e.,
The following general convergence result is taken from Gorski et al. (2007).
Theorem 5.1
Let \(\{ (u^t,v^t)\}_{t=0}^{\infty }\) be a sequence generated by Algorithm 1 with \((u^{t+1},v^{t+1}) \in \Theta (u^t,v^t)\). Suppose that the solution of the first optimization problem (in Line 3) is always unique. Then, every convergent subsequence of \(\{ (u^t,v^t)\}_{t=0}^{\infty }\) converges to a partial minimum. In addition, if w and \(w'\) are two limit points of such subsequences, then \(f(w) = f(w')\) holds.
The sequence \(\{ (u^t,v^t)\}_{t=0}^{\infty }\) generated by Algorithm 1 may not converge, but instead may have several convergent subsequences. In particular, if \(\{ (u^t,v^t)\}_{t=0}^{\infty }\) is contained in a compact set, then there exists at least one convergent subsequence \(\{ (u^t,v^t)\}_{t\in {\mathcal {T}}}\). A partial minimum may not be global minimum. However, as the problems that we need to solve are NP-hard, we are willing to accept sub-optimal solutions. For more details on the convergence theory of classic ADMs; see, e.g., Gorski et al. (2007) and Wendell and Hurter (1976).
5.2 An ADM for the nominal k-means clustering problem
In this section, we propose a tailored version of Algorithm 1 to compute partial minimum of Problem (3). First, we define the sets U and V. For each attribute’s index \(l\in P\), let \(\pi _{l}^{-} \mathrel {{\mathop :}{=}}\min \{x_l^i:i\in N\}\) and \(\pi _{l}^{+} \mathrel {{\mathop :}{=}}\max \{x_l^i:i\in N\}\) be the minimum and maximum value for this attribute, respectively, and let \(\Pi (X) \mathrel {{\mathop :}{=}}[\pi _{1}^{-}, \pi _{1}^{+}] \times \cdots \times [\pi _{p}^{-}, \pi _{p}^{+}]\) be the bounding box of all data points. Thus, \(x^{i} \in \Pi (X)\) holds for all \(i\in N\).
Now, within the context of Algorithm 1, we consider the following non-empty and compact sets
which gives the variable splitting required for the application of the ADM. Note that, in Problem (3), we do not have any coupling constraint such as g or c in the general Problem (15). In what follows, we describe how we compute a partial minimum of Problem (3).
Suppose that initial mean vectors \(\mu ^0\in V\) are given. In each iteration t, the problem in the direction of U reads
The optimal solution to this binary problem can be obtained as follows. Consider a fixed \(i\in N\). Since \(b_{i,j} \in \{0,1\}\) for \(j\in K\) and only one \(b_{i,j}\) is equal to 1, we set \(b_{i,j} = 1\) for j that minimizes the norm. This is exactly the j with \(\mu ^j\) being the center that is the closest to the point \(x^i\). Thus, for each \(i\in N\), we compute
Note that there may exist more than one element in \(J_i\). In this case, we break the tie by choosing the smallest one. With \(b^{t+1}\) at hand, we update the mean vectors by solving the problem in the direction of V, which is given by
This problem can be solved very effectively by using the formula
with
In other words, in each iteration, we obtain the new centers \(\mu ^{t+1}\in V\) by computing the centers for the given assignment \(b^{t+1}\). Now, we can summarize the main steps of our algorithm to compute a partial minimum of Problem (3) in Algorithm 2. For details on how we choose the initial centers we refer to Sect. 7.3.
Since the method in Algorithm 2 is a special case of the general ADM in Algorithm 1, the general convergence result as stated in Theorem 5.1 can be applied, yielding that Algorithm 2 leads to partial minima. Let us finally mention that the presented ADM exactly mimics the classic k-means clustering algorithm. However, this more abstract view on decomposing the problem can be nicely transferred to the robustified settings that we discuss in the following sections.
5.3 An ADM for the strictly robust k-means clustering problem
In this section, we propose a tailored ADM to compute partial minima of Problem (6). First, we describe how we choose the set V. Consider the bounding box of the erroneous measures \(\Pi ({\tilde{X}})\) as in Sect. 5.2. Thus, \({\tilde{x}}^i\), \(\mu ^j\in \Pi ({\tilde{X}})\) holds for all \(i\in N\), \(j\in K\). With this, for all \(i\in N, \ j\in K, \ l\in P\), we can set
For being a solution of Problem (6), \(\alpha \) needs to be minimized while satisfying (6b). Thus, we can conclude that M as defined in (21) is an upper bound for \(\alpha \), i.e., we can set
This leads to the compact sets
Finally, let \(V\mathrel {{\mathop :}{=}}V_{\mu }\times V_{\alpha }\) be the Cartesian product of the compacts sets \(V_\mu \) and \(V_\alpha \). Thus, the variable splitting is given by the non-empty and compact set (16) and V as just defined. Note that, with this variable splitting, we again do not have any coupling constraints. Now, we can apply Algorithm 1 to Problem (6).
Suppose that the continuous variables \(\mu ^0\) and \(\alpha ^0\) are given. In each iteration t, the first subproblem to be solved is the binary optimization problem
in which we already skipped all the constant terms in the objective function. This problem can be solved in analogy to Problem (18). Suppose that we have computed \(b^{t+1}\). The new iterates \(\mu ^{t+1}\) and \(\alpha ^{t+1}\) are computed by solving
This is a convex quadratic optimization problem and can thus be efficiently solved by standard state-of-the-art solvers; see, e.g., Ben-Tal and Nemirovski (2001) and Boyd and Vandenberghe (2004) for an overview of algorithms and complexity results for convex problems. The overall method to compute a partial minimum of Problem (6) is formally given in Algorithm 3. As in the last section, Theorem 5.1 can be applied again, yielding that Algorithm 3 leads to partial minima of the strictly robust counterpart.
5.4 An ADM for the \(\Gamma \)-robust k-means clustering problem
As in Sect. 5.3 we now first discuss how to obtain the set V for applying Algorithm 1 to Problem (4.1).
Note that in Problem (4.1), the continuous variables \(\uplambda \) and \(\beta \) are always bounded in the direction of optimization since they are non-negative and minimized. Thus, there exist constants \(M_{\uplambda }\) and \(M_{\beta }\) with
Now, we consider the following non-empty and compact sets
With these sets and the sets in (22) and (23) at hand, we can define the non-empty and compact set V as
Thus, the variable splitting is represented by the sets (16) and (25). In this case now, we have coupling constraints (9b). In the notation of (15), we define these constraints formally as \(c:\mathbb {R}^{n_u + n_v} \rightarrow \mathbb {R}^{p\times n}\) with
In the following, we describe how we solve the ADM subproblems. To solve Problem (4.1) into the direction of U, we first need to choose \(\mu ^0\), \(\alpha ^0\), \(\uplambda ^0\), and \(\beta ^0\). Suppose that \(\mu ^0\) is given, and we obtain \(b^0\) as in Steps 3–6 of Algorithm 2 w.r.t. \({\tilde{x}}^i\). Then, we can compute \(\alpha ^0\), \(\uplambda ^0\), and \(\beta ^0\) by solving the linear optimization problem
With the initial values for \(\mu ^0\), \(\alpha ^0\), \(\uplambda ^0\), and \(\beta ^0\) at hand, the problem in the direction of U in iteration t reads
where we already skipped all constant terms in the objective function.
Note that for \(t=0\), we obtain \(b^1=b^0\) by solving Problem (27). This is because \(b^0\) is feasible for (27) and \(b^0\) leads to the optimal objective value since it has been obtained as in Algorithm 2 w.r.t. \({\tilde{x}}^i\). Thus, there is no need to compute \(\alpha ^0\), \(\uplambda ^0\), as well as \(\beta ^0\) and we can directly proceed with the problem into the direction of V using \(b^0\), as described below.
Suppose that \(\mu ^0\) is given and \(b^0\) is obtained as in Algorithm 2. Thus, in each iteration t, we compute \(\mu ^{t+1}\), \(\alpha ^{t+1}\), \(\uplambda ^{t+1}\), and \(\beta ^{t+1}\) by solving
which is a convex quadratic optimization problem. Having this problem solved, we obtain the next b-iterate by solving Problem (27).
Algorithm 4 summarizes the main steps to compute a partial minimum of Problem (4.1). Again, Theorem 5.1 ensures that Algorithm 4 computes partial minima of the \(\Gamma \)-robust counterpart.
6 Restart heuristic
There are specific clustering instances for which the ADMs described so far obtain visibly bad results; see, e.g., the clustering obtained for the instance Unbalance in Fig. 1. For the details about the benchmark data sets we refer to Sect. 7.2. Observe that Algorithm 2 identified two clusters, \(S_1\) and \(S_2\), where there should be only one, and identified one cluster, \(S_3\), where there should be two. This leads to the situation that the loss within cluster \(S_1\) and within cluster \(S_2\) is much smaller than the one within cluster \(S_3\). Here and in what follows, we compute the weighted value of the loss function h in (2) restricted to cluster \(S_j\) by
Suppose now that we join clusters \(S_1\) and \(S_2\) to form a new cluster \(S_{12}\). If the total loss within cluster \(S_{12}\) is still smaller than the one within cluster \(S_3\), then our algorithm is actually not minimizing the sum of total losses, because there exists a better point when we split cluster \(S_3\) into two clusters and join clusters \(S_1\) and \(S_2\).
Based on these observations, we propose a heuristic to avoid that the ADMs in Algorithms 2–4 get stuck in a partial minimum of bad quality as shown in Fig. 1; see Fraiman et al. (2013) for a similar heuristic.
Suppose that we have a partial minimum and the \(h_j\) values at hand. For each pair of clusters \((S_{j_1}, S_{j_2})\), we also compute their joint center and the corresponding total loss via
and
Now, consider the set
which is the set of all possible combinations of three clusters such that the total loss within two joined clusters is smaller than the total loss within a third cluster. Note that the set \(\Psi \) can be empty. If so, this means that we cannot obtain a better partial minimum by joining two clusters and splitting another one. On the other hand, i.e., if there exists \((S_{j_1}, S_{j_2}, S_{j_3}) \in \Psi \), then the total loss of the joined clusters \(S_{j_1}\) and \(S_{j_2}\) is smaller than the total loss within cluster \(S_{j_3}\). Thus, we obtain a better partial minimum by joining \(S_{j_1}\) and \(S_{j_2}\) and by splitting cluster \(S_{j_3}\) into two smaller clusters. To this end, we update the centers in such a way that the clusters \(S_{j_1}\) and \(S_{j_2}\) are now one cluster with center \(\mu ^{j_1j_2}\), cluster \(S_{j_3}\) receives two centers which are the two furthest points in \(S_{j_3}\), and the other centers remain the same, i.e.,
with
Finally, if the set \(\Psi \) has more than one element, then we repeat the process starting with the element \((S_{j_1}, S_{j_2}, S_{j_3})\) that gives the minimum ratio \(h_{j_{1}j_2} / h_{j_3}\). Each time an element \((S_{j_1}, S_{j_2}, S_{j_3})\) is used, we exclude all the elements that contain \(S_{j_1}\), \(S_{j_2}\), or \(S_{j_3}\), because these clusters were already modified.
With \({\hat{\mu }}\) at hand, we compute a new partial minimum using \({\hat{\mu }}\) as the initial centers. Given the new partial minimum, we repeat the process until a termination criterion is satisfied. In Algorithm 5, we formally state the proposed heuristic as an additional part of Algorithms 2–4. The termination criterion in our implementation is to stop if the restart heuristic does not provide modified centers anymore and, as a consequence, if two consecutive partial minima are the same (up to numerical tolerances).
7 Numerical results
In this section, we present our numerical studies, which have been carried out using the setup described in Sect. 7.1. The goal here is to evaluate the performance of the proposed methods in Algorithms 2–5. To this end, we first describe in Sect. 7.2 the clustering problem instances and the evaluation metrics that we use. For analyzing the performance of the different methods, a ground truth labeling has to be defined. These labels are provided with the test data sets used in the numerical evaluation. For the optimization, these ground truth labels are not used as k-means clustering is an unsupervised learning method, i.e., the input data is not labeled. Since all algorithms assume that initial centers are given, we briefly describe how they are obtained in Sect. 7.3. In Sect. 7.4, we show that the proposed ADM performs very well for typical nominal clustering problems. Moreover, we also evaluate the effectiveness of the heuristic described in Sect. 6. In Sects. 7.5 and 7.6, we discuss the results that we obtain by robustifying these k-means clustering problems. Finally, in Sects. 7.7 and 7.8, we go further and evaluate the performance of the proposed algorithms on real-world clustering instances.
As we will see, the strictly robust method clearly outperforms the nominal and \(\Gamma \)-robust methods. In the appendix, we also present a comparison of the strictly robust method with the spectral clustering method by Von Luxburg (2007), which is another state-of-the-art method for clustering. By these results, the competitiveness of the strictly robust method is further highlighted.
7.1 Software and hardware setup
We implemented the algorithms in Python 3.8.5 and solved the binary linear as well as the convex quadratic optimization problems with Gurobi 9.1.0. We use the special Python modules sklearn.preprocessing.MinMaxScaler, sklearn.metrics.adjusted_rand_score, sklearn.metrics.silhouette_score, and scipy.stats.wilcoxon to scale the data to the range [0, 1], to compute the ARI (see below), to compute the Silhouette score (see below), and to calculate the p values for the Wilcoxon signed-rank test (see below as well), respectively. All the computations were performed on a computer with a 3.60 GHz Intel(R) Core(TM) i3-8100 processor and 16 GB RAM. For Algorithms 2–4, the termination criterion is \(\Vert \mu ^{t+1} - \mu ^{t}\Vert _\infty < 10^{-4}\).
7.2 Data sets and validation metrics
To comprehensively compare the proposed algorithms on a variety of clustering problems, we use (i) synthetically generated data for which we include some uncertainty by perturbing the data points as well as (ii) real-world data sets, which naturally contain measurement errors.
Since one of the goals of this work is to identify structural properties of clustering problems for which a robustification is beneficial, we use six synthetic clustering benchmark data sets (a, s, Unbalance, dim, g2, birch) proposed by Fränti and Sieranoja (2018). These data sets provide a perfect benchmark since they cover a wide range of typical k-means clustering problems with different degrees of overlap, density, and sparsity. Besides that, most of the data sets are two-dimensional, so we can also visualize and compare the recovered nominal and robust clustering results. For reasons of comparability, we use these data sets as a reference but exclude the data set birch with 100 clusters that leads to extremely large and challenging problems.
It is also standard in the literature to test clustering algorithms on data sets from the UCI Machine Learning Repository (Dua and Graff 2017); see, e.g., Li et al. (2016), Vo et al. (2016) and Aloise et al. (2012). Thus, to also evaluate the quality of robust solutions on high-dimensional and real-world data sets, we select 52 instances from this repository, forming a sample of clustering problems with diverse sizes and difficulties.
For the synthetic data, the ground truth partition of all data sets is publicly available in Fränti and Sieranoja (2018) and for the real-world instances, the true assignments (labels) of each data point are given as well. Thus, we choose as external validity metrics the adjusted Rand index (ARI) and the loss function value. The ARI (Hubert and Arabie 1985; Steinley 2004) measures the similarity between two assignments, ignoring permutations, within the range \([-1, 1]\). Here, 1 means perfect agreement between the two assignments.
However, in some cases, the ground truth partition corresponds to lower quality solutions, which may lead to a subjective validation if only external metrics are used as reference. Therefore, we also use the Silhouette score as an internal validation metric (Rousseeuw 1987). This score is shown to have a good performance for general clustering problems in an extensive study over a variety of validation metrics done in Arbelaitz et al. (2013). The Silhouette score measures the quality of the clustering in terms of cohesion and separation, taking values within the interval \([-1, 1]\), where a value close to 1 means that the data set is well clustered, whereas a result close to \(-1\) means that most of the data points have been misclassified or that there is no natural clustering structure in the data set.
Further, to assess the significance of the obtained results, we apply the Wilcoxon signed-rank test to calculate p values (Wilcoxon 1945). With this, we are able to quantify the observed improvements in cluster quality when comparing the Silhouette score as well as the ARI of nominal and robust models. If the resulting p value is smaller than the \(\alpha \)-error of 10 %, we reject the null-hypothesis that the robust model is less or equally good as the nominal one in terms of the Silhouette score or the ARI, respectively. Hence, a rejection supports the hypothesis that we obtain improved clustering results for erroneous data when using robustified models.
7.3 Starting point heuristic
It is well known that a poor initialization of the centers can cause a bad clustering result. In Fränti and Sieranoja (2019), the authors study some popular initialization heuristics and test them on a benchmark library, which is the same synthetic data that we use to test our algorithms. On average, they conclude that the “furthest point heuristic”, also known as “Maxmin”, reduces the clustering error of k-means. Based on their results, we decide to compute the initial mean points \(\mu ^0\) with the “Maxmin” heuristic. The idea is to select the first center randomly within the respective bounding box, and then obtain new centers one by one. In each iteration, the next center is the point that is the furthest (max) from its nearest (min) existing center.
7.4 Evaluation of the ADM applied to the synthetic nominal clustering problems
In this section, we discuss the performance of Algorithm 2 and its extension in Algorithm 5 applied to the synthetic data sets. To simplify the presentation here, we refer to Algorithm 2 as (“ADM2”) and to Algorithm 5 as (“ADM5”). As already mentioned at the end of Sect. 5.2, the ADM in Algorithm 2 exactly mimics the classic k-means clustering algorithm. Nevertheless, we present the numerical results for the nominal cases here as well to assess the effects of the restart heuristic and to have a proper baseline for a computational comparison of the robustified versions later on. First, we evaluate the quality of the partial minima computed by ADM2 and ADM5 for 12 instances from the synthetic data sets. The results are shown in Table 1. The objective function value is denoted by \(h(X,\mu )\). We both state the value of h corresponding to the ground truth (“GT”) and the value of h in a partial minimum computed with ADM2 and ADM5. The Silhouette score is computed for the ground truth assignment as well as for the ADM2 and ADM5 assignment results. The two last columns show the ARI, respectively. In Table 2, we state the size of the clustering problem, i.e., the number of clusters k, the number of points n, and the number of attributes p. There we also present the runtime (“Time”; in seconds) and the number of iterations (denoted by “It”) that the ADM2 and ADM5 require to compute a partial minimum. Since the initialization heuristic for the centers is a random procedure, we apply the algorithm five times with different initializations and report the average over all five runs.
Looking at the results of ADM2 first, we note that for the instances a1, a2, a3, s1, s2, dim, and g2-2-30, on average, the value of h and the Silhouette score in the partial minimum are close to the ground truth and the ARI is also close to 1. This shows that the clusterings of the partial minima are similar to the ground truth clusterings for these instances. One can also see that the quality of the partial minima measured in terms of the objective function h and in terms of the Silhouette score can both be worse (e.g., instances s1 or s2) or better (e.g., instance g2-2-30) compared to the ground truth.
On the other hand, for the instances s3, s4, Unbalance, g2-2-50, and g2-2-70, on average, the ARI reveals that the corresponding clusterings have low similarity and also the value of h in the partial minima is rather different to the ground truth. However, the Silhouette score is better and the value of h is smaller for all these cases except for the instance Unbalance for which we obtain a partial minimum of very bad quality; see also Fig. 1. Due to this, in Sect. 6, we proposed a heuristic to improve the quality of the partial minima that are computed by the ADM2; see Algorithm 5.
Regarding the runtime and number of iterations, the ADM2 is very fast in computing partial minima. The median average runtime is 1.9 s (with a maximum of less than 15 s) and the ADM2 never requires more than 45 iterations on average.
Now, looking at the results of ADM5, we note that the extended algorithm is able to find the ground truth clustering of the instance Unbalance. Furthermore, all the results improved in terms of the objective function h and in terms of the Silhouette score. We also observe that the results for all g2 instances (all having \(k = 2\)) remain the same because the heuristic proposed in Sect. 6 is only applicable for \(k>2\).
Let us briefly discuss the instance s4 for which we get the lowest ARI (0.608). In this case, the objective function value is significantly smaller in the ADM5 result (compared to the ground truth; 20.125 vs. 34.544), and also the Silhouette score is better (compared to the ground truth; 0.468 vs. 0.321). Both results are shown in Fig. 2, where it is easy to see that the clusterings also qualitatively differ from each other. Nevertheless, the low values of the Silhouette score reveal that this is a difficult instance for clustering.
Regarding the runtime and number of iterations that the ADM5, on average, requires to find a partial minimum, both increase due to the restart heuristic. This is expected, because in each “outer” iteration we may have new initial centers and, if this is the case, we apply the ADM2 again leading to additional inner ADM iterations. Thus, the computational costs of getting better partial minima is the increase of median average runtime from 1.9 to 5.7 s (with a maximum of 32 s for the instance a3).
We conclude that the proposed Algorithm 2 combined with the restart heuristic proposed in Algorithm 5, i.e., the combination that we denote here by ADM5, performs very well on the synthetic data sets. The algorithm is able to find partial minima of very good quality with objective function values close to or even better than the ground truth and Silhouette scores always better than or equal to the ground truth.
7.5 Strictly robust clustering results for the synthetic data
Let us start the discussion of the strictly robust clustering results with an informal note. To this end, consider the clustering instance Unbalance in Fig. 1 again. For this instance, it is obvious that the eight clusters are very well separated and not too close to each other. Only for a few points between the two clusters below and the middle cluster in the right part of the figure it may be not clear to which cluster they belong. However, the specific assignment decision for these data points will not qualitatively change the overall clustering result. In such a setting, it cannot be expected that a slight perturbation of a certain percentage of data points by a certain amount will lead to a qualitatively different result. Even if all data points are perturbed by, e.g., 1 %,Footnote 1 the clustering will stay the same although the clusters may be blurred a bit. This general observation is also supported by our preliminary numerical experiments: Clustering of well separated sets of data points does not benefit from robustification. Moreover, in such a situation there is also no need for robustified clustering if compared with the nominal results.
Due to this and to the fact that most of the instances from the twelve synthetic data sets provided in Fränti and Sieranoja (2018) are practically well separated, we induce some uncertainty in the data points by perturbing them. More precisely, we consider an amount of perturbation of 10 % and different quantities of perturbed data points: 5 %, 30 %, and 50 %. Assuming now that the perturbed data set is the “new nominal” or given one, we apply the nominal and the two robust models. Each data set is perturbed 10 times and every time the three approaches are applied to an instance, they start with the same given centers. The average results for all data sets over the 10 runs each are presented in Table 3, where we show the Silhouette score as well as the ARI regarding the clustering results on the perturbed data compared to the nominal ADM solution on the unperturbed set (notice that we do not compare to the ground truth assignment here). For each instance and each validation metric, the best average result is printed in bold. Moreover, since in some cases the differences in terms of Silhouette and ARI among methods seem to be modest, we underline the best robust average results for which the corresponding p values (calculated for each metric and each two methods) are smaller than the \(\alpha \)-error of 10 % w.r.t. the nominal results. Therefore, the underlined average results are statistically significantly better on a 10 % level. All the calculated p values are shown in the appendix in Table 10.
It can be directly seen that the advantage of the strictly robust model over the nominal and \(\Gamma \)-robust models increases with the amount of perturbed data points. This is the case, e.g., for the instance Unbalance. As we induce a stronger perturbation, the clusters significantly mix with each other and the performance of the strictly robust model becomes the best in both evaluation metrics.
Let us now discuss the quality of the robustified clusterings for the instances s3, s4, a2, and a3 for which the data points are not well separated or for which the clusters are close to each other; see Fig. 3 (all other data sets are given in Fig. 8 in the appendix). To this end, we consider Fig. 4 in which we display box-plots to analyze the quality of the robustified clustering results. On the y-axes we plot the ARI regarding the strictly robustified clustering results on the perturbed data compared to the nominal ADM solution on the unperturbed set divided by the ARI regarding the nominal ADM clustering results on the perturbed data compared to the nominal ADM solution on the unperturbed set. All ADMs always have been applied using the restart heuristic described in Sect. 6.
For all four instances we see that the quality of the strictly robustified clustering is improved if the errors (in terms of the amount of perturbed points) get larger. The detailed behavior also depends on the specific instance. Thus, we discuss two exemplary ones. For a3 we see that the robustified clustering is advantageous only for the case of 50 % perturbed data points. Taking a look at Fig. 3 again we see that the data points in instance a3 are rather well separated but the clusters are close to each other. Since all points are perturbed by 10 % if they are perturbed at all, a robustification is getting beneficial in the case that enough clusters mix with each other. This does not seem to be the case for 30 % of perturbed points but is the case for 50 %. We observed a similar behavior in our computations if the number of perturbed points stays the same and if the points are perturbed by a varying amount like 10, 20, or 30 %.
As a second example let us consider instance s4; see also Fig. 3. Here, all clusters are not well separated. In such a situation, already a smaller amount of perturbed points leads to a beneficial robustification but for larger amounts of perturbed points no significant further improvements can be seen. If we take a look at Table 3 again, we see that this is also the case for the Silhouette score. The reason most likely is that for 30 % of perturbed points the clusters significantly mix with each other and that this mix is not qualitatively increased for more perturbed points.
Now we discuss the computational performance of the robust clustering methods for the exemplary instances previously discussed, see Table 4 (all the other results are presented in Table 9 in the appendix), where we report the average (over ten runs with different random perturbations) runtimes and iteration counts for the ADM with restart heuristic applied to the nominal and the two robustified clustering models. (The \(\Gamma \) columns will be discussed in the next section.) It is clearly visible that the solution of the strictly robust clustering problems takes more time: the average factor is 5.9 with minimum 2.0 and maximum 12.1. Moreover, the percentage of perturbed points does not have a clear impact on the runtime. In contrast to runtimes, the number of required iterations does not change significantly and there are almost as many instances with decreased average iteration counts as there are increased iteration counts. Consequently, the main additional computational work is due to harder subproblems that need to be solved in every iteration of the ADM.
Thus, the bottom line is as follows. The strictly robustified clustering method is harder to compute but the additional computational effort is reasonable since we roughly stay in the same order of magnitude compared to the runtime of the nominal method. Regarding the outcome of the robustified clustering, the results always have to be analyzed with the structure of the measured data points in mind, since different structures in the observed data may lead to different robustified results as the two exemplary discussed instances show.
Let us close this section on the strictly robustified clustering method with a final numerical result for the instance s3. In Fig. 5, the green triangles are the same in the left and the right plot of the figure and represent the clustering obtained by the ADM including the restart heuristic when applied to the unperturbed data set. Then, we perturb 50 % of all points by 10 % and again solve the ADM with restart heuristic considering these perturbed data points as the given ones. This results in the black (crosses) centers in the left plot of Fig. 5. It is obvious, that by considering the perturbed points as “new nominal” ones, one is not able to detect the original centers of the clusters. In contrast, the strictly robustified clustering result (right plot in Fig. 5) almost perfectly recovers the original centers.
7.6 \(\Gamma \)-robust clustering results for the synthetic data
We start with the analysis of the computational performance of the \(\Gamma \)-approach; see Table 4 again. First, one can observe that the number of iterations is again in the same order of magnitude as the number of iterations required to solve the nominal as well as the strictly robust clustering problems. Interestingly, the \(\Gamma \)-approach requires fewer iterations for 11 out of 15 instances. However, the computation times are, on average, a factor of 9.4 longer then for the nominal problems with a minimum factor of 2.7 and a maximum factor of 12.8. This means that the computation times for the \(\Gamma \)-robustified models are roughly two times longer than the ones for the strictly robust versions.
The motivation for \(\Gamma \)-robustness in Bertsimas and Sim (2004) was to introduce a concept that leads to a less conservative notion of robustness—especially compared to strict robustness. This less conservative concept had many applications and was also shown to be useful for classification in Bertsimas et al. (2019), which is, to the best of our knowledge, the first and only application of \(\Gamma \)-robustness in classification so far.
The question now is whether a \(\Gamma \)-robustification is also beneficial in the context of clustering problems. Unfortunately, an analysis comparable to the one given in Fig. 4 does not show a benefit in general; see Table 3 again. Our analysis of the results revealed two major explanations for this. First, the aim of the \(\Gamma \)-approach is to hedge against worst-case realizations in \(\Gamma \) many parameters with the specific parameters being unknown beforehand. In particular, these \(\Gamma \) many parameters do not need to correspond to those, e.g., 30 % of the parameters that are randomly chosen and perturbed in our simulations. Thus, it is likely to be the case that the randomly perturbed ones are not the ones that the \(\Gamma \)-approach hedges against. Let us discuss this on the example of instance g2-2-50, which is shown in Fig. 6. Here, the two clusters are not well separated. The interesting data points for the \(\Gamma \)-approach most likely are the ones near the diagonal of the plot. Thus, only a small percentage of points can be expected to be the points for which a perturbation might lead to a beneficial behavior of the \(\Gamma \)-approach. In Fig. 7, we analyze the performance of the \(\Gamma \)-approach. It can be seen that the \(\Gamma \)-approach can lead to an improvement compared to the nominal and to the strictly robust approach—but only if the parameter \(\Gamma \) is chosen appropriately; see, e.g., the case in which \(\Gamma \) corresponds to 0.1 % of the data points. Unfortunately, it is not clear on how to choose this parameter appropriately in advance.
Let us also briefly comment on the results reported on beneficial \(\Gamma \)-robustifications in case of SVMs in Bertsimas et al. (2019). There, the authors apply this approach in the case in which the labels are subject to uncertainty, which is not the situation here since no labels are considered. In the case that Bertsimas et al. (2019) call “feature-robust”, which corresponds to our setting here, they also use strictly robustified models.
Figure 7 also makes another issue visible, namely that the \(\Gamma \)-robustified results do not tend to the strictly robust solutions if \(\Gamma \) is getting larger. In theory, however, this should be the case. Our experiments revealed that this is due to non-optimal partial minima computed by the ADM in Algorithm 4—although we present here the results for which the restart heuristic is used as well. This problem is not visible for the strictly robustified problems that we also solve using a tailored ADM, which might be simply explained by the overall increased complexity of the \(\Gamma \)-robustified models compared to the nominal or strictly robust models.
To sum up, we found an interesting case in which less conservatism does not lead to better results—as it is often expected in robust optimization in general. Moreover, the advantageous clustering results based on the strictly robust models are even more pronounced because these models are also significantly easier to solve; see Table 4 again. Thus, if no specific value of \(\Gamma \) is available, we strongly suggest to use the strictly robustified clustering method for data sets with unstructured errors. Therefore, in Sect. 7.8, we only consider the strictly robustified model for the real-world data sets.
7.7 Evaluation of the ADM applied to nominal real-world clustering problems
In this section, we evaluate the performance of Algorithm 2 with the restart heuristic applied to a sample of 52 real-world data sets taken from Dua and Graff (2017). Considering the starting point procedure described in Sect. 7.3, the ADM is applied five times considering different initial centers. The average (over the five runs) runtimes and iterations counts are present in Table 5, where we also give information on the size of the instances. Again, as seen for the synthetic data, the ADM is very fast in computing partial minima. The median average runtime is 0.1 s and the maximum number of required iterations is, on average, less than 57. Interestingly, the ADM is significantly faster for these real-world instances compared to the synthetic data sets. Thus, the proposed method seems to scale very well.
Let us now discuss the quality of the partial minima. To this end, we consider again the ARI as external validation metric since the true labels are available and we also report the Silhouette score to evaluate how good the clustering is in terms of separation and cohesion. The average results over the five runs are shown in Table 6, where we additionally compute the Silhouette score for the true labels. First, we can see that for some instances the clustering result is very bad. Negative values for the ARI mean that the ADM assignment result has little similarity with the true assignment. However, taking a closer look at the Silhouette score in the true labels, the negative (or very small) values of the ARI happen exactly when this score is also very small. The most likely explanation for this is that the data set simply does not have an underlying clustering structure and, consequently, does not fit well the k-means clustering idea.
In contrast to this, when the Silhouette score in the true labels is larger than, e.g., 0.2, then the ARI is also closer to 1 in general. Thus, the ADM assignment result is more similar to the true labeling. In other words, if the real-world data set possesses an underlying clustering structure, then the proposed ADM combined with the restart heuristic performs very well and is able to find partial minima of good quality.
7.8 Strictly robust clustering results on real-world data sets
As seen in the last section, several of the real-world data sets among the 52 instances turned out to be instances that are not suitable for the k-means clustering method because they do not possess an underlying clustering structure. Moreover, our experiments reveal that applying the strictly robustified k-means model to these instances does not significantly improve an already bad nominal clustering result. Therefore, from now on, we only focus on those instances for which the Silhouette score for the true labels is larger than 0.2 and we also focus on the strictly robustified model here due to our previous results on synthetic data sets.
In order to compare the nominal method against its strictly robust counterpart, we apply each method five times with different seeds for the initialization procedure and both methods start with the same given initial centers. The average results over the five runs are presented in Table 7, where the best ARI result for the data set is highlighted in bold. As for the synthetic data sets, we calculated p values to assess the significance of the obtained results. Therefore, we underline the average results for which the difference in the ARI of the nominal vs. the strictly robust method is reasonable (p value smaller than an \(\alpha \)-error of 10 %). All calculated p values are shown in Table 11 in the appendix.
Note that for the Silhouette score there is no big difference between the results. On the other hand, the ARI is more telling here since we compare how similar the two assignment results are to the true labeling. We can see that only for the instance mammographic-mass the nominal method outperforms the strictly robust method. For 10 out of the 15 instances, the strictly robust model performs the best, yielding clustering results that have more similarity to the true classes than the nominal clustering results do. Moreover, the underlined average results reveal that the improvement in clustering quality attained with the strictly robust method is significant. Thus, in the presence of potential measurement errors, the strictly robustified k-means clustering method is able to give more accurate results than its nominal counterpart.
The running times and iterations counts are listed in Table 8. Clearly, on average the strictly robust method needs more time to find a partial minimum. However, it is still rather fast, never requiring more than 39 s on average. In contrast to runtimes, the number of iterations does not change significantly.
We conclude that the strictly robustified k-means clustering model is also beneficial for real-world clustering instances, which likely contain uncertainties due to measurement errors. Specifically, for almost all of the instances, the strictly robust model leads to improvements regarding the recovery of the true assignment over the nominal model.
8 Conclusion
In this paper, we derived the strictly and \(\Gamma \)-robust counterpart of the k-means clustering problem to hedge clustering results against errors without known structure in the observed data. The robustification makes it possible to obtain clusterings that are closer to clusterings of the original, i.e., error-free, data although only erroneous input data can be considered. Since already the nominal problem is NP-hard, we develop tailored alternating direction methods to quickly compute (robust) feasible solutions of good quality. Since these methods only converge to so-called partial minima that may correspond to clusterings of bad quality, we develop a restart heuristic for which we experimentally show that it significantly improves the quality of the computed clusterings. Our comparison of the strictly and \(\Gamma \)-robust clustering method reveals the interesting aspect that the less conservative \(\Gamma \)-approach is outperformed by the classic concept of strict robustness. Most importantly, the strictly robustified clustering method is able to recover clusterings of the original data even if only erroneous measurements are observed. Thus, we see that the studied robustifications can lead to more reliable models.
Let us finally sketch some open problems that we do not resolve in this paper. One issue with the \(\Gamma \)-approach is that partial minima are obtained that do not correspond to globally optimal clusterings. Consequently, the development of methods to compute global optima of robustified clustering problems may be a worthwhile topic of future work. Moreover, we do not answer the question on how to derive practically meaningful uncertainty sets for the robustification, which may be done for practical applications by means of cross-validation. Further, other uncertainty sets such as polyhedral and ellipsoidal sets should be considered in future works as well.
Finally, one could study whether other clustering approaches such as spectral or hierarchical clustering can also be robustified so that they can cope with uncertain data. However, these approaches are not based on a clear-cut optimization model so that new robust concepts need to be developed that are maybe not directly based on robust optimization.
Notes
Note that before the computations the data is scaled to \([0,1]^p\) and the amount of perturbation will be stated in percent for better readability.
References
Alfons, A., Templ, M., & Filzmoser, P. (2013). Robust estimation of economic indicators from survey samples based on Pareto tail modelling. Journal of the Royal Statistical Society. Series C (Applied Statistics), 62(2), 271–286. https://doi.org/10.1111/j.1467-9876.2012.01063.x.
Aloise, D., Deshpande, A., Hansen, P., & Popat, P. (2009). NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75, 245–248. https://doi.org/10.1007/s10994-009-5103-0
Aloise, D., Hansen, P., & Liberti, L. (2012). An improved column generation algorithm for minimum sum-of-squares clustering. Mathematical Programming, 131, 195–220. https://doi.org/10.1007/s10107-010-0349-7
Ames, B. P. W. (2014). Guaranteed clustering and biclustering via semidefinite programming. Mathematical Programming, 147(1), 429–465. https://doi.org/10.1007/s10107-013-0729-x
Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J. M., & Perona, I. (2013). An extensive comparative study of cluster validity indices. Pattern Recognition, 46(1), 243–256. https://doi.org/10.1016/j.patcog.2012.07.021
Ben-Tal, A., El Ghaoui, L., & Nemirovski, A. (2009). Robust optimization (Vol. 28). Princeton University Press.
Ben-Tal, A., Goryashko, A., Guslitzer, E., & Nemirovski, A. (2004). Adjustable robust solutions of uncertain linear programs. Mathematical Programming, 99(2), 351–376. https://doi.org/10.1007/s10107-003-0454-y
Ben-Tal, A., & Nemirovski, A. (2001). Lectures on modern convex optimization. Society for Industrial and Applied Mathematics. https://doi.org/10.1137/1.9780898718829
Bertsekas, D. P., & Tsitsiklis, J. N. (1989). Parallel and distributed computation: Numerical methods. Prentice-Hall Inc.
Bertsimas, D., Brown, D. B., & Caramanis, C. (2011). Theory and applications of robust optimization. SIAM Review, 53(3), 464–501. https://doi.org/10.1137/080734510
Bertsimas, D., Dunn, J., Pawlowski, C., & Zhuo, Y. D. (2019). Robust classification. INFORMS Journal on Optimization, 1(1), 2–34. https://doi.org/10.1287/ijoo.2018.0001
Bertsimas, D., Pawlowski, C., & Zhuo, Y. D. (2017). From predictive methods to missing data imputation: An optimization approach. The Journal of Machine Learning Research, 18(1), 7133–7171.
Bertsimas, D., & Sim, M. (2004). The price of robustness. Operations Research, 52(1), 35–53. https://doi.org/10.1287/opre.1030.0065
Bhattacharyya, C., Pannagadatta, K. S., & Smola, A. J. (2005) A second order cone programming formulation for classifying missing data. In Proceedings of the 17th international conference on neural information processing systems (pp. 153–160). MIT Press.
Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1), 1–122. https://doi.org/10.1561/2200000016
Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge University Press.
Burgard, J. P., Krause, J., Kreber, D., & Morales, D. (2020). The generalized equivalence of regularization and min–max robustification in linear mixed models. Statistical Papers. https://doi.org/10.1007/s00362-020-01214-z
Burgard, J. P., & Münnich, R. T. (2012). Modelling over and undercounts for design-based Monte Carlo studies in small area estimation: An application to the German register-assisted census. Computational Statistics & Data Analysis, 56(10), 2856–2863. https://doi.org/10.1016/j.csda.2010.11.002
Carroll, R. J., Ruppert, D., Stefanski, L. A., & Crainiceanu, C. M. (2006). Measurement error in nonlinear models: A modern perspective. CRC Press.
Celebi, M. E., & Aydin, K. (2016). Unsupervised learning algorithms. Springer. https://doi.org/10.1007/978-3-319-24211-8
Dasgupta, S. (2007) The hardness of k-means clustering. Technical Report CS2008-0916. University of California, Department of Computer Science and Engineering. http://cseweb.ucsd.edu/~dasgupta/papers/kmeans.pdf
Davalos, S. (2017). Big data has a big role in biostatistics with big challenges and big expectations. Biostatistics and Biometrics Open Access Journal, 1(3), 1–2. https://doi.org/10.19080/BBOAJ.2017.01.555563
Dua, D., & Graff, C. (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml
Fischetti, M., & Monaci, M. (2009) Light robustness. In Ahuja, R. K., Möhring, R. H., & Zaroliagis, C. D. (Eds.) Robust and online large-scale optimization: Models and techniques for transportation systems (pp. 61–84). Springer. https://doi.org/10.1007/978-3-642-05465-5_3
Fraiman, R., Ghattas, B., & Svarc, M. (2013). Interpretable clustering using unsupervised binary trees.https://doi.org/10.1007/s11634-013-0129-3
Fränti, P., & Sieranoja, S. (2018). k-means properties on six clustering benchmark datasets. Applied Intelligence, 48(12), 4743–4759. https://doi.org/10.1007/s10489-018-1238-7
Fränti, P., & Sieranoja, S. (2019). How much can k-means be improved by using better initialization and repeats? Pattern Recognition, 93, 95–112. https://doi.org/10.1016/j.patcog.2019.04.014
Gabay, D., & Mercier, B. (1976). A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1), 17–40. https://doi.org/10.1016/0898-1221(76)90003-1
Geißler, B., Morsi, A., Schewe, L., & Schmidt, M. (2015). Solving power-constrained gas transportation problems using an MIP-based alternating direction method. Computers & Chemical Engineering, 82, 303–317. https://doi.org/10.1016/j.compchemeng.2015.07.005
Geißler, B., Morsi, A., Schewe, L., & Schmidt, M. (2017). Penalty alternating direction methods for mixed-integer optimization: A new view on feasibility pumps. SIAM Journal on Optimization. https://doi.org/10.1137/16M1069687
Geißler, B., Morsi, A., Schewe, L., & Schmidt, M. (2018). Solving highly detailed gas transport MINLPs: Block separability and penalty alternating direction methods. INFORMS Journal on Computing, 30(2), 309–323. https://doi.org/10.1287/ijoc.2017.0780
Glowinski, R., & Marroco, A. (1975) Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisation-dualité d’une classe de problèmes de Dirichlet non linéaires. In ESAIM: Mathematical Modelling and Numerical Analysis - Modélisation Mathématique et Analyse Numérique 9.R2 (pp. 41–76). http://eudml.org/doc/193269
Gorski, J., Pfeuffer, F., & Klamroth, K. (2007). Biconvex sets and optimization with biconvex functions: A survey and extensions. Mathematical Methods of Operations Research, 66(3), 373–407. https://doi.org/10.1007/s00186-007-0161-1
Grira, N., Crucianu, M., & Boujemaa, N. (2004). Unsupervised and semi-supervised clustering: A brief survey. A Review of Machine Learning Techniques for Processing Multimedia Content, 1, 9–16.
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2, 193–218. https://doi.org/10.1007/BF01908075
Khoury, M. J., & Ioannidis, J. P. A. (2014). Big data meets public health. Science, 346(6213), 1054–1055. https://doi.org/10.1126/science.aaa2709
Li, J., Song, S., Zhang, Y., & Zhou, Z. (2016). Robust k-median and k-means clustering algorithms for incomplete data. Mathematical Problems in Engineering. https://doi.org/10.1155/2016/4321928
Liebchen, C., Lübbecke, M., Möhring, R., & Stiller, S. (2009) Robust and online large-scale optimization: Models and techniques for transportation systems. In Ahuja, R. K., Möhring, R. H., & Zaroliagis, C. D. (Eds.) Chap. The concept of recoverable robustness, linear programming recovery, and railway applications (pp. 1–27). Springer.https://doi.org/10.1007/978-3-642-05465-5_1.
Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137. https://doi.org/10.1109/TIT.1982.1056489
MacQueen, J. (1967)“Some methods for classification and analysis of multivariate observations.” In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Volume 1: Statistics (pp. 281–297). University of California Press. https://projecteuclid.org/euclid.bsmsp/1200512992
Mahajan, M., Nimbhorkar, P., & Varadarajan, K. (2012) The planar k-means problem is NP-hard. In Theoretical computer science 442. Special issue on the workshop on algorithms and computation (WALCOM 2009) (pp. 13–21). https://doi.org/10.1016/j.tcs.2010.05.034
Pant, R., Trafalis, T. B., & Barker, K. (2011) Support vector machine classification of uncertain and imbalanced data using robust optimization. In Proceedings of the 15th WSEAS international conference on computers (pp. 369–374). World Scientific, Engineering Academy, and Society (WSEAS).
Pirinen, A., & Ames, B. (2019). Exact clustering of weighted graphs via semidefinite programming. The Journal of Machine Learning Research, 20(1), 1007–1040.
Rocke, D. M., Ideker, T., Troyanskaya, O., Quackenbush, J., & Dopazo, J. (2009). Papers on normalization, variable selection, classification or clustering of microarray data. Bioinformatics, 25(6), 701–702. https://doi.org/10.1093/bioinformatics/btp038
Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65. https://doi.org/10.1016/0377-0427(87)90125-7
Schewe, L., Schmidt, M., & Weninger, D. (2020). A decomposition heuristic for mixed-integer supply chain problems. Operations Research Letters, 48(3), 225–232. https://doi.org/10.1016/j.orl.2020.02.006
Soyster, A. L. (1973). Technical note-convex programming with set-inclusive constraints and applications to inexact linear programming. Operations Research, 21(5), 1154–1157. https://doi.org/10.1287/opre.21.5.1154
Steinley, D. (2004). Properties of the Hubert–Arable adjusted rand index. Psychological Methods, 9(3), 386–396. https://doi.org/10.1037/1082-989X.9.3.386
Su, Y., Reedy, J., & Carroll, R. J. (2018). Clustering in general measurement error models. Statistica Sinica, 28(4), 2337.
Trafalis, T. B., & Gilbert, R. C. (2007). Robust support vector machines for classification and computational issues. Optimization Methods and Software, 22(1), 187–198. https://doi.org/10.1080/10556780600883791
Vo, X. T., Le Thi, H. A., & Pham Dinh, T. (2016) Robust optimization for clustering. In Intelligent information and database systems (pp. 671–680). Springer. https://doi.org/10.1007/978-3-662-49390-8_65
Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395–416. https://doi.org/10.1007/s11222-007-9033-z
Wendell, R. E., & Hurter, A. P. (1976). Minimization of a non-separable objective function subject to disjoint constraints. Operations Research, 24(4), 643–657. https://doi.org/10.1287/opre.24.4.643
White, E. (2011). Measurement error in biomarkers: Sources, assessment, and impact on studies. IARC Scientific Publications, 163, 143–161.
Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6), 80–83. https://doi.org/10.2307/3001968
Yamada, K., Takayasu, H., & Takayasu, M. (2018). Estimation of economic indicator announced by government from social big data. Entropy, 20(11), 852–864. https://doi.org/10.3390/e20110852
Acknowledgements
The authors thank the DFG for their support within RTG 2126 “Algorithmic Optimization”.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
In this section, we present additional information that complement those given in Sect. 7.5 and Sect. 7.8. Specifically, in Fig. 8 we present the plots of the remaining instances, in Table 9 we present the computational performance for the instances not discussed in Sect. 7.5, and in Tables 10 and 11 we present the calculated p values for each validation metric and each pair of methods.
Additionally, we also present a comparison of our nominal as well as strictly robust clustering method with a spectral clustering method. To this end, we further apply spectral clustering to the test sets that we consider in this paper. The setup for the experiment is the following. We apply each method to the same instance and assess the quality of the clustering by computing the Silhouette index and the ARI. Here, we always compare the clustering result to the ground truth labels instead of comparing to the ADM results. To assess the significance of the obtained results, we apply the Wilcoxon signed-rank test to calculate p values. If the resulting p value is smaller than the \(\alpha \)-error of 10%, we reject the null-hypothesis that the strictly robust result is less or equally good as the spectral one in terms of the Silhouette score or the ARI, respectively. To compute the spectral clustering, we use the Python module sklearn.cluster.SpectralClustering. The results w.r.t. the synthetic instances are presented in Table 12 and the results w.r.t. the real-world instances are shown in Table 13; the respective computed p values are presented in Tables 14 and 15.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Burgard, J.P., Moreira Costa, C. & Schmidt, M. Robustification of the k-means clustering problem and tailored decomposition methods: when more conservative means more accurate. Ann Oper Res 339, 1525–1568 (2024). https://doi.org/10.1007/s10479-022-04818-w
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10479-022-04818-w
Keywords
- k-means clustering
- Alternating direction methods
- Robust optimization
- Strict robustness
- \(\Gamma \)-robustness