Abstract
Imbalanced class distributions are common in real-world scenarios, including datasets with multiple labels. One widely acknowledged approach to addressing imbalanced distributions is through oversampling, a technique that both balances the class distribution and improves the effectiveness of classification models. However, when generating synthetic data for multi-label datasets, complexities arise due to the presence of multiple-label sets, which require careful placement and labeling. We propose MLCSMOTE-FRST, an algorithm for synthetic data generation based on label-specific clustering and fuzzy rough set theory. Generation ratios and dependency samples are provided by clusters specific to each label, with a focus on the overall label distribution and the distribution within each cluster. The labels are supported by intra-cluster positive samples, determined using fuzzy rough set theory, which helps to capture the consensus label set. Experimental results on multi-label datasets using four classifiers demonstrate the effectiveness of the proposed method in terms of macro-F1 and micro-F1 scores.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
Introduction
Supervised learning involves multi-label [18], multi-criteria [14, 32], and multi-class components. The goal of multi-label classification is to predict a set of labels based on features. Taking news classification in natural language processing as an example, a message may contain a set of labels such as politics, economics, and taxation. The number of labels assigned to a sample is typically much smaller than the total number of possible labels, and the number of labels often varies. In some applications of multi-label learning, such as compound fault detection [19, 57], the accurate detection of samples belonging to minority classes is crucial. Detecting samples corresponding to different labels, especially minority classes, has become a major challenge in imbalanced multi-label learning [45].
Two main approaches have been proposed to address the issue of class imbalance and effectively identify the different classes: algorithm-level [17] and data-level [28, 41]. By refining elements such as the loss function, the algorithm-level approach helps classifiers prioritize minority classes. This method is more challenging to assess across a variety of classifiers because it focuses on only some specific classifiers. The data-level approach uses methods such as downsampling and oversampling to balance the label distribution and works well for various classifiers. Creating synthetic data for multi-label datasets presents several challenges.
We expand upon the above issues.
Generation ratio. The effectiveness of adjusting the sample generation fraction to strengthen small disjunct sections was highlighted in earlier studies on oversampling [16]. The cluster-based method, kmeans-SMOTE [10], has been successfully tested in imbalanced binary problems. It determines the generation ratio of samples based on clustering information and performs interpolation operations within each cluster. When directly applied to multi-label learning, this method often results in an uneven distribution of minority-label clusters, which makes it more challenging to define the intra-cluster generation ratio.
Relative samples. The position of the synthetic data in SMOTE is determined by the selection of relative samples. The classifier exhibits sensitivity to generated noise samples [20], which raises the question of whether the synthetic data are connected to the spatial distribution of minorities.
Label-set-specific. While MLSOL [26] and MLSMOTE [6] use local data for label assignment, they rarely take into account the complexities of the label set. To expand the distribution of the created minority labels, the label set should be applied based on the minority-relevant samples. Label revisions for synthetic examples should increase the proportion of borderline samples.
To address the above challenges, we designed a synthetic data method, \(\textbf{M}\)ulti-\(\textbf{L}\)abel \(\textbf{C}\)luster \(\textbf{SMOTE}\)-\(\textbf{F}\)uzzy \(\textbf{R}\)ough \(\textbf{S}\)et \(\textbf{T}\)heory (MLCSMOTE-FRST). At the attribute level, we apply the principle of multi-label learning with label-specific features (LIFT) [54] to ensemble the distance and distribution of clusters in imbalanced multi-label oversampling. The generation ratio is determined by the average distance between the clusters that are specific to each label and the imbalance ratio. This prioritizes the positioning of samples within clusters with a higher imbalance ratio for generation. At the label level, we utilize seed samples and intra-cluster samples to develop a label revision method based on fuzzy rough set theory. The positive samples within the cluster are used to assign labels to the synthetic data to strengthen the minority labels. Empirical evaluations across 12 multi-label datasets employing four classifiers underscore the effectiveness of our approach.
This paper makes the following contributions:
-
We propose a new approach to optimizing attribute-level and label-level synthetic data generation;
-
At the attribute level, we initially use MLCSMOTE to determine the proportion and position of generation utilizing label-specific clusters;
-
At the label level, we propose using fuzzy rough set theory and intra-cluster samples to determine the label set to generate simulated data, short for FRST. MLCSMOTE-FRST shows an improvement over other oversampling methods on 12 datasets.
The remainder of this paper is organized as follows. Section “Related work” discusses related work, including imbalanced multi-label learning, cluster SMOTE methods, and fuzzy rough set theory. Section “Proposed method” introduces our proposed method. Section “Experiments” discusses our experiments and analyzes the results. Section “Conclusion” relates our conclusions.
Related work
Imbalanced multi-label learning
An imbalanced class distribution is often encountered in multi-label datasets. There are both algorithm-level and data-level methods to effectively extract information from minority classes.
Algorithmic-level methods focus on optimizing the weights of minority samples by improving multi-label classifiers. DB-Focal [49] optimizes long-tail image recognition by adjusting loss weights and reducing the suppression of negative-class labels during model training. It has been validated on publicly available image datasets such as COCO-MLT and VOC-MLT. Chen et al. [8] introduced the AdaBoost architecture to the application of street scene recognition, with the assumption that minority samples will progress to the subsequent stages of AdaBoost and have class weight loss added in those stages. Duan et al. [11] exploited the mining ability of an ensemble of classifier chains [38] in high-order label correlations and combined it with binary-class classifiers in a general method for dealing with imbalanced data. MLHC [12] used hierarchical clustering to transform the multi-label into multi-class to optimize imbalance, improving upon previous algorithms that only considered random correlations. Data-level interventions foster class balance through either downsampling or oversampling methods and are valued for their flexibility in improving the performance of most classifiers.
MLeNN [4] and MLTL [35] use the Tomek link method to remove the majority samples that overlap with others. Roseberry et al. [39] incorporated downsampling into the identification of data streams, enabling quick adaptation to their changes by removing data. The oversampling algorithm extends minority samples by generating synthetic samples, through either repetitive or SMOTE-based approaches. LP-ROS and ML-ROS [5] perform repetitive oversampling by redefining the minority labels, while REMEDIAL [7] clones samples and decouples the labels. MLSMOTE [6] and MLSOL [26] differ in that MLSOL utilizes local label inconsistencies to design the data-generation weights and reduce noise interference with SMOTE. EMLSOL [27] combines MLSOL with downsampling. LCOS [53] considers label correlations and assigns weights to synthetic instances based on their distance information. The introduction of label correlation in multi-label imbalance problems is a current research trend, while the rational use of fuzzy rough sets has not yet been proven.
Cluster-based SMOTE methods
SMOTE can effectively balance the number of classes by interpolating between samples. It has been successfully applied such as in binary classification [52], multi-classification [33], and regression [2].
The basic steps of SMOTE include finding the relative instances around the seed instance using kNN and generating synthetic samples between them. After obtaining a predetermined number of synthetic instances, the classifier is trained using the newly balanced dataset. The SMOTE operation interpolates \(x_{related}\) randomly selected from the neighborhood samples of \(x_{seed}\),
where \(\lambda \) is a random value ranging from 0 to 1. Effective acquisition or generation of features is crucial for pattern recognition or deep learning tasks, such as camera position estimation [30], genomic data analysis [21], and satellite reflectance image processing [50]. Some studies have investigated factors like the position of SMOTE-generated features. Fernández et al. [16] divided the traditional SMOTE optimization direction into seven objectives, where the focus optimization directions include assigning sample generation weights and selecting relative instances. These directions are directly related to the generation results, as they reduce the impact of noisy samples and strengthen the weights of boundary samples.
k-meansSMOTE [10] specifies the proportion of different clusters generated in a binary classification problem using the results of k-means clustering. It restricts relative instances to samples that are within a cluster. This assumption is based on the idea that the minority may exhibit a large proportion of similar features across different samples. Dense clusters can be obtained by k-means, assigning a high generation proportion and using within-cluster relative instances for improvement.
Zhang et al. [59] applied cluster-based SMOTE for data augmentation in a fault diagnosis dataset. We note that k-means clustering can be significantly impacted in fault diagnosis applications due to the existence of outliers, which are replaced with k-medoids. OEGFCM-SMOTE [15] recognizes that k-means is sensitive to outlier and boundary instances and that manually selecting hyperparameters has a greater impact on the results. Applying fuzzy c-means discovers non-convex shapes or clusters of different sizes and utilizes entropy to evaluate the quality of synthetic samples and determine optimal parameters via a nonlinear mixed optimization problem.
LR-SMOTE [24] combines data cleaning with cluster-SMOTE. It removes misclassified minority samples from SVM if there are no other minority samples around them to build k-meansSMOTE. Besides using a classifier to filter samples, SMOTE-COF [31] uses the local information center offset factor as a method to remove noisy instances. For the processed dataset, the samples are generated using the intra-cluster instances as the reference instances for SMOTE. This method has the advantage of generating samples that are free from noise.
Fuzzy rough set application
In a fuzzy set [51], an element can belong to a set to a degree; in a rough set [34], imprecision is expressed by a boundary region. Algorithms related to fuzzy sets are widely used in practical applications including dynamic event-triggered mechanisms [42], dual cyber attacks [58], and composite adaptive control [43]. Fuzzy rough set theory [13] deals with indiscernibility in data. The measure calculates similarity between instances based on a fuzzy relation. A positive region (POS) describes the degree of membership between samples \(x_{1}\) and \(x_{2}\),
where \(q_{x_{1}}(x_{2})\) measures the quality of \(x_{1}\) to \(x_{2}\). In SMOTE application, [47] proposed FRIPS-SMOTE-FRBPS, which is based on SMOTE and fuzzy rough theory. To minimize the impact of noise samples on the classifier, majority and synthetic instances that are identified as noise by fuzzy rough set theory are systematically removed. SMOTE-FRST-2T [36] assumes that some synthetic data may not be similar to the minority class. POS is defined as the similarity between the majority class and synthetic samples. SMOTE-based synthetic samples get the POS with the center of the class and filter simulated data accordingly. In previous studies, the application of fuzzy rough set theory in SMOTE has mainly been used in binary problems to remove synthetic samples. In a multi-label problem, fuzzy rough set theory determines a membership degree C based on the similarity between attributes and labels. The upper and lower approximated membership degrees of an element \(x_{1}\) are, respectively, computed as
where \(\mathcal {I}\) is a fuzzy implicator mapping \([0,1]^2 \rightarrow [0,1]\) to decrease in the first and increase in the second argument. \(\mathcal {T}\) is a fuzzy t-norm, which is a commutative and associative mapping \([0,1]^2 \rightarrow [0,1]\). To examine the relationship between samples with labels in local information, FRONEC [48] predicts labels by identifying the training instance with the highest quality in relation,
To establish the association between \(x_{1}\) and the label set, Eq. (5) establishes \(R(x_{1}, x_{2})\) based on the association of the attributes. It determines the potential label set using \(x_{2}\) and \(R_{d}(x_{2}, x_{3})\).
In the application research of fuzzy set theory and classifiers, [22] applied fuzzy membership to calculate the output weights of the extreme learning machine. This allows for the weighting of corresponding classes of training samples. Enlve et al. [29] extended fuzzy set theory to antecedent learning and online active learning based on the classical Takagi Sugeno model architecture. Further research is needed on the application of fuzzy rough set theory to synthetic samples in multi-label datasets.
Proposed method
MLCSMOTE-FRST has three steps: constructing the generation ratio related to label-specific clusters; determining the attributes of the synthetic samples using intra-cluster samples and SMOTE; and revising the label set using intra-cluster samples and fuzzy rough set theory.
We use the following notation. \(D=\{ \{x_{i},y_{i} \},1<i<n\}\) is a multi-label dataset. The labels L have dimension q, i.e., \(L = \{ l_{1},\ldots ,l_{j},\ldots ,l_{q} \}\). \(Y_{j}\) is the j-th label dimension, and \(y_{ij}\) is the value of the i-th sample in the j-th label dimension.
Generation ratio via label-specific cluster
In contrast to clustering for binary classes, the long-tail label set is sparser in the global distribution. There are two problems with using k-means clustering. Computing the imbalance ratio in the cluster using possible label sets requires the enumeration of all label sets, and to achieve the condition of \((N_{min} > N_{max})\) in k-means SMOTE is challenging due to the small size of the minority-label set within a cluster.
Inspired by LIFT, MLCSMOTE generates label-specific clusters, to create distinct clusters for each label. For \(Y_{j}\) belonging to Y, assume that samples are constructed as \(P_{j}\), and the others are \(N_{j}\),
A minority label in \(Y_{min}\) is expressed as \(IR > 1\), where \(IR = \vert N_{j}| / |P_{j} \vert \). MLCSMOTE defines the number of label-specific clusters, using \( \vert P_{j} \vert \) to fit the distribution of data under each label,
where \(r \in [0,1] \) is a parameter to control the number of clusters. Some methods use k-medoids [59] to mitigate the sensitivity of k-means to noisy samples. We apply k-means to better capture samples corresponding to outlier instances. When the outlier is far from the distribution of \(P_{j}\), k-means tends to assign these samples to separate clusters based on the change in mean value while maximizing expectation.
By performing clustering on \(P_{j}\), we can obtain two properties for a cluster: the number of samples in \(P_{j}\) and the average distance \(d_{j}\) of positive instances to the center. In multi-label imbalance learning, our focus is on the information of the clusters corresponding to \(N_{j}\). It can determine to which cluster \(N_{j}\) belongs using the trained k-means model \(f_{j}\), where
represents the i-th sample to which \(x_{i}\) belongs in the j-th label. Based on this result, MLCSMOTE obtains the number of \(N_{j}\) within the cluster \(CL_{i,j}\), further get the score \(s_{j}^{m}\) for the m-th cluster in the j-th label,
where \(d_{j}^{m}\) is the average distance in the j-th label and m-th cluster. There are two issues with Eq. (9). First is to describe the cluster density. A larger value of \(d_{j}^{m}\) indicates a lower concentration of the minority and a sparser distribution of clusters. Compared with dense clusters, which are better suited for neighborhood-based classifiers [55], increasing the weight for sparse clusters will help the classifier focus on this type of sample. Second is to calculate the imbalance ratio within a cluster. MLSCMOTE uses cluster prediction for \(N_{j}\) to describe IR for each cluster. The larger the IR, the clusters tend to belong to a safe or boundary area. When IR is extremely low, the weight of a generation in a cluster will decrease. In some clusters, where there is only one positive instance, the weight will be zero.
It is necessary to normalize \(s_{j}^{m}\). While \(x_{i}\) belongs to the m-th cluster in label j, it will assign \(s_{j}^{m}\) to \(x_{i}\). When label j does not belong to \(Y_{min}\), it is not included in Eq. (9) and the assignment of the corresponding x. In SMOTE operations, weights are required to fine-tune the generation weights for each minority instance. MLCSMOTE aims to further emphasize the generation of weights for extreme minority classes. An initial straightforward way to obtain the weight of instances \(x_{i}\) is using IR for different labels,
MLCSMOTE uses all the minority labels to build \(w_{i}\), reducing the interference of IR from the majority labels. Obviously, there is no need to assign generation weights for instances where only majority classes are available. Compared with oversampling methods like MLSOL or EMSOL, MLCSMOTE does not require specific parameters to define outlier, boundary, and safe instances.
Attribute-level generation via smote
In the previous stage, we obtained the weights for the generated samples. In binary classification, cluster-based SMOTE selects samples in the same cluster as relative instances to increase the number of synthetic samples. This also helps prevent the generation of noisy instances. For multi-label data, relative instance selection is an important step in SMOTE operations. The majority of instances will disturb the synthetic instances. We propose using samples in label-specific clusters in the SMOTE operation,
where \(x_{sys}\) is the synthetic features, \(x_{seed}\) is the seed data sampling by Eq. (10), and \(x_{relative}\) in Eq. (11) is the neighbor and intra-cluster positive samples from \(x_{seed}\). We use kNN to obtain the set of neighbor instances \(x_{relative}\) for the minority sample \(x_{seed}\). If the sample set \(x_{relative}\) is in the label-specific cluster to \(x_{seed}\), it will be prioritized for SMOTE operations. Instances with the same minority-specific clusters are close to each other, and synthetic samples can be generated for samples without unrelated labels. We illustrate this in Fig. 1.
Label revision via fuzzy rough set theory
We develop our label revision algorithm for synthetic samples using fuzzy rough set theory (FRST). FRST is applied to determine the relationship between synthetic instances and labels, and labels of synthetic instances with quality measures are revised.
R and \(R_{d}\) are calculated in Eq. (5). R establishes the relationship between synthetic samples and minority instances based on their attributes. \(R_{d}\) utilizes label correlation to establish the corresponding information between the minority and synthetic samples. We use the distribution \(p_{min}\) obtained from \(D_{min}\) to find \(R_{d}\), where \(D_{min}\) represents samples in minority classes,
where \(L_{1}\) and \(L_{2}\) are labels in two different samples, and j traverses all labels. \(R_{d}\) is the similarity relation for the label set. The \(\bigtriangleup \) operator constructs the symmetric difference between its two arguments, while \( \vert \cdot \vert \) measures the cardinality of the resulting set. \(p_{j}\) is the prior probability of the j-th label in the entire dataset, and \(p_{min,j}\) is calculated based on \(D_{min}\).
The SMOTE operation in binary imbalance generates results that address the imbalanced distribution. FRST takes into account the minority label set to minimize the influence of the majority class on label assignments. Equation (12) describes the distribution of label j in \(D_{min}\). If the label is common in \(D_{min}\) and D, then minority labels are present, and the penalty for the label is reduced by \(\alpha \). Labels with more occurrences in \(D_{min}\) are rewarded by \(-(\alpha * p_{min,j})\).
It has been noted that the noise phenomenon is common in imbalance learning [23], especially affecting the local information of samples belonging to extreme minority labels. We apply a noise-tolerant computing method in fuzzy rough set theory, with R and \(R_{d}\). Overall Weighted Average (OWA) [9] aggregation is used to calculate the upper and lower approximations of the membership degree between instance \(x_{1}\) with labels. We leverage positive samples within clusters to aid in computing OWA and determine the consensus label set, as illustrated in Eq. (16). Querying the consensus label set corresponding to the samples within the cluster enhances the relationship between synthetic data and the minority and intra-cluster samples. The upper bound of Eq. (17) represents the acceptance probability, while the lower bound of Eq. (18) represents the rejection probability of a fuzzy set.
where \(N_{min}(x)\) represents the k neighbors around synthetic sample \(x_{sys}\), which includes minority labels. \(W_{U}\) and \(W_{L}\) are the respective minimum and maximum operators. As \(W_{L}\), we use the exponential weight in the comparative experiment, and \(W_{U}\) is the inverse of \(W_{L}\). In FRST, quality measure Q has calculated scores that are larger than the average value by \(\overline{C}\) and \(\underline{C}\) is considered to be the label. Synthetic samples retain the label of their interpolated seed instances to strengthen the capability of the borderline. The process of MLCSMOTE-FRST is shown in Fig. 2.
Computational complexity
The algorithm can be divided into the stages of calculating the distribution of samples for each label, performing k-means clustering on minority classes, conducting SMOTE operations to generate synthetic instances, and revising labels. The computational complexity of labels can be expressed in terms of O(qn) to determine whether the label represents a minority and the corresponding IR. The k-means algorithm is known to have a complexity of O(knT), where T is the number of iterations and k is the number of clusters. With q labels, the complexity is O(qknT). In the SMOTE algorithm, to find the nearest neighbor samples of minority instances has computational complexity \(O(n^{2}d+kn^{2})\). After obtaining the data of kNN, the generation process has computational complexity \(O(pn(q+d))\), where p is the predetermined generation quantity.
The computational complexity of attribute generation is \(O(qn + qknT + n^{2}d+kn^{2} + pn(q+d))\). Usually, n is larger than q and d, so the main complexity of MLCSMOTE lies in kNN.
In the process of label revision, kNN is computed in the previous step, and OWA has computational complexity \(O(k\log (k))\) due to the sorting step in its definition. With OWA computed, FRST must construct the revised label using \(N_{min}\) and label set m. The computational complexity for FRST is \(O(N_{min}dmk\log (k))\).
In terms of algorithmic complexity, the oversampling algorithm can proceed by first defining the minority samples, generating samples using SMOTE, and implementing the proposed label correction method. LP-ROS, ML-ROS, and REMEDIAL focus on defining minority samples. LP-ROS and REMEDIAL utilize mean size and SCUMBLE to define the minority class and perform sample cloning, respectively. Following these two steps, the algorithmic complexity can be approximated as \(O(qn+n)\). ML-ROS requires the generation of a proportion p for control, with algorithmic complexity O(qnp). MLSMOTE and MLSOL require the additional use of kNN to find SMOTE samples. This, in combination with the step of defining the minority samples, has computational complexity \(O(qn+n^{2}d+kn^{2})\) for the former step. The latter additionally requires the computation of weights, with complexity \(O(qn+n^{2}d+kn^{2}+nkq)\). In contrast to MLSMOTE, MLCSMOTE-FRST adds a computational process of fuzzy rough sets.
The pseudocode is shown in Algorithm 1.
Experiments
Experimental framework
Twelve datasets provided by scikit-mutilearn [44] were used to verify the effectiveness of MLCSMOTE-FRST. The split ratio between the training and testing sets ranges from 0.51 to 9. These include a high feature dimension of 47236, and a high label dimension of 174. There are significant differences in the distributions and the dimensions of labels observed in our validation dataset, which can effectively validate our algorithms.
We also use MeanIR, CVIR [3] and MeanIMR, CVImR [25] for the training set to analyze the statistical changes that occur after applying the data generation methods. MeanIR and MeanIMR quantify the degree of imbalance. MeanIR focuses on the ratio of the maximum label’s number to each label, while MeanIMR is the mean of positive and negative ratios under each label. MeanIMR is more relevant to our study, as it applies to both positive and negative samples. CVIR and CVImR represent the variances of individual labels for MeanIR and MeanIMR, respectively, which describe the differences in label distributions. Smaller values of MeanIR and MeanIMR indicate a lower degree of imbalance, and smaller values of CVIR and CVImR indicate less variation in the number of labels. The dataset information is shown in Table 1.
Due to MLCSMOTE-FRST focus on validating metrics related to minority groups and multi-label classification, it can be used macro-F1, and micro-F1 and check the minority-related performance. Macro-F1 is more sensitive to minorities because it considers all labels as equal and calculates the average F1-Score. Micro-F1 is used to assess the impact of synthetic samples by calculating the accuracy and recall for each sample. We also use example-based accuracy (ACC) and hamming loss to measure the overall accuracy and percentage of incorrectly predicted labels:
where TP, FP, and FN are the sums of the corresponding values of each class, y and \(\hat{y}\) are the true and predict values.
We demonstrate the impact of MLCSMOTE-FRST on various datasets and classifiers by comparison with other multi-label oversampling algorithms divided into two groups: repetitive oversampling (LP-ROS, ML-ROS, REMEDIAL) and SMOTE-based oversampling (MLSMOTE, MLSOL), which have been described in section “Imbalanced multi-label learning”. We used four classifiers to validate the generated data metrics. These include the problem transformation methods of binary relevance (BR) [56], random k-labelsets (RAkEL) [46], and classifier chains (CC) [37]; and an algorithm adaptation method, multi-label k-nearest neighbor (MLkNN) [55]. Random forest was used as the underlying classifier where needed. The parameters for classifiers were the default settings supported by scikit-mutilearn. All experiments were run five times, and metrics were averaged to minimize the impact of random numbers.
The above oversampling methods all used an oversampling ratio of 0.25. MLSMOTE, MLSOL, and MLCSMOTE are SMOTE-based, with the number of neighbors set to 10. To validate the accuracy of the attribute-level method, we employed a similar approach to MLSMOTE, referred to as MLSCMOTE, to assign labels to the generated data in section “Attribute-level generation via smote”. The cluster ratio r in MLCSMOTE-FRST was set to 0.1.
Results of sampling methods
Due to space limitations, the full experimental results are presented in the supplementary material,Footnote 1 including the classification performance of MLCSMOTE-FRST against eight baselines in 12 benchmarks on four classifiers and three evaluation metrics. We list the metrics of the relevant algorithms under the RAkEL classifier in Table 2. RAkEL achieves the optimal average macro-F1 and micro-F1 compared with other classifiers without the need for synthetic samples. Similar results are found in [40]. Results of the Friedman test is list in Table 3. Nemenyi post hoc analysis with mean rankings are listed in Tables 4 and 5.
Performance of macro-F1 and micro-F1. RAkEL, as the optimal base classifier, refers to the in-depth analysis. MLCSMOTE-FRST achieved the top macro-F1 and micro-F1 scores in the nine datasets, demonstrating strong competitiveness. Only in Scene and Medical, there exist decreases in metrics relative to MLCSMOTE, reflecting the effectiveness of using fuzzy rough set theory. MLCSMOTE is generally considered inferior to MLCSMOTE-FRST but superior to other methods, particularly in terms of macro-F1. It has been shown that the use of label-specific clusters to locate the generated positions and proportions is effective. REMEDIAL, however, has the worst performance for metrics related to minorities. The direct cloning of samples and assignment of different labels has a negative impact on metrics that are sensitive to sample labels.
Performance of ACC and hamming loss. MLCSMOTE-FRST significantly enhances ACC, with the highest average ranking in all four classifier experiments. After applying label revision, improvements were observed in all seven datasets in Table 2. This demonstrates that our algorithm prioritizes the performance of the minority class and reduces false-positives of the majorities. Hamming loss, as a global metric for multi-label problems, is typically minimally affected by oversampling algorithms. MLCSMOTE-FRST shows limited improvements. By modifying labels using fuzzy rough set theory, MLCSMOTE-FRST improves the hamming loss compared with MLCSMOTE on 10 datasets.
To further verify the impact of base classifiers on different metrics, we present the p-value of the Friedman test in Table 3. There is a significant difference in the results between algorithms when the p-value is less than 0.05.
The p-value indicates significant differences across all classifiers. The mean ranks and post hoc test results are presented in Tables 4 and 5.
MLCSMOTE-FRST achieves optimal average rankings in both macro-F1, micro-F1, and ACC, reflecting its effectiveness under different base classifiers. Macro-F1 performs better than micro-F1, showing a greater improvement in the minority-sensitive metric. In Table 5, MLCSMOTE-FRST demonstrates superior accuracy in example-based scenarios. MLCSMOTE demonstrates superiority in the three base classifiers.
Parameter analysis
An additional study was conducted to explore the effect of parameters in MLCSMOTE-FRST. The parameters of the proposed method include the oversampling ratio p and the number of neighbors k. We analyzed the Yeast dataset. The problem of imbalance in datasets has been extensively studied, for multi-label problems as well as binary problems, such as the keel dataset [1]. We used RAkEL as the classifier for the analysis because it outperformed other classification algorithms on average across the datasets (Fig. 3).
As p increases, the performance of MLCSMOTE-FRST improves, with a greater enhancement observed in micro-F1. By contrast, the change for MLSMOTE-FRST is more stable. In multiple rounds of the experiment, the corresponding indicator has a small standard deviation. Both achieve the optimal macro-F1 and micro-F1 at \(p=0.2\). Similar results are reported for ML-ROS and MLSOL. We also found that the performance of ML-ROS at a significance level of \(p=0.05\) was higher than at other tested values.
In contrast to the change in p, the average change of k is more consistent. When k is greater than 15, the indicator shows a decreasing trend. As the neighborhood expands, the range of generated synthetic samples increases. This may result in the generation of additional noisy data, due to the lack of correlation among the neighbors with the seed samples. The performance of MLCSMOTE-FRST at \(k=15\) is higher than at \(k=10\) and 20. The SMOTE-based oversampling method can set a smaller value of k, such as 5 or 10, to obtain the optimal metric. The standard deviation of MLCSMOTE-FRST is smaller than that of the other algorithms for all values except \(k=10\). Combined with the average values, our proposed algorithm at \(k=15\) outperforms the other models and has a more consistent impact on the classification model.
Influence for imbalanced statistical indicators
Although IR is not directly associated with the improvement of classifiers, it can describe the overall multi-label distribution. We calculated the percentage for IR-related metrics of differences between the original metrics in Table 6.
Due to REDEMIAL cloning of the sample and the decoupling process, the overall number of labels does not change, and the statistical analysis of MeanIR and CVIR remains consistent. A positive change was observed for MLSMOTE, indicating that it does not guarantee a reduction in the imbalance ratio for the oversampling method. Our proposed MLCSMOTE and MLCSMOTE-FRST algorithms can significantly reduce both MeanIR and MeanIMR compared with MLSMOTE and MLSOL, while being slightly outperformed by ML-ROS. The direct assignment samples for minority labels by ML-ROS result in a faster decrease in MeanIR and MeanIMR, which are strongly related to the minority labels. It can be seen from Tables 4 and 5 that directly cloning samples is of limited assistance in improving classification performance. The values of CVIR and CVImR for MLCSMOTE and MLCSMOTE-FRST were also reduced, indicating a decrease in the dispersion of individual labels. We also note that the IR-related metrics of MLCSMOTE and MLCSMOTE-FRST were similar, both showing a decreasing trend.
Ablation experiments
Similar to section “Parameter analysis”, we used the same classifier as a benchmark for the ablation experiment. We present the metrics for incorporating FRST into MLCSMOTE in Table 2. In Table 7, the three SMOTE-based methods with the addition of FRST as a postprocessing method are presented for the minority class-sensitive classification metrics macro-F1 and micro-F1.
In Table 7, MLSMOTE and MLCSMOTE show 11 datasets where the metrics are not degraded. For MLCSMOTE, there are 10 datasets where macro-F1 is enhanced, indicating the effectiveness of our postprocessing approach. MLCSMOTE-FRST achieves better results than MLSMOTE-FRST and MLSOL-FRST in half of the datasets. Table 8 presents the variations of the labeling statistics MeanIR and MeanIMR.
It can be observed that the addition of FRST results in a decreasing trend in MeanIR and MeanIMR across most datasets. This demonstrates the effectiveness of the postprocessing approach in achieving a balanced distribution of labels. Although there is no obvious correlation between the label distribution and classification performance, Tables 7 and 8 demonstrate that our proposed postprocessing approach has clear advantages for both classification performance and label distribution.
Discussion
The proposed algorithm significantly improves performance related to minority classes, such as measured by macro-F1 and micro-F1, across several classifiers in the experiment. From the results of ablation experiments, this is attributed to the combined determination of both feature and label assignments. Even if we disregard the process of feature assignment and include FRST in traditional methods like MLSMOTE and MLSOL, the statistical values related to both classification and labeling distributions demonstrate significant improvement. Again, by combining the changes in the statistical values, our proposed method performs well at optimizing the distributions in terms of the statistical metrics of labels in MeanIR.
From the perspective of model implementation, although FRST can be added to most SMOTE-based generative methods, MLSMTOE, MLSOL, and MLCSMOTE have high algorithmic complexity in feature assignment, mainly for SMOTE, according to the analysis in section “Computational complexity”. From the perspective of algorithmic complexity, MLCSMOTE-FRST is the recommended generation algorithm. Combined with the classification results, MLkNN typically exhibits weaker performance compared with other classifiers. In our experiments, Rakel performed better. Increasing both p and k will elevate the complexity of the algorithm. This must be tailored to the dataset in practical scenarios to align with training time requirements.
Conclusion
We presented a method for oversampling multi-label data, which utilizes label-specific clustering and fuzzy rough set theory. Our objective was to develop a more efficient approach for generating synthetic data to establish the correspondence between the attribute and label levels for minority-label samples. Compared with other methods of weight assignment, MLCSMOTE-FRST can define the samples using a more specific clustering approach for minority labels, introducing fewer manual parameters. MLCSMOTE-FRST clusters samples belonging to minority labels using k-means and predicts the remaining samples to obtain clustering information specific to the labels. With this clustering information, IR of each minority label is assigned to the corresponding sample to be generated. The clustering results are used to selectively choose the intra-cluster samples for the SMOTE operation. The labels are supported by the intra-cluster samples and optimized using fuzzy rough set theory to capture the consensus label set.
In experiments on 12 datasets with four classifiers, MLCSMOTE-FRST achieved the highest average macro-F1 and micro-F1 metrics when compared with five other generative algorithms. To specify classifiers, RAkEL had better imbalance-related metrics than other classifiers. After applying MLCSMOTE-FRST, these showed a significant improvement.
In future work, while highlighting the advantages of label-specific cluster-based SMOTE, we can introduce label correlation information to further explore the influence between labels on the oversampling method.
Data availability
Dataset is available at locations cited in the reference section.
Notes
Full experimental results: Huang, Kai (2024), “MLCSMOTE”, Mendeley Data, V3, https://doi.org/10.17632/dj4crhrp4w.3.
References
Alcalá-Fdez J, Sanchez L, Garcia S et al (2009) Keel: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput 13:307–318
Camacho L, Douzas G, Bacao F (2022) Geometric smote for regression. Expert Syst Appl 193:116387
Charte F, Rivera A, del Jesus MJ et al (2013) A first approach to deal with imbalance in multi-label datasets. In: Hybrid artificial intelligent systems: 8th international conference, HAIS 2013, Salamanca, Spain, September 11–13, 2013. Proceedings 8, Springer, pp 150–160
Charte F, Rivera AJ, del Jesus MJ et al (2014) Mlenn: a first approach to heuristic multilabel undersampling. In: Intelligent data engineering and automated learning—IDEAL 2014: 15th international conference, Salamanca, Spain, September 10–12, 2014. Proceedings 15, Springer, pp 1–9
Charte F, Rivera AJ, del Jesus MJ et al (2015) Addressing imbalance in multilabel classification: measures and random resampling algorithms. Neurocomputing 163:3–16
Charte F, Rivera AJ, del Jesus MJ et al (2015) Mlsmote: approaching imbalanced multilabel learning through synthetic instance generation. Knowl Based Syst 89:385–397
Charte F, Rivera AJ, del Jesus MJ et al (2019) Dealing with difficult minority labels in imbalanced mutilabel data sets. Neurocomputing 326:39–53
Chen L, Zhan W, Tian W et al (2019) Deep integration: a multi-label architecture for road scene recognition. IEEE Trans Image Process 28(10):4883–4898
Cornelis C, Verbiest N, Jensen R (2010) Ordered weighted average based fuzzy rough sets. In: International conference on rough sets and knowledge technology. Springer, pp 78–85
Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and smote. Inf Sci 465:1–20
Duan J, Gu Y, Yu H et al (2024) Ecc++: an algorithm family based on ensemble of classifier chains for classifying imbalanced multi-label data. Expert Syst Appl 236(121):366
Duan J, Yang X, Gao S et al (2024) A partition-based problem transformation algorithm for classifying imbalanced multi-label data. Eng Appl Artif Intell 128(107):506
Dubois D, Prade H (1990) Rough fuzzy sets and fuzzy rough sets. Int J Gen Syst 17(2–3):191–209
El-Douh A, Lu S, Abdelhafeez A et al (2023) A neutrosophic multi-criteria model for evaluating sustainable soil enhancement methods and their cost 2 implications in construction. SMIJ 5(2):11
El Moutaouakil K, Roudani M, El Ouissari A (2023) Optimal entropy genetic fuzzy-c-means smote (oegfcm-smote). Knowl Based Syst 262(110):235
Fernández A, Garcia S, Herrera F et al (2018) Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905
Gupta N, Jindal V, Bedi P (2022) Cse-ids: using cost-sensitive deep learning and ensemble algorithms to handle class imbalance in network-based intrusion detection systems. Comput Secur 112(102):499
Han M, Wu H, Chen Z et al (2022) A survey of multi-label classification based on supervised and semi-supervised learning. Int J Mach Learn Cybern 14:697–724
He Z, Chu P, Li C et al (2023) Compound fault diagnosis for photovoltaic arrays based on multi-label learning considering multiple faults coupling. Energy Convers Manag 279(116):742
Huang K, Wang X (2022) Ccr-gsvm: a boundary data generation algorithm for support vector machine in imbalanced majority noise problem. Appl Intell 53:1192–1204
Kaur A, Chauhan APS, Aggarwal AK (2019) Machine learning based comparative analysis of methods for enhancer prediction in genomic data. In: 2019 2nd International conference on intelligent communication and computational techniques (ICCT), IEEE, pp 142–145
Kongsorot Y, Horata P, Musikawan P et al (2019) Kernel extreme learning machine based on fuzzy set theory for multi-label classification. Int J Mach Learn Cybern 10(5):979–989
Koziarski M, Woźniak M, Krawczyk B (2020) Combined cleaning and resampling algorithm for multi-class imbalanced data with label noise. Knowl Based Syst 204(106):223
Liang X, Jiang A, Li T et al (2020) Lr-smote-an improved unbalanced data set oversampling based on k-means and svm. Knowl Based Syst 196(105):845
Liu B, Tsoumakas G (2018) Making classifier chains resilient to class imbalance. In: Asian conference on machine learning. PMLR, pp 280–295
Liu B, Tsoumakas G (2020) Synthetic oversampling of multi-label data based on local label distribution. In: Machine learning and knowledge discovery in databases: European conference, ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, Proceedings, Part II. Springer, pp 180–193
Liu B, Blekas K, Tsoumakas G (2022) Multi-label sampling based on local label imbalance. Pattern Recognit 122(108):294
Liu D, Zhong S, Lin L et al (2022) Highly imbalanced fault diagnosis of gas turbines via clustering-based downsampling and deep Siamese self-attention network. Adv Eng Inform 54(101):725
Lughofer E (2022) Evolving multi-label fuzzy classifier. Inf Sci 597:1–23
Maini D, Aggarwal AK (2018) Camera position estimation using 2d image dataset. Int J Innov Eng Technol 10:199–203
Meng D, Li Y (2022) An imbalanced learning method by combining smote with center offset factor. Appl Soft Comput 120(108):618
Mohamed Z, Ismail M, Abd El-Gawad A (2023) Sustainable supplier selection using neutrosophic multi-criteria decision making methodology. Sustain Mach Intell J. https://doi.org/10.61185/SMIJ.2023.33102
Özdemir A, Polat K, Alhudhaif A (2021) Classification of imbalanced hyperspectral images using smote-based deep learning methods. Expert Syst Appl 178(114):986
Pawlak Z (1982) Rough sets. Int J Comput Inf Sci 11(5):341–356
Pereira RM, Costa YM, Silla CN Jr (2020) Mltl: a multi-label approach for the Tomek link undersampling algorithm. Neurocomputing 383:95–105
Ramentol E, Gondres I, Lajes S et al (2016) Fuzzy-rough imbalanced learning for the diagnosis of high voltage circuit breaker maintenance: the smote-frst-2t algorithm. Eng Appl Artif Intell 48:134–139
Read J, Pfahringer B, Holmes G et al (2011) Classifier chains for multi-label classification. Mach Learn 85:333–359
Read J, Pfahringer B, Holmes G et al (2021) Classifier chains: a review and perspectives. J Artif Intell Res 70:683–718
Roseberry M, Krawczyk B, Cano A (2019) Multi-label punitive knn with self-adjusting memory for drifting data streams. ACM Trans Knowl Discov Data (TKDD) 13(6):1–31
Shan J, Hou C, Tao H et al (2020) Randomized multi-label subproblems concatenation via error correcting output codes. Neurocomputing 410:317–327
Sharma S, Gosain A, Jain S (2022) A review of the oversampling techniques in class imbalance problem. In: International conference on innovative computing and communications: proceedings of ICICC 2021, vol 1. Springer, pp 459–472
Song X, Song Y, Stojanovic V et al (2023) Improved dynamic event-triggered security control for t-s fuzzy lpv-pde systems via pointwise measurements and point control. Int J Fuzzy Syst 25(8):3177–3192
Sun P, Song X, Song S et al (2023) Composite adaptive finite-time fuzzy control for switched nonlinear systems with preassigned performance. Int J Adapt Control Signal Process 37(3):771–789
SzymaĹ P, Kajdanowicz T et al (2019) scikit-multilearn: a python library for multi-label classification. J Mach Learn Res 20(6):1–22
Tarekegn AN, Giacobini M, Michalak K (2021) A review of methods for imbalanced multi-label classification. Pattern Recognit 118(107):965
Tsoumakas G, Katakis I, Vlahavas I (2010) Random k-labelsets for multilabel classification. IEEE Trans Knowl Data Eng 23(7):1079–1089
Verbiest N, Ramentol E, Cornelis C et al (2014) Preprocessing noisy imbalanced datasets using smote enhanced with fuzzy rough prototype selection. Appl Soft Comput 22:511–517
Vluymans S, Cornelis C, Herrera F et al (2018) Multi-label classification using a fuzzy rough neighborhood consensus. Inf Sci 433:96–114
Wu T, Huang Q, Liu Z et al (2020) Distribution-balanced loss for multi-label classification in long-tailed datasets. In: Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16. Springer, pp 162–178
Xiao J, Aggarwal AK, Rage UK et al (2023) Deep learning-based spatiotemporal fusion of unmanned aerial vehicle and satellite reflectance images for crop monitoring. IEEE Access
Zadeh LA (1965) Fuzzy sets. Inf Control 8(3):338–353
Zhang A, Yu H, Huan Z et al (2022) Smote-rknn: a hybrid re-sampling method based on smote and reverse k-nearest neighbors. Inf Sci 595:70–88
Zhang K, Mao Z, Cao P et al (2023) Label correlation guided borderline oversampling for imbalanced multi-label data learning. Knowl Based Syst 279(110):938
Zhang ML, Wu L (2014) Lift: multi-label learning with label-specific features. IEEE Trans Pattern Anal Mach Intell 37(1):107–120
Zhang ML, Zhou ZH (2007) Ml-knn: a lazy learning approach to multi-label learning. Pattern Recognit 40(7):2038–2048
Zhang ML, Li YK, Liu XY et al (2018) Binary relevance for multi-label learning: an overview. Front Comput Sci 12:191–202
Zhang S, Liu Z, He S et al (2022) Improved double tqwt sparse representation using the mqga algorithm and new norm for aviation bearing compound fault detection. Eng Appl Artif Intell 110(104):741
Zhang Z, Song X, Sun X et al (2023) Hybrid-driven-based fuzzy secure filtering for nonlinear parabolic partial differential equation systems with cyber attacks. Int J Adapt Control Signal Process. https://doi.org/10.1002/acs.3529
Zhu QX, Wang XW, Zhang N et al (2022) Novel k-medoids based smote integrated with locality preserving projections for fault diagnosis. IEEE Trans Instrum Meas 71:1–8
Funding
This study was funded by Natural Science Foundation of Fujian Province (No. 2023J01800), Natural Science Foundation of Xiamen, China (No. 3502Z202372018), Department of Education, Fujian Province (No. JAT232012) and Xiamen Science and Technology Subsidy Project (No. 2023CXY0318).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflict of interest to declare that are relevant to the content of this article.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Liu, J., Huang, K., Chen, C. et al. An oversampling algorithm of multi-label data based on cluster-specific samples and fuzzy rough set theory. Complex Intell. Syst. 10, 6267–6282 (2024). https://doi.org/10.1007/s40747-024-01498-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40747-024-01498-w