Keywords

1 Introduction

Classification is a widely known problem in the field of artificial intelligence. In recent years, machine learning approaches, in particular different forms of neural networks, have made substantial progress in solving classification tasks for a diverse range of domains—such as computer vision [15], text processing [10], or graph theory [19]. However, although current machine learning methods for classification purposes may yield remarkably accurate results, they are still not guaranteed to be correct, and they are not inherently explainable, i. e., no form of justification or rationale is provided. On the other hand, the need for explainable methods is becoming increasingly relevant [4].

To address the problem of lacking explainability in machine learning-based classification approaches, Thimm and Kersting [16] propose an approach that combines machine learning with computational models of argumentation [5]. To be precise, the authors suggest a two-step procedure: first, a rule learning algorithm is applied to extract rules from a given dataset; in the second step, the learned rules are used as input for a structured argumentation system, which then yields a justification status for each class, given a new observation. Thus, this approach does not only deliver classifications but also explanations thereof. Further, expert knowledge (in the form of additional arguments) can easily be incorporated into the reasoning process.

Thimm and Kersting [16] presented some preliminary experimental results in their study: on the Animals with Attributes (AwA) dataset, about \(30\,\%\) of the instances were classified correctly, while the remaining \(70\,\%\) were deemed “undecided”.Footnote 1 In this paper, we build explicitly on these results and present an extended approach that likewise includes a rule mining step and an argumentation-based classification step, which introduces a clustering technique for more targeted rule mining. More specifically, the clustering step reduces the number of mined rules to make them more purposeful and additionally counteracts the extraction of contradictory rules. In an experimental analysis, we show that our method can achieve a significantly higher accuracy of \(71\,\%\) on the AwA dataset. To corroborate our observations, we consider additional datasets. Furthermore, we demonstrate that the procedure introduced in this work is potentially significantly more resource-efficient than the approach proposed by Thimm and Kersting.

2 Background

The three main ingredients of the approach presented in this paper are (1) a clustering algorithm, (2) a rule mining algorithm, and (3) a structured argumentation method. Although the choice of each component is generally flexible, we select some concrete instantiations of each component as an example. For the clustering part, we use a simple k-modes algorithm [12], which is aimed at clustering categorical variables. As the rule mining algorithm we use association rule mining, and as the structured argumentation approach, following [16], we use defeasible logic programming [9]. Both latter formalisms are outlined below.

Association Rule Mining. Data mining generally encompasses methods for extracting non-trivial patterns from a given dataset. Association rule mining [3] aims to uncover interesting relationships among items within extensive databases. Consider \(I = \{I_1, I_2, \ldots , I_m\}\) as a set comprising \(m\) distinct attributes. Let \(T\) be a transaction containing a set of items such that \(T \subseteq I\), and let \(D\) be a database with various transaction records \(T_s\). An association rule takes the form of \(X \Rightarrow Y\), where \(X, Y \subseteq I\) represent sets of items known as itemsets, and \(X \cap Y = \emptyset \). In this context, \(X\) is referred to as the antecedent, and \(Y\) is termed the consequent. The rule \(X \Rightarrow Y\) signifies that the presence of \(X\) implies the presence of \(Y\). Association rules rely on two fundamental criteria of interestingness: support and confidence. These criteria help identify relationships and rules by revealing frequently occurring if/then patterns. To be considered, association rules typically must meet both a user-specified minimum support and a user-specified minimum confidence simultaneously. The support of an association rule is defined as the fraction of records that contain \(X \cup Y\) relative to the total number of records in the database. The confidence of an association rule is defined as the fraction of the number of transactions that contain \(X \cup Y\) relative to the total number of records that contain \(X\). For the approach presented here, we use FP-Growth [11] as a rule miner. Defeasible Logic Programming. The core idea behind defeasible logic programming (DeLP) [9] is to combine concepts from logic programming and defeasible argumentation to allow for dealing with incomplete or contradictory data. A defeasible logic program (de.l.p.) consists of facts and rules which are divided into strict rules of the form \(l \leftarrow B\) and defeasible rules of the form , with l being a single literal and B a set of literals. Moreover, a fact is a single literal (i. e., an atom a or a negated atom \(\lnot a\)). Thus, formally, a de.l.p. \(\mathcal {P} = (\varPi , \varDelta )\) consists of a set \(\varPi \) of facts and strict rules, and a set \(\varDelta \) of defeasible rules. Further, a literal l is derivable by some set of rules R (i. e., ) if it is derivable following the classical rule-based understanding. If both and , then R is contradictory. Conventionally, \(\varPi \) is non-contradictory. Further, if , and , we call the literal l consistently derivable (denoted as ).

For a de.l.p \({\mathcal {P}} =(\varPi ,\varDelta )\) and a literal l, a tuple \(\langle \mathcal {A}, l \rangle \) (with \(\mathcal {A} \subseteq \varDelta \)) is an argument for l iff and \(\mathcal {A} \) is minimal wrt. set inclusion. Further, \(\langle \mathcal {B}, q \rangle \) is a subargument of \(\langle \mathcal {A}, l \rangle \) iff \(\mathcal {B} \subseteq \mathcal {A} \). We refer to \(\langle \mathcal {A} _1, l_1 \rangle \) as a counterargument to \(\langle \mathcal {A} _2, l_2 \rangle \) at literal l, iff there is a subargument \(\langle \mathcal {A}, l \rangle \) of \(\langle \mathcal {A} _2, l_2 \rangle \) with \(\varPi \cup \{l,l_1\}\) being contradictory. To deal with counterarguments, we use the generalized specificity relation \(\succ \) as a formal comparison criterion among arguments. According to this criterion, an argument is preferred over another, if (1) it has a greater information content and is thus more precise, or (2) it uses fewer rules and is thus more concise (see Garcia and Simari [9] for a formal definition and further discussion). We call \(\langle \mathcal {A} _{1}, l_{1} \rangle \) a defeater of \(\langle \mathcal {A} _{2}, l_{2} \rangle \) iff there is a subargument \(\langle \mathcal {A}, l \rangle \) of \(\langle \mathcal {A} _{2}, l_{2} \rangle \) such that \(\langle \mathcal {A} _{1}, l_{1} \rangle \) is a counterargument of \(\langle \mathcal {A} _{2}, l_{2} \rangle \) at literal l and either \(\langle \mathcal {A} _{1}, l_{1} \rangle \succ \langle \mathcal {A}, l \rangle \) (proper defeat) or \(\langle \mathcal {A} _{1}, l_{1} \rangle \not \succ \langle \mathcal {A}, l \rangle \) and \(\langle \mathcal {A}, l \rangle \not \succ \langle \mathcal {A} _{1}, l_{1} \rangle \) (blocking defeat).

A finite sequence of arguments \(\varLambda = [\langle \mathcal {A} _{1}, l_{1} \rangle ,\dots ,\) \(\langle \mathcal {A} _{m}, l_{m} \rangle ]\) is an acceptable argumentation line iff (1) every \(\langle \mathcal {A} _{i}, l_{i} \rangle \) with \(i>1\) is a defeater of \(\langle \mathcal {A} _{i-1}, l_{i-1} \rangle \) and if \(\langle \mathcal {A} _{i}, l_{i} \rangle \) is a blocking defeater of \(\langle \mathcal {A} _{i-1}, l_{i-1} \rangle \) and \(\langle \mathcal {A} _{i+1}, l_{i+1} \rangle \) exists, then \(\langle \mathcal {A} _{i+1}, h_{i+1} \rangle \) is a proper defeater of \(\langle \mathcal {A} _{i}, h_{i} \rangle \), (2) the sets \(\varPi \cup \mathcal {A} _1\cup \mathcal {A} _3\cup \dots \) and \(\varPi \cup \mathcal {A} _2\cup \mathcal {A} _4\cup \dots \) are non-contradictory, and (3) there exists no \(\langle \mathcal {A} _{k}, l_{k} \rangle \) as a subargument of \(\langle \mathcal {A} _{i}, l_{i} \rangle \) with \(i<k\). Thus, intuitively, an argumentation line forms a sequence of arguments, in which each \(\langle \mathcal {A} _{i}, l_{i} \rangle \) defeats its predecessor \(\langle \mathcal {A} _{i-1}, l_{i-1} \rangle \). Moreover, since an argument \(\langle \mathcal {A} _{i}, l_{i} \rangle \) defeats \(\langle \mathcal {A} _{i-1}, l_{i-1} \rangle \), and therefore reinstates \(\langle \mathcal {A} _{i-2}, l_{i-2} \rangle \), the sets \(\varPi \cup \mathcal {A} _1\cup \mathcal {A} _3\cup \dots \) and \(\varPi \cup \mathcal {A} _2\cup \mathcal {A} _4\cup \dots \) must be non-contradictory in order for the argumentation line to be acceptable. To avoid circular argumentation, we also need to ensure that no subarguments are reintroduced in the same argumentation line.

Finally, a literal l is warranted if there is an argument \(\langle \mathcal {A}, l \rangle \) which is non-defeated in the end. To decide whether \(\langle \mathcal {A}, l \rangle \) is defeated or not, every acceptable argumentation line starting with \(\langle \mathcal {A}, l \rangle \) has to be considered. The answer is to a DeLP query is Yes if l is warranted, and No if \(\lnot l\) is warranted. Otherwise, the answer is Undecided.

3 Cluster-Specific Rule Mining

The approach proposed in this work is an extension of the argumentation-based classification approach (AbC) described by Thimm and Kersting [16]. The AbC approach consists of two steps: (1) Mining of association rules from a given dateset and (2) performing classification using the generated rules as an input to a structured argumentation approach. During the initial phase, algorithms for rule mining are employed to identify frequent patterns and rules from a specified dataset. The result of this step yields a substantial number of rules [17]. However, these rules cannot be directly applied to classification since they often exhibit inconsistencies. Hence, in the subsequent phase, these rules are used as input to structured argumentation methods, such as DeLP. Employing the argumentative inference procedures inherent in these approaches, the classification of the new observation is executed by formulating arguments based on these rules and evaluating their justification status. Using argumentation techniques enables the creation of classifiers explicitly designed to explain their decisions, thus meeting the contemporary demand for explainable AI. These classifiers are able to explain the reasons for favoring arguments supporting the conclusion over counterarguments.

We extend the original two-step argumentation-based classification approach AbC to a multi-step classification method, that combines traditional machine learning methods with structured argumentation. To be precise, we introduce two additional steps. Firstly, we perform a clustering of the input data, resulting in groups of instances with similar properties. Secondly, a feature selection is carried out for each cluster to identify the most informative features for the prediction of the target variable. Subsequently, these features are used to generate cluster-specific association rules for each cluster. Since the number of generated rules significantly influences the classification time, this approach leads to significantly shorter runtimes and is more resource-efficient. In addition, grouping instances with similar properties leads to discovering relationships that are difficult to detect when looking at the entire dataset. This improves the capability to classify datasets where a naive approach may not extract enough rules. Moreover, the generated rule set is more consistent due to the similarity of the instances within a cluster and the emphasis on meaningful features, improving the decidability of instances and thus reducing the number of undecidable instances. In general, the presented approach consists of four steps: (1) Clustering the input data, (2) cluster-specific feature importance analysis to select the most informative features, (3) cluster-specific association rule mining based on these features, and (4) classification of new observations by assigning them to a cluster and using the cluster-specific rules. Each step is outlined in more detail below.

Clustering. First, the input data is divided into k groups based on all features (including the target feature), using the k-modes [12] algorithm. The k-modes clustering algorithm modifies the well-known k-means clustering method for partitioning a dataset into distinct groups or clusters based on categorical data. This step aims to divide the input data into smaller, more manageable groups with similar properties to reduce the running time of the rule mining algorithm, reduce the number of rules generated, improve the detection of otherwise hard to find relations and improve the quality of the rules.

Feature Selection. This step conducts a feature importance analysis to find the most informative features for classifying the target variable within a cluster using the mutual information score. Mutual information quantifies the relationship between two random variables with a value that is always non-negative, indicating their dependency level. This value is zero exclusively when the two variables are independent, with larger values indicating a greater dependency. The score calculation is based on entropy estimation using distances from k-nearest neighbors, as outlined in [13, 14]. After calculating the scores for each feature, the top k features are selected. The selected cluster-specific features are used as the input for the rule miner in the next step. The association rule mining step is massively accelerated by reducing the number of features and discarding features with little expressiveness. Furthermore, only the most relevant features are used for rule mining, leading to fewer, more meaningful rules.

Association Rule Mining. This step generates cluster-specific association rules for each previously generated cluster based on the most important selected features. In this work, we use the FP-Growth [11] algorithm. In principle, however, any rule mining algorithm is usable. To generate rules from the truth values of the features of an instance, these are represented as a set of ground literals. For example, for a dataset of animals with the attributes swims, black, and arctic, the attributes of a dolphin would be represented as \(\textit{swims}(\textit{dolphin})\), \(\lnot \textit{black}(\textit{dolphin})\), and \(\lnot \textit{arctic}(\textit{dolphin})\). The output of the rule mining algorithm is a set of association rules such as \(\text {\textit{flippers}}(X) \rightarrow \text {\textit{ocean}}(X)\), which can be interpreted as “animals with flippers live in the ocean”. Subsequently, the created rules are filtered according to the method of Thimm and Kersting [16]: Rules with more than one element in the conclusion and more than three elements in the body are discarded. All rules with confidence value 1 are interpreted as strict; the remaining rules are interpreted as defeasible.Footnote 2 Occasionally, no or not enough cluster-specific rules are generated for the target variable, resulting in instances assigned to this cluster not being able to be classified. To prevent this, we implemented an adaptive rule mining process, which iteratively adjusts the confidence and support values until at least one rule for the target variable is generated.

Classification. In the classification step, the cluster-specific rules are used as input to the structured argumentation approach DeLP (see Sect. 2). To classify a new instance, it is first assigned to a cluster using the previously trained clustering algorithm to determine the classification rule set. Since the value of the target variable is unknown, the assignment is performed twice, whereby (1) the target variable is assumed to be positive and (2) it is assumed to be negative. If, for example, one aims to classify the edibility of a mushroom with the classes edible and poisonous, an unseen mushroom is once assumed to be edible and another time assumed to be poisonous. Two cases can occur: The mushroom is either assigned to the same cluster in both cases or to different clusters. In the first case, the rules of the corresponding cluster are applied, and the classification is carried out. In the second case, two classifications are performed with the different rules of the respective cluster. Since two different sets of rules from different clusters are used, and different assumptions are made about the class of the target variable, conflicting classifications may occur. For example, a mushroom m is assigned to cluster \(C_0\) for the negative assumption (poisonous) and to cluster \(C_1\) for the positive assumption (edible). The query \(\textit{poisonous}(m)\) returns the answer Undecided for \(C_0\). For \(C_1\), the answer is Yes. Since the results do not match, one of the two answers must be selected. In general, two types of conflicts can arise: (1) The rules of one cluster return Undecided, and the rules of the other cluster return a concrete answer Yes/No, or (2) one cluster returns Yes and the other No. The first conflict is resolved by choosing the concrete answer (Yes/No) as the final result. In the second case, the answer of the rule set with the higher average confidence is used. If the average confidence matches, the average support is used as a tiebreaker.

4 Experimental Analysis

In this section, we present the results of an experimental analysis, in which we compare our approachFootnote 3 to AbC [16] in terms of the classification performance on five different datasets. Below, we describe the experimental setup and subsequently discuss our findings.

Datasets and Setup. We use five well-known categorical datasets for binary classification as training and test data: Animals with Attributes, Zoo, Mushrooms, Car Evaluation, and Congressional Voting Records.

All categorical features that are not already in binary form were one-hot encoded by converting each feature into as many 0/1 features as there are different values. For a dataset with, for example, the feature safety, which has two different values, low and high, two new dummy features, safety_low and safety_high, have been introduced. For each instance, the feature’s value was then replaced by the corresponding one-hot encoding. Records with missing values were excluded.

We make use of the following five datasets. Animals with Attributes. The Animals with Attributes (AwA) dataset consists of 50 different animals with 85 boolean-valued attributes. The dataset was randomly split into \(90\%\) training data and \(10\%\) test data. The Zoo [8] dataset is similar to the AwA dataset. It contains 101 instances of animals with 16 attributes. The dataset was randomly split into \(80\%\) training data and \(20\%\) test data. Mushrooms. The Mushrooms [2] dataset comprises descriptions of imaginary samples representing 23 species of mushrooms. Each species is categorized as either edible, poisonous, or of uncertain edibility. The latter category was merged with the poisonous one. The dataset initially consists of 8124 instances with 22 categorical features. After the data cleaning and feature encoding, the dataset contains 5644 instances with 99 features and was randomly split into \(90\%\) training data and \(10\%\) test data. Car Evaluation. The Car Evaluation (Car) [6] dataset was derived from a simple hierarchical decision model. The original dataset contains 1728 instances with 6 categorical features. After one-hot encoding, a total of 22 features resulted. No instance was excluded. The classification target is determining whether a car exhibits a low safety standard. The dataset was randomly split into \(80\%\) training data and \(20\%\) test data. Congressional Voting Records. The Congressional Voting Records (Congress) dataset [1] consists of 1984 US Congressional Voting Records for each of the U.S. House of Representatives Congressmen. Initially, it contains 435 instances and 16 features. After removing the records with missing values and encoding the features, 232 instances with 33 features remain. The classification target is to determine which party (Democratic or Republican) a congressman voted for. The dataset was randomly split into \(80\%\) training and \(20\%\) test data.

We repeated the classification five times for each dataset according to the procedure described in Sect. 3. The number of randomly initialized clusters was set to seven. For each cluster, the top four features were selected. We set the minimum support of the rule mining algorithms to 0.7 and the minimum confidence to 0.9.Footnote 4 We use the accuracy, percentage of undecided instances, and percentage of decided, but falsely classified instances to evaluate the performance of the proposed approach. The mean result of the five runs is reported. Note that we randomly selected ten attributes for the AwA dataset as target variables. The average of the metrics across all ten selected attributes is reported. We randomly selected three target variables for the Zoo dataset. The results for each selected attribute are reported individually.

Table 1. Overview of accuracy (ACC), undecidable instances (UNDEC), and decided but falsely classified instances for our approach and the AbC approach. The results of AbC for the AwA dataset are those reported in [16].

Results. The results in Table 1 show that AbC could not classify even one instance for five of the seven scenarios. To be precise, for the classification of Zoo_eggs, Zoo_milk, Car, and Congress all test instances were answered as Undecided, leading to an accuarcy of 0. Our approach, on the other hand, consistently achieves high accuracies ranging from 0.82 (Cars) to 1.0 (Zoo_eggs). For Zoo_fins, both approaches show \(4.76\,\%\) of falsely classified instances. However, our approach exhibits a significantly lower proportion of Undecided instances, reflected in a higher overall accuracy of 0.92 compared to AbC (0.86). In addition, in our experiments, AbC created a very large number of rules for the AwA dataset, which precluded classification in a reasonable time, which is why we rely on the results reported in [16]. Although the results can only be compared to a limited extent, our cluster-specific approach shows significantly higher accuracy (0.71 vs. 0.3) and a significantly smaller proportion of undecidable instances (\(17.2\,\%\) to \(70\,\%\)) than AbC. The most extensive dataset Mushrooms could not be classified by AbC because it ran out of memory in the rule mining step. Our method achieves an accuracy of 0.88, with \(10.8\,\%\) of instances remaining undecided and a low \(1.13\,\%\) false classification rate.

5 Limitations

The approach presented in this paper achieves promising results in terms of accuracy and the reduction of undecidable instances. However, there is still room for improvement. In the following, we will discuss some main limitations of the proposed approach.

Rule Generation Control. The classification performance heavily depends on the generated rules. However, direct control over the rule-generation process is limited to setting support and confidence thresholds. Another way to influence rule generation is through clustering and feature selection. Finding the best parameters for the clustering and feature selection steps is a non-trivial task that ultimately comes down to trial and error as it is very dataset-dependent.

Computational Overhead. Compared to traditional machine learning methods, our approach can result in longer classification times due to its multi-step nature. Each step has a computational overhead, and the runtime of the clustering algorithm, the rule mining algorithm, and the DeLP implementations significantly influence our approach’s runtime performance.

Classification Tasks and Data Types. Our method’s design primarily targets binary classification tasks, focusing on handling categorical variables. In its current configuration, achieving multi-class classification necessitates multiple invocations of the classification pipeline—one for each class. This requirement can significantly heighten computational demands, potentially detracting from overall performance efficiency. Moreover, the approach’s specialization in categorical variables necessitates that numeric features undergo a binning process to be transformed into categorical equivalents. This transformation can lead to an exponential increase in the number of features, substantially expanding the feature space.

6 Conclusion

In this work, we presented a new approach to argumentation-based classification. Building on the preliminary results of Thimm and Kersting [16], we developed a multi-step classification approach that combines classical machine learning methods with approaches to (structured) argumentation. In an experimental analysis, we examined the classification performance on five different dataset and showed that our cluster-specific rule mining approach achieves significantly better accuracies and lower numbers of undecidable instances than the original AbC approach. In future work, we aim to explore the influence of different configurations for the clustering, feature selection, and rule mining steps and their impact on classification performance. Furthermore, broadening the scope of evaluation to encompass datasets of increased complexity and diversity and a comparative analysis with other argumentation-based methods like ABALearn [18] and AA-CBR [7], other symbolic learners and traditional machine learning approaches would be of great interest. Moreover, efforts to improve scalability and computational efficiency are paramount. Optimizing the approach to handle larger datasets efficiently without sacrificing explainability or classification accuracy is critical for practical use. Finally, extending our methodology to efficiently tackle multi-class classification tasks and accommodate diverse data types, including continuous and multi-modal datasets, represents a significant frontier for exploration.