Keywords

1 Introduction

Over the last decade, the field of Argument Mining (AM) has grown into a fruitful area of study that comprises a set of challenging sub-tasks [16, 32]. In our work, we make use of the automatic identification and extraction of argument components, i.e., claims [8, 25] and premises [23]. This has been studied for different text domains including news editorials [2], Wikipedia articles [23], social media data like tweets [27], and student essays [31]; the latter are the domain that we address here.

One application of analyzing argumentation in student essays is in contributing to assessing the quality of an essay. To this end, a variety of argument-related features have been studied and found to be useful in the past (see Sect. 2). In this paper, we add features of “flows” (sequences of occurrence in the text) of the types of claims and premises. We compare the impact of coarse types (major claim, claim, premise) to fine-grained semantic types of those components (e.g., fact, value and policy claims; see Sect. 3.1). We achieve this by utilizing the Argument Annotated Essays (AAE) corpus [31] for training ADU identification and semantic type classification models. These models are used to automatically label our two target essay corpora Feedback [7] and ICLE [11], which have previously been annotated with essay quality ratings, with ADUs and their types. We then extract semantic type flows and use them as features in linear classification models for essay quality prediction.

Our two contributions are (i) the finding that for some dimensions of essay quality, flows of fine-grained features are more powerful predictors than flows of the coarse features; and (ii) a qualitative analysis that leads to some observations on correlations between flow patterns and essay quality.

The next section provides an overview of related work, and Sect. 3 introduces the three corpora we are working with, and the features we use for semantic types. In Sect. 4, we describe our experiments, which involve some “within-domain transfer” in that we train on an essay corpus annotated for the component features but that does not have quality scores [31] and then run those models on two corpora that offer scores but no (compatible) type annotation [7, 11]. We discuss the findings in Sect. 5 and conclude in Sect. 6.

2 Related Work

Argument Mining in Essays. The AAE corpus, consisting of 402 essays with claims, premises and relations among them [31], is a widely-used resource for developing AM techniques. We mention a few, viz. component detection [30], semantic type annotation and identification [4, 26], essay quality assessment [4, 33], and end-to-end AM [21, 24]. It was also applied in research on unsupervised AM [22], the analysis of argumentation strategies [26], and multi-scale AM [34]. The latter utilizes the text units essay, paragraph, and word for major claim, claim and premise identification, respectively. Another essay corpus that received attention in AM is ICLE [12]. For example, [5] used its rich annotations to compare aspects of argumentation strategies across different cultural groups among English learners.

Argument Component Types. Specific types of argument components have been used to label claims and premises in a variety of text genres. In Wikipedia [23], editorials [2], and persuasive essays [4, 26] premises have been annotated as, e.g., study/statistics, expert/testimony, anecdote or common knowledge/common ground. Other annotated premise types include study, factual, opinion, and reasoning in idebate.org data [15]. For claims, fact, value and policy have been annotated in persuasive essays [4, 26], in addition to logos, pathos, and ethos [4], i.e. Aristotle’s modes of persuasion [14]. Claims in Amazon reviews have been labeled with the types fact, testimony, policy, and value [6].

Social media text has been a popular target, too. Annotated types include evidence types typical for social media, e.g. news media accounts, blog posts, or pictures [1], factual vs opinionated [9], and more recently un/verifiability, reason and external/internal evidence [27]. Furthermore, discussions collected from the subreddit Change My View were annotated for the claim types interpretation, evaluation-rational, evaluation-emotional, and agreement/disagreement, while premises were labeled with logos, pathos, and ethos [13].

In our work, we apply the set of claim and premise types that we described in our recent work on argument strategy analysis [26]. It was derived and extended from previous studies [2, 4].

Argument Analysis for Essay Scoring. In early work, [18] found correlations between distributions of argument component types and holistic essay scores. In contrast, [29] evaluated the contents of the arguments in relation to the argument scheme present in the essay prompt. Building on their data, [3] turned to structure and found a moderate positive correlation between holistic essay scores and distributions of argument components and relations. Similarly, [10] showed that scoring TOEFL essays benefits from features like the number of claims and premises, the number of supported claims, and aspects of tree topology. [20] worked with a broad set of linguistic features and distributions of argument components to predict scores in the ICLE corpus. Closely related to our work is the study by [33] who proposed to use linear “flows” of (coarse) premise and claim units for essay scoring and examined their contribution. We extend this by attending to the more fine-grained features of units.

3 Data

3.1 Argument-Annotated Essays Corpus

We use the AAE corpus [31] as a starting point. The corpus contains 402 student essays annotated for argumentative discourse units (ADU) major claim, claim, and premise and their relations support and attack. Major claim and claim are linked via stance annotations. Importantly, components can be extracted from the argumentation structure. Claims always relate to the essay’s major claim, while premises support or attack claims (or other premises). Also, while claims and premises can occur in all essay paragraphs, major claims are supposed to be restricted to the first and last paragraphs.

In previous work [26], we annotated the AAE corpus for semantic claim and premise types that can be used for the extraction of argumentative flow patterns. We provided evidence that these flow patterns are suitable for the analysis of argumentation strategy in essays. Here, we will briefly describe the semantic types. For more detailed definitions and examples, we refer the reader to [26]. The following claim types were annotated: policy, value, and fact (see Table 2 below for proportions). Policy refers to claims arguing in favor of some action being taken or not being taken. Value claims evaluate a target, e.g. they may argue towards it being good/bad or important/unimportant. FactFootnote 1 claims, on the other hand, state that some target is true or false. In addition to the claim types, we annotated the following premises types: testimony, statistics, hypothetical-instance, real-example, and common-ground. Testimony gives evidence by referring to some expert. Statistics uses the results of quantitative research, among others, as evidence. Hypothetical-instance and real-example are both example categories. The former refers to situations created by the author, i.e. hypothetical situations, while the latter describes actual historical events or a specific statement about the world. Finally, common-ground includes common knowledge, self-evident facts, or similar.

In this work, we use the AAE corpus for training ADU identification and semantic type classification models, which are then used to automatically label the Feedback and ICLE corpora with ADUs and their types. Note that we do not use the original relation and stance annotations.

3.2 Feedback Corpus

The Feedback corpus (n = 3,405) is a subset of the PERSUADE corpus [7], which consists of 25,996 essays written by students from grades 6 through 12. In total, 15 prompts were used to elicit the essays. The corpus has been annotated for different ADU types: lead, position, claim, counterclaim, rebuttal, evidence, concluding statement. The corpus was additionally annotated for different quality dimensions, such as cohesion.

Comparing the argumentative components of the PERSUADE corpus with those of the AAE corpus reveals an apparent overlap in categories. Both corpora are annotated for claim and premise/evidence. Position and major claim are defined similarly. However, recall that the ADU types in the AAE corpus are derived from the overall argumentation structure (via the relations between components), while in the PERSUADE corpus, ADUs are defined semantically.

Semantic type classification builds on top of previously classified ADU types. A direct mapping of the ADU types from PERSUADE to AAE would allow us to learn ADU classification on a much larger corpus with more confidence in the predictions for out-of-domain data. To test whether the annotations of the AAE corpus are compatible with those of the PERSUADE corpus, we compare the predictions of our ADU classifier (trained on the AAE data) for the PERSUADE corpus with the original component labels. Mapping the output of our model to the annotations reveals mixed results (see Fig. 1). While evidence and premise overlap to a good extent, differences in claim conceptualization appear problematic. Both claim and counterclaim are mapped by similar proportions to claim and premise by our model. Rebuttal, which is defined as “a claim that refutes a counterclaim” [7], is mostly classified as premise, while concluding statement corresponds to the whole variety of AAE components. Thus, conceptualizations of argument components are on the whole different in the two corpora, and therefore we decided to not use the component annotations of the Feedback corpus, and work with our predicted labels instead.

Fig. 1.
figure 1

Confusion matrix for original PERSUADE corpus labels (y-axis) and the predictions of our AAE model (x-axis).

For our quality prediction experiments, we use the dimensions cohesion and conventions. A text with high cohesion is defined as containing a variety of effective linguistic features such as reference and connectives to link ideas across sentences and paragraphs. Conventions is defined as the use of common rules, including spelling, capitalization, and punctuation.

3.3 International Corpus of Learner English

Our second target corpus is derived from the ICLE corpus [11], which contains more than 6,000 student essays, of which 91% are argumentative. While no argument component annotations are available, the corpus has been annotated for different scoring dimensions. In this work, we utilize the subset of the corpus that has been annotated for organization [19] and argument strength [20] (n = 896). Previously a high organization score was defined as providing a position with respect to an introduced topic and supporting that position [28]. As this definition roughly describes the core aspects of argumentation, we assume this scoring dimension to be a good candidate for our study. On the other hand, an essay with high argument strength “presents a strong argument for its thesis and would convince most readers” [20]. Argument strength is thus tied to persuasiveness, again one of the core aspects of successful argumentation.

4 Experiments

Our experiments consist of two steps: Labeling the two target corpora with ADUs and their semantic types (Sect. 4.1), and testing the contribution of type change flows for the task of essay score prediction (Sect. 4.2). In Sect. 4.3, we undertake a qualitative inspection of flows associated with essays of different quality.

4.1 ADU and Sematic Type Classification

We first classify the coarse type of the argumentative components as major claim, claim, and premise. Afterward, we classify the fine-grained semantic types conditioned on their previously identified coarse type. For the semantic type classification, however, we do not distinguish between major claims and claims but regard both of them as claims. As both classification tasks, ADU and semantic type, have been studied previously [26, 31, 33], we do not conduct extensive comparative experiments here but provide the performance of our ensembles for better quality estimation of the projected labels.

We train ensembles of three models each per step. We use 10% of the AAE corpus for development. The remaining data is used for training. Per run, the data is split randomly (with a random number seed set to either 1, 2, or 3).

As a classifier, we use a pre-trained language model, roberta-base [17], for both the coarse and the fine-grained step. Following previous work by [33], we identify ADUs solely on the sentence level, disregarding smaller units. Our input to the model is the target sentence plus one additional sentence on the left and the right, to provide context. The context is separated from the target sentence by the model’s special tokens. We found that adding this context improves results compared to processing single sentences. Also, it works better than giving the model more context information (additional sentences or structural information such as paragraph breaks).

The ensembles are evaluated on the full AAE corpus. The final classification result is an averaged softmax, from which the label with the maximum probability is chosen. See Table 1 for the results on the annotated corpus. We have further assessed our approach manually on a smaller sample. In particular, we sampled 15 instances per semantic type, and have obtained satisfactory macro results (Claims: 95.55 F1, Premises: 91.64 F1). However, during our review, we noticed some problems with the underlying processed data, e.g. grammatical inconsistencies within sentences and the resulting problems in understanding the author’s intentions, which are unfortunately beyond our project’s scope.

Table 1. Macro-averaged classification results for the AAE corpus.

We then use the trained classification models to predict argument components and semantic types in our two target corpora, Feedback and ICLE. Table 2 shows the distribution of semantic types both for the manually annotated AAE corpus and for the automatic predictions in the Feedback and ICLE corpora. While some types are equally distributed, e.g. policy and statistics, there are notable differences in others. For instance, fact claims occur more frequently in the AAE essays, while our models labeled claims in Feedback and ICLE more often as value. For premises, Feedback contains substantially more hypothetical-instances, while the majority class in ICLE is common-ground.

Table 2. Proportions of semantic types by corpus.
Table 3. Most common change flows of semantic types for different argument components. The first letter refers to the type of the argument component (M = major claim, C = claim, and P = premise), the following letters denote the semantic type (e.g. CV = claim-value; PCG = premise-common-ground). Levels are first and last paragraph of the essay, and everything in-between (body).

4.2 Predicting Essay Quality with Flows of Semantic Types

In this section, we investigate whether essay quality prediction can be improved by using flows of our fine-grained semantic types, in comparison to flows of coarse ADU types, as they had been used by [33]. By “flow”, we mean the linear sequence of type labels that occur in a text unit (paragraph or full text). Importantly, we work with change flows, which result from collapsing sequences of identical types into a single label. This way, we ignore the information on the “length” of a stretch with the same type and focus only on the changes from one type to another.

To simplify the prediction problem, we group all essays into two classes good and bad. We normalize all quality scores to the range [0 .. 1], and then label essays with a score above 0.7 as good and others as bad.

Given the annotations of coarse ADU types and semantic types in the two target corpora, we extract change flow features, both on the global essay level and on that of paragraphs, and for ADU and semantic types, respectively. In Table 3, we show the most common change flows of semantic types in the corpora, divided into first paragraph, body, and last paragraph.

For predicting the quality class, we trained linear models on all extracted change flow features, in particular, we chose stochastic gradient descent models. We set the maximum iteration to 1500, use a balanced class weight, and use grid search cross-validation to decide on the remaining parameters.

We run a comparison on 10-fold cross-validation with optimal parameters. Table 4 shows our averaged macro scores (precision, recall, and F1) summarized as mean and standard deviation over 10 runs. We present results for all four essay scoring dimensions cohesion, conventions, organization, and argument strength. Baseline refers to a stratified classifier, which performs classification based on the observed frequency and outperforms a simple majority voting baseline.

Table 4. Essay Scoring Results. Means and standard deviations of 10-fold cross-validation measured as precision, recall, and F1 scores. As the macro average takes into account the imbalance of the labels, this can result in the F1 values not being between the respective macro values for precision and recall.

Both ADU and semantic type models outperform the baselines. We achieve higher F1 scores for the dimensions conventions and organization with models trained on semantic type change flows instead of coarse ADU type change flows (conventions: 0.559 vs 0.528; organization: 0.603 vs 0.580). For cohesion and argument strength, the two types of flows obtain similar results.

4.3 Analysis of Feature Impact

We use the trained linear models to extract semantic change flow features that are prevalent in good vs bad essays and are thus good predictors for the respective class. We normalize the coefficients to center them around zero. Thus, positive coefficients of features in the linear model correlate with yielding a better essay score, while negative coefficients result in worse scores. We investigated the most important change flows in good vs bad essays both on the full essay level and on the paragraph level. We will only present the results from analyses of the body paragraphs (see Tables 5 and 6), as, presumably, this is where the main argumentation unfolds.

With respect to claim-premise change flows in paragraphs, bad essays are more notably characterized by a lack of claims, thus only premises are utilized. This is especially the case for the quality dimensions cohesion, organization, and argument strength. Furthermore, paragraphs of good essays appear to show more type variety. This is observable for all quality dimensions, but most clearly for the ICLE corpus, i.e. organization and argument strength.

More patterns emerge in the premise change flows. For instance, both feedback dimensions (cohesion and conventions) show the same most dominant flows in good essays, i.e. PCG-PHI-PCG-PHI and PST-PCG. Also, paragraphs in essays with a high conventions score tend to begin with common-ground, while flows exhibit fewer changes than in bad essays. Recall that this does not necessarily imply a less complex argumentation structure, as change flows collapse sequences of identical semantic types.

ICLE essays with a high organization score show complex premise change flows, which often include several common-ground units framing hypothetical-instance, real-example, or combinations of those. Bad essays, on the other hand, are characterized by example types that are more rarely used in combination with common-ground. Similar observations can be made for the argument strength dimension. As the feedback corpus, both ICLE dimensions have identical dominant flows, i.e. PCG-PRE-PCG and PCG-PHI-PCG.

Table 5. Change Flows on Paragraph Level (Body): Feedback Cohesion & Conventions. The first letter refers to the type of the argument component (M = major claim, C = claim, and P = premise), the following letters denote the semantic type (e.g. CV = claim-value; PCG = premise-common-ground).
Table 6. Change Flows on Paragraph Level (Body): ICLE Organization & Argument Strength. The first letter refers to the type of the argument component (M = major claim, C = claim, and P = premise), and the following letters denote the semantic type (e.g. CV = claim-value; PCG = premise-common-ground).

5 Discussion

Transfer across corpora is a complex task. Even corpora that belong to the same general domain of texts, e.g. persuasive essays, may exhibit notable differences in argumentation structure and strategies. This is reflected in the distribution of semantic types across our essay corpora. For instance, the AAE corpus contains a substantially larger proportion of fact claims compared to both Feedback and ICLE. The Feedback corpus shows an especially large proportion of hypothetical-instance, while premises in the ICLE corpus have been predominantly labeled with common-ground. These differences in semantic types have an impact on the observable change flows, and thus on argumentation strategies.

To begin with, bad essays with respect to cohesion, organization and argument strength tend to contain paragraphs without a claim more often than good essays. This is intuitively plausible, as a full argument typically consists of a claim and at least one premise. However, important change flows for the prediction of bad essays with respect to the conventions dimension still contain claims. This may be due to the quality dimension at hand, as conventions is less clearly linked to argumentation quality than the other dimensions.

Second, the suitability of premise change flow complexity as a predictor for essay quality depends on the corpus and quality dimension. While ICLE essays with high organization and argument strength scores tend to show more variety in premise change flow patterns, Feedback essays with high convention scores show less variety.

Third, good essays with respect to conventions, organization, and argument strength show change flows that begin with common-ground or use it as a framing type, typically in combination with an example type. This is in line with the argumentation strategy found in the AAE corpus of beginning (and ending) an argument with a general observation while inserting more concrete premises, e.g. examples, in between [26]. Overall, we can summarize that semantic change flows can be indicative of argument strategies applied to produce a persuasive essay of high quality.

6 Conclusion

In this work, we studied the question to what extent argument arrangement in the sense of change flows of semantic types can support the prediction of student essay quality.

To this end, we trained models for ADU and semantic type classification on the AAE corpus, which has been annotated accordingly in previous work [26, 31]. We used these models to label essays in two target corpora: Feedback and ICLE. We extracted change flows of ADUs and semantic types and used them for essay quality prediction. Importantly, we showed that some dimensions of essay quality, i.e. conventions and organization, can be predicted better by using flows of semantic types rather than by coarse ADU types. This result expands on the earlier work of [33]. Finally, we identify change flow features that are important predictors for good vs bad essays.

We find that 1) the distribution of semantic types depends on the corpus at hand and 2) bad essays tend to lack claims, i.e. contain incomplete arguments. Further, we observe that 3) the mere complexity of change flows is not a sufficient predictor for quality and 4) certain change flows of semantic types indicate the use of argumentation strategies.

In the future, we are interested in investigating more thoroughly the relationship between argumentation strategies and essay quality. Here, we considered this topic only briefly in Sect. 5. Also, we plan to extend our analysis to other out-of-domain corpora (e.g., news editorials and the subreddit Change My View).

Limitations

Due to the very small number of annotated essays (402 instances), it is only possible to estimate to a limited extent how the projection of the annotations by our neural models onto corpora outside the essay domain works. The questions of how well these models work on out-of-domain data and how well the semantic type scheme applies to other domains deserve greater attention in future work.

For our study, we decided to follow previous research that simplifies the argument component classification to the sentence level. Although this is considered legitimate for the AAE corpus due to the consistently strict essay structure, in general, this is a simplification that leads to inexactness in the extracted components.

Our work is the first attempt to use abstract semantic patterns to measure the quality of student writing. However, due to the relatively small gains in performance, we assume that the selected quality dimensions may not ideally capture the meaning of our semantic types.