Keywords

1 Mining Interactions in Debates

In recent years, there have been many refinements to argument mining (AM), a natural language processing (NLP) sub-task that deals with the detection and classification of argumentative structures in text [12]. With the advent of modern transformer-based language models such as BERT [7] and its numerous successors, text classification of increasingly abstract categories such as frames [9] or aspects [18] have been introduced. While these new methods are improving our abilities to capture argument semantics, they still operate with isolated text units that only approximate the kind of argumentation that occurs in the wild. Among other things, this can be attributed to the characteristics of the fine-tuning of language models for text classification: the process operates with limited context lengths and works best with an abundance of singularly labeled data points. In real life, however, argumentation regularly takes place as an exchange of arguments. An abundance of those exchanges can nowadays be observed on online social media platforms, ready to be analyzed. This leads to large datasets of structured conversations, rich in potential arguments. While interesting findings can already be found by classifying isolated text units, the information from dialog structures proves valuable for analyzing argumentative discourses more closely.

Fig. 1.
figure 1

Example of a tree structure of a Twitter conversation about nuclear energy. Vertices are Tweets, directed edges indicate replies. For SPM, argumentative tweets are converted to abstract representations encoding stance (pro, or contra) and aspects (one or more labels from a set of 17 aspects describing the German nuclear energy debate, cf. Table 3). For each tree, a set of transactions of certain lengths for pattern mining can be derived.

In this paper, we explore a novel approach to combine state-of-the-art argument mining approaches with sequence pattern mining (SPM), a data mining approach that is more prominently used in a market research context to answer the following research question: How can categorical predictions from text classification be evaluated together with dialogical structural information to find characteristic argumentative patterns that describe the dynamics of a debate?

To create abstract representations of arguments, we fine-tune language models on two common classification tasks for arguments: stance, and aspect. We then apply both classifiers to a large corpus of tweets related to the nuclear energy debate (cf. Figure 1 for an example graph of an original tweet and its replies). After mining this dataset for reply chains of various lengths, we can describe interactions between users as sequences of tuples in the form (aspect, stance), and look for common patterns in this database. We can then further examine the conversations that contain the most frequent patterns qualitatively and see if they allow us to draw conclusions about how people react to different arguments in online debates. With our method, we aim to support social science research to conduct discourse analysis of large diachronic datasets that utilize the technological advances in NLP constructively.

In the upcoming Sect. 2, we give an overview of related work to our approach. In Sect. 3, we describe the dataset that we have used for conducting our experiments as well as the details regarding the fine-tuned language models that were used. We also introduce our approach to finding patterns in conversations. In Sect. 4, we compare the patterns found in our dataset across the different time slices and conclude in Sect. 5 with a discussion of the potentials as well as the limitations of our approach as a method for argument mining in the social sciences.

2 Related Work

Argument mining advanced to the extraction of finer-grained, more qualitative features from argumentative text. Examples include argument mining with a novel focus on key points [8], frames [2] or aspects [18, 24], which aim to extend argument mining originally focusing on linguistic structures to more semantic units that are of interest for (computational) social science research. Analyzing semantic aspects of arguments is still not widespread in argument mining due to its challenges to cover the broad range of controversial topics [3], but there is already a solid foundation of preliminary work. [15] stressed the importance of context when mining for argument relations, albeit prior to the advancements of powerful contextual word embeddings. [22] established the task of mining for argumentation structures as an important link to discourse analysis. Newer approaches also include larger contexts to better comply with argumentation patterns in empirical data that often use implicit premises, lack argument markers, or are elaborated beyond single sentences [17]. Widening the context for text classification also proved helpful for other text classification tasks such as hate speech detection [28].

While most work on argument mining focuses on learning from isolated textual units, some research tries to mine argumentation from dialogue structures such as online discussion threads [6]. [20] identify distinguishable conversation types from Twitter conversations that can potentially be exploited for mining argument relations. They similarly mined for conversations on Twitter, but have built a smaller dataset by only considering the longest possible thread from an initial root tweet to one leaf. We build upon this work by utilizing the structural information that is available to a greater extent, and include dialogue structures from the many incomplete conversations on Twitter, too.

Moreover, there has been increased attention on the importance of interdisciplinary approaches to argument mining [25]. The field of computational social science (CSS) strives to analyze large amounts of digital trace data with computational methods for social science research questions. Argument mining bears a high potential for CSS due to its ability to give insights into the use of argumentation in political, or otherwise socially impactful debates. In recent years, more cooperation between researchers with a strong foundation in both argument mining and CSS was established. These works often produce data annotated in a way that is in line with the existing standards from the social sciences and make supervised machine learning applicable, e.g. [11], and [18]. Such datasets are important for bringing argument mining closer to CSS researchers as they enable thorough quantitative research opportunities and give a new dimension to qualitative research on big data. We build on the work of [18], by using the methodology of annotating data in tandem with experts from social science to create a dataset with high utility. [10] describe a methodology of using Discourse Network Analysis, a network representation obtained from news corpora, where actors (e.g. politicians) and their claims form two types of nodes in a bipartite graph. By this, discourse networks combine state-of-the-art AM technology for claim and stance detection with a social science goal. Our approach differs from this method by relying on explicit dialog structures from empirical conversation data instead of modeling abstract discourse representations from large amounts of news. [14] describe a novel method for predicting argument persuasiveness from patterns of types of argumentative discourse units mined from individual posts in online debates which are then clustered with other patterns from the same discourse. The features used are more structural and context-independent and patterns are clustered in order to get insights into discussion. While their approach has a similar goal to ours, namely finding patterns in discussions, it does not employ data mining on patterns but uses clustering of similar sequences on the level of single posts.

Sequential Pattern mining is not widely used today, neither in CSS nor in NLP applications. [26] used SPM for retrieving questions from text in the absence of common cues like question marks, which is common for online utterances that may lack the usual grammatical structure. [21] applied SPM to analyze argument structures for two scientific domains for which they hand-coded argumentative structures. They annotated argumentative sections in scientific articles and used SPM to identify typical argument structure models based on the patterns they found. To our knowledge, our study is the first that utilizes semantic features from argument mining as input for SPM.

3 Predicting a Conversational Dataset

Since SPM operates on ordered sets of items, we need to convert the information of individual utterances of a conversation into elements of sets, creating transactions that represent the conversation. We first created a structured conversation corpus that contains conversation trees. A conversation tree is a directed tree graph with tweets as its nodes, and their reply relationship to a previously posted tweet as edges. An example of a conversation tree is shown in Fig. 1. We can then mine the tree structures for conversation chains of various length n, which are sub-graphs of the conversation tree. By classifying each node in a chain with the two properties of stance and aspect, we encode arguments in tweets as transactions. We perform pattern mining on the ordered sets of these transactions.

3.1 Corpus Creation

In order to create a dataset of conversations that are held on social media, we mined entire conversations from Twitter (now re-branded as X). For our study, we focus on the nuclear energy debate in Germany. We first used a key term query to the Twitter API to retrieve individual tweets related to the nuclear energy debate in German language from three different years: 2017, 2019, and 2021.Footnote 1 The resulting tweets were used to retrieve thematically matching conversations by two strategies. First, we filtered for root tweets only, i.e. keyword-matching tweets that were posted on Twitter initially, in contrast to replies as reactions to earlier posted tweets, and requested their entire set of replies via the API. Second, for reply-tweets that matched our query within a conversation, we included these tweets along with their directly connected replies from the conversation tree. While this proceeding reduced the size of our dataset significantly, it was necessary to ensure that the dataset remained consistent with our target topic.

Table 1. Dataset statistics of the tweet dataset.

Table 1 shows basic statistics of the final dataset, which is heavily skewed toward the more recent conversations from 2021. This is likely due to an increase in public attention to the topic of nuclear energy as well as the growing popularity of Twitter as a public debate forum. Further, the more conversations date back in time, the more likely it is that parts or the entire conversation, were deleted from the platform and, thus, are no longer available via the API.Footnote 2

3.2 Mining Conversation Chains from Incomplete Graphs

Since many of the conversation trees in our dataset referenced tweets that could not be retrieved by the API anymore, we opted for mining chains for each tweet individually as an alternative to the traversal of complete conversation trees. This ensures that all tweets that are included in any chain also have immediate neighbors included in the chain, making the mining of relations between utterances and their responses possible. Table 2 shows the distribution of the reconstructed maximum chain lengths for each year. Around 40% of all tweets in the corpus that are predecessors in a dialogical conversation triggered one single reply only. The longest reply chains we found contain up to 70 messages. We decided to limit chain lengths in our experiments for several reasons. First, computational complexity increases significantly for longer chains. Second, from the low ratio of extremely long chains, it is already evident that the likelihood of finding common argumentative patterns that include a larger number of items will be very low.

Table 2. Chain length distribution by year

3.3 Argument Abstraction by Stance and Aspect Prediction

We aim to use established AM methods to derive tuples of information that represent an abstract version of an argumentative text. Two major semantic pieces of information of an argument are stance and aspect, which can be classified with satisfactory performance by fine-tuned transformer language models on labeled examples. For this, we annotated a dataset of 642 German tweets with their stance and aspect data following the method described in [18]. Table 3 shows the aspects that were coded to cover the most prominent aspects of the German debate. Intercoder-agreement measured by Krippendorff’s \(\alpha \) yields very good agreement for most of the categories. Two aspect categories, temporal dimension, and reliability, achieved only substantial agreement around 0.6.Footnote 3 In addition, we used large publicly available English-language datasets on both tasks for transfer learning in a multitask learning (MTL) setting. As a language model, we used the multilingual version xlm-roberta-large of RoBERTa [13] in all our experiments. Further, for all experiments, five models were trained to minimize random effects in the results. We report the mean performance and standard deviation of the performance in Table 4. For aspect classification, we used all available data from the Argument Aspect Corpus (AAC) [19] for transfer learning, which contains aspect labels for sentences from four topics, and our additionally coded German-language dataset in a two-task MTL sequence tagging. On the test set of 10% of the annotated German tweets, our classifier achieved an overall micro F1-score of 77%.Footnote 4 For stance classification, we used the Sentential Argument Mining Corpus (UKP-SAM) [23], which provides stance information on a large number of sentences across eight topics, as a transfer learning task. We modeled both tasks, the UKP-SAM dataset and the additional German tweet dataset, as text classification tasks. The classifier reaches a micro F1-score of around 80% on the German test data.

Table 3. Number of occurrences and intercoder agreement (Krippendorff’s \(\alpha \)) for each aspect in the tweet dataset. In the paper, we refer to aspects using the corresponding English short labels.
Table 4. Overall performance metrics for sentence-level aspect classification and stance classification on the test dataset of coded tweets concerning the German nuclear energy debate.

3.4 Sequential Pattern Mining on Predicted Data

SPM aims to find reoccurring patterns in databases containing sequentially ordered transactions [1]. The method is typically employed to identify patterns for market basket analysis such as ‘customers who bought a PC, and later that month a digital camera likely will buy a printer next month’. For our analysis, we conceptualize dialogical argumentation threads analogous to shopping cart analysis as compilations of abstract augmentations from the ‘market’ of publicly debated ideas. We build transactions by representing each tweet in a retrieved chain with a tuple representation containing the predicted aspect and stance information. We use the PrefixSpan algorithm [16]Footnote 5, which efficiently finds patterns by recursively building from their prefixes, starting with all prefixes of length 1. In each step, for each prefix \(\alpha \), the projected database \(S|_a\) of \(\alpha \) is created, which contains all postfixes of \(\alpha \), which are all sub-patterns that start with \(\alpha \).Footnote 6 The most important metric for evaluating the significance of mined sequences is the support, which is defined as the proportion of the number of sequences in which a pattern occurs. As a parameter, PrefixSpan considers in each step only postfixes with a minimum desired support. After some experimental testing on our empirical data, we set the minimum support for patterns considered relevant for our analysis to 1%.Footnote 7

Figure 2 shows the stance distribution for the predicted dataset. A significant proportion of the tweets in the dataset were predicted as having no stance. This is plausible since not all posts for a topic are actually argumentative and pose a stance. For the pattern mining experiments, chains that contained tweets without a stance were excluded. This was due to the fact that including these posts resulted in a majority of patterns revolving around tweets without a stance, which were not argumentative, thus revealing no argumentative patterns. It is also noticeable that a majority of tweets with a stance were predicted as having a pro stance. While in 2017 there are 2.05 times more pro tweets than con tweets, this factor increases by almost 50% to 2.94 in 2019 and slightly decreases to 2.79 in 2021. This implies that the discussion on Twitter is generally more in favor of nuclear energy.

Fig. 2.
figure 2

Stance distribution in the predicted dataset, by year

Since we tagged aspects as token spans, one tweet can potentially contain multiple aspects. We investigated two possibilities to resolve multi-aspect tweets to create transactions. First, concatenation of aspects, e.g. (costs_reliability, pro) for a tweet with a pro stance which contains costs and reliability as aspects. Alternatively, we create flat representations, creating separate transactions for each aspect. We found that concatenating aspects resulted in fewer significant patterns, as a result of the combinatorial explosion of possible transactions (see Fig. 5 in the Appendix for a discussion of this processing step). Due to these two findings, we limit the mining for patterns on flat chains to chains that contain only tweets for which a pro or con stance was predicted.

Fig. 3.
figure 3

Aspect distribution for pro arguments, by year

Fig. 4.
figure 4

Aspect distribution for contra arguments, by year

4 Results

Figures 3 and 4 show the proportions of aspects for pro and con stanced tweets, i.e. patterns of length 1, by year. For pro arguments, three aspects have a share of more than 10% of all pro arguments throughout the three years: renewables, fossil fuels, and climate. Two other aspects, safety and reliability fall below 10% of shares, and other aspects generally make up five or less percent of all pro arguments. The most significant increase is seen in the share of arguments addressing renewables, which make up nearly 20% in 2021. For con arguments, renewables, costs and safety are strongly represented throughout the years, but a greater number of aspects are represented between five and ten percent throughout the years. While climate is steadily rising from six to ten percent, reliability is falling to 8.3%. An important difference between the two distributions is the prevalence of nuclear waste as a well-represented con argument while staying below a five percent proportion throughout the years in contexts of a pro argument.

Table 5. Top 5 support and attack patterns with the most support for each aspect combination, by year. Green arrows indicate a rise in the pattern rank, red arrows indicate a fall, dashes indicate no change in the position.

4.1 Attack and Support Patterns

Table 5 shows the top five patterns for the four possible combinations of pro and con-stanced tweets over the three time slices. Alteration between pro and con stances in subsequent tweets of a chain can be interpreted as an attack relation of arguments while repeated stances indicate a support relation. The most significant patterns all have a length of two. In total, 327 patterns with minimum support of 1% were mined, yet only 24 patterns had more than two items. Due to the chain length distribution in the dataset (cf. Table 2), longer chains are hardly found in the dataset. The support of the top patterns of 2017 is significantly higher compared to later years. A possible explanation is that the smaller overall discourse by number of tweets was more uniform and expanded over time to more diverse aspects. We further observe the highest support for pro \(\leftarrow \) pro patterns, which originates from the high prevalence of pro-labeled tweets in the dataset. Many prevalent patterns address the same aspect in a row. A possible explanation is that people prefer to reinforce statements they agree with by repeating them (with variations). Another factor may be self-replies to construct a longer thread of tweets for making an argument. In the following, we investigate the results for each combination of pro and con-stanced tweets.

Support Pro \(\leftarrow \) Pro. There are significant changes of top-patterns among supporting pro arguments. For instance, arguments mentioning renewable energy are supported by arguments about reliability but with declining relative support over the years. Further, climate takes the spot as the most important aspect in 2019 and 2021 answered with, again, climate, and with renewable energies. Interestingly, pro-nuclear energy arguments referring to renewables are less likely supported climate-related replies. A possible explanation is that people supportive of nuclear energy shifted their framing to nuclear energy being necessary to combat climate change, yet avoided the expression of support for renewables. The pattern (costs, costs) steadily climbs up to the top ranks indicating an increasingly important economic framing of the debate in addition to climate aspects.

Support Con \(\leftarrow \) Con. Chains of con–con arguments are seldom found patterns compared to other combinations. In 2021 the only pattern con-con pattern that has a support of more than 1% is (costs, costs). Similarly to pro–pro patterns, the common patterns tend to reinforce the same aspect.

Attack Con \(\leftarrow \) Pro. Reliability occurs more frequently in 2017 in the top patterns, and only as the pro argument. In 2017, waste was part of the most supported patterns, which is the only time for any combination of pro and con stanced tweets. Costs is the most occurring con-part of the con-pro patterns, but over the years, it is countered with different pro arguments. While in 2017 renewables and reliability were used the most for addressing con arguments regarding costs, this shifted away slightly from reliability to pro arguments regarding costs.

Attack Pro \(\leftarrow \) Con. For Arguments that are predicted with a pro-stance, it can be seen that the overall most common aspect of con-predicted responses is costs. Responding with the same aspect is also a prevalent pattern for renewables, reliability, and climate. Costs seems to be an aspect that can be addressed regardless of the pro-aspect that is put forth in favor of nuclear energy. Interestingly, in pro–con chains costs only appears in the pro part of the chain starting in 2019 and increases in support to being part of the number one pattern in 2021. This suggests that debates about whether or not nuclear energy is a cost-efficient form of energy production in modern societies intensified significantly.

Longer Patterns. As mentioned earlier, only a small number of patterns longer than two arguments were found in our dataset. Table 6 in the Appendix displays the top five patterns of length \(n=3\) for each year. The table contains exclusively chains of arguments in favor of nuclear energy that mostly reinforce the previously argued aspect. This again suggests that supporters of nuclear energy have a more engaged audience on Twitter compared to opponents of nuclear energy.

4.2 Pattern Mining Vs. Analyzing Distributions

When comparing the aspects of the most common patterns with their distribution, it is evident that the most occurring aspects also occur the most in the top patterns. The important distinction between the two analyses can be seen by analyzing the differences: in 2017, safety and reliability had near similar occurrence. Safety was, however, not discussed in a pro–pro context. Also in 2017, nuclear waste was in the top five con–pro patterns, although its proportion among con arguments rose steadily. Costs was the most popular aspect addressed by con arguments in 2019 and in 2021, surpassing safety. While they had similar proportions in 2019, safety is only prevalent in one con–con top-pattern, while pro–con and con-pro chains were more and more overtaken by the discussion revolving around costs. This shows that our method can add a benefit to analyzing social media debates by leveraging the structured information of their conversation trees.

5 Conclusion

In this paper, we have introduced Sequential Pattern Mining on abstract argument representations generated by recent argument mining methods. By mining patterns from a large corpus of German Twitter conversations on nuclear energy, we demonstrated the usefulness for analyzing structured online debates compared to simpler approaches looking at frequencies of isolated events in the data. To construct a transaction dataset for SPM from Twitter conversations, we suggest employing a set of argument mining approaches, in our case argument stance and aspect classification with fine-tuned language models. Combining structural and abstract semantic information in a set of all possible transactions, we found distinctive patterns of argumentation that were not evident from analyzing tweet information in isolation.

5.1 Limitations

While this first application of our method already shows well-interpretable initial results, more validation is indispensable. A first limitation is the validation of prediction results, which we have conducted, but have not evaluated in a structured manner. Since the method relies on the accuracy of the prediction on the dataset, bad classification will falsify the results of the SPM. We have seen cases, where the stance classifier was unable to accurately predict the stance of arguments in their relationship to nuclear energy and also struggling with sarcasm and jokes. However, we expect to receive mostly valid results from pattern mining given the large corpus size. Another problem might stem from the fact that PrefixSpan can find non-contiguous patterns. This might lead to patterns that do not actually indicate attack or support relationships, especially for cases with longer sequences. However, we are quite confident that these issues play a negligible concerning our dataset role given the very large volumes of data that are analyzed using the method and the fact that a majority of the mined conversations contain not more than one reply.

5.2 Future Work

Future work will concentrate on the interpretation and validation of the mined patterns. Since there are many patterns with less support a careful analysis of all attack and support patterns could reveal more insights into the debate. A thorough qualitative analysis is therefore the next step for establishing the method and testing its potential for the computational social science community. This could also be used to verify classification quality and detect potential issues with classification results. While we assume that training a classifier with labeled data still is preferable to using commercial Large-Language-Models such as ChatGPT for highly specific classification tasks, using such LLMs may increase the use of the method for CSS scholars, as extensive labeling and fine-tuning are not necessary. Further research into alternative, potentially better-suited sequence mining algorithms should be conducted, too. Analyzing patterns from constraint-based SPM approaches that only allow contiguous patterns is an interesting next step for attack and support pattern mining as well as quantifying the chance-corrected statistical significance of the patterns found. Regarding representing and further analyzing attacking and supporting arguments in formalisms dealing explicitly with arguments and argumentation, so-called Abstract Dialectical Frameworks (ADFs) [5] seem to be suitable as they provide sufficient expressive power. Such an approach was suggested in our FAME-project [4] and will be one future research line.