Abstract
This chapter focuses on topic modelling, i.e. the automatic extraction of topics or themes from a corpus. Topic modelling goes a step further than keywords in the automatic identification of the contents of a corpus. Two types of approaches are considered, discussed, and contrasted. On the one hand, those that I dub “traditional”, as illustrated by the LDA and NMF algorithms, and, on the other, embeddings-based approaches, which largely surpass the former in the quality of results. The weakest aspect of topic modelling tools in general is the lack actual labels for the extracted topics, since all they return is a set of loosely related keywords that collectively identify the topic. In the last experiment I describe an approach that uses the power of Large Language Models to effectively derive high-quality labels for the extracted topics.
You have full access to this open access chapter, Download chapter PDF
Keywords
- Topic modelling
- Topics
- Themes
- LDA
- NMF
- Word embeddings
- Topic visualization
- Topic labelling
- Large language models
When discussing the concept of keywords in the previous chapter we saw how some authors, such as Nomoto (2023), regard them as pointers to the “topics” of a text, and we have shown how the extracted lists of keywords do contain many words that could be taken as “labels” for the topics that make up the contents of a corpus. Therefore, it seems pertinent to start this section by discussing what, precisely, is meant by “topics”, and how they are different from keywords.
If we take a bird’s-eye view of any of the keyword lists that we have seen in the previous chapter, it would look as if the sum of all of them somehow embody the topics contained in the corpus. Sometimes there may be one or two keywords or keyphrases that seem to best encapsulate those topics, i.e. they are good labels for the topics, but it is the aggregated set of keywords that provides a more accurate representation of those topics. For example, in Table 4.8, the keyphrases “covid vaccine”, “covid jab”, “covid passport”, and “covid pass” are obviously related and point to the theme of “vaccines and their legal certification”, but none of them could be said to be a perfect label for the topic. Thus, topics have a broader scope than keywords, as they encompass sets of words and phrases which, together, make up a topic. This is in accord with Watson Todd’s (2011) definition: “a clustering of concepts which are associated or related from the perspective of the interlocutors in such a way as to create relevance and coherence” (p. 252), as these clusterings of concepts are inevitably embodied by notionally relevant words (i.e. keywords).
I will not delve any further into a theoretical definition of the concept of topic, as a “standard” dictionary definition—“a matter dealt with in a text, discourse, or conversation: a subject”Footnote 1—suffices to define what is generally understood by this term in the computational treatment of topics, i.e. topic modelling.
Topic modelling can be defined as a number of methods that aim to identify a set of semantically related words, which together form a topic, from a group of documents. These words are assumed to capture the main themes in those documents. It can also be seen as a type of text mining technique used to identify word patterns that occur frequently in written texts, as well as an effective technique to find useful hidden structures in a collection of documents (Zhu et al. 2016).
In the context of topic modelling, topics are sometimes referred to as “latent semantic structures”, due to the original proposal by Dumais et al. (1988) to retrieve semantically relevant information from text collections from user input, to overcome the limitations of document search based on simple string or lexical matching. Furnas et al. (1987) showed that the same keyword was likely to be used repeatedly to refer to the same topic only 20 per cent of the time, and therefore text-based searches are bound to retrieve only a fraction of the relevant documents in a collection—those that actually include the search words. Latent Semantic Analysis (LSA) attempts to overcome this issue by creating semantic spaces from the documents themselves by representing them as term-document matrices of vectors. The adjective “latent” is used because the authors assume that “(…) there is some underlying ‘latent’ semantic structure in word usage data that is partially obscured by the variability of word choice. We use statistical techniques to estimate this latent structure and get rid of the obscuring ‘noise’” (Dumais et al. 1988, 288).
LSA induces the semantics of documents through singular value decomposition to effectively reduce the high number of dimensions of the original texts, a technique that was improved on by probabilistic generative models, specifically Latent Dirichlet Allocation (LDA), matrix factorization techniques, such as Non-negative Matrix Factorization (NMF), and Probabilistic Latent Semantic Analysis (PLSA). LDA, in fact, became the de facto standard method of topic modelling, but, as with many other tasks in NLP, the advent of word embeddings and large language models has brought about newer, more powerful techniques.
In the following two sections I explore the possibilities of some of these methods to identify the “latent” topics or themes in large social media corpora. Although many topic extraction methods and variations on these methods have been proposed over the years, a broad distinction can be made between “traditional” methods, where LDA and NMF stand out, and the latest generation of systems that leverage the semantic power of word embeddings and large language models, such as TopEx, Top2Vec, and BERTopic. Section 5.1 deals with the former, while 5.2 discusses and tests the latter. Finally, dynamic topic modelling is discussed and applied.
The three experiments described in the following sections aim to extract topics using different techniques from tweets generated in the top three countries of the CCTC by volume—the United States, the United Kingdom, and India.Footnote 2 The same dataset was used for all three experiments: three subcorpora from the geotagged section of the CCTC. An attempt was made to have a large number of tweets (over 600,000 for each country) and have approximately the same size for each of the countries. Since different countries have different numbers of tweets, the percentage of tweets to be included in the corpus was adapted to use comparable corpus sizes. The dataset is quantitatively described in Table 5.1, where the time taken to process each of the datasets is also included.Footnote 3
5.1 “Traditional” Topic Modelling Methods
Latent Dirichlet Allocation (LDA) was first proposed by Blei et al. (2003), who kept the adjective “latent” despite the fact that they used a very different technique from that in Latent Semantic Analysis. Unlike LSA, LDA was originally intended not as a document retrieval system, but as a topic extraction (or modelling) tool.
LDA assumes that each document contains several topics and each topic is a distribution over words in the corpus. However, the number of topics must be decided before applying the algorithm to the corpus, and used as one of the parameters. This means that we have to guess how many relevant topics there are in a corpus, which is obviously not optimal, as the algorithm will build a model with that number of topics regardless of what it finds in the corpus, and then fit all documents in the corpus into one of those topics. Furthermore, this fundamental assumption that a document contains several topics is probably not true of many of the documents in our corpus. In fact, the idea of a tweet is a short message about a particular topic, and therefore it is a type of document that challenges the basic assumptions of LDA, as they do not provide enough context to effectively discern topics. In this regard, scientific abstracts, for example, are the perfect type of documents for LDA and NMF, since they are a good summary of the topics that are dealt with in the text, and usually include all relevant keywords and references to those topics. In fact, abstracts have been recurrently used in the literature to show the effectiveness of these methods, e.g. Anupriya and Karpagavalli (2015), Ikegawa (2022), Cao et al. (2023).
Another important consideration is that, since these methods are based on word co-occurrence across the documents in the corpus, they are very sensitive to the precise modifications made during the pre-processing stage. For example, if stop-word removal is not performed, it is very likely that many such words (articles, prepositions, etc.) will be present in the list of keywords that define a topic. Similarly, it is important to group word forms either as stems or lemmas, as otherwise they will be taken as entirely different words (since the algorithms do not consider the semantics of words and phrases). Thus, extensive pre-processing is required to improve results, and this is especially true of social media text, as these texts usually contain a plethora of “noisy” elements: user mentions, hashtags, URLs, misspellings, typographical decorations, etc. (see Sect. 3.2.1).
The final important limitation is that, as mentioned in the previous section, topic modelling tools do not provide semantic labels that can be used to refer to topics. The output is a numbered list of topics identified by a set of keywords (which may or may not be semantically related). The interpretation of what that particular topic semantically refers to and the actual labelling of the topic is left to the user. Thus, evaluating the performance is, as was the case with keywords, rather subjective (Shi et al. 2019). Human interpretation of topics remains an important factor, which is why there have been efforts to implement “human-in-the-loop” topic modelling systems (Smith et al. 2018).
Finally, the discussion of LDA and NMF in this section has so far highlighted the similarities—under the umbrella of “traditional” methods—but these two methods of topic modelling have important differences. LDA is a probabilistic model that follows a generative process: first, a distribution over words is chosen for each topic, then a distribution over topics is chosen for each document; then, for each word in the document a topic is chosen from the document’s distribution over topics and a word is chosen from the topic’s distribution over words. This has the form of a term-document matrix (the “corpus”, in LDA terms). In order to compute the distribution of topics over words, LDA uses Bayesian statistics (specifically, Dirichlet priors, thereby the name).
Non-negative Matrix Factorization (NMF) (Lee and Seung 1999) was originally introduced as a method for parts-based representation of data, especially in the context of image processing. It also uses a term-document matrix (the “corpus”), but it decompresses it in two lower-dimensional non-negative matrices and, unlike LDA, which assumes a probabilistic mixture, NMF represents documents and terms as linear combinations of topics and vice versa.
5.1.1 Experiment: LDA vs NMF for Topic Modelling
A single script was used to extract topics using these two methods, as both require the same pre-processing steps, and the only difference lies in the model generation. As mentioned in the previous section, heavy pre-processing of tweets is necessary, specifically:
-
URL and user mention removal: this was not necessary as both these elements were removed during the corpus extraction process.
-
Stop-word removal. Spacy was used for this task. A set of 25 custom stopwords was added (“2020, “2021”, “news”, “people”, “day”, “week”, “thing”, “think”, “tell”, “read”, etc.)
-
All text was turned into lower case.
-
Tokenization and lemmatization were carried out also using Spacy, whose word tokenizer does keep hyphenated words together (e.g. COVID-19), unlike most others.
-
Hashtags were kept, but the hash symbol was removed.
-
Accented foreign characters (e.g. “café”) were normalized.
-
Word filtering: only words with a minimum length of three characters were kept. Words starting with a number were removed.
-
Document filtering: only tweets with a minimum length of two words (after pre-processing) were kept.
To illustrate the result of this pre-processing step, Table 5.2 contains some examples of the original tweet and the list of tokens after applying the pre-processing function. The dictionary used by the models is based on those lists of tokens.
Input files are plain text files where each line is a document (tweet). Although the input this time is a set of 24 (monthly) files, all tweets are read into one single dataset and processed globally. Thus, for each country set of files, the script generates the following:
-
A CSV file of LDA topics where each topic is a column containing the topic ID and its associated keywords.
-
The equivalent CSV file for NMF.
-
A CSV file where each line contains the original tweet, the dominant LDA topic, and the topic’s keywords.
-
The equivalent CSV file for NMF.
-
A HTML file with an interactive topic visualization for LDA topics.
In the following discussion of results I will use a small, random sample of the generated data. All the datasets are available in the book’s repository.Footnote 4
One key aspect that needs to be addressed is that these methods require that a number of topics be specified prior to extracting the topics. Some methods have been proposed to provide an estimate based on the actual data in the corpus, among which the coherence score stands out. First introduced by Newman et al. (2010), it is based on word co-occurrence statistics (based on point-wise mutual information) and, as the name suggests, it attempts to evaluate the internal coherence of topics. Thus, running the coherence test on several ranges of topic numbers can provide an idea of the optimal number of topics to be extracted from a corpus. However, using this method does not guarantee semantic coherence, just statistical coherence (based on co-occurrence).
The script that extracts topics in this experiment has the functionality to optionally calculate coherence for any given corpus. The results obtained pointed at ranges between 45 and 65 topics per monthly set. Eventually, a decision was made to extract 30 topics per month, as using the coherence-suggested number of topics returned a large number of “small” topics—i.e. topics with very few documents assigned to them. Furthermore, calculating the number of topics using coherence adds considerable overhead in terms of computing time—one extra hour per country set on top of the approximately 43 minutes it takes to pre-process the corpus and run the LDA algorithm (see Table 5.1).
The PyLDAvis visualizations may be helpful to decide whether the number of identified topics overlap or not: the less overlap between topics, the more distinguishable they should be. In these visualizations, each circle represents a topic and the size of the circle is proportional to the number of tweets where that topic is found. The distance between the circles is also meaningful: the farther apart they are the less they have in common. Thus, a good topic model will generate big circles with little overlap between them.
Figure 5.1 shows the visualization for the U.S. topics.Footnote 5 This graph tells us that topics are well defined, as there is almost no overlap between them and they are scattered more or less evenly over the plot area. Except for the three smaller topics (28–30), the only overlap is between topics 4 and 7. If we look at Table 5.3, which contains the list of topics and their assigned keywords, we can in fact see how these two topics are related (LDA Topic #4: ‘new’, ‘death’, ‘sick’, ‘report’, ‘follow’; LDA Topic #7: ‘cdc’, ‘care’, ‘point’, ‘doctor’, ‘contagious’).
It is important to understand that, since these methods of topic modelling assume that each document is a mix of topics, no topic-document assignment is produced by default, although this can be done indirectly by using the generated model to classify the tweets. Examples (23) and (24) are cases where dominant topic is #4, whereas (25) (26) are examples of Topic #7.
-
23.
8 days after returning from Wuhan. 49 M. DM. Fever, cough, fatigue. Minimal sputum. PE: respiratory distress, hypoxemia. CXR: diffuse infiltrates. WBC 4.2 ALC 450. AST 60. In addition to standard precautions, what do you recommend? #MayoIDQ
-
24.
Devastating outbreak in China—Authorities say the newly identified virus originating in central China is spreading between people primarily through coughing, kissing or contact with saliva.
-
25.
The CDC estimated that last year flu season killed as many as 56,000 people in the United States.
-
26.
Y’all better not have New Years parties or I’m calling the cdc on y’all.
The differences between countries are accounted for by both models even in the top 15 topics shown in the tables. For example, in Table 5.3 (U.S. topics) there are references to the Center for Disease Control (CDC) (LDA #7) and presidents Joe Biden and Donald Trump (NMF # 14). Likewise, Table 5.4 (United Kingdom) contains references to London (NMF #2 and #7), England (LDA #6), and the BBC (NMF #13). Table 5.5 (India) includes references to Prime Minister Narendra Modi (NMF #7, LDA # 14), spiritual leader Sant Shri Asharamji Bapu (NMF # 14), and locations like India, Delhi, Mumbai, Maharashtra.
The NMF method does not provide a straightforward way to produce data visualizations, but just by looking at the defining keywords for each topic, it is quite apparent that it produces better results, as these keywords are semantically more related. Another important difference is that this NMF implementation uses n-grams as keywords, which also produces better results. This is apparent in the results for all countries, as it is invariably easier to come up with a title/label for NMF topics than LDA ones, that is, they are more readily interpretable.
The main problem with both these methods of topic modelling is that it is hard to interpret the results. The list of words associated with each topic can only provide a vague idea of what they are about. In order to produce labels that provide a good description of the topics we need to manually examine the topics together with the documents that they are assigned to. That is, we need to establish a relationship between the set of words for each topic and the contents of the documents, which is a time-consuming task. This kind of task, however, can be performed quickly and accurately by Large Language Models. This is demonstrated in the following section.
5.2 Embeddings-Based Topic Modelling
The last decade has witnessed the advent of word embeddings, which have quickly revolutionized the “traditional” methods used in NLP. Word embeddings were first proposed by the ground-breaking word2vec algorithm (Mikolov et al. 2013). Unlike previous ways to represent words in a document mathematically, such as one-hot encoding, which represent them in isolation, embeddings attempt to capture the semantic relationships between words by looking at the contexts in which they appear, effectively putting into practice the Firthian maxim “You shall know a word by the company it keeps” (Firth 1957, 11).Footnote 6 The mathematical constructs that represent words (vectors) use context to locate them in a vector space where distances between them can be precisely measured, effectively creating a “semantic space” such that words with similar meanings are positioned close to each other in that space.
Embeddings-based topic modelling techniques capitalize on these semantic spaces to generate more coherent and semantically dense topics. Unlike the conventional methods discussed in the previous section, which rely solely on word co-occurrence statistics, these approaches take into account the semantic similarity between words—usually measured using the cosine similarity in the vector space—resulting in topics that are more interpretable and contextually relevant.
The ability to capture language nuance and subtlety is one of the most important benefits of using word embeddings in topic modelling. In an embeddings-based model, for instance, synonyms, which in conventional models may be treated as distinct terms, can be identified as semantically similar and categorized under the same topic. Therefore, these models can infer semantic relationships between words even if they do not frequently co-occur in the corpus, making them more resistant to issues such as data sparsity.
Researchers have developed several tools that employ word embeddings as the base for topic modelling, of which three stand out: TopEx, Top2Vec, and BERTopic. Not only do these methods use word embeddings, but also incorporate advanced neural architectures and techniques to further refine topic extraction.
TopEx (Olex et al. 2022) is a user-friendly online software application that enables non-technical researchers to access Natural Language Processing techniques with ease. It permits users to upload data in multiple supported formats, modify parameters, and cluster, visualize, and export results. Text inputs can be uploaded in multiple formats, including CSV and MS Excel files. The tool can be used with any type of text, but was specifically designed to extract and visualize medical-related topics. In particular, the authors refer to PubMed abstracts, grant summaries, publications, interview transcripts, and survey or blog responses. The publication describing the tool (Olex et al. 2022) provides an example use case that employs TopEx to investigate the evolution of topics in a subset of COVID-related tweets over the year 2020. Regarding input size, the tool imposes certain restrictions, and the system may freeze or crash while uploading large datasets. Currently, it is recommended that users with large datasets utilize the TopEx Python library,Footnote 7 as the public server has limited space to construct the necessary matrices for analysis. For the web version of TopEx,Footnote 8 it is advised to limit the analysis to fewer than 2,000 documents with an average paragraph length of four sentences.
Top2Vec (Angelov 2020) and BERTopic (Grootendorst 2022) are more advanced in many ways, both being more flexible and modular, although neither offers a graphical user interface, and therefore they require some knowledge of Python. BERTopic has several advantages over Top2Vec, such as custom labels—a crucial aspect, as we saw in the previous section—and data visualization. As the name suggests, BERTopic is based on the BERT (Bidirectional Encoder Representations from Transformers) model (Devlin et al. 2019), and it represents a significant departure from traditional topic modelling techniques because it is capable of generating highly coherent and interpretable topics, even in the presence of noisy and heterogeneous data. In a nutshell, BERTopic works by clustering semantically related documents using a BERT-like pre-trained model, the assumption being that the sum of all those documents represent a topic, and then extracts keywords from those documents that represent the topic.
Transformers-based models have the ability to produce natural language text representations of the highest quality, which can be applied to various NLP tasks. BERT has been widely used in a variety of NLP projects and applications since its release in 2018. For instance, BERT-based models attained cutting-edge performance in sentiment classification and aspect-based sentiment analysis (Sun et al. 2019). BERT-based models have also achieved state-of-the-art performance in the Stanford Question Answering Dataset (SQuAD) question answering challenge (Gupta and Hulburd 2019). Other successful applications include Named Entity Recognition (Devlin et al. 2018), chatbots (Zhou et al. 2018), and, of course, machine translation (Wang et al. 2018).
BERTopic is not limited to using BERT to create the embeddings of the corpus to be analysed. In fact, it can use any transformers-based model, such as the successful Sentence Transformers (SBERT) (Reimers and Gurevych 2019), which will be used in the following experiment. Although Sentence Transformers is closely related to BERT, the main difference is that whereas BERT produces embeddings of words, SBERT generates embeddings for entire sentences or paragraphs.
The three main stages of the BERTopic algorithm are document embedding, dimensionality reduction, and clustering. In the first step, the semantic content of the documents are captured using word embeddings; the resulting document embeddings make it possible to distinguish between documents that are similar in content. In the second step, the dimensionality of the document embeddings is reduced using a dimensionality reduction algorithm. By default the non-linear algorithm known as UMAP (Uniform Manifold Approximation and Projection) is employed, though others may be used. UMAP reduces the dimensionality of the data while maintaining its overall structure, which is crucial for clustering algorithms. In the last step the HDBSCAN (Hierarchical Density-Based Scanner) method is used to cluster the reduced-dimensional document embeddings. Again, HDBSCAN can be swapped for alternative clustering methods.
This final step (clustering) is critical, because unlike LDA or NMF, BERTopic does not assume any number of topics to be present in the corpus. This is positive because users do not need to guess how many topics there are in advance, but the parameters used for the clustering algorithm (specifically, min_cluster_size and min_samples) largely determine how many topics will be extracted, and there is no easy way to approximate these parameters except by guessing and trial-and-error.
5.2.1 Experiment: Extracting COVID-19 Topics Using BERTopic
In this experiment, the same corpus as in the previous section is used (see Table 5.1) to compare relevant topics in the top three countries by volume of English tweets (the United States, the United Kingdom, and India). Unlike LDA and NMF, no manual pre-processing whatsoever is performed on the corpus. For tokenization, the scriptFootnote 9 uses the tokenizer in the HuggingFace Transformers library (Wolf et al. 2020). As mentioned before, the embeddings are created using Sentence Transformers from the FlagEmbedding (Xiao et al. 2023) “bge-small-en-v1.5”Footnote 10 base model, a state-of-the-art model that can map any text to a low-dimensional dense vector for use in many NLP applications.
As can be seen in Table 5.1, running BERTopic on each of the three country subcorpora (approximately 15 million tokens each) took about half an hour, but it must be noted that this processing time will increase exponentially if a GPU is not available. If this is the case, it is advisable to consider using a smaller corpus sample.
The key parameters to select are the abovementioned min_cluster_size and min_samples. The former determines the minimum number of documents that can form a topic (the higher the number the fewer topics returned); the latter determines the minimum number of neighbours to a core point in the cluster (the higher the number the more documents will be discarded as outliers). After some experimentation, these two values were set at min_cluster_size = 600 (i.e. approximately 0.1% of the total tweets in each subcorpus) and min_samples = 15. This returned a reasonable number of topics (below 100) given the size of the corpus and the variety of topics, while avoiding repeated or very similar topics, as well as good coverage of the thematic range.
Table 5.6 summarizes the results obtained quantitatively. 93 topics were obtained for the U.S.A. subcorpus, 83 for the U.K., and 97 for India. In all cases a large proportion of tweets were not included in any of the topics, i.e. they were considered as outliers. This might be considered as a problem, but the alternative is having a very large number of very small topics—i.e. topics with a reduced number of tweets.
BERTopic numbers topics sequentially starting with Topic #0 and then adds a “-1” set which groups unclustered documents, i.e. documents that were not assigned to any of the topics. In fact, even with this high number of unclustered documents, the top ten per cent of topics accounted for almost half of all the tweets in every case (47.58% on average). Figure 5.2 plots the number of documents on the y axis and the topic ID they were assigned to on the x axis for the U.S.A. subcorpus (unlike LDA or NMF, which assumes that one document is a mix of topics, BERTopic always assigns one document to one topic). Very similar plots were obtained for the other two subcorpora.
The great modularity that BERTopic offers includes customization of the representation model, that is, the way in which topics are summarized and labelled. This is extremely useful because it allows to plug in any method that can summarize collections of documents, including keyword extraction methods, such as KeyBert (see Sect. 4.4). More importantly, we can use a Large Language Model to generate high-quality topic titles (labels). In 2023 Meta made their Llama 2 family of LLMs available for commercial and research applications. Based on the transformers architecture, Llama 2 (Touvron et al. 2023) is extremely powerful and versatile, and can be queried programmatically through an API. The script used in this experiment uses prompt engineering to elicit custom descriptions of topics using Meta’s Llama 2 13-billion parameter LLM. Prompts are generated in natural language inserting the set of keywords originally returned by BERTopic as the descriptor of each topic and the set of documents clustered under that topic. The actual prompt used by the script is the following:
[INST]
I have a topic that contains the following documents: [DOCUMENTS].
The topic is described by the following keywords: [KEYWORDS].
Based on the information about the topic above, please create a short label for this topic.
[/INST]
The descriptions of topics obtained using this method sound natural, and are concise and descriptive, proving how advanced Llama 2 is, to the point that it does not seem worthwhile evaluating against human-generated labels, because humans would probably do a worse job trying to summarize hundreds or thousands of tweets under a single label. Table 5.7 lists all the topics obtained for each of the subcorpora.
The full list of topics for each of the countries is comprehensive and seems to include most if not all of the themes that have been mentioned in previous studies on the discourse of the pandemic on social media (see Sect. 1.2), including medical issues (virus variants, PPE, vaccines, treatments etc.), education, safety, politics, economics and finance, religion, lifestyle, humour, sports, emotional response, and more.
The topic labels in square brackets, e.g. [DEROGATORY LANGUAGE], are a short version of the actual responses returned by Llama 2. Examples of these are given in (27), in reference to the “Chinese virus” topic, and (28), where Llama 2 exhibits the same kind of behaviour as other popular LLMs, such as GPT, when facing socially controversial topics or profanity.
-
27.
“I apologize, but I cannot provide a label for this topic as it is not appropriate to promote or perpetuate negative stereotypes or misinformation about any group of people, including those of Chinese descent. The content of the documents provided contains harmful language and assumptions, and it is important to address these issues with respect and accuracy.\n\nIf you have any other questions or topics, I would be happy to assist you in a responsible and respectful manner”.
-
28.
“I apologize, but as a responsible AI language model, I cannot provide a label for a topic that contains derogatory language or personal attacks. It is important to treat others with respect and dignity, even if we disagree with them or are frustrated with their actions. Instead, I suggest rephrasing the topic in a more constructive and respectful manner. Please let me know if there is anything else I can assist you with”.
To give an idea of what these masked topics are about, examples (29), (30), and (31) are the three representative tweets returned by BERTopic for Topic #16 in the U.K. subcorpus:
-
29.
“What’s wrong with people absolutely disgusting ”
-
30.
“So was he lying about this the other day….? Course he fucking was”.
-
31.
“Because they’re fucking stupid, that’s why”
However, Llama 2 may be applying these measures too strictly. For example, U.S. Topic #6, which is given the (non-) label in example (27) contains most of the alarming tweets at the beginning of the pandemic regarding the outbreak in Wuhan. After manual examination, only a small proportion actually have content that may be said to be promoting negative stereotypes or contain “harmful langage”, such as the following:
-
32.
Fear and Loathing in Xian, China
-
33.
Cut the ties with China President Trump!!!!!
-
34.
Bruh called it the “Chinese Virus”
-
35.
Blocking you sleazebag Krupalie. Move to China you Dimwit!
Many are simply describing the outbreak or calling out Trump’s infamous racist slur:
-
36.
What’s your take on how the Xi’an outbreak is being handled? frankly, I’m impressed and jealous
-
37.
Chinese Virus? What are you talking about?! Even in the middle of a Pandemic, you’re still embarrassing the US to the World!
-
38.
Please tell your father to stop calling it the Chinese virus. Racist implications. Thank you.
For all three countries, the list of topics covers all the general aspects mentioned above that affected the population during the pandemic, and also a number of country-specific themes that somehow identify and describe the different societies, such as the following:
-
United States: 2020 Presidential Election, Covid Relief Bill, CDC guidelines, funny comments about Florida, Anthony Fauci, Andrew Cuomo, New York nursing homes, Border Crisis, Donald Trump’s handling of the pandemic, racial disparities and protests, political opinions about America, death penalty controversy.
-
United Kingdom: Brexit, NHS work, political discussion in Scotland, NHS bed availability and funding, lockdown in the UK, NHS app, Matt Hancock, Boris Johnson, UK Protests and police brutality, UK tier system.
-
India: Outbreak in Maharashtra, IPL 2021 controversy, Sant Shri Asharamji Bapu, Ayurvedic treatments, Indian Railways Service delays, 2023 Bengal elections, President Narendra Modi, Former PM Manmohan Singh, Death of Milkha Singh, Kerala Model, National Pride of India, Assam Flood, water scarcity and sanitation issues, GST and income tax return extensions, Delhi’s COVID-19 management, Chandigarh, Diwali, Eid al-Adha, electricity bill error, black fungus outbreak, farmer protests, Telangana government, Odisha, Tamil Nadu, Saint Kabir.
The list of topics also provides some surprising results, such as the fact that in the U.S. the most recurrent topic—by far, see Fig. 5.2—is the impact of the COVID-19 measures on the National Football League. In the U.K., a sports-related topic also ranks very high (in third position): “Football game cancellations due to COVID-19”. It is also surprising that humour-related topics rank so high in the U.K. list (Topic #1 and #29), whereas in the U.S., we only find two such topics towards the end of the list, and none at all in the case of India; this probably reflects each country’s attitude towards difficult situations. In the case of the U.K., other topics are also humour-related, such as #66 and #62, which have the following representative examples:
-
39.
Spread legs not Covid
-
40.
Open The Pubs
-
41.
You really couldn’t make it up could you
Overall, the topics do seem to reflect each country’s idiosyncrasies. Another example of this is the fact that in the U.S. list there are five topics related to wearing masks, while only one is found in the U.K. and India lists, which probably reflects how controversial this issue became in America. Similarly, four topics in India’s list are related to religion (#14, #46, #79, #89), but only two are found in the U.S. list (#34, #40), and in the U.K. (#41, #71).
BERTopic also offers a number of data visualizations that can help us analyse the results. Figure 5.3, where each coloured dot represents a tweet, is a “map” of topics and their assigned documents.
This graph helps us assess how cohesive the identified topics are, as well as the relationships between topics. Each colour represents a topic and the interpretation is that more cohesive topics should appear more compact than less cohesive ones. This visualization is in fact a Plotly object contained in a HTML page,Footnote 11 and therefore it offers certain useful interactive features. It is possible to zoom in and out, pan, select an area, or we can choose to view any number of topics by clicking on the legend’s titles; hovering over the dots will display the actual text of that particular document. For example, in Fig. 5.4, only two UK education-related topics are displayed (“School lockdown policies” in orange and “Safe learning environments during the pandemic” in green); the former appears to be more compact, as most of the orange dots are in one contiguous area, whereas the green dots are more scattered. Also, the close semantic relationship between the two topics is apparent in that they are adjacent. When interpreting this map it is worth remembering that this two-dimensional visualization has been generated using a dimensionality reduction algorithm that “flattens” the original n-dimensional vector space provided by the embeddings model (in this particular case, 384 dimensions).
BERTopic provides several other visualizations, of which the most useful is no doubt topics over time, which is shown in the next section.
5.3 Dynamic Topic Modelling
Dynamic Topic Modelling (DTM) aims to capture the evolution of topics over time by extending traditional topic modelling techniques, which, as we have just seen, identify topics in a static corpus without considering temporal changes. However, the evolution of topics over time can bring to light many relevant aspects that static a description disregards. DTM is designed to handle corpora where each document or set of documents in the corpus is assigned a timestamp that is used to compute the relative weight of each topic at specific time periods, thus revealing how topics change across different time periods.
DTM assumes a temporal structure of the corpus, that is, the corpus must be divided into discrete time slices of equal length (weeks, months, years). Each time slice contains a set of documents, and the objective is to model the evolution of topics from one time slice to the next, thus capturing the significance and relevance of a topic at different points in time.
The most obvious application of DTM is historical analysis, to understand how certain themes or discourses have evolved over centuries. For example, Blei and Lafferty (2006), one of the first publications on the topic, analysed over 100 years of articles from the journal Science, which was founded in the year 1880 by Thomas Edison (approximately 7.5 million words). The study shows how this technique can help understand the evolution and progression of research topics in science.
The granularity of the time slices is very important in dynamic topic modelling, and it is crucial to choose one that aligns with the expected rate of topic evolution. For this study of the evolution of topics during the COVID-19 pandemic, where new events and announcements were taking place at a rapid pace—e.g. facemasks mandates, lockdowns, new treatments, virus variants, vaccines—weekly time slices were chosen. Further, there are two other variables to take into account when choosing the granularity of time slices: the size of the corpus, the number of documents per time slice, and the total length of the time period covered by the corpus.
Calculating topics over time can be challenging computationally speaking, as it adds considerable complexity compared to static topic modelling. In terms of software, BERTopic, however, handles that complexity with a few simple functions. Once the static topic model is created, it simply takes the list of documents and a list of the corresponding timestamps to calculate the dynamic models. Behind this apparent simplicity, BERTopic calculates the topic representation at each time slice using c-TF-IDF,Footnote 12 without needing to run the entire model several times.
The one outstanding feature of this software package as compared to others is the extremely useful interactive visualizations it generates from the data, which provide a convenient way to examine and interpret results. Figure 5.5 displays the topics over time chart for all topics in the U.S. subcorpus. As with topic maps, BERTopic uses Plotly graphs embedded in HTML pages to produce interactive graphics, which makes it hard to view in static display environments.
These graphs can be zoomed in and also specific topics can be selected, which allow us to focus on them. For example, Fig. 5.6 displays the spectator sports topic in the U.K., where the spikes correlate with important events; the spike highlighted in the graph corresponds to November–December 2020, when it was announced that a limited number of spectators would be allowed to return to British stadiums in low-risk areas at the end of the second national lockdown (2 December, 2020), which obviously generated considerable controversy among fans.
The words displayed at each data point are different, being the ones that stand out according to the c-TF-IDF model for that specific time slice. Figure 5.7 shows a section of the timeline corresponding to the vaccination topic in the U.S., where most of the data points contain the same words, indicating Twitter users who got their first vaccine shot. The line clearly correlates with the events in this country, where the first vaccines were made available in December 2020, and were increasingly administered to people over the following months, spiking during March and April.
The timeline is very different for India, shown in Fig. 5.8, where vaccines were made available a few months later: from January until March 2021 only health workers received it, then increasingly being administered in three phrases (older people with co-morbidities, everyone above 45 years, rest of the population), peaking in June 2021. The timeline in this graph clearly correlates with the events in the real world.Footnote 13
Another example of how the Twitter topic timelines correlate with actual events is given in Fig. 5.9, which plots the timeline of the Brexit topic in relation to the pandemic. The first peaks in this period correspond with the “empty shelves” event in supermarkets as a result of the cancellation of 40,000 heavy goods vehicle (HGV) licences, following the terms of the Brexit deal, which caused the number of professional drivers to plummet, with the subsequent disruptions across supply chains and eventually affected British citizens in very obvious ways.
The examples shown in this section, which represent only a small fraction of the potential analyses that could be performed, clearly show that dynamic topic modelling, coupled with embeddings-based topic modelling and the power of Large Language Models, are a powerful tool to study not only the language, but also the themes that underlie a corpus, a function that is only fulfilled partially by keyword extraction. The main advantage of advanced topic modelling is that it provides a more accurate overview of what is being discussed in a corpus, an abstraction that needs to be made “by proxy” if we are limited to keywords. In other words, topic modelling eliminates the need to semantically classify and generalize over a, possibly very large, number of words that vaguely point towards certain topics. If keywords are signposts to the contents of a corpus, topics are neon signs.
Notes
- 1.
Oxford Dictionary of English, 3rd edition.
- 2.
The corpus files used in this chapter and the next (which includes the Canada, Australia, and South Africa subcorpora) are included in the book’s repository (https://osf.io/h5q4j/).
- 3.
All times are given in hh:mm:ss format. All tasks were run on an Intel Core i7 10700F 2.9 GHz CPU (8 cores) on Ubuntu Linux 22.04 Server 64-bit. BERTopic makes use of the system’s NVIDIA GeForce RTX 3080Ti GPU through the CUDA library. Exponentially longer processing times should be expected on a non-GPU system. The critical step in terms of processing is generating the sentence embeddings; as an example, this step alone took about 3 minutes for the US subcorpus with GPU acceleration and nearly 8 hours without it.
- 4.
- 5.
This figure is a screenshot of an interactive Plotly object on an HTML document generated by the script. All HTML pages are included in the book’s repository, along with the rest of the data at https://osf.io/h5q4j/.
- 6.
The relevance of Firth’s work to the development of distributional semantics and modern word embeddings is acknowledged by many outstanding authors in the field of NLP, such as Russel and Norvig (2010, 985).
- 7.
We were unable to instal this library on Python 3.10, so it is probably not maintained.
- 8.
http://topex.cctr.vcu.edu [Accessed 18 July 2023].
- 9.
Although a custom Python script was written for this experiment, a Jupyter Notebook provided by BERTopic’s author was used as the main code source. This notebook can be found at https://colab.research.google.com/drive/1QCERSMUjqGetGGujdrvv_6_EeoIcd_9M [Accessed 5 August 2023].
- 10.
https://huggingface.co/BAAI/bge-small-en-v1.5 [Accessed 7 August 2023].
- 11.
All visualizations are provided in the book’s repository as HTML files, along with all datasets and document-topic assignments, at https://osf.io/h5q4j/.
- 12.
C-TF-IDF is an adaptation of the TF-IDF algorithm described in Sect. 4.2.2 used by BERTopic where each class of documents is converted to a single document.
- 13.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9069978/figure/fig0006/ [Accessed 17 August 2023].
References
Angelov, Dimo. 2020. Top2Vec: Distributed Representations of Topics.
Anupriya, P., and S. Karpagavalli. 2015. LDA Based Topic Modeling of Journal Abstracts. In 2015 International Conference on Advanced Computing and Communication Systems: 1–5. https://doi.org/10.1109/ICACCS.2015.7324058.
Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. The Journal of Machine Learning Research 3: 993–1022.
Blei, David M., and John D. Lafferty. 2006. Dynamic Topic Models. In Proceedings of the 23rd International Conference on Machine Learning, 113–120. ICML ’06. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/1143844.1143859.
Cao, Qiang, Xian Cheng, and Shaoyi Liao. 2023. A Comparison Study of Topic Modeling Based Literature Analysis by Using Full Texts and Abstracts of Scientific Articles: A Case of COVID-19 Research. Library Hi Tech 41: 543–569. https://doi.org/10.1108/lht-03-2022-0144.
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. Minneapolis, Minnesota: Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423.
Dumais, S. T., G. W. Furnas, T. K. Landauer, S. Deerwester, and R. Harshman. 1988. Using Latent Semantic Analysis to Improve Access to Textual Information. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 281–285. CHI ’88. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/57167.57214.
Firth, J. R. 1957. A Synopsis of Linguistic Theory 1930–55. In Studies in Linguistic Analysis (special volume of the Philological Society), 1–32. Oxford, UK: Basil Blackwell.
Furnas, G.W., T.K. Landauer, L.M. Gomez, and S.T. Dumais. 1987. The Vocabulary Problem in Human-System Communication. Communications of the ACM 30: 964–971. https://doi.org/10.1145/32206.32212.
Grootendorst, Maarten. 2022. BERTopic: Neural Topic Modeling with a Class-Based TF-IDF Procedure. arXiv preprint arXiv:2203.05794.
Gupta, Suhas, and Eric Hulburd. 2019. Exploring Neural Net Augmentation to BERT for Question Answering on SQUAD 2.0. ArXiv abs/1908.01767.
Ikegawa, Takashi. 2022. Micro Science and Technology Fields Requiring Mathematically Trained Contributors: Topic Modeling Using Journal Paper Abstracts. In IEEE Frontiers in Education Conference, FIE 2022, Uppsala, Sweden, October 8–11, 2022, 1–5. IEEE. https://doi.org/10.1109/FIE56618.2022.9962550.
Lee, Daniel D., and H. Sebastian Seung. 1999. Learning the Parts of Objects by Non-negative Matrix Factorization. Nature 401. Nature Publishing Group: 788–791. https://doi.org/10.1038/44565.
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space.
Newman, David, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. Automatic Evaluation of Topic Coherence. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 100–108. HLT ’10. USA: Association for Computational Linguistics.
Nomoto, Tadashi. 2023. Keyword Extraction: A Modern Perspective. Sn Computer Science 4: 92. https://doi.org/10.1007/s42979-022-01481-7.
Olex, Amy L., Evan French, Peter Burdette, Srilakshmi Sagiraju, Thomas Neumann, Tamas S. Gal, and Bridget T. McInnes. 2022. TopEx: Topic Exploration of COVID-19 Corpora—Results from the BioCreative VII Challenge Track 4. Database: The Journal of Biological Databases and Curation 2022: baac063. https://doi.org/10.1093/database/baac063.
Reimers, Nils, and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
Russell, Stuart J., Peter Norvig, and Ernest Davis. 2010. Artificial Intelligence: A Modern Approach. 3rd ed. Prentice Hall Series in Artificial Intelligence. Upper Saddle River, NJ: Prentice Hall.
Shi, Hanyu, Martin Gerlach, Isabel Diersen, Doug Downey, and Luis Amaral. 2019. A New Evaluation Framework for Topic Modeling Algorithms Based on Synthetic Corpora. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, 816–826. PMLR.
Smith, Alison, Varun Kumar, Jordan L. Boyd-Graber, Kevin D. Seppi, and Leah Findlater. 2018. Closing the Loop: User-Centered Design and Evaluation of a Human-in-the-Loop Topic Modeling System. In Proceedings of the 23rd International Conference on Intelligent User Interfaces, IUI 2018, Tokyo, Japan, March 07–11, 2018, ed. Shlomo Berkovsky, Yoshinori Hijikata, Jun Rekimoto, Margaret M. Burnett, Mark Billinghurst, and Aaron Quigley, 293–304. ACM. https://doi.org/10.1145/3172944.3172965.
Sun, Chi, Luyao Huang, and Xipeng Qiu. 2019. Utilizing BERT for Aspect-Based Sentiment Analysis Via Constructing Auxiliary Sentence. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, volume 1 (long and short papers), 380–385. Minneapolis, Minnesota: Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1035.
Touvron, Hugo, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, et al. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv. https://doi.org/10.48550/arXiv.2307.09288.
Wang, Alex, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. GLUE: A Multi-task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP.
Watson Todd, Richard. 2011. Analyzing discourse topics and topic keywords. Semiotica 184. De Gruyter Mouton: 251–270. https://doi.org/10.1515/semi.2011.029.
Wolf, Thomas, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, et al. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 38–45. Online: Association for Computational Linguistics.
Xiao, Shitao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. 2023. C-pack: Packaged Resources to Advance General Chinese Embedding.
Zhou, Xiangyang, Lu Li, Daxiang Dong, Yi Liu, Ying Chen, Wayne Xin Zhao, Dianhai Yu, and Hua Wu. 2018. Multi-turn Response Selection for Chatbots with Deep Attention Matching Network. In Proceedings of the 56th Annual Meeting of the Association for computational linguistics (volume 1: Long papers), 1118–1127. Melbourne, Australia: Association for Computational Linguistics. https://doi.org/10.18653/v1/P18-1103.
Zhu, Jiaqi, Kaijun Wang, Yunkun Wu, Zhongyi Hu, and Hongan Wang. 2016. Mining User-Aware Rare Sequential Topic Patterns in Document Streams. IEEE Transactions on Knowledge and Data Engineering 28: 1790–1804. IEEE.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2024 The Author(s)
About this chapter
Cite this chapter
Moreno-Ortiz, A. (2024). Topics. In: Making Sense of Large Social Media Corpora. Palgrave Macmillan, Cham. https://doi.org/10.1007/978-3-031-52719-7_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-52719-7_5
Published:
Publisher Name: Palgrave Macmillan, Cham
Print ISBN: 978-3-031-52718-0
Online ISBN: 978-3-031-52719-7
eBook Packages: Social SciencesSocial Sciences (R0)