Abstract
Sentiment analysis tools are very powerful when it comes to obtaining a description of the emotional aspect of the contents of a corpus. This chapter describes the methods and tools available, and illustrates what can be achieved with them. Both machine learning ad lexicon-based approaches are described and used, as they can provide different advantages. Whereas machine/deep learning approaches are the state of the art in sentiment classification tasks, lexicon-based tools can provide further insights, as they are able to retrieve the actual sentiment words and expressions used in the corpus. Finally, the role of emojis is discussed and illustrated with a frequency analysis of the most prominent emojis used in the CCTC.
You have full access to this open access chapter, Download chapter PDF
Keywords
- Sentiment analysis
- Opinion mining
- Emoji analysis
- Sentiment classification
- Sentiment lexicon
- Deterministic vs. probabilistic approaches
Sentiment analysis, also referred to as opinion mining, is a branch of Natural Language Processing that aims to identify either the polarity or the emotions expressed in a text (B. Liu 2012), although the term emotion recognition is sometimes used for this specific task, and usually appears linked to the field of affective computing. The main objective of sentiment analysis is to recognize subjective data, such as judgments, opinions, and feelings towards people, things, and their characteristics (Pang and Lee 2008).
Sentiment analysis has many uses in many different industries. It is used for brand monitoring and product analytics in business, and for tracking public opinion and social media analysis in politics. It also has a big impact on customer service, where it aids in comprehending client feedback and enhancing offerings (Cambria et al. 2017). The range of applications is as varied as the range of texts that sentiment analysis can be applied to: from movies and books reviews, e.g. Kennedy and Inkpen (2006), Carretero and Taboada (2014), to hotel reviews, e.g. Moreno-Ortiz et al. (2011), online news, e.g. Soo-Guan Khoo et al. (2012), and political debate on social media, e.g. Wang et al. (2012).
The methods used in sentiment analysis are also varied, ranging from lexicon-based to machine learning and hybrid techniques. Many different machine learning techniques have been developed, including Support Vector Machines (SVM), Naive Bayes, and deep learning models like Convolutional Neural Networks (CNN), and Recurrent Neural Networks (RNN) (Medhat et al. 2014). Lexicon-based approaches, on the other hand, rely on a sentiment lexicon, i.e. a list of lexical features that are labelled as either positive or negative according to their semantic orientation (Taboada et al. 2011).
6.1 Sentiment Analysis Methods
The field has advanced over time to take on more challenging tasks like aspect-based sentiment analysis and emotion detection (Zhang et al. 2018). These authors also define sentiment analysis as the task whose goal is to identify “people’s opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes”. Thus, sentiment analysis is often reduced to a text classification task, which is in fact one of the most basic NLP tasks, whereby a document is classified as belonging to one of two or more classes. This is accomplished by using a classifier, i.e. a predictive model that reads the input text and outputs a certain class, sometimes with a confidence score (i.e. how confident the classifier is that the document belongs in that class).
The classification techniques, like other processes that attempt to emulate intelligent behaviour, can be implemented in many ways. The traditional approach is a series of if–then statements (or production rules), which together form a rule-based system. Rule-based systems have been employed since the beginning in computing, as they form the basis of most programming languages. They are sometimes referred to as “the simplest form of artificial intelligence” (Grosan and Abraham 2011, 149). A rule-based system contains a set of production rules, a set of facts, and an interpreter that controls the application of the rules given the facts. Thus, these systems require expert knowledge on the domain at hand, as well as engineering skills to encode this knowledge as a set of facts and rules. In the case of sentiment analysis, this type of system applies to lexicon-based approaches, where the set of facts specify which words and expressions are positive and negative, and the rules would define how the proportion of positive versus negative words is to be measured to come up with a global sentiment score for the document. Context can also be accounted for by a set of such rules (e.g. “if a negative particle precedes a sentiment adjective, then its polarity is inverted”). Lexicon-based sentiment analysis systems are, for the most part, rule-based systems, where the required static facts (e.g. sentiment lexicon) and procedural knowledge (e.g. context rules) have been obtained from certain knowledge sources—corpora, dictionaries—and encoded by a knowledge engineer.
In contrast to these deterministic systems, machine learning simulates intelligent behaviour using probabilistic (or stochastic) techniques. In lieu of relying on expertly encoded and distilled knowledge, the learning algorithms can acquire this knowledge from vast quantities of data, in this case text. Corpus-based (i.e. machine learning) approaches are prevalent in both industry and research, as they have demonstrated superior classification performance.
The current state of the art in sentiment classification consists of machine learning approaches in the form of neural networks that employ transformers, i.e. deep learning models that aim to solve certain text-related tasks (bi-directional attention, word and sentence prediction, sequence-to-sequence tasks) while easily handling long-range dependencies. Language models based on the transformer architecture include two of the most successful ones to date: Google’s BERT (Devlin et al. 2019) and OpenAI’s GPT (Brown et al. 2020), which have been shown to improve on previous top benchmark scores across numerous NLP tasks, both in natural language understanding and generation, including sentiment analysis (Wolf et al. 2020).
6.1.1 Deterministic Methods
Lexicon-based methods of sentiment analysis can be referred to as deterministic because they employ deterministic data, i.e. a set of words that comprise a lexicon in which sentiment information about those words is stored and, in some cases, a set of rules that can contextualize the semantic orientation of those words in actual usage. Examples of sentiment dictionaries include The Harvard General Inquirer (Stone and Hunt 1963), Bing Liu’s Opinion Lexicon (Hu and Liu 2004) MPQA (Wilson et al. 2005), SentiWordNet (Baccianella et al. 2010), SO-CAL (Taboada et al. 2011), EmoLex (Mohammad and Turney 2010), VADER (Hutto and Gilbert 2014), Lingmotif-Lex (Moreno-Ortiz and Pérez-Hernández 2018), and SenticNet (Cambria et al. 2020). These resources generally consist of word lists with varying degrees of sentiment information, from simple polarity to emotion classification.
However, the context in which individual words and phrases appear can alter their semantics (including polarity), sometimes to the point where they mean the exact opposite of what they initially denote; this is especially true of sentiment words. A negative adverb, such as “not” or “never”, can invert the polarity of the adjective “happy”, for instance. It is therefore difficult for a lexicon-based sentiment analysis system to account for all such context-shifting words. For example, we can implement a rule that inverts the sentiment of “happy” when it is preceded by “never” within a span of three words. This rule would correctly classify as negative expressions such as “I was never truly happy there”, but would incorrectly classify cases such as “I've never been so happy before”. In the field of sentiment analysis, numerous contextual shifter systems have been developed, e.g. Kennedy and Inkpen (2006), Moreno-Ortiz and Pérez-Hernández (2018), Polanyi and Zaenen (2006), Taboada et al. (2011). Nonetheless, the level of difficulty that sentence-level context handling poses pales in comparison to higher-order linguistic levels of analysis; discourse-related phenomena, such as the metaphorical use of words, irony, sarcasm, understatements, or humblebragging—all of which are pervasive in social media—are a serious problem for which there are no immediate solutions.
These knowledge sources are also deterministic because they have been compiled and curated by humans and are therefore known to be true, or at least assumed to be true; consequently, the performance of these systems is entirely dependent on the data upon which they are based. Their underlying model is also deterministic: if a text contains more positive words than negative words, it is predicted to be positive. When analysis errors occur, they are attributed to faulty or insufficient data: a particular sentiment word is missing, a valence shifter was incorrectly applied, pragmatic features were not taken into account, or additional world or common-sense knowledge is required. The underlying assumption is that it is possible to collect all of the facts and rules required for optimal model performance. This is applicable not only to lexicon-based sentiment analysis systems, but also to all formal grammars and computational implementations of linguistic theories. However, it has been repeatedly demonstrated that the facts and rules of language are far too elusive and organic to be constrained by the deterministic straightjacket. Otherwise, after seven decades of implementations of linguistic theories, at least one would have emerged as a viable framework for developing real-world language applications, which, arguably, has not occurred.
6.1.2 Probabilistic Methods
Since the 1960s, machine learning (ML) algorithms have been used in a variety of research fields. However, it has only been in the last two decades that we have witnessed their widespread use in real-world applications. In conventional programming, we tell the computer exactly what steps to take in order to solve a problem, which works well for many situations such as solving an equation; however, there are other tasks that do not lend themselves to this approach: How can we break down the process of identifying a specific object in a picture or the text understanding process, into minute, step-by-step detail? The analysis process I described in the previous section, which is utilized by lexicon-based SA tools, is merely an extreme procedural simplification of much more complex cognitive processes that our brains are able to handle effortlessly.
The goal of machine learning is to teach computers to solve these complex problems by providing them with examples of the problem and allowing them to figure out how to solve it on their own. Despite the fact that “classical” ML algorithms (Naïve Bayes, decision trees, Support Vector Machines, etc.) have been (and continue to be) successfully used to solve practical NLP problems, including sentiment analysis, deep learning and neural networks have revolutionized the field. As mentioned in the previous section, the current state-of-the-art performance in all language-related tasks is offered by the transformer architecture (Vaswani et al. 2017), and therefore it has rapidly become the dominant architecture for NLP (Wolf et al. 2020). It is based on the concept and practice of “pretraining”, i.e. creating a language model from a very large corpus in an unsupervised manner that can then be repurposed for different specific applications by “tuning” it on smaller, labelled (i.e. annotated) corpora.
Probabilistic methods based on the transformer architecture have been repeatedly shown in the literature to be state of the art in terms of sentiment classification, which obviously includes lexicon-based systems. However, lexicon-based systems do provide very useful capabilities that pure classifiers do not possess: the ability to point out which words and expressions have been found that justify their classification results. Conversely, ML systems, especially neural networks, exhibit the well-known “explainability” issue. Indeed, these algorithms excel at discovering correlation in massive datasets, but offer little to nothing in the way of causation. Ultimately, the researcher is left to come up with likely interpretations of the results. Important steps are being taken towards an explainable AI (Barredo Arrieta et al. 2020), but current technology simply cannot offer “explanations” of its own predictions; they simply act as black boxes that take an input and produce an output based on their probabilistic model.
6.2 Experiment: Sentiment Analysis of the CCTC by Country
This experiment is intended to showcase the capabilities of both state-of-the-art, transformer-based sentiment classification systems and an advanced lexicon-based sentiment analysis system. Thus, it consists of two parts; in the first one I use a script that employs the HuggingFace Transformers library (Wolf et al. 2020) together with TweetNLP (Camacho-Collados et al. 2022), a state-of-the-art model for Twitter sentiment classification trained on 124 million tweets and based on RoBERTa (Y. Liu et al. 2019).
In the second part I use Lingmotif (Moreno-Ortiz 2017, 2023), an advanced lexicon-based sentiment analysis system, to analyse the same corpus and obtain frequency lists of sentiment-related lexical items that can help us understand not just the overall semantic orientation of the corpus, but also the nature and type of that sentiment by exposing the actual words and phrases that materialize it.
For this study, I will be using the top six subcorpora by volume in the geotagged section of the CCTC.Footnote 1 Table 6.1 describes the subcorpora quantitatively. As in the experiments in the previous chapters, I use a proportional part of the each subcorpus when this is possible (United States, United Kingdom, and India). For the other three countries, the full subcorpus was used.
6.2.1 Tweet Classification and Sentiment Over Time
The HuggingFace library makes classification very simple, as it takes care of every stage of the process by means of an integrated pipeline, thus hiding the complexity that entails working with transformer-based models. Every file in the corpus, where each line is a tweet, is read line by line, and the full list of documents is passed to the “sentiment-analysis” pipeline together with the tokenizer and language model. The pipeline returns a list of results where each document is classified as belonging to one of three classes—“positive”, “neutral”, or “negative”—and a confidence score in the range 0–1. Table 6.2 shows the global results of the analysis.
The most obvious fact that the data tell us is that the general sentiment is negative, as the proportion of negative tweets is the largest across all countries. However, there are important differences among them: the United States dataset has the most negative results, with over 53% of the tweets being negative, which is significantly higher than the average (47.02% including USA, 45.81% excluding it). This is surprising considering that it is the country with the highest GDP per capita of the group and perhaps a reflection of the poor pandemic management of the Trump administration. Conversely, India, which has the lowest GDP per capita, has the lowest percentage of negative tweets.
These global results, however, are aggregated (averaged) data, as the actual classification task was performed on weekly samples. This organization allows us to look at the evolution of sentiment over the two years that the samples span. Figure 6.1 is a visualization of the sentiment timeline corresponding to the United States using the raw data returned by the classifier.
The timeline reflects some of the most relevant events during the pandemic. After the initial alarm caused by the cases reported in China, the positive sentiment increases during the early spring of 2020 and then negativity increases as lockdowns are ordered in some states. Similarly, the significant surge in negative sentiment during the summer of 2021 correlates with the beginning of a third wave of infections as a result of the Delta variant of the virus.
In order to more easily compare the sentiment timeline of different countries, we can merge these polarity proportions into a single sentiment score using the following equation:
This will give us a score in the range −1 to 1, which can easily be converted it to a more readable 0–100 range. Figures 6.2 and 6.3 use this unified sentiment score to visually compare sentiment evolution in the six countries. Three countries are shown in each graph to facilitate the interpretation of the data.
These data visualizations make it apparent that some countries follow more a similar evolution of sentiment than others. Just by looking at the graph, it seems apparent that India’s sentiment evolution is the one that deviates the most from the rest of the countries. However, in order to properly quantify how much correlation there is between the different time series we can use the Pearson correlation coefficient between country pairs. Table 6.3 shows the list of correlations between country pairs in descending order.
This list of correlations tells us that countries that share more in terms of geographical proximity, culture, or economy tend to correlate higher. We can now say with all certainty that India displays the most deviation from the rest, followed by South Africa.
The reasons why India’s sentiment evolution is so different may be due to many factors, but it probably has to do with a different vaccination process and the different times of the two major waves of COVID-19 cases, which differed from most other countries. India started the vaccination programme in January 2021 and initially managed to control the number of new cases; however, a major second wave started in April 2021, which made new cases spike from 9,000 per day to over 400,000 and 3,500 deaths per day by the end of April.Footnote 2 The reason for this massive increase in daily cases was the incipient Delta variant, which started in India during this time and would later expand to the rest of the world. This clearly correlates with the sentiment timeline during this period, when negative sentiment clearly increases.
Obviously, looking at the changes in sentiment as represented by the peaks and troughs in the graphs and correlating them with real-world events is not an easy task, as there are multiple factors that may cause those changes. However, with sentiment classification of tweets that is all we have. We can only browse through the—very large—set of classified tweets and attempt to see what causes the sentiment. Examples (42) through (47) are tweets from this period. In them, people complain about the bleak situation and the poor management of the pandemic by the government. Examples (43) and (46) are interesting because they illustrate the trouble that state-of-the-art sentiment classifiers run into when faced with sarcasm, as both are clearly negative but are classified as positive and neutral.
-
42.
‘A person cannot live peacefully in Delhi, a person cannot even die peacefully in Delhi’. India overwhelmed by world's worst Covid crisis—BBC News. [negative, 0.896]
-
43.
Half of the world's total covid cases are now from India!! What an achievement.. #IndiaFightsCOVID19. [positive, 0.811]
-
44.
What to do brother, our government is not listening to us right now. There no use of these types of requests [negative 0.926]
-
45.
#Karnatakagovernment Please consider the necessary requirements/decision towards raising COVID-19 death's before it gets out of control. We can afford the raising cases not the raising death's. [negative, 0.632]
-
46.
When coronavirus cases went down, Govt declared victory, PM took all credit as always; Now they're blaming states: Rahul Gandhi [neutral, 0.614]
-
47.
Mismanagement and lack of planning in production and distribution has killed more than the #virus. [negative, 0.898]
6.2.2 The Sentiment Lexicon of the Pandemic on Twitter
Sentiment classification of tweets is obviously useful, but it falls short of telling us about the nature of the sentiment. All we have is the classification data, either as individual or aggregated results by time span, and the classified tweets themselves, which is too much data to manually make sense of. For instance, examples (1) to (6) above were selected from the set of tweets in the week April 26 to May 2, 2021, but that week alone contains 27,902 tweets, so it is very hard to draw any conclusions regarding the content, and the examples are nothing more than anecdotal evidence.
Lexicon-based sentiment analysis systems can be very useful when it comes to obtaining more clues as to the nature of the sentiment, as they can provide frequency lists of the words and expressions that motivate the sentiment. For example, Table 6.4 shows the list of the most frequent negative words during this time period in India.
From this set of negative words and expressions, we can see that many refer to the disease itself (‘pandemic’, ‘epidemic’, ‘virus’, ‘disease’, ‘infect’, ‘test positive’, ‘fever, ‘risk’), others to the deaths caused by the disease (‘death’, ‘dying’, ‘dead’, ‘rest in peace’, ‘rip’, ‘deadly’, ‘kill’, ‘condolence’, ‘loss’), others to the social and economic difficulties (‘crisis’, ‘emergency’, ‘shortage’, ‘poor’, ‘needy’, ‘lack’, ‘struggle’), and finally some of them refer to the management of the pandemic by the government (‘fail’, ‘blame’, ‘shame’, ‘impose’, ‘failure’, ‘wrong’, ‘fake’, ‘unable’, ‘quarantine’, ‘complete lockdown’, ‘lack’).
These words provide a more complete picture of the particular reasons that motivate the negativity at this particular point in time. Looking at positive and negative words can also help us identify what causes the unexpected positive peak in India during the week of October 18, 2021, which, with an all-time high sentiment score of 58.87—from 45.53 the previous week and 40.49 the next—is also an anomaly compared to the rest of the countries. But it is also interesting to contrast these results with the topics that we saw in the previous chapter. Figure 6.4 shows the topics over time for India, where a surge of the vaccines topic is quite apparent.
Finally, looking at the tweets in this week, there is a very large number of tweets celebrating the advancement of the vaccination process. Examples (48) to (52) illustrate these.
-
48.
World Bank Prez Congratulates India on Successful Covid-19 Vaccination Campaign. NaMo App. [positive, 0.922]
-
49.
PM congratulates people of Devbhoomi for 100% first dose of Covid vaccination. [positive, 0.871]
-
50.
98 crores done. India is quickly making its way to #COVID19 vaccine century! Just two more steps to go . ji. [positive, 0.914]
-
51.
2nd Dose Done . Fully vaccinated #corona #vaccinationdone #vaccine #sainisurinder Anandpur Sahib. [positive, 0.823]
-
52.
#India crosses 98 crore vaccine doses. And the roses are increasing fast. Seems 20 Oct is going to be the day when India will cross #100crore doses. Salute to all health care workers, Salute to spirit of India. #TogetherWeWin #COVID19 [positive, 0.946]
Comparing the negative words across different countries can also shed light on the particular circumstances and contexts. Table 6.5 contains the top 25 negative words for each of the countries in this study.
The top few words are mostly the same across all countries (‘pandemic’, ‘lockdown’, ‘virus’, ‘death’). Upon investigation, the third position of the word ‘stigma’ in Canada’s list is due to a specific and very active Twitter account “Fighting Stigma”, which preceded its many tweets with these two words. It is interesting, however, how the word ‘lockdown’ is either in first or second position in all countries except the United States, where it ranks fifth; this is most probably due to the fact that lockdown measures were fewer and more relaxed than in other countries and therefore had less impact on the population. The lemma ‘lockdown’ has 12,111 occurrences in the US subcorpus, whereas in the U.K. corpus (which has a similar number of words) there are 146,388 occurrences.
The lists also offer insights into the particular problems that the countries had to face. For example, in South Africa’s list the words ‘HIV’, ‘arrest’, and ‘corruption’ refer to issues that are not present in other countries. The lemma ‘poor’ is also present, which is also included in India’s list, the only list to contain the word ‘struggle’. These differences do suggest a more difficult economic situation for the people of these countries, which was made worse by the hardships brought about by the pandemic.
On the other hand, all word lists except India’s contain insults and profanity words (U.S.: ‘fuck’, ‘shit’, ‘stupid’, ‘idiot; U.K.: ‘fuck’, ‘shit’, ‘idiot; Canada: ‘fuck’, ‘shit’; Australia: ‘fuck’, ‘shit’, ‘idiot; South Africa: ‘shit’), which is telling of the different cultures. The phrase ‘please help’ is also only found in the India list; In fact, the lemma ‘help’ is extremely more frequent in the India subcorpus.
Finally, the United States list is the only one that contains the word ‘hate’ (in 17th position), which is probably a reflection of the political atmosphere at the time, as examples (53) to (58) illustrate.
-
53.
On coronavirus, Trump needs the ones he hates: Experts and journalists—The Washington Post
-
54.
I fucking hate it here
-
55.
I hate the healthcare process in this country
-
56.
CNN loves China and hates America
-
57.
Pence has his beliefs that many disagree w/ and hate him for it. We need to come together as patriots against those who openly or secretly hate us. #Corona #Coronavirus #MikePence
-
58.
Republicans hate government until an enormous problem made by the private sector (2008 crash) or not solvable by the private sector (Coronavirus) emerges.
As for the positive words, they are very similar across all languages, although of course the frequencies change. Table 6.6 shows the top 50 positive sentiment words and expressions for each of the subcorpora. Lingmotif treats emojis as regular lexical items, which is why they are listed and ranked along with the rest of the words.
As with negative words, most of the words in this list are positive in general, but some are specific of the pandemic subject domain, such as ‘protect’, ‘recovery’, ‘immunity’, ‘volunteer’ or ‘save lives’. There are not many differences between the countries. The primary themes that the lexical items refer to are good wishes, positive advice, and congratulations (on fighting the pandemic). The only country that shows a different theme is, again, India, with the word ‘donate’, in consonance with the recurrent “request-for-help” topic identified before.
We can also track the frequency of positive and negative words and phrases over time. To do this, we need to calculate the frequencies of all positive and negative lexical items over the whole period for each country, which will produce a ranked list of the most frequent sentiment items, which we can then track over time by looking at their frequency at each time period (weeks in this case). To account for the different sizes in the subcorpora, the relative frequencies were calculated per 1,000 words for each of the lexical items.
There are two ways in which we can track sentiment words over time. We can either look at the evolution of the top n words for one specific country, or we can track one specific word in several countries. The latter offers more interesting results, as focusing on certain specific words and comparing their frequency among different countries can provide useful insights. For example, Fig. 6.5 displays the frequency of the word ‘help’ over time, where India is clearly the most prominent and the peaks correspond with the particularly hard periods mentioned before.
Plotting the frequency of specific sentiment words can provide evidence of real-world events. Figure 6.6 plots the timeline of the word ‘protest’ for all countries in the corpus. It evidences the periods where protests became an issue. June 2020 witnessed demonstrations in most countries after months of lockdowns and stay-at-home orders.
Australia is the country that shows the most spikes on this word, surpassing all other countries in June 2020, but also showing many other peaks due to different events. For example, during September 2020, several anti-lockdown protests were organized in this country, as they were during July through November 2021. India, again, is the country that deviates the most from the rest, with a rather flat line except for two obvious spikes during December 2020 and December 2021. However, these demonstrations were most probably related to the farmers’ protest as a reaction to the laws passed by the Indian government in September 2020.
6.3 The Role of Emojis in the Expression of Sentiment
There is little doubt that emojis contribute importantly to the emotional content of social media messages. They rank high in our lists of negative (Table 6.5) and positive (Table 6.6) terms of every country, and they are present in a large number of the examples provided in this chapter, which demonstrates the significant role they play. Emojis facilitate communication of subtle emotional cues by condensing ideas and emotions into a single icon or pictograph (Bai et al. 2019) and have a ubiquitous presence in social media all around the world (Ljubešić and Fišer 2016).
Emojis, like their less sophisticated text-based counterparts, emoticons, are said to fill the role of human facial expressions and gestures, absent in text-based communication, which is demonstrated by the fact that emojis that display facial expressions have been shown to elicit a similar neural time-course as actual human faces, although with lower attentional orientation response (Gantiva et al. 2020). In fact, research has shown how reaction times is slower when humans are confronted with messages where the text and the accompanying emoticon or emoji expresses conflictive valences, and the overall messages also tend to be interpreted as negative more often (Aldunate et al. 2018).
From a cultural perspective, however, the use of emojis is not homogeneous across nations and languages. Kejriwal et al. (2021), in a large-scale study that included 30 countries and as many languages on tens of millions of tweets, concluded that emoji usage is not only strongly dependent on cultural and geographical variables, but its diversity is also much more constrained in some languages and countries than others.
These conclusions are unquestionably supported by the data in this study. From a purely quantitative perspective, we can see in Table 6.5 that four negative emojis are included in the top 25 negative items for South Africa, whereas only one is present in the rest of the countries, which suggests that South African Twitter users tend to use a higher proportion of emojis in their tweets. However, an actual count of emojis in each of the corpora is necessary to confirm this hypothesis. The script used to achieve this task counts emojis using the emoji Python library to detect emojis and produces frequency counts per 1,000 words to account for the differences in corpus size across countries.
Figure 6.7 shows a visualization of the results, which clearly show that the use of emojis is much more frequent in the South African subcorpus, where 35.84 are found per 1,000 words, versus 18.33 on average (including South Africa; 14.83 excluding it).
This huge difference is also apparent when we attempt to plot the presence of specific emojis over time to make comparisons across countries. Figure 6.8 visualizes the relative frequency of the loudly crying emoji ( ), which can be used to unequivocally measure the level of sadness, unlike others whose interpretation may be more ambiguous, and their interpretation is more dependent on cultural factors (Godard and Holtzman 2022). The overall frequency of this emoji in the South African corpus is so much higher than in the rest of subcorpora that it dwarfs all other timelines.
Specifically, the frequency of the loudly crying emoji in the South African subcorpus was 12.28 times greater than the average of other countries. Comparative analysis revealed that the United States had the nearest frequency to South Africa, albeit still 5.80 times lower. The most substantial discrepancy was observed in the Indian subcorpus, where the emoji’s frequency was a striking 19.25 times less than that of South Africa.
In order to offer a more complete overview of the use of emojis, Table 6.7 shows ranked lists of emojis, including their frequency (per 1,000 words) in each of the subcorpora.
The data in this table show some interesting differences among countries. For example, the virus emoji ( ) is present in all of the subcorpora except, precisely, in South Africa; in fact, it ranks very low in the South African list of emojis (58th position). The same is true of the syringe emoji ( ), which ranks in 48th position. To easily check all differences in emoji use across countries, Table 6.8 summarizes them.
We can now see clearly that South Africa (8 differences with rest) and India (5 differences) are the two countries that deviate the most from the rest. Thus, the United States (3 differences), the United Kingdom (2 differences) and Canada (0 differences) share the most emojis with the rest, especially among themselves. Conversely, India and South Africa are the two countries that show most differences with the rest, and also among themselves.
The order in which the emojis rank in each country is also very telling. For example, the praying hands emoji is present in all lists, but it ranks at the very top in the case of India, a more spiritual country than the rest. Again, we find evidence that social media language, both in terms of linguistic and paralinguistics elements, is a good reflection of the cultural, economic, and social differences that exist between societies when it comes to expressing their emotions in written text.
Notes
- 1.
The corpus, along with all datasets resulting from the analysis, are available in the book’s repository at https://osf.io/h5q4j/.
- 2.
https://en.wikipedia.org/wiki/COVID-19_pandemic_in_India [Accessed 5 September 2023].
References
Aldunate, Nerea, Mario Villena-González, Felipe Rojas-Thomas, Vladimir López, and Conrado A. Bosman. 2018. Mood Detection in Ambiguous Messages: The Interaction Between Text and Emoticons. Frontiers in Psychology 9.
Baccianella, Stefano, Andrea Esuli, and Fabrizio Sebastiani. 2010. SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining. In Proceedings of the International Conference on Language Resources and Evaluation, 2200–2204. Valletta, Malta.
Bai, Qiyu, Qi Dan, Zhe Mu, and Maokun Yang. 2019. A Systematic Review of Emoji: Current Research and Future Perspectives. Frontiers in Psychology 10: 2221. https://doi.org/10.3389/fpsyg.2019.02221.
Arrieta, Barredo, Natalia Díaz-Rodríguez. Alejandro, Javier Del Ser, Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador Garcia, et al. 2020. Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges Toward Responsible AI. Information Fusion 58: 82–115. https://doi.org/10.1016/j.inffus.2019.12.012.
Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. Language Models Are Few-Shot Learners. In Advances in Neural Information Processing Systems, ed. H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, 33:1877–1901. Curran Associates, Inc.
Camacho-Collados, Jose, Kiamehr Rezaee, Talayeh Riahi, Asahi Ushio, Daniel Loureiro, Dimosthenis Antypas, Joanne Boisson, Luis Espinosa Anke, Fangyu Liu, and Eugenio Martínez Cámara. 2022. TweetNLP: Cutting-Edge Natural Language Processing for Social Media. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 38–49. Abu Dhabi, UAE: Association for Computational Linguistics.
Cambria, Erik, Soujanya Poria, Alexander Gelbukh, and Mike Thelwall. 2017. Sentiment Analysis Is a Big Suitcase. IEEE Intelligent Systems. IEEE.
Cambria, Erik, Yang Li, Frank Z. Xing, Soujanya Poria, and Kenneth Kwok. 2020. SenticNet 6: Ensemble Application of Symbolic and Subsymbolic AI for Sentiment Analysis. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 105–114. CIKM ’20. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3340531.3412003.
Carretero, Marta, and Maite Taboada. 2014. Graduation within the scope of Attitude in English and Spanish Consumer Reviews of Books and Movies. In Evaluation in Context, 221–240. John Benjamins.
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. Minneapolis, Minnesota: Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423.
Godard, Rebecca, and Susan Holtzman. 2022. The Multidimensional Lexicon of Emojis: A New Tool to Assess the Emotional Content of Emojis. Frontiers in Psychology 13: 921388. https://doi.org/10.3389/fpsyg.2022.921388.
Gantiva, Carlos, Miguel Sotaquirá, Andrés Araujo, and Paula Cuervo. 2020. Cortical Processing of Human and Emoji Faces: An ERP Analysis. Behaviour & Information Technology 39: 935–943. United Kingdom: Taylor & Francis. https://doi.org/10.1080/0144929X.2019.1632933.
Grosan, Crina, and Ajith Abraham. 2011. Intelligent Systems. Intelligent Systems Reference Library. Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-21004-4.
Hu, Minqing, and Bing Liu. 2004. Mining and Summarizing Customer Reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 168–177. Seattle, WA, USA: ACM. https://doi.org/10.1145/1014052.1014073.
Hutto, C., and E. Gilbert. 2014. VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. In Proceedings of the International AAAI Conference on Web and Social Media, 216–225.
Kennedy, Alistair, and Diana Inkpen. 2006. Sentiment Classification of Movie Reviews Using Contextual Valence Shifters. Computational Intelligence 22: 110–125. https://doi.org/10.1111/j.1467-8640.2006.00277.x.
Kejriwal, Mayank, Qile Wang, Hongyu Li, and Lu Wang. 2021. An Empirical Study of Emoji Usage on Twitter in Linguistic and National Contexts. Online Social Networks and Media 24: 100149. https://doi.org/10.1016/j.osnem.2021.100149.
Liu, Bing. 2012. Sentiment Analysis and Opinion Mining. Synthesis Lectures on Human Language Technologies 5: 1–167. Morgan & Claypool Publishers.
Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv. https://doi.org/10.48550/arXiv.1907.11692.
Ljubešić, Nikola, and Darja Fišer. 2016. A Global Analysis of Emoji Usage. In Proceedings of the 10th Web as Corpus Workshop, 82–89. Berlin: Association for Computational Linguistics. https://doi.org/10.18653/v1/W16-2610.
Medhat, Walaa, Ahmed Hassan, and Hoda Korashy. 2014. Sentiment Analysis Algorithms and Applications: A Survey. Ain Shams Engineering Journal 5: 1093–1113. Elsevier.
Mohammad, Saif M., and Peter D. Turney. 2010. Emotions evoked by Common Words and Phrases: Using Mechanical Turk to Create an Emotion Lexicon. In Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, 26–34. Association for Computational Linguistics.
Moreno-Ortiz, Antonio. 2017. Lingmotif: Sentiment Analysis for the Digital Humanities. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, 73–76. Valencia, Spain: Association for Computational Linguistics.
Moreno-Ortiz, Antonio. 2023. Lingmotif (version 2.0). Málaga. Spain: Universidad de Málaga.
Moreno-Ortiz, Antonio, and Chantal Pérez-Hernández. 2018. Lingmotif-lex: A Wide-Coverage, State-of-the-art Lexicon for Sentiment Analysis. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2653–2659. Miyazaki, Japan: European Language Resources Association (ELRA).
Moreno-Ortiz, Antonio, Chantal Pérez-Hernández, and Rodrigo Hidalgo-García. 2011. Domain-Neutral, Linguistically-Motivated Sentiment Analysis: A Performance Evaluation. In Actas del 3° Congreso Internacional de Lingüística de Corpus. Tecnologías de la Información y las Comunicaciones: Presente y Futuro en el Análisis de Corpus, 847–856.
Pang, Bo, and Lillian Lee. 2008. Opinion Mining and Sentiment Analysis. In Foundations and Trends in Information Retrieval 2: 1–135. Now Publishers Inc.
Polanyi, Livia, and Annie Zaenen. 2006. Contextual Valence Shifters. In Computing Attitude and Affect in Text: Theory and Applications, 1–10. Dordrecht, The Netherlands: Springer. https://doi.org/10.1177/09579265221076612.
Soo‐Guan Khoo, Christopher, Armineh Nourbakhsh, and Jin‐Cheon Na. 2012. Sentiment Analysis of Online News Text: A Case Study of Appraisal Theory. Online Information Review 36: 858–878. Emerald Group Publishing Limited. https://doi.org/10.1108/14684521211287936.
Stone, Philip J, and Earl B Hunt. 1963. A Computer Approach to Content Analysis: Studies Using the General Inquirer System. In Proceedings of the May 21–23, 1963, Spring Joint Computer Conference, 241–256. ACM.
Taboada, Maite, Julian Brooks, Milan Tofiloski, Kimberly Voll, and Manfred Stede. 2011. Lexicon-Based Methods for Sentiment Analysis. Computational Linguistics 37: 267–307.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, 6000–6010. NIPS’17. Long Beach, California, USA: Curran Associates Inc.
Wang, Hao, Dogan Can, Abe Kazemzadeh, François Bar, and Shrikanth Narayanan. 2012. A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle. In Proceedings of the ACL 2012 System Demonstrations, 115–120. Jeju Island, Korea: Association for Computational Linguistics.
Wilson, Theresa, Janyce Wiebe, and Paul Hoffmann. 2005. Recognizing Contextual Polarity in Phrase-level Sentiment Analysis. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, 347–354. HLT ’05. Stroudsburg, PA, USA: Association for Computational Linguistics. https://doi.org/10.3115/1220575.1220619.
Wolf, Thomas, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, et al. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 38–45. Online: Association for Computational Linguistics.
Zhang, Lei, Shuai Wang, and Bing Liu. 2018. Deep Learning for Sentiment Analysis: A Survey. Wires Data Mining and Knowledge Discovery 8: e1253. https://doi.org/10.1002/widm.1253.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2024 The Author(s)
About this chapter
Cite this chapter
Moreno-Ortiz, A. (2024). Sentiment. In: Making Sense of Large Social Media Corpora. Palgrave Macmillan, Cham. https://doi.org/10.1007/978-3-031-52719-7_6
Download citation
DOI: https://doi.org/10.1007/978-3-031-52719-7_6
Published:
Publisher Name: Palgrave Macmillan, Cham
Print ISBN: 978-3-031-52718-0
Online ISBN: 978-3-031-52719-7
eBook Packages: Social SciencesSocial Sciences (R0)