1 Introduction

Twitter data have been widely used by researchers across various social and computer sciences disciplines (King et al. 2017). One of the key challenges in working with Twitter data is to obtain a random sample of users from a country (Kim et al. 2018). The goal is usually to get a platform or population-representative sample of users (Wang et al. 2019). The sample is then used for public opinion research, for experimental research, or for training machine learning algorithms. For example, random samples of Twitter users have been used for estimating public opinion (Barberá et al. 2019; Alizadeh et al. 2019; Alizadeh and Cioffi-Revilla 2014), studying the diffusion of misinformation (Shao et al. 2018), studying conspiracy theories (Batzdorfer et al. 2022), evaluating the performance of large language models for text annotation tasks (Alizadeh et al. 2023), measuring the influence of information operations (Barrie and Siegel 2021), and developing supervised models for detecting inauthentic activities (Alizadeh et al. 2020). However, there are at least two significant challenges in obtaining a random sample of users from a country: (1) while several methods have been proposed in the literature, it is not clear which one is the best, and (2) the extent to which these random samples are actually representative of the population is questionable.

There are at least four popular methods to construct a random sample of Twitter users for a specific country. First, at the moment of writing, Twitter provides 1% of all tweets worldwide in real-time through its free stream API. One can collect this stream for a specific period of time, even with filtering for language or country of interest, obtain a list of users who posted tweets, and then sample from them. Second, it is possible to use Twitter’s search-tweet API and query for a specific language, and after ingesting tweets for a period of time, filter for a language of interest, which matches different countries to different degrees. Third, one can directly query for a country of interest using the Search API. Fourth, the Twitter API allows queries based on the bounding box coordinates enclosing a specific country and researchers used it to get a random sample of a country’s Twitter users (e.g. Barrie and Siegel 2021).

The extent to which these four methods produce similar results, and which one produces a more representative sample of a population, is mostly unexplored. Although Twitter shut off the free access to the API in February 2023, many researchers have archived a wealth of data and still publishing novel research using Twitter data (e.g. Truong et al. 2024; Mosleh and Rand 2024). Moreover, in compliance with the EU’s Digital Service Act, Twitter is now accepting researcher API access applications. In this paper, we compare these methods with respect to fourteen metrics at the level of tweets (e.g. distribution of the number of tweets per day), users (e.g. distribution of age and gender), and population. For the population-level metrics, our goal is to investigate which of the four Twitter sampling methods provides the best data for creating a nationally representative sample of users. To this end, we follow the approach proposed in Wang et al. (2019) and create representative samples from each of the four Twitter sampling methods and use them separately to estimate a population of US from Twitter data.

In the following text, we review the theoretical background and related work to highlight the need and importance of comparing various existing Twitter sampling methods. Next, we discuss our methodology to collect Twitter users, exclude non-active, non-individual, and bot-like accounts, infer the age, gender, and location, create nationally-representative samples from them, and compare them according to various evaluation metrics, which we devised from carefully reviewing the literature.

The results show that the 1%Stream Twitter sampling method is the one that produces the best population-representative sample and which exhibits different characteristics, compared to the other three sampling methods. The results further underscore the Bounding Box sampling method as the best replacement for situations under which the 1% Stream method might not be feasible or suitable. Our results illuminate the positive and negative characteristics of each sampling method and help researchers choose the one that best suits their research goals and designs. By identifying the best sampling methods, our results also pave the way for conducting more accurate social listening studies and building more accurate machine learning models.

2 Related work

Generally speaking, there are two types of methodological approaches to collecting social media data, including Twitter: (1) keyword-based approach, and (2) sample-based approach. In the keyword-based approach, researchers create a list of hashtags or keywords and collect all matched tweets over a period of time. Although this approach is popular due to its ease of automation, it suffers from some shortcomings (see Kim et al. 2018 for a discussion). The most crucial drawback of the keyword-based data collection is that most researchers often pick keywords in ad hoc ways that are far from optimal and usually biased (see King et al. 2017; Munger et al. 2022 for potential solutions). Other researchers take the sample-based approach, which is the focus of this study, and try to sample desired tweets or users.

From the tweet sampling perspective, since Twitter has yet to be transparent about how its data sampling is performed, early research on getting data from Twitter data were focused on understanding its underlying mechanisms. Comparing the limited Streaming API with the unlimited but costly Firehose API, Morstatter et al. (2013) tried to answer whether data obtained through Twitter’s sampled Streaming API is a sufficient representation of activity on Twitter as a whole. They found that for larger numbers of matched tweets, the Streaming API’s coverage is reduced, but its ability to estimate the top hashtags is comparable to the Firehose. Interestingly, the results showed that the Streaming API almost returns the complete set of geotagged tweets despite sampling.

One of the critical questions about the Twitter Streaming API is whether it provides similar results to that of multiple simultaneous API requests from different connections. Joseph et al. (2014) compared samples of tweets collected using the Streaming API that tracked the same set of keywords at the same time. Their results showed that, on average, over 96% of the tweets in various samples are the same. In practice, this means that an infinite number of Streaming API samples are required to collect most of the tweets containing a particular popular keyword (Joseph et al. 2014). In another work, Tufekci (2014) proposed a framework to address potential biases in tweet collection and how to mitigate them.

More recently, Kim et al. (2018) compared simple daily random sampling with constructed weekly sampling. Their results underscored simple daily random sampling as the efficient way to obtain a representative sample of tweets for a specific period of time. In another important study, Pfeffer et al. (2018) showed that the Streaming API can be deliberately manipulated by adversarial actors due to the nature of Twitter’s sampling mechanism. Their results further showed that technical artifacts of the Streaming API can skew tweet samples and therefore those samples should not be regarded as random.

In a distinct investigation, researchers had reverse-engineered the sampling mechanism used by Twitter’s Sample API. The sample was based on the timestamp of when tweets arrived at Twitter’s servers and any tweet arriving between 657 and 666 ms was included in the 1% Sample API (Pfeffer et al. 2018).

In De Choudhury et al. (2010), researchers explored the impact of attribute and topology-based sampling strategies on the discovery of information diffusion in Twitter. The study analyzed several widely-adopted sampling methods that selected nodes based on attributes and topology, and developed metrics based on user activity, topology, and temporal characteristics to evaluate the sample’s quality. The results showed that incorporating both network topology and user-contextual attributes significantly improved the estimation of information diffusion by 15–20%.

Another research highlighted the lack of common standards for data collection and sampling in the emerging field of digital media and social interactions. The paper focused on Twitter and compared the networks of communication reconstructed using different sampling strategies. The paper concluded that a more careful account of data quality and bias, and the creation of standards that can facilitate the comparability of findings, would benefit the emerging area of research (González-Bailón et al. 2014).

In another work, Wang et al. (2015b) compared the representativeness of two Twitter data samples obtained from the Twitter stream API, Spritzer and Gardenhose, with a more complete Twitter dataset. The study found that both sample datasets capture the daily and hourly activity patterns of Twitter users and provide representative samples of the public tweets, but tend to overestimate the proportion of low-frequency users.

A detailed analysis of the effects of Twitter data sampling on measurement and modeling studies across different timescales and subjects was presented in Wu et al. (2020). It validated the accuracy of Twitter rate limit messages in approximating the volume of missing tweets and identified significant temporal and structural variations in the sampling rates across different scales and entities. They also suggested the use of the Bernoulli process with a uniform rate for counting statistics and provided effective methods for estimating ground-truth statistics.

This paper (Hino and Fahey 2019) addressed the challenges researchers face in accessing representative and high-quality data from social media platforms like Twitter. The authors proposed a methodology for creating a cost-effective and accessible archive of Twitter data through population sampling, resulting in a highly representative database. The study demonstrated the high degree of representativeness achieved by comparing the sample data with the ground truth of Twitter’s full data feed, making it suitable for post-hoc analyses and enabling researchers to refine their keyword searches and collection strategies. Overall, this approach provided an alternative solution for researchers with limited resources to access social media data under resource constraints.

In terms of comparing expert sampling and random sampling, Zafar et al. (2015) explored the advantages and disadvantages of the two methods. The study found that expert sampling offers a number of advantages over random sampling, including more rich information content, trustworthiness, and timely capture of important news and events. However, random sampling preserves the statistical properties of the entire data set and automatically adapts to the growth and changes of the network, while expert sampling does not. The authors suggested that both random and expert sampling techniques would be needed in the future, and called for equal focus on expert sampling of social network data.

From the Twitter user sampling perspective, the most important issue that researchers have tried to tackle is the problem of sampling bias. Indeed, much of the extant literature on sampling users from Twitter is related to pointing out sampling biases in election prediction studies and the necessity to control for it (see Jungherr et al. 2012 for a discussion and Gayo-Avello 2013 for an early review). More recently, Yang et al. (2022) focused on the issue of inauthentic accounts that could skew the behavior of voters. They proposed a method to identify potential voters on Twitter and compared their behavior with various samples of American Twitter users. The results showed that users sampled from the Streaming API are more active and conservative compared to the potential voters and randomly selected users. They further showed that the users in the Streaming API sample tend to exhibit more inauthentic behaviors, involve in more bot-like activities, and share more links to low-credibility sources.

Although the problem of sampling bias in Twitter has been recognized in many works, none of the papers we discussed above addressed the issue, due to the lack of a valid methodology to do so. In a seminal work, Wang et al. (2019) proposed a method by combining demographic inference with post-stratification to make social media samples a more representative of a population. First, they created a multimodal deep neural network classifier for joint identification of age, gender, and non-individual accounts. Second, they proposed a multilevel logistic regression approach to correct for sampling biases. The proposed debiasing approach estimates inclusion probabilities of users from various demographic groups from inferred joint populations and ground-truth population histograms. They further showed that their fully debiased sample outperforms a baseline and marginally debiased samples in the prediction task of estimating European regions’ population from Twitter data.

Creating a random sample of Twitter users from a country or geographical region has been cited in some of the research discussed above (e.g. Pfeffer et al. 2018; Yang et al. 2022). However, no study pointed out the fact that there are several methods for creating such a sample from Twitter users. For example, one can use Streaming API to collect tweets published in a certain country or language and then randomly sample from the list of tweets’ authors, or just simply use the Search API and query for country or language or both. We identified four such Twitter sampling methods in the literature (plus a fifth one which is not feasible anymore). The extent to which these random (or near-random) sampling methods produce similar results or are more representative of a population is unexplored. In this paper, we attempt to provide answers to the following research questions:

  • RQ1: Do different methods for creating a random sample of Twitter users from a country produce similar results in terms of the tweet- and user-level metrics?

  • RQ2: Are different methods for creating a random sample of Twitter users from a country produce representative sample of the population? If not, which method provides a more representative sample of the population?

To answer these research questions, we collect US Twitter data using four widely-used Twitter user sampling methods for one month and compare the results according to multiple tweet- (e.g. distribution of tweets), user- (e.g. distribution of age and gender), and population-level evaluation metrics. For our population-level metrics, following (Wang et al. 2019), we use five different prediction errors for estimating the population of US from Twitter data. The goal of the population-level metrics is to explore which sampling and debiasing methods provide the minimum prediction error, and thus, more representative of the population.

3 Methodology

3.1 Sampling methods

We use four different Twitter sampling methods (Table 1) that are widely used in the literature to create a random sample of Twitter users in a country. They include (1) 1% Stream, in which we use Twitter Streaming API to get 1% stream tweets, and filter for languages or country of interest; (2) Country Query, in which we query for the country into Twitter Search API and get all tweets; (3) Language Query, in which we query for languages that are related to the country of interest and use Twitter Search API same as Country Query; and (4) Bounding Box, in which we divide a country to multiple small bounding-box coordinates and get all tweets within them by Twitter Search API. There is also a fifth sampling method, in which one could generate random Twitter user IDs, check whether it exists on Twitter, and if they existed, filter to country or language of interest (Barberá et al. 2019). However, this method is no longer feasible due to technical changes on Twitter.Footnote 1

Table 1 Four different methods of getting a random sample of users for a country

3.2 Data

We used Twitter’s V2 API to collect tweets over a month (i.e. from 2022-09-07 to 2022-10-08). As a first method, we used the sampled Streaming API. This is the simplest method that returns 1 % of all tweets during the listening period, but it generates the most amount of noise as well. For the second method, we used Twitter Search API and collected tweets posted in the United States by setting the filter argument as place_country:US. For the third method, we used the same endpoint as the second one but filtered tweets based on English language and US country by setting the filter argument as lang:en and place_country:US. As for the fourth method, we used the bounding boxes available in the Twitter Search API, which uses the coordinates of a specific area. The bounding boxes are limited in size (25 miles in height and width) and their form (rectangular). To mitigate the limitations of the bounding boxes, we implemented a grid of small boxes plotted over the US, with points defined by a longitudinal and latitudinal distance of 0.3\(^{\circ }\). We also included points along the borders of the countries by shifting the points diagonally up and down for one unit. This resulted in a comprehensive approximation of the United States, represented by 9,541 bounding boxes. In summary, Our research scope focuses on Twitter users in the United States. However, we applied certain query parameters in all four methods to narrow down our user selection within this context. It’s important to note that we did not use any specific keywords or hashtags that could potentially introduce bias towards any particular topic.

3.3 User pre-processing

Following the data collection, the next step involves the pre-processing of accounts. Typically, when we generate a random sample of Twitter users from a specific country, our goal is to obtain a sample consisting of authentic individuals rather than organizations or malicious accounts like bots. This is important as the inclusion of such accounts could introduce bias into our sample (Yang et al. 2022). To achieve this, we selected a random sample of 30K users from each dataset associated with each of the four Twitter sampling methods. This equalizes the dataset sizes and facilitates a more accurate comparison of changes in data volume after applying each pre-processing filter. From the initial sample of 30K randomly selected users, we applied several filters to exclude specific categories, including bots [identified using botometer (Yang et al. 2022)], verified accounts, protected accounts, low-activity accounts (those with fewer than 100 tweets), recently created accounts (less than 9 months old), and suspended accounts (as detailed in Table 2). Additionally, we eliminated accounts whose bios contain keywords such as "journalist," "magazine," "member," "organization," "mayor," "actress," etc., as outlined in Table 2. This step ensures that our sample comprises regular individuals rather than celebrities or organizational accounts. Lastly, we excluded users whose tweet language is not English and those whose tweet coordinates did not correspond to locations in the United States.

Table 2 Pre-processing steps for twitter users

3.4 Inferring users’ demographics

We utilized the M3 model (Wang et al. 2019), which is a multimodal deep learning model, to predict the gender and age of Twitter users. This model functions across 32 languages and relies solely on users’ profile details, including their screen_name, user_name, bio, and profile picture. The M3 model offers two modes, and we employed the ’full’ model, which is more accurate and incorporates the profile image. In addition to gender and age, the model also discerns whether a Twitter account belongs to an organization. Consequently, we excluded organizational accounts from our dataset. Regarding location, we followed established research practices by utilizing self-reported user location or information from their close connections to estimate the users’ locations at the state level (Barberá et al. 2019).

3.5 Creating representative population estimates

In this step, we’ve inferred all the necessary features, enabling us to create our sample. We’ve selected all the valid users that have passed through the previous filters, resulting in our final 10K samples for each method. From this point forward, we will use this 10K-sample dataset for our analysis. Previous research has demonstrated that when demographic information is available, and proper statistical adjustments like re-weighting and post-stratification are applied, non-representative polls can still yield accurate population estimates (Wang et al. 2015a). The primary demographic characteristics that survey analysts focus on to address non-representativeness are age, gender, and location (Wang et al. 2019). We follow the approach introduced in (Wang et al. 2019) to learn inclusion probabilities based on users’ demographics. This method utilizes multilevel regression techniques to estimate the likelihood of an individual with specific demographics being present on a particular platform. It does this by considering inferred joint population counts and ground-truth population data. To implement this approach, we require gender, age, and geographic location information for each Twitter user, along with ground-truth data about the population. In the case of the United States, we rely on census data as our ground truth).Footnote 2

3.6 Evaluation metrics

We compare the results of each method to create a nationally-random sample of Twitter users according to three categories of metrics including (1) tweet-level (Table 3), (2) account-level (Table 3), and (3) population-level metrics (Table 4). Tweet-level metrics include total number of tweets generated by each sampling method, average number of tweets collected from each account, and share of English tweets. User-level metrics encompass various aspects, including: Total number of unique users, Distribution of tweet counts for accounts, considering the correlation between tweet count and account age, Distribution of the average tweet count, calculated by dividing the tweet count by the account’s age in days, Distribution of the number of likes received by the last tweet of each user, Distribution of account creation dates, Distribution of the number of followers and friends by each user, and Distribution of age and gender among users.

Table 3 List of evaluation criteria for comparing various methods of creating a national-sample of Twitter users from tweet and account aspects
Table 4 List of evaluation criteria for comparing various methods of creating a national-sample of Twitter users from population aspect

For the population-level metric, our objective is to determine which of the four Twitter sampling methods is most effective for generating a nationally representative sample of users. To achieve this, we adopt a test outlined in Wang et al. (2019) and employ the representative samples detailed in Sect. Inferring users’ demographics to estimate the overall population of the United States based on Twitter data. In essence, we conduct a regression analysis that correlates the actual population sizes of various areas within the United States (such as states, divisions, or regions) with the number of American Twitter users from different age and gender groups in those specific locations. This analysis helps us assess the representativeness of the Twitter data for estimating the US population.

In a more detailed breakdown, we compare five distinct models that rely on different data sources and operate under different assumptions. The first model (N \(\sim \) M) serves as the baseline and utilizes solely the total population count obtained from the census data along with Twitter user data, without applying any debiasing coefficients. The subsequent three models are grounded on the assumption of homogeneous inclusion probabilities: The second model (N \(\sim \sum _g\) M(g)) uses only gender-specific marginal counts. The third model (N \(\sim \sum _g\) M(a)) uses only age-specific marginal counts. The fourth model (N \(\sim \sum _{a,g}\) M(a, g)) employs the joint distribution inferred from Twitter data alongside the total population counts from the census data. Finally, the fifth model (log N(a, g) \(\sim \) log M(a, g) + a + g) leverages the joint distribution inferred from both Twitter data and census data. For each of these five prediction tasks, we assess their performance using the mean absolute percentage errors (MAPE) evaluation metric, calculated as specified in Eq. 1. In this equation, \({\hat{N}}_i\) represents the predicted population size, \(N_i\) is the actual population size, and the summation is performed over all geographical units of interest, such as states, regions, or divisions in the United States.

$$\begin{aligned} MAPE(N) = \frac{100\%}{n} \sum _i^{Geo.} \frac{|{\hat{N}}_i - N_i|}{|N_i|} \end{aligned}$$
(1)

4 Results

The following subsections provide a detailed presentation of our results by comparing the four Twitter sampling methods. First, we present essential statistics regarding the outcomes of each sampling method. Second, in accordance with the methodology outlined in Sect. User pre-processing, we randomly sample 30K accounts from the output of each sampling method. Subsequently, we apply pre-processing filters and select a random sample of 10K users from the remaining pool. We then report the metrics at both tweet- and user-levels for this subset. Lastly, we generate debiased samples from each 10K random sample by computing inclusion probabilities. We compare their mean absolute percentage errors (MAPE) for the task of estimating the United States population using Twitter data.

4.1 Tweet-level and user-level metrics

In Table 5, we provide a comparison of various tweet-level metrics across the four Twitter sampling methods employed to create random user samples from a country. These metrics include the number of tweets, the count of unique users, the average number of tweets per account, and the percentage of tweets in English. The results show that the bounding box (BB) and location query (Loc) sampling methods produce a significantly higher number of tweets compared to the language query (Lang) and 1% steam methods. Among the four Twitter sampling methods, BB and Loc methods produce more than 18 million tweets, whereas the Lang and 1% stream methods generate 4.5 million and 174,000 tweets, respectively, within the same timeframe. The same pattern is observed when examining the number of unique users and the average tweets per account metrics, except that the difference between BB and Loc methods and Lang method is notably smaller than their difference in the number of tweets. This suggests that the BB and Loc methods have a higher rate of account duplication compared to the other two sampling methods. Lastly, among the three methods that do not explicitly filter for language, the BB method has a higher proportion of English tweets.

Table 5 Statistics of the number of tweets and users collected by each sampling method

Due to the computational expense of bot detection and age, gender, and location inferences, our aim is to compare a random sample of 10K users from each Twitter sampling method. To ensure that we have at least 10K users from each method after applying pre-processing steps, we initially select a random sample of 30K users from each Twitter sampling method. We then execute all pre-processing steps on this larger sample and subsequently select a random sample of 10K users from the remaining pool. The table in Table 6 provides information about the number of accounts that have been removed after each filter has been applied to the 30K-sample.

Table 6 Number of accounts removed by each filtering method in USA 30K samples

The comparison results of tweet- and user-level metrics are reported in Fig. 1 and Tables 7 and 8, and the corresponding t-test results are illustrated in Fig. 2. More specifically, the distributions of the total number of tweets are presented in Fig. 1a. Notably, the 1% stream method generates more tweets (M = 19,873.9, p = 0.00) compared to the other methods (see Table 7 and Fig. 2). Additionally, since the number of generated tweets depends on account age, we also depict the distributions of the number of tweets per day in Fig. 1b. This metric is calculated by dividing a user’s total number of tweets by the number of days since their account creation. Once again, we observe that users from the 1% stream method tend to tweet more frequently (M = 5.81, p = 0.00) than those from other methods (see Table 7). We have also conducted a comparison of the number of likes across the four Twitter sampling methods and have depicted the corresponding distributions in Fig. 1c. It’s worth noting that the BB, Loc, and Lang methods exhibit nearly identical distributions. However, users sampled from the 1% stream method tend to have significantly fewer likes (M = 0.00, p = 0.00). This discrepancy is primarily attributed to the fact that the 1% stream method collects data in real-time, often when engagements with posts have just commenced. This is in contrast to the other methods, which may include tweets posted up to the past seven days, allowing for more time for engagements to accumulate. Additionally, we present the distributions of account creation times in Fig. 1d. Across all methods, we notice a peak around the year 2009, likely reflecting the rapid growth of Twitter during that period (Yang et al. 2022). There is also an increase in the number of created accounts in 2011, coinciding with a period of significant user growth on Twitter.Footnote 3 In comparison to other methods, the 1% stream method appears to generate more younger accounts (p = 0.00), specially those that were created in 2022. However, for the remaining time period, the distributions appear similar across all four methods.

Figures 1e and f display the distributions of the numbers of followers and friends, respectively. These figures illustrate that users in the 1% stream method tend to have slightly more followers (M = 911.3, p = 0.00) and friends (M = 1059.6, p = 0.00) compared to users in the other sampling methods (see Table 8 and Fig. 2). Interestingly, in Fig. 1f, it becomes evident that nearly half of the users in the 1% stream sample have over 1000 friends. Moreover, the 1% method exhibits a higher number of accounts with approximately 5000 friends or more. It’s worth noting that the peak around 5000 friends in Fig. 1f is attributed to a Twitter anti-abuse limitation, which stipulates that an account cannot follow more than 5000 friends unless it has more than 5000 followers.Footnote 4 This policy leads to the observed distribution pattern.

Fig. 1
figure 1

Distributions of a number of tweets; b average number of tweets per day; c number of likes; d account creation date; e number of followers; and f number of friends for different groups. Distribution of users with respect to g gender and h age across the four Twitter sampling methods

Fig. 2
figure 2

Heatmap of P values for Pairwise T-tests of a number of tweets; b average number of tweets per day; c number of likes; d account creation date; e number of followers; and f number of friends across the four Twitter sampling methods

As outlined in Sect. Inferring users’ demographics, age, gender, and location are critical demographics for constructing a nationally-representative sample of individuals. Consequently, it is essential to compare the user samples generated by each sampling method in terms of these three metrics. Figure 1 displays the distributions of age and gender across the four Twitter sampling methods. When examining gender distributions (Fig. 1g), two notable observations emerge: (1) All four methods yield unbalanced gender data, with a majority of users (over 64%) being male, (2) The 1% stream method produces slightly fewer women compared to the other three methods, which generate a nearly equal fraction of women.

Regarding the age (Fig. 1h), the percentage of users who are under 18 or over 40 is nearly identical across all methods. However, there is a significant difference for the 19–29 and 30–39 age cohorts in the 1% stream method. While the other three methods generate approximately 27% of users in both the 19–29 and 30–39 age groups, the 1% stream method has only 18% of users in the 19–29 cohort and 35% in the 30–39 cohort, indicating a notable deviation.

Table 7 Average and standard deviation of the number of tweets and likes across the four Twitter sampling methods
Table 8 Average and standard deviation of the number of followers and friends across the four Twitter sampling methods

Finally, in Fig. 3, we present the distribution of the number of users located in each state within the United States. As discussed in Sect. Inferring users’ demographics, we adopted the methodology proposed in Barberá et al. (2019) to estimate the location of Twitter users at the US state level. In general, the distribution pattern is highly similar across all four Twitter sampling methods, with no significant variations between them. As expected, states with larger populations such as California, New York, Texas, and Florida have a higher number of Twitter users. Additionally, within our 10K-user sample, all four sampling methods managed to provide at least one user from each state, ensuring a diverse geographical representation.

Fig. 3
figure 3

Map of the number of users located in US states. All four sampling methods produced at least one user in all 50 US states

4.2 Population-level metrics

The objective of this section is to assess the accuracy of the four debiased samples created from the four Twitter sampling methods in predicting the population of the United States using Twitter data. This prediction task is conducted at the state level, which necessitates having a sufficient number of users for all age and gender groups in all 50 US states. However, despite all four sampling methods providing at least one user in each state (as shown in Fig. 3), none of them yielded enough data to cover all combinations of age and gender across all US states. Table 9 provides information on the number of US states where there are insufficient users to represent all age and gender groups, across different sample sizes (i.e., 5K, 8K, or 10K users) and Twitter sampling methods. As indicated in Table 9, even with a sample size of 10K users, there are still between 7 and 11 states lacking at least one demographic group (e.g., women aged 30–39 in the State of New York).

Table 9 Number of US states that Twitter sampling methods did not generate enough users in all demographics groups

In our regression model for estimating the US population from Twitter data, each row corresponds to a specific demographic group within a particular state (e.g., men above 40 in New Jersey). If any demographic group is absent in the Twitter data for a specific state, all values in the corresponding row of the regression model would be zeros. However, it’s important to note that no state has all demographic groups missing. Consequently, we made the decision to remove rows with all-zero values from our regression model. This led to the removal of 16 rows (3.92%) from the bounding-box sample, 12 rows (2.94%) from the Country Query sample, 16 rows (3.92%) from the Language Query sample, and 21 rows (5.15%) from the 1% Stream sample. It’s worth mentioning that all of these samples consist of an equal size of 10K users each.

Following the approach outlined in Wang et al. (2019), we assess the accuracy of the four representative samples using a leave-one-state-out cross-validation framework. In this evaluation, we calculate the mean absolute percentage error (MAPE) for the population estimates of the state that is left out of the analysis. The MAPE is computed using the formula specified in Eq. 1. As mentioned in Table 1, we calculate MAPE under five different scenarios. These scenarios encompass a baseline model that relies solely on the total population without the use of debiasing coefficients. Additionally, there are three models based on homogeneity, which consider whether to include both age and gender or only one of these variables. Lastly, there is a full model that takes into account heterogeneity by including both age and gender as factors in the analysis.

In Fig. 4, we observe that the results of the leave-one-state-out evaluation show a benefit to using the 1% Stream Twitter sampling method. Across all five debiasing models, the 1% Stream seems to achieve the minimum and the Language Query method achieves the maximum prediction error. For the N \(\sim \) M model, which is a baseline that does not use any debiasing, the 1% Stream method achieves MAPE of 27%, which is the least compared to BB, Loc, and Lang, which achieved MAPE of 33%, 39%, and 45%, respectively. The inclusion of the inferred age in the debiasing models N \(\sim \sum \) M (a) decreases MAPE of the 1% Stream method to 21%, which again is the minimum among the BB (26%), Loc (25%), and Lang (31%) methods. The inclusion of the inferred gender in the debiasing models N \(\sim \sum \) M (g) decreases the MAPE of the 1% Stream method, but not as big as the inclusion of the age. Nonetheless, the 1% Stream sample shows the minimum prediction error with MAPE of 25% (the BB, Loc, and Lang methods obtained MAPE of 27%, 30%, and 41% respectively). Same pattern holds true for N \(\sim \sum \) M (a, g) and logN(a,g) \(\sim \) logM(a,g) + a + g models. Moreover, the results show that even the baseline model of the 1% Stream sample outperforms the other three sampling methods in all five modelling scenarios.

Fig. 4
figure 4

Performance on leave one state out population inference across different debiasing models where rows with all zero value were removed from the regression. The bar shows MAPE(N) robust standard errors clustered on states

4.3 Robustness tests

To ensure that the observed results are not the artifact of our regression settings, in which we removed the rows with all zero values, or artifact of our pre-processing decisions, we replicate the results of Fig. 4 for the following scenarios: (1) including the District of Columbia to the 50 states and remove rows with zero values in the regression model (see Fig. S1 in Appendix), (2) removing all states that their Twitter data are missing at least one demographic group from the Twitter and census data and measure MAPE for the remaining states (see Fig. S2 in Appendix), (3) aggregate data at the nine US divisions (the U.S. Census Bureau groups the 50 states and the District of Columbia into four geographic regions and nine divisions based on geographic proximity,Footnote 5) which results in no missing demographic group at all nine divisions (see Fig. S3 in Appendix), (4) removing users with less than 200 tweets in the pre-processing step instead of the original 100 tweet threshold (note that due the recent restrictions imposed on Twitter API, we are not able to test the lower threshold because it will produce new users, for which, we cannot obtain bot score, age, gender, and location) (see Figs. S4 and S5 in Appendix), and 5) removing users with the age of less than one year instead of the 9 month threshold (see Figs. S6 and S7 in Appendix).

Across all of these five robustness tests, the overall qualitative patterns of the comparison between the four Twitter sampling methods remain the same, with the 1% Stream sampling method obtaining the minimum prediction error on the US population inference task. Interestingly, for the leave-one-division-out cross-validation setting, we see that the error rates drop nearly in half across the four sampling methods and five population inference tasks (Fig. S3). This error reduction suggests that knowledge of the nine US division specific platform biases is more important than the state-specific ones for accurate estimates of the country population. In other words, to achieve a good performance, a model at least need to be exposed to some divisions within the US during its training period.

5 Discussion

Twitter has become the most studied social media platform where people express their everyday opinions, especially about politics. This fact encouraged a wealth of social and computer science scholars to use Twitter data for measuring national-level statistics for political outcomes, health metrics, or public opinion research. However, Twitter is not a representative sample of the population due to demographic imbalance in usage and penetration rates. To address this fact, researchers often try to create a random sample of Twitter users from a country. However, at least four widely-used sampling methods exist in the literature, and the extent to which their outputs are similar or different has not been explored systematically so far.

In this paper, we tackled this issue by comparing the performance of the four different Twitter sampling methods on some carefully devised evaluation metrics. More specifically, the four methods include (1) 1% stream, in which one uses Twitter Stream API to get 1% of tweets in real-time and then sample from the authors of the tweets, (2) location query, in which one uses Twitter Search API and query for a country of interest, (3) language query, in which one uses Twitter Search API, query for language(s) representing the country of interest, and filter for the country, and (4) bounding-box, in which one uses the ‘bounding-box’ field in the Search API and query for the coordinates enclosing the country of interest. After carefully reviewing the literature, we devised three tweet-level, eight user-level, and five population-level evaluation metrics to compare these four Twitter sampling methods.

Our results highlight the 1% Stream Twitter sampling method, which exhibits different characteristics compared to the other three sampling methods and fits as the top candidate in most use cases. More particularly, Twitter users collected by the 1% Stream method tend to have more tweets, tweets per day, followers, and friends, and fewer number of likes. In addition, it appears that the 1% Stream sampling method provides slightly younger accounts (i.e. accounts created around 2022), slightly more male users, significantly fewer users in the 19-29 age stratum, and significantly more users in the 30-39 age stratum, compared to the other three sampling methods.

The 1% Stream method achieves the minimum error, compared to the other three methods, in the prediction task of estimating the population of the US from Twitter users. This is true across five different debiasing models, each attempting to make the sample representative of the US population. A baseline model using the 1% Stream, in which we do not implement any debiasing technique, outperforms or equates all debiased forms of the other sampling methods in terms of the prediction error (except for the Location Query method when using marginal age counts).

However, the 1% Stream Twitter sampling method has some practical and theoretical disadvantages. Practically, it is time-consuming because the Twitter Stream API provides tweets in real-time. This means that, for example, if one needs to collect a month of stream data, she cannot get it immediately and has to wait for a whole month. This is not true for the other sampling methods, in which one can get the same period of data in a few hours or days, depending on the size of the country. In addition, researchers cannot access historical stream data using the 1% Stream method unless they began collecting it beforehand.

Theoretically, although the location and language filters used in all Twitter sampling methods are consistent, the 1% Stream method may produce a sample biased towards more active users. This is because the more frequently a user tweets, the higher the likelihood they will be included in the 1% sample. Conversely, the other three methods offer equal chances for all users to be sampled. Additionally, the 1% Stream method is not ideal for studies focused on user engagement metrics (such as the number of likes, retweets, comments, and views). This is because it gathers tweets in real-time, often capturing very recent tweets that have not yet been widely viewed or engaged with, as demonstrated in Fig. 1.

These drawbacks of the 1% Stream Twitter sampling method underscore the importance of the second-best sampling method in our study, which is the bounding box method. While its results are identical to Location Query and Language Query methods in terms of tweet- and user-level metrics (see Fig. 1), it clearly outperforms them with respect to the prediction error in the US population estimation task.

While Twitter has been one of the most studied social media platforms, the advantages and disadvantages of different sampling strategies remain unclear. Our results illuminate the positive and negative characteristics of the four main sampling methods used in the literature and help researchers choose the one that best suits their research goals and designs. By identifying the best sampling methods, our results also pave the way for conducting more accurate social listening studies and building more accurate machine learning models. Moreover, our approach and results may be adapted to conduct similar studies for other social media platforms.

The performance of the 1% Stream and bounding box Twitter sampling methods is better compared to the other two sampling methods when used for estimation of the population of US. Future research should explore the role of large language models in improving the performance of or automating usage of different models used in the process of computing the inclusion probabilities (Cerina and Duch 2023), determine significant parameters in the calculation of inclusion probabilities (Alizadeh et al. 2011), and test the temporal and regional validity of the results. Finally, due the the shutdown of the Twitter API, future work should compare the results of these study to those of data donation (Boeschoten et al. 2022) and crawling methods.