Abstract
One of the issues that can potentially affect the internal validity of interactive online experiments that recruit participants using crowdsourcing platforms is collusion: participants could act upon information shared through channels that are external to the experimental design. Using two experiments, I measure how prevalent collusion is among MTurk workers and whether collusion depends on experimental design choices. Despite having incentives to collude, I find no evidence that MTurk workers collude in the treatments that resembled the design of most other interactive online experiments. This suggests collusion is not a concern for data quality in typical interactive online experiments that recruit participants using crowdsourcing platforms. However, I find that approximately 3% of MTurk workers collude when the payoff of collusion is unusually high. Therefore, collusion should not be overlooked as a possible danger to data validity in interactive experiments that recruit participants using crowdsourcing platforms when participants have strong incentives to engage in such behavior.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Interactive online experiments that recruit participants using crowdsourcing platforms such as Amazon Mechanical Turk (MTurk) are growing in popularity (Amir, Rand, & Gal, 2012). Despite their benefits (Mason & Suri, 2012), interactive online experiments that recruit participants using crowdsourcing platforms impose a higher loss of experimental control compared to interactive laboratory experiments. One of the issues that might arise due to this loss of experimental control is collusion, i.e., participants could act upon information shared through channels that are external to the experimental design. If such collusion between participants is common, it could distort the interpretation of experimental results because most experimental designs assume no external communication between participants. For example, two MTurk workers (MTurkers) who are part of the same social network and perform similar types of tasks (Gray, Suri, Ali, & Kulkarni, 2016; Yin, Gray, Suri, & Vaughan, 2016), could communicate through a private messaging app while playing a coordination or public goods game (Arechar, Gachter, & Molleman, 2018; Guarin & Babin, 2021). Were such communication to exist, it should increase the level of cooperation (Cooper, DeJong, Forsythe, & Ross, 1992) which, in turn, might lead researchers to erroneously conclude that MTurkers have special characteristics that cause them to cooperate more than other groups of participants (Almaatouq, Krafft, Dunham, Rand, & Pentland, 2020). Additionally, collusion would introduce noise in experiments that involve information markets (Teschner & Gimpel, 2018) because it would allow participants to directly share (or lie about) their private information.
Although researchers have acknowledged collusion as a potential concern for data quality (Arechar, Gachter, & Molleman, 2018; Hawkins, 2015; Horton, Rand, & Zeckhauser, 2011), to my knowledge, no empirical evidence exists that quantifies how prevalent collusion is in interactive online experiments that recruit participants using crowdsourcing platforms. This study uses two experiments in which participants are incentivized to collude to provide such evidence about online participants recruited from MTurk. Furthermore, I investigate if collusion is affected by several design choices: the recruitment method, the instructions given to participants about communicating with others, the payoff of collusion, and the amount of time participants expect to wait between periods.
When will MTurkers collude?
MTurkers will collude only if they have both the ability and the motivation to do so. On the one hand, evidence suggests that MTurkers have the ability to collude in interactive online experiments. First, MTurkers communicate on dedicated forums (Irani & Silberman, 2013). Although popular forums have policies against sharing specific information about MTurk tasks (human intelligence tasks, HITs), there could be lesser-known forums dedicated to sharing information about how to maximize payoffs in certain HITs (e.g., positioning and answers to attention checks, completion codes, and strategies for interactive online experiments). Second, some MTurkers are part of the same social network (Gray, Suri, Ali, & Kulkarni, 2016; Yin, Gray, Suri, & Vaughan, 2016). Moreover, some MTurkers likely live in the same household (Chandler, Mueller, & Paolacci, 2014). These MTurkers could directly communicate with each other either offline or using an instant messaging app. Indeed, Yin, Gray, Suri, and Vaughan (2016) find that 13.8% of MTurkers who are connected, communicate with each other using such one-on-one channels.
On the other hand, practical considerations indicate that MTurkers do not have the ability to collude. First, given that most HITs are not interactive online experiments (Hitlin, 2016), MTurkers might not have considered how to optimize their payoff in this relatively rare type of HIT. Therefore, MTurkers may not realize that collusion is profitable. Second, collusion partners may not have time to coordinate and join the same HITs because available spots in HITs are quickly filled by other MTurkers.
Ability is not sufficient for collusion to occur, MTurkers need to also be motivated to collude. Incentives are likely a critical factor in the motivation to collude. For collusion to occur, the expected payoff of collusion must exceed its expected cost. Therefore, the likelihood of collusion will depend on the experimental design, specifically, on how the design affects the probability of successful collusion and the payoff of collusion. For example, collusion should be more likely to occur in experiments in which the expected payoff of collusion is high such as experiments with high stakes (e.g., Raihani, Mace, & Lamba, 2013). Similarly, collusion should be more likely to occur when the experimental design involves large groups of participants (e.g., Suri & Watts, 2011) because potential colluders have a higher probability of being in the same group as their collusion partner.
Many MTurkers will likely understand that experimenters do not intend for them to share information with others during interactive online experiments and, as a result, will consider that collusion is cheating. Therefore, beyond the rational cost-benefit considerations of attempting collusion (Becker, 1968), behavioral factors will likely reduce MTurkers’ motivation to collude. Specifically, MTurkers’ motivation to attempt collusion might be decreased because they prefer to maintain a positive image as a non-cheater about themselves (Stets & Carter, 2012). Additionally, participant’s collusion motivation might be lowered by social controls such as internalized norms against cheating (Opp, 2013) and the fear that a potential collusion partner will judge them as a cheater (Nettle, Harper, Kidson, Stone, Penton-Voak, & Bateson, 2013). However, it is unclear whether these social controls will have a strong impact on MTurkers’ behavior because many MTurkers work from home and thus operate in a socially empty environment without direct observation or judgment from others (Kroher & Wolbring, 2015; Lindenberg, 2018).
The above discussion suggests that it is ex-ante unclear whether MTurkers will collude in interactive online experiments. It also suggests that collusion likely depends on experimental design choices. I therefore investigate whether four design choices affect collusion. First, the recruitment method might affect collusion. Researchers have suggested two strategies for recruiting participants from crowdsourcing platforms to participate in online interactive experiments (Arechar, Gachter, & Molleman, 2018; Mason & Suri, 2012). When using the standing panel method (Mason & Suri, 2012; Palan & Schitter, 2018), researchers first create a standing panel of participants by posting a short, lower-paying HIT in which MTurkers are only asked whether they would like to take part in an interactive experiment in the future. Afterward, researchers post the higher-payoff HIT for the experimental session and only allow participants who joined the standing panel to participate. When using the instant method (Arechar, Gachter, & Molleman, 2018), researchers recruit participants from all MTurkers who are available when the higher-payoff HIT for the experimental session is posted. I expect that recruiting participants through the standing panel method will result in more collusion compared to the instant method because potential colluders have a longer time window to coordinate and self-select into standing pools. Available spots are likely filled more slowly for the lower-payoff HITs through which participants sign up to the standing panel than for the higher-payoff HITs for the experimental session (Buhrmester, Talaifar, & Gosling, 2018).
Second, the instructions given to participants about communicating with each other might also impact collusion. Informing participants that communication with others is not allowed in the instructions might reduce collusion, as it clarifies that such behavior is considered cheating thus removing any moral ambiguity (Irlenbusch & Villeval, 2015). However, results from previous studies have been mixed with regard to such interventions. Consistent with the prediction that asking people not to communicate will decrease collusion, Goodman, Cryder, and Cheema (2013) find that asking MTurkers not to look up factual information decreases the percentage of correct answers given to a factual question, even when participants are incentivized to answer accurately. Inconsistent with this prediction, Bryan, Adams, and Monin (2013) find that asking participants not to cheat has no effect on cheating. Moreover, the intervention might even backfire and increase collusion because it could inform MTurkers who would have not otherwise considered collusion that communication with others is possible (Fosgaard, Hansen, & Piovesan, 2013). Overall, it is ex-ante unclear whether explicitly mentioning that communication with other participants is prohibited will affect collusion. I also examine if informing participants that communication is allowed affects information sharing. I expect that announcing that communication is allowed will increase information sharing because all participants will know that communication with other participants is possible and is not considered cheating (Abeler, Nosenzo, & Raymond, 2019).
Third, collusion likely depends on the payoff of successful collusion such that experiments with higher stakes (e.g., Raihani, Mace, & Lamba, 2013) will be more vulnerable to collusion. Even if MTurkers view collusion as cheating, the higher payoff of collusion will likely motivate some of them to ignore their moral concerns and to collude (Kajackaite & Gneezy, 2017). Moreover, a higher payoff will likely attract a different, more experienced sample of MTurkers (Hitlin, 2016) who may be more capable of colluding.
Fourth, collusion might depend on the amount of time MTurkers expect to wait between periods. There is likely variation in how quickly MTurkers expect to advance through experiments. For instance, in synchronous interactive experiments, larger groups may require more waiting between periods as participants need to wait for the slowest group member to finish a given period. A shorter expected waiting time between periods would likely increase the opportunity cost of engaging in collusion, as MTurkers could complete the study more quickly and move on to other HITs if they do not attempt collusion. In contrast, a longer expected waiting time between periods would lower the opportunity cost of collusion as it would not be possible to decrease the waiting time regardless of how quickly MTurkers finish a period. As a result, I hypothesize that a longer expected waiting time between periods will result in more collusion.
To examine these predictions, I conduct two experiments on MTurk. In the first experiment, I manipulate the recruitment method and the instructions given to participants about communication. Participants had the opportunity to earn $2.00 by colluding in this first experiment. To reduce the risk that MTurkers do not collude in Experiment 1 even though such behavior occurs in typical interactive experiments, I aimed to set the payoff for successful collusion that is slightly higher than what is typically used in interactive experiments. In the second experiment, I manipulated the time MTurkers expect to wait between periods. Moreover, I increased the incentives to collude such that, in the second experiment, participants could earn $25.00 by colluding. In the second experiment, I aimed to set the payoff for successful collusion that is representative of experiments where stakes are unusually high (e.g., Larney, Rotella, & Barclay, 2019).
Experiment 1
Method
Participants were recruited from MTurk and the experimental task was programmed in oTree (Chen, Schonger, & Wickens, 2016).Footnote 1 To participate, MTurkers needed to reside in the United States, have completed at least 100 HITs, and have an approval rate of at least 90% on their previous HITs. The experiment could not be opened from a mobile device. I used the unique ID assigned to each MTurker to restrict participants from completing the study more than once. I collected data over three days and seventeen sessions. Each session collected data for only one of the six treatments. Participants had twenty minutes to complete the task after accepting the HIT.
The task description informed respondents that they would receive a base pay of $0.90 and a bonus of up to $2.00 for completing the task. After agreeing to the task, participants were redirected to a webpage where they could complete the task. Participants read the instructions and needed to correctly answer eight comprehension questions within two attempts to participate in the experiment and receive the participation fee. The instructions were still available to participants while answering the comprehension questions. Participants who did not correctly answer all the comprehension questions within two attempts were not allowed to participate in the experiment. Participants then completed the main task of the study. Participants did not receive any feedback about their performance while completing the task. On average, participants took 6.4 min to complete the task and received $1.01 for their work.
Number choosing task
For ten periods and in groups of six, participants chose a number from a list that contained ten numbers. Participants learned that, out of the ten numbers, nine numbers were unique to each participant and one number was common across all participants in their group. They earned a bonus of 20 cents for each period in which they and at least one other participant in their group chose the common number. If participants did not share information, they had a 10% chance of choosing the common number each period. If participants were to share information and compare their lists of numbers, they could choose the common number in each period. Therefore, participants had incentives to share information. The expected bonus for participants who did not share information was 8 centsFootnote 2 while the expected payoff of successful information sharing was $2.00.
Recruitment method
Participants self-selected into one of two treatments: Standing and Instant. In the Standing treatment, I recruited participants from a pool of MTurkers who, before the experimental session, had indicated their willingness to participate in an interactive experiment. I posted a one-question HIT where participants indicated whether they wanted to be notified about an upcoming HIT in which they will interact with other MTurkers. The payoff for completing this HIT was 1 cent and 830 participants signed up to receive email notifications about the upcoming sessions. The notification emails were sent approximately one and a half hours before a session and informed participants about the time at which the session would begin. Following the recommendation of Mason and Suri (2012), I invited approximately three times the number of participants that I intended to recruit for a session. In the Instant treatment, I recruited participants from the pool of available MTurkers at the time of posting a session. MTurkers who joined the pool for the Standing treatment could not participate in the Instant sessions. As expected, participants had a longer time window to accept the lower-payoff HIT for signing up for the standing panel than they did for the higher-payoff HIT for the experimental session in the Instant treatment. The available spots for the standing panel HIT were never filled while the available spots for the experimental session HIT were filled immediately after being posted.
Instructions about communication
Participants were randomly allocated to three treatments: NoMentionC, ProhibitedC, and AllowedC. In the NoMentionC treatment, the instruction contained no explicit instructions about communication with other participants. The instructions in this treatment described the task as a "game of chance". This wording implied that information sharing was prohibited because the element of chance would disappear if participants could always choose the common number by comparing their lists of numbers. In the ProhibitedC treatment, the task was also described as a "game of chance" and the instructions indicated that "communication with other participants is strictly prohibited". In the AllowedC treatment, the task was only described as a "game" and the instructions indicated that "communication with other participants is allowed! This study does not provide a chat function. However, you are allowed to communicate with your group members through other methods such as forums". Data from this treatment were collected in the last sessions to avoid informing participants about the purpose of the study and creating a demand effect.
Measure
I define collusion as participants acting upon information shared through channels that are external to the experimental design. This definition does not include the situation in which an individual gains additional information by simultaneously performing a study multiple times because they have access to multiple MTurk accounts (Chandler, Mueller, & Paolacci, 2014). Information sharing, whether it occurs between distinct participants or between multiple instances of the experiment controlled by a single participant, results in respondents having information that the experimenter does not intend them to possess. However, it is important to distinguish between these two cases because they have different underlying causes, and, as a result, different mitigation strategies.
To measure collusion, I used a design inspired by the coin-flipping task (e.g., Abeler, Becker, & Falk, 2014). The variable of interest was the number of periods in which a participant chooses the common number. Participants had a 10% chance of choosing the common number each period if they did not share information with each other. Participants could choose the common number in each period if they shared information and compared their lists of numbers. This design allows me to detect whether participants are sharing information with each other by comparing the distribution of outcomes to the theoretical distribution absent information sharing. If participants did not share information, the outcomes should follow a binomial distribution. For each participant, the probability of choosing the common number exactly k times out of ten periods is given by the formula \(Pr(k,10,0.1) = {10 \atopwithdelims ()k} \times p^k \times (1 - p)^{10 - k}\). In all treatments except for the AllowedC treatment, information sharing was prohibited. Consequently, I classify any information sharing between distinct participants in these treatments as collusion. In contrast, because participants were allowed to share information in the AllowedC treatment, information sharing became part of the experimental design. Therefore, in the AllowedC treatment, I measured information sharing during interactive experiments but not collusion.
Within each period, the common number was identical across all groups of each session.Footnote 3 This design choice enabled participants to benefit from sharing information across different groups, unlike in most experiments where participants can only gain advantages by sharing information within their own group. By having a common number for all groups in a session, I can measure information sharing between participants who are in different groups. It is important to measure information sharing between participants who are in different groups even if, in most studies, sharing information with partners who are not in the same group is unprofitable. This is because participants who are not in the same group will likely add noise by experimenting with their choices to determine if they are in the same group as their partner.Footnote 4
To gain a deeper understanding of participants’ motivations and strategies, I included at least one open-ended question in the post-experimental questionnaire in every treatment. In all treatments except AllowedC, participants were asked a single open-ended question about the strategy they used during the task. To prevent revealing the study’s purpose and creating a demand effect, I only asked specific questions about communication in the final sessions, i.e., in the AllowedC treatment where communication was explicitly allowed. In these sessions, participants were also asked to "explain what method you used to communicate with other participants if you tried to do so" and to "explain why you did not try to communicate with other participants if you chose not to do so".
Data
In total, 650 MTurkers completed the experiment in January 2022. Participants were 52.46% male, 39.95% reported a bachelor’s degree as the highest education level obtained, and 51.08% reported having participated in at least 30 academic studies in the past month. The mean age was 40. Across all characteristics, participants in the ProhibitedC, NoMentionC, and AllowedC treatments were similar. Participants also did not differ in the measured characteristics between the three days of data collection. Participants who self-selected in the Standing and Instant treatments had different characteristics. Consistent with the idea that males self-select into higher-paying HITs (Litman, Robinson, Rosen, Rosenzweig, Waxman, & Bates, 2020; Litman & Robinson, 2020; Manzi, Rosen, Rosenzweig, Jaffe, Robinson, & Litman, 2021), the Standing treatment, which required MTurkers to first accept a 1-cent HIT, attracted fewer male participants (mean = 0.46, sd = 0.50) compared to the Instant treatment (mean = 0.58, sd = 0.49, difference = 0.12, t = 2.99, two-tailed p < 0.01). In addition, on the measure of risk aversion developed by Dohmen, Falk, Huffman, Sunde, Schupp, and Wagner (2011), participants reported being more risk-averse in the Standing treatment (mean = 4.95, sd = 2.58) than in the Instant treatment (mean = 5.88, sd = 2.65, difference = 0.93, t = 4.50, two-tailed p < 0.01).
Results
I first examine whether participants collude in the treatments that resemble most other experimental designs, i.e., all treatments in which participants are not explicitly told they are allowed to communicate. Table 1 and Fig. 1 show that the empirical distribution of the total number of periods in which participants chose the common number does not deviate significantly from the theoretical distribution absent information sharing (p = 0.80, \(\chi ^2\) goodness of fit test; average probability of choosing the common number in a period is 10.15%, one-tailed p = 0.36, binomial test checking if the probability is greater than 10%) when pooling all the data except the observations from the AllowedC treatment. This suggests that, when conditions resemble most other experimental designs, MTurkers do not collude.
I find no evidence that the design choices studied affect the level of collusion when communication is not explicitly allowed. The distribution of the total number of periods in which participants chose the common in the ProhibitedC treatment is not significantly different from the one in the NoMentionC treatment (p = 0.13, two-tailed Kolmogorov-Smirnov (KS) test). Similarly, the distribution in the Instant treatment is not significantly different from that for the Standing treatment (p = 0.99, two-tailed KS test). Participants also did not collude more on the third day of data collection as compared to the first day (p = 0.99, one-tailed KS test) and second day (p = 0.82, one-tailed KS test).
Table 1 and Fig. 1 suggest that participants share information in the AllowedC treatment in which they are informed that communication with other participants is allowed. The empirical distribution of the total number of periods in which a participant chose the common number is significantly different from the theoretical distribution absent information sharing (p < 0.01, \(\chi ^2\) goodness of fit test, probability of choosing the common number in a period is 13.75%, one-tailed p < 0.01, binomial test checking if the probability is greater than 10%). The effect is driven by eight observations in which the common number was chosen in all ten periods.Footnote 5 It is unlikely that this pattern is generated by chance, given that the probability of choosing the common number in all ten periods by chance is \(10^{-10}\).Footnote 6,Footnote 7
When further examining the eight responses in which the common number was chosen in all ten periods, I find that these responses are fraudulent and likely provided by the same individual who had access to multiple MTurk accounts. This is because all eight responses have nearly identical answers to the three open-ended questions. All eight responses are "I just guessed the number" to the question that asked how participants chose the numbers. Similarly, the responses are "I didn’t communicate with other participants" to the question that asked to explain what method participants used to communicate. Finally, the responses are "I don’t have option" to the questions that asked to explain why they did not communicate with other participants. The only difference between the answers is that some participants include an apostrophe in "don’t" while others do not. Because standard research protocol does not allow multiple responses from the same individual within the same study, I do not classify these responses as instances of information sharing between distinct participants. Instead, similar to previous studies (Bentley, Bloomfield, Bloomfield, & Lambert, 2023; Goodrich, Fenton, Penn, Bovay, & Mountain, 2023; Griffin, Martino, LoSchiavo, Comer-Carruthers, Krause, Stults ... & Halkitis, 2021), I classify these responses as coming from fraudulent accounts based on the nearly identical answers to the open-ended questions. I therefore removed these responses from the dataset used to make inferences about information sharing between distinct participants and collusion.
When excluding these eight responses, I find that the distribution of the total number of periods in which a participant chose the common number is not significantly different from the theoretical distribution absent information sharing in the AllowedC treatment (p = 0.62, \(\chi ^2\) goodness of fit test). Overall, the analyses suggest that collusion and information sharing between distinct individuals do not occur in any treatment.
To provide recommendations for identifying fraudulent accounts, I further analyze the eight responses that are most likely coming from such fraudulent accounts. None of these responses are from the Standing treatment, which suggests that this recruitment method may help reduce fraudulent responses.Footnote 8 The eight responses come from different IP addresses. Data from https://iphub.info indicates that five (62.50%) out of the eight responses came from a VPN as compared to 98 (15.26%) out of the 642 participants in the rest of the sample.
Finally, I use the answers to the open-ended question in the AllowedC treatment that asked participants why they did not try to communicate with other participants to investigate why most participants did not attempt to share information. I classify these answers into three categories: no ability to communicate with other participants, insufficient time, and insufficient incentives. I can classify 85 out of 128 answers (66.41%) into at least one of these three categories. Most participants, 63 out of 128 (49.22%) indicate that they did not have the ability to communicate with other participants. Another group of participants, 18 out of 128 (14.06%), indicate that the incentives were not high enough for them to attempt to communicate. Finally, 16 out of 128 (12.5%) participants indicate that the twenty minutes they had available to complete the HIT were insufficient to successfully coordinate with other participants. Furthermore, 13 out of 128 (10.16%) participants indicate they attempted to share information through forums but could not find any posts about the task. I also checked all the MTurk forums that I am aware of (Reddit, MTurkcrowd, Turkerview, MTurkforum) and could not find specific information about the task. This indicates that information sharing was not attempted between strangers who communicated through forums.
Discussion
In the treatments that resemble the design of most other interactive online experiments that recruit participants using crowdsourcing platforms, I find no evidence of collusion. This suggests that MTurkers will not collude in the majority of interactive online experiments. However, my results suggest that eight fraudulent accounts, likely controlled by a single individual, shared information between different instances of the experiment. Thus, another cost of not filtering out fraudulent respondents from studies is that these participants might be more likely to collude. Previous research provides extensive recommendations for how to filter out such fraudulent responses (e.g., Aguinis, Villamor, & Ramani, 2021; Burnette, Luzier, Bennett, Weisenmuller, Kerr, Martin... & Calderwood, 2022; Chandler, Sisso, & Shapiro, 2020; Griffin, Martino, LoSchiavo, Comer-Carruthers, Krause, Stults... & Halkitis, 2021; Moss & Litman, 2018; Yarrish, Groshon, Mitchell, Appelbaum, Klock, Winternitz, & Friedman-Wheeler, 2019). In my sample, the eight fraudulent responses came mostly from accounts that used a VPN and these responses were exclusively observed in the Instant treatment. Thus, it is possible that blocking VPN usage (see also Chandler, Sisso, and Shapiro, 2020; Dennis, Goodson, & Pearson, 2020; Kennedy, Clifford, Burleigh, Waggoner, Jewell, & Winter, 2020) and recruiting respondents via standing panels can partially mitigate against MTurkers who hold multiple accounts. It should be noted, however, that such mitigation measures also have drawbacks. Specifically, both methods can block legitimate respondents from participating and thus limit the available sample. Moreover, recruiting MTurkers using the standing panel method can affect the demographics of the available pool and slow data collection speed. Finally, the analysis of the open-ended responses suggests that it is logistically difficult for MTurkers to communicate with each other during studies. Indeed, around half of the participants indicated that they did not have the ability to communicate with others. For the MTurkers who reported the ability to communicate, the practical challenges associated with communication made it unprofitable for them to make such attempts. This suggests that, even though channels for communication exist, the logistical difficulty of communicating with others prevents most MTurkers from colluding. Some MTurkers indicated that the maximum possible bonus of $2.00 was too low to warrant the effort to communicate. Thus, it is possible that MTurkers might collude in experiments in which the payoff of collusion is higher. To test this possibility, I increase the bonus that participants can earn from colluding in Experiment 2.
Experiment 2
Method
Experiment 2 employed the same design and measures as Experiment 1, with four exceptions. First, the incentives to collude were higher. Participants could earn a maximum bonus of $25.00Footnote 9 (a bonus of $2.50 for each period in which they and a member of their group chose the common number) in Experiment 2 as compared to $2.00 in Experiment 1. Second, Experiment 2 manipulated the time MTurkers expect to wait between periods. Participants were assigned to either the Wait treatment, which required a 60-second pause between periods, or to the NoWait treatment, where they could progress through the experiment without any imposed waiting time between periods.Footnote 10 Third, the base pay increased to $3.00 in Experiment 2 as compared to $0.90 in Experiment 1. Fourth, participants had 50 min to complete the task after accepting the HIT as compared to twenty minutes in Experiment 1. The latter two changes aimed to provide fair compensation and enough time to complete the study for participants in the Wait treatment. Experiment 2 kept constant communication about collusion by not mentioning communication (similar to the NoMentionC treatment from Experiment 1) and the recruitment method by recruiting all participants using the instant method. Similar to the NoMentionC treatment from Experiment 1, I only included one open-ended question in the post-experimental questionnaire, in which participants were asked to explain their strategy on the number-choosing task. I did not explicitly ask participants about communication to avoid revealing the purpose of the study and creating a demand effect.
On average, participants took 8.0 min (19.9 min) to complete the task and received $4.30 ($5.08) for their work in the NoWait (Wait) treatment. Participants who opened the link that allowed them to start Experiment 1 could not participate in Experiment 2.
Data
In total, 320 MTurkers completed the experiment in November 2022. Participants were 49.06% male, 88.75% reported a bachelor’s degree as the highest education level obtained, and 23.13% reported having participated in at least 30 academic studies in the past month. The mean age was 35.Footnote 11 Across all characteristics, participants in the Wait and NoWait treatments were similar.
Results
Table 2 and Fig. 2 show that the empirical distribution of the total number of periods in which participants chose the common number deviates significantly from the theoretical distribution absent information sharing when pooling all the data (p < 0.01, \(\chi ^2\) goodness of fit test) and when examining each treatment individually (p < 0.01, \(\chi ^2\) goodness of fit test in the Wait and NoWait treatments). The effects are driven by nine participants who chose the common number in nine periods and by two participants who chose the common number in all ten periods (p = 0.97, \(\chi ^2\) goodness of fit test, if these participants are excluded). All of these eleven participants have likely colluded, given that the probability of choosing the common number in at least nine periods by chance is lower than \(10^{-8}\). These results suggest that in experiments with unusually high payoffs from collusion, a small subset of MTurkers (3.43% in this case) collude.
Although more participants collude in the Wait treatment (seven participants) compared to the NoWait treatment (four participants), the distribution of the total number of periods in which participants chose the common in the Wait treatment is not significantly different from the one in the NoWait treatment (p = 0.23, one-tailed KS test). The probability that participants chose the common number in a given period is higher (13.71%) in the Wait than in the NoWait treatment (10.43%, logit coeff. 0.16, z = 2.84, one-tailed p < 0.01). However, this result is partially driven by the fact that participants in the Wait treatment are less likely to choose the common number in zero periods compared to participants in the NoWait treatment. This difference in the probability of choosing the common number in exactly zero periods is likely caused by noise instead of collusion. The difference in probability of choosing the common number between the Wait and NoWait treatments becomes only marginally significant (logit coeff. 0.15, z = 1.34, one-tailed p = 0.09) when dropping the participants who chose the common number in zero periods. Thus, the results provide weak evidence that a higher expected wait time between periods will increase collusion. Given that, as explained in Footnote 10, my manipulation of a higher expected wait time between periods is unusually strong, I do not interpret these results as suggesting that researchers should be more concerned about collusion when their experiment requires a longer waiting time between periods.
Similarly to the first experiment, the eleven participants who colluded have different IP addresses from one another and none of the participants mention information sharing as a strategy in the post-experimental questionnaire. Differently from the first experiment, colluding participants do not give nearly identical answers to the open-ended question related to the strategy they used and most of them do not use a VPN. Data from https://iphub.info indicates that one (9.09%) out of the eleven participants was using a VPN as compared to 18 (5.83%) out of the 309 participants in the rest of the sample. This suggests that collecting data from participants who do not use a VPN has little to no impact on the likelihood of collusion in high-stakes experiments.
Conclusion
I investigate whether participants collude in interactive online experiments that recruit participants using crowdsourcing platforms. I collected data from 970 MTurkers through two incentivized experiments that allowed me to detect collusion if it occurs. Additionally, I examine how four design choices influence collusion: the recruitment method, the instructions given to participants about communicating with others, the payoff of collusion, and the amount of time participants expect to wait between periods.
This study is subject to limitations that provide opportunities for future research. First, the conclusion drawn from Experiment 1 that MTurkers do not collude in typical interactive online experiments hinges on the assumption that the payoff for collusion in Experiment 1 is higher than it is in typical experimental designs. I aimed to set a payoff of successful collusion in Experiment 1 that is higher than typical experimental designs to reduce the risk that MTurkers do not collude in Experiment 1 even though such behavior occurs in typical interactive experiments. Nonetheless, estimating the typical payoff for successful collusion is challenging due to the variation in MTurkers’ earnings per hour (Hitlin, 2016; Moss, Rosenzweig, Robinson, Jaffe, & Litman, 2020) and per academic study (Brodeur, Cook, & Heyes, 2022). As a result, it is possible that many experimental designs will provide a payoff for collusion that is higher than the one I have set in Experiment 1. In such designs, researchers should consider the strength of participants’ incentives for collusion before dismissing concerns about collusion. Second, beyond excluding fraudulent accounts from participating in studies, which most researchers likely already aim to do, I did not find any way to mitigate collusion. This leaves open-ended questions about what mitigation methods should be used in studies that are particularly vulnerable to collusion. Third, my experimental design cannot definitively identify whether information sharing occurred between distinct people, or whether a single individual was utilizing multiple MTurk accounts to share information across different instances of the experiment. Although information sharing reduces data quality regardless of how it is implemented, it would likely be more worrying to know that one person has access to multiple MTurk accounts. People with multiple MTurk accounts would likely also decrease the data quality of non-interactive experiments because they would observe multiple treatments. Therefore, future research could investigate how common it is for people to have access to multiple MTurk accounts.
Notwithstanding these limitations, my study contributes to the growing literature that investigates the advantages and disadvantages of using online labor markets as a method to recruit participants in interactive experiments (Arechar, Gachter, & Molleman, 2018; Hawkins, 2015; Horton, Rand, & Zeckhauser, 2011; Mason & Suri, 2012; Raihani, Mace, & Lamba, 2013; Thomas & Clifford, 2017). My findings suggest that in the vast majority of interactive online experiments that recruit participants through crowdsourcing platforms, collusion is not a significant concern. However, I also found that fraudulent accounts, likely controlled by one individual, shared information across multiple instances of the experiment. This suggests that information sharing is an additional risk to data quality that arises when researchers do not filter out fraudulent responses from their sample (Chandler, Sisso, & Shapiro, 2020). In addition to the extensive recommendations already provided by previous studies (e.g., Aguinis, Villamor, & Ramani, 2021), my results tentatively suggest that researchers should also consider recruiting participants using a standing panel (Mason & Suri, 2012; Palan & Schitter, 2018) and not allowing MTurkers who use a VPN to join the study (Kennedy, Clifford, Burleigh, Waggoner, Jewell, & Winter, 2020) to reduce participation from fraudulent accounts.
I also find that approximately 3% of MTurkers collude when the payoff for collusion is unusually high. Therefore, collusion should not be overlooked as a possible danger to data validity in experiments where participants have strong incentives to collude, for example, experiments in which the stakes are unusually high (Exley, 2018; Farjam, Nikolaychuk, & Bravo, 2019; Keuschnigg, Bader, & Bracher, 2016; Larney, Rotella, & Barclay, 2019; Raihani, Mace, & Lamba, 2013; Wu, Balliet, & Van Lange, 2016) or that form large groups of participants (Balietti & Riedl, 2021; Faravelli, Kalayci, & Pimienta, 2020; Suri & Watts, 2011). Despite the unusually high payoff in Experiment 2, I observed relatively low levels of collusion. This, coupled with participant responses to open-ended questions from Experiment 1, suggests that logistical challenges impede most MTurkers from colluding during interactive online experiments, even though channels for communication are available.
Notes
The oTree code can be downloaded at https://doi.org/10.6084/m9.figshare.23641722. The instructions can be found at https://doi.org/10.6084/m9.figshare.19940987.
The probability of choosing the common number was 0.1 in each period. To win the bonus in a period in which a participant chose the common number, another participant from the group needed to also choose the common number. With five other participants in the group, the probability of all of them not choosing the common number was 0.59 (\(0.9^5\)). Thus, the probability of at least one group member choosing the common number was 0.41 (\(1 - 0.59\)) each period. Therefore, the probability of winning the bonus in a period was 0.04 (\(0.41 \times 0.1\)). As a result, for all ten periods, a participant was expected to win 0.4 (\(0.04 \times 10\)) times corresponding to an expected payoff of 8 cents (\(0.4 \times 20\) cents).
Participants were informed that the number was common across their group but were not explicitly informed that the number was also common across all groups in a session.
To decrease waiting time, participants did not need to wait for all other participants in their group to reach the same completion level to advance in the experiment. Therefore, some groups contained fewer than six participants because participants did not wait for groups of six to form before beginning the task and because some MTurkers accepted the HIT but did not participate in the study. As a result, the number of participants in some treatments is not a multiple of six. I simulated additional participants after the experiment so that all participants had incentives as if they were part of groups of six. Participants who were part of smaller groups were therefore not disadvantaged because participants received no feedback during the experiment and because the common number was identical across all groups of each session.
Out of the eight responses, three responses were in the same group. Therefore, if respondents could only share information with their group members, then three out of these eight respondents would have increased their payoff by sharing information.
In the other treatments, colluding participants could have intentionally not chosen the common number in all periods to avoid detection. If colluding participants used this strategy, I should observe more participants than expected choosing the common number for a number of periods that is profitable but that makes it difficult to know which participant colluded (e.g., three to five times). I find no evidence of participants using such a strategy. Two participants chose the common number five times although the probability of doing so was 0.01%. However, when examining the amount of time spent by these participants on the task, I find no evidence of collusion. On average, these two participants spent less time on each period of the task (4.7 s) than participants who chose the common number in all ten periods (38.1 s) and they spent less time in periods in which they chose the common number (3.9 s) than in periods in which they did not choose the common number (5.4 s).
The observed pattern cannot be explained by these participants using a semi-automated script (Buchanan, & Scofield, 2018) in the number-choosing task because each participant saw a different list of nine numbers unique to them. Therefore, any rule used by a script such as choosing the lowest/highest number cannot explain participants choosing the common number in all ten periods.
An analysis of the entire sample suggests a similar conclusion. I find that 20 out of 650 respondents gave an answer that was unusual and similar to at least one other respondent (e.g., "based on calculation" and "just follow the instructions") to the question that asked how they chose the numbers. Only 2 out of these 20 unusual responses are from the Standing treatment.
Van Zant, Kennedy, and Kray (2022) ask MTurkers what they consider to be a large maximum bonus and find that, on average, MTurkers consider a large bonus to be $12.31. Therefore, most MTurkers will consider $25.00 to be a large maximum bonus.
The manipulation of expected waiting time is unusually strong because there are likely few interactive online experiments in which MTurkers expect to wait such a long time, 60 s, between periods. An alternative, more realistic manipulation would have imposed that participants wait for the slowest member of the group before proceeding. However, this more realistic manipulation would have had a higher risk that my experiment could not detect that a higher expected wait time increases collusion even though this relationship exists in reality. I opted for a strong manipulation to reduce this risk.
Interestingly, participants who self-selected in this experiment with higher stakes were more educated, younger, and reported participating in fewer academic studies than participants who self-selected in the first, lower-stakes experiment.
References
Abeler, J., Becker, A., & Falk, A. (2014). Representative evidence on lying costs. Journal of Public Economics, 113, 96–104.
Abeler, J., Nosenzo, D., & Raymond, C. (2019). Preferences for truth-telling. Econometrica, 87(4), 1115–1153.
Aguinis, H., Villamor, I., & Ramani, R. S. (2021). Mturk research: Review and recommendations. Journal of Management, 47(4), 823–837.
Almaatouq, A., Krafft, P., Dunham, Y., Rand, D. G., & Pentland, A. (2020). Turkers of the world unite: Multilevel in-group bias among crowdworkers on Amazon Mechanical Turk. Social Psychological and Personality Science, 11(2), 151–159.
Amir, O., Rand, D. G., & Gal, Y. K. (2012). Economic games on the internet: The effect of $1 stakes. PLoS ONE, 7(2), e31461.
Arechar, A. A., Gächter, S., & Molleman, L. (2018). Conducting interactive experiments online. Experimental Economics, 21(1), 99–131.
Balietti, S., & Riedl, C. (2021). Incentives, competition, and inequality in markets for creative production. Research Policy, 50(4), 104212.
Becker, G. S. (1968). Crime and punishment: An economic approach. Journal of Political Economy, 76(2), 169–217.
Bentley, J.W., Bloomfield, M.J., Bloomfield, R.J., Lambert, T.A. (2023). What drives public opinion on the acceptability of distorting performance measures? perceptions of deception, rule-breaking, and harm
Brodeur, A., Cook, N., Heyes, A. (2022). We need to talk about mechanical turk: What 22,989 hypothesis tests tell us about publication bias and p-hacking in online experiments
Bryan, C. J., Adams, G. S., & Monin, B. (2013). When cheating would make you a cheater: Implicating the self prevents unethical behavior. Journal of Experimental Psychology: General, 142(4), 1001–1005.
Buchanan, E. M., & Scofield, J. E. (2018). Methods to detect low quality data and its implication for psychological research. Behavior Research Methods, 50(6), 2586–2596.
Buhrmester, M. D., Talaifar, S., & Gosling, S. D. (2018). An evaluation of Amazon’s Mechanical Turk, its rapid rise, and its effective use. Perspectives on Psychological Science, 13(2), 149–154.
Burnette, C. B., Luzier, J. L., Bennett, B. L., Weisenmuller, C. M., Kerr, P., Martin, S., & Calderwood, L. (2022). Concerns and recommendations for using amazon mturk for eating disorder research. International Journal of Eating Disorders, 55(2), 263– 272.
Chandler, J., Mueller, P., & Paolacci, G. (2014). Nonnaïveté among amazon mechanical turk workers: Consequences and solutions for behavioral researchers. Behavior Research Methods, 46(1), 112–130.
Chandler, J., Sisso, I., & Shapiro, D. (2020). Participant carelessness and fraud: Consequences for clinical research and potential solutions. Journal of Abnormal Psychology, 129(1), 49.
Chen, D. L., Schonger, M., & Wickens, C. (2016). oTree–An open-source platform for laboratory, online, and field experiments. Journal of Behavioral and Experimental Finance, 9, 88–97.
Cooper, R., DeJong, D. V., Forsythe, R., & Ross, T. W. (1992). Communication in coordination games. The Quarterly Journal of Economics, 107(2), 739–771.
Dennis, S. A., Goodson, B. M., & Pearson, C. A. (2020). Online worker fraud and evolving threats to the integrity of mturk data: A discussion of virtual private servers and the limitations of ip-based screening procedures. Behavioral Research in Accounting, 32(1), 119–134.
Dohmen, T., Falk, A., Huffman, D., Sunde, U., Schupp, J., & Wagner, G. G. (2011). Individual risk attitudes: Measurement, determinants, and behavioral consequences. Journal of the European Economic Association, 9(3), 522–550.
Exley, C. (2018). Incentives for prosocial behavior: The role of reputations. Management Science, 64(5), 1975–2471.
Faravelli, M., Kalayci, K., & Pimienta, C. (2020). Costly voting: A large-scale real effort experiment. Experimental Economics, 23(2), 468–492.
Farjam, M., Nikolaychuk, O., & Bravo, G. (2019). Experimental evidence of an environmental attitude-behavior gap in high-cost situations. Ecological Economics, 166, 106434.
Fosgaard, T. R., Hansen, L. G., & Piovesan, M. (2013). Separating will from grace: An experiment on conformity and awareness in cheating. Journal of Economic Behavior & Organization, 93, 279–284.
Goodman, J. K., Cryder, C. E., & Cheema, A. (2013). Data collection in a flat world: The strengths and weaknesses of mechanical turk samples. Journal of Behavioral Decision Making, 26(3), 213–224.
Goodrich, B., Fenton, M., Penn, J., Bovay, J., & Mountain, T. (2023). Battling bots: Experiences and strategies to mitigate fraudulent responses in online surveys. Applied Economic Perspectives and Policy, 45(2), 762–784.
Gray, M.L., Suri, S., Ali, S.S., Kulkarni, D. (2016). The crowd is a collaborative network. Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing (pp. 134–147)
Griffin, M., Martino, R.J., LoSchiavo, C., Comer-Carruthers, C., Krause, K.D., Stults, C.B., Halkitis, P.N. (2021). Ensuring survey research data integrity in the era of internet bots. Quality & Quantity, 1–12
Guarin, G., & Babin, J. J. (2021). Collaboration and gender focality in stag hunt bargaining. Games, 12(2), 39.
Hawkins, R. X. (2015). Conducting real-time multiplayer experiments on the web. Behavior Research Methods, 47(4), 966–976.
Hitlin, P. (2016). Research in the crowdsourcing age: A case study. Retrieved from https://www.pewresearch.org/internet/2016/07/11/research-in-the-crowdsourcing-age-a-case-study Pew Research Center
Horton, J. J., Rand, D. G., & Zeckhauser, R. J. (2011). The online laboratory: Conducting experiments in a real labor market. Experimental Economics, 14, 399–425.
Irani, L.C., & Silberman, M.S. (2013). Turkopticon: Interrupting worker invisibility in Amazon Mechanical Turk. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 611–620)
Irlenbusch, B., & Villeval, M. C. (2015). Behavioral ethics: How psychology influenced economics and how economics might inform psychology? Current Opinion in Psychology, 6, 87–92.
Kajackaite, A., & Gneezy, U. (2017). Incentives and cheating. Games and Economic Behavior, 102, 433–444.
Kennedy, R., Clifford, S., Burleigh, T., Waggoner, P. D., Jewell, R., & Winter, N. J. (2020). The shape of and solutions to the MTurk quality crisis. Political Science Research and Methods, 8(4), 614–629.
Keuschnigg, M., Bader, F., & Bracher, J. (2016). Using crowdsourced online experiments to study context-dependency of behavior. Social Science Research, 59, 68–82.
Kroher, M., & Wolbring, T. (2015). Social control, social learning, and cheating: Evidence from lab and online experiments on dishonesty. Social Science Research, 53, 311–324.
Larney, A., Rotella, A., & Barclay, P. (2019). Stake size effects in ultimatum game and dictator game offers: A meta-analysis. Organizational Behavior and Human Decision Processes, 151, 61–72.
Lindenberg, S. (2018). How cues in the environment affect normative behaviour. Environmental psychology: An introduction (pp. 119–128). Wiley, New York
Litman, L., & Robinson, J. (2020). Conducting online research on Amazon Mechanical Turk and beyond. Sage Publications
Litman, L., Robinson, J., Rosen, Z., Rosenzweig, C., Waxman, J., & Bates, L. M. (2020). The persistence of pay inequality: The gender pay gap in an anonymous online labor market. PloS one, 15(2), e0229383.
Manzi, F., Rosen, Z., Rosenzweig, C., Jaffe, S.N., Robinson, J., Litman, L. (2021). New job economies and old pay gaps: Pay expectations explain the gender pay gap in gender-blind workplaces
Mason, W., & Suri, S. (2012). Conducting behavioral research on Amazon’s Mechanical Turk. Behavior Research Methods, 44(1), 1–23.
Moss, A., & Litman, L. (2018). After the bot scare: Understanding what’s been happening with data collection on mturk and how to stop it. Retrieved February, 4, 2019
Moss, A., Rosenzweig, C., Robinson, J., Jaffe, S.N., Litman, L. (2020). Is it ethical to use mechanical turk for behavioral research? Relevant data from a representative survey of mturk participants and wages
Nettle, D., Harper, Z., Kidson, A., Stone, R., Penton-Voak, I. S., & Bateson, M. (2013). The watching eyes effect in the dictator game: It’s not how much you give, it’s being seen to give something. Evolution and Human Behavior, 34(1), 35–40.
Opp, K.D. (2013). Norms and rationality. Is moral behavior a form of rational action? Theory and Decision, 74(3), 383–409
Palan, S., & Schitter, C. (2018). Prolific.ac–a subject pool for online experiments. Journal of Behavioral and Experimental Finance, 17, 22–27.
Raihani, N. J., Mace, R., & Lamba, S. (2013). The effect of \(1, \)5 and $10 stakes in an online dictator game. PLoS ONE, 8(8), e73131.
Stets, J. E., & Carter, M. J. (2012). A theory of the self for the sociology of morality. American Sociological Review, 77(1), 120–140.
Suri, S., & Watts, D. J. (2011). Cooperation and contagion in web-based, networked public goods experiments. PLoS ONE, 6(3), e16836.
Teschner, F., & Gimpel, H. (2018). Crowd labor markets as platform for group decision and negotiation research: A comparison to laboratory experiments. Group Decision and Negotiation, 27, 197–214.
Thomas, K. A., & Clifford, S. (2017). Validity and Mechanical Turk: An assessment of exclusion methods and interactive experiments. Computers in Human Behavior, 77, 184–197.
Van Zant, A.B., Kennedy, J.A., Kray, L.J. (2022). Does hoodwinking others pay? the psychological and relational consequences of undetected negotiator deception. Journal of Personality and Social Psychology
Wu, J., Balliet, D., & Van Lange, P. A. (2016). Reputation management: Why and how gossip enhances generosity. Evolution and Human Behavior, 37(3), 193–201.
Yarrish, C., Groshon, L., Mitchell, J., Appelbaum, A., Klock, S., Winternitz, T., & Friedman-Wheeler, D. G. (2019). Finding the signal in the noise: Minimizing responses from bots and inattentive humans in online research. The Behavior Therapist, 42(7), 235–242.
Yin, M., Gray, M.L., Suri, S., Vaughan, J.W. (2016). The communication network within the crowd. Proceedings of the 25th International Conference on World Wide Web (pp. 1293–1303)
Acknowledgements
I thank participants at the University of Amsterdam CREED lunch seminar for their useful comments and suggestions. I am grateful to the anonymous reviewers, Farah Arshad, Ryan Guggenmos, Victor Maas, Andreas Ostermaier, Sebastian Stirnkorb, Jochen Theis, and Jacob Zureich for their helpful comments. I gratefully acknowledge funding from the Fynske Købstæders Fond
Funding
Open access funding provided by University Library of Southern Denmark. This study has benefited from funding from the Fynske Købstæders Fond. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The author has no relevant financial or non-financial interests to disclose. The Ethics Committee for Economics and Business from the University of Amsterdam approved this project. Participants gave permission to use their answers knowing that the information obtained would be kept anonymous
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Open Practices Statement
The datasets generated during and/or analysed during the current study are available in the Figshare repository, (https://doi.org/10.6084/m9.figshare.19936070). The oTree code can be downloaded at (https://doi.org/10.6084/m9.figshare.23641722). The experimental designs were preregistered (https://aspredicted.org/a89wc.pdf for Experiment 1 and https://aspredicted.org/rq2qg.pdf for Experiment 2). Only the main analyses were preregistered. All other analyses were exploratory
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ghita, R.S. Do mturkers collude in interactive online experiments?. Behav Res 56, 4823–4835 (2024). https://doi.org/10.3758/s13428-023-02220-3
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.3758/s13428-023-02220-3