Code reviews are an important part of the software development process for many reasons [3], such as that well-reviewed code increases software quality [8]. But code reviews can sometimes contain toxic, disrespectful, or otherwise contentious discussions that can be stressful on contributors [9, 14]. Such negative interpersonal interactions and rejection disproportionately affect developers from historically marginalized groups. For instance, open source developers who are women have their code reviews rejected more often when they are outsiders to a project and their gender is visible [16]; open source developers who are perceptibly non-White have their reviews rejected more often [11]; and industry developers who are young, White, and male face less pushback than their colleagues [10]. In this chapter, we describe a simple intervention we deployed for two years in the Gerrit code review system designed to reduce negative experiences, as well as an evaluation of whether the intervention reduced code review comment toxicity. While the intervention did not appear to reduce toxicity in the communities we studied, we recommend future researchers and practitioners study better ways to reliably measure toxicity and more sophisticated interventions.

Figure 18-1
A screenshot of the code review reminder text box in the Gerrit U I. The textbox is displayed in between the program code along with the cancel and save buttons.

An example of a respectful code review reminder in the Gerrit UI

The Design of Respectful Review Reminders

In an effort to reduce the likelihood that negative interpersonal interactions occur, in February 2020, we introducedFootnote 1 respectful code review reminders in the Gerrit code review system. Our assumption was not that reviewers are intentionally malicious toward code authors, but instead that they have good intentions and simply need to be reminded occasionally to communicate with care.

We designed the reminders to show a text message when a reviewer opens a comment box to provide feedback on a piece of code. Figure 18-1 shows an example. Our design approach for the reminders was based on a blog post about practical advice to engineers for writing respectful code review comments,Footnote 2 and the blog post itself was based in part on our qualitative research on pushback in code review [5]. Using the blog post as a starting point, we (a group of designers, engineers, and researchers) created a mockup and decided how often we thought reminders should be sent. We designed the reminders with the following intent:

  • To improve the likelihood the text was actually read by the developer, we tried to keep it short, with a link with more information. The text of each reminder was one of the following:

    • DO: Assume competence.

    • DO: Provide rationale or context.

    • DO: Consider how comments may be interpreted.

    • DON’T: Criticize the person.

    • DON’T: Use harsh language.

    • DO: Provide specific and actionable feedback.

    • DO: Clearly mark nitpicks and optional comments.

    • DO: Clarify code or reply to the reviewers’ comment.

    • DO: When disagreeing with feedback, explain the advantage of your approach.

  • So that reviewers would be less likely to become desensitized, whenever a reviewer opened the user interface to leave a comment on the code, the software would randomly decide whether or not to show a message with a probability of 0.3. When a message was shown, the system randomly selected one from the preceding available messages.

  • To increase the likelihood that the message was seen, we placed the text directly below where the reviewer would write a comment.

  • To reduce the likelihood that the messages would become annoying, reminders initially appeared for each user at most every three days. Some users complained that the feedback was becoming annoying, such as that “I do not think it’s respectful of my time to have to dismiss this.”Footnote 3 We thus increased the delay to 14 days.

  • To simplify our implementation, messages were shown before the reviewer typed text into the comment box and remained there until the comment was saved. Neither the code being reviewed nor the content of reviewers’ comments affected when messages were shown.

In February 2022, the feature was removed from Gerrit, as the analysis described in the following showed a lack of evidence for effectiveness.

Evaluation Approach

We wanted to evaluate whether the reminders had an observable effect on discussions in Gerrit. While there are many potential effects that such reminders might have – such as improvements in sentiment [1, 12], constructiveness [6], or pushback [5] – we made the decision to measure toxicity using Perspective, a public API that returns the output of machine learning models pre-trained on a variety of data, including comments from Wikipedia and the New York Times.Footnote 4 We chose to measure this construct with this API because

  • Toxicity is a widely used construct in a variety of online contexts that has been used in prior work on analyzing code reviews [13], with mature and scalable analysis tools.

  • The Perspective API has relatively good performance. When we applied several natural language processing techniques in prior work, we found that the Perspective API’s toxicity scores were the best predictor of human-labeled toxicity in open code review comments [13]. Sarker and colleagues also found that the Perspective API produced the highest F1 score and inter-rater agreement when predicting human-rated toxicity in open source code reviews, including for the Android and Chromium Gerrit projects [15].

  • Applying the API is a simple approach, commensurate with the simplicity of the implementation of respectful review reminders.

Our research question can be stated as: Did Gerrit’s respectful code review reminders reduce comment toxicity?

To evaluate the research question, we applied the Perspective API to score toxicity in Gerrit comments. After a project privacy review within Google, designed to ensure that we were following best practices to ensure the privacy of Gerrit’s users, we were granted bulk access to the comments in two large projects that use Gerrit for their code review: ChromiumFootnote 5 as well as both the internal and open sourceFootnote 6 repositories of Android. After analyzing comments on these projects with the Perspective API, we examined how the scores changed from before the reminders were deployed in Gerrit to after they were deployed. Since we had no way of knowing if a reminder was shown for each particular comment from the user, we instead assumed that if the deployment of the reminders had an effect on toxicity, that effect would be widely observable rather than only locally observable for certain comments.

After running the Perspective API, to understand and improve the API’s labeling accuracy, we randomly sampled comments for manual inspection. The API returns a toxicity score between 0 and 1, representing “how likely it is that a reader would perceive the comment”Footnote 7 as “rude, disrespectful, or unreasonable [and] likely to make people leave a discussion.”Footnote 8 However, because the Perspective API model is periodically updated, the precise toxicity scores reported here can change.Footnote 9 Using toxicity scores as strata, one author first inspected 100 comments with scores greater than 0.9, then between 0.8 and 0.9, and so on. The goal was to identify any recurring patterns that represented false positives, so that we could make adjustments and then rerun the toxicity API to better reflect true positives. Such adjustments are typically done in other contexts when natural language processing classifiers are applied to a software engineering context (e.g., [14]).

In comments categorized as having toxicity scores greater than 0.5, we noticed and addressed the following frequent false positives:

  • The word nit often was rated as highly toxic, yet using the word is considered best practice in code review.Footnote 10 We replaced this token with suggestion in any comment that contained it, before scoring the comment with the Perspective API.

  • The word remove often was rated as highly toxic, but appeared not to be in context. We replaced it with change before scoring.

  • The string ASSERT often was rated as highly toxic, but it instead referred to a snippet of code in context. We replaced assert strings with abc before scoring.

  • The string CHECK often was rated as highly toxic, but it instead referred to a snippet of code in context. We replaced it with ABC before scoring.

After performing these replacements, we ran the toxicity API and used the resulting toxicity scores in our analysis.

Using the 12 months before the deployment of the reminders as a baseline, we then examined the period after deployment using a sharp regression discontinuity design. Our primary hypothesis was that the reminders had an immediate effect, with a secondary hypothesis that the reminders had a longer-term effect. We also examined whether the effect of the reminders, if any, disappeared after the feature was removed. Given these hypotheses, we dummy-coded the time period into either

  • Before (the 12 months before the deployment)

  • Short-term (the one week after the deployment)

  • Medium-term (between one week and one month after deployment, about three weeks in duration)

  • Longer-term (between one month after deployment and the time it was removed, about two years)

  • Short-term removal (the one week after removal)

  • Medium-term removal (between one week after removal and about two months after removal, about nine weeks in duration)

Given uncertainties in the exact timing of the feature being available to users, we excluded analysis of toxicity scores for two separate one-week periods, one surrounding the estimated rollout date and one surrounding the estimated removal date.

Inspecting percentiles of toxicity scores over time, we observed a noticeable jump in toxicity on July 3, 2020, during the longer-term period. Around this time, we found a sharp increase in reviewers mentioning the acronym LGTM. Because the Perspective API marked it as moderately toxic, we replaced it with the neutral-scoring looks good to me in the same way we did with tokens like ASSERT, as mentioned previously. Even after replacement, we noticed an increase in average comment toxicity. After speaking with two people familiar with contemporary community events and inspecting common word frequencies before and after July 3, we were unable to ascertain the cause of the average toxicity change. Nonetheless, we decided to model the shift by dividing the longer-term period into longer-term (A) and (B). Longer-term (A) refers to the period from one month after deployment to July 3, 2020 (about 15 weeks in duration). Longer-term (B) refers to the period from July 3, 2020, to about two years after deployment (about 84 weeks).

Figure 18-2
A bar chart with error bars of the toxicity score versus study periods. The longer-term B, short-term removal, and medium-term removal have the highest bars of 0.07. Data are approximate.

Toxicity scores during the various study periods

The regression predicted the overall toxicity score for each comment, since this appeared to us to be the simplest way to model toxicity. We included a variety of control variables in our regression. First, we included a random effect for identity of the person who made the comment; the objective here is to control for the typical toxicity that each person exhibits. We also included a variety of fixed effects: the log of the number of revisions in a changelist, using log here to normalize tail skewed distributions; the log of the number of reviewers; the log of the number of inserted lines of code; the log of the number of deleted lines of code; the log of the number of files changed; and whether the change is a reversion of a prior change.

Results

The results of running our regression model are shown in Figure 18-2. The figure shows the toxicity scores that the model estimates, which can be read as mean toxicity scores when accounting for the controls. The whiskers on each bar indicate 95% confidence intervals.

From the figure, we observe the following:

  • Given that toxicity scores can range between 0 and 1, toxicity score estimates across all periods remained low – less than 0.1. An example of a common string with a toxicity score of 0.1 is See comment below.

  • In the short-term, medium-term, and longer-term (A) after the feature was added to Gerrit, toxicity scores increased slightly and statistically significantly.

  • In the longer-term (B) period, toxicity scores increased even further, likely due to the July 3, 2020, change.

  • No decrease in toxicity was apparent in the removal periods.

Discussion

The preceding evidence suggests that the respectful code review reminders did not reduce the overall level of toxicity in the three Gerrit projects studied. This supports the decision to remove the feature from Gerrit. However, since the problem of negative interpersonal interactions is an observed problem – especially for marginalized developers – other interventions should be considered and evaluated. Such interventions might include

  • Context-sensitive suggestions, which activate when comment authors are drafting text that may be interpreted negatively and the tool could provide actionable examples to improve respectfulness of the comment. A domain-specific text analysis tool would be required, however.

  • Feedback on feedback: Readers could provide their own feedback on comments, perhaps as lightweight as an emoji.Footnote 11

  • Feedback dashboard: Users could have a private dashboard that allows them to compare statistics about their comments – for example, overall positive or negative sentiment – and compare those statistics to the larger community. Each user involved in a code review could rate the level of respectfulness of the discussion, after it is merged. This would be reflected in the user dashboard to self-assess and act upon their level of respectfulness.

The evidence here also suggests that the amount of toxicity actually increased after the feature was introduced. If the feature caused the increase, a plausible explanation might be that it induced reactance [4] in developers, where people react in opposition to messaging when they view it as a restriction on their choices and behavior. In fact, the regression discontinuity method used here assumes that defiers [2], such as the reactant developers here, do not exist in the dataset.

Limitations

There are several limitations to our research. First, based on our manual inspection of comments, it’s clear that a significant proportion of comments scored as highly toxic were, in reality, not toxic. While we tried to manually address this by replacing common keywords (e.g., nit) before scoring, other more subtle patterns were not easy to mitigate. For instance, we noticed that self-directed negativity (e.g., I made a stupid mistake) was scored as toxic, as were negative comments directed at the code. On a surface level, these can be interpreted as false positives, but that’s not necessarily the case; toxicity directed at the code can also be interpreted as toxicity that reflects on the code’s author. General inaccuracy in applying the Perspective API to code review is somewhat expected; it was trained on toxic comments on the open Internet, rather than on code review or software engineering data specifically. A toxicity model trained on code review data specifically would likely yield more accurate results.

Second, the overall effect sizes are very small. While toxicity has shifted statistically significantly, the overall shift may not be practically significant.

Third, other changes may confound our results. As we already pointed out, we observed some unexplained change in July 2020, which appears to have had an effect on toxicity scores. As another example, longer-term and later toxicity estimates may be influenced by pandemic-related effects, if any. Other unknown changes may have occurred at various points in time, making it difficult to isolate the causal effect of the politeness feature. Rather than a regression discontinuity design, an A/B testing design could have enabled more confident causal inferences [7], but we did not have access to an A/B testing framework in Gerrit at the time. Moreover, an A/B test in this context may suffer from cross-contamination effects, since toxic comments may beget more toxic comments and vice versa.

Finally, we modeled toxicity by predicting the toxicity of each comment, but other modeling strategies may yield different results. For example, comments could be broken up into sentences; then the toxicity of each sentence could be predicted. As another example, comments could be aggregated by reviewer, and then the average toxicity of the reviewer could be predicted.

Conclusion

The analysis presented in this chapter provides evidence that regularly reminding developers to be respectful in their code review does not measurably reduce comment toxicity. While these results suggest that our specific approach did not have an observable effect, even minor changes – such as decreasing message frequency or using different message text – may yield different results. More sophisticated approaches may also be successful, such as targeted messages to people whose comments may be perceived negatively by their peers. Regardless of the approach taken in the future, researchers should evaluate and report on the effects of such interventions – whether or not an effect is observed. With developers from marginalized groups facing more negative experiences during code review, both in open source [11, 16] and in industry [10], we encourage toolsmiths to design interventions to try to increase equity in code review experiences and outcomes.

Acknowledgments

Thanks to Adam Brown, Alison Chang, Ciera Jaspan, Claire Taylor, Liz Kammer, Rayven Plaza, the Gerrit team, Google’s Core Developer team, and our anonymous reviewers for their feedback and support.