Multiple studies of gender in software engineering require identifying gender of the individuals involved either by asking them (when conducting interviews and surveys) or by “guessing” it from archival data recorded in software repositories. In this chapter we discuss ways to ask about gender in surveys and interviews as well as three groups of automated genderization approaches proposed in the literature: name-to-gender, face-to-gender, and artifact-to-gender. For each one of the approaches, we discuss the way they work, the associated ethical concerns, the reliability and accuracy concerns, and the assumptions they make.

Introduction

When it comes to studies of diversity in software engineering, gender is by far the most studied diversity dimension: 61% of the scientific studies recently surveyed by Rodriguez-Perez et al. have considered gender [34]. Indeed, previous studies have shown that women participating in open source projects disengage faster than men [32], that while women concentrate their work across fewer projects and organizations, men contribute to a higher number of projects and organizations [14], men and women follow different comprehension strategies when reading source code [40], and men tend to switch more frequently between debugging strategies [8]. Several studies in this volume also consider gender as a variable of interest: for example, Hashmati and Penzenstadler in Chapter 5, “How Users Perceive the Representation of Non-binary Gender in Software Systems: An Interview Study,” report on an interview study of representation of gender in software; Gama et al. in Chapter 16, “Toward More Gender-Inclusive Game Jams and Hackathons,” focus on experiences of transgender (binary and non-binary) and gender-nonconforming people related to jams and hackathons; Kohl and Prikladnicki in Chapter 11, “Gender Diversity on Software Development Teams: A Qualitative Study,” conduct a survey of gender diversity in software development teams; Simmonds et al. in Chapter 23, “Rethinking Gender Diversity and Inclusion Initiatives for CS and SE in a University Setting,” discuss the findings of the focus group of women and non-binary students; and Happe in Chapter 25, “Effective Interventions to Promote Diversity in CS Classroom,” studies frustrations steering women away from computer science. All these studies require the researchers to obtain information about gender identity of the study participants (for controlled experiments, interviews, and surveys) or of the individuals that have contributed to the dataset analyzed (for data-driven archival studies such as repository mining). As we are going to see in the following, obtaining such an information is fraught with challenges, and inappropriate ways of doing this might both alienate study participants and threaten validity of the scientific results. The challenges are not limited to researchers: indeed, everyone conducting internal surveys, performing marketing analysis, adding “gender” questions in the user interface of the software, or aiming at understanding user satisfaction necessarily has to find their way of recording information about gender identity.

To support both researchers and practitioners, in this chapter we take a look at the techniques used to obtain information about gender and the associated advantages and challenges.

Before we even start discussing how information about gender can be obtained, one has to remember that gender is a complex social construct of norms, behaviors, and roles that varies between societies and over time. Hence, study of gender in general, gender identities, or gender expression of individuals should be done with utmost care. Whatever technique we use, we should keep in mind that gender is privacy sensitive and should be treated as such even if such regulations as the General Data Protection Regulation (GDPR) do not consider this information as sensitive. In particular, open source contributors might be hiding their gender on purpose, for example, many women developers prefer not to disclose their gender due to safety concerns. Moreover, some open source projects do not necessarily want us to know the genders of their members (but some do!), and companies might be sensitive to this topic as well.

Talking to People

One of the most popular ways of obtaining information about gender is asking the individuals themselves, as part of a survey or an interview. We should keep in mind, however, that reliability of this method strongly depends on the ability of the respondents to understand the question and find an answer corresponding to the way they see themselves.

In Asking Questions: The Definitive Guide to Questionnaire Design, Bradburn et al. [6] suggest recording the respondent’s gender by asking, “What is your/NAME’s sex?” and offering two answer options, male and female. This question conflates biological sex and socially constructed gender and reduces the spectrum of options to merely two. However, by now it is well known that both the biological reality of sex and the social reality of gender are much more complex [17]. For example, the recent survey of Stack Overflow indicates that 1.42% of software developers identify as non-binary, genderqueer, or gender nonconforming and 0.92% prefer to self-describe.Footnote 1 In the survey of the Linux Foundation, 4% of the respondents have indicated their gender as “non-binary/third gender.”Footnote 2 Surprisingly, in December 2018, a popular survey platform SurveyMonkey was still offering “female” and “male” as the only options for the “What is your gender?” question [43].

Hence, at the very least, the phrasing of the question about gender should reflect existence of genders other than women and men. One of simplest ways of phrasing such a question would be “Are you…male, female, something else? Specify ____.” In December 2018, a similar phrasing has been the default in Google forms [43], and a similar question has been used by Roberts et al. in a 2022 study of Australian adolescents’ eating pathology [33]. Such a phrasing is profoundly problematic as it expresses a preference toward “male” and “female” pushing all other gender identities outside of the norm – this process is known as othering [10, 47], “differentiating discourses that lead to moral and political judgments of superiority and inferiority between ’us’ and ’them’” [12]. Moreover, by allowing the respondents to select only one answer option, this phrasing excludes people who are, for example, women and non-binary. Finally, in the empirical evaluation performed by Bauer et al. [3], cisgender participants had no problems answering this question, but transgender participants tried to understand what exactly the researchers were asking and reached different conclusions: both transfeminine (assigned male at birth and identify as women/non-binary) and transmasculine (assigned female at birth and identify as men/non-binary) respondents have given all three possible answers (male, female, other) rendering this question useless. When used in the interviews, this item was cognitively taxing for transgender interview participants [3].

The previous discussion suggests that (a) one should avoid referring to certain gender identities as “other”; (b) if answer options are provided, respondents should be able to select several options; and (c) the phrasing should explicitly refer to gender identity. Several proposals satisfying these requirements have been made in the literature. For example, Spiel et al. [43] recommend asking, “What is your gender?” with the following five checkboxes: “woman,” “man,” “non-binary,” “prefer not to disclose,” and “prefer to self-describe.” If the last option is checked, a free-form field opens up. Nikki Stevens, author of the Open Demographics project,Footnote 3 suggests phrasing this question as “Where do you identify on the gender spectrum?” followed by a list of 30 gender identities taken from The ABC’s of LGBT+ by Ashley Mardell [24], as well as “prefer not to answer” and “self-identify: ____.” One should be aware, however, that a lengthy list of gender identities might be experienced as confusing and take too much time if used as part of a larger survey.

Instead of offering answer options, one might also ask an open question as recommended by Scheuerman et al. in “HCI Guidelines for Gender Equity and Inclusivity.”Footnote 4 In this case researchers will be required to manually code the responses, so the expected number of participants should not be too large, which is often the case for software engineering surveys. Moreover, open questions might elicit absurd reactions such as “bagel” or aggressive reactions such as “attack helicopter,” originating from a meme ridiculing non-binary gender identification [16].

Summary

When conducting interviews or small surveys, and the risk of aggressive or absurd responses is deemed small, prefer an open question such as “Where do you identify on the gender spectrum?” For larger surveys or surveys of populations that are more likely to provide absurd or aggressive responses, consider offering the following five checkboxes: “woman,” “man,” “non-binary,” “prefer not to disclose,” and “prefer to self-describe ____.”

Mining Software Repository Data

Repository mining studies analyze contributions from tens of thousands [32] to tens of millions of individuals [35]. This wealth of data allows one to carefully distinguish fine-grained statistical effects or perform longitudinal studies spanning over 50 years. However, when analyzing these amounts of data, it becomes no longer feasible to contact every single individual and ask them about their gender identity. In case of longitudinal studies, individuals might have retired or passed away; in case of large-scale studies of contemporary software development practices, contacting tens or hundreds of software developers might be technically possible, but it will likely lead to community disengagement, threatening the already low response rates [41]. To address this challenge, multiple tools have been proposed to automatically obtain gender information from the way developers present themselves, for example, by selecting their username or an avatar, or from the artifacts they produce such as source code or comments. The tools can be broadly classified as name-to-gender, face-to-gender, and artifact-to-gender. Many of these tools have not been designed with the software engineering data in mind, but software engineering data has its own peculiarities we discuss in the following.

Name-to-Gender

As Bradburn et al. [6] have put it, “<s>ometimes a person’s gender is obvious from his or her name.” Phrasing this more carefully, we can say that many cultures tend to associate specific names with specific genders: For example, Božidar is a Bulgarian name commonly given to men, while Nijol is a Lithuanian name commonly given to women. At the same time, לט (Tal) is a Hebrew name that can be given to a child of any gender. In their most basic form, name-to-gender tools merely look up a given name in lists of names typically associated with women and men and return “woman,” “man,” or “unknown” depending on relative prevalence of a certain name within a specific gender. Prevalence is sometimes used to express degree of confidence of the tool in the gender inferred. For example, genderize.io states “male” with 0.99 confidence for “bozidar,” “female” with 0.99 confidence for “nijole,” and “male” with 0.68 confidence for “tal”.

However, this kind of basic approach fails to take into account differences between cultures: for example, Andrea is more commonly associated with men in Italy and with women in Germany, while Karen is mostly associated with women in the United States and with men in Armenia. International collaboration means that the same software engineering project or the same software engineering dataset might involve contributors from different cultures. This requires more advanced name-to-gender genderizers to take the cultural background into account. genderComputer that has been designed to analyze Stack Overflow data uses location as a proxy for cultural background [45]. However, less than 20% of Stack Overflow users in the sample analyzed by Vasilescu et al. have indicated their location, and location as indicated by users does not necessarily correspond to an actual geographic location (e.g., The Matrix) [45]. Moreover, using location as a proxy for national culture fails to take into account immigration-related effects.

This is why NamsorFootnote 5 uses the individual’s surname as a proxy for national culture. This allows Namsor to infer that Andrea Rossini is (more likely) to be a man, while Andrea Parker is (more likely) to be a woman.Footnote 6 This also makes Namsor one of the most accurate name-to-gender tools [36, 38]. Closer inspection reveals a different story, however. Santamaria and Mihaljević [36] reported that confidence of Namsor is almost perfect for European names, but the median confidence drops to 70% for Asian names. In particular, Eastern and Southeastern Asian names are difficult to genderize. Half of the East Asian names have a confidence score of 0, indicating that Namsor is essentially guessing randomly. This Eurocentric bias is problematic when trying to apply automatic gender inference techniques to software developers: the recent Stack Overflow survey shows that almost one out of four software developers have indicated different Asian regions as their ethnic background; in particular, 4.2% of the respondents are East Asian and 4.39% Southeast Asian.Footnote 7 Recognizing this limitation of Namsor, Qiu et al. combined it with genderComputer and designed a classifier trained on public name lists and celebrity name lists [32]. The features of this classifier included the last character (e.g., in Spanish, names ending in a are usually associated with women), the last two characters (e.g., in Japan, names ending in ko are usually associated with women), and tri-grams and 4-grams to capture romanized Chinese, Japanese, and Korean names. The combined name-to-gender tool outperformed both genderComputer and Namsor: for example, accuracy on Chinese names was 60% as opposed to 7% of Namsor and 18% of genderComputer [32]. A similar, character-based approach has been combined with deep learning models by Hu et al. [13]. The work of Qiu et al. has also inspired the Namsor developers to further develop special techniques for ChineseFootnote 8 and Japanese names.Footnote 9 Still, a 2022 study of Sebo shows that even for the current version of Namsor, the overall proportion of errors (misclassifications and non-classifications) is 53% [39]. What is even more problematic is that Namsor tends to perform worst for names associated with women as opposed to those associated with men (19.2% of the former names have been categorized correctly as opposed to the 66.5% of the latter) [39].

However, with all the improvements, Namsor cannot be applied if the individual is known by a mononym: for example, an Indian-American scientist and educator Govindjee is known by a single name only. This also limits the applicability of Namsor to such datasets as Stack Overflow: 43% of the Stack Overflow usernames do not use spaces and hence cannot be analyzed using Namsor. Since genderComputer has been designed for Stack Overflow, it implements several heuristics targeting software developers. In particular, if the name cannot be easily split into first name(s) and last name(s), genderComputer assumes it is formatted according to common naming conventions for usernames (e.g., “johns” for “John Smith”) [4] and restarts the genderization process (e.g., with “john” derived from “johns”) [45].

Another limitation of all the aforementioned approaches is their inability to take age into account. Indeed, for example, in Pennsylvania such names as Morgan and Robin that have been predominantly associated with men in the past have evolved to being associated with people of any gender and more recently to be more commonly associated with women [2].

Summarizing the discussion of name-to-gender tools, we can say that multiple name-to-gender tools have been developed by open source practitioners, academic researchers, and company-based software engineers. These tools tend to approximate cultural background by analyzing the location or the surname of the individual. While age might have affected popularity of names among individuals of different genders, to the best of our knowledge, no currently available name-to-gender tool takes age into account.

To conclude this discussion, we list several examples of name-to-gender tools. As providing a complete overview of those tools would not be feasible, Table 28-1 only lists examples of the name-to-gender tools that (a) are available at the moment of writing and (b) have been empirically evaluated in scientific publications other than the paper that has introduced those tools.

Table 28-1 Examples of name-to-gender tools

Face-to-Gender

Another way developers present themselves on social platforms such as Stack Overflow and GitHub is by using avatars. This means that face recognition techniques such as Facelytics,Footnote 10 Face Analysis by Visage Technologies,Footnote 11 and PicPurifyFootnote 12 can be applied to the avatars to identify gender of the individuals on these avatars. Indeed, on the task of identifying gender of Stack Overflow users based on their avatars, a face-to-gender tool Face++ has been shown to have a performance comparable to genderComputer [22], while on different avatar datasets, Face++, Amazon, and MS achieve more than 90% accuracy when identifying gender based on automatically detected faces [18]. However, not all faces can be correctly detected: in the study of Jung et al. [18], the very best tool has correctly identified faces in merely 76% of the analyzed images. Moreover, not everybody has a profile picture representing a human face. For instance, approximately 30% of the Stack Overflow users only have a default profile picture automatically generated based on the MD5 hash of the user’s mail, rendering approximately 70% of the Stack Overflow users possibly amenable for face-to-gender inference. However, not all Stack Overflow profile images represent faces (rather than logos or cat pictures). This is why Lin and Serebrenik have carefully selected 900 non-generated profile images of users of different ages and reputations and classified them manually. Reputation classes were selected according to different privileges Stack Overflow users might have; age intervals according to the general distribution of the ages on Stack Overflow. Among the 900 profile images, only 53% represent faces [22], suggesting that overall face-to-gender tools might be applicable to approximately 37% = 70% * 53%.

Artifact-to-Gender

Artifact-to-gender tools are based on the assumption that people of different genders express themselves differently in writing. Not surprisingly, the lion’s share of the research in this area has been based on personal writing on social media such as tweets and Facebook posts [21]. For example, the work of Company and Wanner [42] has been designed in the first place for attribution of authorship of blog posts and novels to one of the authors within a predefined set, and then the same technique has been retrained to predict gender of the author. Authorship attribution techniques have been designed for the source code as well [11, 15]; similarly to Company and Wanner [42], they are aiming at associating code fragments with one of the authors from the predefined set of approximately 100–160 candidates. This shows that deanonymization of source code is possible despite a much more constrained use of language compared to social media texts. This is why Naz and Rice have applied similar techniques to predict gender of the authors. On a dataset of 100 student assignments, their approach has achieved the accuracy of 72% [29]. It remains to be seen whether these techniques can scale to tens of thousands of contributors common in repository mining studies.

Limitations and Concerns

The automated techniques discussed in the previous sections have shown that gender-related information can be obtained from such names, avatars, and code/text written. However, we need to remember that these methods are far from being perfect and one has to be very careful when applying them.

The first group of concerns are ethical. They are mostly raised in relation to face-to-gender techniques, but similar concerns can be raised for any automated genderization methods and are related to assigning any kind of categories to human beings without their explicit consent. While as humans we might be assigning categories to other people continuously, for example, when we are describing people, this kind of automation might be dangerous, for example, what if a tool recognizes a woman driving a car in a country where women are not allowed to drive cars? In fact, Nature has surveyed approximately 500 researchers in facial recognition, CS, and AI, and about two-thirds believe that application of facial recognition methods to recognize or predict personal characteristics (such as gender, sexual identity, age, or ethnicity) from appearance should be done only with the informed consent of those whose faces are used or after discussion with representatives of groups that might be affected [31]. Getting an informed consent from all GitHub or Stack Overflow developers is, of course, not realistic. Furthermore, individuals do not necessarily want to disclose their gender and sometimes take steps to hide it: one of the developers surveyed by Vasilescu et al. stated that they “have used a fake GitHub handle (my normal GitHub handle is my first name, which is a distinctly female name) so that people would assume I was male” [46]. In this case “correct” genderization would explicitly contradict the individual’s intention, which can hardly be seen as ethical.

The second concern is related to the gender binary assumption perpetrating the automatic techniques discussed previously. These are percentages of papers reviewed in two meta-studies. Keyes has shown that 92.9% of papers that introduce automatic face-to-gender tools assume gender binary, and this is also the case 96.7% of papers that use automatic gender recognition [19]. For artifact-to-gender literature surveyed by Krü ger and Hermann, this percentage goes up to 100% [21]. Finally, name-to-gender tools are doing a bit better: while they are still ignorant of non-binary genders, they at least tend to provide confidence scores, that is, they at least recognize their own lack of confidence [36]. Due to this gender binary assumption, automatic genderization tools can harm non-binary individuals as well as individuals with a limited ability to appear and be treated as their preferred gender [37].

Third, both the applicability and the accuracy of the techniques are not perfect. Restricted applicability might bias conclusions of a study since it is based only on data that the tools could analyze. Moreover, applicability and accuracy can be even lower for some subcommunities, for example, for Chinese names, when some of the gender-specific information is lost during the romanization.

All these reasons can lead to tools assigning an individual a gender that they do not agree with (e.g., because they do not want to disclose it, because this gender cannot be identified by the tool, or because the tool is imprecise), a problem known as misgendering, which can be seen as a form of verbal violence [26]. This is why we believe that (a) automated techniques should never be applied at the level of an individual subject but only at the level of large groups, (b) techniques should not be showing unequal performance on specific groups (e.g., if we know that name-to-gender techniques underperform on Asian names, conclusions based on application of these techniques to Asian names might be wrong), and (c) one has to continuously reflect on potential risks of the application of these techniques.

Summary

Automated tools are necessary when analyzing large-scale data. When using the tools, one should never apply them at the level of an individual subject, but only at the level of large groups, and either ensure that performance of the tools is equal across different subpopulations or recognize unequal performance as a threat to validity of the conclusions derived.

Beyond Software Engineering

Several insights discussed previously can be also applied outside of the realm of software engineering. As the guidelines related to interviews and surveys are borrowed from the field of Human-Computer Interaction, they can be expected to be applicable to any interview and survey looking to collect information about gender. Similarly, techniques discussed in the context of mining software repositories are applicable to analysis of any large-scale archival data ranging from social media sites such as Twitter and Facebook [20] to corpora of scientific publications [23], from a movie-related knowledge-sharing platform [25] to museum catalogs [44], and from Wikipedia [1] to collections of crowd-sourced recommendations [9]. Application of those techniques beyond software engineering might, however, require rethinking the aforementioned limitations and concerns as their relevance and importance might depend on the application domain.

Summary

Aforementioned techniques can be applied beyond software engineering, but their application might require careful rethinking the aforementioned limitations and concerns.

Conclusions

Gender and gender diversity are popular topics in contemporary software engineering research. To conduct this research, one has to identify gender of the individuals involved. To this end we have discussed two large groups of identifying the contributor’s gender: by asking questions and by applying algorithmic tools. None of the techniques is perfect: questionnaires do not scale, and algorithmic tools guessing gender from GitHub information assume gender binary. Choice of the technique should, of course, be made in function of the research questions one is trying to answer. However, it might be equally important to discuss the limitations and problems of these techniques (and not only their advantages that made us choose them in the first place).

Summary

  • For interview studies and small surveys, ask an open question: “Where do you identify on the gender spectrum?”

  • For larger surveys use the same phrasing and the following five checkboxes: “woman,” “man,” “non-binary,” “prefer not to disclose,” and “prefer to self-describe ____.”

  • When mining repositories evaluate name-to-gender and face-to-gender tools and either ensure that performance of the tools is equal across different subpopulations or recognize unequal performance as a threat to validity of the conclusions derived.