There have been many past reports of women being underrepresented among contributors to open source, from surveys and analyses of repository data. In this chapter we take a fresh, comprehensive look at the representation of women in open source, focusing on historical trends among infrastructure projects – the libraries and packages indexed by popular package managers that so much of the world relies on. We start by compiling and synthesizing existing empirical data from the literature and then use an automatic name-based gender inference technique to capture population level across 20 open source package manager ecosystems. Our results reveal a promising upward trend in the percentage of women among both highly active (“core”) and general repository contributors over time, but also high variation in the percentage of women contributors across ecosystems. The chapter is based on a short paper we presented at ICSE SEIS 2023 [44].

Introduction

The economic value and importance of open source software (OSS) to the economy and society as a whole are, by now, well recognized. Companies big and small, nonprofits, government entities, scientists, students, and hobbyists all use OSS libraries and packages [18]. To maintain all this digital infrastructure, a constant supply of effort is needed, often by volunteers, to fix bugs, patch vulnerabilities, and implement new features. Prior research has repeatedly shown that the availability of this effort should not be taken for granted – open source contributors can choose to disengage at any time for a variety of reasons [37], and even widely used, popular projects can end up being maintained by no one at all [4, 12].

Among the challenges to open source software sustainability, low gender diversity is particularly problematic because it hinders the benefits that a team could have possessed otherwise. It is beyond being a problem of social justice, as there is plenty of evidence demonstrating the benefits of having a gender-diverse team. For example, evidence shows that having a gender-diverse team in public code collaboration could enhance productivity and lower community smells [11, 46]. One reason behind the better performance is that men and women tend to display different personalities [49]. Leveraging positive personality traits that are associated with better team performance can lead to more successful teams [59]. At the same time, a diverse team can better understand the needs of their users, which are often diverse [38].

Practitioners and researchers have been working on solving the problem of low gender diversity in OSS. Many studies and reports in the past two decades showed low representation of women in OSS (see the section “Related Work” for a review). Active research areas include identifying roles women play in OSS development [54], detecting barriers that women face when entering OSS [17], and quantifying biases women face when making contributions [53]. In practice, there are initiatives to remove barriers for women and to create more inclusive communities, such as Open Source Diversity,Footnote 1 Outreachy,Footnote 2 and Rails Girls Summer of Code.Footnote 3

There have been many attempts to assess the gender representation in the open source software community. Although prior studies reached a general agreement on the overall low fraction of women in the population, the reported percentages have a high variance, the possible causes of which could be unrepresentative samples, different subpopulations, different time periods, or different methods. In this chapter, we add one large-scale study to the literature while fixing the method and looking over time and across ecosystems. This chapter is a descriptive study that reports the representation of women in OSS slicing by three dimensions. We first slice data over time to show how the gender distribution evolves. Then we slice data by ecosystem, since each of them has different management practices [6]. We also segment the population vertically to analyze women’s distribution among core contributors, those who are more experienced and responsible for the majority of the contributions [28].

When investigating gender distribution, we followed many previous studies [48, 57] and used automated gender inference tools to infer genders based on the information disclosed by contributors, oftentimes names. These methods have certain known limitations and biases, including the imperfect accuracy and the assumption of binary gender, which does not reflect the current perception of gender [50]. We are aware that the use of the inference on individuals can be harmful [26, 29]. Therefore, our study only uses name-based gender inference on the population level and treats the results as only an approximation of the real situation [35].

Related Work

Automatic Gender Inference Tools

Researchers have explored various techniques to automatically infer gender of individuals. This section discusses the approaches available to our GitHub source data. Note that all classifiers here assume binary gender, and their benchmarks also consist of only data of binary gender.

Appearance-based gender inference has been extensively studied in the field of computer vision, where many classifiers can achieve an accuracy higher than 90%, even 99% [2] or nearly 100% [62]. However, a large number of GitHub users are using default profile pictures, and there is no guarantee that a contributor’s profile picture is a picture of themselves. Hence, we did not use appearance-based inference because the results would be very unreliable.

Researchers have also explored text-based gender inference, which relies on vocabulary and frequency of words [34] and even style markers and structural characteristics [13]. However, our text pieces on GitHub, such as commit messages, are usually short, and the accuracy of this technique is low.

To the best of our knowledge, name-based gender inference is the most commonly used approach in the software research community. Certain tools perform the inference based on only an individual’s first name. For example, Gender-guesser is a Python package that uses the first name to assign “unknown,” “andy” (androgynous), “male,” “female,” “mostly_male,” or “mostly_female” to an individual. In comparison, several tools incorporate one’s geolocation or cultural origin into their inference. For example, both Namsor and NameAPI are paid services that infer one’s cultural origin based on their last name. Based on benchmark evaluations by Santamaría and Mihaljević [50] and Sebo [51], Gender API and Namsor are the most accurate tools with accuracy higher than 90%. Thus, we pick Namsor as our gender inference tool.

Researchers have started reflecting on the negative impact of automatic binary gender inference tools. Hamidi et al. [26] criticized the tools’ assumption of binary gender as “gender reductionism.” We acknowledge and agree that the limitation also exists in name-based gender inference, including ours, and caution against using such technology to make individual-level inferences. As we argued previously, we only make population-level inferences to get a general sense of global trends and differences among ecosystems.

Gender Distribution from Prior Studies

With rising awareness of the low gender diversity problem, many studies have attempted to estimate the gender composition in the OSS community. Although all studies report a low percentage of women contributors, these numbers have wide variation ranging from 1% to 12%. Building on the overview or women ratios across years by Trinkenreich [55], we provide an overview of the results reported by prior studies grouped by methods.

Surveys: The first section of Table 14-1 lists the studies that rely on survey data to measure gender distribution. Surveys can capture people’s self-identified gender and arguably increase the precision of gender identification [36]. However, survey data, albeit more reliable and accurate, are prone to selection bias [5]. Moreover, survey samples are usually small, making it hard to generalize.

Table 14-1 Women ratios in prior works grouped by data sources and methods

Mining software repositories: The second section of Table 14-1 lists the studies that rely on data mining to report gender distribution. In these quantitative studies, researchers often need to infer gender because not all platforms collect users’ gender and not all users disclose their genders online. Thus, automatic gender inference tools have become a common practice. Despite the limitations, gender inference based on mined user information provides a more representative, larger-scale sample than the survey approach. It also eliminates the burden on the survey respondents and the efforts taken to collect survey results.

Ecosystems: The last section of Table 14-1 lists studies that report gender ratios in specific software ecosystems. The percentages of women range from 0% (Whamcloud) to 10% (OmapZoom) [7]. However, to the best of our knowledge, there is not a study that covers all major ecosystems, and many of the previous studies focus on a selection of projects rather than the entire ecosystem.

Methods

To conduct an ecosystem-level census, we used data from GHTorrent and retrieved the list of projects in the 20 largest package managers on libraries.io,Footnote 6 a service collecting data of open source packages. We only selected the 20 biggest package managers out of the total 38. Because our automatic gender inference is not perfect and can be used only as a population-level approximation, results in smaller ecosystems can fluctuate and become unreliable. We used data from GHTorrent [25], which provides trace data from GitHub between January 2008 and March 2021. However, we note the limitation that the data between June and December 2019 are missing.

Data Processing Pipeline

Extracting the list of open source infrastructural projects: We consider a GitHub project that is registered at libraries.io as an OSS project. Using the January 12, 2020, version of the dataset from libraries.io, which consists of entries of open source projects registered by the date, we parsed out 1,550,273 unique, valid projects that can be found on GHTorrent.

Collecting contributions: Due to data traceability, we consider only commits, both code and documentation, as contributions. We acknowledge that this simplification neglects contributions such as management, avocation, and mentorship [54, 55]. However, many of these non-code activities are either untraceable or hard to quantify. Therefore, at this moment, we focus on only tractable contributions.

De-aliasing user entries: Because developers sometimes use different accounts when authoring commits in a project, we perform identity merging through a set of heuristic rules to ensure that we do not over-count users. Our de-aliasing method relies on user-level information, for example, emails and names [19, 61].Footnote 7 For example, if two accounts use the same email and similar names, that is, some or all parts are the same but in different orders, or the same name with similar emails, that is, their emails contain part of their names, their commits could most possibly be credited to one author.

Removing bots: To reduce the impact of bot contribution, we manually evaluate the activity of all users who made at least 1,000 commits in each ecosystem [16]. We found 511 unique bot accounts, which made 5,828,940 commits in total.

Aggregation granularity: To study how women’s participation changes over time, we aggregate data into three-month windows, which ensures sufficient interactions among contributors since activities on GitHub are more sparse than those in companies. For windows that have less than 30 contributors whose genders can be inferred, we consider those windows as no activity, as the percentage of women might surge and become an outlier in the data.

Identifying core contributors: Adapting from the validated count-based methods by Joblin et al. [28] and Bosu et al. [7], we identified core contributors in the following way. For each ecosystem, within each three-month window, we first identified projects whose number of commits ranked top 10% in the ecosystem. Then, within each of the top projects, we identified each project’s core developers as those who made more than 10% of the commits within that three-month window. In summary, in our analysis, a core contributor makes more than 10% of the commits in a project whose number of commits ranks top 10% in that ecosystem. We are specifically interested in core developers because cores typically take more responsibility for public project code contribution.

Gender Inference

Of the 45,838,860 GitHub users in GHTorrent, 53.65% do not provide a name, and 3.84% are organizational accounts. We label these users’ gender as Unknown. We also label users whose names have more than four parts (71,367 (0.16%)) as Unknown since a manual checking showed that most of them are names of organizations. We preprocess the remaining users’ names by removing punctuations, common titles or prefixes, emails, and URLs.

Then, we infer the gender of each user with Namsor [9], one of the name-based gender inference tools with the highest accuracy [50, 51]. The tool makes inferences based on the first name and the cultural origin of the last name.

Namsor also provides a confidence level that a user’s gender is correctly identified. We denote users whose gender inference confidence is lower than 0.7 as Unknown gender. Removing inferences with low confidence can increase the overall accuracy of our gender classification, yet setting a high confidence threshold cuts down our data size. Thus, we choose 0.7 as the threshold to retain 83.81% of the gender data. Of 1,823,414 users who have contributed to OSS projects, 911,990 (50.02%) are labeled as men and 54,859 (3.01%) as women. To reduce the effect of Unknown gender on our result, we calculate women fraction by

$$ \frac{Number\kern0.17em of\kern0.17em Women\kern0.17em Contributors}{Number\kern0.17em of\kern0.17em Women+ Men\; Contributors} $$

Results

Gender Distributions in OSS and Different Ecosystems

Figure 14-1 shows the overall gender distribution in OSS libraries and its evolution over time. Overall, the percentage of women has been constantly low – no higher than 5.0%. Moreover, the percentage of women among all contributors in OSS projects is lower than that among core contributors.

For the gender distributions in the top 20 most popular OSS ecosystems and their evolution, we observed different patterns in different ecosystems. Due to the space limit, we display only plots from four more representative ecosystems in Figure 14-2: npm, CRAN, PlatformIO, and CPAN.

Figure 14-1
A stacked bar graph and dual-line graph of the number of contributors and women percentage versus time from 2008 to 2021. The bars are all unknown, all men, and all women. The lines are among all and among core. Both bars and lines follow an increasing trend till 2019 and descend.

Gender representation in OSS contribution overall. The gray bar covers the period where GHTorrent has missing data.

For more figures, please visit our GitHub page.Footnote 8

Figure 14-2a shows the trend of women percentage in the npm ecosystem. The pattern of npm’s women percentage change is representative of many ecosystems, such as PyPI, Bower, and Go. Although the overall women percentage has been low all the time (lower than 6%), there is a steady increase overtime.

While most ecosystems exhibit increasing women percentage, the numbers are all lower than 10%, with the exception of CRAN, which reached 10.02% in 2021 (Figure 14-2b). CRAN is the package manager for the R programming language, which is widely used among academic researchers. The higher women percentage in CRAN may be due to the fact that the population of R users is more diverse because they come from various disciplines other than computer science [6].

Moreover, as shown in Figure 14-2c and 14-2d, PlatformIO and CPAN display a puzzling periodicity and minimum growth over the years. This pattern can be due to the fact that PlatformIO is a smaller ecosystem in our dataset. As a result, a small change in team composition can result in a large fluctuation. This also explains why we chose to only present results for the 20 larger ecosystems: the smaller the ecosystem is, the more likely it would be influenced by small changes.

Figure 14-2
A set of 4 dual-line graphs of the Women Ratio versus time. The lines are among all and among core. The graphs are n p m javascript, CRAN R, platform I O C, C P AN perl. Lines of graphs A and B follow an increasing trend. Lines of graphs C and D fluctuate.

Women distributions overall and in selected ecosystems. Gray bars cover the period with missing data on GHTorrent.

For most ecosystems, the percentage of women exhibited an uphill pattern and reached its peak between 2018 and 2021. However, some languages commonly used for system programming – Perl, Rust, and C++ – reached their maximum percentage before 2014. Table 14-2 shows the percentages of women at the end of our data (January–March 2021) and the window during which the maximum percentage of women contributors occurred.

Gender Distributions Among Core Contributors

Starting with women percentage of 2.13% among core contributors and 2.25% among all contributors, the number has been steadily growing between 2008 and 2021. We observed that, while the women percentage among all contributors was higher than among cores in 2008, the difference between them was less than 0.01% in 2014. Between 2014 and 2021, we found that the women percentage among cores has surpassed that among all, leaving a slight but approximately stable margin of 0.3%.

Lastly, comparing the percentage of women among core contributors and among all contributors in 2021 in Table 14-2, we noticed that, in most ecosystems the percentage among core contributors is higher than that among all contributors, with few exceptions such as Meteor, Pub, Cargo, and Hex, which have very small number of women contributors overall.

Table 14-2 Women’s participation by package managers (sorted by the number of projects)

Main Takeaways

The gender diversity is improving. We observed a slow but steadily increasing trend of women’s participation in open source infrastructural projects. Our observation agrees with prior findings [41, 48]. The increasing trend is also observed in most of the ecosystems. While the reasons behind this change over time are beyond the scope of our study, we speculate that some of the past efforts to encourage and support marginalized groups in OSS have taken effect.

Gender distributions vary across ecosystems. Specifically, many ecosystems related to web development, especially front end, for example, Meteor and RubyGems, have higher women percentages. In comparison, several ecosystems related to system programming, for example, CPAN and PlatformIO, have lower gender diversity. Our finding agrees with Vasarhelyi et al.’s finding [56] that contributors in front-end programming languages are more likely to be women.

There are more core women contributors among big open source projects. When computing women’s percentage among core contributors, we focused on only the biggest projects, whose commits are ranked top 10% in that ecosystem. We found that, among the biggest projects, whose commits are ranked the top 10% in that ecosystem, the percentage of women among core contributors is higher than that of among all contributors.

Open Research Questions

Reasons behind the increase: While our analysis and several recent studies [41, 48] reported a similar trend of increasing percentage of women among open source contributors, we do not yet understand how this has happened. Is it by chance or because some prior diversity efforts have been effective? Are hackathons [39], coding camps [43], or conferences effective in attracting and retaining women contributors? Future research can analyze the reasons behind the increased women’s percentage and reflect on the outcome of prior efforts to improve diversity. Such studies can inform the design and deployment of future diversity and inclusion activities.

Ecosystem difference: Our study provides another piece of evidence that the differences in gender representation could be due to the functions of the programming languages. However, more in-depth and targeted studies are needed to test the speculation or provide a reasonable explanation. Is the disparity due to the nature of the programming languages or some community practices?

A fine-grained examination on women’s representation across open source: Although our analysis found differences in gender representation across ecosystems and the level of contributions, there are more ways to slide the data and pinpoint the places with skewer gender distribution. For example, we examined the percentage of women core contributors among big projects and found that the percentage is higher than among all contributors. This is different from a prior result where the percentage of women among core contributors is much lower than that among all contributors [8]. Future studies can further investigate the relationship between gender distributions and project sizes. Our study also did not investigate the non-code contributions. Future researchers can consider adding contributors who only contributed to issue discussions. There are also non-code contributions that are not visible on social coding platforms. Quantifying the gender distribution among these hidden contributors is an open research question.