How Much Do Women Build Open Source Infrastructure?

Qiu, Huilian Sophie; Zhao, Zihe H; Yu, Tielin Katy; Dabbish, Laura; Vasilescu, Bogdan

doi:10.1007/978-1-4842-9651-6_14

Huilian Sophie Qiu⁶,
Zihe H Zhao⁷,
Tielin Katy Yu⁸,
Laura Dabbish⁹ &
…
Bogdan Vasilescu¹⁰

551 Accesses
1 Altmetric

Abstract

Huilian Sophie Qiu, Zihe H. Zhao, Tielin Katy Yu, Laura Dabbish, and Bogdan Vasilescu

You have full access to this open access chapter, Download chapter PDF

There have been many past reports of women being underrepresented among contributors to open source, from surveys and analyses of repository data. In this chapter we take a fresh, comprehensive look at the representation of women in open source, focusing on historical trends among infrastructure projects – the libraries and packages indexed by popular package managers that so much of the world relies on. We start by compiling and synthesizing existing empirical data from the literature and then use an automatic name-based gender inference technique to capture population level across 20 open source package manager ecosystems. Our results reveal a promising upward trend in the percentage of women among both highly active (“core”) and general repository contributors over time, but also high variation in the percentage of women contributors across ecosystems. The chapter is based on a short paper we presented at ICSE SEIS 2023 [44].

Introduction

The economic value and importance of open source software (OSS) to the economy and society as a whole are, by now, well recognized. Companies big and small, nonprofits, government entities, scientists, students, and hobbyists all use OSS libraries and packages [18]. To maintain all this digital infrastructure, a constant supply of effort is needed, often by volunteers, to fix bugs, patch vulnerabilities, and implement new features. Prior research has repeatedly shown that the availability of this effort should not be taken for granted – open source contributors can choose to disengage at any time for a variety of reasons [37], and even widely used, popular projects can end up being maintained by no one at all [4, 12].

Among the challenges to open source software sustainability, low gender diversity is particularly problematic because it hinders the benefits that a team could have possessed otherwise. It is beyond being a problem of social justice, as there is plenty of evidence demonstrating the benefits of having a gender-diverse team. For example, evidence shows that having a gender-diverse team in public code collaboration could enhance productivity and lower community smells [11, 46]. One reason behind the better performance is that men and women tend to display different personalities [49]. Leveraging positive personality traits that are associated with better team performance can lead to more successful teams [59]. At the same time, a diverse team can better understand the needs of their users, which are often diverse [38].

Practitioners and researchers have been working on solving the problem of low gender diversity in OSS. Many studies and reports in the past two decades showed low representation of women in OSS (see the section “Related Work” for a review). Active research areas include identifying roles women play in OSS development [54], detecting barriers that women face when entering OSS [17], and quantifying biases women face when making contributions [53]. In practice, there are initiatives to remove barriers for women and to create more inclusive communities, such as Open Source Diversity,^{Footnote 1} Outreachy,^{Footnote 2} and Rails Girls Summer of Code.^{Footnote 3}

There have been many attempts to assess the gender representation in the open source software community. Although prior studies reached a general agreement on the overall low fraction of women in the population, the reported percentages have a high variance, the possible causes of which could be unrepresentative samples, different subpopulations, different time periods, or different methods. In this chapter, we add one large-scale study to the literature while fixing the method and looking over time and across ecosystems. This chapter is a descriptive study that reports the representation of women in OSS slicing by three dimensions. We first slice data over time to show how the gender distribution evolves. Then we slice data by ecosystem, since each of them has different management practices [6]. We also segment the population vertically to analyze women’s distribution among core contributors, those who are more experienced and responsible for the majority of the contributions [28].

When investigating gender distribution, we followed many previous studies [48, 57] and used automated gender inference tools to infer genders based on the information disclosed by contributors, oftentimes names. These methods have certain known limitations and biases, including the imperfect accuracy and the assumption of binary gender, which does not reflect the current perception of gender [50]. We are aware that the use of the inference on individuals can be harmful [26, 29]. Therefore, our study only uses name-based gender inference on the population level and treats the results as only an approximation of the real situation [35].

Related Work

Automatic Gender Inference Tools

Researchers have explored various techniques to automatically infer gender of individuals. This section discusses the approaches available to our GitHub source data. Note that all classifiers here assume binary gender, and their benchmarks also consist of only data of binary gender.

Appearance-based gender inference has been extensively studied in the field of computer vision, where many classifiers can achieve an accuracy higher than 90%, even 99% [2] or nearly 100% [62]. However, a large number of GitHub users are using default profile pictures, and there is no guarantee that a contributor’s profile picture is a picture of themselves. Hence, we did not use appearance-based inference because the results would be very unreliable.

Researchers have also explored text-based gender inference, which relies on vocabulary and frequency of words [34] and even style markers and structural characteristics [13]. However, our text pieces on GitHub, such as commit messages, are usually short, and the accuracy of this technique is low.

To the best of our knowledge, name-based gender inference is the most commonly used approach in the software research community. Certain tools perform the inference based on only an individual’s first name. For example, Gender-guesser is a Python package that uses the first name to assign “unknown,” “andy” (androgynous), “male,” “female,” “mostly_male,” or “mostly_female” to an individual. In comparison, several tools incorporate one’s geolocation or cultural origin into their inference. For example, both Namsor and NameAPI are paid services that infer one’s cultural origin based on their last name. Based on benchmark evaluations by Santamaría and Mihaljević [50] and Sebo [51], Gender API and Namsor are the most accurate tools with accuracy higher than 90%. Thus, we pick Namsor as our gender inference tool.

Researchers have started reflecting on the negative impact of automatic binary gender inference tools. Hamidi et al. [26] criticized the tools’ assumption of binary gender as “gender reductionism.” We acknowledge and agree that the limitation also exists in name-based gender inference, including ours, and caution against using such technology to make individual-level inferences. As we argued previously, we only make population-level inferences to get a general sense of global trends and differences among ecosystems.

Gender Distribution from Prior Studies

With rising awareness of the low gender diversity problem, many studies have attempted to estimate the gender composition in the OSS community. Although all studies report a low percentage of women contributors, these numbers have wide variation ranging from 1% to 12%. Building on the overview or women ratios across years by Trinkenreich [55], we provide an overview of the results reported by prior studies grouped by methods.

Surveys: The first section of Table 14-1 lists the studies that rely on survey data to measure gender distribution. Surveys can capture people’s self-identified gender and arguably increase the precision of gender identification [36]. However, survey data, albeit more reliable and accurate, are prone to selection bias [5]. Moreover, survey samples are usually small, making it hard to generalize.

Table 14-1 Women ratios in prior works grouped by data sources and methods

Full size table

Mining software repositories: The second section of Table 14-1 lists the studies that rely on data mining to report gender distribution. In these quantitative studies, researchers often need to infer gender because not all platforms collect users’ gender and not all users disclose their genders online. Thus, automatic gender inference tools have become a common practice. Despite the limitations, gender inference based on mined user information provides a more representative, larger-scale sample than the survey approach. It also eliminates the burden on the survey respondents and the efforts taken to collect survey results.

Ecosystems: The last section of Table 14-1 lists studies that report gender ratios in specific software ecosystems. The percentages of women range from 0% (Whamcloud) to 10% (OmapZoom) [7]. However, to the best of our knowledge, there is not a study that covers all major ecosystems, and many of the previous studies focus on a selection of projects rather than the entire ecosystem.

Methods

To conduct an ecosystem-level census, we used data from GHTorrent and retrieved the list of projects in the 20 largest package managers on libraries.io,^{Footnote 6} a service collecting data of open source packages. We only selected the 20 biggest package managers out of the total 38. Because our automatic gender inference is not perfect and can be used only as a population-level approximation, results in smaller ecosystems can fluctuate and become unreliable. We used data from GHTorrent [25], which provides trace data from GitHub between January 2008 and March 2021. However, we note the limitation that the data between June and December 2019 are missing.

Data Processing Pipeline

Extracting the list of open source infrastructural projects: We consider a GitHub project that is registered at libraries.io as an OSS project. Using the January 12, 2020, version of the dataset from libraries.io, which consists of entries of open source projects registered by the date, we parsed out 1,550,273 unique, valid projects that can be found on GHTorrent.

Collecting contributions: Due to data traceability, we consider only commits, both code and documentation, as contributions. We acknowledge that this simplification neglects contributions such as management, avocation, and mentorship [54, 55]. However, many of these non-code activities are either untraceable or hard to quantify. Therefore, at this moment, we focus on only tractable contributions.

De-aliasing user entries: Because developers sometimes use different accounts when authoring commits in a project, we perform identity merging through a set of heuristic rules to ensure that we do not over-count users. Our de-aliasing method relies on user-level information, for example, emails and names [19, 61].^{Footnote 7} For example, if two accounts use the same email and similar names, that is, some or all parts are the same but in different orders, or the same name with similar emails, that is, their emails contain part of their names, their commits could most possibly be credited to one author.

Removing bots: To reduce the impact of bot contribution, we manually evaluate the activity of all users who made at least 1,000 commits in each ecosystem [16]. We found 511 unique bot accounts, which made 5,828,940 commits in total.

Aggregation granularity: To study how women’s participation changes over time, we aggregate data into three-month windows, which ensures sufficient interactions among contributors since activities on GitHub are more sparse than those in companies. For windows that have less than 30 contributors whose genders can be inferred, we consider those windows as no activity, as the percentage of women might surge and become an outlier in the data.

Identifying core contributors: Adapting from the validated count-based methods by Joblin et al. [28] and Bosu et al. [7], we identified core contributors in the following way. For each ecosystem, within each three-month window, we first identified projects whose number of commits ranked top 10% in the ecosystem. Then, within each of the top projects, we identified each project’s core developers as those who made more than 10% of the commits within that three-month window. In summary, in our analysis, a core contributor makes more than 10% of the commits in a project whose number of commits ranks top 10% in that ecosystem. We are specifically interested in core developers because cores typically take more responsibility for public project code contribution.

Gender Inference

Of the 45,838,860 GitHub users in GHTorrent, 53.65% do not provide a name, and 3.84% are organizational accounts. We label these users’ gender as Unknown. We also label users whose names have more than four parts (71,367 (0.16%)) as Unknown since a manual checking showed that most of them are names of organizations. We preprocess the remaining users’ names by removing punctuations, common titles or prefixes, emails, and URLs.

Then, we infer the gender of each user with Namsor [9], one of the name-based gender inference tools with the highest accuracy [50, 51]. The tool makes inferences based on the first name and the cultural origin of the last name.

Namsor also provides a confidence level that a user’s gender is correctly identified. We denote users whose gender inference confidence is lower than 0.7 as Unknown gender. Removing inferences with low confidence can increase the overall accuracy of our gender classification, yet setting a high confidence threshold cuts down our data size. Thus, we choose 0.7 as the threshold to retain 83.81% of the gender data. Of 1,823,414 users who have contributed to OSS projects, 911,990 (50.02%) are labeled as men and 54,859 (3.01%) as women. To reduce the effect of Unknown gender on our result, we calculate women fraction by

$$ \frac{Number\kern0.17em of\kern0.17em Women\kern0.17em Contributors}{Number\kern0.17em of\kern0.17em Women+ Men\; Contributors} $$

Results

Gender Distributions in OSS and Different Ecosystems

Figure 14-1 shows the overall gender distribution in OSS libraries and its evolution over time. Overall, the percentage of women has been constantly low – no higher than 5.0%. Moreover, the percentage of women among all contributors in OSS projects is lower than that among core contributors.

For the gender distributions in the top 20 most popular OSS ecosystems and their evolution, we observed different patterns in different ecosystems. Due to the space limit, we display only plots from four more representative ecosystems in Figure 14-2: npm, CRAN, PlatformIO, and CPAN.

A stacked bar graph and dual-line graph of the number of contributors and women percentage versus time from 2008 to 2021. The bars are all unknown, all men, and all women. The lines are among all and among core. Both bars and lines follow an increasing trend till 2019 and descend. — **Figure 14-1**

For more figures, please visit our GitHub page.^{Footnote 8}

Figure 14-2a shows the trend of women percentage in the npm ecosystem. The pattern of npm’s women percentage change is representative of many ecosystems, such as PyPI, Bower, and Go. Although the overall women percentage has been low all the time (lower than 6%), there is a steady increase overtime.

While most ecosystems exhibit increasing women percentage, the numbers are all lower than 10%, with the exception of CRAN, which reached 10.02% in 2021 (Figure 14-2b). CRAN is the package manager for the R programming language, which is widely used among academic researchers. The higher women percentage in CRAN may be due to the fact that the population of R users is more diverse because they come from various disciplines other than computer science [6].

Moreover, as shown in Figure 14-2c and 14-2d, PlatformIO and CPAN display a puzzling periodicity and minimum growth over the years. This pattern can be due to the fact that PlatformIO is a smaller ecosystem in our dataset. As a result, a small change in team composition can result in a large fluctuation. This also explains why we chose to only present results for the 20 larger ecosystems: the smaller the ecosystem is, the more likely it would be influenced by small changes.

A set of 4 dual-line graphs of the Women Ratio versus time. The lines are among all and among core. The graphs are n p m javascript, CRAN R, platform I O C, C P AN perl. Lines of graphs A and B follow an increasing trend. Lines of graphs C and D fluctuate. — **Figure 14-2**

For most ecosystems, the percentage of women exhibited an uphill pattern and reached its peak between 2018 and 2021. However, some languages commonly used for system programming – Perl, Rust, and C++ – reached their maximum percentage before 2014. Table 14-2 shows the percentages of women at the end of our data (January–March 2021) and the window during which the maximum percentage of women contributors occurred.

Gender Distributions Among Core Contributors

Starting with women percentage of 2.13% among core contributors and 2.25% among all contributors, the number has been steadily growing between 2008 and 2021. We observed that, while the women percentage among all contributors was higher than among cores in 2008, the difference between them was less than 0.01% in 2014. Between 2014 and 2021, we found that the women percentage among cores has surpassed that among all, leaving a slight but approximately stable margin of 0.3%.

Lastly, comparing the percentage of women among core contributors and among all contributors in 2021 in Table 14-2, we noticed that, in most ecosystems the percentage among core contributors is higher than that among all contributors, with few exceptions such as Meteor, Pub, Cargo, and Hex, which have very small number of women contributors overall.

Table 14-2 Women’s participation by package managers (sorted by the number of projects)

Full size table

Main Takeaways

The gender diversity is improving. We observed a slow but steadily increasing trend of women’s participation in open source infrastructural projects. Our observation agrees with prior findings [41, 48]. The increasing trend is also observed in most of the ecosystems. While the reasons behind this change over time are beyond the scope of our study, we speculate that some of the past efforts to encourage and support marginalized groups in OSS have taken effect.

Gender distributions vary across ecosystems. Specifically, many ecosystems related to web development, especially front end, for example, Meteor and RubyGems, have higher women percentages. In comparison, several ecosystems related to system programming, for example, CPAN and PlatformIO, have lower gender diversity. Our finding agrees with Vasarhelyi et al.’s finding [56] that contributors in front-end programming languages are more likely to be women.

There are more core women contributors among big open source projects. When computing women’s percentage among core contributors, we focused on only the biggest projects, whose commits are ranked top 10% in that ecosystem. We found that, among the biggest projects, whose commits are ranked the top 10% in that ecosystem, the percentage of women among core contributors is higher than that of among all contributors.

Open Research Questions

Reasons behind the increase: While our analysis and several recent studies [41, 48] reported a similar trend of increasing percentage of women among open source contributors, we do not yet understand how this has happened. Is it by chance or because some prior diversity efforts have been effective? Are hackathons [39], coding camps [43], or conferences effective in attracting and retaining women contributors? Future research can analyze the reasons behind the increased women’s percentage and reflect on the outcome of prior efforts to improve diversity. Such studies can inform the design and deployment of future diversity and inclusion activities.

Ecosystem difference: Our study provides another piece of evidence that the differences in gender representation could be due to the functions of the programming languages. However, more in-depth and targeted studies are needed to test the speculation or provide a reasonable explanation. Is the disparity due to the nature of the programming languages or some community practices?

A fine-grained examination on women’s representation across open source: Although our analysis found differences in gender representation across ecosystems and the level of contributions, there are more ways to slide the data and pinpoint the places with skewer gender distribution. For example, we examined the percentage of women core contributors among big projects and found that the percentage is higher than among all contributors. This is different from a prior result where the percentage of women among core contributors is much lower than that among all contributors [8]. Future studies can further investigate the relationship between gender distributions and project sizes. Our study also did not investigate the non-code contributions. Future researchers can consider adding contributors who only contributed to issue discussions. There are also non-code contributions that are not visible on social coding platforms. Quantifying the gender distribution among these hidden contributors is an open research question.

Notes

Bibliography

Shaosong Ou and Alexander Hars. Working for free? Motivations for participating in open-source projects. International Journal of Electronic Commerce, 6(3):25–39, 2002.
Google Scholar
Luís A. Alexandre. Gender recognition: A multiscale decision fusion approach. Pattern Recognition Letters, 31(11):1422–1427, 2010.
Google Scholar
Ikram El Asri and Noureddine Kerzazi. Where are females in OSS projects? Socio technical interactions. In Working Conference on Virtual Enterprises, 308–319, Springer, 2019.
Google Scholar
Guilherme Avelino, Eleni Constantinou, Marco Tulio Valente, and Alexander Serebrenik. On the abandonment and survival of open source projects: An empirical investigation. In International Symposium on Empirical Software Engineering and Measurement (ESEM), 1–12, IEEE, 2019.
Google Scholar
Jelke Bethlehem. Selection bias in web surveys. International Statistical Review, 78(2):161–188, 2010.
Google Scholar
Christopher Bogart, Christian Kästner, James Herbsleb, and Ferdian Thung. How to break an API: cost negotiation and community values in three software ecosystems. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, 109–120, 2016.
Google Scholar
Amiangshu Bosu and Kazi Zakia Sultana. Diversity and inclusion in open source software (OSS) projects: Where do we stand? In 2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), 1–11, IEEE, 2019.
Google Scholar
Edna Dias Canedo, Rodrigo Bonifácio, Márcio Vinicius Okimoto, Alexander Serebrenik, Gustavo Pinto, and Eduardo Monteiro. Work practices and perceptions from women core developers in OSS communities. In Proceedings of the 14th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), 1–11, 2020.
Google Scholar
Elian Carsenat. Inferring gender from names in any region, language, or alphabet. Unpublished, 10, 2019.
Google Scholar
Hilary Carter and Jessica Groopman. The Linux Foundation report on diversity, equity, and inclusion in open source. https://www.linuxfoundation.org/tools/the-2021-linux-foundation-report-on-diversity-equity-and-inclusion-in-open-source/, 2021. Accessed on March 10, 2022.
Gemma Catolino, Fabio Palomba, Damian A. Tamburri, Alexander Serebrenik, and Filomena Ferrucci. Gender diversity and women in software teams: How do they affect community smells? In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Society (ICSE-SEIS), 11–20, IEEE, 2019.
Google Scholar
Jailton Coelho and Marco Tulio Valente. Why modern open source projects fail. In Proceedings of the Joint Meeting on Foundations of Software Engineering (ESEC/FSE), 186–196, ACM, 2017.
Google Scholar
Malcolm Corney, Olivier De Vel, Alison Anderson, and George Mohay. Gender-preferential text mining of e-mail discourse. In 18th Annual Computer Security Applications Conference, 2002, Proceedings., 282–289, IEEE, 2002.
Google Scholar
Daniel Izquierdo Cortázar. Gender-diversity analysis of the Linux kernel technical contributions. https://speakerdeck.com/bitergia/gender-diversity-analysis-of-the-linux-kernel-technical-contributions?slide=48, 2016. Accessed on January 20, 2022.
Paul A. David, Andrew Waterman, and Seema Arora. Floss-us the free/libre/open source software survey for 2003. Stanford Institute for Economic Policy Research, Stanford University, Stanford, CA (www.stanford.edu/group/floss-us/report/FLOSS-US-Report.pdf), 2003.
Tapajit Dey, Sara Mousavi, Eduardo Ponce, Tanner Fry, Bogdan Vasilescu, Anna Filippova, and Audris Mockus. Detecting and Characterizing Bots That Commit Code, 209–219. ACM, New York, NY, USA, 2020.
Google Scholar
Edna Dias Canedo, Heloise Acco Tives, Madianita Bogo Marioti, Fabiano Fagundes, and José Antonio Siqueira de Cerqueira. Barriers faced by women in software development projects. Information, 10(10):309, 2019.
Google Scholar
Nadia Eghbal. Roads and Bridges: The Unseen Labor Behind Our Digital Infrastructure. Ford Foundation, 2016.
Google Scholar
Hongbo Fang, Daniel Klug, Hemank Lamba, James Herbsleb, and Bogdan Vasilescu. Need for tweet: How open source developers talk about their GitHub work on Twitter. In Proceedings of the 17th International Conference on Mining Software Repositories, 322–326, 2020.
Google Scholar
Sharan Foga. ASF committer diversity survey. https://cwiki.apache.org/confluence/display/COMDEV/ASF+Committer+Diversity+Survey+-+2016, 2016. Accessed on January 20, 2022.
Denae Ford, Alisse Harkins, and Chris Parnin. Someone like me: How does peer parity influence participation of women on stack overflow? In 2017 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), 239–243, IEEE, 2017.
Google Scholar
Marco Gerosa, Igor Wiese, Bianca Trinkenreich, Georg Link, Gregorio Robles, Christoph Treude, Igor Steinmacher, and Anita Sarma. The shifting sands of motivation: Revisiting what drives contributors in open source. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), 1046–1058, IEEE, 2021.
Google Scholar
GitHub. Open source survey. https://opensourcesurvey.org/2017/, 2017. Accessed on March 10, 2022.
Rishab A. Ghosh, Ruediger Glott, Bernhard Krieger, and Gregorio Robles. Free/libre and open source software: Survey and study, 2002.
Google Scholar
Georgios Gousios and Diomidis Spinellis. GHTorrent: GitHub’s data from a firehose. In 2012 9th IEEE Working Conference on Mining Software Repositories (MSR), 12–21, IEEE, 2012.
Google Scholar
Foad Hamidi, Morgan Klaus Scheuerman, and Stacy M. Branham. Gender recognition or gender reductionism? The social implications of embedded gender recognition systems. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, 1–13, 2018.
Google Scholar
Daniel Izquierdo, Nicole Huesman, Alexander Serebrenik, and Gregorio Robles. OpenStack gender diversity report. IEEE Software, 36(1):28–33, 2018.
Google Scholar
Mitchell Joblin, Sven Apel, Claus Hunsen, and Wolfgang Mauerer. Classifying developers into core and peripheral: An empirical study on count and network metrics. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), 164–174, IEEE, 2017.
Google Scholar
Os Keyes. The misgendering machines: Trans/HCI implications of automatic gender recognition. Proceedings of the ACM on Human-Computer Interaction, 2(CSCW):1–22, 2018.
Google Scholar
Andrew Kofink. Contributions of the under-appreciated: Gender bias in an open-source ecology. In Companion Proceedings of the 2015 ACM SIGPLAN International Conference on Systems, Programming, Languages and Applications: Software for Humanity, 83–84, 2015.
Google Scholar
Victor Kuechler, Claire Gilbertson, and Carlos Jensen. Gender differences in early free and open source software joining process. In IFIP International Conference on Open Source Systems, 78–93, Springer, 2012.
Google Scholar
Karim R. Lakhani and Robert G. Wolf. Why hackers do what they do: Understanding motivation and effort in free/open source software projects. Open Source Software Projects (September 2003), 2003.
Google Scholar
Amanda Lee and Jeffrey C Carver. Floss participants’ perceptions about gender and inclusiveness: a survey. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), 677–687, IEEE, 2019.
Google Scholar
Feng Lin, Yingxiao Wu, Yan Zhuang, Xi Long, and Wenyao Xu. Human gender classification: a review. Int. J. Biom., 8(3/4):275–300, 2016.
Google Scholar
Jeffrey W. Lockhart, Molly M King, and Christin Munsch. What’ s in a name? Name-based demographic inference and the unequal distribution of misrecognition. 2022.
Google Scholar
Mike Medeiros, Benjamin Forest, and Patrik Öhberg. The case for non-binary gender questions in surveys. PS: Political Science & Politics, 53(1):128–135, 2020.
Google Scholar
Courtney Miller, David Widder, Christian Kästner, and Bogdan Vasilescu. Why do people give up FLOSSing? A study of contributor disengagement in open source. In International Conference on Open Source Systems, OSS, 116–129, Springer, 2019.
Google Scholar
Dawn Nafus. “Patches don’t have gender”: What is not open in open source software. New Media & Society, 14(4):669–683, 2012.
Google Scholar
Lavinia Paganini and Kiev Gama. Engaging women’s participation in hackathons: A qualitative study with participants of a female-focused hackathon. In International Conference on Game Jams, Hackathons and Game Creation Events 2020, 8–15, 2020.
Google Scholar
Antoine Pietri, Diomidis Spinellis, and Stefano Zacchiroli. The software heritage graph dataset: public software development under one roof. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), 138–142, IEEE, 2019.
Google Scholar
Gede Artha Azriadi Prana, Denae Ford, Ayushi Rastogi, David Lo, Rahul Purandare, and Nachiappan Nagappan. Including everyone, everywhere: Understanding opportunities and challenges of geographic gender-inclusion in OSS. IEEE Transactions on Software Engineering, 2021.
Google Scholar
Huilian Sophie Qiu, Alexander Nolte, Anita Brown, Alexander Serebrenik, and Bogdan Vasilescu. Going farther together: The impact of social capital on sustained participation in open source. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), 688–699, IEEE, 2019.
Google Scholar
Huilian Sophie Qiu, Yang Wen, and Alexander Nolte. Approaches to diversifying the programmer community – the case of the girls coding day. In 2021 IEEE/ACM 13th International Workshop on Cooperative and Human Aspects of Software Engineering (CHASE), 91–100, IEEE, 2021.
Google Scholar
Huilian Sophie Qiu, Zihe H. (co-first author) Zhao, Tielin Katy Yu, Justin Wang, Alexander Ma, Hongbo Fang, Laura Dabbish, and Bogdan Vasilescu. Gender representation among contributors to open-source infrastructure – an analysis of 20 package manager ecosystems. In International Conference on Software Engineering – Software Engineering in Society, ICSE SEIS, IEEE, 2023.
Google Scholar
Mahin Raissi, Molly de Blanc, and Stefano Zacchiroli. Preliminary report on the influence of capital in an ethical-modular project: Quantitative data from the 2016 Debian survey. Journal of Peer Production, (10):1–25, 2017.
Google Scholar
Gregorio Robles, Laura Arjona Reina, Jesús M González-Barahona, and Santiago Dueñas Domínguez. Women in free/libre/open source software: The situation in the 2010s. In IFIP International Conference on Open Source Systems, 163–173, Springer, 2016.
Google Scholar
Gregorio Robles, Hendrik Scheider, Ingo Tretkowski, and Niels Weber. Who is doing it. A Research on Libre Software Developers, 2001.
Google Scholar
Davide Rossi and Stefano Zacchiroli. Worldwide gender differences in public code contributions: and how they have been affected by the COVID-19 pandemic. Proceedings of the 44th International Conference on Software Engineering (ICSE 2022) – Software Engineering in Society (SEIS) Track, 2022.
Google Scholar
Daniel Russo and Klaas-Jan Stol. Gender differences in personality traits of software engineers. IEEE Transactions on Software Engineering, 2020.
Google Scholar
Lucía Santamaría and Helena Mihaljević. Comparison and benchmark of name-to-gender inference services. PeerJ Computer Science, 4:e156, 2018.
Google Scholar
Paul Sebo. Performance of gender detection tools: a comparative study of name-to-gender inference services. Journal of the Medical Library Association: JMLA, 109(3):414, 2021.
Google Scholar
Stack Overflow. Developer survey results. https://insights.stackoverflow.com/survey/ 2017, 2017. Accessed on May 1, 2022.
Josh Terrell, Andrew Kofink, Justin Middleton, Clarissa Rainear, Emerson Murphy-Hill, Chris Parnin, and Jon Stallings. Gender differences and bias in open source: Pull request acceptance of women versus men. PeerJ Comp Sci, 3:e111, 2017.
Google Scholar
Bianca Trinkenreich, Mariam Guizani, Igor Wiese, Anita Sarma, and Igor Steinmacher. Hidden figures: Roles and pathways of successful OSS contributors. Proceedings of the ACM on Human-Computer Interaction, 4(CSCW2):1–22, 2020.
Google Scholar
Bianca Trinkenreich, Igor Wiese, Anita Sarma, Marco Gerosa, and Igor Steinmacher. Women’s participation in open source software: A survey of the literature. Preprint at arXiv:2105.08777, 2021.
Google Scholar
Orsolya Vasarhelyi and Balazs Vedres. Gender typicality of behavior predicts success on creative platforms. Preprint at arXiv:2103.01093, 2021.
Google Scholar
Bogdan Vasilescu, Andrea Capiluppi, and Alexander Serebrenik. Gender, representation and online participation: A quantitative study of Stack Overflow. In 2012 International Conference on Social Informatics, 332–338, IEEE, 2012.
Google Scholar
Bogdan Vasilescu, Andrea Capiluppi, and Alexander Serebrenik. Gender, representation and online participation: A quantitative study. Interacting with Computers, 26(5):488–511, 2014.
Google Scholar
Bogdan Vasilescu, Daryl Posnett, Baishakhi Ray, Mark GJ van den Brand, Alexander Serebrenik, Premkumar Devanbu, and Vladimir Filkov. Gender and tenure diversity in GitHub teams. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, 3789–3798, 2015.
Google Scholar
Bogdan Vasilescu, Alexander Serebrenik, and Vladimir Filkov. A data set for social diversity studies of GitHub teams. In 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, 514–517, IEEE, 2015.
Google Scholar
Bogdan Vasilescu, Alexander Serebrenik, Mathieu Goeminne, and Tom Mens. On the variation and specialisation of workload – a case study of the gnome ecosystem community. Empirical Software Engineering, 19(4):955–1008, 2014.
Google Scholar
Ji Zheng and Bao-Liang Lu. A support vector machine classifier with automatic confidence and its application to gender classification. Neurocomputing, 74(11):1926–1935, 2011.
Google Scholar

Download references

Author information

Authors and Affiliations

Northwestern University, Evanston, USA
Huilian Sophie Qiu
Rice University, Houston, USA
Zihe H Zhao
Carnegie Mellon University, Pennsylvania, USA
Tielin Katy Yu
Carnegie Mellon University, Pennsylvania, USA
Laura Dabbish
Carnegie Mellon University, Pennsylvania, USA
Bogdan Vasilescu

Authors

Huilian Sophie Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Zihe H Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Tielin Katy Yu
View author publications
You can also search for this author in PubMed Google Scholar
Laura Dabbish
View author publications
You can also search for this author in PubMed Google Scholar
Bogdan Vasilescu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Victoria, Victoria, BC, Canada
Daniela Damian
University of Auckland, Auckland, New Zealand
Kelly Blincoe
Microsoft, Redmond, WA, USA
Denae Ford
Eindhoven University of Technology, Eindhoven, The Netherlands
Alexander Serebrenik
Riyadh, Saudi Arabia
Zainab Masood

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Qiu, H.S., Zhao, Z.H., Yu, T.K., Dabbish, L., Vasilescu, B. (2024). How Much Do Women Build Open Source Infrastructure?. In: Damian, D., Blincoe, K., Ford, D., Serebrenik, A., Masood, Z. (eds) Equity, Diversity, and Inclusion in Software Engineering. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-9651-6_14

Download citation

DOI: https://doi.org/10.1007/978-1-4842-9651-6_14
Published: 21 September 2024
Publisher Name: Apress, Berkeley, CA
Print ISBN: 978-1-4842-9650-9
Online ISBN: 978-1-4842-9651-6
eBook Packages: Professional and Applied ComputingApress Access BooksProfessional and Applied Computing (R0)

Publish with us

Policies and ethics