Introduction

The growing availability of public datasets recording various variables of human behavior has triggered the development of new computational tools to structure and ultimately examine public data, providing reliable empirical causal inferences that explain human behavior (Baden et al., 2022). Scientometric research focused on gender bias, gender differences, or gender inequalities in sciences (Charlesworth & Banaji, 2019; Huang et al., 2020; Kwiek & Roszka, 2022; Kyvik & Teigen, 1996; Llorens et al., 2021; Nosek et al., 2009; Ribarovska et al., 2021; Ross et al., 2022; Sebo et al., 2021), for instance, has largely focused on examining the role of gender in explaining differences in a myriad of scientific metrics, such as citations, productivity, h-index, scientific awards, academic promotion, and so on (Astegiano et al., 2019; Beaudry & Larivière, 2016; Chatterjee & Werner, 2021; Donald et al., 2011; Habicht et al., 2021; Holman et al., 2018; Krukowski et al., 2021; Meho, 2021). To account for these gender-related research questions, scholars need to code the gender of observations from names, as this information is typically not directly available. Against this backdrop, research has typically relied on two main empirically tested procedures to "genderize" observations from names: manual (Rajkó et al., 2023) and computational (Sebo, 2021) coding.

The manual coding of gender from names is a traditional procedure that in communication research is typically conducted through content analysis (Lacy et al., 2015; Lombard et al., 2002). It essentially involves trained human coders analyzing the names of observations, either using their knowledge/heuristics or, in case of doubt, through online searches (see Rajkó et al., 2023 for an example). To be considered reliable, this analysis needs to follow the standard procedures of all content analyses (Matthes & Kohring, 2008; Riffe et al., 2019) and provide intercoder agreement coefficients, such as Krippendorff's alpha (Hayes & Krippendorff, 2007; 2004), Cohen's Kappa (Cohen, 1960), or similar interrater agreement measures (Hsu & Field, 2003). This manual procedure, which has been tremendously popular and successful in examining media, public, or institutional content (Ernst et al., 2019; Lawson & Lugo-Ocando, 2022; Mackert et al., 2014; Manganello & Blake, 2010; Sullivan et al., 2019), is, however, highly inefficient in coding large datasets or big data, mainly due to the significant time and human resources needed to successfully code gender from names.

As a potential solution, computational sciences have provided robust tools to substantially address this challenge (Das & Paik, 2021; Fourkioti et al., 2019; Goyanes et al., 2022). Specifically, one of the most important procedures is the automatic computation of gender from names through gender detection application programming interfaces (APIs) (Bérubé et al., 2020; Sebo, 2021, 2022a). These algorithmic classifiers enable the computational coding of gender from names in big datasets (Bérubé et al., 2020), significantly reducing the time and human resources required for manual coding, while maintaining or even improving the quality and reliability compared to human coders. Studies of computational coding of gender using tools like Namsor or Gender-API report a high degree of accuracy, particularly for Western names (Santamaría & Mihaljević, 2018; Sebo, 2021, 2022b).

In both computational and manual coding, genderizing efforts most typically assume a binary dichotomy, assigning male/female categories to potentially non-binary conforming individuals. This is due to the impossibility of automatically or manually assigning non-binary categories with data coming directly from names. A more nuanced approach to gender coding would be interviewing individual-level entries, but this approach may also introduce new limitations in terms of sample size. What most studies perform, not without facing fair criticisms due to this approach, is leveraging the power of numbers in terms of sample size at the expense of non-binary gender categorizations, as these require self-reported information.

Although the internal implementation of commercial gender APIs is not public, their developers advertise that they rely on big data. Gender is usually inferred by comparing the name and other possible information, like country or ethnicity, if available, against a large database that is continuously updated. For each request, gender detection tools return additional information that can help researchers decide about the accuracy of the prediction. For example, Namsor returns the inferred gender and a probability. When the probability is below .55, the name can be interpreted as not being specific to any particular binary gender (i.e., it is a unisex or gender-neutral name). Gender-API also returns a number of samples that were used to make the prediction.

However, neither Namsor nor Gender-API provides explicit guidelines on how to handle unisex, gender-neutral names, or non-binary categorizations. This lack of guidance can pose challenges for researchers and analysts who aim to conduct gender-inclusive studies. For instance, when such names are encountered or when gender categorizations assume a non-binary approach, it may be beneficial to include a separate category. This can help to acknowledge and respect the identities of non-binary and gender-diverse individuals. Alternatively, researchers could consider reaching out to participants for self-identification, if possible and appropriate. This approach respects individual autonomy and can provide more accurate and inclusive data.

Beyond the computational tools specifically developed for addressing the task of name genderization, the recent advent of chatbots powered by generative artificial intelligence, accessible to the public, such as ChatGPT, has equipped the scientific community with novel tools to confront this challenge. ChatGPT boasts formidable natural language processing capabilities grounded in language models trained on vast amounts of text content. This training enables it to consistently engage with humans, responding to messages and resolving issues presented through the chat (Ray, 2023). Consequently, it is conceivable to present the task of inferring gender from a given name to ChatGPT, allowing it to leverage the knowledge acquired during its training for resolution. While few studies have specifically addressed ChatGPT's performance in this task, there are studies investigating its capabilities in other classification tasks, such as fake news detection, spam detection, or sentiment analysis in tweets, among others (Kocoń et al., 2023). It has been observed that ChatGPT adeptly handles many of these tasks, although it may not attain the performance levels of state-of-the-art models tailored for them specifically.

All things considered, the automatic coding of gender from names represents a significant and efficient methodological contribution that, used properly and consistently, may help to shed light on the most intriguing research questions on the role of gender in science. Accordingly, this study seeks to describe an easy-to-execute methodological procedure that enables the computational codification of gender from names that could be implemented by scholars without advanced computer skills. We present examples using the Namsor and Gender-API detection tool, as well as ChatGPT.

Step 1: Having/generating a dataset with names

As all empirical research, scholars interested in shedding light on the role of gender in different dependent variables need a reliable dataset, either by means of designing or downloading it from public/private repositories. In our case, most scientometric scholars typically rely on data coming from platforms that compute and record different variables of scientific research, such as Scopus, The Web of Science, Google Scholar, Altmetrics, etc. For the sake of simplicity, in this research guide we focus on gender from names with data coming from SciVal, a platform that computes data taken from Scopus of different metrics of journals, institutions, countries, and scholars (Goyanes, 2023). Although not publicly available (it needs a subscription), SciVal provides reliable scientometric data that, used consistently and applying reliable scientific methods and procedures, could serve to address a myriad of gender-related research questions, such as the association between gender and productivity, citations, h-index, etc.

Specifically, in this research, we have downloaded the names of the most productive scholars in the ten most productive countries (United States, United Kingdom, China, Spain, Germany, India, Australia, Canada, Italy, and the Netherlands) in the field of communication between the years 2019 and 2022. By default, SciVal provides the names of the top 500 most productive scholars for each country, and this is the ranking we use to computationally infer their gender. In Table 1, we provide as an example the names of the top ten for each country according to SciVal (Table 2).

Table 1 List of the names of the top 10 most productive scholars for the most productive countries in communication between the years 2019-2022 according to SciVal
Table 2 List of the names of the top 10 most productive scholars for the most productive countries in communication between the years 2019-2022 according to SciVal

Step 2: Preprocessing the dataset with names

Gender detection tools return better results if the first and last names are clearly identifiable, and if the country of residence is also provided. In this step, we focus on the preprocessing of the dataset to meet these requirements when possible. If the data source identifies the first and last name, we can split data into different columns using a spreadsheet or any software that facilitates data manipulation, such as statistical packages. The specific option or function to use for this purpose depends on the data and the software used. For the example, we can see that SciVal uses a comma to separate last and first name, so we can use the following Excel functions to split both names in different columns:

  • Last name: ‘=LEFT (A2;SEARCH(",";A2)-1)’

  • First name: ‘=RIGHT (A2;LEN(A2)-SEARCH(",";A2)-1)’

The first function gets all the characters from the beginning of the string (function LEFT) to the position of the first comma (function SEARCH). The second function returns the substring that contains the last n characters (function RIGHT), where n is computed as the length of the unsplit name minus the position of the comma (‘LEN(A2)-SEARCH(",";A2)’). It is necessary to subtract one in both cases because we do not want the comma included in the final strings representing the names.

Since SciVal provided the country of residence of scholars, for the given example we can also feed this information to our gender detection method to further improve the accuracy of its prediction. Gender detection tools typically use the ISO 3166 two-digit country code to represent the country. To get the country code, we can check online sources like the Wikipedia websiteFootnote 1 or the gender detection tools examples and documentation.Footnote 2 When we have the country code for each country of the dataset, we can either create a new column with the country code or substitute the original country name with the code. We can do that using “find & replace” for the given column in Excel. In our example, the preprocessed dataset looks like Table 3. The complete preprocessed dataset, including the Excel functions used, can be found as Supplementary Material (SM1_TestGenderProtocol.xlsx, sheet “PreprocessedData”)

Table 3 Preprocessed dataset (excerpt) of the top 10 most productive scholars for the most productive countries in communication between the years 2019-2022

However, on many occasions, it is not possible to easily separate the first and last name. Gender Detection Tools (GDTs) provide functionalities to infer gender from both full and split names. In such cases, we should select the appropriate feature from the GDT. In Namsor, we can choose "Gender Full Name," while Gender API automatically splits the first name and family name when using a column with a full name. In the case of ChatGPT, instead of using a prompt that specifies first name and family name separately, we can simply use the full name.

Step 3.1: Inferring gender (with ChatGPT)

ChatGPT can infer the gender of a name if we use an appropriate prompt. For the given example, we start by asking for the gender of a first instance using the following prompt ‘Andrew C. Billings from the US. Is it a male or female name?’ (Figure 1). Following the initial ChatGPT answer, we can just copy and paste the list of names in the prompt and ask ChatGPT to provide a gender for all these instances using the following prompt: “Based on first name, last name and/or country, give me a numerical answer only: 0 for female; 1 for male. Provide a table style answer including first and last name, country (if any), 0 if female; 1 if male.” (Fig. 2).Footnote 3 For this study, we pose the question as a binary decision without qualifying it further.

Fig. 1
figure 1

ChatGPT prompt and answer to infer gender of the first instance of the dataset

Fig. 2
figure 2

Final part of the ChatGPT prompt and initial part of the answer to infer the gender of a dataset of names

Results can then be pasted to a spreadsheet. Inferring names with ChatGPT using the web interface presents the following limitations at the present moment: (1) after 75 instances, ChatGPT asks for confirmation to continue processing; (2) ChatGPT can process up to 4000 tokens (approx. 4000 words) through its web prompt. Considering these, for large datasets we suggest using batches of around 100 instances in the web interface, which is close to the limit. For big datasets, we would not recommend using the ChatGPT prompt but instead the ChatGPT API through the Chat Completion object and its Create method.Footnote 4 A sample Python script to do this is included in Appendix 1. Automating gender detection using the ChatGPT API enables processing datasets of any size but comes with two disadvantages: (1) it requires programming skills; and (2) it has a cost since OpenAI charges for using the API based on the number of tokens. Processing a dataset of 5000 instances using ChatGPT API has an approximate cost of 0.0013USD per instance at the moment of writing this with the last version of ChatGPT (4.0). The cost of gender detection tools varies depending on the tool and package offerings. For 10,000 instances the cost is also around 0.0013USD in Namsor and 0.0017USD in Gender-API.

Step 3.2: Inferring gender (with Namsor)

To determine the gender using Namsor, we upload the file (Excel or comma-separated values (CSV)) and select the feature under “Gender from names” that matches our data (Figure 3). For the example, we select “Genderize Name Geo” because we have preprocessed the data (Step 2) to separate the first name and last name and coded the country using the ISO 3166 country code. Namsor provides all possible alternatives for the cases in which only the full name is available, or the country is not coded. We then configure the settings by selecting whether we want to keep the columns that originally exist in the dataset, indicating whether the dataset has a header, thus identifying which is the first row of data, and determining which columns correspond with the first name, the last name, and the country code used to infer the gender. Namsor validates the selected columns. Finally, Namsor presents a screen to review data that shows the columns configured for prediction and the new output columns (Figure 4), and after displaying a short summary, the file is processed and can be downloaded. For the whole process, Namsor offers an easy-to-use web interface.

Fig. 3
figure 3

Namsor features for inferring gender from names

Fig. 4
figure 4

Review screen from Namsor that shows the columns used to infer gender (in blue) and the new columns added with the prediction (in green). Columns not changed are shown in grey

Step 3.3: Inferring gender (with Gender-API)

Similarly, to infer gender using Gender-API, we upload the file by selecting CSV Upload or Excel Upload. Then we choose the fields for the first or full name, and the country code if available. After clicking on ‘review your data’ Gender-API shows the output of processing the first ten records (Figure 5). We can change the previous settings or click on the button to process the dataset. After clicking on the button Gender-API processes the file which can then be downloaded. Since Gender-API only offers the feature to infer gender, its interface is even simpler than Namsor’s.

Fig. 5
figure 5

Review screen from Gender-API that shows the original columns and the new columns added with prediction (in light grey)

Step 4: Interpreting gender results from gender-APIs and ChatGPT: classifications and nonclassifications

Namsor returns a spreadsheet that includes the data used for the prediction and the results (Table 4). Data used for the prediction include the name (first/last name or full name), the version of API, and the alphabet of the name (under “script”). The output includes the following: (1) “likelyGender,” which is the inferred gender coded as “female” or “male”; (2) “genderScale,” which is a statistic representing the probability of the gender ranging in a scale from -1 (male) to 1 (female); (3) “score” is a quantitative not normalized score of the reliability of the result; and (4) “probabilityCalibrated,” which represents the accuracy of the prediction ranging from 0 to 1. A higher means a more reliable result. If the probability of the predicted gender is below .55, the name can be interpreted as a unisex name for which Namsor is not able to determine the gender (i.e., a nonclassification).

Table 4 Results (excerpt) of predicted gender for the top 10 most productive scholars for the most productive countries in communication between the years 2019-2022 using Namsor

This approach, while effective for binary gender identification, may not fully address the needs of non-binary individuals. By classifying unisex names as nonclassifications, Namsor's model may inadvertently erase or ignore the identities of those who do not conform to the traditional male/female binary. This could potentially lead to misgendering or exclusion in contexts where accurate gender identification is important, such as in social research, marketing, or user experience personalization. Furthermore, the reliance on a binary gender scale (−1 to 1) may oversimplify the complex nature of gender, which is increasingly recognized as a spectrum rather than a binary construct. This could limit the model's applicability in diverse and inclusive settings. While Namsor's gender prediction model provides valuable insights, it is important to consider its limitations and the potential implications for non-binary and gender-diverse individuals. Future iterations of such models could benefit from incorporating more nuanced gender categories and considering the potential impacts on all users.

Researchers can use the additional output data provided by Namsor about the inferred gender to further control and tailor results. For instance, instances with a probability below .55 are nonclassified according to Namsor documentation. Researchers can decide what to do with them, having three options: code them manually; consider them as missing; or use Namsor classification although accuracy is low. This decision can rely on additional factors, such as the type of study, the size of sample and population, or the method of analysis. Namsor returns a result for all names. Other gender detection tools, like Gender-API, return nonclassified empty results for the instances in which there are not enough or no matching names in the tool’s database. Researchers can also decide to increase the threshold probability for higher accuracy if justified by the research questions or methods. Previous studies on gender detection tools, however, suggest that while increasing probability threshold results in fewer misclassifications, the overall performance decreases due to a higher number of nonclassifications (Sebo, 2022b).

Similarly, Gender-API returns a spreadsheet that includes the original data and the inferred gender (Table 5). The output includes three new columns. ‘ga_gender’ is the predicted gender coded as ‘f’ for female, ‘m’ for male and ‘u’ for instances in which the gender could not be determined. ‘ga_accuracy’ is a measure of the accuracy of the prediction between 0 and 100. ‘ga_samples’ is the number of samples used for the prediction of the instance. Gender-API returns a nonclassification, coded as ‘u,’ when the accuracy is 50 meaning that there is an equal number of male and female names in its database. In all other circumstances, Gender-API returns the gender with more instances. The gender is also undetermined when Gender-API cannot find the name in its database. Gender-API, similar to Namsor, also presents limitations when it comes to handling gender diversity.

Table 5 Results (excerpt) of predicted gender for the top 10 most productive scholars for the most productive countries in communication between the years 2019-2022 using Gender-API

ChatGPT does not provide additional information on the genders inferred with the prompt that was used in this study. Although it is possible to use Upon specific prompting to include a numeric column that represents the accuracy of the inferred gender for the example, it returns 0.99 for 85 instances and 0.75 for 15 instances of the example. Curiously, ChatGPT and Namsor disagreed in six instances of the example. Five of them were rated with an accuracy of 0.75 by ChatGPT while Namsor returned an average calibrated probability of 0.84 (range 0.56–0.99) in the six disagreements. It is also worth noting that ChatGPT can produce slightly different results when using different, but similar, prompts. For instance, if we use “boy’s or girl’s” name instead of “male or female” in our prompt, results change slightly; two records differed in the example. Since ChatGPT is a pretrained AI model that relies on the data sources the model was trained with, we conjecture that in the first case ChatGPT queries the first name in the given country (which is similar to what gender detection tools do), while in the second ChatGPT directly searches for the person in its internal pretrained model. Our conjecture is based on the interaction that the chat shows when the response is produced. For the same reason, results in future prompts can also vary even if we use the exact same prompt. All in all, in their current form generative AI chatbots like ChatGPT are nondeterministic and can produce different results for the same input. However setting certain parameters like the temperature can reduce the non-determism. Scientists must ponder this factor when selecting the tool for gender prediction, particularly if there is a stringent requirement for replication of the research.

Results of genderization for the example using ChatGPT, Namsor and Gender-API along with the reported accuracy measures are presented as Supplementary Material 1 (SM1_TestGenderProtocol.xlsx, sheet “OutputAll”).

Step 5: Reporting data

Although, to the best of our knowledge, there is no standard way to report results of automatic of gender detection, such standardization would provide a way to evaluate its results, which may help to contextualize and assess the quality of the research study in which it is used. We suggest reporting, at least, the tool or method used to infer gender, the measure and threshold for nonclassification, the number of nonclassifications, and how nonclassifications were treated. Descriptive statistics of the accuracy measure returned by the gender detection tool could also be reported. For the example used in this paper, reporting could be as follows: “Gender was inferred using the Namsor gender detection tool (feature split name and country, API version 2.0.29) with a 0.65 threshold for the calibrated probability. Five instances (5%) were not classified and were coded as missing. The average calibrated probability of classified instances (N=95) was 0.989 (SD=0.037).” When additional information is used to determine gender (e.g. country) or other classifications represent a dimension of analysis for the study, we could break down the results and present them as seen in Table 6. Results can be reported analogously when using Gender-API using the accuracy returned by this gender detection tool.

Table 6 Results reporting automatic genderization by country of the top 10 most productive scholars for the most productive countries in communication between the years 2019-2022 using Namsor

Chatbots, in their present form, do not return reliable measures of accuracy. So reporting is also limited when using them to automatically infer gender. From this perspective, gender detection tools like Namsor represent a better methodological option.

Comparing performance

In certain circumstances, researchers may wish to compare the relative performance of different gender detection tools. This comparison is feasible if we have access to the ground truth. Common metrics used to compare performance in gender detection include Accuracy, Error Coded, Error Coded without NA, NA Coded, and Gender Bias (Santamaría & Mihaljević, 2018: Wais, 2016). To compute the metrics, we firstly need to produce the classification matrix which includes the following measures: mm is the number of males correctly classified as males, ff is the number of females correctly classified as females, mf is the number of males incorrectly classified as females, fm is the number of females incorrectly classified as males. Additionally, for non-classifications, we have two additional measures: mu is the number of males classified as unknow and fu is the number of females classified as unknow. Accuracy is a common metric used in gender detection. It is defined as the proportion of true results (both true positives and true negatives) in the population. It is calculated as follows:

$$ \frac{{m_{m} + f_{f} }}{{m_{m} + f_{m} + m_{f} + f_{f} + m_{u} + f_{u} }} $$

Error Coded is defined as the proportion of incorrect classifications (both males classified as females and females classified as males) in the population. It is calculated as follows:

$$ \frac{{f_{m} + m_{f} + m_{u} + f_{u} }}{{m_{m} + f_{m} + m_{f} + f_{f} + m_{u} + f_{u} }} $$

Error Coded without NA is similar to Error Coded, but it does not consider non-classifications, so we can compute it by removing mu and fu from the previous formula. NA Coded is the proportion of non-classifications in the population. It is calculated as follows:

$$ \frac{{m_{u} + f_{u} }}{{m_{m} + f_{m} + m_{f} + f_{f} + m_{u} + f_{u} }} $$

Gender Bias is a measure of the tendency of a classifier to prefer one class over the other. It is calculated as the difference between the proportions of males and females incorrectly classified. A positive value indicates a bias towards classifying names as female, while a negative value indicates a bias towards classifying names as male.

$$ \frac{{m_{f} - f_{m} }}{{m_{m} + f_{m} + m_{f} + f_{f} }} $$

Results for the example used in this study are presented in Table 7 providing a comparative analysis of three gender detection tools: Namsor, Gender-API, and ChatGPT. According to the metrics calculated for split names, ChatGPT demonstrated the highest accuracy and the lowest error rate, both with an score of 0.99. Namsor and Gender-API showed slightly lower performance, with accuracies of 0.93 and 0.95, respectively. In terms of error rates, Namsor had the highest with 0.07, while Gender-API had a slightly lower error rate of 0.05. All tools had negligible non-classifications, except for Gender-API which had a small proportion of 0.01. In terms of gender bias, Namsor showed a slight bias towards classifying names as female, while Gender-API and ChatGPT showed a slight bias towards classifying names as male. When considering the full names without splitting into first and last names, the performance of the tools varied slightly. The accuracy of Namsor improved to 0.94, while Gender-API's accuracy increased significantly to 0.97. ChatGPT maintained its high accuracy of 0.99. The error rates for all tools decreased when using full names, with Namsor and Gender-API showing error rates of 0.06 and 0.03, respectively, and ChatGPT maintaining its low error rate of 0.01. Non-classifications remained negligible for all tools. The gender bias for all tools remained relatively consistent when using full names, with Namsor showing a slight bias towards classifying names as female and Gender-API and ChatGPT showing a slight bias towards classifying names as male.

Table 7 Performance metrics for gender detection using Namsor, Gender-API and ChatGPT of the top 10 most productive scholars for the most productive countries in communication between the years 2019-2022

Conclusion

The objective of this study was twofold: 1) to provide an easy step-by-step research protocol to infer gender from names; and 2) to compare the gender detection capabilities of two technologies: ChatGTP (chatbot), Namsor and Gender-API (two gender detection tools). Based on the performed analysis, this study contributes to the literature on scientometrics (Charlesworth & Banaji, 2019; Huang et al., 2020; Kwiek & Roszka, 2022; Kyvik & Teigen, 1996; Llorens et al., 2021; Nosek et al., 2009; Ribarovska et al., 2021; Ross et al., 2022; Sebo et al., 2021) by providing suggestions and recommendations for future research interested in inferring gender from names that may have significant implications over several methodological decisions during the processes of automatic gender classifications (Rajkó et al., 2023; Santamaría & Mihaljević, 2018; Sebo, 2021; Sebo et al., 2021). In addition, this study is one of the first articles that evaluates the potential muscle of ChatGPT in performing gender classifications from names. The test of this technology is important because it may serve as a common ground for many researchers without advanced computer skills to shed light on different issues revolving around gender representation and gender impact in sciences. Sebo (2024) evaluated the performance of ChatGPT 3.5 and ChatGPT 4.0 to determine the gender of physician names. The study concluded that both versions of ChatGPT demonstrated high accuracy and a low rate of misclassifications, showing that their performance was slightly superior to that of Gender API and NamSor. Our results are similar, showing an error code of around 1% for ChatGPT 4.0 as compared to 3% for Gender-API and 6% for Namsor, although our sample size is smaller. Sebo also concluded that both versions of ChatGPT offer similar performance, which is particularly interesting considering the significantly lower cost of ChatGPT 3.5. Both Sebo's and our results point to the potential of ChatGPT for gender detection. Our study also presents a systematic research protocol for inferring and reporting name-to-gender results, which can be applied to all existing tools and different versions, as well as extended to new upcoming ones like new chat bots.

Despite the differences in both approaches (chatbots and gender detection tools) when it comes to the automatic process for gender classification, both converge in the surprising lack of transparency for the process of gender inference. This lack of transparency represents a major shortcoming of gender detection APIs as there is no ground truth available to which human intelligence can compare the output. Nevertheless, Namsor and Gender-API provide additional information to evaluate the accuracy of the inferences, as suggested in the results section, enabling researchers to make informed decisions as to whether to include, exclude, or manually code the problematic observations. In the case of ChatGPT, the artificial intelligence platform only provides raw data, and therefore just the inferred genders are provided in the chatbot’s output. Considering this information, and with the salient concerns discussed above, gender APIs could be considered a more appropriate tool to automatically classify gender from names.

Relatedly, the costs associated with the automatic classification of gender are very similar, despite the differences in the nature of the technology. Namsor is a partially free API, and researchers can freely classify 5000 observations per month, at least at the time of writing this study. ChatGTP is also freely available to automatically codify genders from names via chat but, as suggested in the results section, the intelligent system has several limitations regarding the accuracy of prompts generation and the word limit that can feed the chatbot. As a result, for relatively small samples, we recommend feeding the chatbot with batches of around 100 observations so it can operationally work. For relatively big samples, our suggestion is to directly compute the gender inference classification process in the ChatGPT API, rather than in the web browser. This procedure, however, entails two main limitations: 1) it requires relatively moderated computer skills; and 2) it has cost. Processing a dataset of 5000 instances using the ChatGPT API with the method described has an approximate cost of 0.0013USD per instance, while the cost of gender detection tools varies depending on the tool and package offerings. For 10,000 instances the cost is also around 0.0013USD in Namsor and 0.0017USD in Gender-API. According to these estimated rates, both platforms (AI vs. API) have a similar cost for large datasets.

Regarding usability, both technologies are relatively well equipped to reliably infer gender from names. However, a gender detection tool is a technology whose objective is indeed the empirical classification of gender from names, while ChatGPT is an artificial intelligence technology with multiple uses, one of which, as we have demonstrated in this study, may be gender classification. In this regard, Namsor and Gender-API present a web interface that facilities the codification of gender for scholars without advanced computer skills, while ChatGPT, despite its ease of use, requires interacting with the chatbot and finding the appropriate prompts to correctly perform the classification. In our test, the results showed that on some occasions when performing the automatic gender classification from names, ChatGPT refused to answer our query, alleging privacy and moral issues, which are indeed relevant. In conclusion, based on our present comprehension of various gender classification methodologies, we continue to advocate for the utilization of API detection tools for dependable gender identification from names, primarily due to their ease of use. The example used in this study suggests a comparable accuracy between the tools. However, it is imperative to substantiate these findings with additional analyses that evaluate the performance of ChatGPT under varying conditions and juxtapose the results with those obtained from existing gender APIs.

We acknowledge a number of limitations in our study. Firstly, the robustness of prompting for ChatGPT and the influence of prompt engineering on the generated output were not extensively explored in our study. This is due to the geometric explosion of tests and subsequent analyses that would result from varying prompts, parameters, and configurations. We recognize this as an area for future research. Secondly, the stability of a non-deterministic classifier like ChatGPT should be taken into account. As a machine learning model, ChatGPT’s predictions can vary between runs, which could potentially impact the consistency of the results. This variability is inherent to the nature of non-deterministic classifiers and is a factor that researchers should be aware of when using such tools. Thirdly, the results and their interpretation presented in this study are based on a reduced sample size and are primarily used to demonstrate how the comparison between different gender detection tools can be conducted. Therefore, the results might not be representative of the performance of these tools on a larger, more diverse dataset. Future studies with larger sample sizes are needed to validate these findings and provide a more comprehensive comparison of the performance of these tools. Fourthly, according to recent research (Sebo, 2022a, 2022b), gender detection tools are much less accurate with Chinese names than with Western names, and therefore the vast majority of inconsistencies are reported when genderizing non-Western names. Lastly, it’s crucial to acknowledge the non-binary nature of gender. The gender detection tools evaluated in this study classify names into binary categories of male and female, which oversimplifies the complex nature of gender. This binary classification approach does not account for individuals who identify as non-binary, genderqueer, or other gender identities outside the male-female binary. This is a significant limitation of these tools and a challenge for gender-related research. Future work in this area could explore the development and evaluation of tools that can accurately classify a broader range of gender identities.

As only one short comparative study (Sebo, 2024) was published comparing chatbots and gender detection tools suggesting similar performance, while further evidence becomes available, we conclude specific APIs like Namsor and GPT-API are a better technology, at least from a methodological perspective, since they provide accuracy measures that can be used to assess the quality of the genderization results and their validity. In this regard, we also suggest the systematic reporting of accuracy measures in studies that use automatic gender detection. Considering the comparative test performed in this study, we recommend future research to focus on designing empirical studies with large datasets and grounded truth benchmarks to compare detection tool accuracy power.