1 Introduction

Human gene editing and gene therapy are controversial topics among geneticists. The uses of gene therapy have the potential to cure genetic diseases such as cystic fibrosis, bird flu (Tsanni 2023), and in the future Huntington’s disease or even cancer. With these benefits why would experts be tentative about its use? It is because the use of gene editing has the risk of making permanent changes to the human genome that could be passed to future generations and potentially lead to eugenics. These risks bring up significant ethical concerns about when it is right to use gene therapy.

The development and public release of Large Language Models (LLMs) has been increased in recent years. While these models can be used to build helpful tools, e.g., ChatGPT, Bard, NeevaAI, Bing AI, etc. LLMs can produce inaccurate or misleading information because the content generated is based on word associations and patterns and not on a deep understanding of the subject (Ray 2023). Sometimes these responses can also be dangerous and lead to tragedies, for example, a young Belgian man took his own life after conversing with a chatbot named ELIZA [“a chatbot that uses GPT-J, an open-source artificial intelligence language model developed by EleutherAI” Walker 2013] for six weeks about his fears of climate change. These chatbots can also be sensitive to input phrasing producing inconsistent answers and varied amounts of detail (Ray 2023). Finally, chatbots’ can create polarizing views on both sides of a topic which poses another risk to society as it may lead users to use chatbots to confirm their already predetermined beliefs. This confirmation bias may lead to inaccurate emotions towards different groups of people, ideas, or topics. Hence in this research we highlight some of the capabilities that chatbots have in producing potentially biased responses to users when the chatbot is prompted about a controversial social issue such as gene editing and supplied with a region or a political affiliation of the prompter/inquirer.

Our overarching research goal is: Do Large Language Models (LLMs) return potentially biased information when prompted about a controversial issue? So, in this research, we compared both linguistic and semantic differences between the responses returned by OpenAI ChatGPT and Google Bard when prompted about the issue of gene editing and analyzed the responses to find whether the provided information will make the chatbot give biased information or varied amount of details (i.e., how sensitive to input phrasing?) and how these biases would look like. More specifically, this research aims to examine the “cultural and linguistic bias”, “positive/negative sentiment bias”, and “ideological bias” of ChatGPT and Bard, which could stem from the models’ training on content from the internet generated by humans (Ray 2023). This paper is prepared according to the rules of the “SBP-BRiMS 2023 Grand Interdisciplinary Data-Science and Modeling Challenge,” which can be found here: https://sbp-brims.org/2023/challenge/. The rest of the paper is organized as follows: in Sect. 2, we provide a brief literature review of related work, Sect. 3 describes our methodology, and Sect. 4 lists and discusses all our findings. Section 5 concludes the paper with possible future research directions.

2 Literature review

In this section, we provide a brief literature review of the main topics of this paper, i.e., Large Language Models (LLMs) and the issue of Gene Editing.

The use of LLMs in various medical fields, including robotics, dictation, medical chart summarization, personalized patient recommendations, breast cancer detection, lymph node metastases, and therapeutics, has experienced a significant surge. Nazario-Johnson et al. (2023) sets out to compare two large language models ChatGPT (version 3.5, trained on general data) and Glass AI (version 1.0, trained on only medical text) to that of a senior neuroradiologist in their ability to recommend radiologic imaging modalities. All three entities-ChatGPT, Glass AI, and the neuroradiologist-were tasked with labeling the most suitable neuroimaging modality for each case, assigning scores on a scale of 0 to 3, where 3 is denoted as the optimal choice. The average scores for ChatGPT, Glass AI, and the neuroradiologist stood at 1.75, 1.83, and 2.20, respectively. Despite Glass AI outperforming ChatGPT, the observed differences were not statistically significant. However, the contrast between the neuroradiologist and the two chatbots yielded statistically significant results. These findings underscore the notion that while LLMs, including ChatGPT and Glass AI, may not be poised to replace neuroradiologists outright, they hold promise as valuable tools for providing recommendations in medical contexts. They also speculated that fine-tuning ChatGPT with medical texts might enhance its performance compared to Glass AI (Nazario-Johnson et al. 2023).

In addition to using LLMs in the medical field, Kasneci et al. (Apr 2023) looked into how the growing popularity of LLMs could impact education. The authors discussed the opportunities and challenges associated with using LLMs in education. Some highlighted opportunities include using LLMs in elementary schools to enhance students’ writing skills and employing LLMs in professional fields, e.g., programming, project management, and decision-making. LLMs could also be utilized in teaching tasks such as creating lesson plans, assessments, and grading assignments. However, these benefits come with potential problems, including copyright issues; bias and fairness; learners relying too heavily on the model; teachers becoming too reliant on the model; lack of understanding and expertise in using the models; difficulty distinguishing model-generated from student-generated answers; cost of training and maintenance; data privacy and security; sustainable usage; cost of verifying information and maintaining integrity; difficulty distinguishing between real knowledge and convincingly written but unverified model output; and lack of adaptability. The authors conclude that to maximize the benefits of LLMs in education, these models must be used cautiously and undergo a close evaluation to identify limitations and biases (Kasneci et al. Apr 2023).

In many professional fields, programming is a widely used skill. When a written code contains an error, it is referred to as a software bug. The process of fixing the software, which involves eliminating the bugs, is known as debugging. This process consumes a significant amount of time and poses challenges for programmers. Bugs are not only troublesome but also risky, as they can cause software crashes, lead to security vulnerabilities, or result in data loss. To address this problem, numerous products have been created to expedite the debugging process. However, more recently, there has been an increasing trend in using LLMs such as ChatGPT for code debugging. In their research, Surameery and Shakor (2023) explained the characteristics of ChatGPT that enable it to perform effectively as a debugger. These characteristics include its knowledge representation, pattern recognition, error correction, and ability to generalize from examples it has not seen before. Although the authors discussed the promise that ChatGPT has shown, they emphasized that the effectiveness of ChatGPT in solving bugs depends on the quality of training data and the type of bugs being targeted (Surameery and Shakor 2023).

Another interesting and challenging topic that LLMs might help address is global warming. This topic is difficult to tackle due to the involvement of various scientific disciplines such as atmospheric science, oceanography, and ecology. Biswas (Jun 2023) suggests that LLMs applications such as ChatGPT could benefit scientists in studying this issue. The authors propose that ChatGPT could be utilized to analyze and interpret large amounts of data, disseminate information to a broader audience about climate change, offer recommendations on addressing climate change problems, and generate potential climate scenarios. However, the authors also acknowledge the limitations of ChatGPT. The model, trained on general data, may lack a fully comprehensive understanding of complex topics related to the issue. It could also lack contextual awareness of multi-faceted issues, be at risk of being trained on inaccurate or biased data, raise ethical concerns, and have a limited scope due to the potential lack of training on up-to-date data (Biswas Jun 2023).

For the issue of gene editing, various scientific studies have discussed it. However, we selected the most relevant studies for our work. A research conducted by Lander et al. (Mar 2019) discusses the 18 signatories who are citizens of 7 different countries pleading for a global moratorium on human germline editing. Human germline editing involves changing heritable DNA, such as sperm, eggs, or embryos. Germline editing serves two primary purposes. The first is genetic correction, which involves editing a rare mutation that, without intervention, would have a high probability of causing severe single-gene diseases. The goal of genetic correction is to convert the mutated gene into the DNA sequence carried by a majority of the public. The authors believe that this type of germline editing is much safer than the second type, which is genetic enhancement. However, they caution against proceeding with genetic correction until the right protocol is established. Genetic enhancement involves modifying DNA with the intention of “improving” individuals, such as making them stronger or enhancing their memory. This type of gene editing is significantly riskier than genetic correction because modifying genes could have unknown outcomes. For example, improving an individual’s immunity to a specific disease may simultaneously increase their risk for a different disease. The authors recommend a framework that begins with a 5-year pause on germline editing, followed by a two-year public notice before any trials. During this period, international discussions about the pros and cons of the study could take place. Subsequently, the ethical implications of the study would be determined, and only if there is broad societal consensus, should the study proceed (Lander et al. Mar 2019).

The idea of having the ability to eradicate genetic diseases from society creates a significant push to promote germline genetic modification. However, germline genetic modifications come with numerous potential safety issues. These risks were discussed in a study conducted by Ishii (Dec 2017). The author discussed the case of misusing germline genetic modification for human enhancement for nonmedical purposes, which could result in a larger socioeconomic gap. The wealthy might have access to gene editing before the rest of society, or more seriously, it could cause permanent damage to the human genome. Genome editing also carries the risk of unintentional cuts to non-targeted DNA sequences, leading to large-scale genomic alterations. Scientists argue that “it would be irresponsible to proceed with any clinical use of germline editing unless and until a reasonable and ethical follow-up protocol is established” (Ishii Dec 2017). Currently, there is no agreed-upon follow-up protocol for how long children born via germline genome editing should be monitored. Given the potential impact on future generations, scientists must consider the feasibility of trans-generational follow-up (Ishii Dec 2017).

Doxzen and Halpern (2020) discussed two issues in their study. The first issue with germline genome editing (GGE) is that the difference between treatment and enhancement is so small that people from different cultures, geographical settings, or time periods might have different opinions about it. An example of this would be editing the genome to increase the expression of the Klotho protein to prevent degenerative neurological conditions (treatment), but increased levels of Klotho have also been shown to enhance cognition in mice (enhancement). The second issue they addressed was about social justice and eugenics. They believe that the use of GGE will lead to a stigma around people with disabilities and a possibility of leading to eugenics. Currently, there is an option for genetic selection called pre-implantation genetic diagnosis (PGD), where parents can pick the embryo they want based on genetic traits, allowing parents who are at risk of having kids with genetic disorders to test if the embryos have the disorders. PGD still has a relatively low success rate and is very expensive. Like PGD, GGE would also be very expensive, likely making it accessible only to the wealthy, which would increase health inequality. These authors believe that when it comes to the ethics of GGE, it is most important to focus on how actual human lives and rights will be impacted, specifically addressing the problems of social exclusion and the risk of unfair access that gives an advantage to a select population (Doxzen and Halpern 2020).

Finally, Shaw (2020) discussed the case of the Chinese researchers who gene-edited embryos that resulted in the live birth of twins. This raised many questions about the ethical procedures they followed. The author’s primary concern was why HIV was chosen for the first CRISPR embryo implantation instead of addressing a serious heritable genetic disease. Their article examines the four consent documents to determine the ethical nature of that research. In the first document titled “Informed Consent,” the research team falsely claimed to be launching an AIDS vaccine development project, while in reality, it was a gene editing study. Unrealistic claims were made, stating that this treatment could produce babies with natural immunity to AIDS, while also suggesting that the babies could be infected with HIV by their mothers. In the second document, titled “Supplementary Explanation of Informed Consent (Long-term Health Follow-up Plan),” also referred to as “The Second Consent Form,” the researchers added more information about what would happen in the following years after the study. It included 10 follow-up physical examinations and an 18-year follow-up plan. In Document 3, referred to as the “Safety Document,” the authors did not evaluate the safety of gene editing. Instead, they provided details on inclusion and exclusion criteria for the study; however, no rationale for choosing the criteria was given. Additionally, the document was relatively short. Moving on to the fourth document, referred to as “Ethics Application Form,” supposedly submitted and approved before the study, there were concerning inconsistencies. One role of the ethics committee is to confirm that the application forms provide similar information to that presented to the participants. However, on the “Ethics Application Form,” it states, “In this study, we plan to use the CRISPR-Cas9 to edit embryos,” Shaw (2020) with no mention of developing an AIDS vaccine as described to the participants. The article concludes by highlighting how this research violates three of the five gene editing principles of the principal investigator’s university, which has now disowned this research (Shaw 2020).

Since “gene editing” is a controversial social and medical issue and LLMs are widely used in many everyday tasks by medical professionals, teachers, and programmers (e.g., for disease identification, teaching, or software debugging), as demonstrated in the literature review, understanding how biased these tools could be and what makes these tools provide polarized responses to controversial issues is a critical problem worth investigating. This research is one step in that direction.

Fig. 1
figure 1

The overall research methodology

3 Methodology

Figure 1 shows the overall research methodology we followed to answer the research question mentioned in the Introduction section. We generated our textual data by prompting two different chatbots, namely, OpenAI ChatGPT (version 24/05/2023) and Google Bard (version 07/06/2023) with two prompts, one asking the chatbot to generate a response based on a specific geographic region, while the other is based on a specific political affiliation. Our goal was to prime the LLMs to provide polarized responses. The two prompts are shown below:

  1. 1.

    Prompt1: write 1000 words on what someone from {Europe, China, United States} would think about gene editing.

  2. 2.

    Prompt2: write 1000 words on what a {Communist, Democrat, Republican} would think about gene editing.

Each prompt is the text shown above plus one of the words in the curly brackets. This resulted in 12 different text files with a total of 7591 words. See Table 1 for information on the number of words in each text file. We call the text resulted from using Prompt1 as Article 1 and the text resulted from using Prompt2 as Article 2. Also, Fig. 2 displays the word clouds for the responses generated by ChatGPT and Bard. The top 10 most occurred words in the responses generated by ChatGPT were Ethics, Potential, Perspective, Scientific, Technology, Human, Individual, Considerable, Concern, and Public, while the top 10 most occurred words in the responses generated by Bard were Use, Potential, Technology, Disease, Concern, Create, People, Human, Ethics, and Genetic.

Table 1 Shows the word counts of the 12 textual data points we obtained from the two chatbots and both prompts

To analyze the generated textual data and show the semantic differences between the various responses (which is one of the requirements of the data challenge), we used three dictionary-based tools, namely: Linguistic Inquiry and Word Count (LIWC) (LIWC 2022), an extended Moral Foundations Dictionary [(eMFD (Hopp et al. 2021)] which is based on the Moral Foundation Theory and Biblical Ethics (the twelve returned scores are split by vice and virtue), and Perspective API (Google 2024). LIWC gives a percentage (from 0 to 100%) of all words in the text that fit a specific linguistic category, e.g., a pos_emo score of 4.5 means 4.5% of the words in the document have positive emotion. Also, the eMFD score is a percentage of all words in a text, but it returns scores that range from 0 to 1. For example, a care.virtue score of 0.24 would mean 24% of the words in the document are part of care virtue. Finally, Google’s Perspective API returns a toxicity score between 0–1 for any text passed to it which represents the percentage of people that would think the text is toxic. So, a response with a score closer to 1 is considered more toxic than a response with a score closer to 0. After calculating the aforementioned scores, we ran a non-parametric test suitable for small samples that assume no specific distribution. It is called the “Mann–Whitney U Test” (McKnight and Najab 2010) to test the following hypothesis:

Null hypothesis (\(H_{o}\)): there is no difference between the two independent samples.

Alternative hypothesis (\(H_{1}\)): there is a difference between the two independent samples.

The results of our analysis, which aimed to identify the semantic differences that emerged when the language generator was prompted to provide polarized information based on the specified political affiliation or geographical location of the prompter, are reported in the next section.

Fig. 2
figure 2

a Word cloud of the articles generated by OpenAI ChatGPT, b word cloud of the articles generated by Google Bard

4 Results

In this section, we present the findings of our research, detailing the results obtained after calculating the various scores mentioned in the Methodology Section and conducting the Mann–Whitney U Test. The calculation of LIWC, eMFD, and Toxicity scores for the collected text yielded 114 scores for each of the 12 generated responses. We provide a breakdown of the results for both the chatbots and prompts. Note that some scores were higher for one prompt vs. the other OR one chatbot vs. the other. However, we report only those with statistically significant differences.

Fig. 3
figure 3

a The breakdown of the emotion scores for Prompt 1. b The Toxicity scores breakdown per Prompt 1 (orange bars) and Prompt 2 (blue bars) for both ChatGPT and Google Bard. (Color figure online)

Findings per prompt: we initially combined the text returned by both chatbots for each prompt and analyzed the resulting scores. Additionally, we examined the scores of individual files. In this analysis, we discovered:

  1. 1.

    For Article 1 (i.e., responses returned from Prompt 1), the overall emotion scores were higher in the responses returned by ChatGPT from the USA point of view (2.03) and Bard from the China point of view (2.05), shown as the tallest blue bars in Fig. 3a. Upon closer examination of the breakdown of the emotion scores, it appears that responses returned by ChatGPT from the USA point of view are more positive, whereas responses returned by Bard from the China point of view are more negative (in Fig. 3a, observe the tallest orange bar and the tallest gray bar, respectively).

  2. 2.

    Article 1 and Article 2 showed varying levels of toxicity. Article 1 toxicity average was 0.0342 however, Article 2 toxicity score was 0.0515. This could mean that “gene editing” is a more heated topic between different political affiliations than different geographical regions (In Fig. 3b notice how the blue bars are taller than orange bars). Also, this might suggest that these chatbots have a “sensitivity to input phrasing” (Ray 2023) when political affiliation is used instead of geographical location. To test whether the difference is statistically significant, we run the Mann–Whitney U Test and found that the \(p =.004\) (which is \(< 0.05\)) so we can reject the null hypothesis and accept the alternative hypothesis stated in the Methodology section.

  3. 3.

    For Article 1, we observed that Bard from the Europe point of view had the highest toxicity score (0.0424) (see the tallest orange bar of Fig. 3b). However, for Article 2, we observed that both chatbot responses from the Communist point of view had the two highest toxicity scores, with 0.0483 for ChatGPT and 0.0848 for Bard (see the tallest blue bars of Fig. 3b).

  4. 4.

    Article 1 has higher average ethnicity (average = 0.99) scores than Article 2 (average = 0) while Article 2 has a higher average politic scores (average = 3.72) than Article 1 (average = 1.155). The difference is statistically significant with \(p =.009\) and .016, respectively (see Fig. 4).

Findings Per Chatbot: Here, we compare the results obtained from analyzing the responses returned by each chatbot for both of the prompts. See Fig. 4 for a list of all scores mentioned below. In this analysis, we discovered:

  1. 1.

    The responses returned by ChatGPT had higher fairness_virtue scores (average = 0.071) than responses returned by Bard (average = 0.050). While the responses returned by Bard had higher care_vice scores (average = 0.074) than responses returned by ChatGPT (average = 0.06). The difference is statistically significant with \(p =.008\) for both.

  2. 2.

    Also, the responses returned by ChatGPT had a higher average Social score (average = 9.652) than Bard (average = 6.502) with a \(p =.008\) and a higher average Social Behaviour score (average = 5.355) than Bard (average = 2.61) with a \(p =.004\).

  3. 3.

    Bard had a higher average illness score (average = 1.442) than ChatGPT (average = 0.133) with a \(p =.004\).

  4. 4.

    Even though we asked the chatbots to return 1000-word articles, the responses ranged from 493 to 686 words. Additionally, ChatGPT returned longer responses than Bard, with averages of 646.8 and 618.3 words, respectively. See Table 1 for details about each response word count.

Finally, despite selecting a controversial issue and employing two chatbots trained by different companies, the majority of the scores calculated for the twelve responses were low. This is great because it means that the chatbots are not returning very toxic or negative responses about the issue of gene editing. The top highest five scores in most of the responses were Analytic (the highest score was 96.22), Dictionary (the highest score was 90.31), Linguistic (68.28), Tone (87.11), and Clout (54.72) indicating that the returned responses are of high quality (see red scores in Fig. 4).

Fig. 4
figure 4

Some of the scores we obtained and discussed in our findings

5 Conclusion and future work

The use of LLMs nowadays is vast, and this might be only the tip of the iceberg. Given that these tools are used by the public, it is essential to pay more attention to the impact of their usage because the usage of these tools might pose harm to individuals or society at large. For example, if these tools provide biased, polarized, or incorrect information, they could convince people of false ideas, beliefs, or ideologies. Hence, in this research, we tried to analyze the responses returned by two chatbots that use LLMs to give human-like answers to the prompter when they are asking about a controversial issue (i.e., the issue of “gene editing”) and providing their political affiliation or geographical areas. When we analyzed the responses and compared the differences (tested via the Mann–Whitney U test) in the results returned per chatbot and prompt, we found that 4 scores yielded results that could reject the null hypothesis between the different prompts (i.e., there is a significant difference between the articles’ average scores with a \(p <.05\)). In contrast, when comparing ChatGPT and Bard, we found that 33 scores yielded results that could reject the null hypothesis. This leads us to believe that although political affiliations and geographical locations can lead chatbots to produce statistically different results, the chatbot used to produce the response has a significantly larger impact. This might be because these tools are trained using different content from the internet that humans mostly generate. Therefore, examining the source of the training dataset is crucial for understanding any potential chatbot bias or polarization. Finally, despite selecting a controversial issue the majority of the scores calculated for the twelve responses were low, meaning the chatbots are not returning toxic or negative responses [not much “Sentiment Bias” (Ray 2023)] about the issue of gene editing. However, the top five highest scores in most of the responses indicate that the returned responses are of high quality.

This type of research should be conducted on various conversational chatbots (maybe start with the most popular ones as this research does) that use a variety of LLMs trained on different text corpus to determine whether these chatbots provide polarized or biased responses and thus require designing solutions [e.g., developing a “Nice Conversations” dataset (Beattie et al. 2022)] to reduce polarization or bias. In the future, we plan to use the same methodology on other controversial issues to measure the “inconsistency in quality” (Ray 2023) of the ChatGPT and Bard responses related to the discussed issue. Or use other chatbots with the gene editing issue to better understand how these tools work and confirm our findings.

Some limitations of this work might have been introduced by the tools we used to calculate the scores of the textual data. These tools were mostly dictionary-based. However, the tools have been used and tested in many academic publications, so we assume the score calculation is highly accurate.