1 Introduction

With the introduction of sophisticated large language models that demonstrate impressive text production skills, there has been a paradigm change in the field of artificial intelligence and in particular, in the field of natural language processing (NLP) (Fui-Hoon Nah et al. 2023). Among these models, the ChatGPT architecture created by OpenAI has drawn a lot of interest because of its capacity to carry on meaningful and pertinent dialogues with users. This study attempts to thoroughly analyze the several ChatGPT architectural developments while exploring its distinctive characteristics, advancements, limitations and improvements. We wanted to show how the improvements lead to more natural interactions and better text creation by examining the distinctive features added in each release (Sheikh et al. 2023).

The ChatGPT architecture has several great features, but it has certain drawbacks as well. The problem of producing believable but false or biased information is still a problem (Hanna and Levic 2023). Additionally, the model’s sensitivity to input wording and the sporadic creation of obscene or socially incorrect content have sparked ethical questions. These restrictions highlight the intricate relationship between model complexity, training data, and the possibility of unexpected results. Such research paper studies like on wireless sensor networks (Vadgaonkar et al. 2018), Evaluation Metric for machine translation (Banik et al. 2018) and more such studies on machine translation like phrase table readjustment (Banik 2021), assembling translation (Banik et al. 2019), and automated language translation model Banik et al. (2019) gave us an in detail idea about the translation model. Similarly, this research will give an in-depth idea about the evolutionary AI tool.

The Generative Pre-trained Transformer (GPT) framework, upon which the ChatGPT architecture is based, has undergone a number of revisions (Haleem et al. 2022), each of which saw notable advancements in model size, training data, and fine-tuning methods. The ChatGPT series iterative design, which includes iterations like GPT-3 and later, emphasizes the ongoing work to improve the model’s language recognition and generation skills (Ozdemir 2023). These variations are distinguished by an expanding set of characteristics that allow them to recognize complex language subtleties and provide comprehensible and appropriate replies for the given situation.

The primary objective of this research study is to provide a comprehensive comparative analysis of different architectural iterations of ChatGPT (Meyer et al. 2023), emphasizing their unique features, progressions, and limitations. It would enhance the knowledge of the potential applications and moral issues surrounding advanced language models by exploring the architecture’s evolution and tackling its problems. The vision of this research is to open the door for more responsible and efficient use of ChatGPT and related models in practical applications.

1.1 Motivation

The inspiration for this paper derives from the growing importance of understanding the capabilities and limitations of AI models such as ChatGPT, as well as the necessity to provide researchers and practitioners with a comprehensive comparison of different ChatGPT versions. Alongside the success of ChatGPT, there are certain limitations. In order to find solutions to those limitations, this paper aims to understand the detailed architecture of ChatGPT, also performing a comparative analysis of its different versions. ChatGPT’s shortcomings, such as providing plausible but inaccurate or illogical replies, sensitivity to input phrasing, and potential biases in generated content, highlight the significance of a comprehensive knowledge of its capabilities. Keeping these constraints in concern, this paper aims to improve the model’s reliability and usefulness, opening the way for more effective and responsible deployment in real-world circumstances. We provide a comprehensive view of ChatGPT’s history through a detailed evaluation of its performance and a comparison of its architectural developments, enabling informed decision-making and responsible usage within the AI community.

1.2 Novelty

  1. 1.

    The primary objective of this research paper is to comprehensively examine the ChatGPT architecture. This involves a meticulous analysis of its capabilities, limitations, and evolution through multiple iterations. Furthermore, the paper underscores the importance of tracing the evolution of ChatGPT to highlight the improvements made in subsequent versions, contributing to a well-rounded assessment of this AI language model.

  2. 2.

    By exploring the practical ramifications of ChatGPT’s operation, this research study goes beyond a cursory assessment of its capabilities. It attempts to learn what ChatGPT is capable of and how those capabilities apply to actual situations. The study clarifies ChatGPT’s architecture significance in Psychological Analysis in the field of Big Data Analytics.

  3. 3.

    The uniqueness of this paper lies in its holistic approach to evaluate ChatGPT’s strengths and limitations. Beyond a short list of its advantages, our work thoroughly analyzes ChatGPT’s text production capabilities. It explores the underpinning principles that allow ChatGPT to produce logical and contextually appropriate content.

  4. 4.

    In parallel with its analysis of ChatGPT’s abilities, we also identify and discuss the model’s shortcomings and limitations. These limitations could encompass areas where the generated text lacks accuracy, coherence, or appropriate context. Then we critically examine instances where ChatGPT produces biased or inappropriate content and where it might struggle with understanding and responding to certain prompts.

  5. 5.

    Finally, the distinctive contribution of this research paper lies in its novel approach to analyzing and comparing the performance of two iterations of the ChatGPT model, specifically ChatGPT 3.5 and ChatGPT 4. While previous studies have explored individual iterations of AI language models, this paper innovatively synthesizes a head-to-head comparison, shedding light on the advancements and progress achieved in the development of ChatGPT.

2 Related work

Large language models (LLMs) have been the subject of much investigation in recent years. With OpenAI’s ChatGPT serving as an outstanding example of the potential, consequences, and ethical issues associated with their application. In this section, we examine relevant research that has helped us comprehend the development, uses, difficulties, and social effects of ChatGPT and related models. Mhlanga’s (2023) (Mhlanga 2023) investigation underscores the imperative of ethical and responsible deployment of ChatGPT in educational settings, particularly emphasizing its potential for facilitating lifelong learning journeys. Shazeer et al. (2020) (Shazeer 2020) contribute to model enhancement with their introduction of Glu variants, offering avenues for potentially augmenting ChatGPT’s performance across various tasks. Alkaissi and McFarlane’s (2023) (Alkaissi and McFarlane 2023) study raises critical concerns regarding artificial hallucinations within ChatGPT, especially within the context of scientific discourse, prompting reflections on the model’s reliability and trustworthiness. Radford et al.’s seminal work (2019) (Radford et al. 2019) lays foundational concepts for models like ChatGPT, presenting language models as versatile unsupervised multitask learners, pivotal for shaping subsequent advancements. Haque’s (2023) comprehensive exploration (2023) navigates through the practical applications and inherent limitations of ChatGPT within the realm of natural language processing, providing invaluable insights for researchers and practitioners alike. Raman et al.’s (2023) (Raman et al. 2023) review of early research trends offers a panoramic view of ChatGPT’s developmental trajectory, charting its evolution and projecting potential research trajectories. Floridi and Chiriatti’s (2020) seminal contribution (2020) broadens the discussion by contemplating the broader societal implications of models like ChatGPT, enriching our understanding of their ethical and philosophical dimensions. Hariri’s (2023) (2023) exploration delves into the diverse applications and constraints of ChatGPT across multiple domains, offering a nuanced perspective on its utility and limitations. Rivas and Zhao’s (2023) ethical discourse (2023) underscores the necessity of ethical considerations in utilizing ChatGPT for marketing, navigating the complex ethical terrain inherent in deploying AI-driven chatbots. Peters et al.’s critical (Peters et al. 2023) response (2023) to the philosophical and educational ramifications of ChatGPT-4 stimulates thoughtful reflections on its implications for the future of humanity. These seminal works collectively constitute a rich tapestry of research, providing comprehensive insights into ChatGPT’s architectural evolution, comparative performance evaluations, ethical nuances, and avenues for future exploration and development.

3 ChatGPT architecture

ChatGPT architecture is based on transformer architecture. It has been frequently used since it was initially developed for machine translation tasks. The Transformer design is based on the concept of self-attention mechanisms [21], which enable the model to process each element while focusing on various portions of the input sequence (Haque 2022). It does not need sequential processing and effectively captures global dependencies. The Transformer has an encoder and a decoder stack, but since ChatGPT is a language generation model, we have an encoder (Briganti 2024). For better understanding, we can look into the diagrammatic representation of the transformer architecture of ChatGPT in Fig. 1. The essential elements involved in processing the incoming text and producing answers are shown in this figure, which gives a high-level overview of the ChatGPT Transformer architecture (Aydın and Karaarslan 2023).

Fig. 1
figure 1

A Visual Guide to the Transformer Architecture by dissecting its components-attention mechanisms, multi-head self-attention, encoder-decoder stacks-this visualization unravels the underlying mechanics that empower models like GPT

A key element of the Transformer design of ChatGPT is multi-head Self-attention (Bansal et al. 2024). Using this approach, the model can pay attention to several locations in the input sequence simultaneously [26]. The output is a weighted sum of the values of all other tokens depending on their relevance to the current token, and the input is the attention scores between each input token and all other tokens. Using this method, the model can better articulate itself and comprehend context by capturing different connections between tokens (Villena Toro and Tarkian 2023). Tokens refer to the individual units or chunks into which a sequence of text is divided. The input embedding used in multi-head attention is converted into three distinct linear projections called queries (A), keys (B), and values (C). The attention scores between tokens are calculated using the queries and keys, and the weighted sum of the output is calculated using the values. The Transformer architecture’s multi-head attention mechanism may be mathematically stated as follows: Given an input sequence of tokens

$$\begin{aligned} A={a_1,a_2,a_3,.....,a_n}\end{aligned}$$
(1)

the subscripts attached to each a in Eq. 1 indicate the index of the token within the sequence A. The subscripts start from 1 and go up to n, representing the total number of tokens in the sequence. Each token there is some ai. is embedded into a vector representation bi using an embedding matrix B:

$$\begin{aligned} B(A) = {b_1, b_2,..., b_n} \end{aligned}$$
(2)

Here B in Eq. 2 is the embedding matrix to input sequence A and b1 corresponds to the vector representation of the first token a1, b2 corresponds to the vector representation of the second token a2, and so on, up to bn which corresponds to the vector representation of the last token an. Then, the queries A, keys B, and values C are obtained by transforming the each embedding using learnable matrices Zq, Zk, and Zv, respectively as shown in Eq. 3, Eq. 4 and Eq. 5:

$$\begin{aligned} A = B(A) * Z_A \end{aligned}$$
(3)
$$\begin{aligned} B = B(A) * Z_B \end{aligned}$$
(4)
$$\begin{aligned} C = B(A) * Z_C \end{aligned}$$
(5)

The multiplication of matrices is shown here by the symbol *. Here B(A) in Equations (3), (4) and (5) signifies the transformation of the input sequence A into a sequence of vector embeddings using the matrix B.

The dot product between the queries (A) and keys (B) is used to generate the attention scores (Attention(A, B)h) for the h-th head [28]. Attention score is generated as a consequence, reflecting the affinity or resemblance between the tokens. While lower scores show weaker links, higher scores reveal stronger dependencies or connections between tokens. Softmax in Eq. 6 highlights tokens with higher relevance while downplaying the significance of tokens with lower relevance, hence amplifying the discrepancies between scores and dimension of the queries (denoted as jm).

$$\begin{aligned} Attention(A, B)_h = softmax((AB^T) / sqrt(j_m)) \end{aligned}$$
(6)

Using the attention scores from (6), the weighted sum of the values (C) in Eq. 7 is calculated to obtain the output of the h-th attention head (Attention(A, B, C)h):

$$\begin{aligned} Attention(A, B, C)_h = Attention(A, B)_h * C \end{aligned}$$
(7)

The outputs from all attention heads are concatenated and linearly projected using another learnable weight matrix C0 in (8) to obtain the final multi-head attention output. The concatenation represented as concat in the Eq. 8 makes sure that the model can identify a variety of correlations and patterns in the input sequence (Bello et al. 2019). The use of linear projection in the multi-head attention mechanism transforms and reshapes the concatenated outputs.

$$\begin{aligned} \begin{aligned} MultiHeadAttention(A, B, C) = \\ Concat(Attention(A, B, C)_1, \\ Attention(A, B, C)_2,...,\\ Attention(A, B, C)_h) * C_o \end{aligned} \end{aligned}$$
(8)

Following the multi-head self-attention mechanism, the next step involved is the feed-forward neural network. The final MultiHeadAttention(A, B, C) is utilized in the next feed-forward neural network and further processing within the Transformer architecture and contains data from various attention patterns learned by each attention head. The feed-forward neural network is a crucial element utilized in each encoder and decoder layer of the Transformer architecture (Orrù et al. 2023). In order to interpret context-aware token representations acquired through self-attention and model intricate non-linear relationships in the data, it adheres to the self-attention mechanism. The Feed Forward Neural Network is applied to each of the context-aware representations separately once the self-attention mechanism has processed the input sequence and obtained them. Typically, the Feed forward neural network is a two-layer, fully connected neural network with an activation function through Rectified Linear Unit (ReLU) in the middle (Wu et al. 2023). Mathematical representation for Feed Forward Neural Network can be computed as detailed below:

Each input token representation is initially transformed linearly using a bias vector v1 and a learnable weight matrix k1 as presented in Eq. 9. The output of the linear transformation is denoted as Linear(L).

$$\begin{aligned} Linear(L) = L * k_1 + v_1 \end{aligned}$$
(9)

Here, * represents matrix multiplication. This transformation is a crucial stage in the process for the feed-forward neural network to learn and capture complicated relationships within the token representations.

Linear correlations between input and output are introduced via the linear transformation. Many real-world interactions in data, however, are by their very nature non-linear. Without ReLU, the model would only be able to learn linear transformations, which could not adequately represent the complexity and richness of the input. A non-linear activation function (Koubaa et al. 2023) typically the Rectified Linear Unit (ReLU)-is applied to the output of the linear transformation in (10):

$$\begin{aligned} ReLU(L) = max(0, Linear(L)) \end{aligned}$$
(10)

ReLU adds non-linearity to the model, enabling it to recognize intricate data patterns. Linear (L) in Eq. 10 is the output of the linear transformation, and max refers to the maximum value selection between the output of the linear transformation and zero if the output is negative.

The output of the ReLU activation procedure is then subjected to a second linear transformation using a bias vector v2 and a learnable weight matrix k2 in Eq. 11:

$$\begin{aligned} FFN(L) = ReLU(L) * k_2 + v_2 \end{aligned}$$
(11)

It is a parallelizable procedure since the Feed Forward Neural Network is applied element-by-element to each token representation separately. FFN(L) in Eq. 11 represents the output of the second linear transformation of the feed-forward neural network applied to the token representations that have undergone the ReLU activation. The Feed Forward Neural network gives the model more complexity and aids in capturing the relationships between the tokens in the sequence.

A residual connection and layer normalization are usually used to stabilize training and enhance information flow. The output of the FFN is normalised in Eq. 12 after element-wise addition to the original token representations:

$$\begin{aligned} FFN_{Output} = LayerNorm(L + FFN(L)) \end{aligned}$$
(12)

Layer normalization makes training more robust and enhances model performance by ensuring that the token representations have a constant size and distribution across several levels (Ray 2023). In conclusion, the Transformer architecture’s Feed Forward Neural network is an essential part that aids the model in capturing complicated patterns and correlations in the context-aware token representations acquired through self-attention. Several natural language processing tasks, including language production challenges like ChatGPT, have shown this combination of self-attention and feed-forward neural networks to be particularly successful.

The model employs positional encoding after both the multi-head self-attention and the feed-forward neural network (Rothman and Gulli 2022). Since the model lacks intrinsic knowledge of the relative or absolute locations of tokens in the input sequence, Positional Encoding is a method employed in the Transformer architecture to include sequential information into the input embeddings. It enables the model to comprehend token order and recognise sequential links. Before feeding the input embeddings into the Transformer encoder, the Positional Encoding vectors are added. These vectors are intended to encode positional information without interfering with the ingrained token embeddings. The mathematical explanation of positional encoding is as follows.

The ensemble C of positional encoding vectors

$$\begin{aligned} C={c_1,c_2,c_3,.....,c_n} \end{aligned}$$
(13)

is calculated as follows in Eq. 14 and 15 where c1, c2,......,cn are the corresponding positional vectors for the input token a1, a2,.......,an

$$\begin{aligned} c_{i, 2j}= & {} sin(p / k^{(2i/f_{model})}) \end{aligned}$$
(14)
$$\begin{aligned} c_{i, 2j+1}= & {} cos(p / k^{(2i/f_{model})}) \end{aligned}$$
(15)

p represents the token’s position in the sequence (p= 1, 2,..., n). i represents the position index in the positional encoding vector and j in Equations (14) and (15) represents the dimension index. fmodel is a scaling factor that adjusts the frequency of the positional encoding, and k is the constant base used for calculation, here assumed equal to 10000. The above equations generate two values (sin and cos) for each dimension of the positional encoding vector, leading to a unique encoding for each token position. Finally, the ensemble of positional encoding vectors C is added to the corresponding token base embeddings B(L) in Eq. 16:

$$\begin{aligned} L (with Positional Encoding) = B(L) + C \end{aligned}$$
(16)

This addition allows the model to distinguish tokens based on their positions in the sequence. The resulting L(with Positional Encoding) is then used as input to the Transformer encoder.

In conclusion, Positional Encoding gives the Transformer model knowledge about the location of each token in the input sequence, allowing the model to recognize the sequential order of tokens and successfully capture long-range relationships. The Transformer architecture is improved for a variety of natural language processing applications, including language production tasks like ChatGPT, by integrating positional information.

The training phase follows the above mechanisms. The model’s parameters are adjusted during training to minimize a specific loss function, as in masked language modeling (Hadi et al. 2023). The goal is to enhance the model’s language generation capabilities. The Masked Language Modelling (MLM) job is often the goal of the training phase of ChatGPT. In this assignment, some number of tokens in the input sequence are randomly masked, and it is the model’s job to forecast these masked tokens based on the context accurately. The mathematical explanation of training is as follows.

A certain percentage of the input tokens are masked, creating a masked sequence. The model computes the scores for all potential tokens using a Softmax function in order to obtain the probability distribution over the vocabulary for each masked token. Let’s write the model’s output scores as

$$\begin{aligned} O={o_1,o_2,o_3,....,o_n} \end{aligned}$$
(17)

where oi represents the output of the model for the i-th token. In Eq. 17o1 corresponds to the output score of the first token in the sequence. o2 corresponds to the output score of the second token, o3 corresponds to the output score of the third token, and on corresponds to the output score of the n-th token, where n is the total number of tokens in the sequence. The entire set O represents the collection of output scores for all tokens in the sequence. The probability distribution over the vocabulary for the masked token ai is obtained using the Softmax function:

$$\begin{aligned} Pd(a_i | A_{masked}) = softmax(o_i) \end{aligned}$$
(18)

Pd in Eq. 18 represents the probability distribution over the vocabulary for the masked token ai given the context Am masked. This distribution tells us the likelihood of each possible word ai in the vocabulary being the correct substitution for the masked token. Finally, softmax (oi) refers to the softmax function, which converts the vector oi into a probability distribution. oi is the input vector to the softmax function that usually contains scores or logits associated with each word input (Campesato 2023). These scores indicate how likely each word is to be the correct substitution for the masked token ai. Softmax is defined in Eq. 19:

$$\begin{aligned} softmax(o_i) = exp(o_i) / sum(exp(o_j)) \quad for \quad all \quad j \end{aligned}$$
(19)

In Eq. 19oi is the input vector that represents the raw score or logit associated with the i-th output neuron, exp oi is the exponential of the raw score oi. Exponentiating the raw score is used to transform it into a positive value. The reason for using the exponential function is to amplify differences between the logits, making larger logits much larger and smaller logits much smaller. This helps in emphasizing the model’s confidence in its predictions. Then, sum(exp(oj)) is the summation that includes all the classes in the classification problem. This sum is in the denominator of the softmax equation and acts as a normalization factor. For each masked position, the model is trained to maximize the log-likelihood of the right token. The loss function, commonly known as the cross-entropy loss, is the negative log-likelihood in Eq. 20:

$$\begin{aligned} Loss = -log(Pd(a_i | A_{masked})) \end{aligned}$$
(20)

The sum of the negative log-likelihoods for each masked location is the total loss for the whole masked sequence represented as Loss in Eq. 20 and Pd represents the probability distribution over the vocabulary for the masked token ai given the context Am masked. During training, an optimisation technique like stochastic gradient descent is used to update the model’s parameters (weights and biases). The goal is to reduce the average loss across all training samples so that the model may learn meaningful representations and associations between the input sequence’s tokens. In conclusion, masked language modeling is used during training in the Transformer architecture (and ChatGPT) to forecast masked tokens accurately and to update the model’s parameters to reduce the negative log-likelihood of the right tokens. During language creation challenges, the model can learn how to produce coherent and contextually relevant replies thanks to the training process.

The model has learned to generalize correlations and patterns in the training data after the training phase. The trained model produces replies for fresh, unforeseen input cues during the inference phase. At this point, the model’s internal parameters are fixed and not modified anymore. Based on the input, it uses the learned patterns to provide coherent and contextually appropriate answers. During inference, the model generates the next token (oi) at each step based on the context of the previous tokens generated oi-1 shown in Eq. 21:

$$\begin{aligned} o_i = argmaxo_i = argmax(Pd(o | o_{_{i-1}}) \end{aligned}$$
(21)

Pd in Eq. 21 is the probability distribution over the vocabulary for the output token o given the context oi-1, which comprises of the tokens that have already been produced. In order to increase the possibility of the subsequent token given the context, the argmax function chooses the token with the highest probability. Until a predetermined stopping condition is satisfied, such as creating a specific end-of-sentence token or reaching a limit sequence length, the generation process is recursive and uses the previous tokens to predict the next token. This procedure allows the model to construct replies token by token in ChatGPT, allowing it to produce coherent and contextually appropriate text based on the input prompt or conversation history. The generation can be further improved by employing methods like beam search to investigate several possible tokens at each stage and select the token sequence that is most likely to occur based on a score mechanism. In conclusion, the Transformer architecture in ChatGPT uses output probability inference to anticipate the next token at each step based on the context of previously created tokens, allowing the model to provide meaningful and contextually relevant replies in natural language. A full answer to the input prompt or discussion is produced after each iteration of the procedure that lasts until a halting condition is satisfied. A detailed overview of the ChatGPT Transformer architecture, emphasizing the key components involved in processing the input text and generating responses, is done in this section.

ChatGPT, with its versatile capabilities (Bahrini et al. 2023), finds applications (Abdullah et al. 2022) across numerous domains such as customer support, education (Adeshola and Adepoju 2023), healthcare (Biswas 2023), content creation, language translation, research (Rahman and Watanobe 2023), virtual assistants, and legal and compliance, etc. As a case-study, we have shown two different applications to discuss various architectures.

3.1 ChatGPT architecture’s significance in psychological analysis

The state-of-the-art language model demonstrates astounding skill in natural language processing and creation (Rathje et al. 2023). The architecture is a perfect tool for understanding the psychological elements of gaming experiences since it can interpret player interactions (Huang et al. 2023), in-game chats, and user attitudes. The model’s adaptability enables examination across a range of gaming platforms, demographics, and game genres. Psychological analysis requires a fundamental understanding of human language and emotions (Loconte et al. 2023). With its transformer-based design, the ChatGPT architecture has become a potent tool for natural language processing (Orrù et al. 2023).

  • The self-attention mechanism in ChatGPT provides excellent sentiment analysis, allowing researchers to evaluate emotional responses from text data properly. Understanding the emotional states of people and groups requires the capacity to analyze sentiment on a scale.

  • ChatGPT’s transformer-based design enables it to capture context and dependencies in textual data. This contextual awareness is critical in psychological research, where subtleties in language and emotional signals play an essential part in analyzing human behavior and reactions.

  • The ability of ChatGPT to generate natural language facilitates interactive evaluations and treatments in psychological research and therapy. Researchers can use conversational evaluations to gain a better grasp of participants’ viewpoints and experiences.

  • Processing and analyzing vast amounts of textual data in psychological research can be time-consuming. ChatGPT’s design allows for efficient data processing, which aids in literature reviews, summarising research articles, and data analysis.

  • Marketing content such as commercials, promotional materials, and social media campaigns may be analyzed by ChatGPT. Its natural language processing skills allow it to recognise the persuasion techniques used in marketing material (Rivas and Zhao 2023). Game designers and marketers may get useful input on the efficacy of their campaigns by anticipating how players will respond to particular marketing strategies.

  • ChatGPT can analyze customer feedback from a variety of sources, including surveys, reviews, and social media. By doing sentiment analysis, it may detect the general attitude towards marketing efforts and find areas that resonate positively or adversely with the target audience.

  • ChatGPT can analyze and optimize marketing material such as ad copy, email campaigns, and social media postings. It can detect language patterns and components influencing engagement and conversions, resulting in optimized content development.

  • ChatGPT may help you understand client categories based on their linguistic patterns, preferences, and responses to marketing initiatives. This study assists marketers in effectively tailoring their tactics to specific audience segments.

  • ChatGPT may be implemented as an interactive chatbot on websites and social media to engage consumers, collect feedback, and make personalised suggestions based on their interests.

3.2 Ethical concerns

When it comes to ethical concerns, ChatGPT itself is a language model with no inherent capability to ensure ethical considerations in digital gaming or any other domain. The responsibility for ensuring ethical practices lies with the developers, organizations, and stakeholders who deploy ChatGPT or any other AI model in real-world applications, including digital gaming (Zhou et al. 2023). It is critical to ensure ethical issues when using ChatGPT architecture to big data analytics or digital gaming. Here are some enhancements to the ChatGPT architecture that may be implemented to meet ethical concerns (Wang et al. 2023):

  1. 1.

    Improve ChatGPT’s capacity to recognise and reduce biases in its answers. Implement techniques to detect possible biases in produced material linked to sensitive themes, ensuring that the language model does not accidentally disseminate discriminatory or hurtful information.

  2. 2.

    Allow users to customise and fine-tune ChatGPT’s behaviour while remaining ethical. Allow users to establish rules, preferences, or values to verify that the model’s responses are consistent with their ethical concerns (Guleria et al. 2023).

  3. 3.

    Implement ways that will allow users to regulate the created material more effectively. Allow users to define limitations or recommendations on the kind of replies ChatGPT creates, lowering the danger of producing unsuitable or harmful content.

  4. 4.

    Create real-time monitoring mechanisms to examine the outputs of ChatGPT for ethical problems. Use filters and alerts to discover and highlight potentially problematic replies, allowing for quick action.

  5. 5.

    Establish external supervision and auditing methods to examine and evaluate the model’s ethical performance. Involve outside experts to evaluate the potential biases and ethical implications of ChatGPT outcomes.

  6. 6.

    ChatGPT should be updated and improved on a regular basis, depending on feedback and developing ethical issues. Train the model on varied and inclusive datasets on a regular basis to reduce biases and increase its comprehension of other views.

4 Periodic modification of ChatGPT

ChatGPT is an improved version of GPT\(-\)3.5, which was a modified version of GPT-3, rather than a model that was trained entirely from scratch. Both supervised learning and reinforcement learning with human feedback (RLHF) were used in the GPT-3 training procedure (Liang et al. 2024).

Supervised learning is a fundamental method for teaching ChatGPT to create coherent and contextually relevant text answers. In supervised learning, the model learns to anticipate target outputs (responses) from labeled input sequences (prompts) (Roumeliotis and Tselikas 2023). The purpose of supervised learning is to reduce the gap between the projected probability distribution and the actual target distribution (Kenney 2023). This difference is quantified by the loss function. Consider a basic case in which the intended output is a single token and the cross-entropy loss is used.

Given an input prompt I and its corresponding target response O, let Pd represent the predicted probability of token oi being generated given the input I. Also, let otrue be the true one-hot encoded target token. The cross-entropy loss X for a single token can be calculated using Eq. 22:

$$\begin{aligned} X(o_{i},I)=-\sum _{i}O_{true}[i]\cdot log(Pd(o_{i}|I)) \end{aligned}$$
(22)

i is the index that iterates over all possible tokens in the vocabulary. The summation indicates that we are summing the loss over all possible tokens. To put it another way, the loss computes the negative log-likelihood of the real token’s probability based on the model’s forecast (Davis et al. 2024). The aim is to minimize this loss throughout the whole dataset, implying that the model learns to create replies corresponding to the intended target sequences. In ChatGPT, supervised learning entails training the model to predict target answers given input prompts by minimizing cross-entropy loss. The model learns to create coherent and contextually appropriate content by iteratively modifying its parameters through optimization.

Reinforcement Learning (Hou et al. 2024) is a learning paradigm in which an agent learns how to interact with its surroundings in order to maximise a reward signal (Jahan et al. 2023). The framework of Markov Decision Processes is frequently used to define Reinforcement Learning. The reinforcement Learning with the Human Feedback model is improved using a combination of supervised fine-tuning, reward modeling, and reinforcement learning techniques as shown in Fig. 2. A prompt or query is sent to ChatGPT by the user, as shown in Fig. 2. The model analyses the prompt using its language processing abilities and previously educated grasp of linguistic relationships and patterns. In the first step in Fig. 2, it is a supervised fine-tuning model (SFT) that is trained by collecting demonstration data. Mathematically it can be explained as:

Fig. 2
figure 2

RLHF Training Method. Model RLHF leverages human-provided feedback to improve the model’s performance. Arrows indicate the flow of information through the network, highlighting the iterative learning process from feedback (Höglund and Khedri 2023)

Demonstration data: pi, di, where pi is the input prompt and di is the desired model response.

SFT model:

$$\begin{aligned} SFT MODEL:\sigma _{SFT} \end{aligned}$$
(23)

SFT model in Eq. 23 refers to the neural network or model that is being fine-tuned using demonstration data via supervised learning, and the symbol sigma is a parameter of SFT. This model aims to provide responses similar to the desired model responses in the demonstration data (Yuan et al. 2024).

$$\begin{aligned} SFT Loss: G_{SFT}(\sigma _{_{_{SFT}}})=\sum _{i}G(R^{_{\sigma _{SFT}}}(p_{_{i}}),(d_{_{i}})) \end{aligned}$$
(24)

The SFTLoss in Eq. 24 represents the discrepancy between the output generated by the SFT model and the desired response di for a specific input pi. R represents the response generated by the SFT model. It takes an input pi and uses the specific parameters sigma associated with the SFT model to produce a response. This is the model’s prediction or output for the given input pi. The loss function (G) quantifies the difference between the generated response and the desired response for each demonstration data point. The summation is used to aggregate the losses over all demonstration data points. The SFT loss aggregates the losses over all demonstration data points using the summation symbol.

Then, in the subsequent step, the reward model will assign points to the SFT model output based on how appealing it is to consumers. This process involves iteratively updating the SFT model’s parameters to improve its behavior over time. The Reward Model Training Loss is shown in Eq. 25

$$\begin{aligned} G_{RM}(RM)=\sum _{j}G(X_{RM}(R(p_{j})),d_{j}) \end{aligned}$$
(25)

GRM(RM) in Eq. 25 represents the overall training loss of the reward model. This is the loss value that we aim to minimize during the training process of the reward model. It reflects how accurate the reward model is in predicting the quality of the SFT model’s outputs. The summation symbol indicates that we are summing over all data points in the reward model training dataset. XRM(R(pj)) represents the predicted reward score given by the reward model for the response R(pj) generated by the SFT model. dj stands for the actual human-assigned score or desirability rating for the response pj. This is the real quality assessment given by human evaluators.

The reward model training’s purpose is to minimize this loss. The goal of the reward model is to anticipate how desirable or high-quality a machine-generated answer is in the eyes of human assessors. In the final step, Reinforcement learning is used to fine-tune SFT Policy by allowing it to optimize the RM. Proximal Policy Optimization (PPO) is an abbreviation for a fine-tuned model of proximal policy optimization. Better answers are generated by the enhanced SFT model, which is assessed and utilized to further update the reward model and modify the machine’s behavior.

Apart from GPT-3 there were two more successive language models, GPT-1 and GPT-2. Although the architecture for all three models was the same, they were designed to understand and generate human-like text based on the patterns. The original GPT-1 model, which has 117 million parameters, was released in 2018. With 1.5 billion parameters, GPT-2, launched in 2019, was substantially bigger and more powerful. With 175 billion parameters, GPT-3, published in 2020, is the largest and most sophisticated version. Because of its massive size, it can record even more intricate linguistic patterns and interactions (Ghojogh and Ghodsi 2020). The amount of decode layers and hidden layers present in GPT 1, GPT 2, and GPT 3 has also been mentioned in Table 1. The Maximum number of characters that the GPT can consider at once for GPT-1 is 512, GPT-2 is 1024, and GPT-3 is 2048. The batch size in Table 1 refers to the number of input examples, like data points being processed together in a single forward and backward pass through the model in the training.

Table 1 Comparision Table GPT 1, GPT 2, And GPT 3 based on the number of parameters considered in the dataset, context token size of the model along with the number of decode and hidden layer, batch size of the model
Fig. 3
figure 3

Comparison of Transformer Blocks in Different GPT Versions. The evolution of GPT models is demonstrated through their respective Transformer block structures

GPT-1 was trained using a combination of licensed data, data produced by human trainers, and freely available internet content (Hadi et al. 2023). GPT-2 was trained on a significantly larger and more diversified dataset that included a wide variety of online material. The training dataset for GPT-3 is significantly larger and more diversified, containing a wide range of text sources from the internet.

GPT-1 displayed outstanding language creation skills, although its replies were frequently less cohesive and contextually correct than later incarnations. GPT-2 outperformed GPT-1 in terms of producing more cohesive and contextually appropriate text (Guo et al. 2023). It demonstrated the capacity to create extended stretches of high-quality prose and quickly gained popularity for its human-like text production. GPT-3 provides a significant advancement in capability. Because of its increased size and improved training data, it can interpret and create more sophisticated and contextually correct language. GPT-3 can have complicated conversations, answer questions, translate languages, and execute a variety of natural language processing tasks.

Fine-tuning on certain tasks was conceivable, although it was not thoroughly investigated in the original GPT-1 release. OpenAI initially withheld GPT-2 because to concerns about its exploitation. OpenAI later published the model and illustrated how it could be fine-tuned for specific applications while emphasizing safe use. GPT-3 allows for fine-tuning, although precise control over its behavior is difficult because to its size and complexity.

GPT-1 demonstrated the power of large-scale language models and helped shape succeeding generations. GPT-2 drew a lot of attention and sparked debate about ethical concerns and the potential exploitation of AI-generated material. GPT-3 has been widely used in a variety of applications, including chatbot virtual assistants, as well as content generating and creative writing. Its powers have generated both enthusiasm and anxiety about the influence it would have on numerous businesses and society as a whole.

The Transformer architecture is the foundation of the GPT-1, GPT-2, and GPT-3 versions. While the underlying structure stays intact across various iterations, each iteration has seen modifications and upgrades (Chawla et al. 2021). Let’s look at how the primary components of the Transformer block vary in GPT-1, GPT-2, and GPT-3 in Fig. 3: text and location embeddings, multi-head self-attention, layer normalization, feed-forward layers, and text prediction.

Text embeddings are employed in all three versions to turn input tokens (words or subword parts) into continuous vector representations. The semantic meaning of the tokens is captured by these embeddings. Transformers have no built-in notion of the order of tokens in a sequence since they process tokens in parallel. Position embeddings are added to text embeddings to provide information to the model about the location of tokens in the sequence.

GPT-1 has a single-head self-attention mechanism, which means it uses just one attention pattern to attend to different points in the input sequence. GPT-2 features a multi-head self-attention mechanism, which enables the model to pay to multiple sections of the input sequence at the same time and learn varied attention patterns. GPT-3, like GPT-2, employs multi-head self-attention but with a substantially greater number of attention heads in bigger variations.

Layer Normalisation is a method used to normalize the activations of neurons inside a layer in neural network topologies like as the GPT series. It is a type of normalization that aids in the stabilization of training, the improvement of convergence, and the facilitation of the learning process in deep neural networks. In each Transformer block, GPT-1 adds layer normalization and feed-forward layers following the self-attention mechanism. GPT-2 keeps this structure but improves the normalization and feed-forward layers scaling to fit the increased model size. GPT-3 keeps the layer normalization and feed-forward layers while optimizing their configurations for greater scalability and performance.

GPT-1 focuses on autoregressive text generation, predicting the next word in a sequence given the context of preceding words. GPT-2 expands on autoregressive generation, demonstrating the capability to generate coherent and contextually relevant text, such as paragraphs and articles. GPT-3 significantly advances text prediction, showcasing the ability to generate entire essays, stories, and translations, and perform a wide range of natural language processing tasks.

There are two different versions of ChatGPT: ChatGPT 3.5 and ChatGPT 4. Although both the models were based on Transformer architecture because of the parallel processing capacity of the architecture also aids in handling extended discussions and keeping a consistent grasp of the dialogue history, which is critical for providing relevant and engaging replies in a conversational AI like ChatGPT. The flexibility of the Transformer architecture to gather contextual information, handle long-term dependencies, and parallelize computations gives it an excellent basis for developing complex conversational AI models like ChatGPT. There are certain differences between the two models, such as the dataset (HumanEval dataset) used for ChatGPT 4 is of size 1 petabyte, whereas the dataset (Common Crawl dataset) used for ChatGPT 3.5 is of size 570 GB. One notable distinction is that GPT-4 is larger than GPT\(-\)3.5, with more parameters and processing capability, allowing it to tackle more complicated jobs and linguistic patterns. ChatGPT-4, which is based on GPT-4 and can accommodate up to one trillion parameters, is more powerful than ChatGPT and capable of handling more diverse and difficult natural language scenarios. There were differences in the number of parameters considered in the dataset, input context length, and output word limit (Massey et al. 2022). In ChatGPT\(-\)3.5 there were 96 self-attention heads, and in the case of ChatGPT-4, it is 120. Similarly, differences based on the memory capacity, size of the training data, the question asked, and input formats between ChatGPT 3.5 and ChatGPT 4 are represented in the table. GPT-4 is fully multilingual and capable of handling many languages, although GPT\(-\)3.5 already had a good English proficiency of 70.1.

5 Comparative analysis and result

A comparative performance analysis was performed between ChatGPT 3.5 and ChatGPT 4 based on the reasoning that is the judgment done by both the chatbots, speed in generating a response, and conciseness.Footnote 1 A rating out of 5 was given for each factor considered in doing the comparative analysis represented in Fig. 4. We can see from Fig. 4 that when it comes to reasoning and conciseness, ChatGPT 4 outperforms ChatGPT 3.5, but in terms of speed ChatGPT 3.5 generates a faster response as compared to ChatGPT 4.

Fig. 4
figure 4

Graphical Comparison Based on Reasoning, Conciseness, and Speed. A visual representation showcasing the trade-off between reasoning ability, response conciseness, and processing speed

There are furthermore key differences between ChatGPT 3.5 and ChatGPT 4 based on which a comparative analysis can be done such as:

ChatGPT 4 is multimodal while ChatGPT 3.5 is unimodal The fact that ChatGPT 3.5 was unimodal was a severe drawback (Mitra 2023). In this sense, the chatbot merely understands and interprets text input. ChatGPT 4 is far more sophisticated since it is powered by the GPT 4 engine, which is multimodal. ChatGPT 4 can now understand and process photos, and it takes both text and image cues. A unimodal model processes or understands just one form of input or data. In language models such as GPT, sometimes referred to as multimodal when paired with pictures or other types of data, unimodal means that the model exclusively deals with text-based input and produces text-based output. But in the case of multimodal, it could employ a combination of text and picture inputs and offer replies impacted by both the literary context and the image’s visual content. This offers new avenues for more participatory and contextually rich dialogues. ChatGPT’s interactions become more versatile by embracing multimodal capabilities. It can aid users with activities that need awareness of both written descriptions and visual clues, such as creating captions for photographs, offering thorough explanations based on diagrams, or assisting with creative endeavours that include both text and visual features.

ChatGPT-4 has much more processing power compared to ChatGPT\(-\)3.5 ChatGPT-4 outperforms ChatGPT 3.5 in terms of raw processing power (Cobb 2023) and the capacity to tackle complicated scientific and mathematical issues. ChatGPT 4 can answer problems and equations in areas ranging from calculus to geometry to algebra. ChatGPT 3.5, on the other hand, can point us in the right path to fix the problem rather than delivering a complete answer. ChatGPT 4’s ability to deliver complete answers can be immensely valuable for users seeking immediate solutions to complex problems. It can be a more comprehensive resource for those looking to learn or obtain accurate results in scientific and mathematical contexts. ChatGPT 4 can solve complex scientific and mathematical questions in a variety of fields, including calculus, geometry, and algebra. This indicates that it can comprehend, process, and solve issues using these mathematical principles.

ChatGPT 4 replies are far more nuanced (Shihab et al. 2023). One significant flaw in ChatGPT 3.5 was its inability to recognise minor subtleties in actual human discourse. ChatGPT 3.5 did not get jokes or sarcasm. When it comes to innovation, ChatGPT 4 has well outpaced its predecessor. ChatGPT 4 can generate better poetry or articles with greater coherence and originality. The enhanced context window in ChatGPT4 is the main cause for this increased functionality. ChatGPT4 can now keep up to 25,000 words of talks for context, whereas ChatGPT 3.5 only had 3000 words as shown in figure 5.

ChatGPT 4 is more precise and less susceptible to hallucinations. Precision is the ability to be exact, accurate, and particular. Being more exact in the context of a language model like ChatGPT implies that the model creates replies that closely fit with the context of the input and give correct information (Hanna and Levic 2023). The fact that ChatGPT 4 is more precise implies that its replies are more likely to be factually true, relevant, and acceptable in light of the information it receives. In the context of language models, hallucinations refer to instances in which the model creates information that is neither correct or verifiable. In other words, the model may generate data that seems convincing but is in fact, false or contrived. Being less susceptible to hallucinations suggests that ChatGPT 4 has been modified to lessen the occurrence of creating misleading or erroneous information, making its replies more dependable and trustworthy. These improvements could be due to ChatGPT 4, which has undergone more fine-tuning and careful data filtering to remove undesirable or unwanted information. Techniques like dropout, weight decay, or other regularization methods applied to ChatGPT 4 also lead to this improvisation.

The word limit in ChatGPT-4 is more than eight times higher than in its predecessor, as shown in Fig. 5. This is essential because GPT-4 can handle far more sophisticated and nuanced inputs with the combination of the quantity of data this model is trained on Rahaman et al. (2023), enabling it to deliver extremely detailed and thorough outputs. ChatGPT 4 can handle a wider range of activities with improved precision and efficiency attributable to an enhanced word count limit for both input and output.

Fig. 5
figure 5

Word limit comparison: unleashing the power of language generation exploring the expanded horizons: ChatGPT 3.5 vs ChatGPT 4

A comparative performance evaluation was done for ChatGPT 3.5 and ChatGPT 4 on academics benchmarkFootnote 2 graphically represented in Fig. 6. The first bar graph analyses the performance of ChatGPT 3.5, and the second bar analyses for ChatGPT 4 based on various tests like MMLU (Multiple Choice Question in 57 subjects) and HellaSWAG, a dataset for researching grounded commonsense inference. It has 70k multiple-choice questions regarding real-world scenarios that have four possible answers. The questions are drawn from one of two sources: activitynet or wikihow.com., the AI2 Reasoning Challenge (ARC) seeks to stimulate research in advanced question-answering, including problems requiring reasoning, the application of commonsense knowledge, and other strategies for deeper text understanding, WinoGrande which is a 44k-problem dataset inspired on the original WSC design, but modified to increase the dataset’s scale and toughness, A human eval is a method for evaluating the performance of natural language processing (NLP) models, such as ChatGPT, by enlisting human judges to offer qualitative and subjective comments on the model’s output and GSM -8K Test is a recent development or a specialized term related to a specific field, technology, or industry. In all the tests as represented by the bar graph, ChatGPT 4 outperforms ChatGPT 3.5 in terms of reasoning, logic, accuracy, and appropriate answers.

Fig. 6
figure 6

Enhanced performance of ChatGPT 4 over ChatGPT 3.5 in diverse evaluations-MMLU, HellaSWAG, AI2 Reasoning Challenge, WinoGrande, human evaluation, and GSM-8K Test

A comparison of ChatGPT 3.5 and ChatGPT 4 demonstrates substantial improvements in the latter version. ChatGPT 4 improves language comprehension, coherence, and context memory in talks. This can be due to the larger architecture and training data set of the model. ChatGPT 3.5, on the other hand, has some discrepancies in context, resulting in less logical replies. The improved performance of ChatGPT 4 can manage lengthier discussions with more accurate replies, giving it a more capable tool for creating human-like prose. ChatGPT 4, the most current generation, has a bigger model size and a more diversified and extensive training dataset. This results in more complex query understanding and the development of more contextually relevant replies. ChatGPT 4 includes updated fine-tuning procedures that increase response quality. Even in complicated and nuanced conversational circumstances, the model displays an improved capacity to provide logical and meaningful replies.

The limits of ChatGPT 3.5 include comparable issues, such as context inconsistency, occasional incoherence, and the possibility of biased or incorrect replies (Azaria 2022). Furthermore, the model’s limited design may restrict its ability to handle sophisticated or nuanced questions and generate extensive answers. While ChatGPT 4 represents a significant improvement, it is not without limits. Common issues exist, such as sensitivity to input wording and the creation of lengthy answers on occasion (AlZu’bi et al. 2022). Furthermore, while ChatGPT 4 has improved context preservation, it still struggles to keep meaningful discussion threads in excessively long encounters. Furthermore, the model can still produce plausible-sounding but erroneous or biased information, emphasizing the importance of user awareness and fact-checking (Zhou et al. 2023).

6 Limitations of ChatGPT

  • ChatGPT has some difficulties in recognizing context or creating replies that need a thorough comprehension of the actual world, particularly if the information is not explicitly supplied in the input. In order to overcome this limitation, we can include more context and specifics (Koubaa et al. 2023). If an answer appears to be incorrect or illogical, we can modify our input to provide more context or directly indicate the needed information (Rice et al. 2024).

  • ChatGPT can occasionally generate false or biased information since it learns from data that contains errors or biases (Sallam 2023). To avoid such errors, fact-checking the information supplied by ChatGPT and utilising different sources to confirm its veracity could be helpful. Be mindful of potential biases and assess replies critically.

  • Hallucination is a significant challenge in AI systems, particularly in language models like ChatGPT. These models generate text based on patterns they learn from vast amounts of training data, and sometimes, they might produce responses that sound plausible but are not grounded in reality. To reduce hallucination, fine-tuning and domain specification could be helpful, along with fact-checking, verification, and uncertainty estimation.

  • ChatGPT generates sensitive, offensive, or inappropriate content. A solution to avoid this could be restricting the development of explicit or objectionable content using OpenAI’s content filter. If we encounter improper replies, you can submit them to OpenAI for further development.

  • ChatGPT may struggle to develop really innovative or original material, frequently responding with replies based on its training data. We can set certain rules or constraints that stimulate new thinking if we seek creative input. Experiment with various prompts to elicit distinct reactions.

  • In lengthier chats, ChatGPT may struggle to maintain a constant context or grasp references. The solution to this problem could be while asking follow-up questions; we might periodically summarise the background of the dialogue or make clear and explicit references. If the talk grows too complicated, consider beginning with a brief overview of the situation (Alshami et al. 2023).

7 Conclusion and future work

In conclusion, this study examined the ChatGPT architecture’s shortcomings, finding alternative solutions. In order to identify the ChatGPT model’s advantages and disadvantages (Liu et al. 2023), we looked at numerous iterations and conducted a thorough comparison. Through rigorous analysis, it became evident that while ChatGPT has made remarkable strides in generating human-like text and facilitating natural language interactions, it is not devoid of limitations.

The limitations explored in this paper encompassed issues related to biased and unsafe outputs, sensitivity to input phrasing, generation of incorrect or nonsensical information, and struggles with handling context over longer conversations. To tackle these limitations, the proposed solutions incorporated techniques such as fine-tuning on narrower datasets, reinforcement learning from human feedback, and integrating external knowledge sources. Additionally, examining various versions of ChatGPT, ranging from its initial release to subsequent iterations, showcased the iterative improvements made by the development team in response to user feedback and research insights.

Future work in this field has a lot of potential. Further study might concentrate on improving and extending the suggested solutions’ efficiency and investigating fresh methods to improve model performance. A crucial area of research is expanding the model’s training data diversity and enhancing its ethical concerns, such as minimizing bias and avoiding dangerous outputs. The system’s practical utility might also be considerably improved by tackling the difficulties of long-context dialogues and enhancing the system’s capacity to store context.

As the field of AI-driven natural language processing continues to evolve, researchers and developers are encouraged to work collaboratively toward more robust and reliable conversational AI systems. By addressing the limitations highlighted in this paper and building upon the insights gained from comparing different versions of the ChatGPT architecture, we can collectively contribute to the creation of AI systems that better understand and assist humans in their communication needs while respecting ethical and contextual boundaries.