Keywords

1 Introduction

The introduction of ChatGPT into the landscape of technology has generated a notable amount of disruption, especially within the field of software engineering. However, despite this growing interest, empirical knowledge about the actual capabilities of ChatGPT remains limited. This lack of comprehensive understanding is particularly evident when considering the potential of ChatGPT to design and implement holistic systems as opposed to merely generating discrete fragments of code. There exists a significant difference between crafting isolated code snippets and deploying a fully realized software solution, a distinction that is yet to be thoroughly explored in the context of ChatGPT.

This article describes a student software project that was created to explore the use of artificial intelligence (AI) in software development. The main goal of the project was to investigate the effectiveness of ChatGPT in practice.

Overall, this project contributes to the field of AI-assisted software development by providing valuable experience of using ChatGPT as tool in software development. The remainder of this article provides related research in Sect. 2, a detailed description of the project and the research design in Sect. 3, results in Sect. 4, and conclusions.

2 AI Assisted Software Development

Artificial Intelligence (AI) is a branch of computer science that focuses on creating intelligent machines that can perform tasks that typically require human intelligence, such as understanding natural language, recognizing patterns, making decisions, and solving problems.

One type of AI tool that has gained significant attention in recent years is the large language model (LLM) like OpenAI’s GPT-4 [7]. LLM is AI model that is trained on massive amounts of data to generate human-like text output. GPT-4 uses transformer-style model to predict the content and structure of text based on an input, usually text as well. This can be used in a wide range of natural language processing tasks, such as language translation, text summarization, and answering questions.

ChatGPT has provided a chatbot interface for interacting with OpenAI’s GPT-models [2]. This has made the capabilities of LLMs more widely known, which in turn has sparked the research around use cases for this technology. Current research include studies related to prompt patters [13], human-bot collaborative architecting [1] and using ChatGPT for programming numerical methods [6]. Treude [11] has developed a prototype to compare different GPT model solutions, and Dong and others [4] developed a self-collaboration code generation framework. Surameery and Shakor [10] have applied ChatGPT to solve programming bugs.

3 Research Design

In this section we introduce the project background and project implementation details including the implemented features.

3.1 Project Background

Solita Ltd. [8] is a large software consultancy company in the Nordic countries. Solita collaborates with universities by inventing exercise topics and supervising student exercises. They challenged the student team to undertake an AI-assisted, large-scale project.

Project topic was chosen from well-defined public procurement requests on the Hilma portal [5], a website for procurement in the Finnish public sector. It was agreed in the project that the specifications would not be directly used as input material for AI, but the prompts given to AI were written mostly by the team themselves. However, the number of fields in the user interface was kept the same as in the original request, etc.

The selected project, Valvontatyöpöytä (VTP) is a platform for financial supervision, designed to support the operations of an organization. The intended user group for the VTP is financial professionals, including supervisors, managers, and analysts.

3.2 Project Implementation

The VTP project [12] was proposed in the end of December 2022. The project was then accepted by a seven member team. The project started in the end of January 2023. The team consists of three master’s level students and four Bachelor’s level students of Computer science or Information technology. None of the team members had earlier experience on AI assisted software development. General overview of the course’s practices and schedule are described by Sten and others [9]. First was sprint 0 (one week) to plan the project and set up the development environment, then five two-week implementation sprints, and then one week quality assurance sprint. The total duration of the project is 15 weeks. Sprint 5 and the QA sprint are not covered in this report as the project is ongoing. Project phases and implemented features per sprint are show in Fig. 1. The team utilized a Kanban board to manage tasks and issues.

Fig. 1.
figure 1

Project phases and implemented features per sprint.

3.3 Development Environment

The development environment was built piecemeal by consulting ChatGPT on suitable technologies for the projects needs. The team fed ChatGPT prompts explaining what they were currently trying to accomplish and ChatGPT replied with multiple recommendations. The team then cherry-picked from the recommendations based on their own preferences and previous experience. The effect of picking technologies with such a process is twofold. Firstly, ChatGPT is more likely to recommend technologies it has knowledge of. Secondly, it makes the teams’ efforts in reviewing the generated code easier.

Table 1. Technical implementation environment and technology selection criteria.

Table 1 lists technologies used in the project with the corresponding selection criteria. As seen in Table 1 there are exceptions on which ChatGPT was not consulted at all. These include using ChatGPT as the sole AI assistant, and containerization of the produced application, both of which were requirements laid out by the customer. The team decided to use GitHub as their version control system to enable easy collaboration. The only exception in languages used in the project comes from a Bash script used by the backend container to wait for the database to fully initialize before trying to establish a connection.

3.4 Development Process

The team’s process of working with ChatGPT is illustrated in Fig. 2. When a task was related to existing code, the assigned team member would provide the relevant code to ChatGPT and request it to generate a solution. If the task was not related to existing code, the team member would first ask ChatGPT for recommendations before requesting it to produce the code. Once ChatGPT generated the code, the team member would review it for correctness. If the code was deemed satisfactory, the team member would add it to the code base, save the chat session as a markdown file, and create a pull request with the chat as an attachment. If the code was not acceptable, the team member would provide the problematic code back to ChatGPT and repeat the process, iterating until a satisfactory solution was achieved or asking for a new recommendation for another approach.

Fig. 2.
figure 2

Process of working with ChatGPT.

The project team logged their weekly workings hours according to different categories (Documentation, Requirements, Design, Implementation, Testing, Meetings, Studying, Other, Lectures). After sprint 4 the project had a total 607 h logged. Throughout the project, the project team met with the customer twice a week to ensure quality assurance and planning were on track. This is a reason why meetings (198 h, 33%) have such a considerable part in the logged hours. So far the implementation has taken 131 h (22%) and studying 116 h (19%). The logged time also included university course related subjects, such as lectures and other studying, which did not relate directly to the implementation of the project.

3.5 Implemented Features

Based on ChatGPT’s suggestion, we began by setting up the foundation for our React JavaScript project. Implemented features per sprints are shown in Fig. 1. Following the wireframes in the original procurement requests, we commenced implementing the Inspection Information page shown in Fig. 3. We adopted a component-by-component approach to building the application, and once the first frontend components were completed, we proceeded to develop the backend and database infrastructure.

Fig. 3.
figure 3

Inspection information in VTP.

To better comprehend the application’s functionality, we utilized the wireframe images to create a prototype. In our development process, we proceeded to implement a new view for the application, the Inspection Plan page. Additionally, we created localization possibility into the application, allowing for text to be displayed in both Finnish and English. With the assistance of ChatGPT, we created a style guide for the UI and began implementing it into the application.

We proceeded to build the correct routing and to integrate the frontend and backend together. Recognizing the need for adjustments to the database schema, we initiated revisions while concurrently addressing application styling and adding several popup forms.

4 Results

In this section we observe ChatGPT discussions, code and lessons learned.

4.1 Discussions with ChatGPT

We stored ChatGPT conversations that were relevant to the project in Markdown format. The resulting Markdown files were included into the description part of pull requests made to the project GitHub repository.

The conversation length varies a lot depending on how big changes were requested and if any problems occurred. From sprint 0 to sprint 4, 39 pull requests were merged to the codebase. In these conversations the number of prompts made ranged from 1 to 129. On average a pull request had 19 prompts. A common way to start conversation with ChatGPT was to describe the problem or wanted feature or to paste existing code and ask for needed changes. If the AI model couldn’t give a solution directly based on the first prompt, the developer would give more information or clarifications in an iterative manner where the solution was found by fine-tuning the earlier answers.

4.2 Code

The application implements a basic three-tier architecture, where the front-end communicates with the back-end through a REST API. Each tier runs in their own Docker container and the whole system is managed with Docker Compose.

Table 2 provides a breakdown by type of file and line for the project’s code base. File and blank, comment and code line counts were calculated from the project’s repository using cloc [3]. The vast majority of lines shown are a result of bootstrapping a React project to be used as the front-end client. This includes most of the JSON lines that node package manager uses for managing dependencies.

However, the team managed to use ChatGPT to generate all of the code directly related to running the project. This includes containerization (both Dockerfiles, a YAML file), CI pipeline (other YAML file), database initialization clauses (SQL), and the entire RESTful API that makes up the project’s back-end, including unit tests for its routes. Furthermore, all React components, their layout and styling (CSS) were created by ChatGPT based on the teams instructions. The HTML files contain a style guide. In total, 4000 lines of code were generated by ChatGPT.

Table 2. Output of the cloc [3] package based on the project’s repository as of 2.4.2023.

In general, ChatGPT produces code that appears to be relatively high quality. The code mostly works, it is laid out in a logical manner, and has well named variables that make it easy to understand. Its two biggest pitfalls are consistency and attention to detail. The former shows as stylistic differences in blocks of code produced in separate replies by ChatGPT, which were sometimes incompatible to the point of not functioning. The latter problem was mainly encountered when attempting to fit pieces of the project together as it grew more complex which meant conversations with ChatGPT needed more context to be provided. The team noticed marked improvement in both areas when using GPT-4 over GPT-3.5.

4.3 Lessons Learned

ChatGPT has both strengths and limitations in assisting with software development tasks. While it can be helpful in creating a base for new components, fine-tuning and getting the final result may take longer. It may struggle with updating code consistently, remembering to make changes, and handling errors. Additionally, it can misunderstand questions, have issues with non-determinism, and suffer from token limitations in answers. However, breaking requests into smaller pieces, providing clear descriptions, and using various techniques to avoid length restrictions can improve its effectiveness.

One team member described experience of working with ChatGPT like a rubber ducking, but the duck actually responds. In similar vein to rubber ducking, the output from the language model depends on how well the developer is able to express the problem at hand. A Clear well defined prompt would return better code than a prompt by someone with only a vague idea of what they were trying to achieve. Good development practises still apply, even if the code is written by AI.

5 Conclusions

This paper presents first-hand experiences of using ChatGPT to develop a fullstack software application. Overall, this study contributes to the growing body of literature on the application of language models in software engineering. This study also provides insights for researchers and practitioners interested in exploring the use of ChatGPT for developing real-world software systems.

Based on early results from this exploratory study, the main conclusions drawn were as follows: 1) these findings demonstrate the potential for ChatGPT to be integrated into the software engineering workflow, 2) it can be used for creating a base for new components and for dividing coding tasks into smaller pieces, and 3) noticeable enhancements in ChatGPT-4, compared to ChatGPT-3.5, indicate superior working memory and the ability to continue incomplete responses, thereby leading to more coherent and less repetitive dialogues.

Next steps in the research will include reporting experiences from the remaining sprints, test the application systematically, analyse code quality, and compare ChatGPT generated code with the code written by the developers.