1 Introduction

Neurodegenerative Alzheimer’s disorder (ad) is the leading cause of chronic or progressive dementia, which negatively impacts cognitive functioning, including comprehension, speech, and thinking problems, memory loss, etc. [1]. More in detail, the typical stages of cognitive decline can be categorized as pre-clinical ad, mild cognitive impairment (mci) caused by ad, and finally ad dementia [2]. Generally, cognitively impaired users find difficult to perform daily tasks with the consequent detrimental impact on their life quality [3]. In this line, cognitive decline is a leading cause of dependency and disability for our elders [4].

According to the Alzheimer’s Association report on the impact of this disease in the USA [5], it is the sixth-leading death cause that increased more than 145% in the last years. Moreover, it affects 6.7 million people 65 or older. Dreadfully, this number is predicted to grow to 13.8 million by 2060. Regarding medical expenses related to people affected with dementia 65 or older, these are three times greater than those of people without this condition, reaching 345 billion dollars so far in 2023. Overall, the World Health Organization estimates that 50 million people worldwide are affected by dementia, with 10 million new patients yearly.Footnote 1

Clinical prognostication and early intervention, the most promising ways to address mental deterioration, rely on effective progression detection [2]. Among the benefits of early identification, care planning assistance, medical expense reduction, and the opportunity to receive the latest treatments, including non-invasive therapy, given the rapid biologic therapeutics advancements, stand out [6, 7]. The social stigma and socioeconomic status must also be considered when accessing mental health services [8]. However, the latter early diagnosis is challenging since the symptoms can be confused with normal aging decline [9]. To address it, computational linguistics can be exploited [10]. Natural language analysis is particularly relevant, constituting a significant proportion of healthcare data [11]. Particularly, impairment in language production mainly affects lexical (e.g., little use of nouns and verbs), semantic (e.g., the use of empty words like thing/stuff), and pragmatic (e.g., discourse disorganization) aspects [12].

Digital and technological advances such as artificial intelligence (ai)-based systems represent promising approaches toward individuals’ needs for personalized assessment, monitoring, and treatment [13]. Accordingly, these systems have the capabilities to complement traditional methodologies such as the Alzheimer’s Disease Assessment Scale-Cognition (adascog), the Mini-Mental State Examination (mmse), and the Montreal Cognitive Assessment (moca), which generally involve expensive, invasive equipment, and lengthy evaluations [14]. In fact, paper-and-pencil cognitive tests continue to be the most common approaches even though the latest advances in the natural language processing (nlp) field enable easy screening from speech data while at the same time avoiding patient/physician burdening [15]. Summing up, language analysis can translate into an effective, inexpensive, non-invasive, and simpler way of monitoring cognitive decline [14, 16] provided that spontaneous speech of cognitive impaired people is characterized by the aforementioned semantic comprehension problems and memory loss episodes [17].

Consequently, clinical decision support systems (cdsss), diagnostic decision support systems (ddsss), and intelligent diagnosis systems (idss), which apply ai techniques (e.g., machine learning ml, nlp, etc.) to analyze patient medical data (i.e., clinical records, imaging data, lab results, etc.) and discover relevant patterns effectively and efficiently, have significantly attracted the attention of the medical and research community [18]. However, one of the main disadvantages of traditional approaches is their lack of semantic knowledge management and explicability capabilities [17]. The latter can be especially problematic in the medical domain regarding accountability of the decision process for the physicians to recommend personalized treatments [14].

Integrating ai-based systems in conversational assistants to provide economical, flexible, immediate, and personalized health support is particularly relevant [19]. Their use has been greatly enhanced by the nowadays popular large language models (llms), enabling dynamic dialogues compared to previous developments [20]. Subsequently, llms have been powered by the latest advancements in deep learning techniques and the availability of vast amounts of cross-disciplinary data [21]. These models represent the most innovative approach of ai into healthcare by expediting medical interventions and providing new markers and therapeutic approaches to neurological diagnosis from patient narrative processing [22]. Note that patient experience can also be improved with the help of llms in terms of information and support seeking [23]. Summing up, conversation assistants that leverage llms have the potential to monitor high-risk populations and provide personalized advice, apart from offering companion [19, 24] constituting the future of therapy in the literature [25].

Given the still poor accuracy of cdsss [26, 27], we plan to leverage an llm using the latest nlp techniques in a chatbot solution to provide interpretable ml prediction of cognitive decline in real-time. Linguistic-conceptual features are exploited for appropriate natural language analysis. The main limitation of llms is that their outcomes may be misleading. Thus, we apply prompt engineering to avoid the “hallucination” effect. Through explainability, we aim to fight potential biases of the models and improve their potential to help clinical workers in their diagnosis decisions. Summing up, we contribute with an affordable, non-invasive diagnostic system in this work.

The rest of this paper is organized as follows. Section 2 reviews the relevant competing works on cognitive decline detection involving llms and interpretable ml predictions of mental deterioration. The contribution of this work is summarized in Sect. 2.1. Section 3 explains the proposed solution, while Sect. 4 describes the experimental data set, our implementations, and the results obtained. Finally, Sect. 5 concludes the paper and proposes future research.

  • Problem The World Health Organization predicts a yearly increase of 10 million people affected with dementia.

  • What is already known Paper-and-pencil cognitive tests continue to be the most common approach. The latter is impractical, given the disease growth rate. Moreover, one of the main disadvantages of intelligent approaches is their lack of semantic knowledge management and explicability capabilities.

  • What this paper adds We leverage an llm using the latest nlp techniques in a chatbot solution to provide interpretable ml prediction of cognitive decline in real-time. To sum up, we contribute with an affordable, flexible, non-invasive, personalized diagnostic system to this work.

2 Related Work

As previously mentioned, the main focus of dementia treatment is to delay the cognitive deterioration of patients [17]. Consequently, early diagnosis, which simultaneously contributes to reducing medical expenses in our aging society and avoiding invasive treatments with subsequent side effects on the users, is desirable [6]. To this end, ai has been successfully applied to idss in order to recommend treatments based on their diagnosis prediction [28, 29].

While ml models perform well and fast in diagnosis tasks, they require extensive training data previously analyzed by experts, which is labor-intensive and time-consuming [17]. In contrast, advanced nlp-based solutions exploit transformer-based models already trained with large corpora, including domain-related data, which results in very sensitive text analysis capabilities [30]. Consequently, transformer-based pre-trained language models (plms) (e.g., bert [31], gpt-3 [32]) which preceded the popular llms (e.g., gpt-4Footnote 2) have disruptively transformed the nlp research. These models exhibit great contextual latent feature extraction abilities from textual input [30]. The latter models are implemented to predict the next token based on massive training data, resulting in a word-by-word outcome [33]. Nowadays, they are used for various tasks, including problem-solving, question-answering, sentiment analysis, text classification, and generation [34].

There exist plm versions over biomedical and clinical data such as Biobert [35], Biogpt [36], Bluebert [37], ClinicalbertFootnote 3, and tcm-bert [38]. Open-domain conversational assistants, whose dialogue capabilities are not restricted to the conversation topic, exploit llms [19]. However, using llms for cognitive decline diagnosis is still scarce even though these models represent the most advanced way for clinical-patient communication using intelligent systems [39]. More in detail, they overcome the limitation of traditional approaches that lack semantic reasoning, especially relevant in clinical language [40]. Unfortunately, despite the significant advancement they represent, llms still exhibit certain limitations in open-domain task-oriented dialogues (e.g., medical use cases) [41]. For the latter, the reinforcement learning from human feedback (rlhf, i.e., prompt engineering) technique is applied to enhance their performance based on end users’ instructions and preferences [42].

Regarding the application of plm to the medical field, Syed et al. [3] performed two tasks: (i) dementia prediction and (ii) mmse score estimation from speech recordings combining acoustic features and text embeddings obtained with the bert model from their transcription. The input data correspond to cognitive tests (cts). Yuan et al. [12] analyzed disfluencies (i.e., uh/um word frequency and speech pauses) with bert and ernie modes based on data from the Cookie Theft picture from the Boston Diagnostic Aphasia Exam. Close to the work by Syed et al. [3], Chen et al. [15] analyzed the performance of bert model to extract embeddings in cognitive impairment detection from speech gathered during cts. Santander-Cruz et al. [17] combined the Siamese bert networks (sberts) with ml classifiers to firstly extract the sentence embeddings and then predict Alzheimer’s disease from ct data. In contrast, Vats et al. [1] performed dementia detection combining ml, the bert model, and acoustic features to achieve improved performance. Moreover, Li et al. [16] compared gpt-2 with its artificially degraded version (gpt-d) created with a dementia-related linguistic anomalies layer induction based on data from a picture description task, while Agbavor and Liang [14] predicted dementia and cognitive score from ct data using gpt-3 exploiting both word embeddings and acoustic knowledge. Finally, Mao et al. [2] pre-trained the bert model with unstructured clinical notes from Electronic Health Records (ehrs) to detect mci to ad progression.

Table 1 Comparison of diagnostic llm-based solutions taking into account the field of application, the model used, the input data, and explainability (Ex.) capability

More closely related to our research is the work by Bertacchini et al [13]. The authors designed Pepper, a social robot with real-time conversational capabilities exploiting the Chatgpt gpt-3.5 model. However, the use case of the system is autism spectrum disorder detection. Furthermore, Caruccio et al. [18] compared the diagnoses performance of different models of Chatgpt (i.e., ada, babbage, curie, davinci and gpt-3.5) with Google Bard and traditional ml approaches based on symptomatic data. The authors exploited prompt engineering to ensure appropriate performance when submitting clinical-related questions to the llm model. Moreover, Hirosawa et al. [39] analyzed the diagnosis ability of Chatgpt gpt-3.5 model using clinical vignettes. Then, the llm was evaluated compared to physicians’ diagnosis. However, the authors again focus not on cognitive decline prediction but on ten common chief complaints. Consideration should be given to the work by Koga et al. [30], who used Chatgpt (i.e., gpt-3.5 and gpt-4 models) and Google Bard to predict several neurodegenerative disorders based on clinical summaries in clinicopathological conferences without being a specific solution tailored for ad prediction. Finally, regarding conversational assistants that integrate llms, Zaman et al. [43] developed a chatbot based on Chatgpt gpt-3.5 model to provide emotional support to caregivers (i.e., practical tips and shared experiences).

2.1 Contributions

As previously described, a vast amount of work in the state of the art exploits plms even in the clinical field [44]. However, scant research has been performed in the case of llm models. Table 1 summarizes the reviewed diagnostic solutions that exploit llms in the literature. Note that explainability represents a differential characteristic of the solution proposed given the relevance of promoting transparency in ai-based systems [45].

Given the comparison with competing works:

  • Our system is the first that jointly considers the application of an llm over spontaneous speech and provides interpretable ml results for the use case of mental decline prediction.

  • Our solution implements ml models in streaming to provide real-time functioning, hence avoiding the re-training cost of batch systems.

  • In this work, we leverage the potential of llms by applying the rlhf technique through prompt engineering in a chatbot solution. Note that the natural language analysis is performed with linguistic-conceptual features. Consequently, we contribute with an affordable, non-invasive diagnostic system.

  • Our system democratizes access to researchers and end users within the public health field to the latest advances in nlp.

3 Methodology

Figure 1 depicts the system scheme proposed for real-time prediction of mental decline combining llms and ml algorithms with explainability capabilities. More in detail, it is composed of (i) data extraction employing nlp-based prompt engineering (Sect. 3.1); (ii) stream-based data processing including feature engineering, analysis and selection (Sect. 3.2); (iii) real-time classification (Sect. 3.3); and (iv) the explainability dashboard to provide visual and natural language descriptions of the prediction outcome (Sect. 3.4). Algorithm 1 describes the complete process.

Fig. 1
figure 1

System scheme

figure a

3.1 Data Extraction

The Chatgpt gpt-3.5 model used serves two purposes: (i) it enables a natural, free dialogue with the end users, and (ii) data are extracted due to its semantic knowledge management capabilities. The latter information is gathered once the conversation is concluded (either more than 3 min of inactivity or farewell detected) and used to compute the features used for classification (see Sect. 3.2.1). For this extraction, prompt engineering is exploited. The complete data extraction process is described in Algorithm 2.

figure b
Table 2 Features engineered for mental deterioration prediction

3.2 Stream-Based Data Processing

Stream-based data processing encompasses feature engineering, analysis, and selection tasks to ensure the optimal performance of the ml classifiers.

3.2.1 Feature Engineering

Table 2 details the features used to predict mental decline. Note that conversational, emotional, and linguistic-conceptual features are computed. The conversational features.Footnote 4 (1–10) represent relevant semantic and pragmatic information related to the free dialogue (e.g., fluency, repetitiveness, etc.), while emotional features focus on the mental and physical state of the users. Finally, linguistic features represent lexical and semantic knowledge (e.g., disfluencies, placeholder words, etc.).

Furthermore, the system maintains a history of each user data (i.e., past and current feature values) that enables the computation of four new characteristics per each in Table 2: average, q1, q2, and q3 as indicated in Eq. (1), where n is the user conversation counter and X[n] represents a particular feature with historical data.

(1)

3.2.2 Feature Analysis & Selection

Feature analysis and selection tasks are necessary to optimize the performance of the ml classifiers. These tasks are even more important in the streaming scenario where samples arrive at a real-time pace. The latter means that the classification problem layout (e.g., the most relevant features) may vary over time.

The proposed system follows two thresholding strategies for feature analysis and selection based on cut-off points regarding correlation and variance values to remove irrelevant features. The former, correlation analysis, limits the number of features to extract the most relevant characteristics. For the latter variance analysis, the number of features selected is dynamically established in each interaction of the stream-based model, selecting those that meet the threshold criteria.

Algorithm 3 details the data processing stage, including feature engineering, analysis, and selection.

figure c

3.3 Stream-Based Classification

Two classification scenarios are considered:

  • Scenario 1 analyzes the behavior of the classifiers in a streaming setting. Under this consideration, sequential and continual testing and training over time is assumed.

  • Scenario 2 analyzes the models’ performance under more realistic conditions. Thus, the testing is continuous (i.e., in streaming) while training is performed desynchronized in blocks of 100 samples.

The following ml models are selected based on their good performance in similar classification problems [46,47,48]:

  • Gaussian Naive Bayes (gnb) [49] exploits the Gaussian probability distribution in a stream-based ml model. It is used as a reference for performance analysis.

  • Approximate Large Margin Algorithm (alma) [50] is a fast incremental learning algorithm comparable to support vector machine to approximate the maximal margin between a hyperplane concerning a norm (with a value of \(p \ge 2\)) for a set of linearly separable data.

  • Hoeffding Adaptive Tree Classifier (hatc) [51] computes single-tree branch performance and is designed for stream-based prediction.

  • Adaptive Random Forest Classifier (arfc) [52] constitutes an advanced model of hatc in which branch performance is computed by majority voting in an ensemble tree scenario.

Algorithm 4 describes the stream-based prediction process.

figure d

3.4 Explainability Dashboard

Prediction transparency is promoted through explainability data provided to the end users regarding relevant features in the prediction outcome. Thus, those relevant features are included in the natural language description of the decision path. The five features whose mathematical module is highest or with the highest variance and whose values are the most distant from the average are selected. In the case of the counters (features 9–10), this average is obtained from the average of all users in the system.

4 Evaluation and Discussion

This section discusses the experimental data set used, the implementation decisions, and the results obtained. The evaluations were conducted on a computer with the following specifications:

  • Operating system Ubuntu 18.04.2 LTS 64 bits

  • Processor Intel Core i9-10900K 2.80GHz

  • RAM 96GB DDR4

  • Disk 480GB NVME + 500GB SSD

4.1 Experimental Data Set

The experimental data setFootnote 5 consists of an average of \(6.92\pm 3.08\) utterances with \(62.73\pm 57.20\) words involving 44 users with \(13.66\pm 7.86\) conversations by user. The distribution of mental deterioration in the experimental data set is 238 samples in which mental deterioration is present and 363 in which it is absent. Figure 2 depicts the histogram distribution of words and interactions by absent and present mental deterioration, respectively. While the distributions of the number of interactions in the absence or presence of cognitive impairment follow a normal function, the number of words can be approximated by a positive normal centered on 0. The most relevant issue is that, as expected, users with mental deterioration present a lower number of interactions and a significant decrease in the number of words used in their responses.

Fig. 2
figure 2

Distribution of interactions and number of words

4.2 Data Extraction

Data to engineer conversational (1–8), emotional, and linguistic features in Table 2 were obtained with gpt-3.5-turboFootnote 6 model. The prompt used is shown in Listing 1.

figure e

4.3 Stream-Based Data Processing

This section reports the algorithms used for feature engineering, analysis, and selection and their evaluation results.

4.3.1 Feature Engineering

A total of 88 features were generatedFootnote 7 in addition to the 22 features generated in each conversation (see Table 2) resulting in 110 features. In Fig. 3, we show the distribution of conversations by the user, which approaches a uniform density function, being relevant that the large majority concentrates between 15 and 20 conversations.

Fig. 3
figure 3

Distribution of conversations by user

4.3.2 Feature Analysis & Selection

Correlation and variance thresholding decisions were based on experimental tests. For the correlation thresholding, SelectKBestFootnote 8 was applied using the Pearson correlation coefficient [53]. The K value corresponds to the most relevant features of the 80% experimental data. Table 3 shows the features with a correlation value greater than 0.2 with the mental deterioration target when the last sample entered the stream-based classification model.

Table 3 Correlation and variance results

Regarding the variance thresholding, the implementation used was VarianceThresholdFootnote 9 from the River library.Footnote 10 Moreover, the cut-off point, 0.001, is computed with the 10th percentile variance value of the features contained in the 20% of the experimental data set, which acts as the cold start of this method. Consequently, only those features that exceed the abovementioned cut-off are selected as relevant for classification purposes. Table 3 also details the features with a variance greater than 0.5.Footnote 11

Table 3 shows that among the conversational features, user initiative (feature 6 in Table 2) plays an important role. The same applies to the number of interactions within a dialogue (feature 9). Regarding emotional features, consideration should be given to fatigue (feature 12) and polarity (feature 14). Finally, using a colloquial/formal registry (features 16/19), disfluency (feature 18), and short responses (feature 22) stand out among linguistic characteristics. Considering correlation and variance analysis jointly, initiative and polarity are the most relevant data for prediction purposes.

4.4 Stream-Based Classification

The River implementations of the ml models selected are: gnb,Footnote 12alma,Footnote 13hatcFootnote 14 and arfc.Footnote 15 Listings 2, 3 and 4 detail the hyper-parameter optimization ranges used, excluding the baseline model, from which the following values were selected as optimal:

  • Correlation thresholding

    • ALMA: alpha=0.5, B=1.0, C=1.0.

    • HATC: depth=None, tiethreshold=0.5, maxsize=50.

    • ARFC: models=10,features=5, lambda=50.

  • Variance thresholding

    • ALMA: alpha=0.5, B=1.0, C=1.0.

    • HATC: depth=None, tiethreshold=0.5, maxsize=50.

    • ARFC: models=100,features=sqrt, lambda=50.

figure f
figure g
figure h

Table 4 presents the results for evaluation scenarios 1 and 2. In both scenarios, the feature selection methodology based on correlation thresholding returns lower classification metric values than those obtained with the variance method. Thus, once the variance feature selection method is applied, the arfc is the most promising performance algorithm regardless of the evaluation scenario.

Table 4 Classification results (Sce.: scenario, time in seconds)
Table 5 Classification results in batch for the rf model (time in seconds)
Table 6 Classification results for the arfc model using the experimental data from [54] (time in seconds)

Consideration should be given to the fact that even in scenario 2, in which training is performed desynchronized and in batch, the robustness of arfc stands out with classification results exceeding 80% and with a recall for the mental deterioration class about 85%.

Provided that our system operates in streaming and to enable direct comparison with batch ml solutions, additional evaluation measures from tenfold cross-validation are provided, particularly, for Random Forest (rfFootnote 16) equivalent to the best model, arfc, in stream-based classification. The results are displayed in Table 5, most surpassing the 90% threshold. Note that the increase in performance compared to streaming operation (e.g., +8.37% points in accuracy) is derived from the fact that in batch classification, the model has access to the 90% of the experimental data for training. In contrast, stream-based classification relies on the ordered incoming new samples, which is more demanding. Consequently, having achieved a comparable performance in batch and stream-based classification is noteworthy.

Fig. 4
figure 4

Explainability dashboard

To verify the system’s operation in a more challenging scenario, we have experimented with a data set from a previous study [54] with fewer interactions per session. Even when the system is fed with less information, the evaluation metrics are promising, as shown in Table 6 with all values above 70%, and the precision and recall of the mental deterioration category above 80%. Comparing the rf batch model in our past research [54] with the proposed arfc algorithm, which operates in streaming, the improvement reaches 10% points and 4% points in the recall metric of mental deterioration and absence of mental deterioration categories, respectively.

4.5 Explainability Dashboard

Figure 4 shows the explainability dashboard. In this example, the variation in predicting cognitive impairment is visualized, considering two weeks of past data. This variation is represented with the predict_proba function of arfc algorithm. At the bottom, the most relevant features are displayed. Each figure card contains the identifier and statistic represented in colors following this scheme: 1–0.5 in green, 0.5–0.25 in yellow, and 0.25–0 in red. The latter assignation is inverted for negative values. At the bottom, a brief description in natural language is provided. The average accumulated predict_proba value, and the confidence prediction of the current sample are displayed on the right.

5 Conclusions

Cognitively impaired users find it difficult to perform daily tasks with the consequent detrimental impact on their life quality. Thus, progression detection and early intervention are essential to effectively and timely address mental deterioration to delay its progress. In this work, we focused on impairment in language production (i.e., lexical, semantic, and pragmatic aspects) to engineer linguistic-conceptual features toward spontaneous speech analysis (e.g., semantic comprehension problems, memory loss episodes, etc.). Compared to traditional diagnostic approaches, the proposed solution has semantic knowledge management and explicability capabilities thanks to integrating an llm in a conversational assistant.

Consideration should be given to the limitations of using llms, which are transversal into the healthcare field beyond mental deterioration detection. The potential biases and lack of inherent transparency stand out among the risks of applying these models for medical purposes. The latter black-box problem, also present in traditional opaque ml models, is particularly critical in the healthcare field by negatively impacting the decision process of physicians due to their limited corrective capabilities and even the end users, limiting their trust in medical applications. Moreover, these systems’ current limited memory management capability is worth mentioning, which prevents the realization of longitudinal clinical analysis. The same applies to the associated complexity of context information management. Ultimately, the difficulty in collecting data due to the sensitivity and confidentiality of the information in the medical field should also be mentioned.

More in detail, the solution provides interpretable ml prediction of cognitive decline in real-time. rlhf (i.e., prompt engineering) and explainability are exploited to avoid the “hallucination” effect of llms and avoid potential biases by providing natural language and visual descriptions of the diagnosis decisions. Note that our system implements ml models in streaming to provide real-time functioning, hence avoiding the re-training cost of batch systems.

Summing up, we contribute with an affordable, flexible, non-invasive, personalized diagnostic system that enables the monitoring of high-risk populations and offers companionship. Ultimately, our solution democratizes access to researchers and end users within the public health field to the latest advances in nlp.

Among the challenges and potential ethical concerns raised by the application of ai into the healthcare field, the double effect principle must be considered. In this sense, few can deny its promising potential to provide innovative treatments while at the same time presenting safety-critical concerns, notably regarding their interpretability. Apart from the algorithmic transparency mentioned, the main considerations are privacy and safety of the medical data, fairness, and autonomous decision-making without human intervention. In future work, we plan to test the performance of new approaches, such as reinforcement learning, to enhance the system’s personalizing capabilities further. Moreover, we will explore co-design practices with end users, and we seek to move our solution to clinical practice within an ongoing project with daycare facilities. Note that reinforcement learning with human feedback will also allow us to mitigate some of the limitations discussed, such as physicians’ lack of interpretability and corrective capabilities. The latter will also have a positive ethical impact on the deployment of llm-based medical applications by ensuring fairness. The societal impact derived from reduced costs compared to traditional approaches may result in broader accessibility to clinical diagnosis and treatment on a demand basis. The equity will be impulsed by the capability of these systems to provide unlimited personalized support. In future research, we will work on mitigating health inequities by performing longitudinal studies to measure bias in our ai solution, particularly related to the algorithm design, bias in the training data, and the ground truth. Underperformance in certain social groups may also be considered. For that purpose, we will gather social context data, which will allow us to measure equity (e.g., gender, race, socioeconomic status, etc.). To ensure patient data protection while at the same time increasing data available for research, federated learning approaches will be explored.