Leveraging large language models through natural language processing to provide interpretable machine learning predictions of mental deterioration in real time

de Arriba-Pérez, Francisco; García-Méndez, Silvia

doi:10.1007/s13369-024-09508-2

Leveraging large language models through natural language processing to provide interpretable machine learning predictions of mental deterioration in real time

Research Article-Computer Engineering and Computer Science
Open access
Published: 27 August 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Arabian Journal for Science and Engineering Aims and scope Submit manuscript

Leveraging large language models through natural language processing to provide interpretable machine learning predictions of mental deterioration in real time

Download PDF

Francisco de Arriba-Pérez¹^na1 &
Silvia García-Méndez ORCID: orcid.org/0000-0003-0533-1303¹^na1

292 Accesses
1 Altmetric
Explore all metrics

Abstract

Based on official estimates, 50 million people worldwide are affected by dementia, and this number increases by 10 million new patients every year. Without a cure, clinical prognostication and early intervention represent the most effective ways to delay its progression. To this end, artificial intelligence and computational linguistics can be exploited for natural language analysis, personalized assessment, monitoring, and treatment. However, traditional approaches need more semantic knowledge management and explicability capabilities. Moreover, using large language models (llms) for cognitive decline diagnosis is still scarce, even though these models represent the most advanced way for clinical–patient communication using intelligent systems. Consequently, we leverage an llm using the latest natural language processing (nlp) techniques in a chatbot solution to provide interpretable machine learning prediction of cognitive decline in real-time. Linguistic-conceptual features are exploited for appropriate natural language analysis. Through explainability, we aim to fight potential biases of the models and improve their potential to help clinical workers in their diagnosis decisions. More in detail, the proposed pipeline is composed of (i) data extraction employing nlp-based prompt engineering; (ii) stream-based data processing including feature engineering, analysis, and selection; (iii) real-time classification; and (iv) the explainability dashboard to provide visual and natural language descriptions of the prediction outcome. Classification results exceed 80% in all evaluation metrics, with a recall value for the mental deterioration class about 85%. To sum up, we contribute with an affordable, flexible, non-invasive, personalized diagnostic system to this work.

Explainable cognitive decline detection in free dialogues with a Machine Learning approach based on pre-trained Large Language Models

Article 24 September 2024

Proactive Polypharmacy Management Using Large Language Models: Opportunities to Enhance Geriatric Care

Article 18 April 2024

ChatGPT-HealthPrompt. Harnessing the Power of XAI in Prompt-Based Healthcare Decision Support using ChatGPT

Discover the latest articles, news and stories from top researchers in related subjects.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Neurodegenerative Alzheimer’s disorder (ad) is the leading cause of chronic or progressive dementia, which negatively impacts cognitive functioning, including comprehension, speech, and thinking problems, memory loss, etc. [1]. More in detail, the typical stages of cognitive decline can be categorized as pre-clinical ad, mild cognitive impairment (mci) caused by ad, and finally ad dementia [2]. Generally, cognitively impaired users find difficult to perform daily tasks with the consequent detrimental impact on their life quality [3]. In this line, cognitive decline is a leading cause of dependency and disability for our elders [4].

According to the Alzheimer’s Association report on the impact of this disease in the USA [5], it is the sixth-leading death cause that increased more than 145% in the last years. Moreover, it affects 6.7 million people 65 or older. Dreadfully, this number is predicted to grow to 13.8 million by 2060. Regarding medical expenses related to people affected with dementia 65 or older, these are three times greater than those of people without this condition, reaching 345 billion dollars so far in 2023. Overall, the World Health Organization estimates that 50 million people worldwide are affected by dementia, with 10 million new patients yearly.^{Footnote 1}

Clinical prognostication and early intervention, the most promising ways to address mental deterioration, rely on effective progression detection [2]. Among the benefits of early identification, care planning assistance, medical expense reduction, and the opportunity to receive the latest treatments, including non-invasive therapy, given the rapid biologic therapeutics advancements, stand out [6, 7]. The social stigma and socioeconomic status must also be considered when accessing mental health services [8]. However, the latter early diagnosis is challenging since the symptoms can be confused with normal aging decline [9]. To address it, computational linguistics can be exploited [10]. Natural language analysis is particularly relevant, constituting a significant proportion of healthcare data [11]. Particularly, impairment in language production mainly affects lexical (e.g., little use of nouns and verbs), semantic (e.g., the use of empty words like thing/stuff), and pragmatic (e.g., discourse disorganization) aspects [12].

Digital and technological advances such as artificial intelligence (ai)-based systems represent promising approaches toward individuals’ needs for personalized assessment, monitoring, and treatment [13]. Accordingly, these systems have the capabilities to complement traditional methodologies such as the Alzheimer’s Disease Assessment Scale-Cognition (adascog), the Mini-Mental State Examination (mmse), and the Montreal Cognitive Assessment (moca), which generally involve expensive, invasive equipment, and lengthy evaluations [14]. In fact, paper-and-pencil cognitive tests continue to be the most common approaches even though the latest advances in the natural language processing (nlp) field enable easy screening from speech data while at the same time avoiding patient/physician burdening [15]. Summing up, language analysis can translate into an effective, inexpensive, non-invasive, and simpler way of monitoring cognitive decline [14, 16] provided that spontaneous speech of cognitive impaired people is characterized by the aforementioned semantic comprehension problems and memory loss episodes [17].

Consequently, clinical decision support systems (cdsss), diagnostic decision support systems (ddsss), and intelligent diagnosis systems (idss), which apply ai techniques (e.g., machine learning ml, nlp, etc.) to analyze patient medical data (i.e., clinical records, imaging data, lab results, etc.) and discover relevant patterns effectively and efficiently, have significantly attracted the attention of the medical and research community [18]. However, one of the main disadvantages of traditional approaches is their lack of semantic knowledge management and explicability capabilities [17]. The latter can be especially problematic in the medical domain regarding accountability of the decision process for the physicians to recommend personalized treatments [14].

Integrating ai-based systems in conversational assistants to provide economical, flexible, immediate, and personalized health support is particularly relevant [19]. Their use has been greatly enhanced by the nowadays popular large language models (llms), enabling dynamic dialogues compared to previous developments [20]. Subsequently, llms have been powered by the latest advancements in deep learning techniques and the availability of vast amounts of cross-disciplinary data [21]. These models represent the most innovative approach of ai into healthcare by expediting medical interventions and providing new markers and therapeutic approaches to neurological diagnosis from patient narrative processing [22]. Note that patient experience can also be improved with the help of llms in terms of information and support seeking [23]. Summing up, conversation assistants that leverage llms have the potential to monitor high-risk populations and provide personalized advice, apart from offering companion [19, 24] constituting the future of therapy in the literature [25].

Given the still poor accuracy of cdsss [26, 27], we plan to leverage an llm using the latest nlp techniques in a chatbot solution to provide interpretable ml prediction of cognitive decline in real-time. Linguistic-conceptual features are exploited for appropriate natural language analysis. The main limitation of llms is that their outcomes may be misleading. Thus, we apply prompt engineering to avoid the “hallucination” effect. Through explainability, we aim to fight potential biases of the models and improve their potential to help clinical workers in their diagnosis decisions. Summing up, we contribute with an affordable, non-invasive diagnostic system in this work.

The rest of this paper is organized as follows. Section 2 reviews the relevant competing works on cognitive decline detection involving llms and interpretable ml predictions of mental deterioration. The contribution of this work is summarized in Sect. 2.1. Section 3 explains the proposed solution, while Sect. 4 describes the experimental data set, our implementations, and the results obtained. Finally, Sect. 5 concludes the paper and proposes future research.

Problem The World Health Organization predicts a yearly increase of 10 million people affected with dementia.
What is already known Paper-and-pencil cognitive tests continue to be the most common approach. The latter is impractical, given the disease growth rate. Moreover, one of the main disadvantages of intelligent approaches is their lack of semantic knowledge management and explicability capabilities.
What this paper adds We leverage an llm using the latest nlp techniques in a chatbot solution to provide interpretable ml prediction of cognitive decline in real-time. To sum up, we contribute with an affordable, flexible, non-invasive, personalized diagnostic system to this work.

2 Related Work

As previously mentioned, the main focus of dementia treatment is to delay the cognitive deterioration of patients [17]. Consequently, early diagnosis, which simultaneously contributes to reducing medical expenses in our aging society and avoiding invasive treatments with subsequent side effects on the users, is desirable [6]. To this end, ai has been successfully applied to idss in order to recommend treatments based on their diagnosis prediction [28, 29].

While ml models perform well and fast in diagnosis tasks, they require extensive training data previously analyzed by experts, which is labor-intensive and time-consuming [17]. In contrast, advanced nlp-based solutions exploit transformer-based models already trained with large corpora, including domain-related data, which results in very sensitive text analysis capabilities [30]. Consequently, transformer-based pre-trained language models (plms) (e.g., bert [31], gpt-3 [32]) which preceded the popular llms (e.g., gpt-4^{Footnote 2}) have disruptively transformed the nlp research. These models exhibit great contextual latent feature extraction abilities from textual input [30]. The latter models are implemented to predict the next token based on massive training data, resulting in a word-by-word outcome [33]. Nowadays, they are used for various tasks, including problem-solving, question-answering, sentiment analysis, text classification, and generation [34].

There exist plm versions over biomedical and clinical data such as Biobert [35], Biogpt [36], Bluebert [37], Clinicalbert^{Footnote 3}, and tcm-bert [38]. Open-domain conversational assistants, whose dialogue capabilities are not restricted to the conversation topic, exploit llms [19]. However, using llms for cognitive decline diagnosis is still scarce even though these models represent the most advanced way for clinical-patient communication using intelligent systems [39]. More in detail, they overcome the limitation of traditional approaches that lack semantic reasoning, especially relevant in clinical language [40]. Unfortunately, despite the significant advancement they represent, llms still exhibit certain limitations in open-domain task-oriented dialogues (e.g., medical use cases) [41]. For the latter, the reinforcement learning from human feedback (rlhf, i.e., prompt engineering) technique is applied to enhance their performance based on end users’ instructions and preferences [42].

Regarding the application of plm to the medical field, Syed et al. [3] performed two tasks: (i) dementia prediction and (ii) mmse score estimation from speech recordings combining acoustic features and text embeddings obtained with the bert model from their transcription. The input data correspond to cognitive tests (cts). Yuan et al. [12] analyzed disfluencies (i.e., uh/um word frequency and speech pauses) with bert and ernie modes based on data from the Cookie Theft picture from the Boston Diagnostic Aphasia Exam. Close to the work by Syed et al. [3], Chen et al. [15] analyzed the performance of bert model to extract embeddings in cognitive impairment detection from speech gathered during cts. Santander-Cruz et al. [17] combined the Siamese bert networks (sberts) with ml classifiers to firstly extract the sentence embeddings and then predict Alzheimer’s disease from ct data. In contrast, Vats et al. [1] performed dementia detection combining ml, the bert model, and acoustic features to achieve improved performance. Moreover, Li et al. [16] compared gpt-2 with its artificially degraded version (gpt-d) created with a dementia-related linguistic anomalies layer induction based on data from a picture description task, while Agbavor and Liang [14] predicted dementia and cognitive score from ct data using gpt-3 exploiting both word embeddings and acoustic knowledge. Finally, Mao et al. [2] pre-trained the bert model with unstructured clinical notes from Electronic Health Records (ehrs) to detect mci to ad progression.

Table 1 Comparison of diagnostic llm-based solutions taking into account the field of application, the model used, the input data, and explainability (Ex.) capability

Full size table

More closely related to our research is the work by Bertacchini et al [13]. The authors designed Pepper, a social robot with real-time conversational capabilities exploiting the Chatgpt gpt-3.5 model. However, the use case of the system is autism spectrum disorder detection. Furthermore, Caruccio et al. [18] compared the diagnoses performance of different models of Chatgpt (i.e., ada, babbage, curie, davinci and gpt-3.5) with Google Bard and traditional ml approaches based on symptomatic data. The authors exploited prompt engineering to ensure appropriate performance when submitting clinical-related questions to the llm model. Moreover, Hirosawa et al. [39] analyzed the diagnosis ability of Chatgpt gpt-3.5 model using clinical vignettes. Then, the llm was evaluated compared to physicians’ diagnosis. However, the authors again focus not on cognitive decline prediction but on ten common chief complaints. Consideration should be given to the work by Koga et al. [30], who used Chatgpt (i.e., gpt-3.5 and gpt-4 models) and Google Bard to predict several neurodegenerative disorders based on clinical summaries in clinicopathological conferences without being a specific solution tailored for ad prediction. Finally, regarding conversational assistants that integrate llms, Zaman et al. [43] developed a chatbot based on Chatgpt gpt-3.5 model to provide emotional support to caregivers (i.e., practical tips and shared experiences).

2.1 Contributions

As previously described, a vast amount of work in the state of the art exploits plms even in the clinical field [44]. However, scant research has been performed in the case of llm models. Table 1 summarizes the reviewed diagnostic solutions that exploit llms in the literature. Note that explainability represents a differential characteristic of the solution proposed given the relevance of promoting transparency in ai-based systems [45].

Given the comparison with competing works:

Our system is the first that jointly considers the application of an llm over spontaneous speech and provides interpretable ml results for the use case of mental decline prediction.
Our solution implements ml models in streaming to provide real-time functioning, hence avoiding the re-training cost of batch systems.
In this work, we leverage the potential of llms by applying the rlhf technique through prompt engineering in a chatbot solution. Note that the natural language analysis is performed with linguistic-conceptual features. Consequently, we contribute with an affordable, non-invasive diagnostic system.
Our system democratizes access to researchers and end users within the public health field to the latest advances in nlp.

3 Methodology

Figure 1 depicts the system scheme proposed for real-time prediction of mental decline combining llms and ml algorithms with explainability capabilities. More in detail, it is composed of (i) data extraction employing nlp-based prompt engineering (Sect. 3.1); (ii) stream-based data processing including feature engineering, analysis and selection (Sect. 3.2); (iii) real-time classification (Sect. 3.3); and (iv) the explainability dashboard to provide visual and natural language descriptions of the prediction outcome (Sect. 3.4). Algorithm 1 describes the complete process.

3.1 Data Extraction

The Chatgpt gpt-3.5 model used serves two purposes: (i) it enables a natural, free dialogue with the end users, and (ii) data are extracted due to its semantic knowledge management capabilities. The latter information is gathered once the conversation is concluded (either more than 3 min of inactivity or farewell detected) and used to compute the features used for classification (see Sect. 3.2.1). For this extraction, prompt engineering is exploited. The complete data extraction process is described in Algorithm 2.

Table 2 Features engineered for mental deterioration prediction

Full size table

3.2 Stream-Based Data Processing

Stream-based data processing encompasses feature engineering, analysis, and selection tasks to ensure the optimal performance of the ml classifiers.

3.2.1 Feature Engineering

Table 2 details the features used to predict mental decline. Note that conversational, emotional, and linguistic-conceptual features are computed. The conversational features.^{Footnote 4} (1–10) represent relevant semantic and pragmatic information related to the free dialogue (e.g., fluency, repetitiveness, etc.), while emotional features focus on the mental and physical state of the users. Finally, linguistic features represent lexical and semantic knowledge (e.g., disfluencies, placeholder words, etc.).

Furthermore, the system maintains a history of each user data (i.e., past and current feature values) that enables the computation of four new characteristics per each in Table 2: average, q1, q2, and q3 as indicated in Eq. (1), where n is the user conversation counter and X[n] represents a particular feature with historical data.

(1)

3.2.2 Feature Analysis & Selection

Feature analysis and selection tasks are necessary to optimize the performance of the ml classifiers. These tasks are even more important in the streaming scenario where samples arrive at a real-time pace. The latter means that the classification problem layout (e.g., the most relevant features) may vary over time.

The proposed system follows two thresholding strategies for feature analysis and selection based on cut-off points regarding correlation and variance values to remove irrelevant features. The former, correlation analysis, limits the number of features to extract the most relevant characteristics. For the latter variance analysis, the number of features selected is dynamically established in each interaction of the stream-based model, selecting those that meet the threshold criteria.

Algorithm 3 details the data processing stage, including feature engineering, analysis, and selection.

3.3 Stream-Based Classification

Two classification scenarios are considered:

Scenario 1 analyzes the behavior of the classifiers in a streaming setting. Under this consideration, sequential and continual testing and training over time is assumed.
Scenario 2 analyzes the models’ performance under more realistic conditions. Thus, the testing is continuous (i.e., in streaming) while training is performed desynchronized in blocks of 100 samples.

The following ml models are selected based on their good performance in similar classification problems [46,47,48]:

Gaussian Naive Bayes (gnb) [49] exploits the Gaussian probability distribution in a stream-based ml model. It is used as a reference for performance analysis.
Approximate Large Margin Algorithm (alma) [50] is a fast incremental learning algorithm comparable to support vector machine to approximate the maximal margin between a hyperplane concerning a norm (with a value of \(p \ge 2\)) for a set of linearly separable data.
Hoeffding Adaptive Tree Classifier (hatc) [51] computes single-tree branch performance and is designed for stream-based prediction.
Adaptive Random Forest Classifier (arfc) [52] constitutes an advanced model of hatc in which branch performance is computed by majority voting in an ensemble tree scenario.

Algorithm 4 describes the stream-based prediction process.

3.4 Explainability Dashboard

Prediction transparency is promoted through explainability data provided to the end users regarding relevant features in the prediction outcome. Thus, those relevant features are included in the natural language description of the decision path. The five features whose mathematical module is highest or with the highest variance and whose values are the most distant from the average are selected. In the case of the counters (features 9–10), this average is obtained from the average of all users in the system.

4 Evaluation and Discussion

This section discusses the experimental data set used, the implementation decisions, and the results obtained. The evaluations were conducted on a computer with the following specifications:

Operating system Ubuntu 18.04.2 LTS 64 bits
Processor Intel Core i9-10900K 2.80GHz
RAM 96GB DDR4
Disk 480GB NVME + 500GB SSD

4.1 Experimental Data Set

The experimental data set^{Footnote 5} consists of an average of \(6.92\pm 3.08\) utterances with \(62.73\pm 57.20\) words involving 44 users with \(13.66\pm 7.86\) conversations by user. The distribution of mental deterioration in the experimental data set is 238 samples in which mental deterioration is present and 363 in which it is absent. Figure 2 depicts the histogram distribution of words and interactions by absent and present mental deterioration, respectively. While the distributions of the number of interactions in the absence or presence of cognitive impairment follow a normal function, the number of words can be approximated by a positive normal centered on 0. The most relevant issue is that, as expected, users with mental deterioration present a lower number of interactions and a significant decrease in the number of words used in their responses.

4.2 Data Extraction

Data to engineer conversational (1–8), emotional, and linguistic features in Table 2 were obtained with gpt-3.5-turbo^{Footnote 6} model. The prompt used is shown in Listing 1.

4.3 Stream-Based Data Processing

This section reports the algorithms used for feature engineering, analysis, and selection and their evaluation results.

4.3.1 Feature Engineering

A total of 88 features were generated^{Footnote 7} in addition to the 22 features generated in each conversation (see Table 2) resulting in 110 features. In Fig. 3, we show the distribution of conversations by the user, which approaches a uniform density function, being relevant that the large majority concentrates between 15 and 20 conversations.

4.3.2 Feature Analysis & Selection

Correlation and variance thresholding decisions were based on experimental tests. For the correlation thresholding, SelectKBest^{Footnote 8} was applied using the Pearson correlation coefficient [53]. The K value corresponds to the most relevant features of the 80% experimental data. Table 3 shows the features with a correlation value greater than 0.2 with the mental deterioration target when the last sample entered the stream-based classification model.

Table 3 Correlation and variance results

Full size table

Regarding the variance thresholding, the implementation used was VarianceThreshold^{Footnote 9} from the River library.^{Footnote 10} Moreover, the cut-off point, 0.001, is computed with the 10th percentile variance value of the features contained in the 20% of the experimental data set, which acts as the cold start of this method. Consequently, only those features that exceed the abovementioned cut-off are selected as relevant for classification purposes. Table 3 also details the features with a variance greater than 0.5.^{Footnote 11}

Table 3 shows that among the conversational features, user initiative (feature 6 in Table 2) plays an important role. The same applies to the number of interactions within a dialogue (feature 9). Regarding emotional features, consideration should be given to fatigue (feature 12) and polarity (feature 14). Finally, using a colloquial/formal registry (features 16/19), disfluency (feature 18), and short responses (feature 22) stand out among linguistic characteristics. Considering correlation and variance analysis jointly, initiative and polarity are the most relevant data for prediction purposes.

4.4 Stream-Based Classification

The River implementations of the ml models selected are: gnb,^{Footnote 12}alma,^{Footnote 13}hatc^{Footnote 14} and arfc.^{Footnote 15} Listings 2, 3 and 4 detail the hyper-parameter optimization ranges used, excluding the baseline model, from which the following values were selected as optimal:

Correlation thresholding
- ALMA: alpha=0.5, B=1.0, C=1.0.
- HATC: depth=None, tiethreshold=0.5, maxsize=50.
- ARFC: models=10,features=5, lambda=50.
Variance thresholding
- ALMA: alpha=0.5, B=1.0, C=1.0.
- HATC: depth=None, tiethreshold=0.5, maxsize=50.
- ARFC: models=100,features=sqrt, lambda=50.

Table 4 presents the results for evaluation scenarios 1 and 2. In both scenarios, the feature selection methodology based on correlation thresholding returns lower classification metric values than those obtained with the variance method. Thus, once the variance feature selection method is applied, the arfc is the most promising performance algorithm regardless of the evaluation scenario.

Table 4 Classification results (Sce.: scenario, time in seconds)

Full size table

Table 5 Classification results in batch for the rf model (time in seconds)

Full size table

Table 6 Classification results for the arfc model using the experimental data from [54] (time in seconds)

Full size table

Consideration should be given to the fact that even in scenario 2, in which training is performed desynchronized and in batch, the robustness of arfc stands out with classification results exceeding 80% and with a recall for the mental deterioration class about 85%.

Provided that our system operates in streaming and to enable direct comparison with batch ml solutions, additional evaluation measures from tenfold cross-validation are provided, particularly, for Random Forest (rf^{Footnote 16}) equivalent to the best model, arfc, in stream-based classification. The results are displayed in Table 5, most surpassing the 90% threshold. Note that the increase in performance compared to streaming operation (e.g., +8.37% points in accuracy) is derived from the fact that in batch classification, the model has access to the 90% of the experimental data for training. In contrast, stream-based classification relies on the ordered incoming new samples, which is more demanding. Consequently, having achieved a comparable performance in batch and stream-based classification is noteworthy.

To verify the system’s operation in a more challenging scenario, we have experimented with a data set from a previous study [54] with fewer interactions per session. Even when the system is fed with less information, the evaluation metrics are promising, as shown in Table 6 with all values above 70%, and the precision and recall of the mental deterioration category above 80%. Comparing the rf batch model in our past research [54] with the proposed arfc algorithm, which operates in streaming, the improvement reaches 10% points and 4% points in the recall metric of mental deterioration and absence of mental deterioration categories, respectively.

4.5 Explainability Dashboard

Figure 4 shows the explainability dashboard. In this example, the variation in predicting cognitive impairment is visualized, considering two weeks of past data. This variation is represented with the predict_proba function of arfc algorithm. At the bottom, the most relevant features are displayed. Each figure card contains the identifier and statistic represented in colors following this scheme: 1–0.5 in green, 0.5–0.25 in yellow, and 0.25–0 in red. The latter assignation is inverted for negative values. At the bottom, a brief description in natural language is provided. The average accumulated predict_proba value, and the confidence prediction of the current sample are displayed on the right.

5 Conclusions

Cognitively impaired users find it difficult to perform daily tasks with the consequent detrimental impact on their life quality. Thus, progression detection and early intervention are essential to effectively and timely address mental deterioration to delay its progress. In this work, we focused on impairment in language production (i.e., lexical, semantic, and pragmatic aspects) to engineer linguistic-conceptual features toward spontaneous speech analysis (e.g., semantic comprehension problems, memory loss episodes, etc.). Compared to traditional diagnostic approaches, the proposed solution has semantic knowledge management and explicability capabilities thanks to integrating an llm in a conversational assistant.

Consideration should be given to the limitations of using llms, which are transversal into the healthcare field beyond mental deterioration detection. The potential biases and lack of inherent transparency stand out among the risks of applying these models for medical purposes. The latter black-box problem, also present in traditional opaque ml models, is particularly critical in the healthcare field by negatively impacting the decision process of physicians due to their limited corrective capabilities and even the end users, limiting their trust in medical applications. Moreover, these systems’ current limited memory management capability is worth mentioning, which prevents the realization of longitudinal clinical analysis. The same applies to the associated complexity of context information management. Ultimately, the difficulty in collecting data due to the sensitivity and confidentiality of the information in the medical field should also be mentioned.

More in detail, the solution provides interpretable ml prediction of cognitive decline in real-time. rlhf (i.e., prompt engineering) and explainability are exploited to avoid the “hallucination” effect of llms and avoid potential biases by providing natural language and visual descriptions of the diagnosis decisions. Note that our system implements ml models in streaming to provide real-time functioning, hence avoiding the re-training cost of batch systems.

Summing up, we contribute with an affordable, flexible, non-invasive, personalized diagnostic system that enables the monitoring of high-risk populations and offers companionship. Ultimately, our solution democratizes access to researchers and end users within the public health field to the latest advances in nlp.

Among the challenges and potential ethical concerns raised by the application of ai into the healthcare field, the double effect principle must be considered. In this sense, few can deny its promising potential to provide innovative treatments while at the same time presenting safety-critical concerns, notably regarding their interpretability. Apart from the algorithmic transparency mentioned, the main considerations are privacy and safety of the medical data, fairness, and autonomous decision-making without human intervention. In future work, we plan to test the performance of new approaches, such as reinforcement learning, to enhance the system’s personalizing capabilities further. Moreover, we will explore co-design practices with end users, and we seek to move our solution to clinical practice within an ongoing project with daycare facilities. Note that reinforcement learning with human feedback will also allow us to mitigate some of the limitations discussed, such as physicians’ lack of interpretability and corrective capabilities. The latter will also have a positive ethical impact on the deployment of llm-based medical applications by ensuring fairness. The societal impact derived from reduced costs compared to traditional approaches may result in broader accessibility to clinical diagnosis and treatment on a demand basis. The equity will be impulsed by the capability of these systems to provide unlimited personalized support. In future research, we will work on mitigating health inequities by performing longitudinal studies to measure bias in our ai solution, particularly related to the algorithm design, bias in the training data, and the ground truth. Underperformance in certain social groups may also be considered. For that purpose, we will gather social context data, which will allow us to measure equity (e.g., gender, race, socioeconomic status, etc.). To ensure patient data protection while at the same time increasing data available for research, federated learning approaches will be explored.

Notes

Available at https://www.who.int/news-room/fact-sheets/detail/dementia, May 2024.
Available at https://platform.openai.com/docs/models/gpt-4, May 2024.
Available at https://github.com/EmilyAlsentzer/clinicalBERT, May 2024.
Features 9–10 are not computed using the llm.
Data are available on request from the authors.
Available at https://platform.openai.com/docs/models/gpt-3-5, May 2024.
New four characteristics (average, q1, q2, and q3) per each of the 22 features in Table 2.
Available at https://riverml.xyz/0.11.1/api/feature-selection/SelectKBest, May 2024.
Available at https://riverml.xyz/0.11.1/api/feature-selection/VarianceThreshold, May 2024.
Available at https://riverml.xyz/0.11.1, May 2024.
Note that we have discarded features 9 and 10 from Table 2 from this example since they represent counters and their variance is always greater than 1.
Available at https://riverml.xyz/dev/api/naive-bayes/GaussianNB, May 2024.
Available at https://riverml.xyz/0.11.1/api/linear-model/ALMAClassifier, May 2024.
Available at https://riverml.xyz/0.11.1/api/tree/HoeffdingAdaptiveTreeClassifier, May 2024.
Available at https://riverml.xyz/0.11.1/api/ensemble/AdaptiveRandomForestClassifier, May 2024.
Available at https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html, May 2024.

References

Vats, N.A.; Yadavalli, A.; Gurugubelli, K.; et al.: Acoustic features, BERT model and their complementary nature for Alzheimer’s dementia detection. In: Proceedings of the International Conference on Contemporary Computing. Association for Computing Machinery, pp. 267–272 (2021). https://doi.org/10.1145/3474124.3474162
Mao, C.; Xu, J.; Rasmussen, L.; et al.: AD-BERT: using pre-trained language model to predict the progression from mild cognitive impairment to Alzheimer’s disease. J. Biomed. Inform. 144, 104,442-104,449 (2023). https://doi.org/10.1016/j.jbi.2023.104442
Article Google Scholar
Syed, M.S.S.; Syed, Z.S.; Lech, M.; et al.: Automated screening for Alzheimer’s dementia through spontaneous speech. In: Proceedings of the Interspeech Conference. International Speech Commnunication Association, pp. 2222–2226 (2020). https://doi.org/10.21437/Interspeech.2020-3158
Nadira, C.S.; Rahayu, M.S.: The relationship of cognitive function and independence activities of daily living (ADL) in elderly at Panti Darussa’adah and An-Nur Lhokseumawe. J. Kedokt. dan Kesehat. Publ. Ilm. Fak. Kedokt. Univ. Sriwij. 7, 55–60 (2020). https://doi.org/10.32539/JKK.V7I3.10690
Article Google Scholar
Association, A.; Thies, W.; Bleiler, L.: 2023 Alzheimer’s disease facts and figures. Alzheimer’s Dement. 19, 1598–1695 (2023). https://doi.org/10.1002/alz.13016
Article Google Scholar
Rasmussen, J.; Langerman, H.: Alzheimer’s disease—Why we need early diagnosis. Degener. Neurol. Neuromuscul. Dis. 9, 123–130 (2019). https://doi.org/10.2147/DNND.S228939
Article Google Scholar
Manly, J.J.; Glymour, M.M.: What the aducanumab approval reveals about Alzheimer disease research. JAMA Neurol. 78, 1305–1306 (2021). https://doi.org/10.1001/jamaneurol.2021.3404
Article Google Scholar
Kandratsenia, K.: Social stigma towards people with mental disorders among the psychiatrists, general practitioners and young doctors. Eur. Neuropsychopharmacol. 29, 401–402 (2019). https://doi.org/10.1016/j.euroneuro.2018.11.608
Article Google Scholar
Tucker-Drob, E.M.: Cognitive aging and dementia: a life-span perspective. Annu. Rev. Dev. Psychol. 1, 177–196 (2019). https://doi.org/10.1146/annurev-devpsych-121318-085204
Article Google Scholar
Pl, R.; Ks, G.: Cognitive decline assessment using semantic linguistic content and transformer deep learning architecture. Int. J. Lang. Commun. Disord. 59, 1110–1127 (2024). https://doi.org/10.1111/1460-6984.12973
Article Google Scholar
Velupillai, S.; Suominen, H.; Liakata, M.; et al.: Using clinical Natural Language Processing for health outcomes research: overview and actionable suggestions for future advances. J. Biomed. Inform. 88, 11–19 (2018). https://doi.org/10.1016/j.jbi.2018.10.005
Article Google Scholar
Yuan, J.; Bian, Y.; Cai, X.; et al.: Disfluencies and fine-tuning pre-trained language models for detection of Alzheimer’s disease. In: Proceedings of the Interspeech Conference. International Speech Communication Association, pp. 2162–2166 (2020). https://doi.org/10.21437/Interspeech.2020-2516
Bertacchini, F.; Demarco, F.; Scuro, C.; et al.: A social robot connected with chatGPT to improve cognitive functioning in ASD subjects. Front. Psychol. 14, 1–22 (2023). https://doi.org/10.3389/fpsyg.2023.1232177
Article Google Scholar
Agbavor, F.; Liang, H.: Predicting dementia from spontaneous speech using large language models. PLOS Digit. Health 1(12), 1–14 (2022). https://doi.org/10.1371/journal.pdig.0000168
Article Google Scholar
Chen, J.; Ye, J.; Tang, F.; et al.: Automatic detection of Alzheimer’s disease using spontaneous speech only. In: Proceedings of the Interspeech Conference, vol 6. International Speech Communication Association, pp. 3830–3834. (2021). https://doi.org/10.21437/Interspeech.2021-2002
Li, C.; Knopman, D.; Xu, W.; et al.: GPT-D: Inducing dementia-related linguistic anomalies by deliberate degradation of artificial neural language models. In: Proceedings of the Annual Meeting of the Association for Computational Linguistic, vol 1. Association for Computational Linguistics, pp. 1866–1877 (2022). https://doi.org/10.18653/v1/2022.acl-long.131
Santander-Cruz, Y.; Salazar-Colores, S.; Paredes-García, W.J.; et al.: Semantic feature extraction using SBERT for dementia detection. Brain Sci. 12, 270–287 (2022). https://doi.org/10.3390/brainsci12020270
Article Google Scholar
Caruccio, L.; Cirillo, S.; Polese, G.; et al.: Can ChatGPT provide intelligent diagnoses? A comparative study between predictive models and ChatGPT to define a new medical diagnostic bot. Expert Syst. Appl. 235, 121,186-121,199 (2023). https://doi.org/10.1016/j.eswa.2023.121186
Article Google Scholar
KS, N.P.; Sudhanva, S.; Tarun, T.N.; Yuvraaj, Y.; Vishal, D.A.; et al.: Conversational chatbot builder - smarter virtual assistance with domain specific AI. In: Proceedings of the International Conference for Emerging Technology, pp. 1–4. IEEE (2023). https://doi.org/10.1109/INCET57972.2023.10170114
Palanica, A.; Flaschner, P.; Thommandram, A.; et al.: Physicians’ perceptions of Chatbots in health care: cross-sectional web-based survey. J. Med. Internet Res. 21, 1–10 (2019). https://doi.org/10.2196/12887
Article Google Scholar
Idris, M.D.; Feng, X.; Dyo, V.: Revolutionizing higher education: unleashing the potential of large language models for strategic transformation. IEEE Access 12, 67,738-67,757 (2024). https://doi.org/10.1109/ACCESS.2024.3400164
Article Google Scholar
Romano, M.F.; Shih, L.C.; Paschalidis, I.C.; et al.: Large language models in neurology research and future practice. Neurology 1–29 (2023). https://doi.org/10.1212/WNL.0000000000207967
Fear, K.; Gleber, C.: Shaping the future of older adult care: ChatGPT, advanced AI, and the transformation of clinical practice. JMIR Aging 6, 1–3 (2023). https://doi.org/10.2196/51776
Article Google Scholar
Alessa, A.; Al-Khalifa, H.: Towards designing a ChatGPT conversational companion for elderly people. In: Proceedings of the International Conference on Pervasive Technologies Related to Assistive Environments. Association for Computing Machinery, pp. 667–674 (2023). https://doi.org/10.1145/3594806.3596572
Vaidyam, A.N.; Wisniewski, H.; Halamka, J.D.; et al.: Chatbots and conversational agents in mental health: a review of the psychiatric landscape. Can. J. Psychiatry 64, 456–464 (2019). https://doi.org/10.1177/0706743719828977
Article Google Scholar
Ceney, A.; Tolond, S.; Glowinski, A.; et al.: Accuracy of online symptom checkers and the potential impact on service utilisation. PLOS ONE 16, 1–16 (2021). https://doi.org/10.1371/journal.pone.0254088
Article Google Scholar
Schmieding, M.L.; Kopka, M.; Schmidt, K.; et al.: Triage accuracy of symptom checker apps: 5-year follow-up evaluation. J. Med. Internet Res. 24, 1–13 (2022). https://doi.org/10.2196/31810
Article Google Scholar
Kiliçarslan, S.; Közkurt, C.; Baş, S.; et al.: Detection and classification of pneumonia using novel Superior Exponential (SupEx) activation function in convolutional neural networks. Expert Syst. Appl. 217, 119,503-119,514 (2023). https://doi.org/10.1016/j.eswa.2023.119503
Article Google Scholar
Yu, B.; Chen, H.; Jia, C.; et al.: Multi-modality multi-scale cardiovascular disease subtypes classification using Raman image and medical history. Expert Syst. Appl. 224, 119,965-119,976 (2023). https://doi.org/10.1016/j.eswa.2023.119965
Article Google Scholar
Koga, S.; Martin, N.B.; Dickson, D.W.: Evaluating the performance of large language models: ChatGPT and Google Bard in generating differential diagnoses in clinicopathological conferences of neurodegenerative disorders. Brain Pathol. 1–4 (2023). https://doi.org/10.1111/bpa.13207
Kenton, J.D.M.W.C.; Toutanova, L.K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of Annual Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol 1. Association for Computational Linguistics, pp. 4171–4186 (2019)
Brown, T.B.; Mann, B.; Ryder, N.; et al.: Language models are few-shot learners. In: Proceedings of the Advances in Neural Information Processing Systems Conference, pp. 1–25. MIT Press (2020)
Deriu, J.; Rodrigo, A.; Otegi, A.; et al.: Survey on evaluation methods for dialogue systems. Artif. Intell. Rev. 54, 755–810 (2021). https://doi.org/10.1007/s10462-020-09866-x
Article Google Scholar
Brown, T.; Mann, B.; Ryder, N.; et al.: Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020)
Google Scholar
Lee, J.; Yoon, W.; Kim, S.; et al.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 1234–1240 (2020). https://doi.org/10.1093/bioinformatics/btz682
Article Google Scholar
Luo, R.; Sun, L.; Xia, Y.; et al.: BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief. Bioinform. 23, 1–11 (2022). https://doi.org/10.1093/bib/bbac409
Article Google Scholar
Peng, Y.; Yan, S.; Lu, Z.: Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets. In: Proceedings of the BioNLP Workshop and Shared Task. Association for Computational Linguistics, pp. 58–65 (2019). https://doi.org/10.18653/v1/W19-5006
Yao, L.; Jin, Z.; Mao, C.; et al.: Traditional Chinese medicine clinical records classification with BERT and domain specific corpora. J. Am. Med. Inform. Assoc. 26, 1632–1636 (2019). https://doi.org/10.1093/jamia/ocz164
Article Google Scholar
Hirosawa, T.; Harada, Y.; Yokose, M.; et al.: Diagnostic accuracy of differential-diagnosis lists generated by generative pretrained transformer 3 Chatbot for clinical vignettes with common chief complaints: a pilot study. Int. J. Environ. Res. Public Health 20, 3378–3387 (2023). https://doi.org/10.3390/ijerph20043378
Article Google Scholar
Gillioz, A.; Casas, J.; Mugellini, E.; et al.: Overview of the transformer-based models for NLP tasks. In: Proceedings of the Federated Conference on Computer Science and Information Systems. Polish Information Processing Society, pp. 179–183 (2020). https://doi.org/10.15439/2020F20
Ji, Z.; Lee, N.; Frieske, R.; et al.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55, 248–285 (2023). https://doi.org/10.1145/3571730
Article Google Scholar
Chen, H.; Yuan, K.; Huang, Y.; et al.: Feedback is all you need: from ChatGPT to autonomous driving. Sci. China Inf. Sci. 66, 166,201-166,203 (2023). https://doi.org/10.1007/s11432-023-3740-x
Article Google Scholar
Zaman, K.T.; Hasan, W.U.; Li, J.; et al.: Empowering caregivers of Alzheimer’s disease and related dementias (ADRD) with a GPT-powered voice assistant: leveraging peer insights from social media. In: Proceedings of the IEEE Symposium on Computers and Communications, pp. 1–7. IEEE (2023). https://doi.org/10.1109/ISCC58397.2023.10218142
Alomari, A.; Idris, N.; Sabri, A.Q.M.; et al.: Deep reinforcement and transfer learning for abstractive text summarization: a review. Comput. Speech Lang. 71, 101,276-101,318 (2022). https://doi.org/10.1016/j.csl.2021.101276
Article Google Scholar
Wischmeyer, T.: Artificial Intelligence and Transparency: Opening the Black Box, pp. 75–101. Springer International Publishing (2020). https://doi.org/10.1007/978-3-030-32361-5_4
Book Google Scholar
Mathkunti, N.M.; Rangaswamy, S.: Machine learning techniques to identify dementia. SN Comput. Sci. 1, 118–124 (2020). https://doi.org/10.1007/s42979-020-0099-4
Article Google Scholar
Ilias, L.; Askounis, D.: Context-aware attention layers coupled with optimal transport domain adaptation and multimodal fusion methods for recognizing dementia from spontaneous speech. Knowledge-based Syst. 277, 110,834-110,851 (2023). https://doi.org/10.1016/j.knosys.2023.110834
Article Google Scholar
Kumar, Y.; Koul, A.; Singla, R.; et al.: Artificial intelligence in disease diagnosis: a systematic literature review, synthesizing framework and future research agenda. J. Ambient. Intell. Humaniz. Comput. 14, 8459–8486 (2023). https://doi.org/10.1007/s12652-021-03612-z
Article Google Scholar
Xu, S.: Bayesian Naïve Bayes classifiers to text classification. J. Inf. Sci. 44, 48–59 (2018). https://doi.org/10.1177/0165551516677946
Kang, S.; Kim, D.; Cho, S.: Approximate training of one-class support vector machines using expected margin. Comput. Ind. Eng. 130, 772–778 (2019). https://doi.org/10.1016/j.cie.2019.03.029
Weinberg, A.I.; Last, M.: EnHAT - Synergy of a tree-based Ensemble with Hoeffding Adaptive Tree for dynamic data streams mining. Inf. Fusion 89, 397–404 (2023). https://doi.org/10.1016/j.inffus.2022.08.026
Zhang, W.; Bifet, A.; Zhang, X.; et al.: FARF: A Fair and Adaptive Random Forests Classifier, vol. 12713 LNAI, pp. 245–256. Springer (2021). https://doi.org/10.1007/978-3-030-75765-6_20
Book Google Scholar
Benesty, J.; Chen, J.; Huang, Y.; et al.: Pearson correlation coefficient. In: Springer Topics in Signal Processing, vol 2. Springer, pp. 37–40 (2009). https://doi.org/10.1007/978-3-642-00296-0_5
de Arriba-Pérez, F.; García-Méndez, S.; González-Castaño, F.J.; et al.: Automatic detection of cognitive impairment in elderly people using an entertainment chatbot with Natural Language Processing capabilities. J. Ambient Intell. Humaniz. Comput. 14, 16,283-16,298 (2023). https://doi.org/10.1007/s12652-022-03849-2
Article Google Scholar

Download references

Funding

Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This work was partially supported by (i) Xunta de Galicia grants ED481B-2022-093 and ED481D 2024/014, Spain; and (ii) University of Vigo/CISUG for open access charge.

Author information

F. de Arriba-Pérez and S. García-Méndez have contributed equally to this work.

Authors and Affiliations

Information Technologies Group, atlanTTic, University of Vigo, Vigo, Spain
Francisco de Arriba-Pérez & Silvia García-Méndez

Authors

Francisco de Arriba-Pérez
View author publications
You can also search for this author in PubMed Google Scholar
Silvia García-Méndez
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Francisco de Arriba-Pérez contributed to conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing—original draft, writing—review and editing, visualization, supervision, project administration, and funding acquisition. Silvia García-Méndez contributed to conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing—original draft, writing—review & editing, visualization, supervision, project administration, and funding acquisition.

Corresponding author

Correspondence to Silvia García-Méndez.

Ethics declarations

Conflict of interest

The authors have no conflict of interest to declare relevant to this article’s content.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

de Arriba-Pérez, F., García-Méndez, S. Leveraging large language models through natural language processing to provide interpretable machine learning predictions of mental deterioration in real time. Arab J Sci Eng (2024). https://doi.org/10.1007/s13369-024-09508-2

Download citation

Received: 25 December 2023
Accepted: 12 August 2024
Published: 27 August 2024
DOI: https://doi.org/10.1007/s13369-024-09508-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Leveraging large language models through natural language processing to provide interpretable machine learning predictions of mental deterioration in real time

Abstract

Similar content being viewed by others

Explainable cognitive decline detection in free dialogues with a Machine Learning approach based on pre-trained Large Language Models

Proactive Polypharmacy Management Using Large Language Models: Opportunities to Enhance Geriatric Care

ChatGPT-HealthPrompt. Harnessing the Power of XAI in Prompt-Based Healthcare Decision Support using ChatGPT

1 Introduction