Integrative machine learning approach to risk prediction for dementia and Alzheimer’s disease

Stern, Amos; Linial, Michal

doi:10.1007/s11357-025-01828-x

Integrative machine learning approach to risk prediction for dementia and Alzheimer’s disease

ORIGINAL ARTICLE
Open access
Published: 27 August 2025

(2025)
Cite this article

You have full access to this open access article

Download PDF

View saved research

GeroScience Aims and scope Submit manuscript

Integrative machine learning approach to risk prediction for dementia and Alzheimer’s disease

Download PDF

3292 Accesses
7 Altmetric
1 Mention
Explore all metrics

Abstract

Dementia, particularly Alzheimer’s disease (AD), presents a growing global health challenge characterized by cognitive decline, behavioral changes, and loss of independence. With increasing life expectancy, early diagnosis and improved clinical strategies are urgently needed. This study developed and evaluated machine learning (ML) models to predict AD risk using UK Biobank data, integrating health, genetic, and lifestyle factors. The cohort included 2878 AD cases and 72,366 controls. Among several algorithms, CatBoost performed best (ROC-AUC = 0.773), especially in females. Inputs included ICD-10 codes from 5 years pre-diagnosis, ApoE-ε4 genotype, and large collection of modifiable risk factors. Despite fewer cases, the risk predictive models for vascular dementia (VaD) outperformed the unique AD models. ApoE-ε4 was the most predictive genetic marker, while other common variants had limited utility. Key non-genetic predictors included comorbidities (e.g., diabetes, hypertension), education, physical activity, and diet. These findings highlight the value of integrating diverse data sources for dementia risk prediction and emphasize the role of sex-specific modeling and modifiable factors in early, personalized intervention strategies.

Graphical Abstract

A machine learning framework for classifying dementia risk in mild cognitive impairment: evidence from a Korean genome-wide association study cohort

Article Open access 10 November 2025

Modified dementia risk score as a tool for the prediction of dementia: a prospective cohort study of 239745 participants

Article Open access 10 December 2022

New insights into the genetic etiology of Alzheimer’s disease and related dementias

Article Open access 04 April 2022

Introduction

The prevalence of dementia, including Alzheimer’s disease (AD), is rapidly increasing due to global population aging [1, 2]. AD accounts for 60–70% of all dementia cases and is a leading cause of dementia in individuals aged 65 and older. Dementia is a global health concern, affecting over 55 million people, with nearly 10 million new cases annually. According to the World Health Organization (WHO), by 2050, the number of people living with dementia is projected to triple, reaching over 150 million globally [3]. Dementia significantly contributes to disability and dependency in older populations [4]. It is also among the leading causes of death worldwide and imposes an enormous burden on the medical system, as well as on health and caregiving economies. Key epidemiological trends show that the fastest growth in the occurrence of dementia is in low- and middle-income regions, which account for about two-thirds of dementia cases [5, 6]. Overall, the increase in prevalence and mortality rates associated with AD likely also reflects underdiagnosis, which applies across North America, Europe, and Asia [7]. Still, in some high-income countries, the increase in AD has been attenuated, attributed to improved education and healthcare, and, to some extent, also to lifestyle changes [8]. Many factors, such as age and genetic predisposition, have been reported to increase the risk of developing AD [9, 10]. The Dementia Prevention, Intervention, and Care Committee (2020) identified 12 risk factors, including low education, hypertension, hearing loss, smoking, obesity, depression, physical inactivity, diabetes, and low social contact [11]. These factors account for about 40% of all cases. With the added roles of alcohol consumption, traumatic brain injury (TBI), air pollution, untreated vision loss, and high LDL cholesterol, the contribution of modifiable factors is even greater [12, 13].

AD is best described as a multiplex disorder involving multiple pathological pathways, including neuroinflammation [14], endocytosis, cholesterol transport, APP and tau processing, protein folding, and ubiquitination [15]. These processes lead to progressive, irreversible changes in brain structure and function, particularly affecting vascular systems and neuronal connectivity. This disrupts the interplay among neurons, microglia, astrocytes, and oligodendrocytes [16]. Clinically, AD features a prolonged preclinical phase during which anatomical and molecular alterations—such as β-amyloid plaques, immunoreactive senile plaques, and tau neurofibrillary tangles—develop silently [17, 18], long before cognitive symptoms like memory loss or disorientation emerge.

Genetically, early-onset familial AD (EOAD) involves pathogenic variants in APP, PSEN1, and PSEN2 [19, 20]. In late-onset cases, APOE variants, especially APOE4 (R112, R158), are key modulators of risk and pathology [21]. Genome-wide association studies (GWAS) have identified over 50 susceptibility loci, reinforcing the disease’s polygenic nature and the central role of APOE [22]. Mixed dementias, such as AD coexisting with vascular dementia (VaD), become more prevalent with age [23], with cerebral amyloid angiopathy (CAA) as a major contributor to vascular-related pathology [24]. Other dementias, including Lewy body disease and TDP-43 encephalopathy (LATE), complicate differential diagnosis and care.

Machine learning (ML) has enabled early prediction of AD using large-scale clinical datasets. Models integrating electronic health records (EHRs) and genetic profiles can forecast disease onset years in advance [25,26,27]. Enhancements using polygenic risk scores (PRS) [28], natural language processing (NLP) [29], and unsupervised deep learning [30] have improved performance in detecting mild cognitive impairment and predicting progression to dementia [31, 32].

Based on the elevated risk of dementia with age and the difficulty in defining the type of dementia and its severity, having a population-based, generalizable risk assessment protocol is crucial for the healthcare system and medical professionals. In this study, we integrate clinical data, leading genetic variants, and lifestyle information to estimate the relative importance of risk factors in predicting dementia, particularly Alzheimer’s disease. As age is clearly one of the most influential risk factors in all AD models, we neutralize its contribution, allowing us to reveal more subtle factors. We also provide alternative models for females and males and analyze the contribution of a wide variety of known and novel risk factors to improve diagnosis. This study highlights the understudied importance of modifiable risk factors, life experiences, and the relatively limited contribution of genetics beyond APOE to the risk of dementia and its various subtypes.

Materials and methods

UK biobank database, extraction, and processing

The UK biobank (UKB) includes over 500,000 participants collected from 23 medical centers across the UK, who were recruited over the years 2006–2010 for participants aged 40–69 (54.4% are females) [33]. All analyses were based on the 2021 UKB release. Disease classification is based on clinical information encoded by ICD-10 codes F00 (dementia in Alzheimer’s disease; AD) and G30 (AD). For AD, we expanded our definition to broader phenotypes which are defined as dementia (including VaD and nonvascular forms of dementia; vascular dementia is code F01, and we included all other diagnose codes for dementia: F00, F02 and F03).

To avoid biases due to overrepresentation of genetic factors, we removed genetic relatives by keeping only one representative of each kinship group. This resulted in a dataset that includes 469,637 participants. To mitigate the age factor, which is the strongest factor in onset of AD, a protocol for age-dependent matching of the AD and control groups was performed implying a stochastic matching process between the two groups. The objective is to keep the majority of the samples while matching the age of recruitment distribution. In practice, we randomly chose samples from the control group (a total of 72,366 individuals) with the same age distribution as the age distribution of samples with Alzheimer’s diagnosis (AD group, 2878 individuals). The rest of the analysis was performed on the matched sets. We performed ten iterations for each model training with different random seeds to generate different random controls.

Machine learning methodology

Algorithm selection

We used a state-of-the-art machine learning (ML) gradient boosting tree model of CatBoost (Categorical Boosting) [34, 35]. CatBoost is a gradient boosting algorithm that is specifically optimized for datasets with categorical features. It handles categorical variables natively without requiring extensive preprocessing like one-hot encoding, and it employs ordered boosting and efficient use of permutations to reduce overfitting and prediction shift bias. In each step of the algorithm, a decision tree base learner is created, using the previous iterations’ decision tree residuals as a gradient for minimizing the current tree’s loss function.

Data partitioning and cross-validation strategy

The age-matched dataset (n = 75,244) was partitioned using stratified random sampling to maintain class balance across splits: training set (68%, n = 51,157), validation set (12%, n = 9028), and test set (20%, n = 15,047). To assess model stability and robustness, we performed ten independent training iterations, each using different random seeds (seeds 0–9) that generated unique data partitions while maintaining the same split proportions. This approach allowed us to evaluate both model performance variability and feature importance consistency across different data configurations. When training the model, the majority of other parameters remained at CatBoost default values, which have demonstrated superior performance in comparative studies and reduce the risk of overfitting through automated parameter selection [36].

Hyperparameter configuration

CatBoost models were trained with the following key hyperparameters: (i) maximum tree depth, 6; (ii) maximum iterations, 1000; (iii) early stopping rounds, 15 (training terminated if validation AUC showed no improvement for 15 consecutive iterations); (iv) evaluation metric, area under the ROC curve (ROC-AUC); (v) class balancing, positive class weighted by the ratio of negative to positive samples; (vi) learning rate, default (automatically determined by CatBoost); (vii) categorical feature handling, Native CatBoost encoding; (viii) random seed, varied across iterations (0–9) for robustness assessment.

Model training process

Training followed CatBoost’s ordered boosting approach, where each iteration builds a decision tree using gradient information from previous iterations’ residuals. For categorical features, CatBoost computes target statistics using ordered boosting with random permutations of the training data, effectively mapping categorical variables to continuous space while avoiding prediction shift. The validation set was used exclusively for early stopping decisions and hyperparameter evaluation, while the test set remained completely held out until final model assessment. As in UKB, the number of AD cases is < 4% of the dataset; we addressed class imbalance by weighting the positive class by the ratio of negative to positive samples. To prevent overfitting, we used CatBoost’s early stopping feature [36] based on a separate validation set. Training stopped when no improvement was observed for a set number of iterations after the best metric value. Logistic regression models were also trained for comparison.

Feature set variants

To investigate the relative contribution of different feature types, we trained multiple model variants labelled as follows: (i) all features—complete feature set including clinical, genetic, demographic, and lifestyle variables. This includes the 45 expert-based features, combined with genetics and diagnosis. (ii) Selected ICD-10—statistically significant ICD-10 codes (n = 66) combined with genetic and demographic features. (iii) Reduced—minimal feature set including sex, education, genetic variants, and ICD-10 diagnoses. (iv) Genetics only—ten selected single nucleotide polymorphisms (SNPs) based on GWAS associations that were detected in 100% of the participants. (v) ICD-10 only—clinical diagnosis codes without genetic or demographic information. (vi) PWAS variants—proteome-wide association study (PWAS) is gene-based functional scores for top-ranked AD-associated genes. Each model variant was trained using identical data partitions and hyperparameter settings to ensure fair comparison of predictive performance across feature combinations. We trained additional models only on a subset of features (e.g., genetics, ICD-10 diagnoses, sex, and education). Moreover, we trained models on UKB population partitioned by (i) sex for male and female groups; (ii) education preprocessed to poor, high school, and academic degree; (iii) APOE variants with partition to the APOE_e4_e4 (homozygotes with e4-e4 alleles), no_APOE_e4_e4 (samples without e4-e4 alleles), APOE_e4_hetro (heterozygotes with a single e4 allele), and no_APOE_e4_hetro (samples without the e4 allele).

Feature selection and engineering

We have included all listed ICD-10 diagnoses as features. We filtered out the ICD-10 diagnoses to retain only diagnoses that are dated back ≥ 5 years prior to the AD diagnosis, and ≥ 5 years before the mean diagnosis age of the relevant affected age-group for the control group. We created a selected ICD-10 diagnosis features model by selecting the top ICD-10 diagnosis by performing Fisher’s exact test based on the contingency tables on the training set of affected vs. control groups for each ICD-10 diagnosis term. We selected only the significant ICD-10 codes (FDR < 0.05) and were able to reduce the list to only 66 significant ICD-10 features. In addition to the data fields from the UKB, we engineered features which were not explicitly defined by the UKB (e.g., the mean of systolic and diastolic blood pressure calculated from the list of blood pressure samples). Extracted education values were simplified to three categories of poor, high school, and academic degree.

Genetics analysis

Variant genetic analysis

The UKB genotyping scheme is based on 805,426 preselected genetic variations which are expended by an imputation protocol to approximately 9 M variants that passed quality control [37]. We used the Open Targets (OT) platform [38] to select the most updated genetic associations for AD. OT platform compiled the top-scored variants summary statistics from the GWAS catalog [39]. For top listed candidate genes from OTG we included the functional damaged score based on FIRM [40], as implemented by PWAS [41, 42]. We generated a list of 97 unfiltered SNPs and chose the ten most frequent SNPs from the UKB dataset. The gene scoring methods by PWAS is described in Supplementary Text S3.

In addition, we used the in-house PRS and the UKB provided PRS [43]. We used LDpred2 to derive the AD PRS from UKB. In addition, we used the previously published PRS (v.2) from UKB based on the GWAS summary statistics and the imputed genotype data [44, 45]. Note that the optimal practice for AD PRS is sensitive to the SNP selection thresholds [46].

Dementia models and subtypes

One of the goals was to assess the variation in the risk factors according to different clinical terms associated with dementia. Several models were trained on the different individuals with dementia by their labels. For dementia group, we used the ICD-10 diagnosis of F00, F01, F02, and F03 as positive. For the vascular dementia, samples with ICD-10 F01 were labeled as positive. For the unique vascular dementia, samples with the F01 ICD-10 but not any of the other dementia diagnosis (F00, F02 or F03) are labeled as positive. For the non-vascular dementia, samples that have a dementia ICD-10 diagnosis that is not vascular dementia (e.g., samples with F00 or F02 or F03 and no occurrence of F01.

Model evaluation and statistics

Model performance was primarily assessed using the area under the receiver operating characteristic curve (ROC-AUC), which measures the model’s ability to distinguish between classes across thresholds. We used SHAP (SHapley Additive exPlanations) to estimate the features’ importance [47]. By assigning a consistent importance value to each variable, SHAP results are presented as a rank list of the most strongly influence features with respect to the original values of each feature. We estimated the model performance also by other metrics including accuracy, precision, recall, and F1 score.

In comparing two groups (e.g., case and controls), and testing for statistically significant difference in the variable, we applied both t-test and the Mann–Whitney U test. To test the altered distributions between groups, we applied Kolmogorov–Smirnov (KS) test that is used to estimate whether the two samples come from the same distribution. We applied the Fisher’s exact test that yields exact p-value to determine whether there is a non-random association between two categorical variables (i.e., 2 × 2 contingency table).

Results

Balancing the age effect from AD group by a matching protocol

The primary goal of this study was to review current risk factor knowledge and evaluate its contribution to AD and dementia prediction in general. As a population-based resource, the UKB is based on standardized data collection protocols (see Materials and Methods). The average age of the patients in UKB is 57.1 years old (standard deviation, Std 7.66). We have retrospectively analyzed personalized clinical information on diagnosis, medical procedures, lifestyle, personal genetics, self-reporting, and nurse interview reports. As the strongest risk factor for AD is age, we have activated an age matching protocol to remove the impact of this feature by imposing an age-match of cases diagnosed with AD and control groups.

Figure 1A shows the distribution of the participants in the study for controls and AD diagnosed (AD group). The Kolmogorov–Smirnov (KS) test confirmed the significant difference in the age distribution among AD group and controls (p-value < e-300). Repeating the statistical tests following the matching protocol resulted in an insignificant difference between the two groups (KS, p-value 1.0; Fig. 1B). Additional statistical tests including t-test and Mann–Whitney U test confirmed that the age feature was cancelled out for the AD (Supplementary Table S1). The rest of the analysis was performed on the age-matched data. Note that such matching protocol (See Supplemental Text S1) is used to avoid the many factors that are implicitly age-dependent, including the occurrences of age-dependent comorbidities.

Clinical view using a time stamp for AD comorbidity

We performed Fisher’s exact test on the ICD-10 diagnosis of AD vs. the control groups. ICD-10 diagnoses were filtered to only include diagnosis 5 years prior to AD diagnosis for the affected individuals and 5 years prior for the mean age of the diagnosis for the control groups (see Materials and Methods). The relatively extended time-stamp (5 years) is used to make sure that administrative delay in AD diagnosis is likely to be minimal and, therefore, we can properly separate comorbidities that tend to increase with age. Top diagnoses were used as features for the “selected icd10” models.

Supplementary Fig. S1 shows that while most ICD-10 are insignificant and do not meet the statistically significant of FDR ≤ 0.05, a few significant protective ICD-10 codes are shown on the negative side of the odds ratio (OR < 0) axis. In contrast, many more ICD-10 diagnoses tend to increase AD risk (the graph is shifted to the positive values of log2(OR)). The most significant ICD-10 that overlap with AD are listed in Table 1. Notably, the ICD-10 I25.9 diagnosis of a chronic ischemic heart disease (unspecified) shows an OR of 3.47 and a very strong z-value (1.23).

Table 1 Top 15 significant ICD-10 (FDR < 0.05) from Fisher’s Exact test of AD and control groups

Full size table

AD and dementia model performance

The relatively limited size of the AD group (filtered to avoid duplications based on multiple dementia annotations, insufficient support for variant calling, etc.) encouraged us to carefully select a controlled number of features for each model. To minimize the number of selected features, we restricted the analysis to 45 features called “all” including BMI, smoking habits, cholesterol levels, blood pressure, playing computer, and other features that were used in consulting with the CNP lab in Hadassah Medical Center (Jerusalem, Israel). In addition, we included ten selected SNPs and all ICD-10 diagnoses as one text feature. “Selected ICD10” is a collection of top 66 ICD-10 combined with genetic features. “Reduced” refers to the sex and education with the genetic features and ICD-10 diagnoses. “Only genetic” includes the ten selected SNPs that are specified as the common variants with the highest coverage among all individuals in UKB. “Only selected” are the 66 significant ICD-10 that were selected as features. “PWAS-global” includes the results from PWAS-based scores (total 40, that covers inheritance modes of recessive and dominant) for the top 20 genes AD gene associations from the OT global score. “PWAS-genetics” includes the 18 genes from the top OTG score according to PWAS scores (total 36, including the recessive and dominant modes). We performed the model training for the different population subsets (see Materials and Methods). The summary of the models’ performance metrics including the robustness assessment of each model (i.e., Std) can be viewed in Fig. 2A.

The model with the best ROC-AUC score metric is for “female_all” which is even higher than the one that includes all features for the unified samples from both sexes. The male models ROC-AUC scores are significantly lower, suggesting different manifestations of AD in males and females. Visually, one can see that there are two main classes of performances, the lower performance group (ROC-AUC ~ 0.6) suffer from relatively small group size, and consequently a low statistical power (e.g., “APOE_e4_e4_selected_icd10_reduced”). Models that included only ICD-10 codes were limited to genetic features (“only_icd10”, “only_genetics”) and resulted in lower performance with ROC-AUC of ~ 0.55 and 0.58, respectively. We conclude that while genetic and ICD-10 data contribute to model performance, they may not be sufficient without the contribution of rich demographic and clinical features. The other group of models is substantially better (ROC-AUC of ≥ 0.67). The population sizes and the performance metrics including precision, recall and accuracy are available in Supplementary Table S2.

Considering only genetic features hinders the model ROC-AUC to ~ 0.7, while removing any genetic-origin features reduced the performance to ROC-AUC of ~ 0.67. The reduced models are less predictive than the “all” models across all of the different groups. For the “reduced” model, using logistic regression (LR) instead of CatBoost reduced the performance, emphasizing the contribution of non-linear relations and feature interactions (Fig. 2B). Using the 66 selected ICD-10 codes of features instead of all of the ICD-10 data does not affect the performance substantially; the “all” and “selected_icd10_all models” ROC_AUC was quite stable. Using the reduced features instead of the expert-based features led to a drop in ROC-AUC, a drop from 0.761 (“all”) to 0.728 (“all_reduced”). The drop in ROC-AUC score is consistent across all model groups (see Supplementary Table S2).

Interestingly, for the education models, the high-school models showed better performance than the poor and academic degree models. This can be explained by the size of the positive labelled samples from the total (1227 vs 554). To further test the stability of the results, we performed down-sampling and bootstrapping to neutralize the direct effect of the group size, and to test robustness. Based on such manipulation in the data size, we conclude that there is no statistically significant difference between the means of the educational partitioned groups.

Gene-based PWAS score failed to boost AD risk predictive models

As seen in the ROC-AUC curves (Fig. 2B), the “all” model outperforms the other selected models in all ranges while “only_genetics” and “only_ICD10” performed relatively poorly. We conclude that interaction of genetics, clinical, and other features are critical for AD risk prediction. For the ICD-10 diagnosis features, we applied a cutoff of 5 years prior to the age of diagnosis. Interestingly, repeating the same protocol but reducing the cutoff to only 1 year prior to diagnosis age, the model metrics improved significantly. For example, for the “all” features model, the ROC-AUC improved from 0.761 (5 years cutoff) to 0.81 (1 year cutoff).

We tested the possible contribution PWAS as a gene-based association method [48]. PWAS is a complementary approach for routine GWAS based on FIRM scores that assess the effect of genetic variants on gene function per individual (see Supplementary Text S3). We selected two groups of genes as features for the models training. The first model group called “PWAS-global” was composed of the OT gene list with top listed 20 genes with best global score (global OT score > 0.55). The second model group, named “PWAS-genetics” was composed of the genes with OTG which ranked genes only through evidence from genetic association (GA). We analyzed 18 such genes with a GA score > 0.6). The dominant and recessive PWAS scores for each of these genes were calculated for each individual in the UKB to create the functional-based PWAS features. Surprisingly, the PWAS features that are based on gene-relevance knowledge from OT did not improve the AUC across all models (Supplementary Table S2). Using only the PWAS data as features still carries substantial predictive power (AUC ~ 0.69) which is comparable to using “only_genetic” features (AUC ~ 0.7). Based on the results, we concluded that from genetic perspective, it is mostly the APOE features that boost the model quality across all training results, and these genes and allele are already incorporated in any of the genetics used (GWAS, PWAS, or both).

Feature importance and interpretability

We analyzed the SHAP feature importance from “all” model across ten iterations (Supplementary Table S3). The most significant feature identified was the rs429358, a known variant of APOE. This was followed by “Filtered_ICD10,” “Age_last_episode_depression,” “Age_stop_smoking,” and “Medication,” which are relatively stable across the trained models (see standard deviation (Std) in Supplementary Table S3. As expected, the less important features are also generally less stable. Note that some of the top listed features are signified by strong dependency among them (e.g., “Meds_cholesterol_hypertension_diabetes” and “Medication”). We further evaluated the contribution of each feature on a selected iteration of the “all” model using an explainable AI tool of SHAP.

Figure 3A shows the top 20 features ranked by SHAP. We analyzed the genetic feature importance across various models trained on the dataset. The heatmap (Fig. 3B) visually presents the importance of different features for each model, providing a comprehensive overview of how each feature contributes to the predictive power of the models. The strongest feature is the rs429358, followed by the rs7412, both are APOE variants. Interestingly, the rs429358 is more important for female models (mean SHAP importance > 0.5) while for males, the value SHAP value is < 0.4. SNP rs117618017 demonstrates only a moderate importance across different models, with a peak in the “only_genetics” (0.029) and the “selected_icd10_all” (0.017) models.

The importance of the other SNPs, such as rs143332484 (Fig. 3B) and rs1859788, is lower but still noteworthy importance in some models. The dominating trend is that other SNPs, including rs769452 and rs146723120, show minimal importance across most models. Overall, the heatmap highlights the significant role of certain SNPs, particularly rs429358 and rs7412, in influencing model outcomes. We concluded that APOE variants are of utmost importance for the AD prediction outcome but the contribution of other genetic variants is negligible in improving any of the predictive models. The results emphasize the need for further investigation into these SNPs to better understand their biological implications and their potential use in predictive modeling.

Health records boost the AD model and exposes sex differences

The analysis of ICD-10 feature importance across various models reveals insights into the predictive power of the diagnostic codes. The heatmap in Fig. 4A shows that the most significant feature identified across several models is E11.9 (Diabetes without complications). Another notable feature is I10 Essential (primary) hypertension, which shows substantial importance across various models, especially in the “selected_icd10_all.” Note that ICD-10 I10 provided additional values for models of both sexes. The R07.4 (Chest pain, unspecified), I20.9 (Angina pectoris, unspecified), and E78.0 Pure hypercholesterolemia carry considerable importance in only some but not all models. Overall, the heatmap illustrates the variability in ICD-10 feature importance across different models, with some features consistently showing high importance, while the contribution of other diseases is negligible. These findings emphasize the need for a combined and weighted set of features to enhance model accuracy and provide valuable insights into the health determinants of AD within the tested population (for extensive analysis of models and ICD-10 importance, see Supplementary Fig. S2).

We compared the mean feature importance of the “male_all” and “female_all” models to emphasize instances that maximize the different features that carry sex-dependent differences (Fig. 4B). Interestingly, the rs429358 APOE variant is relatively more important for the female model than the male model, followed by the filtered ICD-10 diagnosis, the age of last episode of depression, medication, and rs7412 APOE variant. Age of stop smoking has a greater mean SHAP importance value for males. The tail of the features has a difference of less than 0.01 mean SHAP importance and will not be further discussed.

Models for vascular dementia outperform AD and dementia models

AD is usually diagnosed through MRI and brain imaging with the majority of dementia cases unspecified, and may occur as a mixed dementia [49]. Figure 5A shows the different dementia by ICD-10 indexing. The dementia models consist of any patients that have at least one positive diagnosis for the ICD-10 codes of F00, F01, F02, or F03. Together there are 6227 positive samples (the sum of 2,216, 33, 275, 411, 807, and 2485 samples). The non-vascular dementia models consist of patients with F00, F02, or F03 with no F01 diagnosis (2485 + 2216 = 4701 positive samples). Vascular dementia (VaD) models consist of patients with F01 (807 + 33 + 275 + 33 = 1526 positive samples) and the Unique VaD models consist of patients with just F01 and no F00, F02, or F03 diagnosis (807 + 33 = 840 positive samples).

For the goal of improving prediction by dementia subtyping, we performed model training for different diagnoses and created models for male, female, and both sexes (i.e., “all”). The results are summarized in Fig. 5B. The model with the best ROC-AUC metric is for VaD. The female models are stronger than the male-centric models for AD, non-vascular dementia, and dementia in general. But the male models perform better for VaD and for Unique VaD. Notably, the unique VaD model has a relatively high ROC-AUC score (0.764), but this measure is lowered for the male and female models. We suspected that the difference is due to the relatively low number of positive samples for these models (< 500 samples) and relatively high standard errors. Testing the most likely explanation for the sex difference failed to support sex dependency. The limited cohort sizes dominated the results as shown by down-sampling and bootstrapping protocols. We conclude that there is no statistically significant difference between the means of the two groups and we claim that there is no sufficient evidence to suggest that the different samples come from different populations. The performance of the different models is compiled in Table 2.

Table 2 Models metrics summary of AD and dementia models

Full size table

Difference in feature importance for VaD, AD, and dementia models

The features that are more important in AD models include rs7412 APOE variant, illness_of_father, qualifications, meds_cholesterol_hypertension_diabetes, and diabetes diagnosed. The features that are more important in dementia models include BMI, life_quality (datafield 26,417), duration_moderate_activity walking_activity (num days in week), rs117618017, sex, and smoking_packs_years. This list strongly proposed the importance of life style, physical activity, and family history.

We observed that the unique VaD models have some features that are substantially more important relative to the other models. These include LDL_cholesterol (average rank of 5.3 vs 34.35 and 41.05 in AD and dementia models, respectively), cardiovascular_diagnosis (average rank of 14.40 vs 43.45 and 24.50 in AD and dementia models), and seen_shrink_for nerves, anxiety, tension, depression (average rank of 21.10 vs 37.15 and 35.00 in AD and dementia models). Features that are substantially less important in unique VaD models include Medication (average rank of 31.75 vs 5.40 and 3.60 in AD and dementia models, respectively) and dental_problems (average rank of 31.55 vs 11.80 and 11.80 in AD and dementia models, respectively). Note that APOE rs429358 is ranked first or second in all models (with Std error 0).

Discussion

AD is a progressive neurodegenerative disorder that causes memory loss and cognitive decline, accounting for ~ 70% of dementia cases. Its hallmarks include the amyloid plaques and tau tangles which are typically confirmed in postmortem brains. In living patients, diagnosis relies on imaging methods that reveal brain atrophy, reduced hippocampal volume, and signs of neuronal loss [50]. Dementia itself is a broader term for conditions marked by cognitive deterioration. Fewer than half of AD patients are marked as AD isolated, and most patients are diagnosed with mixed pathologies, underscoring the diagnostic complexity [49]. Currently, no reliable blood or routine lab tests exist to differentiate dementia subtypes.

This study presents a statistical ML-based framework for predicting AD and dementia subtypes. Using UKB clinical data, we address key methodological challenges, particularly the lack of detailed follow-up longitudinal information. Our models leverage isolated aspects of lifestyle and experiential data from thousands of datafields, without including imaging or cerebrospinal fluid (CSF)–based measures, which are expected to enhance prediction [51, 52]. Integrating multimodal data such as imaging, proteomics, and genetics [28, 53, 54], along with detailed longitudinal information [55], were confirmed to improved AD prediction. By focusing on EHR-derived features, our approach aims to support real-world healthcare systems lacking sophisticated brain imaging data. Overall, the robustness and reproducibility of our models are essential for any future clinical application. Our predictive models revealed sex-specific contributing variables, offering insights that may inform personalized medical approaches. A key strength of this study is the detailed analysis of feature importance for the performance of alternative models. Feature importance correlations across AD, VaD, and non-vascular dementia ranged from 0.64 (AD with unique VaD) to 0.75 (AD and dementia; Supplementary Table S5). The VaD predictive model outperformed all others, including the “female_all” model (Fig. 5B). Despite the overlap in clinical manifestations of AD and VaD, the cause of VaD is reduced blood flow to the brain due to strokes or other conditions that damage blood vessels. This form of dementia can develop suddenly (e.g., following a stroke) or gradually from existing chronic vascular disease. While neurodegeneration is a progressive and long-lasting condition, in VaD, symptoms may worsen abruptly. We showed that the key difference in the cause of AD versus VaD is reflected in the analysis of the main features of each model.

An important aspect of our research is the utility of the age-matching protocol. This step is critical because age is a powerful confounder that can mask or exaggerate the importance of other risk factors in model training. Removing the influence of age allows the model to better capture the contributions of genetic, clinical, and lifestyle features that are independently associated with disease risk. Although we used the age of diagnosis as an alignment metric, this definition is somewhat arbitrary, as AD diagnoses are often influenced by health insurance policy, available support, and other factors rather than strict clinical determinants. Most models identify age as the primary factor, followed by sex and ethnicity (e.g., [55]). Instead, we explicitly used sex as separate groups for modeling to fully address the partition of key factors by sex (Fig. 4B). To minimize statistical bias and ensure robust interpretations, we routinely applied down-sampling techniques paired with bootstrapping to avoid any misinterpretation due to potentially small sample sizes and imbalanced groups. We further leveraged AI technology to model complex interactions, which are often nonlinear and challenging to test directly due to the rapid growth of possible combinations.

We found that the strongest (and almost exclusive) genetic contribution to the models is based on the combination of the APOE alleles. The dominant effect of the two APOE alleles is consistent with current knowledge, where the ε4 allele (i.e., rs429358_C, rs7412_C) is associated with a significantly increased risk of AD. The small number of individuals with two ε4 alleles (homozygous) can have an 8 to 12 times increased risk compared to people without ε4. In contrast, the ε2 allele (i.e., rs429358_T, rs7412_T) appears to be protective. These two extreme combinations are captured by our models. The contribution of a few additional SNPs that were selected was negligible [56]. An unresolved issue concerns the limited contribution of results from genetic association studies (excluding APOE). It is worth noting that no consensus is established regarding specific genes or variants associated with AD and dementia. The models that included pre-calculated polygenic risk score (PRS) from UKB showed negligible contribution to the genetic models. Nevertheless, a benchmark for selecting optimal PRS for modeling AD emphasized the sensitivity of careful methodological choices in genetic risk stratification [57]. While only unequivocally APOE gene was associated with AD risk, recent approach for AD genetics based on rare variants from whole exomes and whole genomes was not included in this study.

This research offers significant clinical utility by providing a scalable, non-invasive approach to dementia risk prediction that can be integrated into real-world healthcare systems. The models enable earlier, personalized interventions for AD and VaD, particularly valuable in primary care and resource-limited settings where access to neuroimaging or CSF-based diagnostics is limited. By combining genetic, health, and lifestyle data from a large cohort and standard medical records, the study demonstrates how ML can identify high-risk individuals before clinical diagnosis, creating a critical window for preventive action. The inclusion of modifiable risk factors (e.g., comorbidities, physical activity, education) offers clinicians practical targets for risk reduction. Furthermore, the model’s superior performance in VaD and detection of sex-specific patterns support the need for refined, subtype- and sex-specific care strategies. Overall, these predictive tools advance precision medicine by improving early risk stratification, optimizing resource use, and guiding tailored prevention and counseling strategies.

There are numerous limitations in this study that need to be addressed. Primarily, the relatively low prevalence of AD patients suggests a selection bias during recruitment. The underrepresentation of diseases such as cancer was confirmed in the UKB relative to the general population [58]. To account for such bias, further validation with independent datasets is necessary. Testing alternative biobanks (e.g., All of Us) will be pursued in the future to assess the success of our model’s transferability. Another obvious limitation concerns the cohort used. We trained the model on UKB data from 2019, and in 2023 there were additional ~ 1500 patients diagnosed with AD that were diagnosed in this time window from the recruitment date. We therefore anticipate relatively large amounts of false negatives that potentially reduced the quality of our models. Our findings underscore the significant role of education in lowering the risk of AD, likely through its association with healthier lifestyles and improved health management. Higher educational attainment has been linked to delayed onset of symptoms and protective changes in brain structure [59]. However, this protective effect was notably reduced in VaD, particularly in cases classified as uniquely VaD, suggesting distinct underlying etiologies. For example, strong lipidomic signatures observed in VaD were absent in AD, reinforcing the specificity of certain risk factors.

Abbreviations

AD :: Alzheimer’s disease
AI :: Artificial intelligence
AUC :: Area under the curve
CAA :: Cerebral amyloid angiopathy
CSF :: Cerebrospinal fluid
HER :: Electronic health record
EOAD :: Early-onset familial Alzheimer’s disease
FDR :: False discovery rate
GA :: Genetic association
GWAS :: Genome-wide association studies
ICD-10 :: International Classification of Diseases, 10th revision
KS :: Kolmogorov-Smirnov
LATE :: Lewy body disease and TDP-43 encephalopathy
LR :: Logistic regression
ML :: Machine learning
NLP :: Natural language processing
OR :: Odds ratio
OT :: Open targets
PRS :: Polygenic risk score
PWAS :: Proteome-wide association study
ROC :: Receiver operating characteristic
SHAP :: Shapley additive explanation
SNP :: Single nucleotide polymorphism
TBI :: Traumatic brain injury
UKB :: UK biobank
VaD :: Vascular dementia
WHO :: World Health Organization

References

Scheltens P, De Strooper B, Kivipelto M, Holstege H, Chételat G, Teunissen CE, Cummings J, van der Flier WM. Alzheimer’s disease. Lancet. 2021;397:1577–90.
Article CAS PubMed PubMed Central Google Scholar
Garre-Olmo J. Epidemiology of Alzheimer’s disease and other dementias. Rev Neurol. 2018;66:377–86.
CAS PubMed Google Scholar
Zhang X-X, Tian Y, Wang Z-T, Ma Y-H, Tan L, Yu J-T. The epidemiology of Alzheimer’s disease modifiable risk factors and prevention. J Prevent Alzheim Dis. 2021;8:313–21.
Article Google Scholar
Li X, Feng X, Sun X, Hou N, Han F, Liu Y. Global, regional, and national burden of Alzheimer’s disease and other dementias, 1990–2019. Front Ag Neurosci. 2022;14:937486.
Article Google Scholar
Kumar A, Singh A. A review on Alzheimer’s disease pathophysiology and its management: an update. Pharmacol Rep. 2015;67:195–203.
Article CAS PubMed Google Scholar
Zhang X-X, Tian Y, Wang Z-T, Ma Y-H, Tan L, Yu J-T. The epidemiology of Alzheimer’s disease modifiable risk factors and prevention. J Prev Alzheimers Dis. 2021;8:313–21.
Article PubMed PubMed Central Google Scholar
Tahami Monfared AA, Byrnes MJ, White LA, Zhang Q. Alzheimer’s disease: epidemiology and clinical progression. Neurol Ther. 2022;11:553–69.
Article PubMed PubMed Central Google Scholar
Livingston G, Huntley J, Sommerlad A, Ames D, Ballard C, Banerjee S, Brayne C, Burns A, Cohen-Mansfield J, Cooper C, Costafreda SG, Dias A, Fox N, Gitlin LN, Howard R, Kales HC, Kivimaki M, Larson EB, Ogunniyi A, Orgeta V, Ritchie K, Rockwood K, Sampson EL, Samus Q, Schneider LS, Selbaek G, Teri L, Mukadam N. Dementia prevention, intervention, and care: 2020 report of the Lancet Commission. Lancet. 2020;396:413–46.
Article PubMed PubMed Central Google Scholar
Armstrong RA. Risk factors for Alzheimer’s disease. Folia Neuropathol. 2019;57:87–105.
Article Google Scholar
Lopez OL, Kuller LH. Epidemiology of aging and associated cognitive disorders: prevalence and incidence of Alzheimer’s disease and other dementias. Handb Clin Neurol. 2019;167:139–48.
Article PubMed Google Scholar
Livingston G, Huntley J, Sommerlad A, Ames D, Ballard C, Banerjee S, Brayne C, Burns A, Cohen-Mansfield J, Cooper C. Dementia prevention, intervention, and care: 2020 report of the Lancet Commission. Lancet. 2020;396:413–46.
Article PubMed PubMed Central Google Scholar
Livingston G, Huntley J, Liu KY, Costafreda SG, Selbaek G, Alladi S, Ames D, Banerjee S, Burns A, Brayne C, Fox NC, Ferri CP, Gitlin LN, Howard R, Kales HC, Kivimaki M, Larson EB, Nakasujja N, Rockwood K, Samus Q, Shirai K, Singh-Manoux A, Schneider LS, Walsh S, Yao Y, Sommerlad A, Mukadam N. Dementia prevention, intervention, and care: 2024 report of the Lancet standing Commission. Lancet. 2024;404:572–628.
Article PubMed Google Scholar
Javaid SF, Giebel C, Khan MA, Hashim MJ. Epidemiology of Alzheimer’s disease and other dementias: rising global burden and forecasted trends. F1000Res. 2021;10: 425.
Article Google Scholar
Heneka MT, Carson MJ, El Khoury J, Landreth GE, Brosseron F, Feinstein DL, Jacobs AH, Wyss-Coray T, Vitorica J, Ransohoff RM, Herrup K, Frautschy SA, Finsen B, Brown GC, Verkhratsky A, Yamanaka K, Koistinaho J, Latz E, Halle A, Petzold GC, Town T, Morgan D, Shinohara ML, Perry VH, Holmes C, Bazan NG, Brooks DJ, Hunot S, Joseph B, Deigendesch N, Garaschuk O, Boddeke E, Dinarello CA, Breitner JC, Cole GM, Golenbock DT, Kummer MP. Neuroinflammation in Alzheimer’s disease. Lancet Neurol. 2015;14:388–405.
Article CAS PubMed PubMed Central Google Scholar
Sims R, Hill M, Williams J. The multiplex model of the genetics of Alzheimer’s disease. Nat Neurosci. 2020;23:311–22.
Article CAS PubMed Google Scholar
De Strooper B, Karran E. The cellular phase of Alzheimer’s disease. Cell. 2016;164:603–15.
Article PubMed Google Scholar
Braak H, Del Tredici K. The preclinical phase of the pathological process underlying sporadic Alzheimer’s disease. Brain. 2015;138:2814–33.
Article PubMed Google Scholar
Trejo-Lopez JA, Yachnis AT, Prokop S. Neuropathology of Alzheimer’s disease. Neurotherapeutics. 2023;19:173–85.
Article Google Scholar
Bagyinszky E, Youn YC, An SS, Kim S. The genetics of Alzheimer’s disease. Clin Interv Aging. 2014;1:535–51.
Article Google Scholar
Cacace R, Sleegers K, Van Broeckhoven C. Molecular genetics of early-onset Alzheimer’s disease revisited. Alzheimers Dement. 2016;12:733–48.
Article PubMed Google Scholar
Martens YA, Zhao N, Liu C-C, Kanekiyo T, Yang AJ, Goate AM, Holtzman DM, Bu G. ApoE cascade hypothesis in the pathogenesis of Alzheimer’s disease and related dementias. Neuron. 2022;110:1304–17.
Article CAS PubMed PubMed Central Google Scholar
Chouraki V, Reitz C, Maury F, Bis JC, Bellenguez C, Yu L, Jakobsdottir J, Mukherjee S, Adams HH, Choi SH, Larson EB. Evaluation of a genetic risk score to improve risk prediction for Alzheimer’s disease. J Alzheim Dis. 2016;53(3):921–32.
Article CAS Google Scholar
Custodio N, Montesinos R, Lira D, Herrera-Pérez E, Bardales Y, Valeriano-Lorenzo L. Mixed dementia: a review of the evidence. Dement Neuropsychol. 2017;11:364–70.
Article PubMed PubMed Central Google Scholar
Bir SC, Khan MW, Javalkar V, Toledo EG, Kelley RE. Emerging concepts in vascular dementia: a review. J Stroke Cerebrovasc Dis. 2021;30: 105864.
Article PubMed Google Scholar
Rowe TW, Katzourou IK, Stevenson-Hoare JO, Bracher-Smith MR, Ivanov DK, Escott-Price V. Machine learning for the life-time risk prediction of Alzheimer’s disease: a systematic review. Brain Commun. 2021;3(4):246.
Article Google Scholar
Akter S, Liu Z, Simoes EJ, Rao P. Using machine learning and electronic health record (EHR) data for the early prediction of Alzheimer’s disease and related dementias. J Prev Alzheimers Dis. 2025. https://doi.org/10.1016/j.tjpad.2025.100169.
Article PubMed PubMed Central Google Scholar
Li Q, Yang X, Xu J, Guo Y, He X, Hu H, Lyu T, Marra D, Miller A, Smith G, DeKosky S, Boyce RD, Schliep K, Shenkman E, Maraganore D, Wu Y, Bian J. Early prediction of Alzheimer’s disease and related dementias using real-world electronic health records. Alzheimers Dement. 2023;19:3506–18.
Article PubMed Google Scholar
Gao XR, Chiariglione M, Qin K, Nuytemans K, Scharre DW, Li YJ, Martin ER. Explainable machine learning aggregates polygenic risk scores and electronic health records for Alzheimer’s disease prediction. Sci Rep. 2023;13:450.
Article CAS PubMed PubMed Central Google Scholar
Kashyap AM, Rao D, Boland MR, Shen L, Callison-Burch C. Predicting explainable dementia types with LLM-aided feature engineering. Bioinformatics. 2025;41: 156.
Article Google Scholar
West M, Cheng Y, He Y, Leng Y, Magdamo C, Hyman BT, Dickson JR, Serrano-Pozo A, Blacker D, Das S. Unsupervised deep learning of electronic health records to characterize heterogeneity across Alzheimer disease and related dementias: cross-sectional study. JMIR Aging. 2025;8: e65178.
Article PubMed PubMed Central Google Scholar
Vermeulen RJ, Andersson V, Banken J, Hannink G, Govers TM, Rovers MM, Rikkert M. Limited generalizability and high risk of bias in multivariable models predicting conversion risk from mild cognitive impairment to dementia: a systematic review. Alzheimers Dement. 2025;21: e70069.
Article CAS PubMed PubMed Central Google Scholar
Hou XH, Feng L, Zhang C, Cao XP, Tan L, Yu JT. Models for predicting risk of dementia: a systematic review. J Neurol Neurosurg Psychiatry. 2019;90:373–9.
Article PubMed Google Scholar
Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, Motyer A, Vukcevic D, Delaneau O, O’Connell J. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562:203–9.
Article CAS PubMed PubMed Central Google Scholar
Dorogush AV, Ershov V, Gulin A. CatBoost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363. 2018.
Hancock JT, Khoshgoftaar TM. Catboost for big data: an interdisciplinary review. J Big Data. 2020;7:94.
Article PubMed PubMed Central Google Scholar
Hancock J, Khoshgoftaar TM. Performance of catboost and xgboost in medicare fraud detection. In 2020 19th IEEE international conference on machine learning and applications (ICMLA) 2020 Dec 14 (pp. 572-579). IEEE.
Canela-Xandri O, Rawlik K, Tenesa A. An atlas of genetic associations in UK Biobank. Nat Genet. 2018;50:1593–9.
Article CAS PubMed PubMed Central Google Scholar
Ghoussaini M, Mountjoy E, Carmona M, Peat G, Schmidt EM, Hercules A, Fumis L, Miranda A, Carvalho-Silva D, Buniello A. Open targets genetics: systematic identification of trait-associated genes using large-scale genetics and functional genomics. Nucleic Acids Res. 2021;49:D1311-20.
Article CAS PubMed Google Scholar
MacArthur J, Bowler E, Cerezo M, Gil L, Hall P, Hastings E, Junkins H, McMahon A, Milano A, Morales J. The new NHGRI-EBI catalog of published genome-wide association studies (GWAS catalog). Nucleic Acids Res. 2017;45:D896–901.
Article CAS PubMed Google Scholar
Brandes N, Linial N, Linial M. Quantifying gene selection in cancer through protein functional alteration bias. Nucleic Acids Res. 2019;47:6642–55.
Article CAS PubMed PubMed Central Google Scholar
Brandes N, Linial N, Linial M. PWAS: proteome-wide association study-linking genes and phenotypes by functional variation in proteins. Genome Biol. 2020;21:173.
Article CAS PubMed PubMed Central Google Scholar
Zucker R, Kelman G, Linial M. PWAS Hub: exploring gene-based associations of complex diseases with sex dependency. Nucleic Acids Res. 2025;53:D1132–43.
Article PubMed Google Scholar
Gouveia C, Gibbons E, Dehghani N, Eapen J, Guerreiro R, Bras J. Genome-wide association of polygenic risk extremes for Alzheimer’s disease in the UK Biobank. Sci Rep. 2022;12: 8404.
Article CAS PubMed PubMed Central Google Scholar
Thompson DJ, Wells D, Selzam S, Peneva I, Moore R, Sharp K, Tarran WA, Beard EJ, Riveros-Mckay F, Giner-Delgado C, Palmer D. A systematic evaluation of the performance and properties of the UK Biobank polygenic risk score (PRS) release. Plos one. 2024;19(9):e0307270.
Article CAS PubMed PubMed Central Google Scholar
Wang B, Chibnik LB, Choi SH, Blacker D, DeStefano AL, Lin H. Association of genetic risk of Alzheimer’s disease and cognitive function in two European populations. Sci Rep. 2025;15: 6410.
Article CAS PubMed PubMed Central Google Scholar
Leonenko G, Baker E, Stevenson-Hoare J, Sierksma A, Fiers M, Williams J, de Strooper B, Escott-Price V. Identifying individuals with high risk of Alzheimer’s disease using polygenic risk scores. Nat Commun. 2021;12: 4506.
Article CAS PubMed PubMed Central Google Scholar
Marcílio WE, Eler DM. From explanations to feature selection: assessing SHAP values as feature selection mechanism. In 2020 33rd SIBGRAPI conference on Graphics, Patterns and Images (SIBGRAPI) 2020 Nov 7 (pp. 340-347). Ieee.
Brandes N, Linial N, Linial M. PWAS: proteome-wide association study—linking genes and phenotypes by functional variation in proteins. Genome Biol. 2020;21:173.
Article CAS PubMed PubMed Central Google Scholar
Haller S, Jager HR, Vernooij MW, Barkhof F. Neuroimaging in dementia: more than typical Alzheimer disease. Radiology. 2023;308: e230173.
Article PubMed Google Scholar
Ferreira LK, Busatto GF. Neuroimaging in Alzheimer’s disease: current role in clinical practice and potential future applications. Clinics (Sao Paulo). 2011;66(Suppl 1):19–24.
Article PubMed Google Scholar
Buto PT, Wang J, La Joie R, Zimmerman SC, Glymour MM, Ackley SF, Hoffmann TJ, Yaffe K, Zeki Al Hazzouri A, Brenowitz WD. Genetic risk score for Alzheimer’s disease predicts brain volume differences in mid and late life in UK biobank participants. Alzheimers Dement. 2024;20(3):1978–87.
Article CAS PubMed PubMed Central Google Scholar
Apostolova LG, Risacher SL, Duran T, Stage EC, Goukasian N, West JD, Do TM, Grotts J, Wilhalme H, Nho K, Phillips M. Associations of the top 20 Alzheimer disease risk variants with brain amyloidosis. JAMA Neurol. 2018;75(3):328–41.
Article PubMed PubMed Central Google Scholar
He XY, Wu BS, Kuo K, Zhang W, Ma Q, Xiang ST, Li YZ, Wang ZY, Dong Q, Feng JF, Cheng W, Yu JT. Association between polygenic risk for Alzheimer’s disease and brain structure in children and adults. Alzheimers Res Ther. 2023;15: 109.
Article PubMed PubMed Central Google Scholar
Beydoun MA, Beydoun HA, Li Z, Hu YH, Noren Hooten N, Ding J, Hossain S, Maino Vieytes CA, Launer LJ, Evans MK, Zonderman AB. Alzheimer’s disease polygenic risk, the plasma proteome, and dementia incidence among UK older adults. Geroscience. 2025;47:2507–23.
Article PubMed Google Scholar
You J, Zhang YR, Wang HF, Yang M, Feng JF, Yu JT, Cheng W. Development of a novel dementia risk prediction model in the general population: a large, longitudinal, population-based machine-learning study. EClinicalMedicine. 2022;53:101665.
Article PubMed PubMed Central Google Scholar
Yan D, Hu B, Darst BF, Mukherjee S, Kunkle BW, Deming Y, Dumitrescu L, Wang Y, Naj A, Kuzma A, Zhao Y. Biobank-wide association scan identifies risk factors for late-onset Alzheimer’s disease and endophenotypes. Elife. 2024;12: 1360.
Article Google Scholar
Bellou E, Kim W, Leonenko G, Tao F, Simmonds E, Wu Y, Mattsson-Carlgren N, Hansson O, Nagle MW, Escott-Price V. Alzheimer’s disease neuroimaging initiative benchmarking Alzheimer’s disease prediction: personalised risk assessment using polygenic risk scores across various methodologies and genome-wide studies. Alzheim Res Ther. 2025;17(1): 6.
Article Google Scholar
Li C, Dite GS, Nguyen TL, Hopper JL, Li S. Cancer incidence inconsistency between UK Biobank participants and the population: a prospective cohort study. BMC Med. 2025;23: 181.
Article CAS PubMed PubMed Central Google Scholar
Andrews SJ, Fulton‐Howard B, O'Reilly P, Marcora E, Goate AM, Collaborators of the Alzheimer's Disease Genetics Consortium, Farrer LA, Haines JL, Mayeux R, Naj AC, Pericak‐Vance MA. Causal associations between modifiable risk factors and the Alzheimer's phenome. Annals of neurology. 2021 89(1):54-65.

Download references

Acknowledgements

We thank the member of the Linial’s lab for their support throughout this work. Special thanks to Roei Zucker that helped with UKB data management and technical support. Special thanks to Shachar Arzi (Dept of Cognitive and Brain Sciences, The Hebrew University) and his team for introducing an expert view of AD and dementia.

Funding

Open access funding provided by Hebrew University of Jerusalem. This study was partially supported by the National Alopecia Areata Foundation (NAAF) on Genetics of alopecia areata (M.L.) and a fellowship from Center for Interdisciplinary Data Science Research (3035000440).

Author information

Authors and Affiliations

The Rachel and Selim Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel
Amos Stern
Department of Biological Chemistry, The Life Science Institute, The Hebrew University of Jerusalem, 91904, Jerusalem, Israel
Michal Linial

Authors

Amos Stern
View author publications
Search author on:PubMed Google Scholar
Michal Linial
View author publications
Search author on:PubMed Google Scholar

Contributions

A.S. performed the analysis, developed the predictive models, and created the visualization and the supplementary materials. M.L. wrote the initial draft. M.L. contributed in conceptualization, mentoring, and coordination. A.S and M.L. contributed to final writing, editing, and finalizing the manuscript.

Corresponding author

Correspondence to Michal Linial.

Ethics declarations

Ethics

The study was approved by the University Committee for the Use of Human Subjects in Research Approval number 12072022 (July 2025). This study uses the UK-Biobank (UKB) application ID 26664 (Linial lab).

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Highlights

• CatBoost accurately predicted Alzheimer’s disease risk, with higher performance in females.

• The ApoE-ε4 gene was the strongest genetic risk factor across AD and all dementia subtypes.

• Comorbidities (e.g., heart disease, diabetes) interact with lifestyle factors and education to increase the risk predictive models.

• Vascular dementia showed strong predictive signals despite smaller sample size.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 4951 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Stern, A., Linial, M. Integrative machine learning approach to risk prediction for dementia and Alzheimer’s disease. GeroScience (2025). https://doi.org/10.1007/s11357-025-01828-x

Download citation

Received: 07 June 2025
Accepted: 31 July 2025
Published: 27 August 2025
Version of record: 27 August 2025
DOI: https://doi.org/10.1007/s11357-025-01828-x

Keywords

Profiles

Michal Linial View author profile

Integrative machine learning approach to risk prediction for dementia and Alzheimer’s disease

Abstract

Graphical Abstract

Similar content being viewed by others

A machine learning framework for classifying dementia risk in mild cognitive impairment: evidence from a Korean genome-wide association study cohort

Modified dementia risk score as a tool for the prediction of dementia: a prospective cohort study of 239745 participants

New insights into the genetic etiology of Alzheimer’s disease and related dementias

Explore related subjects

Introduction

Materials and methods

UK biobank database, extraction, and processing

Machine learning methodology

Algorithm selection

Data partitioning and cross-validation strategy

Hyperparameter configuration

Model training process

Feature set variants

Feature selection and engineering

Genetics analysis

Variant genetic analysis

Dementia models and subtypes

Model evaluation and statistics

Results

Balancing the age effect from AD group by a matching protocol

Clinical view using a time stamp for AD comorbidity

AD and dementia model performance

Gene-based PWAS score failed to boost AD risk predictive models

Feature importance and interpretability

Health records boost the AD model and exposes sex differences

Models for vascular dementia outperform AD and dementia models

Difference in feature importance for VaD, AD, and dementia models

Discussion

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics

Competing interests

Additional information

Publisher's Note

Highlights

Supplementary Information

Supplementary file1 (DOCX 4951 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles