Introduction

School dropout is a complex economic, social, and individual problem that has long-term consequences including difficulty engaging in the labor market, income and health deprivation, and increased risk of engagement in criminal activity (CBS, 2020; De Ridder et al., 2012). While substantial success in reducing dropout has been achieved in recent years, early identification and prevention of school dropout remains high on the agenda in most high-income countries (OECD, 2020). Freudenberg and Ruglis (2007) argued that school dropout should be considered a major public health problem, given the convincing evidence of its negative consequences on health. They argued that school dropout received insufficient attention at youth health care services and they emphasized the need for collaborative prevention efforts by schools and health care professionals (Freudenberg & Ruglis, 2007).

Overview of Known Predictors of Dropout

A growing body of research in the USA and Europe suggests that dropping out is the final stage in a dynamic and cumulative process of disengagement from school (Bowers et al., 2012; Hauser & Koenig, 2011). Known predictors of dropout can be arbitrarily divided into individual, family, and neighborhood environment factors as well as school-related factors.

Mental health issues constitute a strong predictor of school dropout on individual level, particularly in males (Hjorth et al., 2016). Exhibiting aggressive behavior and associating with aggressive peers has been a strong predictor of leaving school without diploma (Fortin et al., 2010; Fortin, 2006). Male students have been shown to drop out more frequently than females, with aggressive behaviors being suggested as one of the main reasons (Fortin et al., 2010; Fortin, 2006). Mental health issues in adolescence often go underdiagnosed. This emphasizes the need to use accessible screening tools to signal emerging problems in order to guide interventions to improve educational and social outcomes (Bowman et al., 2020).

Children from single parent or divorced families have on average less academic success (Astone, 1991; Heard, 2007; Theunissen et al., 2015) and higher risk of poverty (Amato & Keith, 1991; Barton, 2006) which are, in turn, a risk factor for dropout (Kane Salvador, 2012). Economic hardship assessed through different measures (free school meal status in the USA, low family income, or living in a low-income neighborhood) has been repeatedly and consistently associated with school dropout (Brooks-Gunn & Duncan, 1997; De Witte et al., 2013). However, household or family poverty cannot be considered the same as living in a low-income or impoverished community. Family poverty is associated with a number of adverse conditions such as transient living arrangements and homelessness, food insecurity, parents who are in jail or absent, violence, drug abuse and other problems, known as “toxic stressors” because they are severe, sustained and not buffered by supportive relationships (Shonkoff et al., 2012). Community poverty refers to neighborhoods with high levels of joblessness, family instability, substance abuse, poverty, welfare dependency and crime (Sampson, 2002). Other family circumstances such as birth of sibling in infancy have also been associated with school dropout (Theunissen et al., 2015).

School-related factors have been extensively studied in relation to school dropout. School absenteeism is one of the most prominent correlates of dropping out from school (Kane Salvador, 2012; Theunissen et al., 2015). School absenteeism can be a manifestation of diverse underlying factors including coming from a poor family, an unhealthy school environment, exposure to violence, or difficulties coping with the learning workload. Other school-related factors predictive of future dropout include failure of a year level or one or more courses within the year, unsatisfactory behavior scores, and poor average academic scores (Bowers et al., 2012; Kane Salvador, 2012). Furthermore, perceptions of school as boring, a lack of interest in studies, not being able to get along with teachers, failing one or more courses, and an overall lack of intention to complete school were among the most frequent correlates with school dropout in several survey studies in USA (Archambault et al., 2009; Bridgeland et al., 2006).

The complex dynamics of these personal attributes, behavior and contextual factors have been captured by the Social Cognitive Theory (Bandura, 1989). This theory maintains that a person is a product of their environment and, in turn, forms their environment through their interactions with it. This theory provides a useful lens through which school dropout may be better understood. When recognized early, some factors can be amended (e.g. additional support with educational load or addressing reasons for absenteeism) or the negative impact of non-modifiable factors can be reduced (e.g. engaging the child with issues at home into structured social and sporting activities) in order to diminish their detrimental impact on individual behavior and to support better choices. This underlines the value of efforts for early prediction of school dropout.

The need for prediction models and technological advances to develop these models is acknowledged in recent literature in this field (Chung & Leeb, 2019; Robison et al., 2017). Recent examples using machine learning techniques have demonstrated the power of artificial intelligence to predict dropout from school (Chung & Leeb, 2019; Márquez-Vera et al., 2015) or undergraduate studies (Del Bonifro et al., 2020). Majority of published models adopted a school-perspective in a sense that they mainly used academic achievement, attendance and behavior data collected by schools (Bowers et al., 2012; Chung & Leeb, 2019; Márquez-Vera et al., 2015).

Research Setting: Education System, Youth Health Care Services and School Dropout in the Netherlands

The Dutch education system comprises pre-school education (grade 1 and 2), primary school (grades 3–8) and secondary school, with each section taking between 4 and 6 years to complete depending on the level. A peculiarity of the Dutch education system is an obligatory test at the end of primary school, called the CITO test. The CITO score, along with the advice of the school, contributes heavily to the level of secondary education the child is admitted to. Children with highest scores will continue at secondary scientific education (in Dutch: VWO), children with medium scores will be offered secondary professional education (in Dutch: HAVO), and all other children will be streamed into secondary vocational education (in Dutch: VMBO). In this way, secondary school levels can be seen as an indirect proxy for primary school grades and overall performance. When it is not clear which secondary education level best suits a student, or if the parents insist their child can handle a higher level of education than what was recommended to them, there is an orientation year (referred to as a bridging year) for both VMBO/HAVO and HAVO/VWO to determine this. Under certain strict conditions, the education system allows moving upwards between the levels of secondary school as a safety net to diminish the negative effects of a child’s immaturity or lack of self-knowledge at the time when a decision is first made. Aside from moving up, there is also a system in place where students can be demoted to a lower level of education, when they experience problems with attaining the level. The risk of not graduating with diploma is highest for the lowest level of secondary school (VMBO) (Allen & Meng, 2010). After a significant decline in school dropout in the Netherlands in the years preceding 2015 (from 2.8 to 1.7%), from 2015 to 2019 the trend is now increasing again(towards 2%), indicating that the problem of school dropout requires sustained attention (www.onderwijsincijfers.nl).

An important question is that, besides schools, which organizations should play a role in prevention of school dropout, and at what life stage should they intervene. Nowadays, Dutch schools can identify the risk of school dropout with the family (where possible) and/or contact social services supported by the municipalities. The student may then be provided with supervision and support tailored to their needs. However, the weakness of this approach is that help is being sought at a time when problems already exist, which may be too late. Moreover, early signs of risk of dropout may be missed with this approach. An earlier risk stratification of the student population, for example, at the transition between the primary school and secondary school, could potentially allow more effective efforts directed towards strengthening engagement with the school. The Dutch Youth Health Care (YHC), a municipal service that monitors health and social development of children and adolescents in the Dutch population, could play a role in this process. YHC has regular contact with children from birth through to the age of 18, with this contact being most frequent in the first 4 years, reducing to several planned and on-demand appointments in the primary and secondary school years. In addition, YHC collaborate with schools and families on topics related to health and upbringing. YHC already successfully collaborates with schools in the case of longer term absence related to illness in a program that focuses on understanding and addressing the underlying reasons for the absence in order to facilitate and promote a return to school (Vanneste et al., 2016). YHC builds a long-term relationship with family from as early as pre-birth (ante-natal visits) and has the potential to contribute important knowledge and context to support schools and families in preventing early school dropout. There is therefore exciting potential to expand the YHC role in school success if we can provide sufficient tools and knowledge on which children are potentially at risk and so efforts can be targeted to those most at need.

Strength and Difficulties Questionnaire (SDQ) and School Dropout

In the Netherlands, every child has a digital YHC dossier in which health related information is collected by YHC professionals during the first 18 years of life. It collects data on growth, health problems and referrals to other professionals, but also incorporates several standard tools to monitor physical and socio-emotional development relevant to each age. One of these instruments is the Strengths and Difficulties Questionnaire (SDQ) (Goodman, 1997). The SDQ is a widely used questionnaire which is used as a screening instrument for the initial assessment of psychosocial problems (Goodman, 2001; Goodman et al., 2002). SDQ has several versions appropriate for different ages and these have previously been validated in the Dutch population (Mieloo et al., 2014; Stone et al., 2015; Vogels et al., 2009). Using the SDQ, an increased risk of emotional problems, behavioral problems, hyperactivity, peer problems and lack of pro-social behavior can be predicted. Only a handful of recent studies have explored the relationship between SDQ and the likelihood of not completing compulsory secondary school (Lindhardt et al., 2022; Sagatun et al., 2014). Lindthardt et al. (2022) looked at the relationship between SDQ of secondary school students and school dropout at 2.5 years follow-up. They reported that SDQ is predictive of dropout in students with and without mental health disorders, indicating that markers such as the SDQ might contribute to the identification of multiple vulnerable adolescent groups. Sagatun et al. (2014) divided the SDQ scales into external problems (conduct problems and hyperactivity-inattention) and internal problems (emotional symptoms and peer problems) and found that external problems were predictive of school dropout, with approximately one third of this relationship being mediated by school grades. None of these studies have assessed the overall predictive performance of their model. In the present study, we hypothesized that; (1) high SDQ total score and/or high score on the individual subscales at ages 10 and 14 are predictive of school dropout by age 17, and (2) at least a moderate performance prediction model can be computed using SDQ and individual and family background characteristics known to YHC. In the Dutch region of South Limburg, the SDQ has been consistently used as a screening instrument for nearly two decades. If a reasonable prediction model of school dropout can be developed based on the data available to YHC it might help to steer efforts to improve a child’s chances to graduate and enter the labor market with a qualification. Such a model would empower YHC to take an initiative-taking role in prevention of school dropout in partnership with schools and families. Therefore, the aim of this study was to compute and assess the performance of a prediction model using data on socio-emotional development and socio-economic background routinely available to Youth Health Care services during consultations at 10 and 14 years old.

Methods

Study Design

This study was a longitudinal retrospective cohort study conducted in the South Limburg province of the Netherlands. Data on SDQ scores was collected during regular assessments by YHC service of children aged 10 and 14. Information on each child and family’s socio-economic status (at age 10 and 14), secondary education level (at age 14) and school dropout status (age 17) was obtained from national population registries through Statistics Netherlands. The data linkage was executed using the individual number of municipal administration (GBA) or, where this number was not available, a combination of birth date, gender, and postal code of the child. The data linkage was performed by a trusted third party (Statistics Netherlands), the data was pseudo anonymized before it was made accessible to the researchers. The study was approved by medical ethical committee of Maastricht University (METC azM/UM 2020 − 1573).

Participants

Children born between 1996 and 2001 who lived and attended school (excluding ‘special education’ for children with severe physical and intellectual disabilities) in South Limburg at age ten and/or fourteen were included.

Study Outcome

School dropout was defined as having left school at least once without diploma by the age 17. This data is collected annually by a Dutch governmental organization responsible for organization and funding of the education system (DUO). Each student enrolled in the education system in the preceding year was categorized as (1) enrolled in study (2) graduated or (3) left without diploma on October 1st of the next academic year. Binary variables ‘enrolled or graduated’ vs. ‘dropped out’ were created. A cut-off age of 17 was chosen to ensure that data on the outcome were available for children from each of the 6 birth cohorts (1996–2001; the most recent data available on school enrolment at the time of analysis was 2018). While it is possible that someone classified as ‘dropped out’ in this way can still enroll in another study next year, having dropped out at least once can be seen as a proxy to social vulnerability and is an indicator of potential difficulties in the future (Rumberger & Lamb, 2003).

Data on socio-emotional Development (Strengths and Difficulties Questionnaire)

The SDQ is a questionnaire consisting of 25 Likert scale questions. Each question has the following response options: “Not true”, “Somewhat true” and “Certainly true”. The 25 questions are split into five subscales (each 0–10); emotional problems, behavioral problems, hyperactivity, peer problems and pro-social behaviors. The total SDQ score (range 0–40, higher scores indicate greater difficulty) is the sum of the subscales, excluding the pro-social behaviors subscale (Goodman, 1997, 2001). SDQ total score and subscales were modelled as a continuous score, as well as using the developer proposed cut-points to categorize results into ‘normal’, ‘borderline’ and ‘high’ SDQ (Goodman, 1997, 2001; Theunissen et al., 2016). The questionnaire was completed by the parents of children aged 10, and by the children themselves at age 14.

Socio-economic and School Variables

Each child’s gender and immigration background (Dutch, first- and second-generation immigrant) was obtained from the municipal population registry. Parent education level was defined as the highest education level (categorized as low, medium or high) achieved by either parent. A binary variable ‘at least one parent receiving a state welfare subsidy’ was created using data from the socio-economic registry if at least one of the parents had an unemployment or long-term welfare subsidy as their main source of income when the child was aged 10 or 14. ‘Parents registered on the same address’ vs. ‘not registered at the same address’ when their child was age 10 and 14 was used as a proxy to capture single-parent households.

Secondary education level at age 14 was obtained from the national education registry. Secondary education level is assigned at the age 12 based on the results of the standardized test (CITO) and primary school advice. The secondary education level was categorized as (1) bridging year (for a small number of children that could not yet be allocated to one of the levels) (2) secondary vocational education (VMBO) (3) secondary professional education (HAVO) (4) secondary scientific education (VWO).

Statistical Analyses

Statistical Model

Supervised machine learning techniques were used to classify dropout versus did not dropout. Dropout was modelled as a function of independent variables in a generalized mixed linear model with logit link. Random intercepts for schools were tested to account for possible within-school correlation and an Intra-Class Correlation (ICC) coefficient was computed. Interactions between SDQ and secondary school level were tested in the model with predictors at age 14. Additionally, interactions with gender were explored in both models given that SDQ total scores or subscores might have differential impact on the outcome in boys and girls (Fortin et al., 2010). Backwards selection procedure was used for variable selection (p < 0.10).

Assessment of Predictive Performance of Models

A number of performance metrics were used to assess the model fit, namely Hosmer and Lemeshow test for goodness of fit, area under the curve (AUC) based on Receiver Operating Characteristics (ROC) curve, sensitivity and specificity.

Hosmer and Lemeshow Test was used to assess goodness of fit (Hosmer & Lemeshow, 1980). This test assesses whether or not the observed event rates match expected event rates in subgroups of the model population. Models for which expected and observed event rates in subgroups are similar are considered well calibrated. Calibration plots were used to graphically explore the model fit (Nattino, 2018).

A ROC curve was plotted using the predicted probability of school dropout and the observed dropout. The area under the ROC curve determined the discriminative ability of the model (i.e. how well the model can differentiate between those with high risk of outcome and those with low risk of outcome) (Hanley & McNeil, 1982; Melo, 2013). In the ROC space, the point (0, 1) (or the corner on the top left) represents the model with best performance because the model detects positive cases with 100% certainty (i.e., 1 for true positive rate), and never misclassifies negative cases as positive cases (i.e., 0 for false positive rate). Each predictive model creates one curve in ROC space and each point on the curve represents the performance of the model for a specific threshold of the model. Therefore, the model with the ROC curve closer to the point (0, 1) is a better model. To quantify the closeness of the ROC curve of a model to the point (0, 1), the area under the ROC curve (AUC) is calculated. Because the ROC graph ranges from 0 to 1 on both the X and Y axes, the perfect model has the value of 1 for AUC. Therefore, the model with an AUC closest to 1 is considered best model. The values of AUC depend on the context (e.g., severity of the outcome, nature and costs of available preventive actions, and likelihood of preventive action itself). Mandrekar (2010) proposed AUC of 0.7–0.8, 0.8–0.9, and 0.9–1.0 should be regarded as acceptable, excellent, and outstanding (Hosmer & Lemeshow, 2000; Mandrekar, 2010). Random K-fold validation of the AUC curve was also performed (k = 10) (Luque-Fernandez et al., 2017). In k-fold cross validation the dataset is divided into k sub datasets. Then the k-1 sub dataset, called the validation dataset, is used to test the performance of the model on k-1 sub datasets. The process is repeated k times by alternating the validation dataset, and afterwards the mean performance measures are calculated.

Average adjusted predictive margins (or marginal probabilities) were calculated to assist the interpretation of the results. Marginal probabilities were calculated from fitted models to provide an intuitive explanation to observed odds ratios. Marginal probabilities represent the probability that a student with certain values on independent variables will drop out of school. For this study, we calculated marginal probabilities of dropout for different levels of SDQ for a student who is male, whose parents have low education and are not living together and are receiving welfare subsidy. This calculation illustrated the amount of information added by SDQ score for otherwise disadvantaged students.

Sensitivity and specificity of the model were assessed by calculating positive and negative predictive values (PPV and NPV, respectively). PPV was computed as true positives/positive calls by the model and NPV as true negatives / negative calls by the model. PPV and NPV represent the probability that subjects with a positive (negative) screening test truly have (have not) the disease.

A series of sensitivity analyses were performed with a number of categorizations of SDQ scores (Theunissen et al., 2016), as well as quintiles and deciles of the distribution. Further, age at the moment of completing the questionnaire (some children were younger or older than 10 or 14 during the measurement) was factored into the model and model sensitivity to this factor was explored.

Analyses were performed in Stata 16 (StataCorp, 2019). Statistical significance was assumed at α = 5%.

Results

Descriptive Statistics

In total, 11,589 and 18,955 children born between 1996 and 2001 completed the SDQ questionnaire at age 10 and 14 (Table 1), respectively. In both groups, approximately 51% were male, 9% came from households where at least one of the parents received a welfare subsidy as their main source of income, and 11% of children came from families where one or both parents had low education. In 20% of children the education of parents could not be obtained due to missing data in the national registry. Among children measured at age 10, 22% of parents were not living together, and this proportion was slightly higher at age 14 (25%). From secondary school children (age 14), 934 (5%) were in the bridge class, 7,772 (40%) were enrolled in secondary vocational education (VMBO), 4,705 (24%) into secondary professional education (HAVO), and 5,731 (30%) were enrolled in secondary scientific education (VWO). Of note, 74% of ‘bridging year students were registered in vocational secondary school (VMBO) 2 years after measurement. Mean (SD) SDQ at age 10 was 5.81 (5.11) and at age 14 mean scores were 9.34 (5.01).

Table 1 Characteristics of the study population

To explore the coverage achieved by YHC (in terms of response to the questionnaire that is part of the routine assessment), total number of children born between 1999 and 2001 and living and studying in South Limburg was compared to the number of children seen by YHC services (data before 1999 was not available). On average, the questionnaire was completed for 63% of 10-year-old children. Similar analyses was performed for age 14 (for birth cohorts 2000–2001) resulting in 58% and 26% who were seen by YHC. A comparison of dropout rates at age 17 among children that completed and did not complete the SDQ questionnaire revealed that non-completers had higher average dropout rates (4.2 vs. 6.6% and 3.8 vs. 5.7% for ages 10 and 14 respectively).

Predictors of School Dropout at age 10

In the model with factors measured at age 10, SDQ total score was a significant predictor of future school dropout with OR 1.07 [1.05;1.09] for each additional point of SDQ score. Boys were more likely to drop out than girls (OR 1.37 [1.11;1.69]). Children of parents with low education were more than twice as likely (OR 2.37 [1.72;3.28) to drop out compared to children of highly educated parents (Table 2). Children of parents whose education could not be obtained from the registry (approx. 20% of all children) were not statistically significantly different from children whose parents had high or middle education, indicating no specific pattern in the missing data in relation to the outcome. Immigration background did not show a significant association with the outcome, nor did it confound the relationship between other factors with the outcome. Interaction between SDQ and gender was significant but not relevant after stratification.

Table 2 Mixed effects logistic regression model

Predictive Performance of the Model with Predictors at age 10

The Hosmer and Lemeshow test indicated suboptimal fit (p = 0.03), with groups with high observed counts being poorly approximated by the predicted model counts, with a calibration plot revealing borderline fit (Figures S1 and S2, left). The AUC value of this model was 0.697 (Fig. 1) indicating that the model could provide an accurate prediction of future dropout in approximately 70% of cases. Sensitivity and specificity trade-off is presented in Figure S3. Positive and negative predictive values for the two potential cut-offs (0.07 and 0.20, with range being 0.01–0.44) are presented in Table S5. SDQ led to slight improvement of the prediction properties of the model (AUC without SDQ = 0.66). Further calculations of marginal probabilities of dropout for children with a number of risk factors (male, parents with low education, parents not living together, at least one parent on welfare subsidy) revealed that accounting for SDQ score resulted in substantially calibrated predictions which ranged from 6 to 16% for children with lowest and highest SDQ scores respectively (Table S4). Models with SDQ subscales as an outcome had inferior performance compared to models with total score (results not shown).

Fig. 1
figure 1

ROC curves for models with predictors at age 10 (all children, left) and age 14 (children attending secondary vocational education) level (right)

Predictors of School Dropout at age 14

Significant interaction was detected between SDQ and secondary school level (p < 0.01), therefore analyses were stratified by secondary school level. Quite diverse patterns by school level were revealed. SDQ remained a predictor of dropout among the secondary vocational education (VMBO) and secondary professional education (HAVO) students and was not relevant for bridging year or the vocational secondary school level (VWO). Higher risk of dropout for male versus female students was only observed for lowest secondary level students (vocational secondary school VMBO), and parental disadvantage (low education and single-parent households) were only predictive of school dropout among secondary vocational education (VMBO) students and in the bridging year. Parents reliance on welfare subsidies remained an important factor across all levels except the small group of children attending bridging year (Table 3). Immigration background did not show a significant association with the outcome, nor did it confound the relationship between other factors with the outcome. Interaction between SDQ and gender was not statistically significant.

Table 3 Mixed effects logistic regression model stratified by secondary education level

Predictive Performance of the Model with Predictors at age 14

AUC for models in bridging year, secondary professional and secondary scientific education were substantially below the relevant cut-off level (< 0.60), indicating poor predictive performance. Only model in a subsample of students in secondary vocational education (VMBO) revealed moderate predictive performance. ROC of the model among secondary vocational education (VMBO) students is presented in Fig. 1 (AUC value 0.69). The Hosmer and Lemeshow test demonstrated good fit (p-value = 0.65), and the calibration plot raised no concerns (Figures S1 and S2, left). AUC value of models in other strata was between 0.54 and 0.57, indicating very poor predictive performance. The sensitivity and specificity trade-off is presented in Figure S3. Positive and negative predictive values for the two potential cut-offs (0.07 and 0.20, with range being 0.02–0.35) are presented in Table S5. Adding SDQ to the prediction model yielded a slight improvement above the model with only gender and socio-economic family risk factors (AUC 0.66). Marginal probabilities for a socially disadvantaged VMBO student could be refined to a range of 10–16% (as opposed to an average 13%) when accounting for lowest and highest decile SDQ, respectively (Table S4). Similar to models with predictors at age 10, models with SDQ subscales as an outcome resulted in inferior performance compared to models using total score (results not shown).

Discussion

The objective of this study was to develop and assess a prediction model using child and family data routinely collected by YHC. As hypothesized, higher total SDQ score at age 10 was predictive of school dropout, with OR of 1.04 between each point of the SDQ. At age 14, the relationship between borderline or high SDQ and dropout was only observed among the students attending secondary vocational education (VMBO) and not at higher academic secondary school levels. Selection for academic level occurs during transition between the primary and secondary school, with students who report stronger psychosocial scores and whose parents are in higher social and economic positions clustered in higher levels and vice versa. It is possible that unmeasured factors like favorable peer environment, availability of social support or opportunities for intellectual challenges might work as protective factors to compensate for some of the psychosocial problems detected by SDQ, however these data were not available to be included in the analysis.

Combining total SDQ score with family socio-economic factors, a moderate performance prediction model (AUC of 0.7) was developed for children attending the lowest level secondary school at age 14 (VMBO) as well as for all children at age 10. Prediction performance of models on SDQ subscales was always inferior compared to models that were trained using SDQ total score. The strength of these prediction models is that they only include a few health and socio-economic factors that are routinely collected by YHC. It is important to acknowledge that a large number of false positives makes current models unsuitable for use in implementing and trialing any intensive and/or costly interventions. However, the simplicity of the models and the time window of the prediction suggests that they might be useful as a first step in risk stratification and might justify adding school performance to the topics discussed during a regular consultation between family and YHC. In this risk stratification ‘high risk’ children can be detected, e.g. children with up to four times higher risk of school dropout than the baseline risk. Importantly, the negative predictive value of the model is not higher than the prior probability which implies that individual scores below the cut-off value cannot be used to rule out early school dropout in the future.

While both models showed comparable predictive value, the model with predictors at age 10 may be preferred. In general, earlier prevention is desirable and assessing risk of dropout at age 10 offers a larger window of opportunity to engage with school and family, and, where possible, anticipate the choice of secondary school level. On the contrary, 14 years represents puberty age, where interventions to re-engage with school may be more challenging. The majority of participants of secondary vocational education will stream into senior vocational programs (MBO), and these are known to be at higher risk of future dropout occurring at and after this transition (Allen & Meng, 2010). Nonetheless, the proposed model allows to further differentiate within this already vulnerable group.

Two important known predictors of school dropout were not included in the current models, namely diagnosed mental health problems and school performance. While mental health problems (e.g. attention deficit) are partially captured by SDQ scores available to YHC professionals and used in the model, a more accurate anamnesis of mental health diagnoses might improve prediction (Fried et al., 2016). School achievement and grades is another known strong predictor of future early school leaving (Balkis, 2018; Bowers et al., 2012). It has been hypothesized to mediate the relationship between high SDQ scores and school dropout (Sagatun et al., 2014). At the moment, these data are not routinely available to YHC professionals and therefore including them in the model is not possible. On the other hand, SDQ and family background appear to reasonably capture early signals of a potential problem in the future and will allow for the provision of early intervention to families with children at risk. Future research should investigate whether SDQ remains an independent predictor of school dropout when school-related factors such as grades or absenteeism are accounted for, or whether it only acts as a proxy for school performance and/or behavior. Future research and interventions that aim to tackle early prevention of school dropout should also include attempts to make the reliable data on children’s mental health and school achievement available to YHC at the moment of consultation. This is particularly important for those that have been classified as ‘at risk’ by the initial prediction model. Interestingly, students from a migrant background were not at higher risk of school dropout when parental socio-economic status was accounted for. This is in agreement with previous research which indicated that the higher dropout rate amongst migrant groups was primarily due to socio-economic family disadvantage (Traag, 2012). However, this finding should be interpreted with caution given the rapidly changing migration landscape in recent years. Other factors that were predictive of school dropout in our model, namely male gender, low parental education and reliance on state welfare subsidies as main source of family income, as well as living in a single-parent household, have been also identified as predictors of dropout by other studies (Bowers et al., 2012; Gubbels et al., 2019; Traag, 2012).

The findings of this research should be interpreted in view of a few limitations. Firstly, given the young age of the cohort at the time of assessment (the most recent data on school dropout was available for ages 17 through to 22), school dropout status was defined at age of 17. This is not in line with formal definitions that use older age (European Commission, 2020;www.onderwijsincijfers.nl, 2020) and so it is possible that some children would have left school in later years or vice versa and this will not have been captured. Nonetheless, our definition may be seen as a proxy for early identification of potential problems with obtainment of secondary/ high school level certificate in the future (Rumberger & Lamb, 2003). Secondly, children who did not complete the questionnaire sent by YHC (and thus for whom SDQ was not available) had higher school dropout rates, which implies an underestimation of dropout rates in the study population.

Further, SDQ is completed by teenagers themselves at age 14 (as opposed to parents completing the questionnaire for children aged 10), which may introduce an additional response bias in this group, when those facing most behavioral and social challenges choose to not participate. Of note, organizational changes occurring in YHC in years 2014–2015 (transition to another digital dossier software) contributed to (temporarily) lower questionnaire completion rates in those years. Finally, the current models did not account for ongoing dropout policies at school level and intensity of support already provided at different schools. Nonetheless, within-school correlation (or intra-class correlation) was 7% suggesting no strong clustering in the data.

Future research involving all relevant stakeholders including schools, families and YHC should evaluate potential barriers and facilitators for the model’s use in practice as a first step in risk stratification. Caution is needed around communication about the goals and procedures of such prevention strategies, to avoid stigmatization around SDQ scores that may otherwise compromise the validity of the data and discount potential benefits.

In conclusion, we have demonstrated that school dropout can be anticipated by YHC service based on a few factors including SDQ score and adverse family circumstances, as early as age 10. Current models might signal potential problems early and provide an opportunity for timely interventions that set school performance on the agenda in consultations between YHC, parents, children and schools.