1 Introduction

Education stakeholders are currently working within an environment where vast quantities of data can be leveraged to have a deeper understanding of the educational attainment of learners. A growing pool of data is generated through software with which students, teachers and administrators interact (Kassab et al., 2020), through apps, social networking and the collection of user behaviour on aggregators such as YouTube and Google (De Wit & Broucker, 2017). Moreover, thanks to the Internet of Everything phenomenon, stakeholders in the education domain have access to data in which people, processes, data and things connect to the internet and to each other (Langedijk et al., 2019). That data takes on non-traditional formats and retains language, location, movement, networks, images and video information (Lazer et al., 2020). Such non-traditional data sets require cutting-edge analytical techniques in order to be effectively used for learning purposes and to be translated into succinct policy recommendations.

Learning analytics, as an interdisciplinary domain borrowing from statistics, computer sciences and education (Leitner et al., 2017), exploits this new data-rich landscape to improve the learning process and outcomes of current and future citizens (De Wit & Broucker, 2017). In education, learning analytics is set squarely within the new computational social sciences, which consist in the “development and application of computational methods to complex, typically large-scale human behavioral data” (Lazer et al., 2009). Learning analytics directs these advances towards the creation of actionable information in education. It applies data analytics to the field of education, and it attempts to propose ways to explore, analyse and visualize data from any relevant data source (Vanthienen & De Witte, 2017). An important role of learning analytics is the exploitation of the traces left by students on electronic learning platforms (Greller & Drachsler, 2012). As such, learning analytics allows teachers to maximize the cognitive and non-cognitive education outcomes of students (Long & Siemens, 2011). In an optimal learning environment, one would maximally leverage the potential of students to increase their welfare and performance not only during schooling but also afterwards, across civil society.

As the COVID-19 pandemic induced shifts towards online and home education, there is an increased opportunity for data analytics in general and to mitigate the crisis’ effects both on learning outcomes (Maldonado & De Witte, 2021) and on the well-being of students (Iterbeke & De Witte, 2020) in particular. The online traces that students leave on electronic learning platforms allow teachers, schools and policy-makers to better tailor targeted remedial teaching interventions to the most needy students. The closures of schools also showed how unequally digital devices are spread among students, with significant groups of disadvantaged students without access to basic digital instruments such as stable broadband access and computer. Similarly, the school closures revealed significant differences between countries in their readiness for online teaching and in the availability of high-quality digital instruction. Still, thanks to the unprecedented crisis, multiple countries made significant investments in the educational ICT infrastructure (De Witte & Smet, 2021). If this coincides with improved training of teachers and school managers; an improved integration of educational, administrative and online data sources; and the improved accessibility of hands-on software, we expect to see the domain of learning analytics to further flourish in the next decades.

The following chapter aims to contribute to this accelerated use of learning analytics by picturing its potential in multiple educational domains. We first discuss the increasing emergence of large and accessible data sets in education and the associated growth in expertise in educational data collection and analysis. This is sustained by real-time streamed data and increasingly autonomous administrative data sets. Section 16.2 compares the cost-effectiveness of learning analytics to that of costly and unreliable retrospective studies and surveys. Learning analytics may also contribute to the improvement in the quality of the currently dispensed education through fraud detection and student performance prediction, for example. In Sect. 16.3, three tools of growing popularity and potential for learning analytics are presented: the Bayesian Additive Regression Trees (BART), the Social Network Analysis (SNA) and the Natural Language Processing (NLP). These tools permit savvy users to make insightful predictions about student types, performance and the potential of reforms. The brief description of these techniques aims to familiarize practitioners and decision-makers with their potential. Finally, alongside recommendations, technical and non-technical challenges to the implementation and growth of learning analytics and empirically based education in general are discussed. As the growing possibilities of learning analytics result in sensitive options regarding data usage and linkages, we discuss in the conclusion section the related ethical and legal concerns.

2 Potential for Educators and Citizens

2.1 Growing Opportunities for Data-Driven Policies in Education

“Students and teachers are leaving large amounts of digital footprints and traces in various educational apps and learning management platforms, and education administrators register various processes and outcomes in digital administrative systems” (Nouri et al., 2019). In this section, we discuss three trends that allow for growing opportunities in fomenting creative data-driven policies in education: (1) the development of online teaching platforms, (2) software-oriented administrative data collection with links between heterogeneous data sets (Langedijk et al., 2019) and (3) the Internet of Things (Langedijk et al., 2019).

First, consider the online teaching platforms. A prime example of the latter are massive open online courses (i.e. MOOCs, De Smedt et al., 2017). Institutional MOOC initiatives have been contributing to making high-quality educational material accessible to a wide range of students and to maintaining the prestige of the participating institutions (Dalipi et al., 2018). For adults, MOOC completion has also been associated with increased resilience to unemployment (Castaño-Muñoz & Rodrigues, 2021). From a learning analytics perspective, it is interesting to observe that all student activities can be tracked within the MOOC. This information has been studied to give empirical grounding to suggestions to reduce course dropout by fostering peer engagements on online forums, team homeworks and peer evaluations (Dalipi et al., 2018). From a methodological perspective, some of the innovative methodologies exploiting MOOC’s large data sets include K-means clustering, support vector machines and hidden Markov models.Footnote 1

A second trend in data-driven policies in education arise from software-oriented administrative data collections. These refer to the digital warehousing of administrative data such that this data can be relatively easily linked with other data sets and easily transformed through, for example, the inclusion of a large quantity of new observations (e.g. student files) and the ad hoc addition of new variables of interest (Agasisti et al., 2017). Administrative data sets are built around procedures whose aims are not primarily to foster data-driven policies (Barra et al., 2017; Rettore & Trivellato, 2019). In that sense, they can provide rich information about students and other educational stakeholders while being quicker to gather and significantly cheaper than retrospective surveys (Figlio et al., 2016).

As a major advantage, software-oriented administrative data collections can be easily linked to other data sources, such as the wide array of information surveyed by local governments in their interactions with citizens. Through software integration, data regarding such diverse domains as public health and agriculture may be seamlessly captured. To conceptualize the diversity of potential data sources, Langedijk et al. (2019) describe those data as divided into thematic silos. Each silo represents an important civil concern, health or education, for example, and within each silo, stakeholders can define sub-themes onto which interesting data sets are attached. For example, in the case of education, some proposed sub-themes are standardized test results, textbook quality and teacher qualityFootnote 2 (Langedijk et al., 2019). Through the development of electronic networks, links cannot only be established within silos, where policy-makers may, for instance, be interested in the relation between teacher quality and test scores, but also across silos, where improvement in learning outcomes can be associated with changes in the health of citizens (Langedijk et al., 2019). The analyses required to measure such associations can take advantage of the typically long-run collection of administrative data (Figlio et al., 2016). As an additional advantage of the electronic networks, whereas data has traditionally been transmitted in batches, in order to produce descriptive reports at set time intervals, for example, electronic networks now permit event registration in real time (De Wit & Broucker, 2017; Mukala et al., 2015). The real-time extraction of data benefits teachers and students who can rely, for example, on automated assignments and online dashboards in order to improve their learning experience and their learning outcomes (De Smedt et al., 2017).

A good example of data set linkages in education are studies with population data that aim to explore education outcomes in specific subgroups. A recent study by Mazrekaj et al. (2020) made use of the rich micro-data sets made available to researchers by the Dutch Central Bureau of Statistics (CBS). These micro-data cover many themes of social life (e.g. financial, educational, health, environmental, professional silos and more) and are, though of limited access because of privacy issues, easy to link together with standard analytics software.

Third, consider the Internet of Things. The Internet of Things denotes the numerous physical devices with integrated internet connectivity (DeNardis, 2020). In educational settings, these devices are the computers, SNS services, mobile devices, camera, sensors and software with which students, teachers and administrators interact (Kassab et al., 2020). They are used to monitor student attendance and class behaviour and their interactions with online teaching services and laboratories. On online platforms, but also through mobile apps and logging platforms (e.g. library access, blogs, electronic learning environment), students’ and tutors’ behaviours and opinions can be monitored in real time and passed through automatic analytics platforms or saved to solve future policy issues (De Smedt et al., 2017; De Wit & Broucker, 2017). Similarly, RFID (radio-frequency identification) sensors track the locations and availability of educational appliances such as laboratory equipment and projectors. Students and tutors can communicate with each other regardless of location, and assessment feedback can be delivered instantaneously, resulting in higher-quality education.

2.2 Learning Analytics as a Toolset

The toolset of learning analytics can be used for several purposes. We first provide some examples on how it can contribute to improve the cost-effectiveness of education and next how it can foster education outcomes on cognitive and non-cognitive scales. Finally, we provide examples of how learning analytics can assist in educational quality management.

2.2.1 Improving Cost-Effectiveness of Education

The increasing public scrutiny and tighter budgets, which are an ever-present reality of the educational landscape, motivate a double goal for data-driven solutions. These must improve efficiency and performance with regard to learning outcomes while also proposing solutions that are competitive in terms of cost (Barra et al., 2017). There are two poles through which cost-effective learning analytics solutions can be proposed.

The first pole stands at the level of data collection. Administrative data sets suffer from their high cost of data cleaning and collection. Indeed, although data extraction is usually native to recent administrative software (King, 2016), administrative data sets typically require ad hoc linkages and research designs (Agasisti et al., 2017). In the sense that their inclusion in data-driven decision-making is not their primary purpose, they constitute an opportunistic data source and thus may occasionally demand more resource investments than deliberate data collection procedures. Meanwhile, the omnipresent network of computing devices and the associated online educational platforms permit data extraction at every step of the learning process (De Smedt et al., 2017). As previously indicated, this type of unstructured data can be saved, but the real-time data stream can also be designed in such a way to permit automatic analyses. This deliberate pipeline associating the collected data to useful analyses can insure cost-effectiveness through economies of scales. It can also serve as a baseline to future improvements in summarizing data for students, teachers and stakeholders in general. In short, rich data sets and insightful analyses can be produced without requiring punctual organizational involvement. In that sense, the environment in which learning analytics is embedded permits professionals and stakeholders to benefit from opportunistic analyses and from insights that are delivered efficiently (Barra et al., 2017). For example, during the COVID-19 crisis, learning analytics was used to monitor how students were reached by online teaching.

The second pole to achieve cost-effectiveness in the establishment of data-driven policy-making for education is that of data analytics. Up until now, technologically able and creative teams have been achieving parity with the expanding volume, variety and velocity of data by developing and applying advanced analytical methods (De Wit & Broucker, 2017; King, 2016). One such method is Data Envelopment Analysis (DEA). It permits the employment of administrative and learning data in order to directly fulfil goals related to cost minimization (Barra et al., 2017; De Witte & López-Torres, 2017; Mergoni & De Witte, 2021). The result of such analyses may be useful in promoting efficient investments in educational resources (see, e.g. the report by the European Commission Expert Group on Quality Investment in Education and Training). Additional spending brings to the forefront its paradoxical effect of increasing cost-effectiveness in the long run. Advances in social sciences have already demonstrated the consequences of poor learning outcomes, the principal of which are “lower incomes and economic growth, lower tax revenues, and higher costs of such public services as health, criminal justice, and public assistance” (Groot & van den Brink, 2017). Hence, learning outcomes deserve an important place in discussions around the cost-effectiveness of education (De Witte & Smet, 2021).

2.2.2 Improving Learning Outcomes

In terms of directly improving educational quality, three ambitions can be distinguished for learning analytics: making improvements in (non-)cognitive learning outcomes, reducing learning support frictions and a wide deployment and long-term maintenance for each teaching tool (Viberg et al., 2018). These ambitions are now discussed.

First, learning outcomes can be interpreted as the academic performance of students, as measured by quizzes and examinations (Viberg et al., 2018). Learning outcomes can also be defined in a broader way than similar testable outcomes, for example, by being related to interpersonal skills and civic qualities. However widely defined, it is important that the set of criteria identifying educational success is well-defined by stakeholders and that it is clearly communicated to and open to the contributions of citizens. In that way, educational policy discussions can be centred around transparent and recognized aims.

Although there is a rich literature evaluating learning analytics in higher education, the contributions of learning analytics tools to improving the (non-)cognitive learning outcomes of secondary school students have received relatively little attention in the empirical literature (Bruno et al., 2021). Nevertheless, clear improvements in writing and argumentative quality have been associated with the use of automatic text evaluation softwares (Lee et al., 2019; Palermo & Wilson, 2020). These softwares use Natural Language Processing (NLP) to analyse data extracted from online learning platforms. Automatic text evaluation has also shown promising results at higher education levels and with non-traditional adult students (Whitelock et al., 2015b). There is thus flexibility in terms of the type of students or teachers to whom learning analytics approaches apply.

Another interesting contribution of learning analytics to the outcomes of secondary school students has been in improving their computer programming abilities. This has been accomplished through another advanced data analysis technique, process mining, which helped teachers in pairing students based on captured behavioural traces during programming exercises (Berland et al., 2015).

Second, with respect to learning support frictions, there is often a lag between the assumptions behind the design of learning platforms and the observed behaviours of students (Nguyen et al., 2018). An example of this lag is that students tend to spend less time studying than recommended by their instructors. Less involved students also tend to spend less time preparing assignments (Nguyen et al., 2018). By reducing their ability to receive feedback in a timely manner, a similar lag can negatively affect both students’ and teachers’ involvement in the learning process. Thanks to learning analytics tools, students will receive tailored feedback, will rehearse exercises that are particularly difficult for them and will receive stimulating examples that fit their interest (Iterbeke et al., 2020). This reduces the learning support frictions and consequently improves learning outcomes.

Yet, the lag between the desired learning outcomes and student behaviour cannot be corrected simply through the implementation of electronic platforms or through a gamification of the learning process. It is critical that the digital tools being implemented and those implementing them take students’ feedback into account. Many students are now used to accessing information without having to pass through much in the manner of physical and social barriers. For those students, the interactivity and the practicality of the digital learning tools are particularly important (Pardo et al., 2018; Selwyn, 2019). Other students may not have the same familiarity with online computing devices. For these, accessibility has to be negotiated into the tools.

Many authors warn of a transfer from magisterial education to learning platforms in which feedback and exercises may be too numerous, superficial or ill-adapted to students’ capabilities or learning ambitions (Lonn et al., 2015; Pardo et al., 2018; Topolovec, 2018). Hence, a hybrid approach to learning support is suggested wherein technologies, such as those just touched upon of automatic text analyses and process mining, are combined with personalized feedback from teachers and tutors. Indeed, classroom teaching is often characterized by a lack of personalization and biases in the dispensation of feedback and exercises. For example, low-performing students are over-represented among the receiver of teacher feedback. Additionally, given the same learning objectives, feedback may be administrated differently to students of different genders and origins. Teachers may find learning analytics tools useful in helping their students attain the desired learning outcomes while fostering their personal learning ambitions and their self-confidence (Evans, 2013; Hattie & Timperley, 2007).

Third, learning analytics can provide additional value to students and teachers. In that sense, we observe several clear advantageous applications of learning analytics.

  • Learning analytics could contribute to non-cognitive skills, as collaboration is an area where non-cognitive skills play an important role. Identifying collaboration and the factors that incite it can improve learning outcomes and even help in preventing fraud. The implementation of analytics methods such as Social Network Analysis (SNA) in learning platforms may allow teachers to prevent or foster such collaborations (De Smedt et al., 2017). Simple indicators like the time of assignment submission can be treated as proxies for collaboration. We discuss SNA more into depth in Sect. 16.3.

  • Another computational approach, process mining, can exploit real-time behavioural data to summarize the interactions of students with a given course’s material. Students can then be distinguished based on their mode of involvement in the course (Mukala et al., 2015). It allows teachers to learn how the teaching method results in behavioural actions. These insights can be incorporated in the course design, and on the detection of inefficient behaviour, allowing fast and personalized intervention (De Smedt et al., 2017).

  • A conjoint method to generate value from learning analytics is by implementing text analyses directly on the learning platforms. Natural Language Processing (NLP) is a text analysis method that has been shown to greatly improve the performance of students with regard to assignments such as the writing of essays (Whitelock et al., 2015a). Generally, text analysis can provide automated feedback shared with the students and their teachers (De Smedt et al., 2017). Providing automated feedback makes another argument for the cost-effectiveness of learning analytics. By giving course providers the ability to score large student bodies, it allows teachers to put more focus onto providing adapted support to their students (De Smedt et al., 2017). We discuss NLP into more depth in Sect. 16.3.

  • Not the least advantage of online learning is that it allows asynchronous and synchronous interactions and communications between the participants to a course (Broadbent & Poon, 2015). These interactions can be logged as unstructured data and incorporated into useful text, process and social network analyses.

2.2.3 Educational Quality Management

A key component of quality improvement in education is the creation of quality and performance indicators related to teachers and schools (Vanthienen & De Witte, 2017). Learning analytics’ contribution to educational quality improvement is in providing data sources and computational methods and combining them in order to produce actionable summaries of teaching and schooling quality (Barra et al., 2017). Whereas, traditionally, data analyses have required punctual involvement and costly (time) investments from stakeholders, learning analytics can rely on computational power and dense networks of computational devices to automatically propose real-time reports to policy-makers. Below, contributions in terms of quality measurement and predictions are introduced.

2.2.4 Underlying Data for Quality Measurement

Through the exploitation of unstructured, streamed, behavioural data and pre-existing administrative data sets, analytical reports can be updated in real time to reflect the state of education at any desired level, from the individual student and classroom to the country as a whole. That information is commonly ordered in online dashboards (De Smedt et al., 2017). Analysts and programmers can even allow the user to customize the presented summary in real time, by applying filters on maps and subgroups of students, for example.

2.2.5 Efficiency Measurement

An aspect of the quality measurements provided by learning analytics is efficiency research, in which inputs and outputs are compared against a best practice frontier (see the earlier discussed Data Envelopment Analysis model). In this branch of literature, schools are, for instance, compared based on their ability to maximize learning outcomes given a set of educational inputs (De Witte & López-Torres, 2017; e Silva & Camanho, 2017; Mergoni & De Witte, 2021). The outcome of a similar analysis might be used for quality assessment purposes.

2.2.6 Predictions

When discussing the potential of learning analytics for educators and stakeholders, the ability to make predictions about learning outcomes is an unavoidable point of interest. In quantitative analyses, predictions are generated by translating latent patterns in historical data, be it structured or unstructured, in order to identify likely future outcomes (De Witte & Vanthienen, 2017).

Predictions can be produced using, for example, the Bayesian Additive Regression Trees (BART) model (see Sect. 16.3), as applied in Stoffi et al. (2021). There, linked administrative and PISA data available only in Flanders is used to distinguish a group of overwhelmingly under-performing Walloon students and explain their situation. Typically, such a technique uses administrative data that is available for both endowment groups in order to make a sensible generalization from one to the other.

Alternatively, process mining can be used to identify clusters of students and distinguish successful interaction patterns with a course’s material (Mukala et al., 2015). Similar applications can be imagined for Social Network Analysis (De Smedt et al., 2017), through the evaluation of collaborative behaviour, and Natural Language Processing. These techniques are usually perceived as descriptive, but their output may very well be included in a predictive framework by education professionals and researchers.

Learning analytics has initiated a shift from using purely predictive analytics as a mean to identify student retention probabilities and grades towards the application of a wider set of methods (Viberg et al., 2018). In return, cutting-edge exploratory and descriptive methods can improve traditional predictive pipelines.

3 An Array of Policy-Driving Tools

It is one thing to comb over the numerous contributions and potential of learning analytics to data-informed decision-making; it is yet another to actually take the plunge and settle on tools for problem-solving in education. In what follows, a brief introduction to distinct methods from the field of computational social sciences is provided. In that way, the reader can get acquainted with the intuition of the methods and how they can be used to improve learning outcomes and quality measurement in education. To set the scene, we also illustrate how the approaches open up the range of innovative educational questions that can be answered through learning analytics.

3.1 Bayesian Additive Regression Trees

The Bayesian Additive Regression Trees (BART) stems from machine learning and probabilistic programming. It is a predictive and classifying algorithm that makes solving complex prediction problems simple by relying on a set of sane parameter configurations. Earlier comparable algorithms such as the Gradient Boosting Machine (GBM) and the Random Forest (RF) require repeated adjustments that hinge the quality of their predictions on an analyst’s programming ability and limited computational resources. By contrast, the BART incorporates prior knowledge about educational science problems in order to produce competitive predictions and measures of uncertainty after a single estimation run (Dorie et al., 2019). This contributes to the accessibility of knowledge discovery and the credibility of policy statements in education.

As with the GBM and the RF, the essential and most basic component of the BART algorithm is the decision or prediction tree. The prediction tree is a classic predictive method that, unlike traditional regression methods, does not assume linear associations between sets of variable. It is robust to outlying variable values, such as those due to measurement error, and can accommodate a large quantity of data and high-dimensional data sets.

Their accuracy and relative simplicity have made regression trees popular diagnostic and prediction tools in medicine and public health (Lemon et al., 2003; Podgorelec et al., 2002). In education, a recent application of regression trees has been to explore dropout motivations and predictors in tertiary education (Alfermann et al., 2021). The regression tree algorithm (i.e. CART or classification and regression trees, Breiman et al., 2017) does variable selection automatically, so researchers are able to distinguish a few salient motivations, such as the perceived usefulness of the work, from a vast endowment of possible predictors.

To predict quantities such as test scores or dropout risk, regression trees separate the observations into boxes associating a set of characteristics with an outcome. The trees are created in multiple steps. In each of these steps, all observations comprised in a box of characteristics are split in two new boxes. Each split is selected by the algorithm to maximize the accuracy of the desired predictions. The end result of this division of observations into smaller and smaller boxes are branches through which each individual observation descends into a leaf. That leaf is the final box that assigns a single prediction value (e.g. a student’s well-being score) to the set of observations sharing its branch. Graphically, the end result is a binary decision tree where each split is illustrated by a programmatic if statement leading onto either the next binary split or a leaf.

The Bayesian Additive Regression Trees (BART) algorithm is the combination of many such small regression trees (Kapelner & Bleich, 2016). Each regression tree adds to the predictive performance of the algorithm by picking up on the mistakes and leftover information from the previously estimated trees. After hundreds or possibly thousands of such trees are estimated, complex and subtle associations can be detected in the data. This makes the BART algorithm particularly competitive in areas of learning analytics where a large quantity of data are collected and there is little existing theory as to how interesting variables may be related to the outcome of interest, be it some aspect of the well-being of students or their learning outcomes.

The specific characteristic of the BART algorithm is its underlying Bayesian probability model (Kapelner & Bleich, 2016). By using prior probabilistic knowledge to restrict estimation possibilities to realistic prediction scenario, the algorithm can avoid detecting spurious association between variables. Each data set, unless it constitutes a perfect survey of the entire population of interest, contains variable associations that are present purely due to chance. Such coincidental associations reduce the ability to predict true outcomes when they are included in predictive models. Thus, each regression tree estimated by the BART algorithm is kept relatively small. Because each tree tends to assign predictions to larger sets of observations (i.e. large boxes), the predictive ability of individual trees is bad. This is why analysts call them weak learners. However, by combining many such weak learners, a flexible, precise and accurate prediction function can be generated (Hill et al., 2020).

The BART algorithm has already been presented earlier in this chapter as a flexible technique to detect and explain learning outcome inequalities (Stoffi et al., 2021). A refinement of the algorithm also permits the detection of heterogeneous policy effects on the learning outcomes of students. This is showed in Bargagli-Stoffi et al. (2019), where it is found that Flemish schools with a young and less experienced school director benefit most from a certain public funding policy. The large administrative data sets provided by educational institutions and governments are well fit for the application of rewarding but computationally demanding techniques such as the BART (Bargagli-Stoffi et al., 2019).

3.2 Social Network Analysis

The aim of Social Network Analysis (SNA) is to study the relations between individuals or organizations belonging to the same social networks (Wasserman, Faust, et al., 1994). Relations between these actors are defined by nodes and ties. The nodes are points of observations, which can be students, schools, administrations and more. The ties indicate a relationship between nodes and can contain additional information about the intensity of various components of that relationship (e.g. the time spent collaborating, the type of communication; Grunspan et al., 2014). Specifically for education, SNA aims to describe the networks of students and staff and make that information actionable to stakeholders. Applications of SNA include the optimization of learning design, the reorganization of student groups and the identification of at-risk clusters of students (Cela et al., 2015). Through text analysis and other advanced analytics methods, SNA can handle unstructured data from school blogs, wikis, forums, etc. (Cela et al., 2015). We discuss five examples more in detail next and refer the interested readers to the review by Cela et al. (2015), who provides many other concrete applications of SNA in education.

As a first example, the recognized importance of peer effects, both within and outside the classroom, makes Social Network Analysis (SNA) a particularly useful tool in education (Agasisti et al., 2017; Cela et al., 2015; Iterbeke et al., 2020). Applications of SNA model peer effects indirectly as a component of unobserved school or classroom effects that influence the (non-)cognitive skills (Cooc & Kim, 2017). As a second example, SNA has been applied to describe and explain a multiplicity of phenomena in schools. In a study of second and third primary school graders from 41 schools in North Carolina, Cooc and Kim (2017) found that pupils with a low reading ability who associated with higher ability peers for guidance significantly improved their reading scores over a summer. Third, other relevant applications of SNA have been in assessing the participation of peers in the well-being, be it mental or physical, of students. Surveying 1458 Belgian teenagers, Wegge et al. (2014) showed that the authors of cyber-bullying were often also responsible for physically bullying a student. Additionally, it was observed that a majority of bullies were in the same class as the bullied students. Moreover, a map of bullying networks isolated some students as being perpetrators of the bullying of multiple students. In cases of intimidation and bullying, a clear advantage of SNA over the usual approaches is that the data does not depend on isolated denunciations from victims and peers. The analysis of Wegge et al. (2014) simultaneously identifies culprits and victims, suggesting a course of action that does not focus attention on an isolated victim of bullying. A fourth example application of SNA is in improving the managerial efficacy and the performance of employees within educational organizations. One way to do this is by identifying bottlenecks in the transmission of information through the mapping of social networks. This can take two forms in the language of SNA: brokerage and structural holes (Everton, 2012). In a brokerage situation, a single agent or node controls the passing of information from one organizational sub-unit to the other. Meanwhile, structural holes identify absent ties between sub-units in the network. In a school, an important broker may be the principal’s secretary, whereas structural holes may be present if teachers or staffs do not communicate well with one another (Hawe & Ghali, 2008). As a fifth illustration, the SNA method has been used to propose a typology of teachers based on the nature of their ties with students and to identify clusters of students more likely to be plagiarising with each other (Chang et al., 2010; Merlo et al., 2010; Ryymin et al., 2008). The ability to cluster students based on the intensity of their collaborations in a course has also been distinguished as a way to prevent fraud. Detecting cooperation between students is one of the key application of SNA in learning analytics (De Smedt et al., 2017).

3.3 Natural Language Processing

Natural Language Processing (NLP) is an illustration of the ability of computing machines to communicate with human languages (Smith et al., 2020). NLP applications can be achieved with relatively simple sets of rules or heuristics (e.g. word counts, word matching) or without applying cutting-edge machine learning techniques (Smith et al., 2020). Given NLP relies on machine learning techniques, it is better able to understand the context and reveal hidden meanings in communications (e.g. irony) (Smith et al., 2020).

In education, the use of NLP has been shown to improving students’ learning outcomes (Whitelock et al., 2015a) and promoting student engagement. Moreover, NLP systems have the potential to provide one-on-one tutoring and personalized study material (Litman, 2016). The automatic grading of complex assignments is a precious feature of NLP models in education. These may eventually become a cost-effective solution that facilitate the evaluation of deeper learning skills than those evaluated through answers to multiple-choice questions (Smith et al., 2020). By efficiently adjusting the evaluation of knowledge to the learning outcomes desired by stakeholders, NLPs can contribute to educational performance. External and open data sets have allowed NLP solutions to achieve better accuracy in tasks such as grading. Such data sets can situate words within commonly invoked themes or contexts, for example, allowing the NLP model to make a more nuanced analysis of language data (Smith et al., 2020). Access to rich language data sets and algorithmic improvements may even allow NLP solutions to produce course assessment material automatically (Litman, 2016). However, an open issue with machine learning implementations of NLP is that the features used in grading by the computer may not provide useful feedback to the student or the teacher (e.g. by basing the grade on word counts) (Litman, 2016). Reasonable feedback may still require human input.

4 Issues and Recommendations

Despite the outlined benefits and contributions of learning analytics, there are, however, still some issues and limitations. A clear distinction can be made between issues belonging to the technical and non-technical parts of learning analytics (De Wit & Broucker, 2017). In the first case, there are the issues related to platform and analytics implementations, data warehousing, device networking, etc. With regard to the non-technical issues, there are concerns over the public acceptance and involvement in learning analytics, private and public regulations, human resources acquisition and the enthusiasm of stakeholders as to the technical potential of learning analytics. We summarize these challenges and propose a nuanced policy pathway to learning analytics implementation and promotion.

4.1 Non-technical Issues

Few learning analytics papers mention ethical and legal issues linked to the applications of their recommendations (Viberg et al., 2018). Clearly, developments in learning analytics participate to and benefit from the expansion of behavioural data collection. The spread and depth of data collection are generating new controversies around data privacy and security. These have an important place in public discourse and, if mishandled by stakeholders, could contribute to further limiting the potential of data availability and computational power in learning analytics and similar disciplines (Langedijk et al., 2019). Scientists are currently complaining about the restrictions put upon their research by rules and accountability procedures. Such rules curtail data-driven enterprises and may be detrimental to learning outcome’s improvements (Groot & van den Brink, 2017). To facilitate collaboration between decision-makers, it is important that the administrative procedures related to learning analytics been seen by researchers as contributing to a healthy professional environment (Groot & van den Brink, 2017).

Additionally, public accountability and policies promoting organizational transparency may be a proper counter-balance to privacy concerns among citizens (e Silva & Camanho, 2017). The transparency and accessibility of information, by making relevant educational data sets public, for example, can involve citizens in the knowledge discovery related to education and foster enthusiasm for data-driven inference in that domain (De Smedt et al., 2017). It is also important that the concerned parties, including civil society, are interested in applying data-driven decision-making (Agasisti et al., 2017). It can be difficult to convince leaders in education to shift to data-driven policies since, for them, “experience and gut-instinct have a stronger pull” (Long & Siemens, 2011).

Just as necessary as political commitment, the acquisition of a skilled workforce is another sizeable non-technical issue (Agasisti et al., 2017). The growth of data-driven decision-making has yielded an increase in the demand for higher-educated workers while reducing the employment of unskilled workers (Groot & van den Brink, 2017). In other words, there is a gap between the growing availability of large, complex data sets and the pool of human resources that is necessary to clean and analyse those data (De Smedt et al., 2017). This invokes the problem, shared across the computational social sciences, of the double requirement of technical and analytical skills. Often, even domain-specific knowledge is an unavoidable component of useful policy insights (De Smedt et al., 2017). That multiplicity of professional requirements has made certain authors talk of the desirable modern data analyst as a scholar-practitioner (Streitwieser & Ogden, 2016).

4.2 Technical Issues

Many technical problems must be tackled before data-driven educational policies become a gold standard. Generally, there is a need for additional research regarding the effects of online educational softwares and of digital data collection pipelines on student and teacher outcomes. Additionally, inequalities in terms of the access to online education and its usage are an ever-present challenge (Jacob et al., 2016; Robinson et al., 2015).

There is yet relatively little evidence indicating that learning analytics improve the learning outcomes of students (Alpert et al., 2016; Bettinger et al., 2017; Jacob et al., 2016; Viberg et al., 2018). For example, less sophisticated correction algorithms may be exploited by students who will tailor their solution to obtain maximal scores without obtaining the desired knowledge (De Wit & Broucker, 2017). This is a question of adjustment between the spirit and the letter of the learning process.

Additionally, although the combination of administrative and streamed data is in many ways advantageous compared to survey data (Langedijk et al., 2019), the fast collection and analysis of data create issues of data accuracy. With real-time data analyses and reorientations of the learning process, accessible computing power becomes an issue.

Meanwhile, the unequal access to online resources and devices plainly removes a section of the student and teacher population from being reached by the digital tools of education. In part, this creates issues of under-representation in educational studies that increasingly rely on data obtained online (Robinson et al., 2015). It also creates a divide between those stakeholders that can make an informed choice between using and developing digital tools and face-to-face education and those that cannot access it or to whom digital education has a prohibitive cost (Bettinger et al., 2017; Di Pietro et al., 2020; Robinson et al., 2015).

Lack of access to digital or hybrid learning tools (i.e. a mix of face-to-face and digital education) may directly impede the learning and well-being of students. Indeed, students with access to online and hybrid education can access resources independently to enhance their educational pathway (Di Pietro et al., 2020). In a sense, a larger range of choices makes better educational outcomes attainable. For example, students at a school within a neighbourhood of low socio-economic standing may access a diverse network of students and teachers on electronic platforms (Jacob et al., 2016). In times of crisis such as with the COVID-19 school lockdowns, ready access to online educational platform also reduces the opportunity cost of education (Chakraborty et al., 2021; Di Pietro et al., 2020).

However, access is not a purely technical challenge. There are also noted gaps between populations in terms of the usage that is made of educational platforms and internet resources more generally (Di Pietro et al., 2020; Jacob et al., 2016). Students participating to MOOC, for example, are overwhelmingly highly educated professionals (Zafras et al., 2020). Online education may also leave more discretion to students. This discretion has proven to be a disadvantage to those who perform less well and are less motivated in face-to-face classes (Di Pietro et al., 2020).

4.3 Recommendations

Data-driven policies will require vast investments in information technology systems towards both data centres and highly skilled human resources. Therefore, additional data warehouses need to be built and maintained. Those require strong engineering capabilities (De Smedt et al., 2017). The integration of teaching and peer collaborations within computer systems promises to accelerate innovations in education. One can imagine that, in the future, administrative and real-time learning data will be updated and analysed in real time. The analyses will also benefit from combining data from other areas of interest such as health or finance. Additionally, the reach of analytics programs could be international, allowing for the shared integration and advancement of knowledge systems across countries (Langedijk et al., 2019).

Although there is a large practical potential of data-driven policies and educational tools, it is important that an educational data strategy not be developed in and of itself. Unlike what some big data enthusiasts have claimed, the data does not “speaks for itself” in education (Anderson, 2008). Those teachers, administrators and policy-makers, who are working to better educate our children, will still face complicated dilemma appealing to their professional expertise regardless of the level of integration of data analytics in education.

Furthermore, to insure political willingness, it is critical that work teams and stakeholders profit from the collected and analysed data (De Smedt et al., 2017). This contributes to the transparency of data use. Finally, although the evidence is still quite thin regarding the benefits of learning analytics, it must be noted that only a small quantity of validated instruments are actually being used to measure the quality and transmission of knowledge through learning platforms (Jivet et al., 2018).

Despite this scarcity of evidence pertaining to education, the exploitation of data through learning analytics can be linked to the recognized advantages of big data in driving public policy. Namely, it can facilitate a differentiation of services, increased decisional transparency, needs identification and organizational efficiency (Broucker, 2016). Generally, the lack of available data backing a decision is an indication of a lack of information and, thus, sub-optimal decision-making (Broucker, 2016).

Policies can be better implemented through quick and vast access to information about students and other educational stakeholders. In other words, the needs of students and other educational stakeholders can be more efficiently satisfied with evidence obtained from data collection (e.g. lower cost, higher speed of implementation). Such evidence-based education is a rational response to the so-called fetishization of change that has been plaguing educational reforms (Furedi, 2010; Groot & van den Brink, 2017).

It follows that data analytics should not become a new object for the fetishization of change in educational reforms. Indeed, quantitative goals (e.g. quantity of sensors in a classroom) should not be confounded with educational attainments (Long & Siemens, 2011; Mandl et al., 2008). Rather, data analytics should be developed and motivated as an approach that ensures that there are opportunities to use data in order to sustain mutually agreeable educational objectives.

These objectives may pertain to the lifetime health, job satisfaction, time allocation and creativity of current students (Oreopoulos & Salvanes, 2011). In other words, learning analytics pipelines must be carefully implemented in order to ensure that they are a rational response to contemporary challenges in education.