Introduction

Assessing outcomes of educational efforts in terms of competence has a long tradition in fields of general education and gained particular recognition through international large-scale assessments such as Programme for International Student Assessment (PISA). However, respective efforts to measure domain-specific vocational and professional competences are still rare. In 2011, the German Federal Ministry of Education and Research (BMBF) launched the research initiative ‘Technology-Based Assessment of Skills and Competencies in Vocational Education and Training (ASCOT)’. The initiative aimed at the development of computer-based instruments for the assessment of domain-specific competences in selected vocations in Germany on the basis of authentic work and business processes. The work presented in this paper is part of a research project entitled ‘Modelling and measuring domain-specific problem-solving competence of industrial clerks (DomPL-IK)’.Footnote 1 In the following, we want to highlight two innovative features of our competence measurement in the business domain: (1) With problem-solving competence we address higher-order competences instead of just knowledge reproduction. Therefore, we did not only develop complex problem scenarios within an authentic office simulation but also provided an open-ended and authentic problem space for working on these problems rather than apply highly structured items (e.g., multiple choice items). The analysis of the participants’ complex behavior patterns was based on a theoretically driven competence model and on item response theory (IRT). (2) As an alternative to relying on detached self-report questionnaires, we implemented an integrated measurement of non-cognitive facets of competence (i.e., facets of self-concept and interest) that we refer to as ‘Embedded Experience Sampling’ (EES): Test-takers in a complex problem-solving task are requested to stop at certain times during the test and spontaneously answer short prompts regarding their actual experience of the problem situation (e.g., ‘Your colleague Julian visits your office: Hi, how are you? I heard you have to deal with a rather large task. Well, I just wanted to ask how you are doing.’; answers were to be given on a four-point Likert scale, e.g. from ‘At the moment, I feel not at all confident’ = 1 to ‘… very confident’ = 4). The project is located in commercial vocational education and training (VET). Nevertheless, the approach is applicable in other domains as well.

A study with nearly 800 VET students was conducted in 2014. This paper provides an overview of the theoretical modeling of domain-specific problem-solving competence, the development of problem scenarios in the field of controlling, the computer-based test environment, and the implementation of EES. Particular attention is given to the analysis of reliability and validity of the developed competence assessment based on the empirical study. Finally, we discuss limitations, possible applications and advancements of the assessment.

Background

Domain-specific problem-solving competence

According to common definitions, a person is confronted with a problem when he or she has a goal but—in contrast to facing a simple task/routine task—does not immediately know what is needed to reach the desired goal (Duncker 1945; Newell and Simon 1972; OECD 2013). Thus, whether a situation is perceived as a task or a problem depends on an individual’s prior experience, knowledge and skills (Dörner 1987; Mayer 1994; Funke et al. in print). However, even for routine tasks one may not always immediately recognize all necessary operations. It may take time or additional information to consider what to do without ever considering the situation as a problem. This challenges the clarification of the term ‘problem’. Hence, in addition to an initial ‘state of not knowing’, we suggest that problems are also characterized by the affective response to this initial ‘state of not knowing’. A tendency towards negative emotional responses then indicates a problem situation (i.e., a significant discrepancy between an actual and a desired state), whereas the absence of such an initial negative emotional state would indicate that goal achievement is either not significant enough (e.g., the goal can easily be abandoned) or not considered too challenging (e.g., the goal can easily be achieved). This perspective is also found in the problem definition by Jonassen and Hung (2012) who suggest two critical attributes of a problem, namely the existence of an unknown and the need to determine the unknown. Thus, experiencing negative emotions indicates that an individual really cares about solving the problem (Op’t Eynde et al. 2006) or finding the unknown, respectively. Furthermore, the problem solver might (and should) try to actively down-regulate such tendencies towards negative emotional responses (Dörner and Wearing 1995; Funke 2012; Funke et al. in print). The effects of emotions on achievement behavior are ambiguous (Carver and Scheier 2014). Positive moods at a medium level of activation were found to facilitate adequate, planned, and reflective problem-solving behaviours in a study by Reither and Stäudel (1985), whereas negative emotions increased the tendency to avoid a problem by shifting attention to easier tasks (Schwarz and Bless 1991). However, Spering et al. (2005) and Barth and Funke (2010) showed that negative feedback from the problem environment triggered negative affect which in turn might enhance problem solving. Still, from a perspective of emotion regulation (Gross 1998), it is important to regulate these negative emotions even if they represent valuable feedback on the progress of problem solving (Hannula 2015).

In line with Weinert (2001) the attribution of competence should be based on dealing with complex situations. The complexity of a problem situation is defined by the number and interconnectedness of variables, number of conflicting goals, lack of transparency, self-reinforcing tendencies and time pressure (Dörner 1996; Funke 2003). With regard to dynamics, Leutner et al. (2005) distinguish dynamic problem solving from analytic problem solving. Dynamic problems require exploration by means of manipulating variables, observing effects, and drawing conclusions. The MicroDYN approach is the most common psychometric instrument for dynamic problem solving and was also applied in PISA. The participants explore linear systems, usually consisting of three independent variables and three dependent variables, by manipulating the independent variables and enter their insight in a causal diagram (i.e. knowledge acquisition). Afterwards they have to manipulate the independent variables to achieve a given array of target values (i.e. knowledge application). The participants are usually confronted with seven to nine tasks, each lasting about a maximum of 5 min (Greiff et al. 2013a; Schoppek and Fischer 2015). In contrast, our own approach builds on analytic problem solving, in which relevant information is presented or can be derived by deductive reasoning (Leutner et al. 2005), which also resembles information problem solving as, for instance, referred to by Brand-Gruwel et al. (2009). We do not follow Leutner’s et al. (2005) opinion that only dynamic problems are complex problems. Likewise, Schoppek and Fischer (2015) argue that problems within the MicroDYN approach lack many of the above characteristics of complex problems. Furthermore, analytic problems can apparently possess all further features of complex problems, too.Footnote 2 In addition, we argue that the degree of complexity of a problem is to some extent subjectively perceived and may also vary frequently while working on the problem. Any attempts to objectively predefine the complexity of problems have to be based on the anticipation of a target group’s problem-solving competence.

Following Fischer and Neubert (2015) we consider problem-solving competence as a combination of knowledge, skills, abilities, and other components (‘KSAO approach’) rather than a single ability as within the MicroDYN approach (Greiff et al. 2013a).Footnote 3 In the context of problem solving, domain-specific knowledge refers to declarative, procedural, conditional and other types of knowledge, which is relevant in problem situations within a particular domain and thus, domain-specific (Ackerman 2000; Nokes et al. 2011; Woolfolk 2005). Thus, by including knowledge in the definition of problem-solving the construct becomes domain-specific. Although domain-specific knowledge plays an important role, problem solving is also enhanced by non-cognitive factors such as self-confidence, perseverance, motivation, interest, frustration tolerance and the like (Frensch and Funke 1995; Schoppek and Fischer 2015; Sugrue 1995). Similarly, Kanfer and Ackerman (2005) consider knowledge, skills and abilities, motivation, personality, and self-concept as components of work competence. In summary, we follow Herl et al. (1999, p. 2) who state that in order ‘… to be a successful problem solver, one must know something (content knowledge), possess intellectual tricks (problem-solving strategies), be able to plan and monitor one’s progress towards solving the problem (metacognition), and be motivated to perform (effort and self-efficacy)’.

Based on extensive literature research, we developed a model of domain-specific problem-solving competence (for more information concerning the development of the competence model see Rausch and Wuttke 2016) that comprises 13 facets of competence, which are assigned to four components—(1) knowledge application,Footnote 4 (2) metacognition, (3) self-concept, and (4) interest—and aligned along an ideal problem-solving process, whilst recognizing that complex problem solving is rarely a linear process (Fig. 1). Furthermore, we refer to the facets of the first two components (A and B in Fig. 1) as cognitive facets and to the facets of the last two components (C and D) as non-cognitive facets. While cognition usually refers to ‘cold’ information processing (Collins and Smith 1994), quite often the term non-cognitive is a ‘residual category’ (Funke et al. in print, p. 8) and ‘comes by default to describe everything else’ (Duckworth and Yeager 2015, p. 238). We follow this distinction between cognitive and non-cognitive facets whilst recognising that many constructs such as self-concept imply both, cognitive and non-cognitive processes.

Fig. 1
figure 1

Model of domain-specific problem-solving competence (Rausch and Wuttke 2016, p. 177)

In contrast to generic dispositions (e.g. intelligence), competence is considered to be domain-specific. People are usually more competent in one domain while being less competent in others (e.g., accounting, baseball, chess). Following Weinert (2001), the underlying constructs of competence in different domains are comparable, although the performance differs substantially between the different domains. Although the performances of preparing a tender letter or setting up a CNC machine are very different, a high self-concept in the respective domain usually enhances one’s performance. Hence, the proposed competence model is not restricted to one domain, but can be easily be adapted to other domains; still it is domain-specific as opposed to domain-general approaches of problem-solving competence. However some of the facets in Fig. 1 might be more domain-specific while others might be more general since different components of problem-solving competence vary in their degree of generalizability (Fischer and Neubert 2015; Funke et al. in print). The 13 competence facets facilitate the development of problem scenarios to measure domain-specific problem-solving competence.

Development of authentic problem scenarios in the domain of controlling

A valid measurement of domain-specific competence builds on the requirements of a particular domain, that is the bundle of tasks that one is expected to solve. Our research focuses on the problem-solving competence of industrial clerks, which is the fifth most common of 328 state-recognized vocational training programs in the German dual system of VET.Footnote 5 Certified industrial clerks usually work in back-office departments of industrial or service companies. Thus, the qualification is roughly comparable to a Bachelor’s degree in business administration. Further education and professional development can lead to lower or middle management positions. Although routine tasks are still an important part of office work, many of those repetitive processes have become automated or outsourced in recent decades (Autor et al. 2003; Frey and Osborne 2013). Thus, employees in back offices of industrial and service companies are increasingly confronted with the remaining non-recurrent problem cases.

VET programs claim to prepare individuals for a broad range of workplace requirements. Consequently, vocational curricula comprise several domains. With regard to the vocational competences of industrial clerks we focused on ‘operative controlling’,Footnote 6 which is an important part of the curriculum followed by apprentice industrial clerks, as well as being a relevant domain of business administration in general. Further insight was derived from the content analyses of vocational training regulations, textbooks, a survey of workplace demands on employees in controlling departments (in cooperation with the European Competence Center for Applied Research on Medium-Sized Enterprises at the University of Bamberg/Germany; Becker et al. 2012), a diary study on problem solving in office work (Rausch et al. 2015) and an interview study on typical tasks and requirements in the domain of controlling with teachers, workplace trainers, VET students, and employees in the domain of controlling (Eigenmann et al. 2015). The findings from these domain analyses form the basis for the development of authentic problem scenarios. Appendix (Table 6) gives an overview of the studies and main findings during the phase of domain analyses.

To ensure authenticity, all problem scenarios are embedded in a model company, which is based on a real-life medium-sized bicycle manufacturer. We developed three complex and authentic problem scenarios, each of which demands for various steps of researching, evaluating and processing information, decision making, and communicating a proposed solution within 30 min. The built-in complexity of the scenarios was designed with regard to typical characteristics of complex problems (see above) and in anticipation of the target group’s professional knowledge and problem-solving competence (based on our domain analysis). Scenario 1 requires a deviation analysis of budget and actual costs. The participants have to calculate budget costs, absolute and relative deviations in a spreadsheet application, identify relevant deviations, investigate the diverse reasons of these deviations in a large number of business documents, and propose adjustments for future budgeting in an email to their supervisor. In scenario 2 the participants must carry out a supplier selection by calculating acquisition prices and applying a value analysis, and scenario 3 concerns a make-or-buy decision. Besides a variety of scenario-specific business documents of various types (invoices, letters, bids, notes, etc.), a comprehensive archive containing short explanations of relevant and irrelevant technical terms, which constitutes an ‘open-book testing’, is available. As with real-life problem solving, the participants can look up information that they do not know by heart but—of course—none of the documents within the test environment provides a complete solution to the problem scenario. Furthermore—just as in real life—many documents provide irrelevant, conflicting and misleading information. In addition, in two of three scenarios the participants receive an email with distracting information (e.g., a listing of income of industrial clerks in different regions of Germany), which is also irrelevant for the problem but may be tempting to read. The participants cannot consult information outside of the software environment.

The problem scenarios allow for an ecologically valid assessment of domain-specific problem-solving competence with respect to curricular requirements, workplace requirements and authentic problem presentation. Scenarios are specified as a set of XML-files, which can be implemented into the computer-based office simulation with a minimum of programming expertise.

Computer-based office simulation

The participants register with the software using a predefined password, choose a last name from a given list and enter a first name, by which they are addressed during the following scenarios. The model company is then introduced via a slideshow with short subtitles. The slideshow is followed by a tutorial introducing the participants to the features of our custom-built office simulation Technology-Based Domain-Specific Learning Assessment (TeBaDoSLA). The tutorial is highly structured and ensures that all participants master the relevant features of the software. The software provides the typical features of an office environment such as a file system with hierarchical folder structure, a file-viewer, an email client, a calculator, a notepad and a clock that shows the remaining time for 3 s when clicked on. The core of the office simulation is a spreadsheet application. It provides most of the common functions of standard software such as Microsoft® Excel®. Altogether, an authentic task environment for the holistic processing of the problem scenarios without any artificial fragmentation was designed. Thus, not only the problem scenarios but also the open problem space (i.e., the entirety of possible system states and available operators; Newell and Simon (1972); see also ‘outcome space’, Wilson et al. 2012) were developed with regard to ecological validity. Figure 2 shows a screenshot of the office simulation software.

Fig. 2
figure 2

Screenshot of the office simulation software (translated from German by the authors)

The test environment records each valid mouse click and keystroke with time stamps. The resulting log-file data enable detailed process analyses which are designated to reveal the metacognitive strategies of the participants. However, log-file analyses are not part of the current paper; instead we focus on the components A, C and D of our competence model (see Fig. 1).

Implementation of EES

Although non-cognitive facets of problem-solving competence are prevalent in contemporary theoretical modeling, they are often neglected in measurement approaches. Focusing on only cognitive variables is often legitimated with reference to Weinert, who suggested analyzing cognitive and non-cognitive facets separately (Klieme and Leutner 2006, p. 880; Klieme et al. 2008, p. 9). From our perspective, disregarding non-cognitive facets does not do justice to Weinert’s approach, since he claimed that ‘… it would not be useful to restrict attention to cognitive and metacognitive competencies if one is concerned with success in broad fields of action across a variety of tasks (e.g., in school, in social institutions, or in a profession)’ (Weinert 2001, p. 61). However, if non-cognitive facets are measured at all, the method of choice is usually self-report questionnaires. This poses methodological problems: while the tasks to be solved are highly concrete and embedded in a certain context, questionnaires on non-cognitive facets such as domain-specific self-concept or interests are usually phrased very universally. The use of different methods—task-specific performance vs. universal self-reports—leads to weak empirical relationships between cognitive and non-cognitive facets, which are often misinterpreted as a low impact of non-cognitive variables (Dermitzaki et al. 2009; Sembill et al. 2013). In a pilot study of our project with 100 VET students, no significant correlations were found between the cognitive component of domain-specific problem solving and neither work-related self-efficacy (p = .188; n.s.) nor vocational interest (p = −.026; n.s.). While the cognitive component of domain-specific problem solving was measured on the basis of three complex scenarios (similar to our approach presented in this paper), the non-cognitive components were measured by universal self-report questionnaires (Rausch under revision). Wittmann and Süß (1999) refer to the ‘Brunswik symmetry’ (named after Brunswik 1952) as an explanation for such phenomena. The Brunswik symmetry suggests the every level of generality at the predictor side has its symmetrical level of generality at the criterion side. Maximum predictability can only be obtained when predictor and criteria are symmetrical (see also Ackerman and Beier 2006). This is apparently not the case when predicting specific task performance by very broad self-evaluated personality traits.

We developed an approach to measure non-cognitive facets of competence during problem solving—referred to as EES. ‘Embedded Experience Sampling’ (EES) builds on the ‘Experience Sampling Method’ (ESM) introduced by Csikszentmihalyi and colleagues (Hektner et al. 2007) and similar methods of data-collecting ‘in situ’ such as the ‘Continuous State Sampling Method (CSSM)’ introduced by Sembill and colleagues (2002). Test-takers are requested to stop at certain times during the test and spontaneously answer short prompts (EES items) regarding their actual experience of the problem situation. These EES events are embedded into the problem situation in a way that resembles common social interaction in the workplace. In doing so, we aim to reduce the artificiality of otherwise isolated questions that are usually administered in supplemental questionnaires. Closed-ended questions were used in order to spare the test-takers the time and effort they would need to write down their answers. Furthermore, closed-ended prompts improve the comparability of the answers and facilitate EES in large-scale assessments. Thus, a participant’s answer is largely pre-specified (e.g., ‘Hi Julian, that’s very nice of you. At the moment, I feel …’). The EES items are rated on a Likert-scale (e.g., from 1 = not nervous at all to 4 = very nervous). EES focuses on non-cognitive constructs such as interest, attitudes, commitment, self-concept and so on that are not possible to observe or infer otherwise. In our research, EES serves to measure the non-cognitive facets of competence presented in Fig. 1 (components C and D). The EES events are integrated into the office simulation. EES events pop up at predefined times during the problem scenarios. Figure 3 shows the screenshot of an EES event within the office simulation. Participants need to answer four closed EES items before they can get back to the problem scenario.

Fig. 3
figure 3

Example of an EES event with four EES items (translated from German by the authors)

From measuring non-cognitive facets within the problem-solving process, a better ecological validity than from administering more unspecific retrospective self-report questionnaires that are separated from context, is assumed. In addition, bias due to social desirability (Harley 2016) might decrease in EES compared to retrospective self-reports, due to the concurrent cognitive load and time pressure during the problem-solving process (Stodel 2015). In group discussions and one-to-one interviews, the participants of a pilot study reported that they liked the idea of the EES. They experienced the specified situations as quite realistic as those were occurrences that they encountered in their everyday working environment. Interestingly enough, they reported that they did not elaborate on what would be ‘good answers’ but instead answered spontaneously, as was requested.

In PISA 2006, for instance, for the measurement of interest ‘in situ’ short ratings of interest in scientific domains were requested directly after particular test items in the field of science (Drechsel et al. 2011). However, these items were not embedded into the ‘storyline’. An approach similar to ours is the ‘affect self-report device’ applied to the game-based learning environment ‘Crystal Island’. During their interaction with the learning environment, participants received an in-game prompt asking them to report on their cognitive and emotional states. These status updates were described as part of an in-game social network (Sabourin and Lester 2014). Another example is the ‘Belief Meter’ within the computer-based learning environment ‘BioWorld’. Medical students report their confidence in their final diagnosis as a percentage (0–100 %) on the ‘Belief Meter’ during problem solving (Jarrell et al. 2016). However these in-game self-reports were not designed to assess facets of competence. Aside from these recent and inspiring works, we did not find more similar approaches.

Research questions

In the empirical section the focus lies on the reliability of the assessment. First, we analyze whether the above approach allows for a reliable measurement of the cognitive facets in the competence component knowledge application. Furthermore, we analyze whether the above approach allows for a reliable measurement of the non-cognitive facets in the competence components self-concept and interest.

While the scenarios were developed with respect to industrial clerks (IC), they were also administered to IT-systems management assistants (ITMA) and merchants in wholesale and foreign trade (MWFT). Their apprenticeship programs are similar to that of industrial clerks. However, the domain addressed in the problem scenarios (‘controlling’, see above) is of less significance in the curricula of ITMA and MWFT apprentices. Given a valid assessment of domain-specific competence, IC apprentices are supposed to outperform the comparative samples. This was also confirmed in a previous pilot study (Wuttke et al. 2015).

Methods

Sample

The main study took place between April and September 2014. The sample was approached via vocational schools but participation was voluntary both on the school level and on the individual level of each student. A total of 786 VET students from various German federal states participated in the study, of which six were excluded from the analyses due to missing data (due to either lack of willingness or technical malfunction of the test software). All of the remaining 780 participants (50.1 % female) were in the second or third year of a 3-year commercial apprenticeship program and showed a typical right skewed age distribution (M = 21.3 years; SD = 2.69; min = 17; max = 44). Of the total sample, 537 were enrolled in an apprenticeship program to become industrial clerks (IC), 106 were apprentice IT-systems management assistants (ITMA), and another 137 were apprentice merchants in wholesale and foreign trade (MWFT).

Procedure

All data were collected in computer-equipped classrooms in vocational schools. At the beginning of the data collection sessions the researchers introduced the project, and the agenda. They also provided information about anonymity, data protection, and ethical factors and emphasized that participation was voluntary. All participants provided written, informed consent before completing any of the assessments. Before and after the problem scenarios, the participants completed several self-report questionnaires including scales on vocational interest, work-related self-concept, and several antecedents of apprenticeship success (Baethge-Kinsky et al. 2016) as well as further tests of general cognitive ability (German version of Cattell’s Culture Fair Test developed by Weiss 2006), domain-specific content knowledge (based on test items from final exams), literacy and numeracy (Ziegler et al. 2016). Nevertheless, these instruments are not in the centre of attention in this paper.

When participants registered in the computer-based office simulation, they were introduced to the underlying model company and the features of the software, before working on the three problem scenarios. Each problem scenario was followed by a short questionnaire intended to assess test motivation, self-assessed quality of the problem-solving process, self-assessed quality of the proposed solution and so forth. Altogether, the procedure lasted 5 h. In the following, we focus on the internal consistency and internal validity of the assessment of domain-specific problem-solving competence.

Results

Reliability of the cognitive facets measured by content analyses

By providing a very open problem space we aimed at ecological validity, as the given problems were designed like real-life scenarios without clear instructions. In the end, the estimation of competence scores for each facet is based on only three stimuli. However, scoring such complex and open-ended responses is laborious—especially in large-scale assessment—and may also impair the reliability of the assessment (Wilson 2008). The scoring process was carried out in three steps as, for instance, suggested by Bennett et al. (2003) in the context of assessing problem solving in technology-rich environments (TRE) within the National Assessment of Educational Progress (NAEP) in the United States. The three steps comprise two levels of coding followed by an IRT analysis.

  1. 1.

    On the first level of coding, the participants’ solutions were analyzed on the basis of fine-grained category systems according to the qualitative content analysis approach by Mayring (2014). Graduate students were trained to rate the categories. They used an additional software (‘Rating Suite’) to display the participants’ solutions and rated them according to the coding guide. The coding guide provided definitions, coding rules and examples for the coding of each category in each scenario. The categories were designed against the background of domain-specific quality standards which were identified during the domain analysis (see Appendix Table 6). Some categories were identical for all three scenarios (e.g., all categories in facet 4 ‘communicating the decision appropriately’) while most of them were scenario-specific (e.g., coding which of the relevant documents were used). Altogether, the category systems for the three scenarios comprised 97 categories (22 for the first, 34 for the second and 41 for the third scenario), each of which corresponded to an item (in our case we denote these as level-one-items) and was assigned to one of the four facets of knowledge application (see Fig. 1). Table 1 shows the hierarchical decomposition (top down).

    Table 1 Hierarchical decomposition of the facets of knowledge application

    Human raters assessed, for instance, the quality of arguments (category 3.1; Table 1) but many level-one-items were scored automatically on the basis of log-files, for instance whether relevant documents were found (category 1.2) and many of the calculations in the spreadsheet (category 2.1). Altogether, automated and human rating resulted in 97 mostly dichotomous level-one-items across the three scenarios. For each item, a higher value indicates a higher quality of the solution. Dual coding enabled an enhancement of the coding guide and the training of the raters based on the inter-rater-reliability for each item.

  2. 2.

    On the second level of our two-level coding we aggregated the 97 level-one-items from the fine-grained coding process into one partial credit item for each competence facet and each scenario (4 × 3 = 12 partial credit items—which we denote as level-two-items). For this purpose, the response patterns in the level-one-items of one competence facet and one scenario were extracted and ordered by the sum score of the items. Thus, a low sum score is a first indicator of a low quality of the solution. Subsequently, experts rated each response pattern with regard to the quality of the solution as compared to other response patterns. Experts not only decided on cutoff values between lower and higher partial credits but they also defined weightings or necessary preconditions with regard to the content of the problem scenario. Assigning credit points to each response pattern resulted in one (level 2) partial credit item per facet and per scenario, each of which had four to seven categories. Thus, the estimation of competence scores for each competence facet is based on only three items. Nevertheless, these partial credit items provide rich information (e.g., 3 partial credit items with 5 categories each equal 12 dichotomous items). Besides the strong qualitative verification of the dimensionality that comes along with the assignment of the partial credits, the main reason to include just one item per scenario in the IRT analysis is to avoid local item dependence (LID). A major problem in evaluating complex scenarios is the strong local dependence of the items that refer to the same scenario, and the corresponding LID is known to bias the reliability, item difficulty estimates, as well as variance and covariance estimates, as has been shown by many authors (see, e.g., Brandt 2012; Sireci et al. 1991; Wainer et al. 2007; Yen 1993). A further option to consider the scenario-based LID might have been to model the observed dependencies, for example, via a hierarchical model such as the Rasch testlet model (Wang and Wilson 2005). The given covariance structure of the testlet specific factors, however, typically is not as proposed by the model (which supposes that they are uncorrelated), and furthermore, the covariances can change depending on the considered (sub-)sample. Such changes then lead to changes for the calculation of the general factor, making the calculation of the latter sample dependent. This also the reason why these models are not used in the known large scale assessments, such as the Programme for International Student Assessment (PISA) or the National Assessment of Educational Progress (NAEP). We therefore also preferred an approach that avoids LID by the design of the underlying items instead of an approach based on modeling.

  3. 3.

    In a third step, the four cognitive competence facets of the competence component knowledge application were, at first, analyzed separately in order to investigate the fit of the constructed partial-credit items. For all item as well as step parameters the calculated in-fit values ranged between .95 and 1.05, that is the items show good fit.

Thereafter, the test was analyzed using a four-dimensional partial credit model (Masters 1982) including background information such as gender, age, vocation, intelligence, the answer data from the non-cognitive facets, and other relevant variables. All calculations were conducted using the R package TAM (Kiefer et al. 2015). Table 2 shows the EAP/PV reliabilities (on the diagonal) and latent correlations between the competence facets.

Table 2 EAP/PV reliabilities and latent correlations of the facets of knowledge application

The EAP/PV reliabilities of the four cognitive facets are satisfactory; compared to the pilot study (Wuttke et al. 2015) they increased considerably. The latent correlations between the facets are medium on average and reflect the multidimensionality of the competence component ‘knowledge application’. The multidimensionality of the construct is further supported by the comparison of the likelihoods of the unidimensional model and the multidimensional model, respectively. While the unidimensional shows a deviance (equals −2 Log-Likelihood) of 16,178.8, the multidimensional model shows a deviance of 16,058.6, which results in a significant Chi Square test (df = 9) as well as AIC and BIC model fit values (16,283 vs. 16,181 and 16,504 vs. 16,440, respectively) in favor of the multidimensional model.

Reliability of the non-cognitive facets measured by EES

For an integrated measurement of non-cognitive facets, embedded experience sampling (EES; see above) was used. Appendix (Table 7) provides an overview of the EES events, the respective competence facets (see Fig. 1), and the EES items, which were the same for all three problem scenarios.

Initially a six-dimensional partial credit model (Masters 1982) including all non-cognitive facets was calibrated. Facet D3 (Interest in the progress of/in learning from the problem), however, showed insufficient reliability (EAP/PV reliability = .30) and was excluded. The final estimation therefore only included five dimensions and was estimated using various variables as background information (compare estimation of the cognitive facets above). Table 3 shows the EAP/PV reliabilities (on the diagonal) and latent correlations between the five remaining non-cognitive competence facets.

Table 3 EAP/PV reliabilities and latent correlations of the non-cognitive facets

The EAP/PV reliabilities of the five non-cognitive facets are satisfactory. The latent correlations between the non-cognitive facets are slightly higher than for the cognitive facets, they can still be considered as moderate though with only one correlation being larger than .70 (between facet C1 and C3). For a conference paper focused on the EES approach, we also calculated the correlations between the non-cognitive facets as measured by EES and similar constructs measured by universal questionnaires (Rausch et al. 2016). We only found only small correlations between both the facets in component C and work-related self-efficacy (.18 < r < .27) and the facets of component D and vocational interest (.10 < r < .25).

Correlations between cognitive and non-cognitive facets

Table 4 shows the latent correlations between the cognitive facets and the non-cognitive facets based on the plausible values (the answer data from the cognitive data was included as background information in the estimation of the model for the non-cognitive data and vice versa; these via plausible values calculated correlations are therefore also latent correlations).

Table 4 Correlations between cognitive and non-cognitive facets of competence

In general, the correlations are all positive and of small to medium size. However, the correlations also show certain tendencies considering the relationship of the dimensions. For facet A4 (‘communicating the decision appropriately’) and facet C1 the correlation is significantly smaller (according to the Fisher r-to-z transformation) than for A1, A2, and A3 with C1; the same holds for facet C3. For the remaining three non-cognitive facets the differences in the correlations are not (statistically) significant, however, all values show the same tendency. Averaging across the non-cognitive facets the correlation of these with A4 is also significantly smaller than with A1, A2, and A3. For the correlations between the facets of domain-specific self-concept (C1, C2, and C3) and the facets of knowledge application (A1 through A4) a similar tendency can be observed. While not all correlations between self-concept and knowledge application are significantly larger than the correlations of domain-specific interest (D1 and D2) and knowledge application, averaging across the corresponding correlations again results in significantly smaller relationships between the facets of interest and knowledge application than between the facets of self-concept and knowledge application. When checking the correlations in the subgroups of different vocations, we found that in the subgroup of merchants in wholesale and foreign trade the correlations between A4 (communicating the decision appropriately) and the non-cognitive facets were smaller (some of them zero). Possible explanations will be discussed below.

Differences between VET students of different vocations

In a first step, a differential item functioning (DIF) analysis was conducted in order to investigate whether the test included items that were particularly unfair for one of the vocations. Using the R package TAM again facet models were calibrated, which yielded the differences in the item difficulties for each of the three groups. The size of DIF effects is typically categorized into three different categories (Zieky 1993, 2003):

  • Negligible effect: <.43 Logits

  • Light to moderate effect: ≥.43 and <.63 Logits

  • Moderate to large effect: ≥.63 Logits

All of the DIF effects for the items of facet A1, A2, A4, C2, C3, D1, D2 were negligible, only facet A3 had two items with light effects, and facet C1 had one item with a light effect. Due to the small effect sizes, we decided to nevertheless include them in the comparison of the groups. In the second step, the competences of the three training vocations were compared. Figure 4 graphically displays these results, and Table 5 gives more details on results particularly considering the significance of the differences. All calculations were based on plausible values (Fig. 4).

Table 5 Group differences for competence facets between IC and ITMA and between IC and MWFT
Fig. 4
figure 4

Comparison of the mean scores of industrial clerks (IC), IT-systems management assistants (ITMA), and merchants in wholesale and foreign trade (MWF) across the nine facets

As hypothesized, the VET students in an apprenticeship program to become industrial clerks outperform the comparison groups. However, only small to medium effect sizes were found. The largest effects were found for the cognitive facets A1 ‘Identifying needs for action and information gaps’ and A2 ‘Processing information’. We regard these differences as an indicator of curricular validity of our assessment, which was developed to primarily meet the curricular requirements of industrial clerks (IC). The domain of controlling and thus, the contents of our problem scenarios, are part of the curricula of IT-system management assistants (ITMA) and merchants in wholesale and foreign trade (MWFT), too, but play a minor role.

Conclusions

In this paper, a computer-based assessment of domain-specific problem-solving competence in the field of commercial vocational education and training was presented. Based on a multi-faceted model of problem-solving competence (Rausch and Wuttke 2016), the development of the assessment focuses on ecological validity, which refers to the congruence between behaviors observed in test environments and real life, and content validity with regard to the competence which is actually required in practice. Therefore, authentic problem scenarios on the basis of extensive domain analyses (curricula analysis, textbook analysis, interview and diary studies, etc.; Eigenmann et al. 2015) were developed. Assumed differences in the performance of apprentice industrial clerks and comparative groups support the assumption of curricular validity of the three problem scenarios in the field of controlling.

We did not only develop authentic problem scenarios but also provided an open-ended problem space for working on these problems within an authentic office environment instead of applying highly structured items (e.g., multiple choice items). Expanding the problem space for the test takers (i.e. reducing experimental control) resulted in very heterogeneous behavior patterns and solutions. Nevertheless, statistical tests and indices based on item response theory demonstrate the reliability of the measurement of cognitive competence facets. We applied a three-step method (similar to Bennett et al. 2003): (1) Fine-grained results from a highly structured content analysis were condensed into (2) partial credit items on the basis of consensual expert judgments. (3) Finally, these partial credits were subject to psychometric scaling using a multidimensional Rasch model (a publication with a more detailed description of the procedure is in preparation).

Besides cognitive facets of problem-solving competence, we also consider non-cognitive facets of competence (e.g., self-concept, interest) to play a role in problem solving in the workplace. Therefore, content validity also calls for the measurement of these non-cognitive facets of problem-solving competence. However, we argue against the use of prevalent self-report questionnaires. Instead, we developed a method—EES—to measure non-cognitive facets of problem solving ‘in situ’. Test-takers are requested to stop at certain times and spontaneously answer short prompts (EES items) regarding their actual experience of the problem situation. Again, aiming at ecological validity, these EES events are embedded into the problem situation in a way that resembles common social interaction in the workplace. Statistical tests and indices based on item response theory demonstrate the reliability of the measurement of non-cognitive competence facets across the three problem scenarios. However, only five of the six non-cognitive facets could be measured reliably. Facet D3 (Interest in the progress of/in learning from the problem) showed a very low EAP/PV reliability and had to be excluded from the analysis. To our mind, this is due to our approach to ask for several competing activated motives (see Appendix Table 7), which did not work out as we anticipated.

The correlations between the four cognitive and the five remaining non-cognitive facets were all positive and showed moderate effect sizes. In a pilot study, we assessed the cognitive facets of problem-solving competence in a similar way as in the present study and found smaller (even zero) correlations with non-cognitive facets, which we then measured by universal self-report questionnaires (Rausch under revision). In the present study, the correlations between the cognitive facet A4 (communicating the decision appropriately) and several of the non-cognitive facets were remarkably smaller in the subgroup of Merchants in wholesale and foreign trade (MWFT) than in the other subgroups. Although the MWFT performed poorer in the cognitive facets A1, A2 and A3, they still managed to produce an appropriate email reply with regard to domain-specific language, communication standards, structure and formal standards. Apparently, the non-cognitive facets such as self-concept are more linked to the ‘core processes’ of problem solving. Table 5 shows further interesting differences between the three training programmes that, due to lack of space, cannot be discussed in detail.

We want to emphasize that we did not model non-cognitive facets as mere explanatory or even confounding factors of the ‘true cognitive competence’ but as competence facets in their own right. Decomposing domain-specific problem-solving competence into various facets and, at the same time, providing an integrated measurement offers opportunities for a differentiated assessment of competence profiles and individualized interventions (Herl et al. 1999; Sugrue 1995). We also postulated metacognitive facets of problem solving, which as yet have not been addressed. We plan to identify metacognitive patterns on the basis of the log-files that are already available from the present study. Inspiring research on pattern recognition in log-files is available, for instance, for the game-based learning environments ‘Crystal Island’ (Sabourin et al. 2013) and ‘Betty’s Brain’ (Biswas et al. 2014). A further limitation of our current approach is the absence of a social component of problem solving since cooperation and collaboration is a major way of solving work-related problems in real life (Rausch et al. 2015). It would be an exciting challenge to integrate cooperative and collaborative features into authentic problem scenarios and hence, into an authentic office simulation. Furthermore, the current degree of automated codings could be advanced in order to reduce the efforts of human coding. Finally, this would also increase the opportunities for dissemination into practice.