Keywords

1 Introduction

Tess is a school leader who must decide how to spend the school’s resources on professional development. Today she has an assessment interview with John of whom she has just visited a lesson yesterday. Her rubric scorings signal some clear directions for improvement in the clarity of John’s front class explanations, yet the student questionnaire administered one month ago gives few signs of poor explanation skills. Instead, the student results signal that John could improve on the interactivity of his instructions. Tess wants to use the interview to plan and guide John’s further professionalization. However, how can she use the available information to provide John with actionable feedback that likely adds to John’s teaching? Also, Tess knows that John wants to discuss opportunities to participate in further training. Yet, the evidence of John’s instructional skill is inconclusive. Hence, on what grounds should she accept or refuse the request?

Imaginary situation based on conversations with school leaders and teachers.

The last decade teacher evaluation has had a central position in policies aiming to improve educational quality in many countries (e.g., Doherty & Jacobs, 2013; Isoré, 2009; Nusche et al., 2014). In the Dutch context, the Ministry of Education published the “teacher agenda” which documented several challenges, objectives, and policy measures meant to increase the quality of the Dutch teacher workforce. One objective was to increase the frequency of performance evaluations in schools (Nusche et al., 2014; OECD, 2016). In the eyes of the policy-makers, performance evaluations were a means to turn schools into “learning organizations” by functioning as a yearly update and a reminder of teachers’ and school leaders’ commitment to increase educational quality. To realize this objective, the councils for primary (PO-raad) and secondary (VO-raad) education and the teacher labor unions together agreed to install a new differentiated payment system and to assign every teacher a personal professionalization budget (Nusche et al., 2014; OECD, 2016). This created an incentive for teachers to request for a performance evaluation interview to discuss evidence of instructional effectiveness. If evidence of effectiveness was insufficient to qualify for a salary raise, the teacher should be informed about steps and/or skills required to qualify. Teachers could use their personal professionalization budget to train these skills. In practice these policies implied that school leaders, like Tess, were confronted with the task to distinguish between “average”, “good” and “excellent” teachers, to give teachers feedback about what they needed to learn, and to organize the conditions under which teachers can start to learn.

The research project described in this chapter took place within this context and examined statistical methods and models that could assist school leaders to distinguish between teachers in terms of their level of instructional effectiveness. Furthermore, the new methods needed to result in feedback that clearly indicated a specific direction for improvement. Instruments used to collect data were student questionnaires and classroom observations. These two instruments were chosen because they share the strength that they collect direct observations of teachers’ classroom behavior (Darling-Hammond, 2013; Goe et al., 2008; Peterson, 2000). However, as the situation of Tess and John shows, the feedback resulting from the student questionnaire and the feedback resulting from the classroom observation instrument do not always agree.

The chapter starts with an introduction of central concepts and how these were operationalized. The applied methods are introduced at a conceptual level and it is discussed how the models relate to other statistical methods that are commonly applied. After the background section, the chapter focuses on the problem of agreement between feedback sampled with student and classroom observation instruments.

2 Background Theory and Definitions of Central Concepts

In the sketch at the beginning of this chapter, Tess is wondering how she may use student questionnaires and classroom observation instruments as two complementary instruments to provide John with actionable feedback. This highlights some central concepts of this chapter, namely instructional effectiveness, improvement, and actionable feedback.

2.1 Instructional Effectiveness

In this chapter, instructional effectiveness is viewed as an estimation of the degree to which teachers’ classroom behavior is expected to give students the opportunity to maximize their learning potential. By stating that instructional effectiveness provides students with opportunities to learn it is clarified that instructional effectiveness is associated with, but not identical to, student achievement and school success which are realizations of these opportunities. Furthermore, the definition clarifies that instructional effectiveness is estimated meaning that any claim about it is surrounded by some level of uncertainty. The research described in this chapter has operationalized instructional effectiveness using two instruments, namely the International Comparative Assessment of Learning and Teaching (ICALT)—which is a classroom observation instrument—and the My Teacher questionnaire—which is a student questionnaire. The ICALT and My Teacher questionnaire instruments conceptualize an effective instructor as a teacher that scores high on six domains of instruction. The six domains are labeled “safe and stimulating learning climate”, “efficiency of classroom management”, “clear and structured explanations”, “intensive and interactive instructions”, “teaching students learning strategies”, and the “adaptation of instructions to individual student needs”. Table 1 details the conceptualizations of each domain. Several studies from different countries provide evidence suggesting that the items included in the ICALT and My Teacher questionnaire cluster according to these six domains (André et al., 2020; Maulana & Helms-Lorenz, 2016; van de Grift et al., 2011).

Table 1 An overview of the six domain and their conceptualization

2.2 Improvement

Another central term in this chapter is improvement, which suggests that teachers can learn or be trained to become more effective instructors. Analogous to Berliner’s (2004) novice-expert continuum, the research project and evidence discussed here conceptualizes the improvement of instructional effectiveness as unfolding along a continuum ranging from completely ineffective instruction to completely effective instruction. To illustrate how we operationalized this continuum we start with a one-item example, “This teacher uses time efficiently”. Research suggests that more effective instructors in general use time more efficiently than less effective instructors (Muijs et al., 2014). The x-axis in Fig. 1 visualizes the continuum of effective instruction. The y-axis represents the probability on a positive item score. In line with the above statement, Fig. 1 predicts that highly effective instructors have near 100% probability on positive scores and that low effective instructors have near 0% probability. Furthermore, only when teachers have acquired a certain level of instructional effectiveness, are they predicted to start to learn how to use time efficiently. This can be inferred from Fig. 1 by observing that the probability on positive scores starts to rise at a certain location on the continuum. Key to the perspective taken in this chapter is that training of teachers’ instructional effectiveness is optimal when it is focused on items that match the teacher’s location on the continuum.

Fig. 1
A graph plots P of x equals 1 versus continuum of instructions effectiveness has an S-curve.

Increase in probability on a positive response to the classroom observation item “This teacher uses time efficiently”

2.2.1 Relation of the Applied Relatively Novel Statistical Model to Other Statistical Models

The proposition that all items measure instructional effectiveness are associated with a single continuum seems conflicting with research that groups items of instructional effectiveness according to dimensions of teaching quality which load on separate (statistical) factors (e.g., studies applying factor analysis). Figure 2 is used to discuss and visualize the relationship between the continuum discussed in this chapter and factor analysis results. Figure 2 again visualizes the continuum of instructional effectiveness, but now includes multiple items. Solid, dashed, and dotted lines indicate clusters of items that have high(er) inter-item correlations (i.e., load on seperate factors). The reader can move the icon directly below the Figure to logically derive this. For example, teachers positioned at the icon have an approximately 50% probability on positive scores on the dotted items, but near 0% probability on positive scores on the dashed and solid ones. When the teacher moves up the continuum, the probability on positive scores on the dashed items increases first, while the probability on the dotted items remains high and the solid items remains low. Thus, some teachers likely have high scores on dotted items and low on all other items, other teachers likely have high scores on the dotted and high on the dashed items but low on the solid items, and yet others likely have high scores on all items. However, it is unlikely that teachers score high on the solid but low on the other items. This scoring pattern, which in factor analytic literature is referred to as the simplex pattern, is detected by factor analysis as a sign that the dotted items and dashed items are in distinct clusters (interested readers may consult Browne [1992] or Jöreskog [1978] for further details). It is acknowledged that the Figure presents an oversimplification of the relationship of factor analysis with the continuum described in the chapter. For example, item slopes, i.e., the steepness with which the s-curved lines increase, are rarely exactly parallel. Such differences in item slope also impact on the assignment of items to factors. However, Fig. 2 illustrates the basic rationale of how different factors on a single continuum.

Fig. 2
A graph plots P of x equals 1 versus continuum of instructions effectiveness has 3 S-curves.

The continuum of instructional effectiveness in which indicators are grouped into three factors

2.2.2 The Sequence of Clusters Along the Continuum

The research group at the University of Groningen has given much empirical attention to the ordering of these factors along the continuum of instructional effectiveness (e.g., Maulana et al., 2015a, 2015b; van de Grift et al., 2011, 2014; van der Lans et al., 2015, 2018, 2021). The results indicated an ordering of the factors in the sequence in which the factors are presented in Table 1, thus: (1) safe and stimulating learning climate, (2) efficient classroom management, (3) clear and structured explanations, (4) intensive and interactive instructions, (5) teaching students learning strategies, and (6) adaptions of instructions to the individual students’ learning needs. The validity of this ordering was further corroborated by other research in Cyprus that applied the same statistical models to their own questionnaire and observation instruments and which reported broadly similar results (e.g., Kyriakides et al., 2009, 2018).

Based on these results, the author developed a feedback report to provide teachers with information of our best estimate of their current position on the continuum of instructional effectiveness and our best estimate of what the teacher could improve on next. Figure 3 presents two reports that were applied by the author to give teacher feedback. The left report concerns the classroom observation instrument and the right concerns the student questionnaire instrument. The reports show a table with three columns. In the column “level” are the six identified levels (or domains) of instructional effectiveness. The column “item” lists the items included in the instruments. Finally, the column, “teacher score” indicates what probably went well (darkest grey top area), what probably can be learnt next (lightest grey middle area), and what probably is beyond the teachers’ competency to learn yet (grey lowest area). The Asterix indicates the exact teacher position on the continuum of instructional effectiveness.

Fig. 3
A screenshot of a page has 2 tables each has 3 columns. The column headers are level, item, and teaching skill score.

(Note These two feedbacks are slightly adapted versions of the ones that were originally reported to teachers)

ICALT (left) and My Teacher (right) feedback reports. Darkest-grey = probably skilled (top area), light grey (middle area) = try to learn now, grey (lowest area) = probably unskilled in

2.3 Actionable Feedback

The third central concept in this chapter is actionable feedback. Cannon and Witherspoon (2005) describe actionable feedback as feedback that leads to learning and increased performance. Evidence indicates that feedback is actionable when, (1) it is directed at the task—not the person, (2) it is unambiguous and specific, and (3) it has clear implications for action (Cannon & Witherspoon, 2005; Kluger & DeNisi, 1996). The aim was to design feedback reports such that they would assist the feedback giver, which presumably is a school leader or coach, to be actionable. Therefore, the reports emphasize on what the teacher does (e.g., the teacher involves students, explains clearly), it attempts to communicate as specific as possible about what went well and what can be improved. In addition, reports accompanied this information with specific implications that would now require action. Participating teachers found this approach informative and often recognized themselves. Nonetheless, the actionability of the feedback was hindered by various organizational and psychometric factors. Organizationally, schools lacked an infrastructure to support training activities at this level of precision. Though, this was not part of the research project discussed here, it is nonetheless important to mention. Psychometrically, the disagreement between students and observers created uncertainty about the reliability of the estimates. Take, for example, the two feedback reports in Fig. 3. The students on average positioned the teacher on the item “my teacher involves me in the lesson” and signal that the teacher should focus to improve on the clarity and structuredness of explanation. The classroom observer, however, positioned the teacher on the item “encourages students to apply what they have learnt” and signals that the teacher should improve on teaching students learning strategies. Moreover, the observer’s report suggests no problems with the clarity and structuredness of the teacher’s explanations. Hence, in case of disagreement the feedback reports no longer unambiguously communicate what went well and what domains of instruction needed improvement. Also, the implications for action were no longer clear.

3 Prior Research on the Disagreement Between Classroom Observation and Student Questionnaires

Prior research suggests that the disagreement between students and observers may be frequent and/or substantial. Studies documenting correlations between observation and survey measures mostly report modest correlations in the range of 0.15–0.30 (e.g., De Jong & Westerhof, 2001; Ferguson & Danielson, 2014; Howard et al., 1985; Martínez et al., 2016; Maulana & Helms-Lorenz, 2016). Designs varied considerably between studies, however. For example, De Jong and Westerhof (2001) report on the correlation between a classroom observation instrument and a student questionnaire that had considerably different factor structure, whereas Maulana and Helms-Lorenz report on the correlation of the ICALT and My Teacher questionnaire that have an overlapping factor structure. Because the study by Maulana and Helms-Lorenz (2016) applied the same instruments, their results are most relevant to the discussion in this chapter. They report a correlation of 0.26. This correlation was replicated in the data that was used in the research project that is reported on in this chapter. This modest correlation suggests that feedback reports of students and observers will more often disagree than agree. An exception to the above list of studies reporting modest correlations is the study by Murray (1983), who reports a correlation of 0.76. We will return on Murray’s study somewhat later in this chapter.

4 Studying Evidence of Agreement and Disagreement Between Questionnaires and Classroom Observation Instruments

Two perspectives can be taken to compare the feedback reports presented in Fig. 3. The first perspective focuses on the teacher’s position and as we have seen this leads to the conclusion that the students and observers disagree. The alternative perspective focuses on the ordering of the items and domains on the continuum of instructional effectiveness. From this perspective the students and observers mostly agree. Both the classroom observation and student questionnaire feedback report start with items related to safe and stimulating learning climate and end with items related to teaching students learning strategies and adaption of instruction to individual students’ learning needs. The only two domains that are ordered differently by the two methods are the just mentioned final two domains.

Van der Lans et al. (2019) went one step further and showed that the My Teacher–student questionnaire and the ICALT classroom observation instrument items can be concurrently calibrated on the same continuum of instructional effectiveness. Table 2 lists the joint item ordering mixing observation and questionnaire items. Items denoted with an “s” are student questionnaire items and items denoted with an “o” are classroom observation items.

Table 2 Item ordering that resulted from the concurrent calibration of ICALT observation and My Teacher questionnaire items. This table was originally published in van der Lans et al. (2019) (O = ICALT observation item; S = My teacher questionnaire item)

Studying Table 2 teaches us that similarly phrased questionnaire and observation items occasionally have similar positions on the continuum. Examples are S39 “my teacher involves me in the lesson” (student) and O11 “this teacher involves all students in the lesson” (classroom observation) and: S17 “my teacher encourages me to think for myself” (student) and O19 “this teacher asks questions that encourage students to think” (classroom observation). However, the questionnaire and classroom observation instrument also contained several items that were instrument unique, but which nevertheless could be calibrated on the same continuum. The considerable overlap between the questionnaire and observation instrument has clear practical implications. Suppose that two reports in Fig. 3 would locate a teacher on items related to the same domain—e.g., both on the domain clear and structured explanation—then the observation and student feedback reports give identical suggestions for improvement and, thus, would be more actionable. That is, there is no thinkable scenario in which feedback reports are actionable and in which the ordering of teaching behaviors (items) on the continuum varies between the methods.

The combination of agreement in item ordering and disagreement in teacher location also has theoretical implications, because these findings do not fit well with most prior believes, theory, and hypotheses concerning the disagreement between questionnaire and observation instruments. For example, a long-standing tradition in educational psychology studies biases in instrument scores. Bias is generally examined by regressing the teacher scores on variables other than instructional effectiveness. Studies in the MET project, like Martínez et al. (2016), regressed the “teacher scores” on variables that were hypothesized to bias measurement. This resulted in a set of ‘bias-corrected’ teacher scores related to the student questionnaire and a set of ‘bias-corrected’ teacher scores related to the classroom observation instrument. These corrected scores were then correlated. However, the resulting correlations were similar to the correlations reported in studies that do not correct for bias (cf. Maulana & Helms-Lorenz, 2016; Martínez et al., 2016). More in general, hypotheses reflecting the believe that inferences based on scores need to be corrected for bias do not fit well with the evidence discussed so far. When scores obtained with the instruments are biased and not indicative of instructional effectiveness, then how can we explain the high similarity in the item ordering along the continuum.

The evidence also is difficult to align with another prominent hypothesis, namely the perspective-specific validity hypothesis stated by Kunter and Baumert (2006). Kunter and Baumert proposed that, despite disagreement, scores obtained with distinct instruments can be used to make valid inferences about teachers’ instructional effectiveness, given that the instruments are well-designed and administered. Kunter and Baumert do not clearly define what they mean with “perspectives” and this makes it complex to empirically assess their hypothesis (see also Fauth et al., 2020). However, many seem to understand different perspectives as meaning that some instruments might be more sensitive to tap certain aspects of instructional effectiveness. The difference in sensitivity explains the modest correlation. Also, using multiple instruments could help to offset blind spots thereby allowing for a fuller and richer picture of instructional effectiveness. The current evidence is insufficient to completely verify this idea, but an analysis of the unique items in Table 2 provides surprisingly limited support for it. Take, for example, the item S3 “my teacher makes clear what I need to study for a test”. Despite that this item mentions unique content (“what students need to study for a test” is not part of the ICALT observation list because observers usually cannot know this), the item S3 does not have a unique position on the continuum. We could leave out item S3 without losing much information about teachers’ instructional effectiveness. As another example, take item S34 “my teacher checks whether I understood the subject matter”. The phrasing “whether I understood” focuses on the individual student and such focus is not included in any of the classroom observation instrument items. Nonetheless, item S34 is very closely located to the item O12 “this teacher checks during instruction whether students have understood the subject matter” which measures the same content, but has observers focus on the “average” student in the class. In sum, the evidence provides no strong indications that differences between instruments in terms of item content and item focus can explain disagreement in how students and observers position teachers on the continuum of teaching effectiveness.

The disagreement in teachers’ position on the continuum was examined in another study. Central in that study was the hypothesis that disagreement in teachers’ position on the continuum changes as a function of measurement reliability (van der Lans, 2018). Central in the study were two claims of which the correctness was empirically assessed. First, it was claimed that the scores assigned by observers reflect the average student in the class. Therefore, agreement was expected to increase when classroom observation scores were correlated with class average questionnaire score, instead of scores assigned by a single student. Secondly, it was claimed that student responses to questionnaire items reflect the teachers’ typical teaching across many lessons. Therefore, the agreement was expected to increase when classroom observation scores sampled from different lessons were averaged. The study applied generalizability theory to test these predictions and found support for both of them. The predicted correlation is lowest when scores of a single student’s questionnaire are correlated with one classroom observation score of one single lesson and the more student questionnaires are sampled the higher the predicted correlation with the observation score of a single lesson becomes. Also, the predicted correlation increases when the observation scores are aggregated over multiple lessons, and, again, the more lessons are sampled the higher the predicted correlation. The study results suggest that the correlation between questionnaire and classroom observation instruments increases to 0.76 when the classroom observation scores concern an aggregate of seven different lesson visits performed by seven different observers and when the student questionnaire is administered in the same class and spans scores of 25 different students. This correlation of 0.76 was interesting because of its correspondence to the correlation reported by Murray (1983), which was also 0.76. Murray estimated that correlation based on the aggregate classroom observation score of six to eight lesson visits by three different observers and a student questionnaire administered in the same class and which score was aggregated over all students in the class. The increase in the expected correlation between the questionnaire and classroom observation instrument is graphically presented in Fig. 4. The y-axis in Fig. 4 gives the predicted correlation. The x-axis indicates the number of students in the class. The separate lines indicate how predictions differ when the number of classroom observers and lesson moments sampled with the classroom observation instrument varies.

Fig. 4
A graph plots rho versus n students. The data reads as follows: 1 observer, (1, 0.21), (30, 0.37); 3 observers, (1, 0.29), (30, 0.60); 5 observers, (1, 0.30), (30, 0.70); 7 observers, (1, 0.30), (30, 0.75).

Predicted increase in correlations (ρ) between the MTQ student questionnaire and ICALT classroom observation instrument for an increasing number of administered classroom observations. The correlations apply when questionnaires and classroom observations are performed within the same class and span no more than one school year. Van der Lans (2018) reports predictions related to other situations (e.g., questionnaires and observations spanning different classes)

The implications of the results in Fig. 4 are not yet well understood. There are varying possible interpretations. One interpretation is that more valid inferences are made with the questionnaire and classroom observation instruments when scores are aggregated over many students and over many lessons and observers, respectively. This interpretation aligns well with studies suggesting that the reliability of single student questionnaire scores is unreliable (Marsh, 2007) and that classroom observations of one single lesson are unreliable snap-shots (Hill et al., 2012; Praetorius et al., 2014; van der Lans et al., 2016). This interpretation has considerable implications for the number of classroom observations and student questionnaires that need to be administered at schools. However, this interpretation does not align well with the rationale behind the hypotheses of van der Lans (2018). That is, the only reason why it is predicted that the correlation between the classroom observation instrument and the student questionnaire increases as a function of the number of lesson visits included in the aggregate is that the students are also expected to aggregate their experiences across many lessons when scoring the questionnaire. If questionnaires would be able to tap students’ experiences about one particular lesson, then it would be predicted that the correlation is highest when the questionnaire results are correlated with classroom observations concerning that same particular lesson. This study was unable to empirically examine this, however. Similarly, the finding that questionnaire scores obtained with single students have low correlation with the classroom observation scores is, in the study by van der Lans (2018), explained by the assumption that classroom observers score the instructional effectiveness towards the “average” student. If observers would have been instructed to score the classroom observation items in relation to one particular student, the correlation is predicted to be highest for the particular student-observer dyad (compared to other dyads). Again, the study was unable to empirically examine this claim.

5 Discussion and Conclusion

5.1 Potential Implication Teacher Evaluation in Schools

Based on the above discussions, what can we advise Tess and John? What we can say is that the evidence so far suggests that single classroom visits rarely will show agreement with a single administration of the student questionnaire. Also, the evidence generally indicates that this has little to do with the interpretation of item content by students and observers. The item ordering estimated across all students is very similar to the item ordering across all observers. This does not imply that the quality of the item phrasing is unimportant, however. The evidence indicates that when items are well-formulated the students can score items related to the same domains of instructional effectiveness very similar. We might advise Tess to postpone the performance evaluation interview and administer some more classroom observations. The evidence presented in this chapter indicates that this increases the chance on agreement. However, it is not always possible to schedule additional classroom observations. Alternatively, we might advise Tess and John to focus on one result. Perhaps John wants to improve on his instructional effectiveness when teaching certain subject matter and because the classroom observation visit took place when John was teaching this particular subject matter, the classroom observation results are favored over the student questionnaire results. However, while this last advice might be intuitive to some, we must acknowledge that it is full of untested claims and hypotheses.

5.2 What to Do Next?

One direction for future research concerns the construction of student questionnaires that can help us to make valid inferences about the instructional effectiveness of single lessons. The Impact! tool might be a potential example of such an instrument (Bijlsma et al., 2019). The alternative item phrasing of the Impact! questionnaire provides opportunities to assess the hypothesis that correlations between student questionnaires and classroom observation instruments are attenuated because students aggregate their experiences over many lessons when scoring regular questionnaire items.

Another direction for future research concerns the commonly shared understanding among researchers that some instruments have higher sensitivity to measure certain aspects/behaviors of instructional effectiveness and that using multiple instruments could help to offset blind spots. When instruments have blind spots concerning the measurement of instructional effectiveness, then we would expect “gaps” in the continuum of instructional effectiveness. Such “gaps” can only become visible when multiple instruments are concurrently calibrated to continuum. The resulting ordering in item positions may reveal that items of one instrument have unique locations on the continuum. The current evidence assessing this idea is very limited. Hopefully, the counterintuitive result—few evidence supporting the idea—motivates future researchers to improve on the study designs, content of the instruments and psychometric methods applied by van der Lans et al. (2019) to more thoroughly study this idea empirically.