Introduction

In July of 2023, the American Board of Surgery (ABS) officially announced the transition to a competency-based resident education (CBRE) model, introducing 18 core General Surgery Entrustable Professional Activities (EPAs) [1]. Competency-based surgical education is an educational paradigm that emphasizes residents demonstrating the competencies needed for effective performance in their roles before advancing to the next phase of training [2]. A challenge of CBRE is how to operationalize competency assessment in everyday activities or on performance assessments.

To address this challenge, the ABS has encouraged the use of EPAs as the focus for clinical assessment tools. EPAs were first proposed by Dr. Ten Cate in 2005 and are defined as “a unit of professional practice that can be fully entrusted to a resident, once he or she has demonstrated the necessary competence to execute this activity unsupervised” [3, 4]. Given the transition to CBRE with EPAs, general surgery programs across the country have started implementing EPAs as the focus of their clinical assessment tool.

Most assessment forms contain a narrative feedback section, inviting individualized feedback for the resident. Narrative feedback poses an opportunity to describe performance, provide individualized recommendations, or compare their performance to a set of standards [5]. Narrative feedback can indicate struggling residents and thus provides an opportunity for early intervention and support [6, 7]. Prior studies analyzing narrative feedback, not specific to EPAs, advocate for feedback that incorporates reinforcement or correction, is highly specific, includes examples, or offers actionable suggestions [8, 9].

When analyzing narrative feedback on EPA assessments, prior studies have utilized the Quality of Assessment for Learning (QuAL) as the solitary scoring system to analyze the content and quality of feedback [10, 11]. The QuAL scoring system focuses on identifying if feedback contains evidence, suggestions, or a connection between the two. There is an opportunity for further characterization of narrative feedback given the limited number of feedback principles previously analyzed on EPA assessments [12]. Better understanding the nuances of narrative comments on EPA assessments may help guide faculty development efforts around more effective feedback to facilitate resident growth. Therefore, we aim to identify narrative feedback characteristics on an operative EPA assessment in general surgery and to determine if resident performance, entrustment, or resident/faculty characteristics impact the type of narrative feedback received.

Methods

Study setting and design

We performed a retrospective directed content analysis of narrative feedback on operative EPA assessments at a single large academic teaching hospital system, the University of California, San Francisco (UCSF) Medical Center. We included assessments completed between June 2022 through July 2023. Faculty were instructed to complete the operative EPA assessment following each operation with a surgical resident. The participants were surgical residents rotating on a general surgery service.

EPA assessment tool

The general surgery program leadership in May 2022 developed an assessment tool for use with intraoperative EPAs. The assessment tool collects demographic information such as resident name, faculty name, and resident postgraduate training level (PGY). Case-specific data such as procedure type and case difficulty (straightforward, moderate, complex) are also collected. There was a total entrustment score that utilized the entrustment supervision (ES) scale recommended by the ABS, ranging from limited participation (1), direct supervision (2), indirect supervision (3), to practice ready (4) [1, 13]. We also included four domains to assess resident performance: Knowledge of Anatomy (1–3), Surgical Technique (1–4), Recognition of Potential Errors (1–4), and Steps of the Operation (1–4). The four domains of resident performance were developed from narratives of the five EPAs on the ABS pilot study, as well as with the expertise of the general surgery program leadership team [13]. The four sub-scores were summed to form a composite score given strong intercorrelations (r = 0.45–0.69) and reliability (α = 0.84) between the domains. The final section of the EPA assessment tool included a required field for narrative feedback.

Additional data collection

Resident and faculty characteristics such as faculty years in practice, faculty/resident self-reported gender, and resident’s Underrepresented in Medicine status (URiM) were collected from the records in the surgical education office. To identify faculty members who most frequently operate with residents, we utilized the ACGME case log data. Finally, we determined the faculty member’s skills as an educator by averaging all resident ratings for the faculty member for the item “providing formative feedback” on the faculty teaching evaluations (where 1 equals ineffective skill and 4 equals exemplary skill).

Feedback characteristics selection

Feedback characteristics included in the study were valence, specificity, appreciation, coaching, and evaluation. We selected characteristics based on prior literature and expertise from the surgical education leadership team [9, 14,15,16,17,18]. High valence feedback was characterized as reinforcing, whereas low valence was corrective. Specificity is defined as specific or general. Appreciation refers to feedback that recognized or rewarded the resident. Feedback with coaching offers a better way to perform a task. Finally, evaluation was defined by feedback that assessed residents against a set of standards.

Data analysis

After data collection, all EPA assessments were de-identified to ensure anonymity of the residents. Two researchers (AM, AG) performed directed content analysis of the narrative feedback from the EPA assessments. Valence and specificity were coded dichotomously. For valence, the feedback was scored as reinforcing or corrective (0 = corrective). For specificity, the feedback was scored as specific or general (0 = general). Appreciation, coaching, and evaluation were coded on a 3-point scale (1 = Not Present, 2 = Somewhat, 3 = Yes). We calculated inter-rater reliability using Cohen’s kappa. In the event of a disagreement between the two researchers performing the content analysis, a third researcher (RB) would analyze the feedback and provide a final decision.

Multivariable logistic regression assessed for associations between feedback characteristics and entrustment score, composite score, PGY level (1–5), case difficulty (1–3), gender of resident and faculty (0 = male), gender matching (1 = alignment), faculty years in practice, resident’s under-represented in medicine (URiM) status (0 = non-URiM), faculty evaluation scores, and number of operative cases with a resident. For appreciation, coaching and evaluation, we converted the 3-point scale to a dichotomous item (0 = not present, 1 = present) to perform logistic regression. All independent variables were standardized to allow direct comparisons across predictors. We performed all analyses in Stata 16.1 for Mac (StataCorp, College Station, TX).

Ethical approval

The University of California, San Francisco institutional review board exempted this study from review (IRB 23-39,766).

Results

Descriptive results

From June 2022 to July 2023, faculty members completed 325 assessments for 44 residents. Of these residents, 57% were female (n = 25) and 30% URiM (n = 13). The study included residents from all PGY levels, with a nearly equal distribution across each class (10 PGY1, 9 PGY2, 8 PGY3, 8 PGY4, 9 PGY5). Forty-two faculty from 10 surgical subspecialities completed evaluations. Faculty had an average of 14 years in practice (range: 1–37) and 45% were female (n = 20). On average the faculty operated 133 times with a resident (13–263 cases) in one year. The average teaching evaluation score of the faculty based on the resident-completed performance evaluations was 3.81 ± 0.17. Of the 325 cases included in the study, faculty rated 132 cases as straightforward, 117 cases as moderate, and 76 cases as complex. Average overall entrustment score was 2.51 ± 0.80 and the average composite score was 12.67 (range: 4–18).

In terms of the feedback characteristics, we found 82% of narrative feedback was reinforcing (18% corrective) and 80% was specific (20% general). 89% of feedback contained appreciation, 51% contained coaching, and only 38% contained evaluation (Table 1). Inter-rater reliability was 91% (kappa = 0.84).

Table 1 Feedback characteristics with definitions and examples from the narrative feedback on the EPA assessment tool

Multivariable regression

We standardized the coefficients and performed logistic regression between each of the feedback characteristics and the predictors (Table 2). We found all feedback characteristics contained statistically significant predictors.

Table 2 Logistic regression for predictors of feedback characteristics, statistically significant if p < 0.05

Reinforcing feedback, or feedback with high valence, was more prevalent after complex cases (ß = 0.50, 95% CI [0.04, 0.95], p = 0.03). Residents who scored higher on the composite resident performance score were also more likely to receive reinforcing feedback (ß = 0.11, 95% CI [0.25, 0.68], p < 0.01). Feedback was more likely to be corrective for senior level residents (ß = − 0.44, 95% CI [− 0.79, − 0.09], p = 0.02) or when the assessment was completed by a male faculty (ß = − 1.03, 95% CI [− 1.88, − 0.18], p = 0.02).

Both faculty and resident characteristics were statistically significant predictors of specific feedback. Faculty characteristics that predicted more specific feedback included: faculty years in practice, with junior faculty providing more specific feedback (ß = − 0.06, 95% CI [− 0.11, − 0.02], p = 0.01) and faculty with higher evaluation scores (ß = 1.29, 95% CI [1.97, 7.03], p < 0.01). Resident characteristics that predicted specific feedback include increasing PGY level (ß = 0.18, 95% CI [0.25, 0.97], p < 0.01) and resident with lower ES scores (ß = − 0.67, 95% CI [− 1.24, − 0.10], p = 0.02).

Feedback was more likely to contain appreciation when residents scored higher on the composite score of the resident performance domains (ß = 0.47, 95% CI [0.22, 0.73], p < 0.01). Faculty gender was also a significant predictor of appreciative feedback, with male faculty more likely to provide appreciation than their female counterparts (ß = − 1.46, 95% CI [− 2.48, − 0.43], p = 0.01).

Feedback contained coaching when residents received lower ES scores (ß = − 1.01, 95% CI [− 1.52, − 0.49], p < 0.01) and composite scores (ß = − 0.20, 95% CI [− 0.37, − 0.03], p = 0.02). Senior level residents received more coaching as compared to junior level residents (ß = 0.32, 95% CI [0.04 to 0.60], p = 0.03). Both resident and faculty gender played a role in delivery of coaching. Female residents were more likely to receive coaching as compared to male residents (ß = 0.65, 95% CI [0.04, 1.25], p = 0.04) and coaching was more prevalent when there was a match between faculty and resident gender (ß = 0.62, 95% CI [0.03, 1.21], p = 0.04). Faculty characteristics that increased the prevalence of coaching feedback include faculty years in practice with junior faculty providing more coaching (ß = -0.05, 95% CI [− 0.09, − 0.01], p = 0.01) and faculty with higher evaluation scores (ß = 3.32, 95% CI [1.33, 5.32], p < 0.01).

Evaluation was the least prevalent type of feedback received and the only predictors were faculty characteristics. Male faculty were more likely to provide evaluation than their female counterparts (ß = − 1.00, 95% CI [− 1.68, − 0.32], p < 0.01). Faculty with lower evaluation scores were also more likely to include resident evaluation against standards in their narrative feedback (ß = − 3.84, 95% CI [− 5.76 to − 1.93], p < 0.01).

Discussion

We performed retrospective directed content analysis to determine the prevalence and predictors of feedback characteristics in narrative feedback provided on an operative EPA assessment tool. We determined that most of the feedback was reinforcing, specific, and contained appreciation. Coaching, defined as providing a recommendation, was less common with prevalence around 50%. Less than half of the narrative feedback contained evaluation, or compared residents against a set of standards, making it the rarest type of feedback provided. It is reassuring faculty frequently included reinforcing feedback, as positive feedback has been shown to improve resident performance, confidence, and motivation [14, 19]. The relative lack of coaching and evaluation feedback is consistent with prior studies demonstrating a lack of recommendations or suggestions in narrative feedback [10, 11, 20]. Both coaching and evaluation are essential for resident education and progression. Therefore, there is an opportunity to improve future feedback quality by highlighting this deficiency and providing recommendations for faculty development.

Few studies have looked at the narrative feedback on EPA assessments (10, 11). Therefore, it remains unknown how feedback characteristics vary following questions regarding entrustment. Ideally the narrative feedback section would provide the opportunity to explain or defend the overall entrustment and resident performance scores. Our study demonstrated the average ES score was 2.51 out of 4 and the overall composite score was 12.67 out of 18; however less than 50% of feedback contained coaching or evaluation against standards. Therefore, there is a missed opportunity to explain the suboptimal ES score and composite scores within the narrative feedback as most feedback was reinforcing and appreciative. Incorporating actionable recommendations or benchmarking residents against established standards would allow residents to better understand their scores and areas for improvement. With ongoing implementation of assessments of EPAs, faculty development might provide guidance on delivering narrative feedback that includes justification for their score selections.

The multivariable regression revealed multiple interesting predictors of feedback characteristics including both resident entrustment, performance, and case factors, but also resident/faculty characteristics. Residents with lower overall ES scores received more specific feedback and coaching, which is consistent with what residents need and desire. The composite score, composed of resident performance scores such as technical skills, recognition of potential errors, steps of the operation, and anatomy knowledge, was the most common predictor of feedback characteristics. As expected, residents with higher composite scores received more appreciation and reinforcing feedback. Residents with lower composite scores received more specific feedback and coaching. These findings suggest that the type of feedback provided corresponded to resident performance, reinforcing strong performance and correcting areas of weakness.

Another resident factor that predicted type of feedback delivered was PGY level. Junior level residents received more reinforcement, while senior level residents received more specific feedback and coaching. This finding may suggest a desire to provide a positive and supportive environment for junior-level residents, while providing more directed feedback with recommendations for improvement to senior-level residents prior to graduation. In addition, senior residents may have an established relationship with the faculty that allows for more directed feedback. Finally, case difficulty also resulted in more reinforcing feedback, potentially signifying faculty offering support to residents through complex and difficult cases.

We found both resident and faculty gender to be significant predictors for feedback characteristics. An almost equal number of male and female faculty completed the EPA assessments and provided narrative feedback. However, male faculty were more likely to deliver reinforcing, appreciative, or evaluative feedback than their female counterparts. Resident gender also impacted feedback characteristics, with female residents being more likely than male residents to receive coaching. Gender match between the resident and faculty also resulted in more coaching feedback. These results suggest there is a potential effect of gender on the characteristics of feedback delivered. This is consistent with prior studies demonstrating an impact of gender on decisions of entrustment and autonomy [21,22,23]. Future studies should investigate the causality of gender differences given its persistence in surgical education and EPA assessment literature.

Faculty characteristics, such as years in practice and evaluation score, were also significant predictors of feedback characteristics. Junior faculty and faculty with higher evaluation scores were more likely to provide specific feedback and coaching. Interestingly, faculty with lower evaluation scores were more likely to provide evaluation feedback or compare the residents against a set of standards. Interestingly, the volume of operations with a resident was not a significant predictor. This variable was selected as a potential surrogate of identifying a relationship between faculty and residents with the assumption that higher case volumes indicate increased familiarity with residents. The sample size was insufficient to look at direct pairings between faculty and residents to understand the impact of familiarity, which represents a potential focus for future studies.

There are several limitations within our study that moderate our findings. We selected five feedback characteristics based on the literature and expertise of our leadership team; however, this is not an exhaustive list. There are likely other feedback characteristics and scoring systems that could provide additional insight into the types of narrative feedback provided. Another limitation is that this study was conducted during the first year of implementation of EPAs at our institution. Therefore, there was varying levels of completion of the assessment forms across surgical subspecialities and individual faculty. Therefore, the results are not equally representative across surgical divisions and faculty. As well, the narrative feedback section of this EPA assessment tool, while on other assessment tools it is optional. This could influence the type of narrative feedback delivered, as optional feedback might be chosen only when faculty has strong opinions. Finally, we conducted this study at a single large academic medical center; findings most likely apply to similar institutions.

Overall, we identified different feedback characteristics and their predictors including resident entrustment, performance, case factors, and resident/faculty characteristics. Our findings highlight potential areas for faculty development to improve the quality of narrative feedback provided on EPA assessment tools. As well, we identified future areas of study such as the role of gender and the faculty/resident relationship on feedback characteristics.