Keywords

1 Introduction

Almost every organization with a strong digital capability is on an agile transformation journey [9]. However, whether this transformation benefits the organization, and whether goals are reached, are frequently heard concerns [33].

Measurement is fundamental to justifying change efforts and provides objective reference material from which to learn and improve (cf. [25, 31]). While previous work (cf. [28]) displayed the feasibility of using data for individual organizations and metrics, the change across organizational layers over time has largely been unexplored to date. Can we find objective data to confirm whether these transformations were actually quantitatively measured and whether they improved organizational performance [33]? In this paper we report on a case study with multiple units, for the first time exploring the application of backlog data to measure and guide a large-scale agile transformation, based on eight Agile Release Trains in a large international financial services company.

2 Related Work

2.1 Large-Scale Agile Frameworks and Impact of Transformations

While agile techniques vary in practice, they share common characteristics, such as iterative development and the focus on people and their interactions, captured in the 2001 Agile Manifesto and its principles [2]. Current figures and surveys on scaled agile transformations [9, 33] indicate that SAFe [19] is considered the most applied framework (35%), followed by Scrum of Scrums (16%), and others like Disciplined Agile Delivery (DAD), Large Scale Scrum (LeSS) [22], Enterprise Scrum, and Lean Management (4%).

Current literature documents multiple attempts to measure the impact of agile transformations [21, 28, 33]. Consolidating prior evidence, Stettina et al. [33] report the impact of agile transformations being significant along the dimensions of Productivity, Responsiveness, Quality, Workflow health, and Employee satisfaction & engagement. From a practitioner perspective, the Scaled Agile Framework (SAFe) proposes three dimensions of metrics, namely Outcome, Flow, and Competency [30]. Outcome metrics focus on whether solutions meet the needs of customers and business, Flow metrics focus on organizational efficiency, and Competency metrics focus on how proficient the organization is in its practices to enable business agility [30].

2.2 Research on Performance Measurement Frameworks

In general management literature, multiple performance measurement frameworks and models have been developed and applied, amongst others the (1) Balanced Scorecard (BSC); (2) Performance Pyramid [39]; and (3) Performance Prism [26]. A comparative overview is provided by Öztayşi and Uçal [42], based on the seven purposes formulated by Meyer [24] (i.e., (1) look back; (2) look forward; (3) roll up; (4) cascade down; (5) compare; (6) compensate; and (7) motivate) combined with two additional views: (8) alignment with company strategies; and (9) flexibility (dynamism) of the measurement model according to change. Öztayşi and Uçal [42] show that (only) BSC satisfies all purposes. The latter two purposes seem especially relevant in the context of agile transformations.

The BSC approach [16, 17] has been introduced to capture strategic intent while linking it to the performance of an organization, and views strategy management as an integrated end-to-end process [16, 27]. BSC is widely applied across different industries and describes four perspectives: (1) Learning & Growth (can we continue to improve?); (2) Customer (doing the right things); (3) Internal process (doing things right); and (4) Financial perspective. In the context of agile strategy, an elaborate description is provided using (Dynamic) Balance Scorecards by Wireaeus and Creelman [40], observing the absence of robust objective statements and not using tools such as driver-based models and so-called Key Performance Questions to bridge the gap between objectives and KPIs (p.15 [40]). We argue that the same challenge applies to the Objective & Key Results (OKR) approach and see a similar ambition in the Goal Question Metric (GQM) approach. In the software quality domain this approach has been proposed to define the right measures [1]; Goals need to be traced back to data that are intended to define those goals, and a framework needs to be provided for interpreting the data with respect to the stated goals.

2.3 Research on (Backlog) Data in Agile Software Development

Backlog tooling to support the application of agile frameworks is perceived by agile teams as highly important within their development toolchain [34]. Further, a combination of tool-driven quantitative reporting (e.g. based on backlog tooling) supplemented by cadence-driven qualitative insights (e.g., iteration reviews, demos as well as employee and customer surveys) is applied among more mature agile teams and organizations [35]. A literature study by Biesialska et al. [3] describes a multitude of tooling data sources available in agile software development and provides an overview of the use of backlog tools for monitoring the status and progress of projects, backlogs, and corporate initiatives. A substantial part focuses on estimation and predictability models [7, 29] on diverse levels, ranging from team-level user stories [5, 6], requirements [8], and Epics [4] to sprint projects and releases [23]. Using data from these sources raises reliability challenges such as (1) the need for automation (unobtrusiveness cf. [23]); (2) transforming the data; and (3) the assessment (of the maturity) of data quality [5]. Based on SWEBOK [15] knowledge areas, Biesialska et al. classify no research under Software Engineering Economics. By this we may conclude that areas such as efficiency, effectiveness, productivity, time-value, and business case are to this date not covered in the context of big data analytics, whereas these are crucial topics in the context of agile transformations. However, the case of Fannie Mae [31] describes the use of analytics to facilitate guidance during a Agile-DevOps transformation using automated function points for productivity and defects for quality measurements.

2.4 Summary of Literature and Research Question

Based on the current state of the literature, we can make the following observations: (1) there is no generally accepted view on success of agile transformation or of its impact on organizational performance; (2) although there are some measurement frameworks available for understanding the impact of agile transformations (cf. [21, 28, 33]), none of those have been used to act as common ground for reference or for the guidance of agile transformations; (3) the same applies to using backlog tooling data. These observations challenge us to pose the following research question: How can we measure and guide the impact of agile transformations on organizational performance using backlog tooling data?

3 Methodology

In order to address our research question, we conducted exploratory analyses (cf. Tukey [37]) on backlog data in an embedded multiple-unit single case study (Yin[41], Type 2). By analyzing a single organization and multiple units, we were able to compare results and observe the impact of interventions, maturity, and trends within the same context of the transformation. Units (i.e., value streams and shared services) have a 1:1 relation to Agile Release Trains (ARTs), and consist of multiple teams at FinOrg.

3.1 Our Case Study Subject: FinOrg

The subject of our case study is the agile transformation of a large Dutch financial services organization: 11 release trains, approximately 70 teams, ranging from development teams, DevOps teams, supporting staff departments (e.g., architecture, security, HR, procurement, marketing), and back-office business (non-IT) operations teams. All units are individually profit-and-loss responsible, have own product market propositions and are autonomousFootnote 1 in the implementation of the new agile way of working, which is driven by the following objectives:

  1. 1.

    Improve productivity (PROD)

  2. 2.

    Faster time to market (TTM)

  3. 3.

    Higher quality (QBD)

  4. 4.

    Higher customer satisfaction (CUST)

  5. 5.

    More engaged employees (EMPL)

No targets for these objectives have been communicated at FinOrg. FinOrg uses Jira as its backlog system, plugins Easy business intelligence for dashboards, and Structure to aggregate data across units and teams. For statistical analysis we used JASP and Jamovi for plots.Footnote 2

3.2 Mapping Literature to Transformation Objectives at FinOrg

In order to map the transformation objectives to categories in the literature, we will use the dimensions introduced by Stettina [33].

  • (a) Responsiveness: maps to FinOrg’s faster time to market driver (TTM). That, in turn, translates to the speed of epics and features on portfolio /program level and team issues of throughput time from creation up until the moment the item is resolved. Note that the notion of delivered or to-market is interpreted in different ways in the literature and tooling. In order to prevent confusion, we used the resolve time (i.e., the item reached its final state) denoted with TTR (Time to Resolve).

  • (b) Productivity: maps to the productivity objective (PROD), which translates to delivering more value to the customer. The notions of WSJF and Cost of Delay are aligned with this objective. Items at program level and above are WSJF-estimated. WSJF is an abbreviation of Weighted Shortest Job First and relates the Cost of Delay (value) attributes as proposed by SAFe [19] to their Job Size (estimated effort).

  • (c) Workflow health: For this dimension one must note that the measurements proposed in literature overlap with responsiveness and time to market [33], although with a different objective. An illustration: an increase in functionality per time unit (cf. [28]) can indicate higher productivity, but might be categorized as faster time to market as well. To dissolve this confusion, we assigned its measurements: (1) Job Size; and (2) the number-of-items resolved to both objectives: PROD and TTM.

  • (d) Employee satisfaction & Engagement: maps to the EMPL objective. By focusing on backlog data, we were not able to address employee engagement. Instead, we used other employee-related measurements with regards to number of people assigned during the flow and a custom FinOrg indicator of complexity as proxy for collaboration/organized/planned measurement and classified these as part of (c) Workflow health.

  • (e) Quality: maps to the quality by design objective (QBD). In order to keep track of compliance and quality aspects, FinOrg introduced a quality-by-design process, implying that all initiatives need to be checked against relevant compliance and quality policies (e.g., security, privacy, legal, ITSM) and are thus explicitly linked to quality backlog items. QBD items have to be resolved alongside the respective initiative (similar to acceptance criteria) assigned to, and executed by, the appropriate roles and colleagues. Number-of-items resolved and TTR for QBD items are used as measures.

4 Results

4.1 Case Background: FinOrg’s Agile Transformation Journey

The framework implemented at FinOrg was based on SAFe [30] with a few additions, the most important being the introduction of the aforementioned quality-by-design process. Another addition was the integration of business operations, including non-IT teams, into the units. FinOrg implemented a workflow on program and portfolio level with funnel, review, analyze, backlog, and implementation stages, mandatory initiative statement registration, and multiple WSJF-estimation and Quality by Design sessions within a quarterly cadence.

With respect to the transformation timeline, we distinguish three phases in the transformation at FinOrg: Wave 0: agile at team level, using backlog system at team level with mixed maturity levels and agile models e.g. Kanban and Scrum variants (months 0–12). Wave 1: introduction at program and portfolio level of a new way of working based on the SAFe framework (months 13–24). Wave 2: maturing at program and portfolio level (months 25–36, most recent). The lead author helped guide the digital transformation at team level during Wave 0 and helped design and implement the operating model at portfolio and program level. At Wave 1, the lead author was responsible for creating and introducing the solution on top of the existing Jira backlogs. This functionality was created with the use of the plugins and custom scripting to facilitate guidance on the program/portfolio and quality-by-design aspirations. An extra layer was introduced using two additional backlogs containing: (1) functional items (i.e., Epics, Features); (2) non-functional also known as quality-by-design items. Release trains and teams are responsible for documenting initiatives and quality aspects and linking activities to the overarching items. This functionality has been iteratively developed and introduced with a minimum viable product at the start of Wave 1 at corporate level.

Table 1. Descriptive information on the Jira tooling data of FinOrg

Table 1 presents our case study data. We performed data cleansing, resulting in dropping three units and multiple Jira team projects based on our assessment that their activities were not substantial enough as a basis for detecting empirical trends and differences. In addition, we harmonized workflows, different uses of statuses, issue types, and custom fields by adding an abstraction model, exposing backlogs in only three basic layers (i.e., Epics, Features, Team issues) and a simplified workflow (i.e., only create/open, resolve statuses). By this, clarity in presentation was improved, while keeping the backlog system intact (refer to additional notes Table 1 for details).

4.2 Uncovering Trends in Backlog Data

Our exploration ambition is to determine whether desired trends are noticeable in order to guide the transformation. We first illustrate productivity PROD. Figure 1 plots resolved Cost of Delay, our proxy for value delivery, relative to its meanFootnote 3, making comparison of results over time possible and uncovering potential trends. We share two observations based on this AVP plot: Observation 1: the start of the portfolio/program-level wave, starting in month 12, is visible by the cadence of resolved items/dots starting just before month 14, two months after the Wave 1 kick off. As envisioned at program/portfolio level, we observe a positive trend. Observation 2: At month 25 a global cost-saving program was introduced within all units, displaying a flattening and subsequent decrease of Cost of Delay, a plausible explanation for this negative trend, since the organization was not able to focus on value delivery. WSJF measurements, the next identified measure of productivity, show the exact same trend.

Fig. 1.
figure 1

Added Variable Plot (AVP) of Cost of Delay for units, baselined per issue type and unit over time (months). The value of 1 therefore represents the baseline. Outliers >3 have been discarded in the plot, to help improve the visualization quality. Dots represent resolved issues. Confidence bands and fitted line based on Loess.

4.3 Trends Across Organizational Layers, Focus on Responsiveness

Another sample demonstration looks at the trends in responsiveness (TTM) including diving into layers and units (Fig. 2). This proved helpful while deepening insights into the dynamics of flow. The impact of local interventions to improve refinement processes, creating better-sized and better-defined chunks of work, is visible over time. It reveals significant differences. One illustration: all trains started with the mandatory use of program/portfolio Epics at Wave 1 (month 12), meaning that all initiatives had to be registered and estimated. Note that one ART (U02) already used features and greatly reduced the TTR for these items during the three years, mainly by defining smaller chunks of work. However, this downsizing of items at U02 did not lead to worsening TTR results at team level; rather the opposite seems true: more items were delivered and there were better rates of TTR for this level as well. Overall, we see decreasing TTR values, which is in line with the envisioned improvement on the TTM objective.

Fig. 2.
figure 2

Baselined Time-to-Resolve (TTR) measurements for ARTs U01-U12 across the three organizational layers and transformation Waves 0-2

4.4 Understanding Transformation Success

In this section we report on a way to provide evidence regarding the overall success of this transformation. For this purpose we compare data sets of the transformation on the program & portfolio level of Wave 1 to Wave 2 in Table 2. A summary of our findings:

Table 2. Impact (%) Wave 1 versus 2 transformation program level across ARTs on objectives
  •   (a) (TTM): Transformation improves (a) responsiveness. At all issue layers a substantial improvement (i.e., decrease in resolve time) was observed, ranging from team-level improvements of over 47% to epic-level ones of 34%.

  •   (b) (PROD): Transformation improves (b) productivity. We are able to report that more value has been delivered (Cost of Delay) (30% epic level, 32% feature level) and better priorities (WSJF) have been set (73% feature level, 114% epic level). Notice the (large) improvements within some units which can be explained by interventions improving the WSJF estimation and prioritization events and redefining epics and features. Note that some units (U01, U08) did not report any resolved epic items.

  •   (c) (PROD,TTM): Transformation improves (c) workflow health. Data from our case study displays an ambivalent picture. Averages of resolved Job Sizes decreased on both levels (22% features, 24% epics), which should be evaluated in the context of number of items. In Fig. 2 one can deduct that indeed the number of Epics and Features are decreasing since less data point are visible in more recent months. A reasonable explanation for this is creating less over-arching items like epics and shift to smaller (right-sized) items facilitating a better flow. It is therefore interesting to observe the dynamics between priority setting (WSJF) and TTR to this dimension. Furthermore, note the differences in unit results: U04, for example, shows workflow decreases on Epic level, but improvements on other levels, which indicates that focus shifted to the delivery of smaller-sized items. Note: the lagging performance of U03 can be explained by specific challenges.

  •   (d) (QBD): Transformation improves quality. Data from our case study display interesting results. The number of QbD items decreased, while the resolve time improved. A plausible explanation is that since the number of resolved epics (initiatives) decreased, a decrease of associated QbD items is to be expected as well. The overall QbD numbers are positive: the ratio of quality aspects is congruent with the initiatives and the handling of the QbD aspects improved over time (TTR 41%).

  •   (e) (EMPL): Transformation improves employee engagement. By not using subjective survey data on engagement, we fail to report on this measurement. However, we are able to report on autonomy and a (custom) complexity measurement as part the (c) workflow health category and report an increase in autonomy (21%) and a decrease in complexity (7%).

5 Discussion

5.1 Using Backlog Data to Guide Transformations Based on Trends

We will now continue to discuss how Jira data contributes to the understanding of the transformation impact and trends in relation to the five dimensions of impact established in agile literature and subsequently to the Balanced Scorecard (BSC). Figure 3 provides insights in how measures, objectives and perspectives are linked by establishing a connection between the Balanced Scorecard, the impact dimensions, and the measurements conducted during the transformation at FinOrg. The perspectives of the BSC as presented by Kaplan and Norton [16, 17], offer a holistic view on the dimensions of organizational performance in contrast to the empirically, bottom-up understanding of impact of agile transformations as presented by Stettina et al. [33]. Plotting Jira backlog data over time and projecting data in multiple layers, as discussed in this paper, allows for zooming into organizational layers and trend analyses provide valuable augmentation.

Fig. 3.
figure 3

Overview of case study results and objectives (blue, 1st block), literature (2nd block). Last column connection to BSC perspectives. Shaded gray results: converted Likert scales of qualitative survey results. (Color figure online)

Firstly, one can observe that the Time-to-Resolve and Items-time (resolved) on Epic, Feature and Team level augment the Responsiveness dimension. This dimension contributes to Learning & Growth through the opportunity of providing faster feedback through faster delivery. Based on the baselined Time-to-Resolve plots in Fig. 2, one can confirm the envisioned trend of decreasing resolve time. In Sect. 4.3 we discuss how smaller slices of Features contribute to lowering TTR using the example of U02. A further general observation that can be made when looking at Fig. 2, is that the impact differs significantly per organizational layer, as previously suggested by Stettina et al. [33].

Secondly, one can observe how the measures of Cost of Delay and WSJF contribute to the dimension of Productivity as they represent how implemented Epics, Features and Stories link to prioritization given by the customer. Here one assumes that a better adherence to previously defined customer issue priorities leads to better performance as previously described in literature [10, 11]. Figure 1 plots aggregated Cost of Delay values for the delivered issues over all units delivered to the customer. Based on the plot one can recognize positive as well as negative trends. Specifically, the implementation of the program & portfolio layer transformation of Wave 1 indicates a positive impact on Cost of Delay values. The negative effect of a cost-saving program on performance due to loss in focus on value delivery can be visually identified to be starting in month 25.

Thirdly, the measures of Autonomy, represented by the number of dependencies linked in Jira across the implemented issues, as well as Complexity, represented by communication complexity (refer to notes Table 2), provide an indication for Employee Satisfaction & Engagement.

Fourthly, one can observe how the Quality by Design issues, can serve as an indicator for Quality improved. The perspective taken here is that the quality of design requirements and lower TTR values lead to better quality of the product. We point out that quality aspects are executed with improved speed and with fewer items. This indicates an improvement in quality by design, especially in the context of firmly enforced protocols and a rigorous (internal and external) audit process. In that respect we may exclude possible manipulation of measurements.

In line with previous findings of Lin et al. [23] we argue that unobtrusiveness and transparency are key success factors to using backlog data. To address this the measurements at FinOrg have been automated and made available in real time. The system is an integral part of the way of working, in other words, no extra effort is needed and, since the system provides relevant insights for users, they are motivated to maintain (1) high data quality and, (2) the inherent openness reduces the risk of gaming (cf. [18]). In addition, understanding how measures are interconnected and using more than one measure per objective strengthen the (3) reliability of the results. As an example over- or underestimating Job Size will show up in relation to the Time-to-Resolve and number of items measurements denoted by the connecting lines.

5.2 Transformation Success at FinOrg Compared to Prior Evidence

We will now continue to elaborate on the main question: Can we declare transformation success based on FinOrg’s objectives and what if we compare these to prior findings? Fig. 3 presents the results of our case study (Sect. 2.3) and connect these to the (most) conservative findings from the literature from Stettina et al.  [33]. Both categorized into seven levels (refer to legend). Based on the backlog data we were able to identify improvements on three of FinOrg’s five transformation objectives.

  1. (1)

    Improve Productivity (PROD) by >30%. Note: existing literature reports on effectiveness values >60%. However, we cannot confirm the significantly higher results reported in existing literature with regard to the workflow health dimension; linked to both productivity and time to market (PROD-TTM) as described in Sect. 3.2: Functionality/time (483%), Business Value (400%), Days between commits (38%). Moreover, the results of number of delivered items and velocity decreased and we postulated as explanation the shift from epics and features to better defined and smaller sized (team level) items.

  2. (2)

    Faster Time To Market (TTM) by >27%; Note: existing literature reports higher numbers: time-to-market survey results (67%), and the request journal interval measurements (24%) and lead time (64%).

  3. (3)

    Higher Quality (QBD) by >41%. We used the (leading) indicator of FinOrg: the quality-by-design measurement. The prior literature focuses on defects and incident/problem data, and in that respect focuses on lagging indicators and therefore comparison is problematic.

With the use of backlog data we were not able to look at (4) Customer Satisfaction, (5) Employee Satisfaction & Engagement as well as the Financial Balanced Scorecard perspective (not part of the FinOrg transformation objectives). Lacking measurements on customer feedback (i.e., customer satisfaction CUST) and employee satisfaction we argue that the perspectives can be improved using additional surveys or direct user experience data.

5.3 The Need for a Performance Management Framework

Our challenges with regard to the comparison and interpretation of measures and results in the literature indicate a need for more research on performance measurement, a topic often discussed but rarely defined (cf. [27]). It is important to learn how measurement (systems) can support, facilitate, and impact the change process and performance of an organization, especially in the context of agile transformations. There is sufficient motivation to suggest that the use of performance management systems can lead to improved capabilities, which then impact performance (cf. [13, 20]). Advantages reported in the literature are higher results orientation, better strategic clarity, higher employee engagement, and quality. Reasons for use are improved focus on control and strategy [38]. An interesting area to pursue would be to verify these findings in the context of agile transformations. A way forward is to improve our understanding of measures (e.g., performance, productivity, effectiveness, and efficiency cf. [12, 14, 32, 36]) and enhance the exploratory mapping we introduced with Balanced Scorecard perspectives in the context of agile transformations. Combining multiple sources of quantitative measurement of backlog with qualitative data such as surveys, customer experience data and (inter)subjective estimation data (e.g., Job Size and Cost of Delay estimations) need to be researched further.

5.4 Limitations and Threats to Validity

This report describes an exploratory data analysis of a case study demonstrating a proof of concept of using backlog data to measure agile transformations. An exploratory analysis imposes requirements on traceability on how data has been collected and used. We documented and automated all steps in gathering and transformation of the data, alongside our decisions not to use specific data (e.g., exclude dormant backlogs, exclude units and document outliers). In addition, since the data was transparently available, presented, and used throughout the whole organization, potential errors, deficiencies, or lack of quality in registering and maintaining data are largely eliminated. Finally, we were able to use an extensive data set ranging over a long period of time (36 months), which mitigates data-maturity issues. We therefore claim high reliability. With respect to construct validity, we used Jira software as a single data source. As noted, we paid considerable attention to the care, depth, and quality of data. In addition to this, we reviewed data and findings with relevant stakeholders at FinOrg. Finally, for all categories of measurements we used multiple measurements in order to substantiate the outcomes. Construct validity can be further improved by extending the research to other data sources and tool-providers and by doing so provide additional insights and knowledge on how to combine different data sources. Using substantial time series data, validating results and trends over multiple units, and providing plausible explanations on differences between units all strengthen the internal validity of our research. We suggest that further research on objective measurement attributes is a productive avenue to pursue, e.g., financial measures, experience and usage data on services, and problem and incident data. With respect to external validity, we used a case study with release trains as embedded units. These units are clearly defined, act within the same transformation context, and are therefore suitable for comparison in an exploratory case study. Finally, we projected our findings in the context of current literature. These efforts strengthen the external validity of our research. However, we recognize that broadening the scope to other organizations and branches, repeating our analysis, will improve generalization evidence.

6 Conclusions

The objective of this report was to discuss if, and how, backlog data can be used to help guide agile transformation journeys towards improved organizational performance. We conducted an exploratory embedded multiple-unit case study to identify trends and measure their development against FinOrg’s five transformation objectives. We used Jira backlog data from eight Agile Release Trains and their teams over a period of three years, with a total of over 57,000 issues, supplemented by engagement of the first author in the transformation.

Our contribution is threefold: Firstly, we provide a proof of concept of how backlog data can be used to identify trends and provide guidance by creating a mapping of Jira data sources to impact dimensions proposed by Stettina et al. [33] as well as the Balanced Scorecard. Secondly, we provide empirical evidence on the assessment of transformation objectives over time at FinOrg. And thirdly, we compare our measurements to previously available literature.

We find evidence pointing towards improvements on three of FinOrg’s five transformation objectives: (1) improved productivity, (2) faster time to market, and (3) higher quality. Backlog data did not enable us to report on customer satisfaction and engaged employees. We observe that results are in line with the current literature, although in trends rather than in absolute numbers. It is important to consider the point of departure of the transformation as context for the measurement of success or comparison.

We may conclude that backlog data can help guide agile transformations. By mapping Jira data to the impact dimensions as discussed in available literature, this report describes how backlog data provides a viable source of information to recognize trends and guide agile transformations and allows organizations to act upon them. Authors suggest to complement measurements with other data sources and apply a measurement framework as proposed here.