Keywords

3.1 Roadmap and Introduction

After this section’s background preliminaries, we briefly examine the consequences of treating Automation/AI as an overarching rubric under which to frame discussions of algorithmic management, machine learning and big data. We then move to the bulk of the chapter to identifying and discussing the three key topics and associated trade-offs within a sociotechnical context of hardware and software developers in highly distributed systems such as Google, Netflix, Facebook and Amazon’s technical infrastructure. We conclude with discussing how our case- and topic-specific perspective helps reframe discussions of algorithmic management, machine learning and big data, with a special emphasis on system safety management implications.

Algorithmic management, machine learning and big data are fairly well-defined concepts. In contrast, the popularised term “AI” is in respects more a hype-driven, marketing term than a meaningful concept when discussing real-world digital issues focused on in this chapter. We will not discuss AI further, nor for brevity’s sake are we going to discuss other key concepts the reader might expect to be included alongside algorithmic management, machine learning and big data.

In particular, this chapter does not discuss “expert systems” (expert-rule systems). This omission is important to note because expert systems can be thought of as the opposite in some respects of big data/machine learning. In the medical context, an expert system may be developed by having medical professionals define rules that can identify a tumor. A big data/machine learning approach to the same problem might start with the medical professionals marking numerous images of potential tumours as being either malignant or benign. The machine learning algorithm would then apply statistical techniques to create its own classification rules to identify which unseen tumours were malignant and benign. In case it needs saying, expert systems are also found in other automated fields, e.g., in different autonomous marine systems [10].

These differences matter because sociotechnical systems differ. In autonomous marine systems (among others), there are components (including processes and connections) that must never fail and events that must never happen (e.g., irreversible damage to the rig being repaired by the remotely operated vehicle) in order for the autonomous system to be reliable. Redundant components may not be readily available at rough seas. In highly distributed systems, by way of contrast, each component should be able to fail in order for the system to be reliable. Here redundancy or fallbacks are essential.

3.2 Limitations of “Automation” as a Covering Concept

To get to what needs to be said about algorithmic management, machine learning and big data, we must usher one elephant out of this chapter (albeit very much still found elsewhere in this volume), that of capital-A “Automation”. This very large topic is the subject of broad political, social and economic debates (for much more detailed discussions of the interrelated debates over “automation” writ large, see: Benanav [2, 3]; McCarraher [5, 6]:

  • Economic: It is said that Automation poses widespread joblessness for people now employed or seeking to be in the future.

  • Social: It is said that Automation poses huge, new challenges to society, not least of which is answering the existential question, “What is a human being and the good life?”

  • Political: It is said that Automation poses new challenges to the Right and Left political divide, e.g., some Right free-market visionaries are just as much in favor of capital-A Automation as some elements on the Left, e.g., “Fully Automated Luxury Communism” (for more on the possible political, social and economic benefits, see Prabhakar [7]

This chapter has nothing to add to or clarify for these controversies. We however do not see why these concerns must be an obstacle to thinking more clearly about the three topic areas.

The sociotechnical context, this chapter seeks to demonstrate, is just as important. Large-scale sociotechnical systems, not least of which are society’s critical infrastructures (water, energy, telecommunications….), are not technical or physical systems only; they must be managed and operated reliably beyond inevitably baked-in limitations of design and technology [8, 9]. The sociotechnical context becomes especially important when the real-time operational focus centres on the three subject areas of algorithmic management, machine learning and big data in what are very different large sociotechnical systems that are, nevertheless, typically conflated together as “highly automated”.

If we are correct—the wider economic, political and social contexts cannot on their own resolve key concerns of the sociotechnical context—then the time is ripe for addressing the subjects of concern from perspectives typically not seen in the political, social and economic discussions. The section that follows is offered in that spirit.

3.3 Developers’ Perspective on a New Software Application

We know software application developers make trade-offs across different evaluative dimensions. The virtue of the dimensions is that each category can be usefully defined from the developer’s perspective and that each fits into a recognisable trade-off faced by software developers in evaluating different options (henceforth, “developers” being a single engineer, team or company).

This section focuses on a set of interrelated system trade-offs commonly understood by software developers, including their definitions and some examples. Many factors will be familiar to readers, albeit perhaps not as organised below. No claim is made that the set is an exhaustive list. These well-understood dimensions are abstracted for illustrative purposes in Fig. 3.1.

Fig. 3.1
An illustration of bidirectional arrows. The first is comprehensibility to features, the second is human-operated to automated, the third is stability to improvements, and the fourth is redundancy to efficiency.

Four key interrelated trade-offs for software developers

  1. 1.

    Comprehensibility/Features Dimension

    • Comprehensibility (Left Side): Ability of developer to understand the system, all bounded by human cognitive limits. Highly distributed systems are often beyond the ability of one team, let alone individual, to fully know and understand as a system.

    • Features (Right Side): Capabilities of the system. Additional features provide value to users but increase the system’s sociotechnical complexity,Footnote 1 thereby reducing comprehensibility.

  2. 2.

    Human Operated/Automated Dimension

    • Human Operated (Left Side): Changes to the system configuration are carried out by human operators. For example, in capacity planning, servers may be manually ordered and provisioned to address forecasted demand.

    • Automated (Right Side): The system may dynamically change many aspects of its operation without human intervention. For instance, it may automatically provide or decommission servers without the intervention of human operators.

  3. 3.

    Stability/Improvements Dimension

    • Stability (Left Side): System operates at full functionality without failure. Beyond strict technical availability, stability may also include the accessibility of the systems to operators trained on an earlier version of the system without requiring retraining.

    • Improvements (Right Side): Changes to the system are made to provide new and enhanced features, or other enhancements such as decreased latency (response time).

  4. 4.

    Redundancy/Efficiency Dimension

    • Redundancy (Left Side): The possibility of the system to experience the failure of one or more system components (including processes and connections) and still have the capacity to support its load. An example is a system provisioned with a secondary database ready to take over in case the primary one fails.

    • Efficiency (Right Side): The ability of the system to provide service with a minimum of cost or resource usage. It is paying for what you are using only.

These four dimensions are relied upon by software builders, where the trade-offs can be explicitly codified as part of the software development and application process. Consider Google’s Site Reliability Engineer (SRE) “error budget”, where applications are given a budget of allowed downtime or errors within a quarter time period. If exceeded—the application is down for longer than budgeted—additional feature work on the product is halted until the application is brought back within budget.Footnote 2 This is an explicit example on the Stability/Improvements dimension.

For each of the four dimensions, current technology and organisation processes occupy one or more segments along the dimension. These respective segments expand/intensify as new technology and processes are developed.

By way of illustration, consider the Human Operated/Automated dimension. Technology and new services have provided additional opportunities to automate the management of increasingly complex sociotechnical systems:

  • In the 2000s, the advent of Cloud providers such as Amazon Web Services (AWS) and Google Cloud Platform (GCP, initially with App Engine) provided significant opportunities to provision hardware via application programming interfaces (API’s) or technical interfaces making it relatively simple to spin up/down hardware instances based on automated heuristics.

  • More recently, big data and machine learning have provided additional opportunities to manage systems using opaque ML algorithms. DeepMind has, for example, deployed a model that uses machine learning to manage the cooling of Google’s data centres leading to a 40% reduction in energy use.Footnote 3

  • Processes, such as Netflix’s Chaos Monkey, enable the organisation to validate the behavior of their highly complex systems under different failure modes. By way of example, network connectivity may be deliberately broken between two nodes to confirm the system adapts around the failure, enabling them to operate increasingly complex and heavily automated architectures.

The expansion of a dimension’s segments is dominated by an asymmetrical expansion of activities and investments on the right side. The importance of Cloud providers, big data and machine learning in driving the expansion has been mentioned. Other factors include sociotechnical shifts such as agile methodologies, the rise of open source, the development of new statistical and machine learning approaches, and the creation of more recent hardware such as GPU’s and smartphones.

3.4 What’s the Upshot for System Safety? Obsolescence as a Long-Term Sociotechnical Concern

System safety is typically taken to be on the left side of the developer’s trade-offs, located in and constituted by stability, redundancy, comprehensibility and recourse to human (manual) operations. Since the left side is also expanding (due in part to advances not reported here outside the three topic areas), we can assume the left-side expansion contributes to advances in safety as well.

If the left side is associated with “system safety”, then the right side can be taken as the maximum potential to generate “value” to the developer/company. Clearly, increases in both right-side features and right-side efficiencies can increase the ability of the system to provide value for its operators or users, other things considered.

Now, look at that left side more closely, this time from the perspective of the designer’s long term versus short term.

Software applications are littered with examples of stable and capable systems that were rendered obsolete by systems that better met users’ needs in newer, effective ways. If the current electrical grid rarely goes down, that is one form of safety. But do we want a system that is stable until it catastrophically fails or becomes no longer fit for new purposes? Or would we prefer systems to fail by frequent small defects that, while we can fix and solve in real time, nonetheless produce a steady stream of negative headlines?

More generally and from a software designer’s perspective, we must acknowledge that even the most reliable system becomes, at least in part and after a point, outdated for its users by virtue of not taking advantage of subsequent improvements, some of which may well have been tested and secured initially on the right side of the trade-offs.

In this way, obsolescence is very much a longer-term system safety issue and should be given as much attention, we believe, as the social, political and economic concerns mentioned at the outset. Cyber-security, for example, is clearly a very pressing right-side issue at the time of writing but would still be pressing over the longer term because even stable defences become obsolete (and for reasons different than current short-term ones).Footnote 4

3.5 A Concluding Speculation on When System Safety is Breached

Since critical infrastructures are increasingly digitised around the three areas, it is a fair question to ask: Can or do these software developer skills in making the four trade-offs assist in immediate response and longer-term recovery after the digital-dependent infrastructure has failed in disaster? This is unanswerable in the absence of specific cases and events and, even then, answering would require close observation over a long period. Even so, the question has major implications for theories of large-system safety.

The crux is the notion of trade-offs. According to high reliability theory, system safety during normal operations is non-fungible after a point, that is, it cannot be traded-off against other attributes like cost. Nuclear reactors must not blow up, urban water supplies must not be contaminated with cryptosporidium (or worse), electric grids must not island, jumbo jets must not drop from the sky, irreplaceable dams must not breach or overtop, and autonomous underwater vessels must not damage the very oil rigs they are repairing. That disasters can or do happen reinforces the dread and commitment of the public and system operators to this precluded-event standard.

What happens, though, when even these systems, let alone other digitised ones, fail outright as in, say, a massive earthquake or geomagnetic storm and blackout? Such emergencies are the furthest critical infrastructures get from high reliability management during their normal operations. In disasters, safety still matters but trade-offs surface all over the place, and skills in thinking on the fly, riding uncertainty and improvising are at their premium.

If so, we must speculate further. Do skills developed through making the specific software trade-offs add value to immediate response and recovery efforts of highly digitised infrastructures? Or from the direction: Is the capacity to achieve reliable normal operations in digital platforms—not by precluding or avoiding certain events but by adapting to electronic component failure most anywhere and most all of the time—a key skill set of software professionals and their wraparound support during emergency management for critical infrastructures? Answers are a pressing matter, as when an experienced emergency manager in the US Pacific Northwest itemised for one of us (Roe) just how many different software critical to the emergency management infrastructure depend on one platform provider major in the region (and globally for that matter).