In four decades, genomics has transformed the biological sciences and has penetrated well beyond them. The marriage of DNA sequencing techniques and computational infrastructures built to handle, store and analyse ever-increasing quantities of data has contributed to significant developments in:

  • Our understanding of human history through our relationship to Neanderthals, Denisovans and other hominids (Pääbo, 2014);

  • Our appreciation of the extent and diversity of life previously undetected by biological methods (Riesenfeld et al., 2004; Venter et al., 2004);

  • Forensic science, food tracing and nature conservation (Arenas et al., 2017);

  • Our picture of the Tree of Life and the evolutionary relationships within it (O’Malley et al., 2010);

  • The reclassification of diseases resulting in improved diagnosis, prognosis and treatment options (Keating et al., 2016);

  • Enhancements in the efficacy of selective breeding in agriculture (Lowe & Bruce, 2019);

  • The reshaping of the fundamental models and metaphors with which we think about how living things develop and function (Keller, 2000).

DNA sequencing has gone from being a highly specialised practice, requiring considerable labour and skill, to being routinely applied in ordinary laboratory work while also being conducted at great scale, speed and accuracy in factory-style genome centres. In the late-1970s, manually sequencing the tiny genome of a bacteriophage (a virus that infects bacteria) was a monumental task, one that earned Frederick Sanger, who led the group undertaking it, a Nobel Prize (Brownlee, 2014; Hutchison, 2007). The determination of the whole human DNA sequence (commonly referred to as the Human Genome Project) took more than a decade, at a cost initially estimated at $3 billion. It started in the 1990s and concluded in 2003, expanding in speed and scale throughout.

Progress since then has been so dramatic that, more recently, well over fourteen million coronaviruses have been sequenced and shared via the Global Initiative on Sharing Avian Influenza Data.Footnote 1 Another example that illustrates how far genomics has come, is that the cost of sequencing a whole human genome was estimated to be about £7000 in 2020, multiple orders of magnitude below the original budget of the Human Genome Project (Schwarze et al., 2020).Footnote 2

In 1999, four years before the Human Genome Project was officially concluded, the National Center for Biotechnology Information of the USA created a new database called RefSeq. The purpose of this database was to serve as a centralised repository that would gather the ongoing reference sequence of the human genome and those of other species completed or in progress. Those reference sequences were and still are curated and freely released to the research community. They serve as canonical representations of their designated species and are graded according to their level of comprehensiveness, representativeness and quality (Ostell, 2013, pp. 72–74; Tatusova et al., 2014, p. 135).Footnote 3

The number of entries in RefSeq has grown exponentially, from complete sequences representing just over two thousand different species in 2003, to 125,116 in November 2022.Footnote 4 On top of this, RefSeq also curates and stores a higher number of partial sequences, as well as variants and other versions of complete reference genomes. Life scientists from every discipline all around the world can access the sequences and curatorial metadata. In processing each existing and upcoming entry, RefSeq curators attempt to achieve a balance between respecting the differences across the stored sequences while avoiding a Tower of Babel of different communities producing separate datasets that would require considerable efforts to integrate, use and compare outside their contexts of creation. Yet in fostering this universal—or at least commensurate—language, some of the distinctions between the individual reference genomes are flattened, and indeed lost.

In what follows, we make some of these distinctions visible again by looking at the history of the production of three reference genomes: those of the baker’s and brewer’s yeast Saccharomyces cerevisiae released in 1996 and published in 1997; Homo sapiens, published in 2001 as a working draft and in more definitive form in 2004; and the pig Sus scrofa, initially released in 2009 and published in 2012. Taken together, these three genomes embody overlapping trajectories of change and differentiation in the practices, goals, organisation and status of genomics research. While yeast is both a model organism in basic biomedical science and a tool for the brewing and biotechnology industries, pigs were mainly sequenced for agricultural purposes, but also to serve objectives of human medicine—for instance, helping organ transplantation. Sequencing H. sapiens became the most prominent area of genomics, one believed to have potentially invaluable clinical payoffs.

By examining the substantially different ways in which these endeavours were conducted across the three organisms, this book argues that producing a whole-genome reference sequence was not always the main—nor the universally accepted—objective of genomics, as the growing entries in RefSeq may suggest. What these now centrally curated reference sequences represented, and the uses to which they were put, also varied substantially across the communities that produced them, in spite of the commensuration work of RefSeq and cognate institutions and repositories.

The rest of this introductory chapter summarises the main features of genomics and how it historically emerged from the practices that have subsequently accompanied it and conferred its identity: mapping and sequencing DNA, and processing the resulting data with information technologies, including databases.Footnote 5 We then present the key concepts and analytical tools that we use throughout the book and outline how we develop them in the remaining seven chapters. We argue that popular and scholarly accounts have tended to excessively emphasise the Human Genome Project in the history of genomics, due to the perceived impact and high profile of this initiative. We refer to this Human Genome Project-centred history as the canonical, master narrative of genomics, and relate its structure to the hourglass model that prior historiography has applied to the study of heredity throughout the nineteenth and twentieth centuries. As in the case of the study of heredity (Barahona et al., 2010), the hourglass model aids the comprehension of the institutional and infrastructural landscape of genomics, while falling short in capturing its broader history. We escape the boundaries of the hourglass model by looking at non-human genomic endeavours and documenting the deep entanglement between the creation of reference genomes and the communities that were involved in their production. We propose the term genomicist to capture the crucial role of communities in the construction of genomic data and materials, and highlight both inclusive and exclusive mechanisms in the formation and operation of those communities.Footnote 6

1.1 Genomics, DNA Mapping and Sequencing

The sequencing of DNA is the determination of the order of the four ‘bases’ along each of the two complementary strands of nucleotides that wind around each other to produce the molecule’s double-helical structure: adenine, thymine, cytosine and guanine, known by their initials—A, T, C and G. Sequencing is central to genomics. However, genomics involves far more than just this, and sequencing can be conducted outside of genomics research and for other biological molecules, such as RNA and proteins. Indeed, while the history of sequencing—of proteins, RNA and then DNA—can be traced back to the 1950s, 1960s and 1970s respectively, genomics proper is recognised to have arisen only in the 1980s (García-Sancho, 2010). Its antecedents were not only sequencing practices, but also the mapping of chromosomes (bodies containing DNA in the cell), and the development of information technologies to process the resulting map and sequence data.

Chromosome mapping dates back to the early twentieth century and is conducted in order to find certain landmarks in them, such as genes (de Chadarevian, 2020; Hogan, 2016; Rheinberger & Gaudillière, 2004).Footnote 7 It was known since the early days of mapping that genes constitute only a small portion of chromosomes; after the discovery of the structure of DNA in 1953, genes were increasingly identified with partial, specific segments of the nucleotide sequence within the chromosome. The third central practice of genomics, the processing of the resulting map and sequence information using databases and computational methods, started to be applied to DNA in the 1970s. Similar practices involving other biological and medical data, such as the elucidation of protein sequences or the three-dimensional structures of proteins, can be traced back to the decades following World War II (Strasser, 2019, Ch. 3; de Chadarevian, 2002, Ch. 4).

What makes genomics distinct from sequencing and these other practices, when they are considered separately? While it is important to avoid the error of being too inclusive, there is also the risk that a strict and exclusive definition of genomics can project the way that genomics developed—or at least a particular trajectory of it—back on to the past. To put it bluntly, there is a danger of a winner’s narrative: that those who succeeded in making their vision of genomics a reality—or who are currently in charge of the institutional manifestation of it—dictate the boundaries of the field and project them retrospectively (Suárez-Díaz, 2010).

Areas of scientific endeavour, particularly ones with disciplinary names and associated journals, databases, brick-and-mortar facilities and well-funded institutions, are social and sociological phenomena. This means that the demarcation and boundary work performed by influential social groups and networks shapes the reality of the field. But scientific fields, disciplines and other phenomena are not only social creations and objects in this top-down political sense. They are also comprised of configurations of methods, techniques, technologies, theories, models, research programmes and commitments, norms and the careers, interests and activities of less-prominent scientists. These are no less infused with the social, cultural and political, but they are elements that deny the exclusivity of elite political, cultural and social mechanisms to define what scientific endeavours like genomics are.

It is not our job to provide an exhaustive and authoritative definition of genomics that takes account of these considerations. We can note, however, and show throughout this book, that the historical configuration of genomics involved a multi-directional, often dialectic, interaction between elite actors, less influential bench biologists and computer experts, all of whom mobilised differing visions, methods and forms of organisation. Genomics necessarily involves some form of sequencing and/or mapping of the genome, wherein the products—in the form of data—are stored and analysed using computational (informatics) infrastructures. To constitute genomics, this must be associated with a more general effort to construct a systematic representation of the genome, either in whole or in part.

The term ‘genome’ long antedates the idea of ‘genomics’, being coined by the German botanist Hans Winkler in 1920 to denote “the haploid chromosome set” (as translated in Lederberg, 2001). The haploid set constitutes one of each pair of chromosomes; so for humans that have a total of 46 chromosomes made up of 23 pairs,Footnote 8 the haploid set constitutes 23 chromosomes. Scholars have noted that the term genome, and genomics itself, aims to capture something comprehensive, a totality (Rheinberger & Müller-Wille, 2017; Stevens, 2013). Does this mean that something can only be genomic if it aims at the complete mapping or sequencing of a genome? Not necessarily. On the basis of achieving total completeness or comprehensiveness, barely anything could constitute genomics. Additionally, what constitutes completeness or comprehensiveness is not fixed; as we see later in the book, but particularly in Chap. 7, the goal posts are always moving. One may say that, as long as there is a concerted effort being made towards that end, it is genomics. However, the indeterminacy of what constitutes the end-point means that there is no strict criterion for ruling any given endeavour either in or out. The idea of a process or journey towards a goal means that the line between ‘true’ genomics and mere sequencing and mapping is somewhat blurry. How close does one need to be to the ever-receding end-point to be doing genomics?

Instead, we prefer to recognise genomics through its systematicity and its treatment of the genome as the substrate of its efforts. By systematicity, we mean that there is some concerted—and often collective—effort to identify and establish relations between multiple objects in and across the genome. By substrate, we mean that the genome is the field of operations for this activity: that which is to be mapped and the map itself. This does not mean that the whole genome needs to be mapped—or sequenced—for an effort to be deemed genomic. We distinguish systematicity from comprehensiveness and argue that in the history of genomics—especially during the early days—there were a substantial number of systematic but not comprehensive efforts, in the form of concerted operations that only addressed certain regions of target genomes.

Our criteria do not imply that all research that tries to identify genes in the genome can be classed as genomics. If a molecular geneticist was able to identify a gene that they had good reason to believe was implicated in some process in the cell, sequence that gene and then study the way it is expressed—how it results in the production of a specific protein—this falls well short of being genomics in both aspects of our guideline. It only considers a single object in the genome. Even in cases where two or more genes were involved in the process of interest, if the research does not consider the relations between them in terms of them being objects in the genome it would still not fulfil our second, ‘genome-as-a-substrate’ criterion. If, instead, the researcher was using known products of genes relating to a biological process of interest in order to identify and map multiple DNA sequences across the genome—ideally in collaboration with other laboratories—they would have shifted towards a more genomic way of working. This is because the focus is now on the genome as a territory to be mapped, rather than just on individual genes. Indeed, as we show in the next chapters, this kind of activity and the communities that converged around it became key drivers of genomics research from the 1990s onwards.

The invention of DNA sequencing methods in the 1970s was crucial to the forging of genomics. One of the main pioneers was Frederick Sanger, who had previously worked to discern the sequence of amino acids—the fundamental building blocks of proteins—in insulin, for which he won the Nobel Prize in 1958. He then moved on to RNA, the intermediary molecules in the process by which stretches of DNA form the basis for the synthesis of proteins with specific amino acid compositions. While other researchers in the mid-1970s such as Allan Maxam and Walter Gilbert also developed DNA sequencing methods, the technique that Sanger and his team devised at the Medical Research Council’s Laboratory of Molecular Biology in Cambridge (UK) became the dominant approach before the creation of newer methods in the twenty-first century (García-Sancho, 2012, Chs. 1–2).

Sanger’s technique required extremely time-consuming and labour-intensive bench work, as well as considerable technical and interpretive skills. The refinement of manual methods alongside increasing automation of parts of the process—including the invention and ongoing improvement of automated sequencing machines from the mid-1980s—enabled more and more to be sequenced in less time (García-Sancho, 2012, Chs. 5–6).Footnote 9 As the 1980s proceeded, therefore, the quantities of DNA sequence data were rapidly expanding year-upon-year.

Alongside this were developments in mapping genes and other markers on the chromosomes. Genetic mapping had been pioneered by Thomas Hunt Morgan and his colleagues in the 1910s, working with the fruit fly Drosophila melanogaster. As in most animals, Drosophila’s chromosomes are paired in two sets within its cell nucleus. Morgan’s team observed, tracked and recorded different variant traits—such as the eye colour or wing shape—in many thousands of these flies, which were systematically bred and assessed (Kohler, 1994). The traits were presumed to result from different mutant versions of genes occurring across the chromosomes.

Morgan and his team exploited two facets of genetics: linkage and recombination. Linkage means that certain genes are commonly inherited together, which in the fly experiments meant that the associated traits were linked across generations. Recombination, discovered by the Morgan laboratory in their explorations of genetic linkage, happens during the creation of the sex cells (gametes), a process called meiosis in which the pairs of chromosomes separate. In it, parts of one of a pair of chromosomes can swap places with the corresponding parts of the other member of the pair. This means that the linkage between genes can be broken.

Morgan’s laboratory realised that they could use this to find out the relative positions of genes on the fly’s chromosomes: the further apart genes were, the more likely it is that a recombination event would occur between them, breaking their linkage. The frequencies of co-occurrence of versions of particular genes could be used to ascertain their relative proximity and order on the chromosomes. An array of relatively simple traits inherited from parent to offspring fly—such as the aforementioned eye colour and wing shape—enabled the group to map the Drosophila chromosomes and to further discern chromosomal dynamics in doing so. These maps of estimated chromosomal positions started to be called genetic linkage maps (see the upper part of Fig. 1.1).Footnote 10

It took several decades for this approach to be applied to humans. When it did, inter-generational studies of families experiencing disproportionate numbers of cases of particular medical conditions could be used to identify the kind of genetic basis underlying them and to perform some analyses to assess the linkage relationships (Comfort, 2012; Lindee, 2005). This practice received a considerable boost when in the 1960s, molecular biologists began detecting polymorphic (many-variant) genetic markers that could be positioned on the chromosomal structures. These markers provided a greater number of landmarks for identification and analysis of variation beyond the small number of individuals suffering medical conditions or showing morphological traits that could be observed with the naked eye, and therefore mapped using the principles of genetic linkage. As we show in subsequent chapters, from 1973, human and medical geneticists periodically gathered in chromosome mapping workshops, with the first one held at Yale University. These workshops enabled attendees to systematically pool their mapping results—some of them obtained through molecular methods—and achieve an increasingly higher resolution in the location of genes and other markers of mainly medical interest.Footnote 11

The first genetic linkage map encompassing the whole human genome obtained through molecular markers—Restriction Fragment Length Polymorphisms or RFLPs—was published in 1980. It deployed a type of protein (restriction enzymes) that cleaved the DNA molecule at specific sequence sites. When applied to DNA samples from multiple individuals, if their sequences diverged, the cleavage would produce different patterns of fragments. These different fragment patterns could be detected and used to map the sequence-specific genome regions where the restriction enzymes acted (Botstein et al., 1980). The same enzymes had been used from the mid-to-late 1970s as part of the recombinant DNA technologies, a suite of methods that enabled researchers to cleave and isolate specific fragments of the genome of one organism and transfer them into another. For instance, as a result, human genes synthesising insulin—a protein used for the treatment of diabetes—could be expressed in a controlled way in bacteria (Rasmussen, 2014; Yi, 2015).

These molecular methods propelled the creation of a different type of map in the 1980s. Rather than representing the approximate location of genes and markers on the chromosomes—as the genetic linkage maps did—this new map visualised a set of ‘physical’ DNA fragments ordered as overlapping lines across the genome (see the lower part of Fig. 1.1). In organisms with larger genomes, the construction of these physical maps required the prior generation of libraries to store and manage the thousands of fragments into which the DNA contained in the different chromosomes would be broken.

Producing a ‘DNA library’ or ‘genome library’ involves using restriction enzymes and other recombinant techniques to insert DNA from the organism to be mapped into the genome of another organism (Hutchison, 2007; Loenen et al., 2014). As well as functioning as warehouses of the DNA inserts, the host organisms can also be used to amplify the fragments to be mapped, multiplying their number. This is achieved through the reproductive cycle of the host organism, which results in the production of cloned copies of the original inserted DNA. The libraries can be screened as well, for instance by hybridisation: using the property of chemical complementarity by which, in a double-stranded DNA molecule, adenines always bond with thymines and cytosines with guanines. Building on this, a probe containing a specific sequence can be designed to detect and locate particular fragments to which it will hybridise: chemically bond, due to the complementarity of its bases.Footnote 12

In the early days of sequencing, viruses or circular chromosomes called plasmids—present in bacteria such as Escherichia coli—were used as host organisms for libraries, but these were limited in storage capacity. In 1987, though, Yeast Artificial Chromosomes (YACs) were developed, offering considerably larger storage capacity. Later, in 1992, Bacterial Artificial Chromosome (BAC) libraries were created, with several quality-related advantages over YACs to compensate for their smaller capacity.

Ordering the inserted DNA fragments of these libraries in physical maps enabled researchers to isolate and access those fragments, which could be used for sequencing purposes or any other sort of genetic experiment. The overlaps detected between the fragments also allowed their assembly into a reference sequence, as was done with the human and other genomes (see lower part of Fig. 1.1). A central argument of this book is that the way in which libraries were constructed, and mapping was combined with sequencing, crucially distinguished the production of the yeast, human and pig reference genomes, thus embodying different forms of organising genomics, and affecting the potentialities and limitations of the resulting sequence data.

Fig. 1.1
A linkage map has six layers at the top, numbered 1 to 5 and 10. Below it is a hierarchical shotgun sequencing with a genomic D N A which resembles a doodle of intertwined loops, B A C library which is a cluster of worm-like fragmented lines, organized mapped large clone contigs, B A C to be sequenced, shotgun clones, sequence, and assembly.

Above, a genetic linkage map of the six chromosomes of the nematode worm Caenorhabditis elegans, elaborated by molecular biologists Sydney Brenner, Robert Horvitz and Jonathan Hodgkin in the 1970s, the decade that the chromosome workshops started. Below, a diagrammatic representation of how a physicalmap is produced from a BAC library and assembled into a sequence—in this case, the reference sequence of the human genome. The physical map is the third illustration starting from the top (“Organized mapped large clone contigs”) and the sequence is the bottom illustration (“Assembly”). Above image: Reproduced from Hodgkin, J, Horvitz, R, Brenner, S, Nondisjunction mutants of the nematode Caenorhabditis elegans. Genetics, 1979, 91(1), 67–94: Fig. 1 on p. 70, by permission of Oxford University Press. Below image: Reprinted by permission from Springer Nature Customer Service Centre GmbH: Springer Nature, Nature (https://www.nature.com/), Initial sequencing and analysis of the human genome, International Human Genome Sequencing Consortium, 2001: Fig. 2 on p. 863

The growing ability to map and sequence DNA presented a problem: what to do with the resulting data. In 1980, the first global database to gather DNA sequences was launched. This was the Nucleotide Sequence Data Library, sponsored by the European Molecular Biology Laboratory as a shared repository to which the life sciences community could both submit their sequencing results and access the data contributed by others (García-Sancho, 2012). In 1982, the US National Institutes of Health (NIH) created an equivalent repository—GenBank, on which RefSeq would later be built—and, two years later, the DNA Data Bank of Japan started its operation. During their early years, these repositories struggled to keep up with processing the increasing quantities of sequence data being produced, while simultaneously having to confront the problem that much of what was being produced was kept by the laboratories that performed the work and not shared with the wider community. In 1987, the three databases reached an agreement by which their entries would be mirrored and users would be able to access the same information regardless of the repository they queried. Their curators also started persuading journal editors to make submission to one of the databases compulsory ahead of the publication of new DNA sequences, something that became increasingly customary in the 1990s (Strasser, 2019, Chs. 5–6; Stevens, 2018).

That same year of 1987, the journal Genomics was founded. It was co-edited by prominent medical geneticists Victor McKusick and Frank Ruddle, who in the previous decade had played a leading role in organising the first chromosome mapping workshop at Yale University. The first editorial of Genomics, entitled “A new discipline, a new name, a new journal” stated that mapping and sequencing DNA should go “hand in hand” since both practices had the “same objective”. McKusick and Ruddle regarded mapping and sequencing genes as “the way to go” and the resulting sequence data as the “ultimate map” or the “Rosetta Stone” from which “the complexities of gene expression in development” could be discerned and the “genetic mechanisms of disease interpreted”. For the “newly developing discipline” of mapping and sequencing DNA, the co-editors “adopted the term GENOMICS” (McKusick & Ruddle, 1987, p. 1, capitals in the original; see also Kuska, 1998). In the late-1980s and especially the 1990s, Genomics established itself as a platform for the dissemination of mapping and sequencing results, along with other journals that reported on the progress of ongoing genomic research.

At this time, scientists and administrators began to consider the full mapping and sequencing of the genomes of different species. Already in the late-1970s, the tiny genomes of viruses had been sequenced, but the scale-up to even bacteria was daunting given the skills and time that the existing techniques required. From the mid-1980s onwards, however, serious proposals to map and sequence the human genome were presented and a number of national programmes began. As we show later in the book (Chap. 3), the most ambitious of these was the Human Genome Project (HGP), which started as a joint endeavour of the NIH and laboratories of the Department of Energy of the USA.

By 1990, an array of human and non-human genome projects were underway. Some, like that for the nematode worm Caenorhabditis elegans and the American side of yeast genome sequencing, were conceived as pilots for human genome sequencing, allowing methods and approaches to be tried and evaluated, then adapted and improved for the bigger task of tackling a larger genome. Others, like the European side of yeast genome sequencing (Chap. 2), and the mapping of the pig genome (Chap. 5), were driven by the research aims of particular communities of scientists working on the biology of those organisms. As we argue, it was in the specificities of the interactions between these communities and their target genomes where differences between the genome projects arose and distinct ways of practising and organising genomics were configured (for a timeline illustrating milestones in the history of genomics across these species and some select others, see Fig. 1.2).

Fig. 1.2
A timeline of the developments presents D N A mapping from 1970 to 1995. It starts with the inaugural human chromosome mapping workshop from 1973 all the way till 1995, followed by the bacteriophage virus sequenced at Cambridge in the year 1978 to 1995 among others. It ends with the PiGMaP 2 from 1994 to 1995.figure 2

Timeline representing historical milestones in DNA mapping and sequencing, as well as genomic research. White arrows refer to human genomics, light grey to yeast genomics and dark grey to pig genomics. Black arrows refer to technical or infrastructural developments. Elaborated by Jarmo de Vries from information compiled by James Lowe. For a larger version of this figure that can be zoomed in and out, see https://www.pure.ed.ac.uk/ws/portalfiles/portal/290406301/Fig_1_2_zoomable_final.pdf (last accessed 29th November 2022)

Genomics came into the public spotlight with the ambitious plans to sequence the entire DNA of humans. These plans—and particularly their materialisation in the HGP—have, quite naturally, attracted considerable attention both in scholarly and non-scholarly literature. In the late 1990s, the US programme coalesced with other initiatives into a transnational effort to determine a reference sequence of the whole human genome. The label HGP was kept, but the meaning of this, in both the popular imagination and for the scientists and administrators involved, shifted from the national US project to designate a broader, multi-national endeavour (Fortun, 1999). The reference sequence was published between 2001 and 2004 by an International Human Genome Sequencing Consortium (IHGSC) formed by institutions from different countries, mainly the USA, UK, France, Germany, Japan and China (Chap. 4).Footnote 13 This was heralded as the entry of biology into the world of big science (Collins et al., 2003; Glasner, 2002; Hilgartner, 2013), a term characterising large-scale, coordinated scientific projects usually in the physical or engineering sciences, such as the World War II Manhattan Project, the Apollo space programme, or the creation and operation of CERN, the European centre for nuclear research (Barnes & Dupré, 2008, p. 43; Lenoir & Hayes, 2000).Footnote 14

A central thesis of this book is that the excessive emphasis on the determination of the human reference sequence has led the history of genomics to be presented in a somewhat narrow fashion. By focusing on genomic work concerning non-human species—namely yeast and pig—and outside the HGP framework, we aim to capture a more richly-textured trajectory in which genomics forked, diversified and permeated in different ways across many areas of the life sciences and the world beyond them. We do this, in part, by unpacking the history of certain aspects of genomics that have come to be conceived of in a teleological manner: that they were created or happened in a certain way because that is how genomics would inevitably develop. These include the multiple possible ways in which genomes can be sequenced—with the HGP representing one strategy among many—and the diverse nature and utility of the reference sequences that are available today in the RefSeq database.

Based on the idea that the human reference sequence is often conceived of in a totemic manner, we now draw analogies between an HGP-centred history of genomics and the hourglass metaphor that some scholars have used to model and interrogate the history of heredity (Barahona et al., 2010). In this hourglass representation, there are two periods featuring heterogeneous activities conducted by a wide array of actors, one before and one after a bottleneck which is narrower in both content and participation. In the case of genomics, the neck of that hourglass corresponds to the later stages of the HGP (1996–2003), an initiative that has shaped the institutional landscape and infrastructures for mapping and sequencing endeavours well beyond itself. In what follows, we look beyond that narrow neck, and past an hourglass-based view of genomics more generally. We do this by paying attention to the needs and objectives of some often overlooked communities of researchers and the interactions they have with their target genomes, of both human and non-human species.

1.2 Moving Away from a Human Genome Project-centred History of Genomics

Since its inception, genomics has been an area with a significant concentration of humanities and social science scholarship. In 1988, a programme to examine the ‘Ethical, Legal and Social Implications’ (ELSI) of genomics was announced by James Watson, co-discoverer of the double helical structure of DNA and then head of the NIH Office for Human Genome Research. ELSI was formally launched in 1990 and awarded no less than 5% of the budget that the NIH would devote to human genomics. Other programmes encompassing ‘Ethical, Legal and Social Aspects’ were also launched in the early years of genomics. The one sponsored by the European Commission began as a small element of the second Framework Programme for Research and Innovation, running from 1987 to 1991. Projects and collaborations aiming to analyse the socio-ethical dimensions of genomics were particularly strong in the USA, UK, Netherlands, Germany and Canada.Footnote 15

Sociological and ethical studies of human genomics have been particularly prominent, reflecting the societal concerns about the implications of the new technologies and the use of sequence data (e.g. see Gannett, 2019). These investigations have taken advantage of the possibility to pursue ethnographic approaches, examining the decision-making, organisation and re-configuration of this new science as it happened (Hilgartner, 2017; Stevens, 2013). Histories have also been published, initially by people close to those involved, for example, Robert Cook-Deegan’s The Gene Wars (1994; see also Gaudillière & Rheinberger, 2004). Philosophical accounts have explored the re-interpretations of the role of genes and genetics in the development of organisms in the light of the findings of genome projects (Keller, 2000; Moss, 2003). This includes aspects such as the smaller than expected number of human genes, the definition and identification of ‘functional elements’ (for example in the ENCODE—Encyclopedia of DNA Elements—project) and the so-called ‘missing heritability’ problem (e.g. Griffiths & Stotz, 2013; Guttinger & Dupré, 2016).

The existing historiography of genomics has been dominated by a particular phase of the HGP: that between the internationalisation, and radical scaling and speeding up of the project in the mid-to-late 1990s and the ‘completion’ of the reference sequence in the early 2000s. This was indeed the phase in which the vast majority of the data was produced. It was made especially salient by the story of a ‘race’ between the IHGSC, funded by an array of public bodies and charities, and the competing corporate effort led by Celera Genomics and its charismatic and controversial head, Craig Venter (Davies, 2001).Footnote 16

This phase was one in which an extraordinary concentration of sequencing capacities was effected in a small number of institutions, with large and increasing numbers of sequencing machines, and ever-developing pipelines to produce, assemble and assess sequence data. Pipelines are series of successive software tools and algorithms configured to refine and validate inputs from sequencing to enable the resulting data to undergo further processing and be integrated into data infrastructures. In those pipelines, the sequences are assembled, with the parts growing smaller in number and larger in size, and more connected to each other (Fig. 1.1, bottom illustration). Many smaller laboratories and centres that had been involved in the earlier stages of human genome mapping were progressively sidelined from the effort. The advent of the reference genome heralded an era that became commonly known as ‘post-genomic’, reinforcing the equation of genomics with the HGP. ‘Post-genomics’ constituted an emergence from the narrow tunnel of the human reference sequence.

The canonical history of genomics—with its emphasis on the HGP—can be portrayed as an hourglass. In its upper part, there were a number of collective efforts to map the human genome and sequence those of other ‘pilot’ organisms such as yeast and the worm C. elegans. These efforts involved heterogeneous collections of institutions, some specialising in genomics, and others concerned with particular aspects of biology, such as anthropology, evolution, cell biochemistry or medical genetics. The later stages of the HGP from 1996 to 2003 constitute the narrow neck, tapered in because of the smaller number of institutions involved, the singularity of the aims of the programmes, and the radical abstraction of the potential genomic variation that was being captured in a single, consensus reference sequence. Then, in the lower part of the hourglass, there is an opening out to the world of post-genomics (Fig. 1.3, left).

Fig. 1.3
Two diagrams in the shape of an hourglass are titled genomics and heredity. The diagram on the left has pre-genomics that covers pre 1990, and post-genomics that covers post-2003. The diagram on the right covers the late nineteenth century at the top and the late twentieth century to the present at the bottom. The narrow middle region covers 1996 to 2003 on the left and 1900 to 1980 on the right.

An illustration of the two hourglass models we describe. The hourglass on the left represents the canonical history of genomics, as centred on the Human Genome Project. The hourglass on the right depicts the history of the scientific treatment of heredity over the nineteenth and twentieth centuries. In both cases, the hourglass models portray a change over time from a variety of practices, approaches and organisational forms (the upper part of the hourglass) to a narrower development (the neck of the hourglass) and then a return to a more diverse configuration (the lower part of the hourglass). Figure elaborated by both authors. For a larger version that can be zoomed in and out, see https://www.pure.ed.ac.uk/ws/portalfiles/portal/290406890/Fig_1_3_increased_final.pdf (last accessed 29th November 2022)

This hourglass model refers to both the scope of genomics and the historical trajectory that the HGP-centred narrative conveys. According to this narrative, the pre- and post-genomic stages were wider in their range of activities and institutional variety, with the HGP resembling the hourglass neck through its focus on the production of a reference sequence at specialist genome centres. This narrative projects a winner’s history in which the HGP is an obligatory passage point through which the sand in the hourglass flows: it is both the triumphant culmination of the pre-genomics stage and the opening to the post-genomic world.

The metaphor of an hourglass has also been used to productive effect when considering the history of the scientific study of heredity. In the second half of the nineteenth century, this research deployed a broad conception of heredity. In this, the roles of environment and inter-generational processes operating at different levels were explored and used to explain observed hereditary phenomena across a range of contexts. The advent of genetics as a discipline narrowed this sense of heredity, and also restricted the range of potential causal factors investigated and appealed to from the early 1900s onwards. This funnel effect, which was strengthened with the establishment of DNA as the genetic material, is what historians identify with the neck in the hourglass representing the study of heredity (Fig. 1.3, right). Then, later in the twentieth-century and into the twenty-first, the concept of heredity has once again been opened up and linked with examinations of organismal development, epigenetics, evolution and interactions with the environment, to produce new configurations such as evolutionary developmental biology. These remove the partitions between a version of heredity understood in terms of the inter-generational transmission of genetic material and other objects of biological research. We are now very much in the wider, lower part of the hourglass (Barahona et al., 2010).

While recognising the general utility of this metaphor, in making it explicit, its proponents have specifically interrogated the potential value and limitations of the hourglass model in the historiography of heredity. Could the hourglass be a “historiographical artifact” resulting from “historical research centered on a few actors and fields, most of them located in the American and British scenarios” (Barahona et al., 2010, p. 7)? Indeed, heredity was implicated in a wide range of endeavours beyond the mainstream genetics research that has traditionally been the focus of historical (and social scientific and philosophical) inquiry: medicine, agriculture, anthropology, genealogy, natural history and taxonomy, physiology, embryology and evolution. However, a cautious and critical use of the hourglass model has enabled its proponent historians to advance knowledge on these endeavours without neglecting the role and influence of the narrow neck representing genetics research.Footnote 17

It is in this heuristic way that we intend to approach the hourglass model in the history of genomics. As we show later in the book, the effects of the HGP in the history of genomics are visible and self-evident. Key current institutions and infrastructures, such as RefSeq, were the products of its momentous impetus. The infrastructures, processes and materials produced through the HGP also shaped contemporary and subsequent genome initiatives, such as the sequencing of the yeast and pig genomes, respectively. In the USA, the NIH made the yeast initiative part of its national human genome programme: it was a pilot project through which technologies were developed and tested during the early-to-mid 1990s, thus preceding the intensive sequencing phase of H. sapiens (Chap. 2). Later on, in 2003, the Swine Genome Sequencing Consortium was formed. It made use of the infrastructures and processes developed at the Sanger Institute, a leading member of the IHGSC (Chap. 5). It was leading members of the IHGSC that advocated for the subsequent transition to a ‘post-genomic’ era. When depicting this transition, its advocates often implicitly deployed an hourglass metaphor, with the HGP featuring in the narrow neck (Fig. 1.4).

Yet, however influential, the organisational model of the HGP, with its emphasis on concentration and maximised rates of production, was just one among other forms of genomics that historically emerged throughout the 1980s and 1990s: we argue that it was an unusual and rather exceptional one (Chap. 3). The other configurations demonstrate that the history of genomics is more complex and richly textured than the master narrative of the HGP and its representation in hourglass form may suggest. In order to appreciate this multifaceted history and its multiple genealogies, we need to look beyond the HGP and examine genome projects in human and non-human species that occurred before, during and after it. Another crucial way of moving beyond the restrictions of the hourglass model is placing the communities that produced the genomes—rather than the sequence end products—at the centre of our history.

Fig. 1.4
A chart of the genomic achievements since the human genome project. The major illustrations are for the achievements from 2004 to 2010.

A depiction of events preceding and succeeding the Human Genome Project (HGP) that illustrated an article co-authored by Eric Green and Marc Guyer in 2011. Green and Guyer were key scientific and administrative figures during the development of the HGP. After its conclusion, they were appointed director and deputy director, respectively, of the National Human Genome Research Institute of the NIH and tasked with planning what was by then called the ‘post-genomic’ era. In the illustration, the HGP is portrayed as a bulb powered by prior scientific achievements and illuminating subsequent milestones. The structure of events resulting from this past ‘powering’ the future places the HGP in a position that is analogous to the pinch-point of an hourglass. Reprinted by permission from Springer Nature Customer Service Centre GmbH: Springer Nature, Nature (https://www.nature.com/), Charting a course for genomic medicine from base pairs to bedside, Green and Guyer, 2011: Figure 1 on p. 205. A high-resolution version of this image is available in the open-access version of the article, which can be found online at: https://www.nature.com/articles/nature09764 (last accessed 29th November 2022). We thank Catherine Heeney for drawing our attention to the image

1.3 ‘Thick Sequencing’, Communities and ‘Genomicists’

This book and the long-term historical narrative it encompasses enables us to probe, expand and develop a number of conceptual tools. While we use some of them for the first time here, we had originally proposed others elsewhere. Among the latter, we extend our distinction between thin and thick sequencing from its original context in making sense of pig genomics (Lowe, 2018), out to the history of genomics more generally. Thin sequencing is the compilation of the string of DNA nucleotides in order, while thick sequencing comprises all the processes, materials and organisational configurations that make the products of genomics—including the ‘thin’ sequence, but not limited to it—usable by a variety of potential actors. Thin sequencing is a feature of the narrowest point of the neck of the hourglass: it is the determination of the order of bases, whether manually or in a more automated way. This is not necessarily a simple task, as it requires the interpretation of recorded signals that are not always unequivocal. To understand the nature of genomics, however, and how its resulting outcomes can be taken up by different users in distinct ways, examining this part of the process alone is insufficient.

Capturing the thickness of sequencing means examining the obtaining and selection of DNA, its storage in DNA libraries, its mapping, the choice to sequence DNA fragments (clones) in YAC, BAC or other types of library, the extent of the coverage of the genome, and the selection of particular areas for more or less rigorous sequencing.Footnote 18 The sequences so generated then need to be assembled and annotated. All of these steps require decisions about what is to be abstracted from the variation that the different individual genomes exhibit in nature and what variation is to be represented in the final result. There are more stable aspects of this process, such as common pieces of software, sequencing and informatics pipelines, quality and validation standards, but the products also depend on the decisions and choices made in the whole thick sequencing process (Lowe, 2018). It is the thickening of our historical approach to sequencing—by focusing on practices such as library construction, mapping and annotation—that enables us to probe the hourglass representation and examine processes, trajectories and lineages beyond the narrow (thin) neck.

Through a thick sequencing framework, the differences between sequencing endeavours across species and how this affects the outcomes of genomics research—including reference genomes—become more manifest. One of the ways in which we capture these differences is by exploring the participation—or lack thereof—of particular communities of scientists in the production of reference genomes. These communities can be identified by coalescence around a particular object, such as a species, and/or a biological unit of it such as a cell. Additionally, or alternatively, they can be oriented around one or several biological processes such as heredity in the case of genetics, evolution or particular molecular mechanisms. These alliances are usually cemented and reinforced by common disciplinary membership and training, and participation in modes of scholarly communication and interaction such as a particular set of journals and conferences. These communities typically share “epistemic cultures” (Knorr-Cetina, 1999), and the extent of collaborative relations will be denser within members of a given community than between members of different communities.

There is no hard-and-fast rule for drawing the boundaries of particular communities, and weaker supra-communities or more specific sub-communities can also be identified. The notion of a community has long interested historians of science and scholars working in Science and Technology Studies (e.g. Shapin & Thackray, 1974). From the early days of both fields, a considerable amount of literature has explored the factors that lead scientists to group into communities and the dynamics of those groupings, from growth to stability, amalgamation, fragmentation or disappearance. Various mechanisms that glue communities together have been highlighted, among them common styles of thought or ways of knowing (Harwood, 1993; Pickstone, 2000), shared moral economies or working worlds (Agar, 2020; Kohler, 1994; Strasser, 2011) and particularly intense collaborative relationships (Vermeulen et al., 2013).

When we deploy the notion of community in this book, we refer to particular sets of individuals, laboratories and associated research practices converging around the description of a genome. Many of these consciously self-identify with communities, acting in concert to launch programmes and initiatives, and featuring specific conferences and venues of publication in common. Yet these communities are not homogeneous, and they may not exhibit the same characteristics or level of resolution. For instance, the community of yeast researchers we discuss (Chap. 2) is more heterogeneous than the medical geneticists we also survey (Chaps. 3 and 4). The pig genome community that we introduce (Chap. 5) is and was much smaller than both of these, but is in many respects broader, featuring different kinds of disciplinary backgrounds and researchers who have worked on other species, in addition to the pig. But, as we show, it was no less coherent a community for all that and acted as a community in shaping the genomics of their chosen species in a decisive and consequential manner. Genomics, and the object of a genome, can only be understood in relation to particular communities that it shapes as well as being shaped by, and wider social and technical configurations that it also impacts.Footnote 19

Our notion of communities builds on scholarship that considers the genome a rhetorical and practical space, as much as a material object (Szymanski et al., 2019). In this space, pre-existing scientific groupings can converge or fragment. Those, like the yeast biologists, who are more successful in defining and shaping the genome in their own terms, are in turn further unified by their orientation around the object of the genome. Human and medical geneticists, by contrast, formed a genome community that differed from the one assembled by the participants in the HGP.Footnote 20 This rhetorical and pragmatic definition enabled us elsewhere to highlight different characteristics of genomics research depending on the communities involved with a given genome: a strict separation of producers and users in the case of the human reference genome (García-Sancho, Leng, et al., 2022), different degrees of proximity and distance between yeast sequencing and particular research goals (García-Sancho, Lowe et al., 2022) and processes of bricolage or reuse of tools and resources that were deployed in the generation of the pig reference genome (Lowe, Leng, et al., 2022).

One conclusion arising from this community framework is that genomics can be regarded as a set of tools that enable groups of scientists to do different things and achieve different objectives with their target genomes (Lowe, García-Sancho, et al., 2022). Throughout the remaining seven chapters of this book, we propose the notion of yeast, human and pig genomicists as (often collective) subjects that make the history of genomics. In this process of construction, the genomicists mould their target genomes according to their necessities. They thus shape what these genomes represent and what they can do with them, sometimes quite consciously and deliberately.

This focus on communities of genomicists allows us to discern greater diversity and complexity in the history of genomics. In what follows, we show that yeast, human and pig genomicists have exhibited different mechanisms of inclusion and exclusion of particular sets of scientists and institutions. These have shaped each community differently and changed their compositions—and sometimes their roles—over time. The genomicists working on S. cerevisiae were relatively stable before, during and after the production of their reference genome, while in H. sapiens the leading genomicists of the early days were replaced by a different community based at specialist genome centres. For S. scrofa, the range of genomicists expanded, due to the convergence of a longstanding community of pig geneticists with practitioners from one of these specialist genome centres. These different trajectories further show that the history of genomics cannot be reduced to a single framework or periodisation.

Previous historiography has narrowly focused on a few, homogeneous genomicists: the participants in the HGP, recipients of the grants to determine the human reference genome and heads of the new institutions of genomics research: the genome sequencing centres. By looking at other less visible genomicists—those working on non-human organisms and beyond the HGP framework—we emphasise their agency as historical subjects and their capacity to pursue their own goals rather than following a teleological, pre-defined pathway. It is in the specificity of those goals and their agency in pursuing them where the interactions between the genomes and their communities occur and we identify trajectories and lineages that diverge from the canonical history of genomics. In other words, when a heterogeneous and inclusive array of genomicists is considered, genomics becomes something other than a static, retrospectively constructed field: it becomes a science (and history) in the making.

1.4 Outline of Chapters and Structure of Our Argument

The book is divided into three parts, comprising two chapters each. Taken collectively, these three parts de-centre the historiography of genomics: from a focus on H. sapiens; from an emphasis on the HGP; and, finally, from excessive attention to the determination of DNA sequences themselves (what we defined above as ‘thin sequencing’).Footnote 21 We achieve this by exploring genomic endeavours around yeast, human and pig—including their reference genome projects—that started in the mid-1980s and concluded towards the late-2010s.Footnote 22 The sources that have enabled us to reconstruct these endeavours are oral histories, published literature—including scientific, administrative and policy reports—and archival materials. For the oral histories, we approached individuals ranging from Nobel Prize-winning scientists to administrators, lower-profile researchers and those devising and running the infrastructures of genomics. Our archival sources include catalogued and uncatalogued collections, as well as grey literature (see Appendix A and Appendix B at the end of the book for a complete list). We have also found extant and archived web pages to be useful in reconstructing parts of the history of genomics that had a lower public profile and lack an extensive secondary literature concerning them.

Part I of the book addresses what we call the distributed model of genomics. It starts with an account of the determination of the reference sequence of yeast: a non-human genome project that ended in 1996, just before the scaling-up of the HGP. The yeast effort enables us to show a greater variety of institutions and ways of organising mapping and sequencing practices than the ones behind the production of the human reference genome. Chapter 2 documents how institutional and organisational diversity was especially manifest in the European Commission-funded Yeast Genome Sequencing Project, which was not intended to serve as a pilot for the HGP, as the NIH S. cerevisiae genome programme was.

Similarly, a focus on the collective and systematic mapping work that preceded the large-scale sequencing characteristic of the latter stages of the HGP reveals a variety of heterogeneous human genome programmes. As we argue in Chap. 3, the HGP was but one among those many programmes: its focus on the rapid, industrial production of a reference sequence of the whole human genome was a particular—and rather singular—characteristic that distinguished the HGP from the others. The other, non-HGP programmes were more collective and inclusive of existing communities of medical geneticists. In order to accelerate the production of the reference sequence, the IHGSC that conducted the later stages of the HGP sidelined a large proportion of human and medical genetics institutions from its operation, starting in 1996. Yet these human and medical genetics communities continued their genome efforts, thus forming trajectories that the canonical winner’s history of genomics overlooks.

Part II compares the production of the human reference genome with those of other species, especially the pig S. scrofa. Chapter 4 presents a main participant in the production of the human reference sequence: the Sanger Institute. Chapter 5 shows how this institution also played a major role in the subsequent sequencing of the pig genome that started in 2006, three years after the HGP was deemed concluded. At a first glance, the pig genome thus seems to be strongly modelled on the HGP. Yet, the broader history of pig genomics allows us to qualify that impression. If we take into account the early pig genome mapping work, started at the same time that the HGP was in the 1990s, we see that the scientific communities working on the agricultural genetics and immunogenetics of S. scrofa were intensely involved then and, unlike human and medical geneticists, continued to be. Indeed, institutions working on the genetics of pig immune response and traits relevant for selective breeding processes were important drivers and participants in the Swine Genome Sequencing Consortium that organised, managed and coordinated the reference genome work.

Taken together, Chaps. 4 and 5 continue the de-centring exercise that we started in Part I. In this case, the de-centring is not only due to our consideration of non-human species (pigs, as well as yeast) but also to our addressing of longer-term trajectories: considering genome mapping, as well as sequencing. We look at the sources of the DNA libraries from which the reference sequences were obtained and show that in both cases they were derived from a narrow pool of a few humans and pigs. Yet in the case of S. scrofa, the engagement of the early mapping communities in the sequencing operation eased the connection of the resulting reference genome with more general immunogenetic goals and the development of data and tools to aid the improvement of agriculturally-relevant breeds. These were the problems that motivated the mapping activity of pig genomicists before their involvement in whole-genome sequencing.

Part III comprises Chaps. 6 and 7. In it, we address a number of features that have been commonly attributed to post-genomics, such as connection of genomic data to other forms of biological data, and an attention to variation and diversity. We examine the annotation of reference genomes and other functional and systematic studies of sequence data. By the former, we mean the elucidation of the effects of particular genes and other genetic elements in the organism. By the latter, we mean the determination of patterns of variation within a given species or between species to inform, among other endeavours, evolutionary biology. We argue that our ‘thick sequencing’ approach—addressing the long-term processes by which DNA data become reference genomes—enables us to show that these practices have been deeply entangled throughout the whole history of genomics rather than necessarily following the completion of the HGP or any other reference sequence project.

Furthermore, in the case of the pig, the close involvement of the communities of immunogeneticists and agriculturally-oriented geneticists from the early days of genome mapping transformed annotation practices at the Sanger Institute into more collective and distributed endeavours. This paved the way to collaboration between two different communities of genomicists, one centred around the Sanger Institute and the other derived from the wider pig genetics community involved in mapping practices.

In our concluding Chap. 8, we explore the implications of our study beyond the realms of the history, philosophy and sociology of science. One of the preoccupations of science policymakers and funders in the wake of the HGP has been the notion of a ‘translational gap’ between the availability of masses of genome data and the exploitation of them, for example in effective new treatments or diagnostic tests in the clinic: ‘from bench to bedside’, as the slogan goes. We argue that this translational gap is an artifact of the particular configuration and history of the HGP: its model of concentrated production and the rigid division it implied between the producers of the reference sequence and the communities that would later use it in biomedical and clinical research. Other genomic endeavours that deployed more inclusive strategies show more immediacy and connection between the compilation of the data and its mobilisation towards particular goals. Our historical investigation thus illuminates ways of reducing the temporal, cognitive and conceptual distance between genomic data and user communities.

Dissatisfaction with reference genomes has given rise to new initiatives to represent genomic variation and to connect genomes to other forms of biological data and processes. As we show throughout, these qualms are based on trying to attribute particular functions to reference genomes and to make them carry weight that they were not designed or conceived for. Our book highlights that many of these problems stem from the contingent and historically-driven processes of reference genome construction. Without a historical reconstruction, these processes and their consequences on the resulting reference genomes are flattened and rendered invisible.