Abstract
In this chapter, we show that many of the features of ‘post-genomics’ were present in pre-reference genome research, and the reference genomics of yeast and pig. Due to the problems we identify with the notion of ‘post-genomics’, we instead propose the term ‘post-reference genomics’, which encompasses all the forms of genomic-related research opened up by the existence of a reference sequence. To identify what is distinct about post-reference genomics, we detail the relationship between two modes of research: functional and systematic. We observe how the evolving relationship between these two modes of research differs across species, and attribute this to distinct relationships between scientific communities and the pre-reference genomics or reference genomics work they were involved in. We close by considering the role of reference genomes and other genomic resources in seeding ‘webs of reference’ that enable researchers and other practitioners to explore the possible variation exhibited by a given species.
You have full access to this open access chapter, Download chapter PDF
Throughout this book, we have mapped the participation of different scientific communities in genomic endeavours across three species—yeast, human and pig—and the distinct processes, epistemic goals and domains of application that informed the creation of annotated reference genomes. In this chapter, we examine how the existence of reference genomes enabled the creation of increasing amounts of additional genomic data, as well as other kinds of biological data. This involved the generation of new reference resources intended to represent forms of variation within a species that were either missing or insufficiently incorporated into its reference genome. Such reference resources could include new maps or sequences, novel ways to relate or align freshly identified variation to the reference genome, or tools to capture and document variants.
We examine two main currents of post-reference genome data production, collection and analysis: functional and systematic studies. By functional analysis, we mean the investigation of the effects of variation in genes and genomes, in terms of alterations to biological processes and therefore differences in phenotypes: the measurable presentation of traits in organisms.Footnote 1 By systematics, we mean the exploration of the patterns and specific details of genomic variation both within a species and between it and related species. We conclude by considering the implications of the increasing interrelationship of the functional and systematic modes in research pertaining to each of the three species.
Before a reference genome exists, and after its creation, communities tend to focus their interest on intra- and inter-species variability. Constructing a reference genome involves abstracting away—to a greater or lesser extent—variation to create a single canonical reference standard. Although the nature of reference genomes varies across species, this abstraction is often raised as a source of concern by the many different communities that may use them but who were not involved in their construction. This concern, we contend, arises from a tension between the presumed representativeness of reference genomes and their role as standards, as stipulated bases of reference and comparison.Footnote 2 This tension may seem to place—and in some cases, conflate—conflicting demands on reference genomes. Yet, the two demands are linked. Reference genomes, through their role as standards, enable researchers to gain a greater appreciation of the range of biologically-meaningful variation present across a species, and so help seed critiques of their representativeness. Furthermore, through their very role as a scaffold on which other representations of variation can be constructed and connected, reference genomes have enabled the development of data, tools and representations to facilitate more bespoke functional and systematic explorations of the biology of the species. Such developments have also encouraged visions of the refinement or even replacement of the reference genome as a central object in genomic and allied research.
In this chapter, we once again consider genomics research on all three of the species we have concentrated on in this book, in the context of understanding the nature of genomics after the reference genome. This is often referred to as ‘postgenomics’ (Richardson & Stevens, 2015). In Sect. 7.1, we contend that this label, and some of the meanings that have been attributed to it, reflects and reinforces a misleading picture of the history of genomics that has arisen through a disproportionate focus on the elucidation of the human reference genome. In this view, a concern with relating multiple kinds of genomic and non-genomic data (what we refer to as multi-dimensionality) and the biological contextualisation of that data is postgenomic. As we have seen in the preceding chapters, however, these facets were evident in pre-reference genome research and even featured alongside the generation of the reference genome for yeast and especially the pig. Even within Homo sapiens, the compilation of genomic data went hand-in-hand—rather than preceding—an aspiration to capture variation and connect this with other biological and medical problems outside the concerted effort of the International Human Genome Sequencing Consortium (IHGSC).
To begin to make that case, in Sect. 7.2 we consider the extent to which reference genomics is an ongoing project, both for species that already have a reference genome, and for species still lacking one. This continued reference genomics does not just constitute a tidying up exercise or involve incremental improvement. To accept that would presume that there is some final standard of completion for reference genomes. Further, it would imply that reference genomes are something to be discovered, rather than constituting creative products. As we have shown in previous chapters, reference genomes are abstractions from the variation found in nature, in which decisions made about data infrastructures, mapping, library construction and use, sequencing method, assembly and annotation are pertinent to shaping the final product. Who was involved in these processes, and when, are therefore matters of deep significance.
In Sect. 7.3, we develop our argument by examining two modes of genomic research—functional and systematic studies—and the relationships between them, by first inspecting an example of pre-reference genome work: pig genetic diversity projects that ran from the mid-1990s to the mid-2000s. These were efforts that explicitly aimed at apprehending the diverse genetic resources that might be tapped for breeding programmes. They also neatly aligned with interests in the domestication, evolution, phylogeny and natural history of pigs that were held by many researchers who primarily worked on pig genetics for agricultural purposes. They therefore instantiate an early entanglement between functional and systematic work and show how some of the supposedly ‘postgenomic’ concerns with variation and multi-dimensionality operated before the creation of reference genomes.
In Sect. 7.4, we compare the manifestation of these functional and systematic modes of genomic research across yeast, human and pig. We show that genomics research across these three species exhibits different forms of entanglement (or lack thereof, at times) between systematic and functional approaches. This crucially affects the ways in which reference genomes are used and how new reference resources are developed and connected to each other. We document this through particular examples of work conducted after the release of the reference genomes of each species:
-
Yeast. EUROFAN, Génolevures and allied projects that followed the completion of the reference genome of Saccharomyces cerevisiae. EUROFAN sought to systematically produce mutants for particular genes and combinations of genes in S. cerevisiae, and so generate mutant stock collections as a standard reference intended for wider circulation. Génolevures was a network that sequenced and comparatively analysed the genomes of multiple yeast species, and in so doing explored their evolutionary dynamics and generated the comparative means for further developing functional analyses.
-
Human. ENCODE, the post-reference sequence project aiming to catalogue the functional elements of the human genome, and GENCODE as a sub-project of this. We also examine attempts to map and make sense of human genomic diversity, the establishment of reference sequences for particular populations, as well as the creation of ClinVar: a database of genomic variants associated with clinical interpretations of their possible implication in disease.
-
Pig. The Functional Annotation of Animal Genomes network (FAANG), which has grouped the pig genomics community with genomicists working on other farm animals. We also examine research on pig genomic diversity across breeds, particularly that related to tracking and understanding patterns of evolution, domestication and dispersal. Finally, we recount the creation of a SNP chip or microarray to test for the presence or absence of particular Single Nucleotide Polymorphisms (SNPs). The advent of this chip enabled further functional and systematic studies, as well as the development of novel resources and the inferential means by which researchers could connect new and existing resources. This eased the connection of genomic resources to particular modes of research and domains of application or ‘working worlds’ (Agar, 2020).
In yeast, we see first a pursuit of functional analysis to build on and enrich the reference genome as a resource, followed by systematic studies. Leading yeast genomicists, pursuing their own lines of molecular biological and biochemical research, increasingly realised the synergies between these two modes. The relationship between functional and systematic research in yeast reflected the ‘do-it-yourself’ approach of the yeast biologists who made up the genomics community, and the nature of yeast as a model organism.
In human genomics, we see continuity from the way that the reference genome effort was organised, with grand concerted efforts led by many of the same institutions encompassing the IHGSC. We focus on ENCODE and GENCODE and compare these with some contemporary systematic studies that examined human genomic diversity, such as the production of reference sequences for particular populations. We conclude the discussion of human post-reference genomics by looking at a relatively new initiative, ClinVar, which aims to connect the infrastructures, norms and practices of large-scale genomics with those of the medical geneticists who were peripheral to the IHGSC, and who had instead developed their own separate and parallel data infrastructures.
For the pig, however, after the discussion of the functional and systematic motivations and consequences of pig genetic diversity research in Sect. 7.3, it will not come as a surprise that the post-reference genome distinction between functional and systematic modes has been far fuzzier than for yeast and human; there have been multiple crossovers between the modes and an early appreciation of their synergies by the community. After examining how the pig genomics community immediately set about functionally and systematically exploiting the reference genome they had helped to create, we examine their collaboration with the sequencing technology company Illumina to produce a SNP chip. The SNP chip is an excellent illustration of the significance of the involvement of a particular community in the creation of genomic resources, in particular in shaping the generation of new reference resources. This, however, introduces constraints into these resources as much as it engenders capabilities or affordances.
We conclude (in Sect. 7.5) by observing that the coming together of functional and systematic modes of genomic research instantiates a particular stage in the development of what we term a web of reference. Over time, webs of reference feature ever-denser webs of connectedness between distinct representations of the variation of, for example, a particular species. Such representations include reference sequences (e.g. of the species or sub-species populations), genome maps and resources such as SNP chips. Connections between such representations are progressively forged by data linkages and the process of identifying and validating inferential and comparative relationships between them. This is enabled by the creation of reference resources that seed the web, with new nodes representing new forms of data scaffolded on and linking to existing ones. These webs of reference are especially dense within particular species, but they can—and indeed have and often must—be connected to genomic reference resources beyond them. The way in which these webs develop depends on prior genomic research and reference resource creation (including that for other species) and the involvement of specific communities of genomicists in those efforts.
1 Postgenomics or Post-Reference Genomics?
This chapter explores the surplus of data concerning genomic variation that has been generated in the wake of the elucidation and publication of reference genomes. We refer to this as ‘post-reference genomics’, to indicate the differences between our treatment of this work with what is usually connoted by the term ‘postgenomics’. ‘Postgenomics’ has often implicitly referred to genome-related research that followed the determination of the human reference sequence. It therefore ignores the ongoing ‘reference genomics’ of the human—beyond the conclusion of the IHGSC endeavour—as well that being conducted for other species. Many species, of course, still do not have a reference genome, while others—as we have shown throughout the book—had their reference sequences produced in a substantially different manner to the human one.
There has been some debate on what postgenomics means, beyond the chronology of simply following the initial publication of the human reference genome. Some scholars have suggested that it was conceived by IHGSC scientists to market their post-reference sequence research agenda, with parallels drawn between this agenda and the contemporary rise of the notion of translation: an imperative to transform research results and data into medical outcomes (Stevens & Richardson, 2015). While obtaining further grant funding may have been a significant driver of the framing of postgenomics as something distinct and new, other accounts have sought to characterise postgenomics as a more substantial endeavour. A common theme in this latter school of thought is that post-genomics constitutes research that aims and aimed to relate other forms of biological data—to integrate additional dimensions—to DNA sequence data, and therefore begin to properly capture the complexity of biological processes. Here, the technologies, methods, infrastructures and data generated through genomics have been used as a platform for further biological research. In this version, postgenomics comprises a recognition of complexity and a non-deterministic, non-reductionist, interactionist and holistic vision of the organism.Footnote 3
This perspective on organismal complexity was first outlined at a 1998 conference held at the Max Planck Institute for the History of Science. As by any definition, this conference was held before ‘postgenomics’ came into being in some form, postgenomics was envisaged in conjectural and promissory ways. At the conference, the biologist Richard Strohman outlined four phases of genomics: the first two being “monogenetic and polygenetic determinism”, then “a shift in emphasis from DNA to proteins” and then “functional genomics”. Following this was a burgeoning fifth stage, presumably postgenomics, but not labelled as such, which is “concerned with non-linear, adaptive, properties of complex dynamic systems”. This was an early statement of the idea that genomics pertains to the linear and deterministic while postgenomics opens out to nonlinear and nondeterministic facets of biology, but it differed from some of the accounts of later scholars by including aspects of this extra- and multi-dimensionality in genomics itself, rather than this being characteristic of postgenomics (Thieffry & Sarkar, 1999, p. 226).
Adrian Mackenzie has also evaluated genomics in terms of dimensionality. He identifies the period roughly between 1990 and 2015 as “the ‘primitive accumulation’ phase of genomics” that “has yielded not only a highly accessible stock of sequence data but sequences that can be mapped onto, annotated, tagged, and generally augmented by many other forms of data”. The single dimensionality of sequence data produced in this phase of genomics is something to be augmented with—and related to—other forms of data; it is the challenge of dealing with dimensionality that characterises “post-HGP [Human Genome Project]” biology (Mackenzie, 2015, pp. 79 and 91). In this conception, postgenomics is defined in terms of both the use of existing sequence data and associated infrastructures, and the establishment of connections between genomic data and other forms of ‘omic’ data (Stevens, 2015). Here, postgenomics involves the results and modes of research of genomics being brought together with other types of biological traditions and outputs, a process that is characterised by the advent of new forms of labour, for example the figure of the curator (Ankeny & Leonelli, 2015).
The designation of something called postgenomics as an endeavour to contextualise sequence data, indicates that genomics became conceptualised in terms of sequence production, rather than involving both sequence production and use, and featuring a range of different ways in which production and use were related and combined. This production-centred interpretation tallies with an approach to genomics that foregrounded an increase in the efficiency and speed of data production, with the pressure of this drive helping to manifest and reify a strict division between producers (submitters) and users (downloaders). However, as we have shown elsewhere (García-Sancho & Lowe, 2022) and further illustrate throughout the book, a sharp division only existed within the IHGSC effort; other approaches to genomics featured different configurations and entanglements between sequence production and use. Additionally, contextualisation of sequence data has been pursued both before and during the production of a reference genome, as much as afterwards. We can, therefore, conclude that contextualisation is not a defining attribute of postgenomics: rather, following the advent of a reference genome, existing forms of contextualisation are altered and new ones are established.
What do multi-dimensionality, augmentation, integration and contextualisation mean in the post-reference genome world? They relate to ideas of completeness, comprehensiveness and the capturing of a whole or a totality. In that 1998 conference previously mentioned, biologist and scientific administrator Ernst-Ludwig Winnacker, the founder of major yeast and human sequencing centre Genzentrum (Chap. 2), said that postgenomics should be about “an understanding of the whole” (Thieffry & Sarkar, 1999, p. 223).
Historian Hallam Stevens has articulated how genomics itself seeks wholeness and comprehensiveness. Drawing on his detailed study of the Broad Institute, and mostly informed by human genomics, he presents genomics as a special form of data-driven bioscience. The nature of data in genomics makes it amenable to the adoption and development of bioinformatics and information technology-based approaches more generally.Footnote 4 Stevens’ interpretation of genomics is that the investigation of the particular is replaced by a sensibility that aims to characterise the totality. Totality and generality are key. For example, he points to the “Added value” generated by having completely sequenced genomes (Stevens, 2013, p. 161, quoting Bork et al., 1998).
Stevens stresses the dialectic of sequence data production and the development and incorporation of informatics infrastructures and approaches. Successive different structures of databases are indicative of shifts from pre-genomics to genomics to postgenomics. Genomics is about producing databases; the reference genome and a particular way of storing and presenting data—relational databases—are mutually constitutive. Distinctions between pre-genomics, genomics and postgenomics are therefore made on the basis of the structure of databases and the place and role of DNA sequence data within them. When researchers increasingly wanted to relate DNA sequence data to other forms of data (e.g. various omics data), this necessitated a shift from one kind of database structure to another. The relational databases that were able to capture well the single dimension of DNA sequence data catalogued in strings of As, Ts, Cs and Gs, therefore gave way to more complex networked databases (Stevens, 2013).
These interpretations of genomics are usually based on specific institutions and infrastructures, often in the orbit of the IHGSC. In such expositions, the effort of producing a reference genome is detached from prior genomic research, parallel genomic research (for example, by medical geneticists), and work following it. Accompanying this separation is the projection of distinct and exclusive attributes to pre-genomic, genomic and postgenomic research.
As an alternative, we propose the designations of pre-reference genomics, reference genomics and post-reference genomics. This periodisation scheme is based on the availability (or otherwise) of an object—the reference genome—and the relationship of particular communities to it. It does not presume that each stage will exhibit specific essential characteristics. Our approach emphasises the historicity and specificity of reference genomes and helps us to discern a more fluid interconnectedness between stages. In the rest of the chapter, we illustrate this by comparing post-reference genome research on yeast, human and pig.
2 Improving Genomes
Reference genomes are not static: they are amended over time, with updated versions evaluated and validated using metrics that enable direct comparisons to be drawn between the new and the old. Even when a reference genome is considered to be ‘complete’—as the human reference genome was famously deemed in 2004—it still subsequently undergoes revisions that are intended to improve it according to existing and novel benchmarks. In what follows, we examine revisions of the human, yeast and pig reference genomes and how metrics and judgements of quality changed according to evolving and distinct objectives for the three species.
We have seen that Celera Genomics saw their full human sequence as provisional and in need of constant improvement and enrichment. This was in order that their corporate effort would be seen to offer sufficiently more value than the publicly-available data to justify paying a subscription to access it. Indeed, as we indicated in Chap. 6, Celera kept developing its whole-genome sequence: new additions that were incorporated after the initial public release in 2001 were only accessible with a paid subscription.
The working draft of the IHGSC sequence (release name: hg3) was published on the University of California Santa Cruz’s (UCSC) website on 7th July 2000. At this stage, though, it was just the sequence data that could be downloaded, with a UCSC browser to visualise it still being in the works. This version had significant gaps and ambiguous positioning of sequenced fragments. The major draft published in February 2001 could also be downloaded from the UCSC website. In the Nature paper accompanying its release, it was estimated that the draft encompassed 96% of the euchromatic regions, the parts of DNA open to transcription.Footnote 5 Much as with the addition of the pig Y chromosome sequence to the new Sus scrofa reference genome assembly in 2017 (Chap. 6), future reference assemblies of the human genome would incorporate data from several sources.
The quality of the human and other reference genomes has been assessed in a number of ways: in terms of coverage, contiguity and accuracy.
Coverage is a metric we have already encountered; it is a function of the depth of sequencing, roughly how many ‘reads’ or particular determined nucleotides are present on average across the genome. It is expressed in terms of number-X, with the number designating the average amount of reads across the genome. However, there may be heterogeneity in the coverage of different regions of the genome. This outcome can be inadvertent, due to the clones captured in library production not evenly representing all areas of the genome, or be because of the exigencies of assembling regions with different genomic properties. Or it may be deliberate, due to the kind of targeting we saw in swine genome sequencing at the Sanger Institute.
Contiguity is the extent to which the building blocks of an assembly, such as contigs or scaffolds, are connected together. A contig is a continuous sequence in which the statistical confidence level in the order of the nucleotides exceeds a stipulated threshold, while a scaffold is a section of sequence that incorporates more than one contig, together with gaps of unknown sequence. The measured level of contiguity affects the classification of the level of a sequence assembly in the GenBank database. The designation of being a “complete genome” requires that all chromosomes should have been sequenced without gaps within them. Then there is a “chromosome” level of assembly: to qualify for this level, a sequence must encompass at least one chromosome, ideally with a complete contiguous sequence; if gaps remain, there need to be multiple scaffolds assigned to different locations across the chromosome. The other two levels are “scaffold” and “contig”, pertaining to the definitions of those objects.Footnote 6 Note that these are ways of assessing genome assemblies. They do not necessarily determine whether an assembly is designated as a ‘reference genome’ or the lesser category of ‘representative genome’ by the RefSeq database (Chap. 1, note 3), both of which are incorporated in the notion of reference genome we deploy across this book. As with improvements to mapping procedures or the evaluation of new genome libraries (Chap. 5), completeness can also be ascertained by searching for known genes or markers in the assembly, and enumerating those found and not found.
As well as these designations, there are metrics that are used to assess the contiguity of assemblies in a more fine-grained way. The most significant are the enumeration of the gaps (and the different kinds of gaps) and the estimated sequence length they represent, and also the calculation of N50 and L50 figures. The L50 figure is the smallest number of contigs whose total sequence lengths add up to at least 50% of the total length of the assembly. The N50 figure is the length of the shortest contig that constitutes part of the smallest set of contigs that together add up to at least 50% of the total length of the assembly. The L50 figure will therefore be expressed as a simple integer, while the N50 figure will be expressed in terms of numbers of nucleotides. These figures pertain to the length of the assemblies, rather than the presumed length of the actual chromosomes or whole genomes that are being assessed. For assemblies of the same length, the quality is presumed to be higher if the N50 figure is larger and the L50 figure smaller. The original draft human reference sequence, published in 2001, contained N50 figures for individual chromosomes and the genome as a whole. Gaps were counted across the assembly. These metrics enabled areas for improvement to be identified and analysed, but also provided a benchmark against which further improvements could be assessed.
Finally, there are measures of accuracy, which is the extent to which an assembly—and the parts thereof—is ‘correct’. This can relate to different aspects, such as the order and orientation of sequenced clones in the assembly, or pertain to the ‘base calls’—the assignment of the identity of individual nucleotides in each position in a DNA molecule—at the sequence level. This is, of course, trickier to execute than the other measures of quality, as it requires not just the measurement of the properties of the assembly and the construction of comparable metrics, but also necessitates assessment against a recognised standard. In the 2001 human reference sequence paper authored by the IHGSC, the accuracy of the assembly was evaluated by comparing it against an ordering of parts of the genome as dictated by sequence data derived from the ends of the cloned fragments in the DNA libraries used in the sequencing. This resulted in the identification of clones that did not overlap with others. These non-overlapping clones had been sought, as their presence could indicate misplacement of fragments in the assembly; they were subjected to closer investigation, resulting in “about 150” of the 421 “singletons” being attributed to misassembly.
Sequence quality at the level of nucleotides was evaluated in terms of the ‘PHRAP score’ for each one. The IHGSC used PHRAP and PHRED, software packages that were developed by Phil Green (both of them) and Brent Ewing (PHRED) at the University of Washington in Seattle. Together, they were—and are still—used for base calling. The software analyses the fluorescent peaks in the sequence read-out. It estimates error probabilities for each base call based on figures obtained from the read-out data and generates consensus sequences with error-probability estimates (Ewing et al., 1998; Ewing & Green, 1998).Footnote 7 The resulting PHRAP scores indicate the probability that an individual base call is incorrect, and therefore the overall accuracy of the sequencing. A score of 10 denotes an accuracy of 90% and that there is a 10% chance that any given base is wrong. A score of 20 means an accuracy of 99% (and a 1% chance of a given base being wrong), 30 means 99.9% (0.1% chance of a given base being incorrect), and so on. The 1998 Second International Strategy Meeting on Human Genome Sequencing held in Bermuda promulgated sequence quality standards that included an error rate of less than 1 in 10,000 (e.g. 99.99% accurate, a PHRAP score of at least 40) and a directive that the error rates derived from PHRAP and PHRED be included in sequence annotations.Footnote 8
Following the initial 2001 publication and online availability of a draft sequence, further assemblies were made available on the internet through GenBank, the DNA Data Bank of Japan and the European Nucleotide Archive (ENA), the last of which encompassed the databases housed at the European Bioinformatics Institute (EBI) from 2007 onwards. From December 2001, these assemblies were released using the name of the National Center for Biotechnology Information (NCBI), the institution into which Genbank was incorporated. NCBI Build 28 was the first release labelled in this way.Footnote 9 Then, in April 2003, the first assembly that constituted a human reference sequence was published, known as NCBI Build 33.Footnote 10
The 2004 IHGSC paper on the ‘finished’ euchromatic sequence was working from a subsequent assembly, NCBI Build 35 (International Human Genome Sequencing Consortium, 2004). In their analysis of this build, the authors compared it against the 2001 version using some of the measures indicated above, but also pursued some deeper analysis of the quality of the new sequence. The new assembly had 341 gaps, compared to 147,821 in 2001. The N50 for the 2004 sequence was 38,500 kilobases, a dramatic improvement from the 81 kilobases determined for the 2001 version. To further examine the completeness of the 2004 assembly, the consortium looked for 17,458 known human cDNA sequences in it and found that the “vast majority (99.74%) could be confidently aligned to the current genome sequence over virtually their complete length with high sequence identity”.
The 2004 paper assessed the accuracy of sequencing by inspecting discrepancies between nucleotides in the overlapping regions of 4356 clones from the same Bacterial Artificial Chromosome (BAC) library. This required some appreciation of the rate of polymorphism (genetic variation) across humans, as a difference in a single nucleotide could be due to this inter-individual or inter-group variation rather than constituting an error. While later, we see how an appreciation of genomic variation and diversity was vital to making functional use of genomic data, here we see how such an understanding, however tentative, played a part in fundamental analyses of the quality of a reference sequence itself.
Alongside these assessments, the IHGSC members evaluated whether junctions “between consecutive finished large-insert clones” that they had used “to construct the genome sequence” were spanned by another set of fosmid clones derived from a library that they created for this purpose (International Human Genome Sequencing Consortium, 2004, p. 936). With approximately 99% of the euchromatic sequence deemed to be of the requisite finished quality, the attention of the sequencers turned to the recalcitrant 1% and the heterochromatic regions, which would require new methods and materials to resolve, rather than merely a continuing scale-up of sequence production. Next-generation sequencing methods, including long-read technologies that sequence larger stretches of DNA and therefore reduce the number of problematic gaps or misassemblies, have assisted in this (e.g. Nurk et al., 2022). Furthermore, fundamental research pertaining to particular problematic regions has generated data and information that has enabled the creators of successive assemblies to amend and improve these refractory areas.
In addition to improvements to a single canonical reference sequence, attempts were increasingly made to ensure that the reference genome was more reflective of the variation manifested by the target organism. For instance, this was realised by creating the possibility of depicting alternate loci, contigs and scaffolds that differ from the reference sequence in databases and visualisations. An example of a new presentational mode that conveys different kinds of variants alongside the reference sequence is the pangenome graph, showing where these variants diverge from the standard and how common their departures from the reference version are (Khamsi, 2022).
In order to move towards a model of reference assemblies that incorporated variation, and to manage and conduct this ongoing work, the Genome Reference Consortium was established in 2007 by the Sanger Institute, the McDonnell Genome Institute (the new name of the genome centre at Washington University), the EBI, and NCBI. They initially focused on three species: human, mouse and zebrafish, the latter two because of their role as model organisms and due to existing investments in creating gene knock-out collections for these species (Church et al., 2011). Since then, rat and chicken—also model organisms—have been added, and The Zebrafish Model Organism Database and the Rat Genome Database have joined the consortium.Footnote 11
Pig and yeast are notably absent from the Genome Reference Consortium. In the case of yeast, ongoing improvements to the sequence and annotation of the reference genome—first released in 1996—are performed by the Saccharomyces Genome Database at Stanford University, with both the sequence and annotation treated as “a working hypothesis” subject to continual revision (Fisk et al., 2006). A major revised new version of the yeast reference genome was completed in 2011, using a colony derived from the AB972 sub-strain of S288C. Linda Riles had used AB972 to construct the genome libraries for the original sequencing of the yeast genome. The new sequence reads were aligned to the existing reference genome, with low quality mismatches discarded and manual assembly and editing of the genome conducted, which involved checks of the literature for particular sequences and annotations.
While the comparison with the older reference affirmed the quality of that earlier standard, the new assembly made numerous corrections to it. The authors of the paper announcing it had sufficient confidence in it to suggest that the reference sequence was now comprehensive and accurate enough so that in future revisions greater weight would be given to incorporating variation rather than fixing errors. They also suggested that having worked towards and largely achieved a highly veridical representation of a single strain, the focus of yeast reference genomics should shift towards creating the most useful representation of the organism. One of the stated implications of this was the need to develop a pangenome including annotated sequences representing different S. cerevisiae laboratory strains and wild specimens, using some of the copious data being generated on these, as well as on related species (see Sect. 7.4; Engel et al., 2014).
In pig genomics, the first major revision after the completion of the reference genome (represented by the 2011 Sscrofa10.2 assembly) was released in 2017. The impetus for producing a new reference genome was provided by a team led by Tim Smith at the US Department of Agriculture’s Meat Animal Research Center (USDA MARC). They sequenced a boar from a population whose breed ancestry was estimated to be half Landrace, quarter Duroc and quarter Yorkshire. Smith was using Pacific Biosciences long-read sequencing technology, which held the promise of greater contiguity of sequence and fewer potential issues with assembly. However, when others in the pig genome community found out about Smith’s endeavour, the error rate for this technology made them sceptical of its worth.
They did, however, work with Smith and his team to produce a new reference genome. Together, they hit upon the strategy of using Pacific Biosciences long-read technology in conjunction with more reliable Illumina short-read technology. This, combined with the improved chemistry of the newer versions of the Pacific Biosciences technology, helped them to produce a high-quality assembly that formed the basis for Sscrofa11, which became the designated reference genome Sscrofa11.1 when the Y-chromosome data from the X+Y project (Chap. 6) was incorporated. Alan Archibald at the Roslin Institute used money acquired from the UK Biotechnology and Biological Sciences Research Council to fund a large part of this effort, paying Pacific Biosciences for an initial assembly that the community could then work on further. He was fortunate that the contractor Pacific Biosciences had engaged to do this had fallen behind schedule, meaning that Pacific Biosciences took it in-house and conducted the work themselves, ensuring that the project benefitted from the latest chemistry and the best expertise on deploying their technology.Footnote 12
The USDA assembly—resulting from Smith’s original work—was submitted separately, though it was compared with the new reference sequence in the eventual paper reporting its completion. Multiple metrics—such as the number of gaps between scaffolds, the coverage and the N50—demonstrated the superiority of Sscrofa11.1 to Sscrofa10.2, and this higher quality ensured a better automated annotation through the Ensembl pipeline, including a doubling of the number of gene transcripts identified (Warr et al., 2020).
It is worth observing here, though, that interpretations of the quality of assemblies are not straightforward. For example, the 2011 assembly Sscrofa10 has a higher number of scaffolds, gaps between scaffolds, and ‘worse’ N50 and L50 figures for scaffolds and contigs than 2010s Sscrofa9.2. This does not mean that the assembly is of a lower quality, but that additional chromosomes (such as the Y chromosome) and extra-nuclear DNA had been included in the assembly. The Y chromosome notoriously contains many repetitive sequences that are consequently difficult to assemble.
This example shows that reference assemblies can constitute—and therefore represent—different objects, even within the same species. Furthermore, for the pig, in addition to the reference assemblies of the Swine Genome Sequencing Consortium, there is the USDA MARC assembly. There have also been other assemblies published for different breeds of pig (including Chinese breeds by the company Novogene) and the minipig used for biomedical research (sequenced by GlaxoSmithKline and BGI-Shenzhen, formerly the Beijing Genomics Institute), as well as other sequences concerning a variety of breeds and populations of pigs. These more specific references, with some recognised in formal designations and database entries and others not, are examined later in the chapter.
The discussion above shows that reference genomes are not monolithic, static objects. They are continually improved, impelled towards an ever-receding horizon of completeness. But parallel to this continual improvement of the standard reference sequence, genome assemblies have also ramified, as we see with the compiling of genomes for distinct breeds of pig. Additionally, for human and yeast, new aims that guide the evaluation of reference genomes in ways that go beyond the quality metrics of old (e.g. N50) have emerged, especially concerning the variation that the reference sequences instantiate. However, this concern with variation and variants is not something that arises after the reference genome, as the story of the IHGSC and the supposed emergence of a postgenomic era may suggest: it was already present beforehand.
3 Functional and Systematic Genomics Before Reference Genomes
Pre-reference genomics occurred in different eras for each of the species: up to the mid-1990s in the case of yeast, until the late-2000s for pig and preceding the turn of the millennium for the human. These distinct timeframes are pertinent because none of the developments in genomics for these species or any others have occurred in a vacuum: particularities of each were mediated by the adaptation and adoption of tools, methods and data produced for other species, and the comparative inferential apparatus that was constructed to enable such translations.
For yeast (Chap. 2), we saw that comprehensive genetic linkage maps were produced well before the initiative to sequence the genome started. Extensive physical maps were produced by Maynard Olson in the 1980s, building on Robert Mortimer’s earlier genetic linkage maps, and then later physical mapping was conducted by the groups in charge of the sequencing to aid this undertaking for each chromosome. In the case of yeast, the dominant focus of the community was on one laboratory strain that had already had much of its variation abstracted from it in the process of its construction as a model organism.
For human, as discussed in Chap. 3, a great deal of data was generated on variation through the medical genetics community, which extensively catalogued variants of particular genes and associated these with clinical cases of specific diseases, such as for cystic fibrosis. Significant hospital-based human DNA sequencing took place, such as at the John Radcliffe Hospital in Oxford, Guy’s Hospital in London or the University of Toronto Hospital for Sick Kids. Yet, because of the notable absences of these medical genetics groups from the IHGSC membership, these maps and sequences were only marginally accounted for in the production of the reference sequence.
For the pig, mapping projects generated considerable amounts of data concerning the variation of particular genetic markers, which were discerned through crosses of different breeds suspected to be genetically distinct owing to the geographical disparity of their origins and their morphological differences. The familiarity of these geneticists with the kinds of markers used in these studies enabled a subset of them to pursue the European Commission (EC) funded projects PigBioDiv 1 and 2 (1998–2000 and 2003–2006, respectively) to characterise the genetic diversity of pig breeds first within Europe, and then across Europe and China (Ollivier, 2009). These projects, as well as prior studies of pig genetic diversity that had been conducted from the mid-1990s, represented an integration of functional and systematic approaches and concerns.
Many researchers in the pig breeding community have had research interests connected to the variation and diversity of both domesticated pigs and their wild cousins. As a result, these topics were even included in early genome mapping initiatives. A pilot study of genetic diversity across twelve rare and commercial breeds of pig formed part of the EC’s PiGMaP II programme (1994–1996).Footnote 13 PiGMaP II’s organisation reflected the collaborative division of labour approach of the PiGMaP projects more broadly, with various groups supplying DNA from, and pedigree information concerning, animals from specific breeds they had access to. Meanwhile, researchers from Wageningen University and INRA Castanet-Tolosan (a station near Toulouse) selected a panel of 27 microsatellite markers—repetitive sequences of variable length—on the basis of their level of polymorphism, distribution across the genome, and practical ability to use in genomic studies. This panel of 27 microsatellites was subsequently adopted by the Food and Agriculture Organization of the United Nations (FAO) for studying pig genetic diversity. Max Rothschild, in his capacity as the pig genome coordinator for the USDA’s Cooperative State Research, Education, and Extension Service, ensured that the appropriate PCR primers for these markers were produced and distributed among the community. In addition to the use of the microsatellites that were themselves a key product of the PiGMaP collaboration, minisatellites and DNA fingerprinting for detecting genetic variation and diversity were also trialled in PiGMaP II.Footnote 14
Beyond PiGMaP, in addition to some of the other projects discussed in Chap. 5, the community sought to further develop their work on pig biodiversity. An initial follow-up was the ‘European gene banking project for pig genetic resources’ that ran from 1996 to 1998, which assessed nineteen breeds of pig using eighteen of the standard set of 27 microsatellites together with the blood group variants and biochemical polymorphisms that had been traditionally employed in studies of variation (Ollivier, 2009).
A major development in the elucidation of pig genetic diversity was the advent of the EC-funded demonstration project, ‘Characterization of genetic variation in the European pig to facilitate the maintenance and exploitation of biodiversity’, which officially ran from October 1998 to September 2000 and was retrospectively referred to as PigBioDiv1.Footnote 15 It was led from the Jouy-en-Josas station of the French Institut National de la Recherche Agronomique (INRA) with quantitative geneticist Louis Ollivier as the coordinator. The participation of Graham Plastow of the Pig Improvement Company (PIC) reflected interest in the project by the breeding sector. On the FAO side, the involvement of Ricardo Cardellino and Pal Hajas showed that those with a longer-term and strategic view of the future of livestock also held this work to be important.Footnote 16
The aim of PigBioDiv1 was to create a means to maintain and track genetic variation. This was motivated by the breeding sector’s assumption that additional sources of genetic variation were needed in order to enable the further improvement of their commercial breeding lines,Footnote 17 to ensure the sustainability of livestock agriculture, and to respond to changing consumer and regulatory demands that might entail new breeding goals. This approach was stimulated by, and aimed to address, a growing policy concern with the conservation of “animal genetic resources” to safeguard global food security. The FAO were central to this drive and published “The Global Strategy for the Management of Farm Animal Genetic Resources” in 1999 to that end (Food and Agriculture Organization, 1999). The concept of “genetic resources”, which has been traced back to the 1970s, was adopted by the FAO in 1983 and formed part of the framework of the UN Convention on Biological Diversity in 1992. It has been criticised for foregrounding an instrumental value of biodiversity (Deplazes-Zemp, 2018), and this is certainly true in the case of the PigBioDiv projects.
Following the widespread adoption of microsatellites in pig genome mapping and the pilot diversity project, and in the light of FAO recommendations for using them in examining genetic diversity, these highly polymorphic markers formed the basis of both the PigBioDiv1 and PigBioDiv2 (February 2003 to January 2006) projects (see Table 7.1).
Sharing a view expressed by other participants, Chris Haley—a quantitative geneticist involved in the PigBioDiv projects—has observed that this work was based on the assumption that genetic diversity reflected functional diversity. Yet, microsatellites were known to be non-functional parts of the genome.Footnote 18 It was this property, however, that enabled them to be so polymorphic, and therefore useful in mapping and tracking diversity. Furthermore, despite being non-functional parts of the genome, microsatellites still had applications in functional research. Indeed, markers such as these can and have been used in animal breeding, where it is not strictly necessary to find a causative gene, but merely something—like a microsatellite—that is statistically associated with one or many genes that may themselves be implicated in phenotypic variation for traits of interest (Lowe & Bruce, 2019). As we shall see later in this chapter, SNPs generated by the pig genomics community and compiled into a SNP chip were used in this way, but were also be applied in more systematic studies of pig genetics concerned with variation and diversity.
The importance of the particular historicity of the pig genomics community, and its involvement in multiple different projects of data collection and resource generation, cannot be underestimated here. The creation of the means to identify and map markers, and exploit the data and mapping relations so generated, relied on a coming together of molecular and quantitative geneticists. In some cases, this occurred within institutions (such as at the Roslin Institute with Chris Haley and Alan Archibald, partly driven by the immediate history of that institution, see Myelnikov, 2017; Lowe, 2021) or within the overall cooperative division of labour that had been forged. This community has been able to work with populations of livestock with well-recorded pedigrees, manipulate breeding in those populations, and produce data, tools and techniques intended to aid the improvement of selective breeding practices.Footnote 19 For this community, associated as they have been with the pragmatic and instrumental concerns of breeding, genetic variation has constituted a potential resource that breeders could exploit to improve populations in the ways they desired. The pig geneticists therefore developed a different disposition to the one that prevailed in medical genetics, a discipline that has been chiefly concerned with deleterious variants, or the one in yeast biology wherein the use of a standardised model strain with variation abstracted away has been a crucial basis of research. This helps to explain why systematic and functional studies were less entangled early on in yeast and human genomics compared to research on the pig.
As well as aiding this functionally-oriented research, the instrumental discernment of pig genetic diversity has also contributed to the identification of Quantitative Trait Loci (QTL), sites of genomic variation associated with phenotypic variation. This is unsurprising, given that the mapping of the pig genome from the early-1990s onwards involved the crossing of breeds that were assumed to be genetically distinct, and that this work was itself directed towards developing the methods to home in on QTL. The generation and exploitation of diversity was implicated in this more direct form of functionally-oriented research from the beginning of pig genomics. In the words of the summary of PigBioDiv2 on the European Union’s CORDIS website, through this research, “the discerning customer can not only demand tasty meat but can help to power the academic drive for conservation”.Footnote 20
This research also added to the data and knowledge concerning other systematic aspects of the pig: phylogenetic relationships, evolutionary history, processes of domestication, and more recent histories of genetic exchange and relationships between breeds. One of the key challenges and contributions of the project was in measuring diversity. They adapted an approach to measure diversity devised by the economist Martin Weitzman, which involved measuring the genetic distance between pairs of populations using the marker data.Footnote 21 The genetic distances were then used to cluster the populations and infer phylogenetic trees and relationships between them. It therefore provided insights into the relationships between different populations, including between European and Chinese ones, and between the patterns of variation prevailing in those two regions.Footnote 22 They attributed these patterns to historical flows of genes that resulted from different modes of domestication and ways of organising livestock farming and breeding (Megens et al., 2008; SanCristobal et al., 2006).
In the next section, we show that the close relationship between these two modes of functional and systematic research—and the continuity of researchers and institutions—persisted through the production of the pig reference genome and into the aftermath of its completion. In yeast and human, with few exceptions, these modes of research were considerably less entangled in the immediate aftermath of the production of the reference genome.Footnote 23
4 After the Reference Genome
4.1 Yeast: Successive Endeavours
EUROFAN—the European Functional Analysis Network—was always considered to be the next step after the Yeast Genome Sequencing Project (YGSP) by the community of S. cerevisiae genomicists. Although individual laboratories functionally interpreted and made use of some of the data from the sequencing project in their research, more concerted large-scale functional analysis was postponed until after the completion of the reference sequence. A high-quality reference would be needed in order to effect the targeted gene deletions that formed the centrepiece of EUROFAN.Footnote 24 Like the pig biodiversity projects examined above, EUROFAN benefitted from initial pilot programmes. In the case of yeast, researchers used these pilots to develop modes of gene distruption and methods of phenotypic assay for functional analysis.
The EUROFAN participants used the same yeast strain as in the sequencing project: S288C. S288C is a laboratory strain and has therefore had invariance in and between its colonies strictly enforced. As a result, for functional analysis it was necessary to create variation, so researchers could uncover the functions of genes in the reference genome. This was done by producing a new resource, a library of mutants. EUROFAN was conceived as a continuation of the annotation of the well-established and comprehensive reference genome, and indeed recapitulated the hierarchical but dispersed nature of the prior effort to sequence the yeast reference genome, especially the EC-funded portion of it. A division of labour was instituted, between:
-
The overall coordination of the project;
-
Liaison with the Yeast Industrial Platform (Chap. 2);
-
An informatics strand—based at the Martinsried Institute for Protein Sequences (MIPS)—to manage and assess the quality of submitted data and develop a database and computational tools for data analysis;
-
The creation of the mutants;
-
The storage, curation and distribution of the mutant collection;
-
Various kinds and stages of functional analysis occurring at the bench.
Like the YGSP, EUROFAN therefore involved a wide variety of institutions. There was considerable continuity between the participants in the YGSP and EUROFAN, and consequently it involved a set of laboratories working on the cell biology, molecular biology and biochemistry of yeast. This approach reflected a continued perception of the value of these large-scale networked projects for the research endeavours of these laboratories, and the advantages of coordinating such laboratories in a network for further genomic analysis.Footnote 25 In this way, the model of functional analysis was conceived as a means of contributing material towards the further investigation of genes, rather than it being intended to transform the basis of “normal” yeast biology.Footnote 26 Indeed, all but two of the 21 participating laboratories in EUROFAN had also been involved in the YGSP; roughly a quarter of the members of that prior effort took part in EUROFAN.Footnote 27 The creation of a curated resource in the form of a mutant collection as well as the ongoing annotation of the reference genome was attractive to the EC, but also meshed explicitly with the imperative to add more resources to the toolkit of yeast as a eukaryotic model organism.
The project was labelled as systematic (in the adjectival sense) rather than comprehensive. This is because only some of the sequences that were potentially thought to contain protein-coding genes were investigated. The work included an assessment of Open Reading Frames (ORFs) identified in the reference genome sequencing. ORFs are DNA sequences between the start and stop codons that begin and terminate the initial transcription of DNA into messenger RNA. A workflow determined which of these ORFs would undergo successive forms of “increasingly specific” functional analysis. As a result, only a portion of the total of ORFs and genes identified through the initial sequencing and structural annotation of the reference genome were fully functionally characterised.
The functional analysis commenced with deletions of specific ORFs through the design of constructs—known as ‘gene replacement cassettes’—and their insertion into yeast DNA. These gene replacement cassettes contained a gene (kanMX) that conferred resistance to the fungicidal chemical geneticin. The application of the said antifungal agent—geneticin—thus yielded only the yeast that had integrated the cassette into its DNA and therefore had suffered the deletion of the ORF. This method was developed in the midst of the YGSP by Peter Philippsen—the coordinator of the sequencing of chromosome XIV of S. cerevisiae—and Achim Wach at Biozentrum (Wach et al., 1994). By observing and measuring the impact of the successful deletion of a specific ORF on the organism, researchers could infer the functional role that it played in yeast, for instance whether the deleted ORF was part of a protein-coding gene.
By 1996, researchers at the European Molecular Biology Laboratory (EMBL) had finished comparing ORF sequence data to protein sequences held in public databases. On the basis of sequence similarities, they made functional predictions for over half of all identified yeast genes (Bassett Jr et al., 1996). EUROFAN concentrated on the genes for which functional predictions of this kind were not possible. These were the so-called ‘orphans’: “novel genes discovered from systematic sequencing whose predicted products fail to show significant similarity when compared to other organisms, or only show similarity to proteins of unknown functions” (Dujon, 1998, p. 617). Functionally characterising these kinds of genes in EUROFAN would be particularly useful, considering yeast’s role as a model organism and in biotechnology. As a model organism, it would constitute a richer platform for inferring the functional implications of homologous sequences found in the less well-characterised genomes of other species. For biotechnology, the genes with novel functions that were identified could be expressed within yeast itself to yield potentially valuable products or be inserted into other organisms by transgenic techniques.
EUROFAN, as well as filling in the orphan gaps left after the EMBL’s analysis, aimed to observe gene effects and functions in ways that were missed by what leading yeast biologist Stephen Oliver described as the “function-first” approach of “classical genetics”, which relied on the detection of some observative heritable variation or change to infer the presence and function of a gene. Instead, in EUROFAN they deleted known genes to produce mutants, and then measured the quantitative effects of this, for instance on growth rates of the cells through competitive growth experiments, or the biochemical effects as assessed through measurement of metabolite concentrations (Oliver, 1997).Footnote 28
EUROFAN created mutants based on the deletion of 758 ORFs and then proceeded towards analysis of the deletants, which was led first by Peter Philippsen and then by Steve Oliver. In addition to this, parallel projects led by YGSP participants created mutant strains of smaller numbers of ORFs and Bernard Dujon’s laboratory committed “mass murder” by deleting multiple ORFs at a time and then characterising the mutant phenotypes arising from these (Goffeau, 2000).
Nevertheless, the desire to identify all of the genes in S. cerevisiae and characterise all ORF deletants remained. Funds to realise this came through a collaboration between two of the leading US figures in the original sequencing project: Mark Johnston at Washington University and Ron Davis of Stanford University. Johnston got a grant from the National Institutes of Health (NIH) for the period 1997 to 2000 for ‘Generation of the Complete Set of Yeast Gene Disruptions’, an initiative to create a comprehensive catalogue of S288C deletion strains, affecting all its genes. Davis also obtained a grant from the NIH to provide the tens of thousands of oligonucleotides—synthetic DNA sequences—that were needed for the production of the deletion cassettes (Giaever & Nislow, 2014).
This work, running from 1998 to 2002 and hosted at Stanford, became the Saccharomyces Genome Deletion Project, now Yeast Deletion Project, a consortium that involved many of the leading actors in European yeast genomics as well as the North Americans, including Howard Bussey at McGill in Canada (Giaever et al., 2002; Winzeler et al., 1999).Footnote 29 It was complementary to, and in many respects a development of, EUROFAN. The consortium analysed the deletion strains thus produced under several growth conditionsFootnote 30 and sent the strains—containing DNA barcodes to enable linkage of material and data resources—to be preserved and distributed by repositories such as ATCC (the American Type Culture Collection) and EUROSCARF (the European Saccharomyces Cerevisiae Archive for Functional Analysis).Footnote 31
All this functional annotation was captured by databases set up specifically for yeast biologists to be able to exploit the data deluge being generated by these projects. The Saccharomyces Genome Database (SGD) was founded in 1993 and first made available through the internet in 1994. It is primarily funded by the NIH—through the National Human Genome Research Institute (NHGRI)—and is hosted at Stanford University. SGD curators compile and integrate data on S. cerevisiae with the aim of presenting functionally annotated genomic data to yeast biologists in a usable form, providing them with a variety of tools that allow them to interrogate functional relationships and interactions (Dwight et al., 2004).Footnote 32 The Comprehensive Yeast Genome Database (CYGD) was established at MIPS and intended to be a development of the prior work conducted at MIPS and the European sequencing and functional annotation consortia. Expert curators manually annotated the yeast genome, using data from EUROFAN and other allied projects. Its main objectives were two-fold: to develop an informatics infrastructure to analyse and annotate complex interactions in the yeast cell and later to link data being generated on other species of yeast to S. cerevisiae, using comparative genomic approaches to improve the annotation of S. cerevisiae using this data (Güldener et al., 2005).Footnote 33
The functional efforts that populated these databases involved the creation of variation in a compendious fashion using a single well-characterised strain of yeast on the basis of a high-quality reference genome. This was a key difference with pig genomics, in which there was a long tradition of investigating variation before the reference genome was produced. For the human, there was also this tradition of investigating variation through medical genetics, but it became disconnected from the IHGSC effort to produce a reference sequence of the whole human genome.
In yeast, the functional analysis of this variation was intended to improve the value of the reference sequence by producing data to help annotate it. More broadly, it was pursued to generate and provide data and physical resources (the mutant strains), which could be used by the wider yeast research community for their own purposes, thereby improving the value of the species as a model organism. As with the YGSP, the creation of reference resources, both bioinformatic and material, was accompanied by the generation of implementable knowledge about the genome of the species that could inform the further study of wider aspects of its biology.
The creation of these reference resources also enabled the production of reference sequences for other strains of S. cerevisiae and related species. This led to a florescence of comparative and evolutionary-focused studies on S. cerevisiae and other types of yeast. One leading example is a network in which six French laboratories associated with the Centre National de la Recherche Scientifique (CNRS)Footnote 34 worked with the French national sequencing centre Genoscope. This network, Génolevures, was a programme of comparative genomics research concerning the ‘Hemiascomycetous’ budding yeasts, a group that includes S. cerevisiae.Footnote 35 In the first round of this initiative, Genoscope sequenced the genomes of thirteen species in this group at a low coverage of between 0.2 and 0.4X. The participating laboratories then analysed this sequence data with reference to S. cerevisiae, which served as a comparator, an “internal standard” according to Horst Feldmann’s description (Feldmann, 2000). This comparative approach facilitated the manual annotation of the thirteen new genomes, and, in turn, enabled the identification of 50 new genes for improving the annotation of the S. cerevisiae (S288C) reference genome. From 2000, all sequence and comparative data were stored in the Génolevures database, which has since been succeeded by three more specialised databases to hold the results produced by the consortium.Footnote 36
In 2002, Genoscope agreed to sequence the reference genomes of four species at a much higher 10X coverage: Kluyveromyces lactis, Debaryomyces hansenii, Yarrowia lipolytica and Candida glabrata, the first three of which were analysed in the initial Génolevures project, the last of which is a human pathogen closely related to S. cerevisiae (Souciet, 2011). The comparison between the genomes involved a study of evolutionary conservation and divergence, which allowed researchers to identify and then investigate a variety of evolutionary changes that occurred in and between each of the phylogenetic branches—the lineages—that the species represented. This formed the basis for further investigations in the systematic mode, including the sequencing of additional species. Intriguingly, the comparative genomics that constituted—and was enabled by—Génolevures also allowed researchers to unveil manifold differences in gene content between the related species. These data were useful for further investigation into the physiological differences between them, and therefore advanced functional analysis as well (Bolotin-Fukuhara et al., 2005; Souciet, 2011).
This connection between the functional and systematic modes of yeast genomics was recognised by leading members of the community. For example, the next major grant that Mark Johnston secured following the 1997 to 2000 creation of deletion strains was another from the NIH: ‘Comparative DNA sequence analysis of the yeast genome’ running from 2001 to 2005. Using BLAST programmes (see Chap. 6) to compare nucleotide and protein sequence data between S. cerevisiae and other members of the Saccharomyces genus, Johnston and collaborators at the Washington University Genome Sequencing Center were able to estimate genetic distances between the species. This information, they supposed, would indicate which pairings would produce the most valuable comparative data. From these comparisons, they were able to identify various genomic elements, such as potential protein-coding genes and functional non-coding sequences (Cliften et al., 2001).
Most of the collaborators on that work then pursued a comparative study of the genomes of Saccharomyces species: S. cerevisiae itself, three others with genetic distances indicative of enough evolutionary distance to ensure divergence of non-functional sequences, and two more distantly related species. The objective of this was to identify signals of conserved “phylogenetic footprints” in the sequence that would indicate the presence of functional parts of the genome, including those that had been previously difficult to find, such as non-coding regulatory elements. The results enabled the further improvement of the annotation of the S. cerevisiae reference genome, and also included predictions of functional sequences that could be experimentally tested (Cliften et al., 2003).Footnote 37
Throughout, this work was accompanied by Johnston’s ongoing molecular biological research programme on glucose sensing and signalling in the yeast cell. He became involved in Génolevures in the late-2000s (The Génolevures Consortium, 2009), contributing further to de novo and improved sequencing and annotation of the members of the Saccharomyces genus. The increasingly dense comparative relations and data so established helped forge synergies between reference genomics, functional analysis of the genome, molecular biological research and systematic studies. Indeed, this had developed to the extent that the status of “model genus” was claimed for the Saccharomyces sensu stricto genus encompassing S. cerevisiae and close relatives, due to the magnitude of data and experimental resources available across and within it (Scannell et al., 2011).
This dynamic was explicitly articulated in the yeast genomics community. They were aware of the limitations of relying solely on a reference sequence of a highly-standardised laboratory strain that was phenotypically atypical. They believed that more reference sequences were required, within the S. cerevisiae species itself and for related species. They appreciated that the data and knowledge of genome variation and evolution that they wrought from these could be used for functional analyses and inform the improvement of the reference resources that they were based on. Ed Louis, who we encountered providing advice on telomeres and chromosomal evolution during the YGSP (Chap. 2), conveyed this in terms of a virtuous cycle (Fig. 7.1). In this cycle, additional data on genomic variation allows researchers to increase their knowledge concerning conservation across genomes. This helps them to improve annotations. Better annotations allow a refinement of the localisation of features such as synteny breakpoints: regions in-between two stretches of conserved sequence of a particular kind. And these, in turn, allow fresh appreciation of structural variation (Louis, 2011).Footnote 38
Ian Roberts of the National Collection of Yeast Cultures at the Institute of Food Research (Norwich, UK) and Stephen Oliver characterised research on the vast genomic and physiological diversity of yeasts and (functionally-oriented) systems biology as the “yin and yang” of biotechnological innovation involving these creatures, therefore emphasising the complementary and co-constitutive nature of these modes (Roberts & Oliver, 2011). As well as aiding manual improvements to annotations, the data and resources concerning diversity across yeast strains and species and their comparative relationships have also been harnessed to power automated annotation pipelines (Dunne & Kelly, 2017; Proux-Wéra et al., 2012).
In yeast, then, there was a passage from creating the reference genome, to pursuing functional analysis of that resource, to then producing data on other strains and related species, and using this to seed comparative and systematic research. The particular interpretation of comprehensiveness for these researchers was not restricted to a ‘complete’ reference genome but was far richer and heterogeneous. It involved the establishment of relations between a variety of different forms of data and the creation of tools to make use of them. This reflected the desire of the yeast genomicists themselves to make use of the resources; they therefore had knowledge of what was needed for research purposes, and how the data, resources and tools could be deployed and contextualised. All this also reflected the disposition of people who were aware of what their stewardship of a model organism entailed.
Major drivers of the yeast genome research agenda, such as Stephen Oliver and Mark Johnston, were able to appreciate and leverage the synergies that could be created between the functional and systematic modes of research, because they were engaged in both. Thus, the continuity of participants across these different successive phases of yeast genome research eased and motivated their ultimate integration. It was something of a different tale than in pig genomics, where, as we showed above, systematic and functional forms of analysis had been entwined since the pre-reference genome stage. In human genomics, our next object of analysis, the functional and systematic modes were more like twin tracks, than successive or permanently-entwined endeavours.
4.2 Human: Twin Tracks
As we have seen, in the sequencing of the human reference genome, the intended user communities were progressively detached from involvement in the production and annotation processes. However, in Chaps. 2 and 3 we showed how laboratories based in hospitals or medical schools had been conducting their own sequencing and making novel contributions by identifying genes and gene variants associated with particular pathological manifestations since before the start of whole-genome sequencing efforts. This programme of variant-focused and medically-oriented sequencing continued throughout the 1990s and beyond, with more and more mutations of particular genes catalogued and analysed, and more genes and key pathological variants associated with particular diseases or conditions. In some cases, research collaborations combined this approach with the sequencing of larger genomic regions: in the early-2000s, researchers at the Toronto Hospital for Sick Children (SickKids) joined forces with other medical genetics groups and Celera to sequence, analyse and extensively annotate human chromosome 7 (García-Sancho, Leng, et al., 2022; Scherer et al., 2003).
Several databases have been established to manage and present data on gene variants concerning human pathogenicity. These include Online Mendelian Inheritance in Man and the subscription access Human Gene Mutation Database (HGMD), while other databases have been created by particular communities focused on specific diseases or genes. The HGMD was founded in 1996, at the University of Cardiff in Wales. Its model is to scan biomedical literature and curate entries on ‘disease-causing mutations’, ‘possible-disease-associated polymorphisms’ and ‘functional polymorphisms’, according to the judgement of the curators assessing multiple lines of genomic, clinical and experimental evidence. Since 2000, HGMD has collaborated with commercial actors: the up-to-date version with enriched annotations and features is available on subscription from them, while a more basic free public version is also made available containing data that is at least three years old. Celera was the first commercial collaborator and included the extensive HGMD data in its Discovery System™ until 2005. From 2006 to 2015, the German bioinformatics company BIOBASE then developed HGMD Professional, a web application accessible upon purchase of a license, to hold this premium data. In 2014, BIOBASE was purchased by the German biotechnology company QIAGEN, which had participated in the sequencing of the yeast genome.Footnote 39
Specialist disease-centred databases, such as the Toronto-based cystic fibrosis mutation database and network (Chap. 3), constitute resources and tools that are curated by the community of medical genetics of clinicians themselves, rather than being provided top-down by the NCBI or any other specialist genomics organisation. In this respect, these specialist databases are similar to some of the ones that arose out of yeast and pig genomics initiatives. They are, however, more long-lasting than many of the pig ones, more fragmented than the yeast ones, and more specialised than both. The more concentrated and global databases of yeast genomics, and the more ephemeral ones of pig genomics, result from different funding and support regimes, but also reflect the role of genomic resources in each community. Yeast, as a model organism, requires comprehensiveness and the inclusion of a multitude of different forms of data in one or a few repositories that exhibit some form of persistence and longevity. The pig community, however, corrals certain kinds of genomic data that are appropriate to the research and translational problems that need to be solved at a certain point in time, with such prioritisation trumping completeness (and permanence). For medical genetics, on the other hand, the community is much larger and divided by disease categories. The pig genomics community is not as partitioned by a focus on particular traits (even if some pig researchers have investigated some traits more than others) nor is the yeast one divided into silos investigating specific kinds of molecular mechanisms or processes.Footnote 40
We return to the medical genetics track shortly. For now, we observe that it constituted a particular form of entanglement between functional and systematics research, which looked both at variation within genes, and variation between individuals, with this data linked to functional information drawn from a variety of sources. These sources even included evolutionary ones, insofar as they provided informative evidence used by curators, such as those at HGMD. Now, though, we consider a separate track that followed the publication of the human reference sequence by the IHGSC in 2004. In this track, distinct annotation efforts were conducted, on the analogy of EUROFAN, but in a quite different form. As during the determination of the human reference sequence, the medical genetics and IHGSC-based tracks remained largely separate throughout the 2000s until recent attempts at rapprochement, including the establishment of a centralised repository of clinically-relevant genomic data in 2013. This is why we refer to them as twin tracks: they developed simultaneously but maintained separated trajectories for a significant period of time.
We have already encountered the comparative genome sequencing effort across the tree of life sponsored by the NHGRI in Chap. 6, in which two working groups provided recommendations to a Coordinating Committee that then amended and submitted them to the NHGRI Advisory Council. The aim of this was to generate data on non-human primates, mammals and selected other species to inform human genome annotation. Here, it is instructive to note two significant changes to the recommendations of the Working Group on Annotating the Human Genome made by the Coordinating Committee. One was to propose even lower coverage sequencing for non-primate mammals, effectively downgrading this component to a pilot project. The rationale for this was that there was insufficient knowledge of mammalian genome evolution at that point to be able to definitively identify particular species as ideal candidates for the deeper shotgun sequencing originally recommended. Instead, they argued that a shallower study should provide sufficient grounds for identifying candidates for deeper sequencing or de novo sequencing. Thus, systematic knowledge needed further development before it could begin to yield data from which a comparative inferential apparatus could generate homologies and hypotheses for searching the human genome for functional elements.
The other change was to postpone working on a survey of human genome variation. In spite of the Committee identifying this element as a “high priority”, it baulked at committing significant resources to what amounted to a “resequencing project”, and recommended instead to wait and see whether resequencing costs declined sufficiently over the coming years.Footnote 41 A ‘Workshop on Characterizing Human Genetic Variation’ was held in August 2004 to discuss possible ways forward, with further proposals for studying human genomic variation developed within the NHGRI in 2005, alongside collaboration with the ongoing HapMap project.Footnote 42 That, and other initiatives surveying human genomic variation are discussed later in this section. For now, it is worth noting that this systematic exploration of human genomic variation became decoupled from the effort to develop resources for human genome annotation.
Operating parallel to the ongoing efforts to develop a comparative approach to human annotation was ENCODE, the Encyclopedia of DNA Elements, an ongoing project that was conceived as a follow-up to the IHGSC effort. ENCODE was launched in September 2003 by the NHGRI, five months after the ‘completion’ of the euchromatic human genome sequence in April 2003. It has passed through successive phases and associated consortia since then (see Table 7.2 for the main participants in the Pilot Phase) but continued to work towards the overarching goal of building “a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active”.Footnote 43 The rationale behind this effort was that a “comprehensive encyclopedia of all of these features is needed to fully utilize the sequence to better understand human biology, to predict potential disease risks, and to stimulate the development of new therapies to prevent and treat these diseases”. It was therefore conceived as a bridge from the structural dataset ‘completed’ in 2003, to the ability to make use of it.Footnote 44
ENCODE therefore aimed at, and presumed the possibility of achieving, completeness. Despite constituting an essay in functional genomics, it involved considerable structural annotation, as it involved identifying and annotating genes and other key functional elements such as regulatory regions. Its methods, though, have extended beyond those that are applied in both automated and manual annotation pipelines. The search for regulatory elements that affect the expression of genes entailed the development of a panoply of other approaches, including a return to ‘wet lab’ experimentation analogous to the functional analysis activities in the laboratories participating in EUROFAN. This involved the use of techniques that aimed to identify signs of activity in the genome, for example biochemical signatures of particular chromatin structures (the way DNA is packed) that enable access to the DNA so it can be transcribed (Kellis et al., 2014).
One of the main outcomes of ENCODE has been the increasing realisation that what constitutes a functional element is relational and context-dependent. The move towards once again conducting genomics research in biological laboratories reflects this shift, as capturing what elements of the genome become functional in particular circumstances “requires a diverse experimental landscape” (Guttinger, 2019). While the establishment of the ENCODE project came out of the IHGSC effort, its investigation of the biology of the human genome has triggered the involvement of a broader range of experimental laboratories, due to ENCODE being concerned with living biological function and not merely constituting a data gathering exercise.
During the pilot phase of ENCODE, the GENCODE consortium was created to produce reference annotations of the human genome. From its inception, it was led by the Sanger Institute, and involved participants from several institutions including the EBI. GENCODE incorporated data from a variety of automated prediction pipelines and experimental data into the Ensembl pipeline and HAVANA manual curation (Chap. 6). The desire to demarcate truly functional regions from non-functional ones meant that genes had to be distinguished from pseudogenes, and that the significance of non-coding regions needed to be assessed. In both cases, inspired by research indicating the salience of regulatory regions in complex developmental processes, the identification and annotation of transcripts assumed great significance in the project and formed the basis for the manual annotation. They used transcriptomic data from EST and messenger RNA sequences and protein sequences obtained from GenBank and Uniprot, using BLAST to align these against the sequences of the original BAC clones used in human reference genome sequencing. The data arising from these efforts led to a mounting appreciation of the prevalence of alternative splicing across the genome, wherein there may be multiple products of a single gene.Footnote 45 Reflecting on their findings, the GENCODE team emphasised that the way in which a reference annotation is constructed “is extremely important for any downstream analysis such as conservation, variation, and assessing functionality of a sequence” (Harrow et al., 2012, p. 1760; see also Kokocinski et al., 2010).
Alongside this, efforts to catalogue the extent and diversity of human genomic variation were already underway. More so than in pig genomics, and far more so than for yeast genomics, this research has concentrated on variation within the target species. Even before the production of the human reference genome, there was a concerted project to map human genetic diversity. Although it received some support through the Human Genome Organisation (HUGO) and the NIH, the Human Genome Diversity Project founded in 1991 was unconnected with the IHGSC effort, and indeed also from medical genetics, being largely an initiative of researchers interested in human evolution and anthropology (M’Charek, 2005; Reardon, 2004). This concern with intra-specific human variation and diversity, and the connection of this with the study of the inheritance of traits—particularly disease traits—pre-dated the determination of the reference sequence. This work heavily relied on the use of genetic markers such as the Restriction Fragment Length Polymorphisms developed in the 1980s (Chap. 1). The advent of microarray technology—SNP chips—made a qualitative difference to this line of inquiry and the relationship between research on human genetic diversity and the identification of particular genes with functional and pathological roles. Rather than recapitulate the research that has examined this (e.g. Rajagopalan & Fujimura, 2018), we instead explore the creation and impact of a SNP chip in pig genomics in the next section and relate this to the work on microarrays in human and medical genetics when appropriate.
In the 2000s, arising from the centralised and top-down world of the IHGSC, were the International HapMap Project (2002–2010) and the 1000 Genomes Project (2008–2015), which drew samples from populations around the world to identify common variants: those with at least 1% prevalence in any given population. For HapMap, SNPs were generated and selections of them were made according to the project criteria. Ten centres were used to genotype—genetically assess—the samples that were collected, with over 60% of this genotyping done at either RIKEN (Rikagaku Kenkyūjo, the Institute of Physical and Chemical Research) in Japan or the G5 institutions: Sanger Institute, Whitehead Institute/Broad Institute, Baylor College of Medicine, Washington University in St Louis and the US Department of Energy’s Joint Genome Institute. The resulting haplotype map identified sets of human genome variants that tended to be inherited together. The mapping was conceived as a “short-cut” to identifying candidate genes and aiding association studies to ascertain the genomic variants implicated in disease (The International HapMap Consortium, 2003). Like the follow-up 1000 Genomes Project, which sequenced whole genomes to capture genomic variation rather just sequencing parts of them, it was a top-down initiative that sought to provide a dataset to be picked up and exploited by a presumed external user community.
The efforts to cultivate and inform a user community, while directed towards helping researchers realise the value of the resource, demonstrated how separated producers and users were during the conception and realisation of such projects.Footnote 46 Furthermore, though they intended the data produced to be useful for what we describe as systematic studies, they were not conceived or generated for those purposes, but for the anticipated potential biomedical use of the data. To the extent that the data was analysed by the project for systematic purposes (e.g. The 1000 Genomes Project Consortium, 2010), it was presented as a separate application of the results. This systematic information was not articulated as being informative or indicative for functional studies, in the synergistic manner understood by the yeast genomics community by this time.
As with the reference genome produced by the IHGSC, however, we can interpret the fruits of these top-down projects in terms of the ways that they have been used as a means to create genomic resources more tailored to particular research needs. Consider the effort to produce “a regional reference genome” by a consortium of Danish researchers, “to improve interpretation of clinical genetics” in that country, enhance the power of association studies (examining the relationship between genomic and phenotypic variation) and aid precision medicine research (Maretty et al., 2017, pp. 87 and 91). This team produced 150 high-quality de novo assemblies, which they validated by aligning them against the then-current human reference genome assembly. They identified multiple forms of variants, aided by the reference panel produced by the 1000 Genomes Project and data from the NCBI’s Single Nucleotide Polymorphism database (dbSNP), which had itself been considerably enriched through the efforts of the International HapMap Project and the 1000 Genomes Project. The Danish team were therefore able to use the infrastructure and resources developed from these top-down projects, not directly to produce research that could be translated into clinical outcomes, but to construct their own local, targeted resources in the form of a local reference genome and a catalogue of variation pertinent to the populations they work with (Maretty et al., 2017).Footnote 47
The relationship between large-scale data infrastructures and more local and specific ones focusing on concrete objects, communities or research areas has recently strengthened. One manifestation of this shift has been the establishment of ClinVar and ClinGen. These represent an attempt at liaison between the separate tracks of human genome research and medical genetics research. ClinVar and ClinGen capture forms of variation and processes of evidential evaluation of their functional and pathological significance that are found in medical genetics and clinical research. It therefore promises a form of synergy involving the alignment of different modes of data practices, methods, analytical approaches and community norms.
ClinGen and ClinVar were established by the NIH in 2013, with the aim of providing open-access data on variants, tied to clinical interpretations of them. ClinGen is the overall programme that works in partnership with ClinVar, the database that is run by NCBI. Both continue to be funded by the NIH. Their founding was based on the concern that such clinically-relevant genetic data was being kept locally, either by individual researchers and laboratories, in disease-specific databases available to members of a particular community, or hidden behind a paywall like the most recent and rich data contained in HGMD. Furthermore, the different architectures of such databases and treatments of data were thought to stymie clinical interpretation.
The answer was a centralised repository, with uniform data standards and clear processes of curation and attribution of labels to individual variants indicating their potential clinical (or otherwise functional) significance. However, to make this work, it would be necessary for the submission of data to ClinVar to be contextualised with its putative medical significance. Rather than being stripped of all but a few items of contextualisation in the form of metadata—as sequence data to GenBank and other similar databases is—this data on sequence variants needs to travel with clinical interpretations made by the submitters and the various kinds of evidence used in them. ClinVar serves as a repository for this information, with agreements or disagreements in interpretation assessed by the user researchers, rather than being solved by the database itself.Footnote 48
Where ClinVar takes a more active role, is in the convening of expert panels to curate interpretations for particular genes. Applications can be made to the ClinGen Steering Committee for approval of the formation of an expert panel. The interpretations of these bodies then outrank virtually all other levels of “review status” (Landrum & Kattman, 2018).Footnote 49 One of these expert panels was called CFTR2, a group that worked—and still works—on the CFTR cystic fibrosis gene. Most of its members belong either to the Johns Hopkins University or the Toronto Hospital for Sick Children, reflecting a parallel route of research stretching back to the 1980s (Chap. 3) that was long separated from the established mainstream of human genomics research and infrastructure.Footnote 50
ClinGen also takes an active role in aggregating and curating genomic and health data from various sources and feeding this into ClinVar (Rehm et al., 2018). ClinGen and ClinVar constitute the platform for a convergence between the once wholly distinct tracks associated with the IHGSC enterprise and medical genetics. While these remain separate in day-to-day practice, the creation of a data infrastructure to draw upon the findings and expertise of clinicians and researchers—including those working in medical genetics—enables them to participate in a more concerted and unified whole-genome effort. This also provides human genomicists outside the medical genetics community—including at specialist genome centres—with access to information about variation and its clinical effects that is essential for the medical translation of sequence data.
4.3 Pigs: A Fuzzier Distinction
As with the production of a reference genome, by the time the pig genomics community was in a position to develop their own concerted functional annotation effort, they were able to benefit from the protocols, methods, data and experience of human functional annotation. This legacy enabled them to devise a pared down approach more appropriate to the levels of funding they enjoyed. There were, however, aspects of functional annotation that drew on the particular history of this community, the uses that they envisaged for the data and the affordances provided by their particular subject organisms.
An initial call for the concerted annotation of non-model organism animals was made by Alan Archibald, Ewan Birney of the EBI and Paul Flicek (who had primarily worked on mouse genomics), at the International Society for Animal Genetics conference in Cairns (Australia) in July 2012.Footnote 51 This alliance reflected the ongoing connections between the pig genome community and the EBI. However, it was at the annual Plant and Animal Genome (PAG) conference, in San Diego in January 2014, that genomicists working on a variety of farm animals started developing the basis for an international multi-species collaboration to advance functional annotation following the initial sequencing of several reference genomes (The FAANG Consortium, 2015).Footnote 52 At that PAG conference, the Animal Biotechnology Working Group of the EU-US Biotechnology Research Task Force convened an “AgENCODE” workshop.Footnote 53 As the name suggests, the aim was to emulate ENCODE, and to that end, several speakers from that project contributed to the session and to subsequent workshops and conferences held by what became the Functional Annotation of Animal Genomes (FAANG) Consortium (Tuggle et al., 2016).
Presenting the outcomes of the AgENCODE workshop in a PAG conference session the following day were key figures from the genome mapping and sequencing of chicken, cattle and pig from the previous two decades: Gary Rohrer of USDA MARC, Alan Archibald of the Roslin Institute, Christine Elsik of the University of Missouri, Elisabetta Giuffra of the Animal Genetics and Integrative Biology unit (Génétique Animale et Biologie Intégrative, GABI) at the Jouy-en-Josas station of INRA and Martien Groenen of Wageningen University.Footnote 54 Reflecting the practices and careers of many livestock geneticists, these researchers worked on the genomes of multiple species. All this demonstrates the agriculturally-inclined origins of FAANG, which has shaped the aims and outputs of the project ever since. Although other potential applications such as the use of animals as biomedical models and understanding domestication and evolution have also been cited as motivations, these have not formed a substantial part of the published output or attention of the consortium.Footnote 55
The aim of the FAANG Consortium (and its constituent steering committee and working groups) has been to “produce comprehensive maps of functional elements in the genomes of domesticated animal species based on common standardized protocols and procedures” (The FAANG Consortium, 2015, pp. 2-3). The Consortium (Table 7.3 lists pig genomicists who were founding members) narrowed its focus to the animals for which there were reference assemblies most amenable to functional annotation (chicken, cattle, pig and sheep), identified a small set of core assays and defined experimental protocols based on the experiences of ENCODE, established a Data Collection Centre based at the EBI to aid and validate submissions to the data portal hosted on the FAANG website, and defined a core set of tissues to be used. The collection and sharing of a limited set of tissues derived from populations of low genetic diversity was intended to aid the replicability and comparability of the data produced using them across the community and to ensure that associations between functional genomic annotations and quantitative phenotypic data could be made even in the early stages of the project (The FAANG Consortium, 2015).
A key feature of FAANG has been its focus on defining and decomposing the phenotype, or the phenome. Phenome is a term that denotes the phenotypic equivalent of the genome, with phenomics constituting concerted phenotyping on the model of the major genomic sequencing projects. In the farm animal world, researchers have access to extensive gross phenotypic data (such as on coat colour, slaughter weight, number of eggs laid per day) on animals with well-defined pedigrees. This is due to the role that measuring phenotypes has played in the breeding industry, with which researchers have enjoyed close ties since at least the 1960s. The means by which to measure, analyse and interpret phenotypic data are long-established and have continually evolved as animal geneticists adopted more molecular approaches in the 1980s and then pursued genome mapping, sequencing and analysis from the 1990s onwards. Both molecular and genomic approaches have intersected with quantitative genetics research and methods.
The extent of this focus on phenotypic data eclipses the other two species we have examined throughout this book. Yeast biologists have paid close attention to the phenotypes of their organism, but these are phenotypes of far less complexity than those of farm animals. Concerning the human, concerted efforts to characterise large groups of humans in phenotypic terms, for example in the history of physical anthropology (Müller-Wille & Rheinberger, 2012, pp. 106-107) or more recent initiatives such as the UK Biobank project (Bycroft et al., 2018), constitute exceptions to the general trend in which phenotypic data collection—and the development of infrastructures and practices to enable this—has been far less extensive than for at least some farm animal species. One cannot control the breeding or environmental conditions experienced by humans or track multiple phenotypic measurements in such a continuous and intrusive way as can be done for an experimental herd or flock (or for plants; see: Müller-Wille, 2018).
One of the key aims of FAANG has been to decompose the gross phenotypes they and breeders had previously been working with into more proximate molecular phenotypes (biomarkers), and then to causally link variation in these proximate molecular phenotypes to variation in gross phenotypes. Alongside other intended outputs of the FAANG collaboration, the identification of molecular phenotypes and associated specific genomic variants has been intended to better model the relationship between genotype and phenotype, to advance their agenda of improving genomic prediction from a known genotype to an expected phenotype. This emphasis on genotype–phenotype relationships and being able to more accurately predict the phenotype from a given genotype is not unique to pig or wider farm animal genomics, but it does attain a distinct salience and inflection in this area.
Within five years of FAANG swinging into action, the participants were looking beyond the initial in-depth studies of a limited range of tissues with low genetic diversity. While this had helped the Consortium to identify and map functional elements and regions, it became clear that data derived from a more genetically-diverse range of animals, and more tissues, would be necessary to further analyse the relationship between genomic variation and phenotypic variation. Genes specific to particular populations could be identified through this, and then visualised in pangenome graphs depicting variation aligned to the reference sequence. This, in turn, could aid the identification of candidate variants to be implemented in programmes of genome editing of livestock species, and in the tracing of genetic diversity in and across populations to inform conservation efforts. Beyond individual species, the functional genomic and phenotypic data that FAANG compiled enabled them to identify evolutionary conservation across species. On this foundation, they could develop comparative analyses and approaches to inform cross-species inferences as to the functional genomic basis of phenotypic traits (Clark et al., 2020).
This transition from a narrow focus to a broader outlook was eased by the design of FAANG and the long-standing entanglement of systematic and functional modes of research in pig genomics. Among pig genomicists, that entwinement had fostered both versatility and an acute appreciation of the wide array of possibilities and potential applications presented by the rich and connected data generated by the FAANG Consortium.
Beyond FAANG, there have been two other ways in which functional and systematic modes have been entangled in pig post-reference genomics. A 2013 paper reporting studies of the genetic diversity of rare breed Chato Murciano pigs kept on eight farms in Spain instantiates one of these. This research used an inspection of the extent of variation that existed in these pigs to assess their (functional) viability in the light of inbreeding and crossbreeding (Herrero-Medrano et al., 2013). The second way is the kind of cycle (as identified by Ed Louis for yeast) between further functional annotation of the genome and an appreciation of conservation and syntenic breakpoints: either across pig breeds, between related species or drawing on a multi-species comparative approach to enrich knowledge of genome evolution more broadly (e.g. Anthon et al., 2014). This, again, often depended on the construction of new sequences based on older ones, in order to establish new connections between genomes, to identify relationships, changes over evolutionary time and examples of different forms of variation. As Martien Groenen of Wageningen University observed in a review of pig genome research in the systematic mode, however, though advances in this direction were enabled by the existence of an annotated reference genome, they were also inhibited by its limitations (Groenen, 2016). A new reference sequence and improved annotation using it and through FAANG has, therefore, proved a considerable boon to both systematic and functional studies.
We close this discussion of the relationship between functional and systematic modes of post-reference genome research concerning the pig by exploring a tool that represents a powerful platform to enable both: the Illumina PorcineSNP60 SNP chip or microarray (see Fig. 7.2).
A SNP chip is a tool that enables the detection of the presence or absence of a particular set of DNA polymorphisms in a sample. In constructing them, DNA—of complementary sequence to the polymorphisms to be detected—is attached to the surface of the chip. The samples to be assayed are then labelled, typically with a fluorescent dye, and added to the chip. Any sequences complementary to the probes should attach to the chip’s surface and, when stimulated, produce a detectable signal which is recorded and can then be processed to give the results of the assay. There are numerous technical details and options that go into the construction and use of a particular chip. We focus here on the choice of the DNA to be attached to the chip surface: the probes that are used to detect particular genomic variation at the single-nucleotide allele level.
When it became possible to do so, the value of identifying and generating data on SNPs was quickly recognised by the community of pig genomicists. They had long valued the creation and mapping of genetic markers of various kinds (including those with no putative functional or mechanistic role), for the identification and mapping of QTL. SNPs are polymorphic—albeit less so than microsatellites—and abundant across the genome, including in regions poorly-represented by markers such as microsatellites. They therefore represented an opportunity to identify markers at a higher resolution and more broadly across the genome.
This is particularly significant given the translational domain most members of the pig genome community were working towards: animal breeding. While there had been efforts to identify particular genes and variants thereof from the 1980s, in many cases actual functional genes were not necessarily needed for the purposes of breeding. In the 1990s, for instance, an approach called ‘Marker-Assisted Selection’ (MAS) was developed that only required that a genetic marker be identified, provided that it was closely associated with a gene of interest that a breeder may want to select for or against (e.g. Rothschild & Plastow, 2002). While identifying a gene would be imperative for transgenic improvement of livestock, or for medical genetics research, it is not for animal breeding. Because the aim is to improve a population in measurable ways, finding and using markers that are good-enough indicators is a viable strategy. If it is mistaken in individual cases, this is not a problem, as they can simply be removed from the breeding pool. By the turn of the millennium, quantitative geneticists were proposing new ways to develop MAS. One of these was ‘genomic selection’, in which many more markers would need to be genotyped across the genome to ensure that at least some of them were closely linked to any (probably unknown) loci with an actual causative effect on the eventual phenotype (Haley & Visscher, 1998; Meuwissen et al., 2001). This, therefore, created the demand for SNPs to be generated and incorporated into a chip to enable the genotyping of multiple sets of them (Lowe & Bruce, 2019).
Alongside this, industry was pursuing SNPs with the view to identifying candidate genes. Sygen (as PIC had been renamed) secured EC funds for PORKSNP, a project running from 2002 to 2006 to identify SNPs in genes expressed in pig muscle and then run association studies to search for loci involved in meal quality traits. Sygen provided the samples for subcontracted biotechnology companies to sequence.Footnote 56 Monsanto, who had entered the pig breeding market having bought into DeKalb in 1996 (completing the purchase in 1998), were also deeply interested in SNPs for performing genome-wide association studies. In November 2001, Monsanto’s Swine Genomics Technical Lead John Byatt spoke with Jane Peterson from the NHGRI’s Extramural Program (Chap. 3) about potential support for a pig genome sequencing project. In Peterson’s notes on the event, she observed that “Really what they need are SNPs—denser needed”.Footnote 57 However, as pig genome sequencing did not proceed at the NHGRI, Monsanto looked elsewhere: to the IHGSC’s competitor, Celera. In addition to its primary biomedical focus, Celera had acquired an agriculturally-oriented biotechnology company from its parent company Perkin-Elmer, in what was effectively an internal transfer. The head of this company, Celera AgGen, was Stephen Bates. Bates persuaded Craig Venter to shotgun sequence pigs, cattle and chickens and create livestock databases using the data so generated. In February 2002, this unit was sold to MetaMorphix Inc., a biotechnology company founded in 1994 by researcher Se-Jin Lee of the Johns Hopkins University School of Medicine, who was the discoverer of the protein myostatin. As part of the deal, MetaMorphix licenced Celera’s databases for pigs, cattle and chickens. In June 2004, they licenced what they called ‘GENIUS—Whole Genome System™’ for pigs to Monsanto for one million dollars and a share of royalties in the new breeding lines (and their hybrid offspring) developed by Monsanto using their data, which encompassed approximately 600,000 mapped SNPs and related intellectual property.Footnote 58 Despite the apparent fruitfulness of this association, MetaMorphix filed for bankruptcy in 2010, and Monsanto abandoned the pig breeding sector in 2007, selling Monsanto Choice Genetics to Newsham Genetics.Footnote 59
Meanwhile, the pig genome community was also pursuing SNPs and the creation of a SNP chip. In addition to their potential utility in animal breeding, the geneticists believed that the generation of SNPs would enable the exploitation of mouse and human data for homing in on candidate genes, as well as aiding the refinement of genetic linkage maps (Rohrer et al., 2002; Schook et al., 2005). Creating the basis for the production of SNPs was to be an outcome of the project to sequence the reference genome. Martien Groenen obtained funding to perform next-generation sequencing on additional pigs to identify SNPs, brought in other members of a consortium—which became the International Porcine SNP Chip Consortium—to pursue this, and led the analysis group to identify putative SNPs (Archibald et al., 2010).
Alongside this, a commercial partner was needed to produce and distribute the chip. The consortium held what Alan Archibald has described as a “beauty contest” at the PAG conference in January 2008, between genomic services and tool manufacturers Illumina and their main competitor, specialist microarray producer Affymetrix. Both had previously produced chips for cattle, and the judges were swayed by Illumina’s articulations of the lessons learned from it.Footnote 60 Illumina’s cattle chip was produced at the behest of the USDA in 2007, with its 54,001 SNPs used in genomic evaluations of American dairy cattle. This was quickly deployed in genomic selection, a process that has produced considerable results on a short timescale and demonstrated the value of the approach (Wiggans et al., 2017). In addition to Illumina’s lessons, a group at the USDA facility in Beltsville (Maryland) offered advice based on their own involvement in creating and using the cattle chip, with Curt Van Tassell in particular contributing valuable insights.
Martien Groenen had been involved in the development of a 20K chip (containing 20,000 SNPs) for the chicken, in collaboration with the breeding industry for that species.Footnote 61 It therefore made sense for him to play a leading role in the effort to create a pig SNP chip. For this, he leveraged existing relationships, such as with the Dutch pig breeding company Topigs, which provided genotype and sequencing data derived from their breeding lines.Footnote 62 As with other pig genomics projects, each participant brought their own funding to enable them to make their contributions, which included the provision of samples, the sequencing and identification of SNPs, conducting the selection and validation of SNPs, bioinformatics work and networking with other organisations (such as the EBI) to assist in developing and publishing the data produced through the project.Footnote 63
The commercial exigencies of the pig chip structured its contents. So too did the interests of the members of the pig genome community and the kinds of pigs—and therefore DNA samples and SNPs—that were available (see Table 7.4 for members of the ‘International Porcine SNP Chip Consortium’). Marylinn Munson from Illumina participated in the weekly working group meetings of the Consortium conducted over Skype, which made the crucial decisions shaping the chip, for instance, how many SNPs were included, with roughly 60,000 chosen. Options of up to a million SNPs were floated, but this was deemed to be excessive when the trade-off between the number of SNPs and the cost of the chip was considered. For the chips designed to genotype humans, which needed to be able to identify rare alleles (possibly involved in rare diseases) and to sample a variety of different populations, a chip with as many SNPs as technically feasible was required. For the pig, however, to ensure the competitive pricing and commercial viability of the chip, advance orders of $5 million would have to be obtained. Breeders therefore had to be interested in the chip, and this meant including alleles of at least 5% prevalence that were present in a range of breeds that mainly reflected commercial populations used by the major breeders. Where possible, SNPs known to be of relevance to livestock traits were included. Proprietary SNPs were excluded.Footnote 64 The team narrowed down the approximately half-a-million SNPs to the selection of tens of thousands to be included on the chip.Footnote 65 The DNA samples used on the chip were obtained from the Duroc, Piétrain, Landrace and Large White commercial breeds from Europe and North America and wild boar from Japan and Europe.Footnote 66
SNPs were identified through a series of procedures, some of which used the latest versions of the reference assembly. The SNPs that passed validation were then put through a selection process which included assessment across a variety of parameters. The resulting PorcineSNP60 Genotyping BeadChip was released by the end of 2008 (Ramos et al., 2009). The advent of SNP chips made genomic selection in pigs feasible, and it was adopted in the pig breeding industry as it had been in cattle (Knol et al., 2016; Samorè & Fontanesi, 2016).Footnote 67 A second version of the Illumina chip has since been developed, as well as other chips created with different selections of SNPs (Samorè & Fontanesi, 2016).
In addition to the direct use in genomic selection, the chip has also been extensively used in systematic studies, for instance concerning the diversity and patterns of domestication and geographic distributions of pigs. As with yeast, such research can reveal differences between populations and signatures of selection that enable candidate genes to be identified for further functional exploration (e.g. Diao et al., 2019; Yang et al., 2017).
A plethora of more direct functional analyses have been enabled by the chip, aiding researchers in finding and investigating genetic loci related to livestock production and welfare traits, for example through association studies (e.g. Maroilley et al., 2017). It has also helped researchers developing pigs as animal models of particular diseases (e.g. for muscular dystrophy: Selsby et al., 2015).Footnote 68 And finally, SNP chips can be used to produce and/or validate new reference resources, for instance in constructing a new high-density genetic linkage map (Tortereau et al., 2012) or assessing the completeness of the new reference sequence (Warr et al., 2020).
SNP chips, much like reference genomes and other reference resources, constitute platform tools that can be deployed for a variety of purposes. They enable new characterisations of variation and the creation of fresh resources based on them. In this, the variation imprinted in it, conditions its affordances as a platform tool. And in the case of the pig, the heavy involvement of the pig genomics community in the generation and selection of the SNPs to be included, and the commercial demands driving this process, affects what the SNP chip can do, and what new resources it can help seed. For example, the lack of representation of samples of DNA from African breeds and populations of pigs in the Illumina 60K chip makes it of limited usefulness for breeding applications in that continent. As a result, there has been a call for the creation of more Africa-specific livestock SNP chips, as well as breed or region-specific reference genomes (Ibeagha-Awemu et al., 2019).Footnote 69
The development of genomic resources and the exploitation of them are therefore strongly conditioned by the historical paths taken. In the case of pig genomics, we have observed a close integration of functional and systematic modes of research from pre-reference genomics onwards, continuing even during the narrower and more concentrated endeavour to sequence the reference genome. The heavy involvement of the community of pig genomicists in the creation of genomic resources from the early-1990s onwards has enabled them to facilitate versatility in the wide use and applications of these resources once the pig reference genome was released. As we have seen though, this does not mean that the data and materials they have helped to generate lend themselves to an unlimited array of uses. It does mean, however, that they have a keen awareness of what these resources represent, how they can be built on and what they can be used for. The pig community has also benefited greatly from knowledge concerning the genomes and genomic research of other species. They have identified practices in human and cattle genomics, for example, and adapted them to their own ends and ways of working. They have also developed a comparative framework for making use of genomic data and other resources on mammals such as humans. As we have seen, the development of pig post-reference genomics differs considerably from that of human and yeast. We close the chapter by assessing the consequences of this, introducing the concept of webs of reference to help us to further characterise post-reference genomics and compare the historical trajectories of genomics across different species.
5 Seeding Webs of Reference
This chapter, together with elements of preceding ones, challenges existing views of postgenomics. By looking beyond human genomics and especially beyond the determination of the human reference sequence, we have shown that an emphasis on variation, multi-dimensionality and the contextualisation of sequence (and mapping) data has pre-existed reference genomics, and can be part of reference genomics itself, rather than simply succeeding and complementing reference genome sequences once they are produced.
Across the three species we have examined, the relationships between pre-reference genome research, reference genomics and post-reference genomics are affected by the differential involvement of particular communities in these efforts. In yeast and pig, there is a high-level of continuity across these phases, with the respective communities involved in constitutive aspects of the process of reference genome sequencing, and in enriching and improving the products. They have done this through engagement with large-scale sequencing centres (e.g. the Sanger Institute) and other centralised actors (e.g. MIPS), though in different ways. For example, the relationship of the pig community to the Sanger Institute was more like Mark Johnston’s relationship to the Genome Sequencing Center at Washington University than it was equivalent to the role of the Sanger Institute as a contributor to the YGSP.
The yeast and pig communities also differed in their overall goals, the nature of their target organisms and the variation exhibited by these organisms. The yeast community were self-consciously curating a model organism with a panoply of linked datasets and experimental resources, with an eye towards comprehensiveness, permanence and accumulation. They worked with a highly-constructed laboratory strain of S. cerevisiae specifically designed to minimise variation within and between colonies. The pig community, on the other hand, often worked with a mixture of primarily commercial breeds of pig, reflecting the mainly agricultural aims of their research but also the ready availability of these creatures. But they also used wild boar, as well as crosses between breeds presumed to be genetically distinct due to their geographical distance. They created genetic markers, maps, mapping tools, QTL detection methods, families and pedigrees of pigs, reference assemblies, annotations of these, as well as masses of SNPs and the chips to genotype selections of them. They worked in a satisficing mode, with researchers, groups and institutions contributing to consortia and collaborations with their own pots of money from various funding sources, building on and using existing sets of resources they had produced for a prior purpose. In both species, we see a convergence between functional and systematic modes of practising genomics, involving considerable overlaps between actors pursuing both modes. Both communities realised that an investigation of diversity could aid functional analyses either directly through the identification and analysis of key physiological and genetic differences, or more indirectly by using the insights gained from systematic analysis to improve the functional annotation and characterisation of reference genomes and other reference resources associated with the species.
In human genomics, there has been more than one community at play. There is the IHGSC community, that through the mid-to-late 1990s and into the 2000s became increasingly narrow and concentrated. They emphasised the technical refinement of sequencing in large-scale centres and the development, advancement and integration of informatics pipelines. Then there has been the medical genetics community, focused on variation between individuals (and across populations more broadly) and in the sequences of particular genes. This latter community, as we have seen, became increasingly divorced from the IHGSC effort. Instead, they established connections with Celera and their activities, for instance through the annotation jamboree, the sequencing and analysis of chromosome 7 (Chap. 6), and in further developing the HGMD. This interaction constitutes a rapprochement between the medical genetics community and an institution that specialised in the sequence determination and informatics aspects of genomics to an exquisite degree, mediated by its own commercial strategies and responses to the actions of the IHGSC. A newer rapprochement between medical genetics and the mode of genomics characterised by centralised infrastructures and data repositories has been through ClinGen and ClinVar. These constitute an attempt to compile and interpret more richly-contextualised data on genetic variants of potential clinical import, and in so doing incorporate medical genetics practices and practitioners more into the centralised NCBI framework.
The community dynamics we have identified, in tandem with the way that pre-reference genomics and the creation of a reference genome proceeded, have affected how post-reference genome functional and systematic research related to each other. Throughout our examination of functional and systematic research, we have found that separately assessing the limitations of individual reference resources or tools fails to capture the inter-relations between them. Inter-relatedness has been a feature across the history of genomics, however, as existing resources are used for the construction of new ones, often through the deployment of comparative practices. Additionally, reference resources can relate to each other contemporaneously, through overlapping repertoires and data infrastructures, and by the ways in which one resource can inform the interpretation or validation of another.
Through interpreting the products of genomic research as part of webs of reference that exhibit a range of connections (Fig. 7.3), we can better assess the infrastructural roles and consequences of reference resources. In the three species, post-reference genome work involved the creation of reference resources that identified and characterised more genomic variation. The reference resources refer to the reference genome, are explicitly intended to connect different manifestations of variation, and contain a surplus of possibilities for the further identification and characterisation of genomic variation and the translation of such data into a multitude of different working worlds.
Based on our examination of the different confluences of systematic and functional research, we can observe that post-reference genomics does not merely consist of increasing dimensionality: the recording and linking of additional genomic variation and other forms of biological variation in data infrastructures. It also involves the generation of these dimensions and the establishment of relations between them, in different concrete ways. Additional dimensions close to the level of the DNA sequence such as RNA sequences and protein sequences do not just exist in nature to be the next logical source of data to link to the reference sequence after its production. These forms of data are produced and catalogued for particular purposes and from particular sources: recall, in Chap. 6, the use of cDNA from the cloned offspring of TJ Tabasco in pig genome annotation. Other forms of data may derive from different origins, and be chosen for their practical utility rather than their representativeness of the species or particular biological processes. Here, we might consider the narrow range of genetically homogeneous tissue samples and assays used in the initial phases of FAANG. Furthermore, as FAANG shows, additional dimensions of data being arrayed on top of reference sequences may not only represent distinct kinds of macromolecules, but phenotypes as well.
Systematic studies entail and power comparative genomic approaches that generate dense sets of data and knowledge concerning the relationships between the genomes of different strains, populations or species. This helps researchers to characterise the extent and nature of genomic variation across populations, species and sets of related species. The extent of the potential variation (including different types of genomic and other biological variation) that can be apprehended and compared is limitless. Therefore, a selection of what is actually identified and represented from that limitless array of the potentially comparable is made either a priori or during the process of analysis. What dimensionality is added to the web of reference depends on the history and interests of the community producing a resource and how this community relates to the processes involved in producing and improving the reference genome. In other words, we cannot characterise this expansion of dimensionality as being a mere consequence of a simple transition from genomics to postgenomics (or even to post-reference genomics): there are different temporalities and models across (and within) yeast, human and pig genomics.
Across both functional and systematic studies separately, and even more acutely in their intersection, the variation that is measured, analysed and integrated into data infrastructures constitutes only some of the potential range that could be pursued and exploited. The dimensions that are explored, even if they are apparently of the same kind, may be directed towards distinct goals, use different materials and be related to other dimensions differently. We refer to this as a variational surplus, in analogy to the surplus possibilities open to researchers working on particular experimental systems, as characterised by Hans-Jörg Rheinberger (1997, p. 161). So, does all this just result in a blooming, buzzing confusion of different approaches to variation among distinct projects and communities? The construction of infrastructures to establish links and relationships between different forms of data and material objects, and efforts towards integration (e.g. Leonelli, 2013), suggests not.Footnote 70
The history of post-reference genomics, elements of which we have examined in this chapter, suggests that there has been a shift in the kind of research on and using genomes. In the next chapter, we explore this in terms of “epistemic iteration”, a term coined by philosopher Hasok Chang (2004). For now, we note that in the absence of direct access to the ‘truth’, the improvement of standards such as reference genomes is evaluated using epistemic virtues, values and goals as guides. This occurs through the correction and enrichment of these resources and builds on and supersedes prior standards. The past serves as a constraint or a condition but is not wholly determinative of the future course of the standard. Reference genomes and other reference resources can be seen as products of their history: the choices made by particular communities amongst those available to them, including objects, methods, and modes of validation and enrichment. These activities use and devise standards such as designated reference genomes and up-to-date maps. Each standard undergoes its own process of improvement, in which new versions succeed old ones. Linkages are made between different kinds of standards or reference resources, and such linkages are used in the construction and evaluation of one resource in terms of another. What makes the shift to post-reference genomics significant depends on two related phenomena. One is the increase in the number of linkages that contributes to the improvement of individual standards/resources and their use in the improvement of other standards/resources. The other is the amplifying and ramifying effect of such improvements at the more global level of webs of reference.
Before we discuss this shift further, however, we should acknowledge that for the purposes of organising the narrative and our analysis, we have assessed the production and nature of reference genomes, their annotation, and post-reference genomics in separate chapters. This should not be taken to imply that these are discrete aspects of genomics or that they occur in a regular and linear sequence. Rather, as we have attempted to demonstrate throughout, the boundaries between any one particular set of practices that depend on the outcomes of another set are rarely sharply drawn. Conceptually later processes such as annotation may inform revisions of assemblies or even details of the sequence of reference genomes, for example, and the distinctions between structural and functional annotation, and manual and automated means of conducting it, are rarely clear-cut.
With that in mind, we consider how the aims and shape of genomic research changed following the release of reference sequences. These reference genomes were not themselves static, but were continually modified and improved according to widely held epistemic criteria. These improvement efforts were often informed by the results of post-reference genomic projects that themselves relied on and used an existing reference sequence.
Alongside the enrichment of the reference genome, a panoply of reference resources have been created for distinct populations and individuals, and the means to make comparisons within and between species has been further developed. These have fed functional analysis, but have also enabled the increasing exploration and mapping of the terrain of variation within species and the establishment of connections between different species. While this has led to concerns about the extent to which the reference sequence represents the increasingly mapped terrain, the new locales established throughout this land were still seeded from the reference genome, and related to it. The terrain is not three-dimensional like a geographical landscape, but more like a hyperdimensional state space. In this way, webs of reference have been constructed, exploring the variational space for a given type (the species, a sub-species, or a higher-level grouping or taxon) as new reference standards are created to capture specified types or sub-types. These webs of reference, in which each node is related to others, have developed iteratively and recursively. The more linked data there is concerning the variational space of the type, the more that further exploration can be conceived, and existing reference resources improved using the new linked data. This is where the development of population-specific resources, and ways of representing genomic variation such as pangenome graphs, have taken post-reference genomics: seeding the web.
The reference genome is useful to the extent that it is a viable origin of radiation that enables functional and systematic lines of investigation to bloom and produce linkages between different kinds of data and material. Genomics involves the creation of standards that improve over time relative to the epistemic aims of their creation and use, becoming more stable over time, though never achieving completion due to shifts in epistemic goals and the non-existence of even a theoretical absolute standard. But this is just a part of the picture, particularly for post-reference genomics, in which developments include the progressive exploration of the indefinitely-dimensional variation space for particular species (or other types) and the establishment of connections between these concerning different species (or between other types). The more the space is explored, the more connections can be made and the basis for further exploration—extensively across the space and intensively in particular regions of it—is created (Fig. 7.3).
The way this process unfolds, and the webs of reference that are constructed through it, is unlikely to be generic. The greater degrees of freedom offered compared with reference genomics indicates that the involvement of particular communities in the generation of genomic resources will be at least as salient to how these webs develop as they were to how reference genomes were produced. However, the historicity and contingency underlying these webs of reference should not distract from the potentially new emergent dynamics generated through them. The existence of a web of reference at a certain level of development lowers the threshold for adding—and connecting—new reference resources. New groups and communities can draw upon and link to existing resources to generate their own, and therefore to contribute towards and help shape the web. The wider context of reference resources should therefore be considered as a factor in enabling fresh participation and the connection of genomic data and resources to more specific research goals, in addition to the more widespread and distributed capacity to conduct sequencing that has emerged in the last 20 years.
Notes
- 1.
The term ‘functional genomics’ itself has been traced back to the mid-1990s, when large-scale sequencing projects—especially the determination of the human reference genome—started to accelerate (Guttinger, 2019). ‘Functional analysis’ was used earlier than this, for example in connection with yeast genome sequencing (Grivell & Planta, 1990).
- 2.
This echoes the debate concerning the use of model organisms in the biological sciences, e.g., Jessica Bolker (2012) critiquing them on the basis of their unrepresentativeness in multiple respects, and Ankeny and Leonelli (2011; Leonelli & Ankeny, 2013) arguing that this should not eclipse their key infrastructural and comparative role across biology as a whole.
- 3.
Rheinberger and Müller-Wille (2017) would say, rather, that if this was the case, postgenomics constitutes a rediscovery of this holistic vision.
- 4.
The Broad Institute was formed out of a partnership of the Whitehead Institute’s Center for Genome Research and several Harvard University-affiliated institutions. Stevens (2013) focuses on this major sequence producer, as well as AceDB as an example of database technology. As we discuss in Chap. 6, this database was designed to present data in a user-friendly manner, with the assumption that once produced and released in AceDB, it would be the user who would add dimensionality to the sequence data.
- 5.
92% of human DNA is euchromatic. Together with data from other public databases, the draft was thought to encompass 94% of the entire human genome. On the UCSC Genome Browser, see: https://genome.ucsc.edu/goldenPath/history.html (last accessed 19th December 2022).
- 6.
https://www.ncbi.nlm.nih.gov/assembly/help/ (last accessed 19th December 2022).
- 7.
The error-probabilities were validated in the same issue in which these papers were published (Richterich, 1998). See also: https://www.codoncode.com/productsservices/phrap.htm (last accessed 19th December 2022).
- 8.
https://web.ornl.gov/sci/techresources/Human_Genome/research/bermuda.shtml#2 (last accessed 19th December 2022). See also Felsenfeld et al. (1999).
- 9.
UCSC retained their own naming system for subsequent human genome assembly releases. https://genome.ucsc.edu/FAQ/FAQreleases.html (last accessed 19th December 2022).
- 10.
https://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring03/human.html (last accessed 19th December 2022).
- 11.
The publication of subsequent assemblies constituting reference genomes has increasingly reflected the nomenclature of software releases, with periodic new versions (e.g., 2.0 and 3.0) being corrected and augmented with more regular patches (e.g., 2.3 and 3.1). Once published, these new reference assemblies are picked up and developed by the major genome browsers—such as the UCSC one, the NCBI, and Ensembl—and form the basis for further annotation. The advent of a new major release requires the commensuration of the new assembly to the old, by the mapping of coordinates—and therefore features—between them.
- 12.
Interview with Alan Archibald, conducted by James Lowe, Roslin Institute, November 2016.
- 13.
The breeds were: Basque, Gascon, German Landrace, Great Yorkshire, Limousin, Piétrain, Porc Blanc de l’Ouest, Schwäbisch-Hällisches Schwein, Sortbroget, Dansk Landrace, Swedish Landrace and Wild Boar; “The pig gene mapping project (PiGMaP)—identifying trait genes” final report, March 1997; in “EC PiGMaPII—Final Report” folder, personal papers of Alan Archibald, obtained 15th May 2017.
- 14.
“The pig gene mapping project (PiGMaP)—identifying trait genes” final report, March 1997; in “EC PiGMaPII—Final Report” folder, personal papers of Alan Archibald, obtained 15th May 2017.
- 15.
https://cordis.europa.eu/project/id/BIO4980188 (last accessed 19th December 2022).
- 16.
https://web.archive.org/web/20070817113534/http://www.projects.roslin.ac.uk/pigbiodiv/contact.html (last accessed 19th December 2022).
- 17.
A contention that was challenged by some quantitative geneticists working close to animal breeding, such as William G. Hill at the University of Edinburgh, who argued that breeding populations were not in fact short of variation, and that identifying and accessing potentially beneficial genetic variation in non-commercial populations presented significant problems that would make it less preferable to other approaches to breed improvement (Hill, 1999).
- 18.
Interview with Chris Haley, conducted by James Lowe and Ann Bruce in Edinburgh, December 2017.
- 19.
Note that these tools and techniques are not merely molecular biological or biochemical in nature, or even deemed part of classical genetics. Statistical, quantitative and computational approaches and methods have been just as central to innovation in this area: Lowe and Bruce (2019).
- 20.
https://cordis.europa.eu/article/id/85133-diversity-database-helps-conserve-rare-pig-breeds (last accessed 19th December 2022).
- 21.
This magpie-like approach to methods, techniques and resources outside of their own field reflects the bricolaged nature of pig genomics, as explored in Lowe et al. (2022).
- 22.
For example, they found that drift rather than mutations was primarily responsible for divergences within European populations, but that mutations were more salient in differences between European populations and Meishan pigs (SanCristobal et al., 2006).
- 23.
Though evolutionary analysis of genomic variation was performed for the compilation of the yeast and human reference sequences themselves (see Chap. 6, Sects. 6.2.1 and 6.2.2).
- 24.
There was some overlap between the projects, however, with EUROFAN beginning in January 1996 and the sequencing of the yeast genome being completed in April 1996. Preparations for a follow-up project to EUROFAN—EUROFAN 2—advanced in 1996 as well, with the application submitted in October of that year and the project beginning the following year: Peter Philippsen, personal communication with James Lowe and Miguel García-Sancho, February 2022.
- 25.
https://cordis.europa.eu/project/id/BIO4950080 (last accessed 19th December 2022). The project to produce a reference genome helped to unite various yeast communities (biochemists, geneticists, cell biologists, molecular biologists), which constituted a key difference with the human genome, but was something that it had in common with pig genome research (Chaps. 2, 3, 4, 5). On the discursive use of the notion of “the yeast genome” to establish and maintain a community and link it to a deeper history of yeast genetics research, see Szymanski et al. (2019).
- 26.
The benefits to ‘normal’ basic biology laboratories were emphasised in several overviews of EUROFAN (Dujon, 1998; Oliver, 1996). However, these laboratories did not feel the need to establish a domain of yeast genomics separate from their day-to-day yeast biology. In human genomics, by contrast, the promoters of large-scale genome centres sought to differentiate their endeavour from laboratory biology (Hilgartner, 2017).
- 27.
82 institutions have been listed as participating in the European Yeast Genome Network (Parolini, 2018), based on the affiliations listed in 1997’s The yeast genome directory.
- 28.
In this way, the EUROFAN approach differed from that of the medical geneticists, for whom the ‘function-first’ approach was integral.
- 29.
https://web.archive.org/web/20210427053303/www-sequence.stanford.edu/group/yeast_deletion_project/deletions3.html (last accessed 19th December 2022).
- 30.
As noted in Bassett Jr et al. (1996, p. 764), “As there are an infinite number of possible growth conditions, it would be impossible for any systematic effort to analyze any given deletion strain comprehensively”, which negates the possibility of true comprehensiveness or completeness, though “the public availability of these strains for future in-depth analyses by yeast labs specializing in the study of a particular class or family of genes would represent a powerful resource”. Therefore, while the territory could never be completely explored, the means now existed for any laboratory to ‘visit’ any part of it they desired to.
- 31.
DNA barcodes use sequences of genes or parts of genes that are known to be specific to particular species to determine and label species membership for a given organism. On DNA barcoding and its multiple uses, see: Hollingsworth et al. (2016) and https://transgene.sps.ed.ac.uk/blog/investigating-barcoding-life (last accessed 19th December 2022). EUROSCARF is a service run by Scientific Research and Development GmbH, a company based in Oberursel near Frankfurt-am-Main.
- 32.
We may observe here the utility of local, specialist databases, though there is significant overlap in the data contained in them and in more global ones such as GenBank. The main reason for having data in a local or specialised database is to adapt its presentation and analytic tools to the requirements and preferences of a specific group of users. We speculate that this may be why the annotation of yeast sequences in the yeast databases are more community-based and less automated than more general-purpose databases such as GenBank or the ENA. On the importance of similar local databases in biomedicine, see Cambrosio et al. (2020).
- 33.
CYGD received most of the funds for its establishment from the EC from 2000 to 2004, and also received support from the German federal government, the German Research Foundation and the government of the Brussels Region of Belgium: https://cordis.europa.eu/project/id/QLRI-CT-1999-01333 (last accessed 19th December 2022).
- 34.
- 35.
This is another example of a large-scale sequencing centre working with the yeast genomics community, on the community’s terms, in a similar manner to that seen in pig genomics. In this respect, it is quite distinct from human post-reference genome projects, in which either large-scale sequencing centres dominated, with their work augmented by smaller institutions and laboratories on the terms of the project defined by the IHGSC, rather than those smaller-scale actors setting the agenda. This may be due to the fact that, whereas the direction of yeast and pig genomics was shaped by people working with these organisms (e.g., André Goffeau and Alan Archibald), the direction of human genomics was shaped by James Watson and John Sulston, outsiders to the human and medical genetics communities.
- 36.
http://gryc.inra.fr/ (GRYC: Genome Resources for Yeast Chromosomes), http://fungipath.i2bc.paris-saclay.fr/ (FUNGIpath) and http://phylomedb.org/ (PhylomeDB)—all last accessed 19th December 2022.
- 37.
Such research was not restricted to the yeast genomics community; there was a similar effort conducted by the Whitehead Institute and MIT, motivated by improving the basis of cross-species comparative genomics (Kellis et al., 2003).
- 38.
- 39.
http://www.hgmd.cf.ac.uk/ac/index.php (last accessed 19th December 2022). See also García-Sancho, Lowe, et al. (2022); Stenson et al. (2020).
- 40.
We would hypothesise that, in part, this is because disease states are more independent of each other than molecular mechanisms and processes in a cell are, and therefore the study of a particular disease can be more detached from research concerning other diseases. There is not necessarily a hard-and-fast distinction between the genetics and physiology of different traits in livestock animals, with immune response genes being implicated in physiological processes involved in other traits, for example. Selection for lean meat content has given rise to Porcine Stress Syndrome, simultaneously a disease, welfare and meat quality problem. Members of the pig genomics community tend to have to diversify their activities as well, to take advantage of different pots of money available to them as much as possible.
- 41.
“New Sequencing Targets for Genomic Sequencing: Recommendations by the Coordinating Committee”, part of the documents for the Meeting of the NHGRI Research Network for Large-scale Sequencing and the NHGRI Sequencing Advisory Panel, May 16, 2004 (NHGRI History Archive 7036-021).
- 42.
“Report of the Annotation of the Human Genome Working Group”, dated January 3, 2005 (NHGRI History Archive 7039-005); https://www.genome.gov/13514604/executive-summary-workshop-on-characterizing-human-genetic-variation (last accessed 19th December 2022).
- 43.
https://www.encodeproject.org/help/project-overview/ (last accessed 19th December 2022).
- 44.
https://www.genome.gov/Funded-Programs-Projects/ENCODE-Project-ENCyclopedia-Of-DNA-Elements/pilot (last accessed 19th December 2022).
- 45.
Demarcating functional and non-functional elements in the genome was a task that became thornier as GENCODE—and ENCODE—went on. It has been the source of much of the controversy arising around ENCODE. Guttinger & Dupré (2016) provide a summary of the contestation around ENCODE that refers to much of the key literature on the topic.
- 46.
https://web.archive.org/web/20210728170223/http://1000gconference.sph.umich.edu/ (last accessed 19th December 2022).
- 47.
However, such population-specific references may not be generated for all geographical areas and populations that may want them or need to make use of them. For instance, it has been observed that, as of 2022, “[l]ess than 2% of [human] genomes analysed in the two decades” after the conclusion of the Human Genome Project “are from African individuals, even though Africa harbours more human genetic diversity than any other continent.” This is not merely a problem of representation, but also a comparative lack of local sequencing and informatics capacity, as well as reliable infrastructures in these underrepresented regions (Ebenezer et al., 2022).
- 48.
ClinVar limits itself to identifying one type of possible conflict between interpretations.
- 49.
https://www.ncbi.nlm.nih.gov/clinvar/docs/review_status/ (last accessed 19th December 2022).
- 50.
https://cftr2.org/about_cftr (last accessed 19th December 2022).
- 51.
https://www.isag.us/2012/docs/ISAG_2012_Abstracts.pdf (last accessed 19th December 2022).
- 52.
The reference genome completion dates were 2004 for the red jungle fowl (Gallus gallus), the wild progenitor of the domesticated chicken; 2009 for cattle (Bos taurus); 2011 for pig (S. scrofa); and 2014 for sheep (Ovis aries). Reference assemblies of other species (such as goat and salmon) were not deemed to be of sufficient quality or were not completed at this point.
- 53.
This working group was initially established in 2008. It aimed “to encourage an integrated program of US-EC collaboration combining research, training and dissemination activities” concerning animal genomics, animal health, and bioinformatics, with the added purpose of fostering “[i]nteractions among the agricultural science, life science and medical science communities” to enable the elucidation of phenotypes from genotypes, a key theme that we return to in this discussion. https://web.archive.org/web/20170918061400/https://ec.europa.eu/research/biotechnology/eu-us-task-force/pdf/20th-meeting/working_group_on_animal_biotechnology_en.pdf (last accessed 19th December 2022). The overall task force was founded in 1990 by the EC and the White House Office of Science and Technology: http://archive.euussciencetechnology.eu/bilat-usa/news/id/231 (last accessed 19th December 2022).
- 54.
https://pag.confex.com/pag/xxii/webprogram/Paper9366.html (last accessed 19th December 2022).
- 55.
- 56.
https://cordis.europa.eu/project/id/HPMI-CT-2002-00205 (last accessed 19th December 2022).
- 57.
Notes from Jane Peterson’s meeting with John Byatt, 20th November 2001. NIHGR archives, Box031-014, obtained 7th December 2016.
- 58.
https://www.sec.gov/Archives/edgar/data/1289370/000093041306007147/c44432_10sb12g-a.htm (last accessed 19th December 2022).
- 59.
https://www.thepigsite.com/news/2007/09/newsham-genetics-acquires-monsanto-choice-1 (last accessed 19th December 2022); https://www.sec.gov/Archives/edgar/data/1289370/000093041311002078/c64859_ex99-1.htm (last accessed 19th December 2022).
- 60.
Interview with Alan Archibald, conducted by James Lowe, Roslin Institute, November 2016.
- 61.
Interview with Martien Groenen, conducted by James Lowe over Skype, September 2017.
- 62.
Interview with Barbara Harlizius, conducted by Ann Bruce and James Lowe over Skype, December 2018; personal communication from Barbara Harlizius to James Lowe, January 2022.
- 63.
“Pig SNP Working Group” folder, Lawrence Schook’s personal papers, obtained 6th April 2018.
- 64.
Interview with Martien Groenen, conducted by James Lowe over Skype, September 2017; interview with Lawrence Schook conducted by James Lowe over Skype, August 2017.
- 65.
“Pig SNP Working Group” folder, Lawrence Schook’s personal papers, obtained 6th April 2018.
- 66.
There was a wider sampling including other domesticated breeds, including Asian ones, and related species to the pig as well, with the data from this sequencing and SNP discovery published in the publicly-available dbSNP database.
- 67.
Additional source: interview with Michael Goddard, conducted by James Lowe and Ann Bruce in Edinburgh, October 2018.
- 68.
For a list of all papers that have cited Ramos et al. (2009) that describes the creation and validation of the first-generation 60K Illumina SNP chip for pigs, see: https://pubmed.ncbi.nlm.nih.gov/?size=200&linkname=pubmed_pubmed_citedin&from_uid=19654876 (last accessed 19th December 2022).
- 69.
As well as the representation of particular alleles, this is also because of the differential genetic structure of livestock populations, as a result of different breeding and herd/flock management practices. There have been initiatives to sequence particular breeds and populations that were not included in the reference genome, combined with new methods of incorporating and displaying variation in reference assemblies (e.g., for one involving two African cattle breeds, see Talenti et al., 2022). However, many breeds—and species—of social and economic importance in the Global South remain uncharacterised (Ebenezer et al., 2022).
- 70.
The notion of connection or linkage between resources that we use in this chapter, to describe the way that reference resources are related to each other in a web of reference, is more generic than the concept of data linkage. Data linkage entails the implementation of specific methods and infrastructures to allow data from different sources to be brought together on a common platform (e.g., Tempini, 2020). The kind of interoperability and data mobility that data linkage in this sense enables may play a role in establishing and exploiting connections between reference resources, such as the alignment of new sequence data to an existing reference genome, or being able to move from a representation of one kind of map to another in a browser (as discussed by de Chadarevian, 2004). However, connections need not require this kind of data linkage. For example, maps and reference genomes can be used interactively as visual sources by researchers who use established inferences in the production and evaluation of new reference resources (Chaps. 4, 5 and 6; Lowe et al., 2022).
References
Agar, J. (2020). What is science for? The Lighthill report on artificial intelligence reinterpreted. The British Journal for the History of Science, 53(3), 289–310.
Ankeny, R. A., & Leonelli, S. (2011). What’s so special about model organisms? Studies in History and Philosophy of Science Part A, 42(2), 313–323.
Ankeny, R. A., & Leonelli, S. (2015). Valuing data in postgenomic biology. In Richardson and Stevens (Ed.), Postgenomics: Perspectives on biology after the genome (pp. 126–149). Duke University Press.
Anthon, C., Tafer, H., Havgaard, J. H., Thomsen, B., Hedegaard, J., Seemann, S. E., et al. (2014). Structured RNAs and synteny regions in the pig genome. BMC Genomics, 15, 459.
Archibald, A. L., Bolund, L., Churcher, C., Fredholm, M., Groenen, M. A., Harlizius, B., et al. (2010). Pig genome sequence—Analysis and publication strategy. BMC Genomics, 11, 1.
Bassett Jr, D. E., Basrai, M. A., Connelly, C., Hyland, K. M., Kitagawa, K., Mayer, M. L., et al. (1996). Exploiting the complete yeast genome sequence. Current Opinion in Genetics & Development, 6(6), 763–766.
Bolker, J. (2012). There’s more to life than rats and flies. Nature, 491, 31–33.
Bolotin-Fukuhara, M., Casaregola, S., & Aigle, M. (2005). Genome evolution: Lessons from Genolevures. In P. Sunnerhagen & J. Piškur (Eds.), Topics in current genetics, Vol. 15: Comparative genomics (pp. 165–196). Springer-Verlag.
Bork, P., Dandekar, T., Diaz-Lazcoz, Y., Eisenhaber, F., Huynen, M., & Yuan, Y. (1998). Predicting function: From genes to genomes and back. Journal of Molecular Biology, 283(4), 707–725.
Bycroft, C., Freeman, C., Petkova, D., Band, G., Elliott, L. T., Sharp, K., et al. (2018). The UK Biobank resource with deep phenotyping and genomic data. Nature, 562, 203–209.
Cambrosio, A., Campbell, J., Vignola-Gagné, E., Keating, P., Jordan, B. R., & Bourret, P. (2020). ‘Overcoming the Bottleneck’: Knowledge architectures for genomic data interpretation in oncology. In S. Leonelli & N. Tempini (Eds.), Data Journeys in the Sciences (pp. 305–327). Springer Nature.
Chang, H. (2004). Inventing temperature: Measurement and scientific progress. Oxford University Press.
Church, D. M., Schneider, V. A., Graves, T., Auger, K., Cunningham, F., Bouk, N., et al. (2011). Modernizing reference genome assemblies. PLoS Biology, 9(7), e1001091.
Clark, E. L., Archibald, A. L., Daetwyler, H. D., Groenen, M. A. M., Harrison, P. W., Houston, R. D., et al. (2020). From FAANG to fork: Application of highly annotated genomes to improve farmed animal production. Genome Biology, 21(1), 285.
Cliften, P. F., Hillier, L. W., Fulton, L., Graves, T., Miner, T., Gish, W. R., et al. (2001). Surveying Saccharomyces genomes to identify functional elements by comparative DNA sequence analysis. Genome Research, 11, 1175–1186.
Cliften, P., Sudarsanam, P., Desikan, A., Fulton, L., Fulton, B., Majors, J., et al. (2003). Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science, 301, 71–76.
de Chadarevian, S. (2004). Mapping the worm’s genome. Tools, networks, patronage. In J.-P. Gaudillière & H.-J. Rheinberger (Eds.), From molecular genetics to genomics: The mapping cultures of twentieth-century genetics (pp. 95–110). Routledge.
Deplazes-Zemp, A. (2018). ‘Genetic resources’, an analysis of a multifaceted concept. Biological Conservation, 222, 86–94.
Diao, S., Huang, S., Xu, Z., Ye, S., Yuan, X., Chen, Z., et al. (2019). Genetic diversity of indigenous pigs from South China Area revealed by SNP array. Animals, 9, 361.
Dujon, B. (1998). European Functional Analysis Network (EUROFAN) and the functional analysis of the Saccharomyces cerevisiae genome. Electrophoresis, 19, 617–624.
Dunne, M. P., & Kelly, S. (2017). OrthoFiller: Utilising data from multiple species to improve the completeness of genome annotations. BMC Genomics, 18, 390.
Dwight, S. S., Balakrishnan, R., Christie, K. R., Costanzo, M. C., Dolinski, K., Engel, S. R., et al. (2004). Saccharomyces genome database: Underlying principles and organisation. Briefings in Bioinformatics, 5(1), 9–22.
Ebenezer, T. E., Muigai, A. W. T., Nouala, S., Badaoui, B., Blaxter, M., Buddie, A. G., et al. (2022). Africa: Sequence 100,000 species to safeguard biodiversity. Nature, 603, 388–392.
Engel, S. R., Dietrich, F. S., Fisk, D. G., Binkley, G., Balakrishnan, R., Costanzo, M. C., et al. (2014). The reference genome sequence of Saccharomyces cerevisiae: Then and now. G3: Genes|Genomes|Genetics, 4(3), 389–398.
Ewing, B., & Green, P. (1998). Base-calling of automated sequencer traces using Phred. II. Error probabilities. Genome Research, 8(3), 186–194.
Ewing, B., Hillier, L., Wendl, M. C., & Green, P. (1998). Base-calling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Research, 8(3), 175–185.
Feldmann, H. (2000). Editorial: Génolevures—A novel approach to evolutionary genomics. FEBS Letters, 487, 1–2.
Felsenfeld, A., Peterson, J., Schloss, J., & Guyer, M. (1999). Assessing the quality of the DNA sequence from the Human Genome Project. Genome Research, 9, 1–4.
Fisk, D.G., Ball, C.A., Dolinski, K., Engel, S.R., Hong, E.L., Issel-Tarver, L., et al. (2006). Saccharomyces cerevisiae S288C genome annotation: A working hypothesis. Yeast, 23(12), 857–865.
Food and Agriculture Organization of the United Nations. (1999). The global strategy for the management of farm animal genetic resources: Executive brief. FAO.
García-Sancho, M., Leng, R., Viry, G., Wong, M., Vermeulen, N., & Lowe, J. W. E. (2022). The Human Genome Project as a singular episode in the history of genomics. Historical Studies in the Natural Sciences, 52(3), 320–360.
García-Sancho, M., Lowe, J. W. E., Viry, G., Leng, R., Wong, M., & Vermeulen, N. (2022). Yeast sequencing: ‘Network’ genomics and institutional bridges. Historical Studies in the Natural Sciences, 52(3), 361–400.
García-Sancho, M., & Lowe, J. W. E. (Eds.). (2022). The sequences and the sequencers: A new approach to investigating the emergence of yeast, human, and pig genomics. Special issue of Historical Studies in the Natural Sciences, 52(3).
Giaever, G., Chu, A. M., Ni, L., Connelly, C., Riles, L., Véronneau, S., et al. (2002). Functional profiling of the Saccharomyces cerevisiae genome. Nature, 418, 387–391.
Giaever, G., & Nislow, C. (2014). The yeast deletion collection: A decade of functional genomics. Genetics, 197(2), 451–465.
Goffeau, A. (2000). Four years of post-genomic life with 6000 yeast genes. FEBS Letters, 480, 37–41.
Goffeau, A., Aert, R., Agostini-Carbone, M., Ahmed, A., Aigle, M., Alberghina, L., et al. (1997). The yeast genome directory. Nature, 387(6632).
Grivell, L. A., & Planta, R. J. (1990). Yeast: The model ‘eurokaryote’? Trends in Biotechnology, 8, 241–243.
Groenen, M. A. M. (2016). A decade of pig genome sequencing: A window on pig domestication and evolution. Genetics Selection Evolution, 48, 23.
Güldener, U., Münsterkötter, M., Kastenmüller, G., Strack, N., van Helden, J., Lemer, C., et al. (2005). CYGD: The Comprehensive Yeast Genome Database. Nucleic Acids Research, 33, D364–D368.
Guttinger, S. (2019). Beyond the genome: The transformative power of functional genomics. Genomics in Context, edited by James Lowe, published 2nd August 2019. Retrieved December 19, 2022, from https://genomicsincontext.wordpress.com/beyond-the-genome-the-transformative-power-of-functional-genomics/
Guttinger, S., & Dupré, J. (2016). The ENCODE project and the ENCODE controversy. In Zalta, E. N. (Ed.), The Stanford Encyclopedia of Philosophy (Winter 2016 Edition). Retrieved December 19, 2022, from https://plato.stanford.edu/entries/genomics/encode-project.html
Haley, C., & Visscher, P. M. (1998). Strategies to utilize marker-Quantitative Trait Loci Associations. Journal of Dairy Science, 81(2), 85–97.
Harrow, J., Frankish, A., Gonzalez, J. M., Tapanari, E., Diekhans, M., Kokocinski, F., et al. (2012). GENCODE: The reference human genome annotation for The ENCODE project. Genome Research, 22(9), 1760–1774.
Herrero-Medrano, J. M., Megens, H. J., Crooijmans, R. P., Abellaneda, J. M., & Ramis, G. (2013). Farm-by-farm analysis of microsatellite, mtDNA and SNP genotype data reveals inbreeding and crossbreeding as threats to the survival of a native Spanish pig breed. Animal Genetics, 44(3), 259–266.
Hilgartner, S. (2017). Reordering life: Knowledge and control in the genomics revolution. The MIT Press.
Hill, W. G. (1999). Advances in quantitative genetics theory. In J. C. M. Dekkers, S. J. Lamont, & M. F. Rothschild (Eds.), From Jay Lush to genomics: Visions for animal breeding and genetics (pp. 35–46). Iowa State University.
Hollingsworth, P. M., Li, D.-Z., Van der Bank, M., & Twyford, A. D. (2016). Telling plant species apart with DNA: From barcodes to genomes. Proceedings of the Royal Society of London B, 371, 20150338.
Ibeagha-Awemu, E. M., Peters, S. O., Bemji, M. N., Adeleke, M. A., & Do, D. N. (2019). Leveraging available resources and stakeholder involvement for improved productivity of African livestock in the era of genomic breeding. Frontiers in Genetics, 10, 357.
International Human Genome Sequencing Consortium. (2004). Finishing the euchromatic sequence of the human genome. Nature, 431, 931–945.
Kellis, M., Patterson, N., Endrizzi, M., Birren, B., & Lander, E. S. (2003). Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature, 423(6937), 241–254.
Kellis, M., Wold, B., Snyder, M. P., Bernstein, B. E., Kundaje, A., Marinov, G. K., et al. (2014). Defining functional DNA elements in the human genome. Proceedings of the National Academy of Sciences of the United States of America, 111(17), 6131–6138.
Khamsi, R. (2022). The quest for an all-inclusive human genome. Nature, 603, 378–381.
Knol, E. F., Nielsen, B., & Knap, P. W. (2016). Genomic selection in commercial pig breeding. Animal Frontiers, 6(1), 15–22.
Kokocinski, F., Harrow, J., & Hubbard, T. (2010). AnnoTrack–A tracking system for genome annotation. BMC Genomics, 11, 538.
Landrum, M. J., & Kattman, B. L. (2018). ClinVar at five years: Delivering on the promise. Human Mutation, 39, 1623–1630.
Leonelli, S. (2013). Integrating data to acquire new knowledge: Three modes of integration in plant science. Studies in History and Philosophy of Biological and Biomedical Sciences, 44(4), 503–514.
Leonelli, S., & Ankeny, R. A. (2013). What makes a model organism? Endeavour, 37(4), 209–212.
Louis, E. (2011). Saccharomyces cerevisiae: Gene annotation and genome variability, state of the art through comparative genomics. In J. I. Castrillo & S. G. Oliver (Eds.), Yeast systems biology, methods in molecular biology (Vol. 759, pp. 31–40). Springer Science+Business Media.
Lowe, J. W. E. (2021). Adjusting to precarity: How and why the Roslin Institute forged a leading role for itself in international networks of pig genomics research. The British Journal for the History of Science., 54(4), 507–530.
Lowe, J. W. E., & Bruce, A. (2019). Genetics without genes? The centrality of genetic markers in livestock genetics and genomics. History and Philosophy of the Life Sciences, 41, 50.
Lowe, J. W. E., Leng, R., Viry, G., Wong, M., Vermeulen, N., & García-Sancho, M. (2022). The bricolage of pig genomics. Historical Studies in the Natural Sciences, 52(3), 401–442.
Mackenzie, A. (2015). Machine learning and genomic dimensionality: From features to landscapes. In Richardson and Stevens (Ed.), Postgenomics: Perspectives on biology after the genome (pp. 73–102). Duke University Press.
Maretty, L., Jensen, J. M., Petersen, B., Sibbesen, J. A., Liu, S., Villesen, P., et al. (2017). Sequencing and de novo assembly of 150 genomes from Denmark as a population reference. Nature, 548(7665), 87–91.
Maroilley, T., Lemonnier, G., Lecardonnel, J., Esquerré, D., Ramayo-Caldas, Y., Mercat, M. J., et al. (2017). Deciphering the genetic regulation of peripheral blood transcriptome in pigs through expression Genome-Wide Association Study and allele-specific expression analysis. BMC Genomics, 18, 967.
M’Charek, A. (2005). The Human Genome Diversity Project: An ethnography of scientific practice. Cambridge University Press.
Megens, H.-J., Crooijmans, R. P. M. A., San Cristobal, M., Hui, X., Li, N., & Groenen, M. A. M. (2008). Biodiversity of pig breeds from China and Europe estimated from pooled DNA samples: Differences in microsatellite variation between two areas of domestication. Genetics Selection Evolution, 40(1), 103–128.
Meuwissen, T. H. E., Hayes, B. J., & Goddard, M. E. (2001). Prediction of total genetic value using genome-wide dense marker maps. Genetics, 157, 1819–1829.
Müller-Wille, S. (2018). Making and unmaking populations. Historical Studies in the Natural Sciences, 48(5), 604–615.
Müller-Wille, S., & Rheinberger, H.-J. (2012). A cultural history of heredity. The University of Chicago Press.
Myelnikov, D. (2017). Cuts and the cutting edge: British science funding and the making of animal biotechnology in 1980s Edinburgh. The British Journal for the History of Science, 50(4), 701–728.
Nurk, S., Koren, S., Rhie, A., Rautiainen, M., Bzikadze, A. V., & Mikheenko, A. (2022). The complete sequence of a human genome. Science, 376, 44–53.
Oliver, S. (1996). A network approach to the systematic analysis of yeast gene function. Trends in Genetics, 12(7), 241–242.
Oliver, S. G. (1997). Yeast as a navigational aid in genome analysis. Microbiology, 143, 1483–1487.
Ollivier, L. (2009). European pig genetic diversity: A minireview. Animal, 3(7), 915–924.
Parolini, G. (2018). Building human and industrial capacity in European biotechnology: The Yeast Genome Sequencing Project (1989–1996). Technische Universität Berlin. Retrieved December 19, 2022, from https://depositonce.tu-berlin.de/bitstream/11303/7470/4/parolini_guiditta.pdf
Proux-Wéra, E., Armisén, D., Byrne, K. P., & Wolfe, K. H. (2012). A pipeline for automated annotation of yeast genome sequences by a conserved-synteny approach. BMC Bioinformatics, 13, 237.
Rajagopalan, R. M., & Fujimura, J. H. (2018). Variations on a Chip: Technologies of difference in human genetics research. Journal of the History of Biology, 51, 841–873.
Ramos, A. M., Crooijmans, R. P. M. A., Affara, N. A., Amaral, A. J., Archibald, A. L., Beever, J. E., et al. (2009). Design of a high density SNP genotyping assay in the pig using SNPs identified and characterized by next generation sequencing technology. PLoS ONE, 4(8), e6524.
Reardon, J. (2004). Race to the finish: Identity and governance in an age of genomics. Princeton University Press.
Rehm, H. L., Berg, J. S., & Plon, S. E. (2018). ClinGen and ClinVar—Enabling genomics in precision medicine. Human Mutation, 39, 1473–1475.
Rheinberger, H.-J. (1997). Toward a history of epistemic things: Synthesizing proteins in the test tube. Stanford University Press.
Rheinberger, H.-J., & Müller-Wille, S. (Bostanci, A., Trans.) (2017). The gene: From genetics to postgenomics. The University of Chicago Press.
Richardson, S. S., & Stevens, H. (Eds.). (2015). Postgenomics: Perspectives on Biology after the genome. Duke University Press.
Richterich, P. (1998). Estimation of errors in “Raw” DNA sequences: A validation study. Genome Research, 8(3), 251–259.
Roberts, I. N., & Oliver, S. G. (2011). The yin and yang of yeast: Biodiversity research and systems biology as complementary forces driving innovation in biotechnology. Biotechnology Letters, 33, 477–487.
Rohrer, G., Beever, J. E., Rothschild, M. F., Schook, L., Gibbs, R., & Weinstock, G. (2002). Porcine sequencing white paper: Porcine Genomic Sequencing Initiative. Retrieved December 19, 2022, from https://www.animalgenome.org/pig/community/WhitePaper/2002.html
Rothschild, M. F., & Plastow, G. S. (2002). Development of a genetic marker for litter size in the pig: A case study. In M. F. Rothschild & S. Newman (Eds.), Intellectual property rights in animal breeding and genetics (pp. 179–196). CABI Publishing.
Samorè, A. B., & Fontanesi, L. (2016). Genomic selection in pigs: State of the art and perspectives. Italian Journal of Animal Science, 15(2), 211–232.
SanCristobal, M., Chevalet, C., Haley, C. S., Joosten, R., Rattink, A. P., Harlizius, B., et al. (2006). Genetic diversity within and between European pig breeds using microsatellite markers. Animal Genetics, 37, 189–198.
Scannell, D. R., Zill, O. A., Rokas, A., Payen, C., Dunham, M. J., Eisen, M. B., et al. (2011). The awesome power of yeast evolutionary genetics: New genome sequences and strain resources for the Saccharomyces sensu stricto genus. G3: Genes|Genomes|Genetics, 1, 11–25.
Scherer, S. W., Cheung, J., MacDonald, J. R., Osborne, L. R., Nakabayashi, K., Herbrick, J. A., et al. (2003). Human chromosome 7: DNA sequence and biology. Science, 300(5620), 767–772.
Schook, L. B., Beever, J. E., Rogers, J., Humphray, S., Archibald, A., Chardon, P., et al. (2005). Swine Genome Sequencing Consortium (SGSC): A strategic roadmap for sequencing the pig genome. Comparative Functional Genomics, 6, 251–255.
Selsby, J. T., Ross, J. W., Nonneman, D., & Hollinger, K. (2015). Porcine models of muscular dystrophy. ILAR Journal, 56(1), 116–126.
Souciet, J-L., for the Génolevures Consortium (GDR CNRS 2354) (2011). Ten years of the Génolevures Consortium: A brief history (Les dix ans du consortium Génolevures: un bref historique). Comptes Rendus Biologies, 334, 580-584.
Stenson, P. D., Mort, M., Ball, E. V., Chapman, M., Evans, K., Azevedo, L., et al. (2020). The Human Gene Mutation Database (HGMD®): Optimizing its use in a clinical diagnostic or research setting. Human Genetics, 139, 1197–1207.
Stevens, H. (2013). Life out of sequence: A data-driven history of bioinformatics. The University of Chicago Press.
Stevens, H. (2015). Networks: Representations and tools in postgenomics. In S. S. Richardson & H. Stevens (Eds.), Postgenomics: Perspectives on biology after the genome (pp. 103–125). Duke University Press.
Stevens, H., & Richardson, S. S. (2015). Beyond the genome. In S. S. Richardson & H. Stevens (Eds.), Postgenomics: Perspectives on biology after the genome (pp. 1–8). Duke University Press.
Szymanski, E., Vermeulen, N., & Wong, M. (2019). Yeast: One cell, one reference sequence, many genomes? New Genetics and Society, 38(4), 430–450.
Talenti, A., Powell, J., Hemmink, J. D., Cook, E. A. J., Wragg, D., Jayaraman, S., et al. (2022). A cattle graph genome incorporating global breed diversity. Nature Communications, 13, 910.
Tempini, N. (2020). The reuse of digital computer data: Transformation, recombination and generation of data mixes in big data science. In S. Leonelli & N. Tempini (Eds.), Data journeys in the sciences (pp. 239–263). Springer Open. Retrieved December 19, 2022, from https://springerlink.fh-diploma.de/book/10.1007/978-3-030-37177-7
The 1000 Genomes Project Consortium. (2010). A map of human genome variation from population-scale sequencing. Nature, 467, 1061–1073.
The FAANG Consortium, Andersson, L., Archibald, A. L., Bottema, C. D., Brauning, R., Burgess, S. C., et al. (2015). Coordinated international action to accelerate genome-to-phenome with FAANG, the Functional Annotation of Animal Genomes project. Genome Biology, 16, 57.
The Génolevures Consortium. (2009). Comparative genomics of protoploid Saccharomycetaceae. Genome Research, 19, 1696–1709.
The International HapMap Consortium. (2003). The International HapMap Project. Nature, 426, 789–796.
Thieffry, D., & Sarkar, S. (1999). Postgenomics? A conference at the Max Planck Institute for the History of Science in Berlin. Bioscience, 49(3), 223–227.
Tortereau, F., Servin, B., Frantz, L., Megens, H.-J., Milan, D., Rohrer, G., et al. (2012). A high density recombination map of the pig reveals a correlation between sex-specific recombination and GC content. BMC Genomics, 13, 586.
Tuggle, C. K., Giuffra, E., White, S. N., Clarke, L., Zhou, H., Ross, P. J., et al. (2016). GO-FAANG meeting: A Gathering On Functional Annotation of Animal Genomes. Animal Genetics, 47, 528–533.
Wach, A., Brachat, A., Pöhlmann, R., & Philippsen, P. (1994). New heterologous modules for classical or PCR-based gene disruptions in Saccharomyces cerevisiae. Yeast, 10(13), 1793–1808.
Warr, A., Affara, N., Aken, B., Beiki, H., Bickhart, D. M., Billis, K., et al. (2020). An improved pig reference genome sequence to enable pig genetics and genomics research. GigaScience, 9(6), giaa051.
Wiggans, G. R., Cole, J. B., Hubbard, S. M., & Sonstegard, T. S. (2017). Genomic selection in dairy cattle: The USDA experience. Annual Review of Animal Biosciences, 5, 309–327.
Winzeler, E. A., Shoemaker, D. D., Astromoff, A., Liang, H., Anderson, K., Andre, B., et al. (1999). Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science, 285, 901–906.
Yang, B., Cui, L., Perez-Enciso, M., Traspov, A., Crooijmans, R. P. M. A., Zinovieva, N., et al. (2017). Genome-wide SNP data unveils the globalization of domesticated pigs. Genetics Selection Evolution, 49, 71.
Author information
Authors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2023 The Author(s)
About this chapter
Cite this chapter
García-Sancho, M., Lowe, J. (2023). Improving and Going Beyond Reference Genomes. In: A History of Genomics across Species, Communities and Projects. Medicine and Biomedical Sciences in Modern History. Palgrave Macmillan, Cham. https://doi.org/10.1007/978-3-031-06130-1_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-06130-1_7
Published:
Publisher Name: Palgrave Macmillan, Cham
Print ISBN: 978-3-031-06129-5
Online ISBN: 978-3-031-06130-1
eBook Packages: HistoryHistory (R0)