Keywords

Introduction

Although breeding for resistant lines is cost-effective method for Striga control, the resistance or tolerance traits gained through mutagenesis often appear with unfavorable traits such as decreased yields. To avoid this problem, crossing resistant mutants with elite cultivars and screening the subsequent progeny for the plants retaining the Striga resistance without unfavorable traits are necessary steps. If the gene in which the causal mutation can be identified, an allele specific marker can be developed to track the Striga resistance gained through mutagenesis. Development of molecular markers co-segregating with the resistance trait will dramatically accelerate selection during introgression of the gained Striga resistance into elite cultivars. Genome sequencing of bulked Striga resistant F2 progeny of the mutant line backcrossed with the wildtype will provide candidate loci that are actually responsible for the mutant phenotype or the information of the closely linked markers (Abe et al. 2012). Whole genome sequences of many cereal species including sorghum, rice and maize are publicly available (Matsumoto et al. 2005; Paterson et al. 2009; Jiao et al. 2017), and these can be used as the mapping reference. Combined with information gained from characterizing the resistance mechanism from the screening protocols described in Chapters “An Agar-Based Method for Determining Mechanisms of Striga Resistance in Sorghum” and “Histological Analysis of Striga Infected Plants”, key details about the cereal/Striga association can be elucidated leading to possible further gene targets for improving Striga resistance.

One caveate about this approach is that gamma rays can cause other types of mutations besides single nucleotide changes that result in the SNP detected through the bioinformatics methods described in this chapter. From Chapter “Physical Mutagenesis in Cereal Crops”, gamma rays are known to cause larger structural variations from chromosome breaks and activation of transposable elements in addition to single nucleotide variations and indels (Nielen et al. 2018; Yang et al. 2019). These larger genome structural variations may be missed in alignment of next generation sequence and the particular analysis tools described in this chapter.

Protocols

Plant materials. An advanced (M4 or later) verified Striga resistant mutant line should be backcrossed with the original (unmutagenized) parental wildtype line. The resulting BC1F1 should be Striga susceptible if the gained resistance through mutagenesis is inherited through recessive alleles. Confirm the phenotype of F1 plant by co-culture with Striga using the appropriate protocols described in Chapters “An Agar-Based Method for Determining Mechanisms of Striga Resistance in Sorghum” or “Histological Analysis of Striga Infected Plants”. Self-pollinate the phenotyped F1 to obtain a large F2 population. Phenotype around 200 individual F2 plants by the protocol used to determine Striga reaction of the original mutant line and F1. Proper Mendelian segregation for a single recessive mutation is a key for successful SNP identification. Aim to collect tissue from 30–50 mutant-phenotype (Striga resistant) F2 plants to bulk for sequencing. As a control, also collect tissues from the wildtype parental line in a separate sample for DNA selection, and at least one (even more are better) Striga susceptible mutant line(s) from the original mutagenesis from which the Striga resistant mutant was independently isolated. These susceptible siblings serve as useful controls in addition to the parent plants and appeared to largely facilitate the identification process of SNPs responsible for Striga resistance trait. The end goal is to determine the causal mutation resulting in the gained Striga resistance from mutagenesis, or at least a SNP that reliably co-segregates with the reistant phenotype. The protocols for actual whole genome sequencing is not covered as this will vary with the platform used. It is assumed here that a next generation sequencing platform (e.g. Illumina) on bulked DNA is used for subsequent alalysis.

High quality DNA extraction using CTAB. Good whole genome sequence starts with high quality DNA. The reagents and equipment needed for DNA extraction include:

  • 2× CTAB buffer

    • 2% CTAB (Hexadecyltrimethylammonium bromide, H6269, Sigma)

    • 100 mM Tris-Cl (pH8.0)

    • 20mM EDTA (Ethylenediamine-N,N,N′,N′-tetraacetic acid disodium salt dihydrate,

      • 345-01865, Wako-chemical, prepare 0.5M stock with pH8.0)

    • 1.4 M NaCl

    • 2% PVP (polyvinylpyrrolidone mw 360,000)

    • 0.1% beta-mercaptoethanol (Prepare on the day of experiment.)

  • 10% CTAB

    • 10% CTAB

    • 0.7 M NaCl

  • Chloroform

  • Isopropanol

  • TE

    • 10 mM Tris-Cl (pH8.0)

    • 1 mM EDTA

  • Qiagen Genomic-tip 100/G (cat No. 10243)

  • Qiagen Genomic DNA Buffer Set (cat. No. 19060) (or QBT, QC and QF buffer bottles)

Add appropriate amount of beta-mercaptoethanol to 2× CTAB buffer before starting the experiment. Preheat 2× CTAB buffer at 70 °C. Prepare 60 °C shaking incubator. Prepare hot stirrer for mixing samples.

Harvest a few young leaves from each F2 individual exhibiting the mutant (Striga resistant) phenotype and mix them to make 1–5 g samples. Grind tissues under liquid nitrogen. Add three times (v/w) volume of 2× CTAB buffer to the sample and immediately mix them by stirring with a magnetic stirrer on a hot stirring plate. Do not allow the samples to melt. Transfer samples to a 50 mL falcon tube. Shake gently at 60 °C for 40 min. Add 1× volume (v/v) of chloroform. Gently mix the sample with chloroform in a rotator at RT (room temperature) for 10 min. Centrifuge at 3500 rpm for 20 min at RT. Transfer aqueous phase to a new tube. Add 0.1× volume (v/v) of 10% CTAB. Add an equal volume of chloroform. Gently rotate the tube for 10 min. Centrifuge at 3500 rpm for 20 min. Transfer aqueous phase to a new tube using a glass Pastuer pipet. Add an equal volume of isopropanol. Mix very gently. Centrifuge at 10,000 rpm for 30 min. Discard the supernatant, keep the pellet. Wash pellet with 2 mL of 70% ethanol. Centrifuge at 10000 rpm for 10 min. Discard the supernatant, keep the pellet. Dissolve the pellet in 500µL TE. Measure the concentration on a spectrophotometer (e.g. Nanodrop) and fluorophotometer Qubit (ThermoFisher). Add RNaseA (Qiagen) to a final concentration of 0.1 mg/mL. Incubate at 37 °C on a heating block. Add Proteinase K (Qiagen) to a final concentration of 1 mg/mL and incubate at 56 °C for 30 min. Equilibrate the Qiagen Genomic-tip 100/G with 4 mL QBT buffer. Add 10× volume of QBT buffer (from Qiagen Genomic-tip) to the sample (5 mL to 500 µL sample) and mix thouroughly with a vortex for 20 s at a maximum speed. Centrifuge the sample at 10,000× gravity for 10 min. Load supernatant to Qiagen Genomic-tip 100/G. Wash column twice with 7.5 mL QC. Set the DNA collection tube. Elute DNA with 5 mL QF and precipitate the DNA by adding 3.5 mL isopropanol. Centrifuge at 10,000× gravity for 30 min at 4 °C. Discard the supernatant, keep the pellet, and rinse it with 70% ethanol (2 mL). Centrifuge at 10,000× gravity for 10 min, discard the supernatant and air dry the pellet. Dissolve the pellet in TE buffer to become 100–1000 µg/µL of DNA. Check the DNA quantity and quality with spectrophotometer (e.g. Nanodrop) and fluorophotometer Qubit (ThermoFisher). Abs260/280 needs to be 1.8–2.0. The concentration measured by a spectorophotometer and Qubit should have similar values. Finally, check the DNA quality by electrophoresis in 0.5% agarose gel.

SNP identification using CLC genomic workbench. The genomic DNA of bulked F2 plants with mutant phenotype is isolated and sequenced with an Illumina or equivalent sequencer with paired-end and with a read length of 150 bp. The total read number should be equivalent to the coverage 30–50. The resulting files are in fastq format. To remove unspecific SNPs, either wild type parent or unrelated mutant genomic DNA should be sequenced at the same condition as the targeted mutant F2. While it is optional, collecting genomic DNA from the susceptible siblings of each mutant is recommended. As a reference sequence, the most recent genome sequences with annotation should be downloaded from a relevant website. The sequence file is probably in fasta format, and the annotation file is in gff or gtf formats. The following is an example of using CLC Genomic Workbench software (ver. 12.0.3):

Import reference genome fasta file with function of “Import” → “Fasta High-Throughput Sequencing Import”. Import GFF annotation track with function of “Import” → “Tracks”. In the next page, choose the right file type and select an annotation file (gff or gtf).

Import F2 bulk genome sequence data with function of “Import” → “Illumina High-Throughput Sequencing Import”. Check “paired reads” in General options and “Paired-end (forward-reverse)” in Paired-end information. Check “Remove failed reads” in Illumina options. Choose right version of Quality score (Fig. 1). The non-mutated wild type genome sequence or unrelated mutant sequence data are also imported as above.

Fig. 1
A screenshot. The highlighted items on the window are the import icon on the menu bar, checkboxes of paired reads and remove failed reads, that are checked, and version filled on the quality scores field.

Import function for the Illumina sequence reads in CLC genomic workbench. Quality score setting can be different depending on the sequencing protocol

For quality trimming of the sequence reads, check the raw data quality using “Toolbox” → “Prepare Sequencing Reads” → QC for Sequencing Reads. Trim the low-quality reads with “Toolbox” → “Trim Reads” function with quality score threshold 0.05, minimum read length 50, trimming 5 nucleotide of 5’ sequence. Recheck the quality after trimming.

For mapping the quality trimmed reads, choose “Toolbox” → “Resequencing Analysis” → “Map Reads to Reference” (Fig. 2). Choose the imported Illumina data and the reference sequence for mapping. In the next page, set the value for mapping. In general cases, Match score = 1, Mismatch cost = 2, Insertion cost = 3, Deletion cost = 3, Length fraction = 0.6 and Similarity fraction = 0.9 can be applied (Fig. 2). Conduct the same analysis for the parental wild type (or unrelated mutant and susceptible siblings) sequence.

Fig. 2
2 screenshots. 1, from the toolbox option on the menu bar, the nested dropdown options selected are resequencing analysis and map reads to reference. 2, the mapping options tab presents 6 enabled fields with values, 2 selected radio buttons, and a checkbox of auto-detect paired distances checked.

Mapping of the sequence reads on the imported reference genome

Variant calling. To identify unspecific SNPs, preform variant calling with low stringency filter in the control samples including the wild type parent, unrealted mutants and/or susceptible sibling. Choose “Toolbox” → “Resequencing Analysis” → “Variant Detection” → “Basic Variant Detection” and set Minimum Coverage, Minimum Counts and Minimum Frequency. We use Minimum Coverage 2, Minimum Counts 2 and Minimum Frequency 5% but these threshold should be adjusted according to your samples (Fig. 3). Low stringency SNPs called with this process will be used as background (noise) SNPs for variant filtering described below.

Fig. 3
2 screenshots. 1, from the toolbox option on the menu bar, the nested dropdown options selected are resequencing analysis and variance detection. 2, the general filters tab in the variant detection window has 5 enabled fields with values and a checkbox of ignore broken pairs, checked.

Example of variant calling with a low stringency filter

For the identification of high stringency SNPs, perform variant calling with a high stringency filter. Choose “Toolbox” → “Resequencing Analysis” → “Variant Detection” → “Fixed Ploidy Variant Detection”. Choose F2 bulk sequence mapping results. Set the filtering parameters. We use Minimum Coverage 6, Minimum Counts 6 and Minimum Frequency 90%, but these settings should be adjusted according to the samples (Fig. 4).

Fig. 4
2 screenshots of fixed ploidy variant detection window. 1, under fixed ploidy variant parameters, 2 fields with values are indicated. 2, the general filters tab has 5 enabled fields with values and a checkbox of ignore broken pairs, checked.

Example of variant calling with a high stringency filter

Variant filtering. Perform variant filtering to identify specific SNPs for the targeted F2 mutant bulk. Choose “Toolbox” → “Resequencing Analysis” → “Variant Filtering” → “Filter against Known Variants”. Choose high stringency variant detection track, and choose low stringency variant tracks as Known Variant Track. Select filter option of “Keep variant no exact match found in tracks of known variants” (Fig. 5).

Fig. 5
A screenshot of a cropped window. From the toolbox option on the menu bar, the following nested dropdown options are selected. Resequencing analysis, variant filtering, and filter against known variants.

Variant filtering for subtraction of non-specific variants to identify unique SNPs

Annotation of the unique SNPs. Annotate the identified unique SNPs against gene annotation of the genome. Select “Toolbox” → “Resequencing Analysis” → “Functional consequences” → “Amino Acid Changes”. Select CDS tracks and mRNA tracks imported at step 1. Include upstream and downstream flanking positions of 5000 and 2000 respectively at “Flanking” option to include promoter or 3’ UTR mutations. Then, the annotation table will be obtained. Check manually for the sequence alignments of the detected non-synonymous SNP regions. Sometimes, failure of alignments results in the false position detection of the unique SNPs. Carefully remove such non-specific SNPs.

Example of application and validation of the utility of the protocol. An example of mutated gene identification using the method described above is available in Cui et al. (2020) using the parasitic plant Phtheirospermum japonicum. Sequencing of approximately 40 times coverage from F2 plants as well as parental mutant lines identified approximately 450 thousand SNPs respectively. Filtering out the non-specific SNPs that were found in low stringency filter from unrelated mutant lines reduced the SNP number to less than 2000 in F2 progeny. Selection of the non-synonymous mutation from the unique SNPs identified two candidate genes for the causal mutation.

SNP detection using open-source scripts. As an alternative approach, SNP detection can be performed with open-source scripts. This section explains how to perform SNP detection using open-source software mainly using a Linux computer.

Quality of sequence reads can be visualized by fastqc. fastqc is available at https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ under GPL v3 license. This script works at Linux, Windows and Mac architectures. Executables can be directly downloadable from the above site. Fastqc is a java application and suitable Java Runtime Environment (JRE) installation is prerequired (Fig. 6).

Fig. 6
2 screenshots present the output of the fast q c tool. The top presents a decreasing line trend with minor fluctuation and 11 vertical lines on the top right. In the bottom output, 4 lines follow a fluctuating trend and then remain constant.figure 6

Examples of fastqc output. Quality scores across all bases (top) and sequence contents across all bases (bottom)

Trimmomatic (Bolger et al. 2014) is useful for removing adaptor sequences and quality trimming with flexible setting. This is also made in java platform and works with Linux, Windows and Mac. The binary files and manual can be downloaded from http://www.usadellab.org/cms/?page=trimmomatic. An example command for trimmomatic quality filtering follows while the adaptor sequence file and valuable setting should be modified according to the quality requirement.

>java -jar trimmomatic-0.39.jar PE -threads 4 -phred33 -trimlog trim.log -basein YOUR_INPUT_1.fq.gz -baseout OUTPUT_name ILLUMINACLIP:TruSeq2-PE.fa:2:30:10:2:keepBothReads HEADCROP:15 LEADING:30 TRAILING:30 SLIDINGWINDOW:5:20 MINLEN:50 > test.log 2>&1 &

This command outputs 4 files with the prefix baseout followed by _1P, _2P for paired sequences and _1U and _2U for unpaired sequences. Standard out contains percentage of surviving sequences.

Read mapping using bowtie. There are numbers of mapping software available with different performance. Here we explain mapping process using bowtie2 (Langmead and Salzberg 2012).

The Bowtie2 script is available at http://bowtie-bio.sourceforge.net/bowtie2/index.shtml.

Download and install according to the instructions. Make an index from your reference genome (fasta format) with the following command.

>bowtie2-build reference_genome.fasta INDEX_name

Run mapping with bowtie2.

>bowtie2 -x INDEX_name -1 forward.fastq -2 reverse.fastq -p number_of_cpu -S OUTPUT.sam >bowtie.log 2>&1 &

Extract uniquely mapped reads from sam file.

>grep -v “XS:” OUTPUT.sam> OUTPUT_unique.sam

Sort and convert the sam file to binary bam file using samtools (available from http://samtools.sourceforge.net).

>samtools sort -@ thread_number -O bam -o OUTPUT_unique.bam OUTPUT_unique.sam

Indexing the bam file.

>samtools index OUTPUT_unique.bam

Calculate coverage of the mapping.

>samtools depth -aa OUTPUT_unique.bam >OUTPUT_unique.depth

You will get the table file with depth of each nucleotide. Calculate average coverage from the depth file using awk command.

>cat OUTPUT_unique.depth | awk ‘$3>0{print} | awk ‘{sum+=$3} END {print sum/NR}’ &

Alternatively, you can also calculate the average coverage using excel or R Plot the coverage using R to investigate whether there is any large deletion in the sequence. Because the data set is too large, split the depth file to each chromosome.

>awk ‘$1 == 1 {print $0}’ OUTPUT_unique.depth > Chr1.depth

Start R program. Use library reshape and ggplot2.

>R

In R console,

>library(reshape) >library(ggplot2) >OUTPUT.Chr1<-read.table(“Chr1.depth”,header=FALSE,sep=”¥t”,na.strings=”NA”)

Rename header.

>OUTPUT.Chr1<-rename(OUTPUT.Chr1,c(V1=”Chr”, V2=”locus”, V3=”depth”))

If you want to visualize each genotype together in a single plot, add genotype column.

>OUTPUT.Chr1<-transform(OUTPUT.Chr1, genotype=”WT”)

Change genotype to suitable setting (WT, M4 or F2 etc.). Repeat the same for different genotypes. Because entire chromosome is too large to visualize in one plot, extract a subset of sequence. Extraction can be one nucleotide in 50 bp, or a certain region from the chromosome.

 > Chr1_sub < -OUTPUT.Chr1[seq(1,nrow(OUTPUT.Chr1),by = 50),] #extract every 50 bp.

 > nrow(Chr1_sub) #count number of row.

Repeat the same procedure for different genotypes. Row bind each genotype result.

>Chr1_bind_sub<-rbind(Chr1_sub_WT,Chr1_sub_F2)

Plotting.

> g<-ggplot(Chr1_bind_sub,aes(x=locus,y=depth,colour=genotype)) > g<-g+geom_point() > g<-g+facet_grid(genotype ~ .)

Set the maximum coverage to plot because some nucleotides have a huge coverage value that interfere with proper visualization of the plot. The following limits the maximum coverage up to ×50.

> g<-g+ylim(c(0,50))

Add title of the chart.

>g<-g+ggtitle("Chr01”)

Adjust width and height of the chart by changing width and height option below (Fig. 7).

Fig. 7
A screenshot of a plot. The plot has 3 fluctuating trends in 3 different colors. The fluctuations are high at the beginning and towards the end of the plot.

Example of coverage plotting of Chr9 in Sorghum bicolor

> ggsave("Chr1.png",g,width=20, height=6,dpi=150)

Extract SNPs from the mapping data using bcftools (available from http://www.htslib.org/doc/bcftools.html). Call SNPs by bcftools.

> bcftools mpileup -Ou -f reference.fasta --max-depth 250 -a FORMAT/AD OUTPUT_unique_wt.bam | bcftools call -mv -Ob |bcftools view > OUTPUT_unique_wt.vcf

Do the same for other genotypes. Filter the SNPs using quality score and allele frequency. For low stringency filter apply allele frequency threshold 0.05.

 > bcftools filter -iQUAL > 20 && DP > 2 && (DP4[2] + DP4[3])/(DP4[0] + DP4[1] + DP4[2] + DP4[3]) > 0.05 OUTPUT_unique_wt.vcf > OUTPUT_unique_wt_0.05.vcf.

High stringency filter for allele frequency threshold 0.8.

> bcftools filter -i 'QUAL>20 && DP>10 && (DP4[2]+DP4[3])/(DP4[0]+DP4[1]+DP4[2]+DP4[3])>0.8' OUTPUT_unique_mt.vcf > OUTPUT_unique_mt_0.8.vcf

Compress the vcf file.

> bgzip OUTPUT_unique_mt_0.8.vcf

Index the vcf file.

> bgzip index OUTPUT_unique_mt_0.8.vcf.gz

Intersect the SNPs from low stringency filter from those with high stringency filter.

>bcftools isec bgzip index OUTPUT_unique_mt_0.8.vcf.gz OUTPUT_unique_wt_0.05.vcf.gz -p output -C

Output file 0000.vdf is generated in the folder specific by –p. Check the passed SNPs number.

> grep "PASS" 0000.vcf |wc -l

Depending on the extracted SNP number, adjust the filter setting for quality and allele frequency. Annotate unique SNPs using SnpEff. SnpEff is available from http://snpeff.sourceforge.net. Install SnpEff according to the instruction. Download database from SnpEff database. Search the reference genome in the database. This is an example of sorghum.

> java -jar snpEff.jar databases| grep -i "sorghum"

Download the reference from the indicated database.

>wget http://downloads.sourceforge.net/project/snpeff/databases/v4_3/snpEff_v4_3_ENSEMBL_BFMPP_32_375.zip

If the reference genome is not available in the SnpEff database, make a reference database. Please note that annotation for the same species could sometimes be different between EMBL and Phytozome. This is an example to make sorghum_v3.1.1 database from the Phytozome sorghum genome.

>mkdir snpEff/data/sorghum_v3.1.1 >mv sorghum_referemce.fasta snpEff/data/sorghum_v3.1.1/ >mv sorghum_referemce.gff snpEff/data/sorghum_v3.1.1/

Edit snpEff.config to adjust reference and gff. Build the new database.

 > java -jar snpEff.jar build -gff3 -v sorghum_bicolor_v_3_1_1.

Check the database.

 > java -jar snpEff.jar databases |grep -isorghum.

Annotate the SNP location.

> java -Xmx4g -jar snpEff.jar -c snpEff/snpEff.config Sorghum_bicolor_v_3_1_1 0000.vcf >annotated.vcf

Effects of mutation on the protein sequence are labelled according to their predicted severity (e.g., “MODERATE” or “HIGH”). The mapping bam file and vcf files can be visualized by IGV (available from http://software.broadinstitute.org/software/igv/) (Fig. 8).

Fig. 8
A screenshot of the S N P visualization in the I G V window. The effects of mutations on the protein sequence are labeled using colored symbols for sorghum.

SNP visualization by IGV

Conclusion

Next generation sequencing is a powerful tool to identify mutant causal genes or closely related markers. Commercialized software or free open-source script are available to perform the SNP detection. The identified SNPs can be converted into PCR markers such as CAPs or dCAPs to analyze mutant populations. Functional validation should be followed after SNP identification. In the case of larger genome structural changes caused by physical mutagens, alternative sequencing and bioinformatics methods may be needed to identify causal mutations underlying gained Striga resistance.