Introduction

Influenza viruses are the infectious agents of acute upper respiratory tract infections that have had several global pandemics. According to the antigenic characteristics, human influenza viruses are divided into four types: A, B, C and D, of which influenza A virus gene and antigen are prone to mutation, often leading to an epidemic or pandemic; influenza B virus gene is relatively small variation, usually causing local outbreaks1,2. In recent years, influenza type B has attracted global attention, as according to a report summarized with 47 studies, hospitalization in type B for 6.7d was longer than that in type A for 6.5d, and the clinical fatality rate (CFR) of people aged 50 years could reach to 2.5% (95%CI 0.7% ~ 7.6%, P < 0.001)3.

Influenza viruses are characterized by both surface glycoproteins, including hemagglutinin (HA) and neuraminidase (NA). The HA and NA mutations based on amino acid sequences were the basis of antigenic shift and drift4. Due to their lack of polymerase proofreading activity, influenza virus genes are frequently mutated without genetic correction, resulting in 1–2% annual divergence of influenza strains5. Genetic evolution might be demonstrated by genome structure mutations and nucleotide substitutions, while the mutations of genome structure include insertions/deletions (indels) and inversions and the nucleotide (Nt) substitutions involving single nucleotide variant (SNV) includes both transition (Ts) and transversion (Tv)6. Either purine (pyrimidine) ↔ purine (pyrimidine) transitions or pyrimidine ↔ purine transversions in SNV have been explored in their impact on nucleotide composition evolution in human and animal7. A longstanding approach for analyzing amino acid (AA) and protein is to compare rates of nonsynonymous (dN) and synonymous (dS) substitutions at each site, where these dN/dS ratios might be calculated by counting mutations or using phylogenetic substitution models8.

The epidemic of influenza arises from genetic and antigenic mutations, where the epitopes containing some key amino acids play a crucial role in antigenicity9. The D614G substitution in SARS-CoV-2 enhances the pathogenic infectivity, of which both (Aspartic/Glutamic) belong to the acidic amino acids10. Furthermore, the conserved protective epitopes of hemagglutinin (HA) are essential to the design of a universal influenza vaccine and new targeted therapeutic agents, especially on small proteins and peptides11. Thus, the amino acids that constitute the antigen domain, on account of their molecular structure, hydrophilicity and charged properties, might have different weights on their antigenicity.

Globally, there were three seasonal influenza strains (H3N2/H1N1/B) cocirculating across continents, but the non-pharmaceutical intervention (NPI) since the spring of 2020 resulted in only influenza B Victoria lineage (Bv) dominated from March 2020 till March 2021 in China12. NPIs have been implemented worldwide, including travel restrictions, face masks, social distancing, public education on prevention measures, and school closures13. Due to the influence of NPI, influenza outbreaks and epidemic in southern China have become an investigation model for infectious disease transmission in an ideal closed-loop environment. Here, based on Bv epidemic information and some genetic sequence, we analyzed the data using the multiple statistical approaches based on spatio-temporal connectivity, to evaluate Ts/Tv substitutions, charged amino acid effects and variable-related outbreaks and to further explore the internal connections of genetic evolution with outbreaks.

Methods

Surveillance and gene sequence

This study was based on the National Influenza Surveillance Network (NISN, https://10.249.6.18:8881/cdc/)12. Influenza surveillance is performed on basis of the National Influenza Surveillance Program (2017 edition)14. The definition of influenza-like illness (ILI) case is a case had body temperature ≥ 38 °C, accompanied with either cough or sore throat, but a lack of molecular detection; Influenza case is an ILI tested positive for nucleic acid of influenza virus. An influenza outbreak is defined as the occurrence of 10 or more cases of ILI in the same school, childcare institution, or other collective unit within a week.

A total of 72 Guangdong (GD) strains were selected by spatio-temporal sampling from March 2019 to April 2022, including 16, 9, 34 and 13 strains per year (no strains isolated during Apr-Dec 2020). Global strains were downloaded from GenBank (https://www.ncbi.nlm.nih.gov/genbank/) or GISAID (https://platform.epicov.org/epi3/frontend), including four vaccine strains (VSs) recommended by the World Health Organization (WHO) and five regional strains with characteristics. As for regional distribution, a total of 72 strains (42 from outbreaks and 30 from hospital sentinels) included in Chaozhou (5), Dongguan (2), Foshan (1), Guangzhou (5), Heyuan (4), Huizhou (5), Jiangmen (3), Maomin (6), Meizhou (3), Qingyuan (8), Shantou (2), Shanwei (2), Shaoguan (10), Shenzhen (6), Yunfu (2), Zhanjiang (4), Zhongshan (2) and Zhuhai (2) (Supplementary Table S1a,b). A set of primers of Bv strains was designed and synthesized, on account of the Bv strains isolated during 2016–2018 (Supplementary Table S2). The 72 strains in this study were extracted, amplified, and sequenced15, and then the genetic fragments were merged [GenBank accession PP989545-PP989616 (HA) and PQ037033-PQ037104 (NA)].

Genetic data processing

Nucleotide

The nucleotide (Nt) sequences based on open reading frame (ORF) are aligned by Clustal W while both phylogenetic trees of HA and NA genes (Figure S1) are established with Neighbor-Joining (NJ) in MEGA 11.0.1316. Nucleotide variation rates are calculated across stages and years using a single VS sequence benchmark, of which the differences were statistically compared using One-Way ANOVA. Grounded in the neutral theory of molecular evolution, the binary coalescent tree is the dual backward representation of the continuous-forward-time diffusion model of genetic drift. In species phylogeny and epidemiology, the tree structure is often used to compare different models of evolution or to fit model parameters17. Analysis of Molecular Variance (AMOVA) is a widely used method which employs variance to study the hierarchical genetic structure of populations, where the nucleotide diversity (D) is calculated as the mean pairwise genetic distance obtained17, while the statistical algorithm for genetic variability is as follows: D = [π − S/(0.5 − log(fmin))]/S. Here the pairwise dN/dS estimates were calculated for the coding regions19.

Based on the SNV, the sequenced data are clustered into 12 substitutions (4 × 3), pooling both transition G(C) ↔ A(T) or transversion G(C) ↔ C/T (A/G). Here the substitution ratio is calculated: Ri = Mi/Ni, where R is ratio, subscript i is the specific nucleotide and both M and N are the mutation Nt numbers per specific purine or pyrimidine20.

The strains were sampled from both the outbreak (Ob) and the hospital sentinel (HS), while the strain sampling dates were classified into the Stage 1 (S1, 2019 → February 2020) and the Stage 2 (S2, March 2020 → Apr 2022), which depended on the subsequent statistical calculation (Cluster).

Amino acid

The amino acid (AA) sequences are aligned by Clustal W as well. The AA mutations are analyzed, then classified into four groups, the hydrophobic (H), polar (P), acidic (A) and basic (B) amino acids on account of their charged features (Supplementary Table S3). The Shannon entropy values per AA site are then calculated18. The equation in Shannon Entropy formulation is as follows, where L is a list of all possible amino acids in all the sequences and Pk(i) is the probability of finding the kth amino acid at that position19.

$$ H\left( i \right) \, = - \sum\limits_{k \in L} {{P_k}(i) \times {{\log }_2}{P_k}(i)} $$

Evolutionary and cluster analysis

The rate of molecular evolution is measured by the number of Nt sequence mutations per unit time. Cumulative mutations across the whole gene region can effectively be used to estimate positive selection (or selective pressure) by calculating substitution ratio of dN/dS (ω)21. Sequence data including Nt and AA are analyzed using the χ2 Test in categorical data and using the One-Way ANOVA in compare means. With the TwoStep and Hierarchical Clusters, the relationship among the variables related to influenza genetic and outbreak are analyzed22. Based on the Schwarz’s Bayesian Criterion (BIC), the Cluster model is as follows: − 2Lm + × lnn, here with the maximized log-likelihood (Lm) and the sample size (n), which fitted for the minimum value.

Statistical evaluation

The data are processed using WPS tabulation and SPSS 23.0 (SPSS Inc., Chicago, IL.), where the mean value in statistics description is performed using Mean ± SD (normal distribution) or Median (P25, P75) (skew distribution)17.

The compare means and the nonparametric test are statistically significant relying on the P value < 0.05 in two sides, while the evolutionary selection is on the P value < 0.10. The correlation is significant based on the R value > 0.50. The statistical logic and processes of the other approaches are interpreted as needed.

Results

Nucleotide homology and evolution

Most of the HA and NA sequences in the ORF have 1749 bp and 1401 bp, respectively, but by the genetic alignment, the HA ORF in Bri/60/08 (2008 VS) being 1758 bp was deleted of Nt529-537 (aaaaacgac) in all HA genes in the 2019–2022 strains in this study except for the HA gene in GD/1557/19 (1758 bp).

The homologies of HA and NA genes in GD strains were compared with those of four VSs (Table 1). It showed as follows, (1) The highest identities in HA gene occurred in 2020 (99.80 ± 0.07, 2019 VS) and in 2022 (99.23 ± 0.22, 2021 VS) (P < 0.001), respectively; (2) On account of HA genes of the 2021 VS, the HA genes of the GD strains during 2021–2022 were classified in the subset 2, which was different from those of other three VSs; (3) The highest identity in NA genes with the 2021 VS was in 2021 (98.57 ± 0.15) than in 2022 (98.43 ± 0.13) (P = 0.012), which indicated that the 2022 strains in both HA and NA genes further evolved than the 2021 VS.

Table 1 Homologies of HA/NA genes (ORF) between GD strains and vaccine strains (VSs).

Both trees of HA and NA genes of 72 influenza Bv strains isolated in Guangdong (Supplementary Fig. S1) have the following characteristics, (1) The HA genes from 2019 to 2020 were closer to the 2019 VS (Was/2/19), and those in 2021–2022 were identical to that in the 2021 VS (Aus/1359417/21); (2) The NA gene in 2019 VS (Was/2/19) was closely related to those of the 2019–2020 GD strains, including the 2019–2022 ones, but these of other two VSs (Col/6/17 and Aus/1359417/21) were genetically different from those of the 2019–2022 GD strains, which suggested that the WHO recommended vaccine strains relied mainly on homologies of HA genes rather than those of NA genes.

Transition and transversion in SNV

The number and ratio of purine and pyrimidine mutations in the nucleotide of each HA and NA gene were calculated according to the reference Bri/60/08 (Table 2). It was showed that (1) Ts mutations were mostly larger than Tv ones, with three increased (A → G↑, C → T↑ and G → A↑) during two stages and one decreased (T → C↓) in HA genes, and one increased (G → A↑) in NA genes; (2) Tv mutations usually were more significant than Ts ones, including two increased (C → A↑ and T → A↑) in HA gene during two stages, while the NA genes were increased and decreased once each (A → T↑ and T → A↓); (3) Tv mutations in HA genes might contributed to the influenza outbreaks after NPI (C → A↑↑ and T → A↑↑); (4) The nucleotide variation ratios of both HA and NA genes were in the following order, G → A, A → G, C → T, T → C, A → T, T → A, A → C, G → T, T → G and C → G (no G → C substitution).

Table 2 Mutations of Ts/Tv on HA and NA genes during two stages.

Evolutionary selection

On account of the dS/dN substitutions in the codons, evolutionary selections in both HA and NA genes in the present study were analyzed using both approaches FUBAR and MEME, shown in Table 3. Here the FUBAR had positive sites in HA genes including site 199, 214 and 563, where MEME had those including site 174, 214 and 563 (P < 0.10). Both methods were used for statistical assessment using Bayes Factor and MEME LogL. The positive sites in NA genes included the site 73 and 384 with FUBAR and only site 73 with MEME (P < 0.10) (Table 3, Supplementary Table S4). The positive selections suggested that these amino acid sites were under enormous external pressure. A lot of negative sites existed in FUBAR (Fig. 1).

Table 3 Evolutionary selection on genes of influenza viruses.
Fig. 1
figure 1

Positive selection of HA and NA genes analyzed by FUBAR and MEME. The negative sites were present in FUBAR, marked with small circle dots in blue.

Evolution comparison

Comparing the two evolutionary indicators [the evolutionary rate (ER) and the Shannon entropy value (SV)], both ERs and SVs were significantly correlated in HA genes (RPearson = 0.690), meanwhile both also in NA genes (RPearson = 0.711); which indicated that both the ERs and SVs were consistent in the same gene (Fig. 2). The differences between two indicators might be that the ER focuses on all nucleotide mutations (dS and dN), and the SV mainly does on the amino acid site of the dN mutation.

Fig. 2
figure 2

The evolutionary rates and entropy values of the sites on HA and NA [(a) ER; (b) SV].

Based on the Tajima’ Neutrality Test, the evolutionary selections were present (Supplementary Table S5). The Tajima` D Test is based on neutrality, if positive, common allele excess and if negative, rare alleles excess. It was indicated as follows, (1) Both SHA (195) and πHA (0.0109) were respectively more than SNA (176) and πNA (0.0095), but both psNA (0.1257) and ΘNA (0.0253) were respectively more than psHA (0.1116)and ΘHA (0.0225), which suggested that the dN was larger in HA than in NA and the dS was larger in NA than in HA; (2) Both HA and NA showed a higher rate of synonymous evolution than nonsynonymous evolution, while the DHA (− 1.7668) was larger than DNA (− 2.1355) due to the HA mainly being an antigenic domain and functional region.

Genetic factors related to outbreak

For all HA gene mutations and outbreak influencing factors, the TwoStep cluster was used to explore the relationship among all the variables mentioned above, especially in the association of influenza outbreak events with genetic mutations. The findings in the TwoStep cluster were as follows, (1) Three clusters were mainly based on sampling dates, Ts and Tv, where Cluster 1 was completely separated from Cluster 2 and 3 and the Cluster 1 ended in the February 2020; (2) Cluster 1 isolated from Cluster 2 and 3 was contributed by the AA site 148 (B → P), 165(P → B), 199(P → A), 212(P → A) and 563(B → P), while Cluster 2 did from Cluster 1 and 3 was contributed by site 256 (P ↔ H); (3) Cluster 3 did from Cluster 1 and 2 was contributed by AA site 137 (B → P) and 142 (B → P), while the Cluster 3 started in the second half of 2021. According to the above analysis, the role of genetic evolution and mutation (the amino acid mutations preferring charge/pH) in the development of epidemic and outbreak could be preliminarily identified (Supplementary Tables S4 and S6).

For all continuous and categorical variables in HA genes and disease outbreaks, the K-Mean Cluster was used to further analyze the relationship among all variables (Table 4). It was shown in the One-Way ANOVA of K-Mean Cluster (label cases by strain name) as follows (only show the significant results), (1) There were two clusters, including Cluster 1 in 2019–2020 and Cluster 2 in 2021–2022 (P < 0.001); (2) Three variables (Tv, Ts and Tv/Ts) were statistically different (P < 0.001); (3) AA sites included P137, P148, P199, P212, P214 and P563 were very significantly different (P < 0.001), as P256 as well (P = 0.002), but P188 (P = 0.302) and Outbreak (P = 0.089) were not significantly different between two clusters. Overall, the factors that determine the 2021–2022 influenza outbreak included epidemic time, Tv, Ts, Tv/Ts, P137(B → P), P148(B → P), P199(P → A), P212(P → A), P214(H → P) and P563(B → P) (Table 4, Supplementary Table S6).

Table 4 Genetic variables related to the outbreaks analyzed by One-Way ANOVA.

Discussion

The B/Victoria lineage stemmed from the 1988–1989 season, of which two distinct antigenic variants of influenza B virus were co-circulated, the B/Victoria and B/Yamagata lineages (Bv/By) with the reference strains B/Victoria/2/87 and B/Yamagata/16/88, respectively. The evolutionary dynamics of influenza B virus are complex and have been characterized by nucleotide insertions and deletions (indels) in the hemagglutinin (HA) gene and extensive reassortment events within and between the Bv and By lineages23. In this study, only strain GD/1557/2019 inserting 529AAAAACGAC537 in HA gene was similar to the vaccine strain B/Bri/60/08, which was different from others. On account of the vaccine strain Aus/1359417/21, the Bv strains circulating in 2022 had the highest homology with their HA gene but were different from those in other years (99.23 ± 0.22, F2022/Others = 74.78, P < 0.001). Some of influenza strains in the present study were isolated at the beginning of NPI (2020), being in fact a continuation of the 2019 epidemic and outbreaks. Moreover, from NPI (2020) to the end of April of 2022, only influenza Bv outbreaks (no H1N1 and no H3N2) occurred in southern China.

This study included the analyses of nucleotides (molecular cluster, transition/transversion, evolutionary rate), amino acids (AA substitution, entropy, evolutionary selection, epitope), genes (HA/NA) and prevalence (epidemic/outbreak, different dates) and the relationship among them of Bv outbreaks. Although SNV occurs at random, the results are significant for the direction of biological evolution. From the results in the present study, the mutations were highly biased toward the specific amino acid, for example, the probability of GC transversion was one in 200,000 (only once) since 2008 (reference strain), but the probability of AG transition was 1–2 per thousand, which was faster than that of GC transversion. The evolutionary rates in this study were successively G → A, A → G, C → T, T → C, A → T, T → A, A → C, G → T, T → G and C → G, with the highest rate 10,000 times faster than the lowest rate. Compared with a study on SARS-CoV-2 pandemic spread during the first months, the frequency of both G → U and C → U substitutions increased, which suggested that the substitution spectrum of SARS-CoV-2 was determined by an interplay of factors, including intrinsic biases of the replication process, avoidance of CpG dinucleotides and other constraints exerted by the new host24.

In this study, the epitope domain mutations including epitope A (120 loop, 137/142/144/199), B (150 loop, 165) and D (190 helix, 212/214) had high evolutionary rates, partially similar to a previous research23. The epidemic and outbreaks in southern China resulted from the mutations on HA genes, which were 1.59 times (2.15/1.36) faster than those on NA genes. As to the deeper reasons, the outbreaks here were associated with mutations of HA gene epitopes A, B and D. Compared with the epidemic in Germany during 2016–202025, a total of 13 substitutions were fixed over time (numbering in HA1 of Bri/60/08), including five in the 120-loop (R116H, I117V, N121T, K129N/D, K136E) and two substitutions in the 120-loop surrounding domain (K48E, N75K), one in the 150-loop (V146I), two in the 160-loop (E164D, N165K) and one in the 190-helix (S197N).

Amino acids have been extensively studied as components of epitopes, while epitopes in infectious diseases involve epidemic, treatment, vaccines and so on26. The ionizing properties of amino acids are associated with the charged capacity, furthermore, with pathogenic adhesion and entry and molecular interaction between antigen and antibody, etc.; where interaction between antigen and antibody is involved in the multiply charged ion signals in amino acids27. Focusing on the epitope domain in this study, three polar amino acids (P137B/P199A/P212A) mutations occurred from 2019–2020 to 2021–2022 (P < 0.01), which affected the antigenicity of the epitope regions.

Based on dS/dN substitution in the codon, there are certain errors in the evaluation of selective evolutionary sites. The site 214 and 563 in the HA genes and the site 73 in the NA genes in this study were the positive ones, which were evaluated by both approaches (FUBAR/MEME; P < 0.10)28. This suggested that the site 214 in HA genes was an AA in the epitope D (H214P) triggered off Bv outbreaks, and a positive selection site under the enormous external pressure in evolution as well.

Entropy is usually used to evaluate the evolution as well29, while here the evolution was evaluated by both ER and SV, of which both were significantly correlated. Estimation of both rates of nucleotide substitution of HA and NA in Bv lineage were 2.05 × 10−3 s/s/y and 2.01 × 10−3 s/s/y, respectively23, while the RHA in this study was less than RNA (RHA = 0.690/RNA = 0.711), which suggested that the amino acid variations on HA were more active than the nucleotide variations, compared with those on NA. At the same time, both DHA (− 1.7668) and DNA (− 2.1355) in this study showed HA genes (especially in the five epitopes in HA1 region) were prone to variation, in other words, NA genes were more likely to evolve synonymous evolution rather than nonsynonymous one.

The key role of charged amino acids has been widely studied in infectious diseases30. Here was a good model of the evolution of infectious disease pathogens (NPI/Bv outbreak only/Less distant transmission). In this study, the first stage entering the second stage of the Bv outbreak involved three polar AAs (N165K, P → H; G199E/K, P → A/B; N212E, P → A), substituted from the polar AAs into the basic, acidic/basic and acidic AAs, respectively; the second half of 2021 in the second stage (Cluster 3, Table S6) involved two polar AAs (H137Q/K/N, B → P; A142T, H → P), substituted from the basic and acidic AAs into the polar AAs, respectively. This suggested that the charge/pH preference for amino acid mutations is closely related (consistent) for the development trend of the outbreak. There are some similar reports, but with different research perspectives31. SNV adaptation is thus likely to have been associated with the influenza virus diversification across the outer environment and to have promoted their survival in extreme32. A genetic approach combined with potential epidemiological linkage enabled us to match data with previous reports on outbreaks or transmission chains, which may benefit public health actions33.

Conclusions

With the advent of COVID-19, the influenza epidemic affected by NPI had a closed, time-limited pattern, and only Bv outbreaks. The HA genes of Bv strains isolated in 2022 evolved further than the vaccine strain isolated in 2021 (Bv/Aus/1359417/21). The codon G → A transition in nucleotide was in the highest ratio but the transversion of C → A and T → A made the most significant contribution to the outbreaks. The epitope domain mutations occurred in the epitope A (AA 137/142/144/199), B (AA165) and D (AA 212/214). Amino acid mutations with polar, acidic and basic features are key factors in the 2021 Bv epidemic, in which the above mutational features alter the molecular structure, charged properties and molecular affinity of its epitope region. The amino acid sites 174, 199, 214 and 563 in HA genes and the sites 73 and 384 in NA genes were evolutionarily selected as the positive sites, which was under evolutionary pressure. The prevalent factors related to 2021–2022 influenza outbreak included epidemic timing, Tv, Ts, Tv/Ts, P137 (B → P), P148 (B → P), P199 (P → A), P212(P → A), P214(H → P) and P563(B → P). The preference of amino acid mutations for charge/pH could influence the trend of the epidemic/outbreak. Further exploratory studies employing mathematical and bioinformatics approaches based on clinical, public health, vaccine research and genetic information may facilitate further understanding of the deep interaction mechanisms of infectious disease transmission.