Introduction

Similar to other countries, Japan has been severely affected by the coronavirus disease 2019 (COVID-19) pandemic. As of May 10, 2023, the Japanese National Epidemiological Surveillance of Infectious Diseases (NESID) reported 33,793,429 cases of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection, with a low number of related deaths at 74,6541. Various factors, including the progress of the vaccination program, have maintained low case and death numbers in Japan since the spring of 20232. On May 8, 2023, the Japanese government eased stringent COVID-19 rules and guidelines, including recommendations for mask-wearing, avoiding crowded places, setting quarantine periods, and hospitalizing positive individuals in designated medical facilities2,3, similar to managing seasonal influenza. Under the current measure, COVID-19 surveillance has transitioned from daily notifiable disease surveillance to a weekly sentinel surveillance in some hospitals.

In Japan, population-based notifiable surveillance for COVID-19 is based on confirmatory diagnosis using real-time PCR testing of symptomatic cases at local public health sites and some clinical laboratories. Next-generation sequencing (NGS) is introduced for some samples with sufficient viral load. The National Institute of Infectious Diseases compiles NGS results after quality control and matches them with the individual’s background in the NESID4. These results are made publicly available by uploading them to the GISAID EpiCoV database5,6,7. Such viral genome data offer useful information for public health interventions, including monitoring emerging variants, detecting outbreaks, and describing the background and risk factors of transmission cluster8.

To apply them for public health measures, it is necessary to ensure that the data reflect the viral genome distribution and individual characteristics of the infected population of interest9. During the daily notifiable surveillance period in Japan, passive COVID-19 testing was performed on symptomatic individuals at these sites. Therefore, asymptomatic, mild illness, or delayed testing individuals were less likely to be reported10,11, and few cases with low viral loads could be provided as the sequence data. This suggests that the delayed diagnosis and the sampling bias are concerns for the viral genome sequences deposited in GISAID from Japan during this period. These concerns might have been particularly substantial during the so-called “seventh wave” from June to October 2022, when repeated mutations of the Omicron variant caused a massive epidemic in Japan with a prevalence approximately 2.5 times higher than previous epidemic waves2 and a major impact on the public health measures12, causing some reporting failures in the surveillance system.

Genome-based surveillance of SARS-CoV-2 may have serious consequences owing to biases associated with notifiable surveillance. Surveillance sites tend to prioritize interesting cases, such as outbreaks and/or cluster infections, rather than sequencing all collected cases. Challenges in capacity for NGS, including maintenance costs and technical difficulties, can result in regional disparities. These biases may lead to overestimation or underestimation of the presentation of viral transmission lineages in phylogenetic tree analysis13,14. To address these challenges, implementing testing-based sentinel screening for symptomatic and asymptomatic individuals is a potential solution. This approach involves active testing, including genome sequencing, for all positive cases. Some countries have established testing-based sentinel surveillance programs, developed for rapid diagnosis and viral infection epidemics assessments15,16,17, including COVID-1918,19. Commercially available COVID-19 tests have been conducted in Japan. One of these diagnostic companies, the SB Coronavirus Inspection Center Corp. (SBCVIC), continuously sequences viral genomes from specimens with informed consent from the individual. They provide COVID-19 testing for workplace screening upon company requests and voluntary testing as requested by local governments. Therefore, the majority of positive cases are from asymptomatic individuals20. In this study, we analyzed viral gene variants during the seventh wave, comparing data from the GISAID, which were deposited through the NESID, and the SBCVIC to evaluate the utility of monitoring viral gene variants by combining PCR testing of asymptomatic individuals with viral genome sequencing.

Results

Genetic diversity of SARS-CoV-2 in the SBCVIC was similar to sequences in GISAID during the same period

According to CoVariants.org21, 21 K in the Nextstrain22 clade, specifically BA.1 in the PANGO lineage23, was initially prevalent in the sixth wave of COVID-19 infection in Japan in early 2022. Subsequently, 21L/BA.2 dominated in the late phase. In June 2022, the beginning of the study period, the sixth wave concluded (Fig. 1A). Starting from this period, BA.2 was gradually replaced by BA.5 (Fig. 1B). The seventh wave peaked in July–August 2022. After September, the prevalence of BA.5 decreased. Simultaneously, more evolved lineages were being detected (Fig. 1B). The genetic diversity of the full-length viral sequences from SARS-CoV-2 positive cases diagnosed at the SBCVIC remained constant throughout the study period. It resembled the diversity of sequences registered in GISAID from Japan between June and September, except for GISAID-registered sequences in October, which were approximately twice as diverse of the other monthly groups (Fig. 2). The lineages contributing to this genetic diversity initially included BA.2.3, BA.2.24, and other BA.2 s in June. However, they were gradually replaced by infected populations with BA.5.2, BF.5, and other BA.5 s starting from July (Fig. 3A). This composition was similar between the SBCVIC and GISAID. However, in October, 46% of the viral sequences from GISAID collected contained BA.2.3.20 and BA.2.75, which were absent in the SBCVIC data.

Fig. 1
figure 1

The waves COVID-19 and the dynamic evolution of SARS-CoV-2 variants in Japan. (A) Trends in the number of COVID-19 cases reported in Japan from May 2020 to April 2023 relying on the NESID reports. (B) Trends in the frequency of VOC and PANGO linages in Japan from May 2020 to April 2023 relying on CoVariants.org (https://covariants.org) data. The prevalent Pango lineage for Omicron is shown, with each month of the study period represented by blue brackets between panel A and B. VOC, variants of concern; NESID, The National Epidemiological Surveillance of Infectious Diseases.

Fig. 2
figure 2

Diverse genetic profiles of SARS-CoV-2 genome sequences collected using various routes across different months. The line graph depicts the average number of base substitutions per site, calculated from averaging over all sequence pairs within each sampling route per month group. Bars show the standard error estimates, which were estimated using the bootstrap method with 500 replications. Open and closed circles indicate diversity within the SB Coronavirus Inspection Center Corp. and GISAID samples from each month, respectively.

Fig. 3
figure 3

Viral lineage composition at the SBCVIC and GISAID entries during the seventh wave, and the concurrent COVID-19 prevalence in Japan. (A) A stacked bar graph illustrates the frequency distribution of viral lineages at the SBCVIC (left) and GISAID entries (right), whereas a line graph represents the number of cases reported in NESID during the corresponding period. (B) The number of cases for each viral lineage at the SBCVIC (left) and GISAID entries (right) are shown as stacked bar graphs. Concurrently, a line graph indicates the number of cases reported in NESID during the same period. NESID, The National Epidemiological Surveillance of Infectious Diseases.

Number of omicron variants diagnosed in the SBCVIC differed from that of GISAID and was consistent with the number of reported cases in Japan

The demographic information for all positive cases for which viral genome sequencing was attempted in the SBCVIC during the study period, June to October 2022, and which were recruited for our analysis, is shown in Supplementary Table S1. Although the number of viral full-genome sequences collected in Japan and registered in GISAID was as high as 887 in June, the number of sequences from the SBCVIC was much lower. The number of sequences from the SBCVIC increased after July, peaked at 762 in August, and subsequently decreased to 69 in October (Fig. 3B). The change in the number of sequences at the SBCVIC was consistent with the number of COVID-19 cases reported in Japan during the same period (Fig. 3B). Conversely, the number of GISAID entries from Japan decreased after June, with only 98 entries in August, the peak of the seventh wave. BA.5 strains began to be observed in early July; however, the cumulative number of GISAID cases slowed down thereafter. The dates on which SBCVIC could capture > 50% and > 90% of the total number of cases in November were 12 and 12 days earlier than in GISAID in BA.5.2, 17 and 10 days earlier in BA.5.2.1, and 15 and 36 days earlier in the other BA.5, respectively (Supplementary Fig. S1). Bayesian skyline plot analysis of the sequence data revealed that the relative population size trends inferred from SBCVIC-derived sequences were consistent with COVID-19 prevalence since the outbreak of Omicron variants in January. In contrast, the inference using GISAID-derived sequences showed a single wave of the sixth wave and no seventh wave (Supplementary Fig. S2).

Genetic differences in sequences from the SBCVIC and GISAID showed greater temporal divergence within each month than those across different collection routes

The gross genetic distance between groups of viral sequences collected by the SBCVIC and GISAID in their respective months was, at most, as small as 0.096% of the substitutions/sites (Supplementary Table S2). The maximum net genetic distance between the groups was 0.041% between SB-Sep and GISAID-Jun, and the ratio of the net to gross value (0.086%) was 47.5%, indicating that up to about 48% of the divergence was due to differences in months and collection routes. The net genetic distances between collection routes in the same month were often smaller than the differences between sampling months along the same route (Supplementary Table S2). A dendrogram illustrating the similarity between the groups showed that the groups were divided into two clusters (Fig. 4). One cluster consisted of populations primarily infected with BA.2 s and included the June SBCVIC, June GISAID, and July GISAID. The second cluster was primarily composed of populations infected with BA.5 s, including the majority of the seventh wave. The October GISAID group was at the top of this cluster, reflecting the high rate of BA.2.75 registrations during this period.

Fig. 4
figure 4

Cluster analysis of mean genetic distance between groups based on collection route and month. The relationship between viral genome differences among the collection routes in each month is shown in Supplementary Table S2 and illustrated in a dendrogram using the Ward method.

Viral variants from GISAID and the SBCVIC were mixed in the transmission network for BA.2, but were separate subclusters for BA.5

Transmission cluster and network analyses identified at least 60 components and 157 singleton cases of BA.2- (33/64 components/singletons) and BA.5-related (27/93) lineages detected during the study period in Japan. Among these components, 12 pairs were cases with BA.2 lineages, and 14 pairs were cases with BA.5 lineages (Supplementary Fig. S3A). The BA.2-related components were primarily composed of GISAID entries (Supplementary Fig. S3B), and the SBCVIC cases were located at different locations in both the phylogenetic clusters (Supplementary Fig. S4) and the components (Fig. 5A). The network graph of BA.2 exhibited lower densities (median = 0.202) than BA.5 (0.603); however, the difference was not significant for the Mann–Whitney U test (p = 0.157). Among the 12 BA.2 components containing > 5 cases, the largest one with BA.2.3–2.3.20 consisted of several smaller clusters and two main clusters. One of these clusters included GISAID-derived BA.2.3 cases from August onwards, BA.2.3.20 cases in October, and earlier cases from the SBCVIC. In another large component, BA.2.3.13, two clusters of GISAID entries from different months were observed, with individual divergence of SBCVIC-derived cases. The BA.5-related components of the seventh wave featured numerous cases from the SBCVIC (Supplementary Fig. S3B), with GISAID entries often forming a subcluster separate from the SBCVIC in certain network components (Fig. 5B). Of the 11 BA.5 components containing > 5 cases, most appeared to form separate clusters of SB- and GISAID-derived cases, except for BA.5.2.1. BA.5.2 primarily consisted of three clusters of SB-derived cases, divided by central GISAID entries from August to September. BA.5.1 had more GISAIDs than the other components, and the GISAID cluster was divided into two areas: one collected in July and the other in September. BF.5 distinctly separated the SBCVICs and GISAIDs without exhibiting clear temporal characteristics.

Fig. 5
figure 5

Transmission network of viruses collected during the seventh wave in Japan. (A) Estimated transmission links for cases belonging to BA.2 with genetic distances < 0.015% are shown. Each node in the network graph indicates a case infected with the BA-2 virus. Node colors in the graph denote collection routes and months. Blue and red indicate cases derived from the SBCVIC and GISAID, respectively. The darkness of the color reflects the recency of the collection month. (B) Estimated transmission links for cases belonging to BA.5 with genetic distances < 0.015% are shown. Each node in the network graph represents a case infected with the BA-5 virus. The color of the node is the same as described in (A). d indicates the graph density of each network.

Discussion

A considerable proportion of asymptomatic10,11,24 and presymptomatic transmissible periods24,25 facilitates early detection of SARS-CoV-2 using traditional syndromic-based surveillance, widely implemented in several countries. Although such type of surveillance has the advantage of providing generalized data on disease incidence and mortality26, reporting all symptomatic cases to a surveillance agency is a burden on health workers and facilities26. The requirement for viral genomic information has notably increased this burden. Relying on symptomatic cases is essentially ineffective in detecting early COVID-19 outbreaks, which are suggested to have a cryptic transmission phase preceding infection waves27,28. An alternative approach to detecting cryptically transmissible viruses is testing-based sentinel surveillance with genome sequencing, including asymptomatic individuals. This approach could monitor epidemic indices efficiently and be more cost-effective than other approaches26. Although the screening test for SARS-CoV-2 conducted by the SBCVIC was not designed to monitor the incidence of COVID-19, it provides information for active sentinel viral genome surveillance in Japan.

To implement this approach, it is necessary to ensure that viruses in the sample population reflect the genetic diversity of the original population. Sentinel surveillance of SARS-CoV-2 genomic data has provided more epidemiologically representative data18, earlier reports of variant shifts29 and identification of novel imported variants30 than those of conventional systems. The available data from the SBCVIC were more representative of the seventh COVID-19 epidemic wave in Japan than the official surveillance during the same period. Notably, the SBCVIC managed to collect cases reflecting the trend in the number of positive individuals reported to NESID, and the viral sequences collected for four months until September 2022 exhibited nucleotide diversity within the tested population comparable to that of GISAID entries. These results indicated the adequacy of the data for monitoring SARS-CoV-2 genetic variation. Despite small sample sizes and regional coverage limitations in Japan, the data analyzed at the SBCVIC may still be appropriate for monitoring viral lineages.

Genetic differences among monthly collection groups in each collection route showed that SBCVIC-based surveillance could capture lineage shifts earlier than GISAID sequences, especially during the phase of the wave when the major variants transitioned from BA.2 to BA.5. The fact that SBCVIC captured a half and 90% of BA.5 variants on average 15 and 19 days earlier than GISAID, respectively, confirms that the shift could be observed about a half a month earlier, in SBCVIC. The viral variant shift was also observed in previous studies involving on-site genome sequencing in healthcare workers29,31. This supports our findings, suggesting the benefit of combining testing-based screening and genomic analysis for SARS-CoV-2 variant surveillance.

In our analysis, the viral population of the seventh wave in Japan consisted of at least 60 components in transmission networks, further subdivided into subpopulations. SARS-CoV2 globally spread following a social network with a scale-free property32,33. Several components were singletons or pairs, and few had many cases, aligning with most viral epidemics34,35,36,37. As GISAID entries of BA.2 reflected various infections in Japan and the BA.2 cases from the SBCVIC were scattered in large BA.2 network components dominated by GISAID entries, the SBCVIC capable of capturing a substantial range of BA.2 lineage diversity. In contrast, in BA.5 lineages, GISAID entries often formed subclusters separated from SBCVIC cases. The BA.5 epidemic gained momentum from August onwards. The declining number of BA.5 entries in GISAID from Japan since August could be attributed to the surveillance burden on official testing laboratories at local public health sections. Given this substantial burden, it would be reasonable to cluster GISAID entries if analyses were solely performed on specific infected populations of interest. This indicated that the authorities may have failed to understand the current status of the epidemic. The distribution of GISAID entries throughout the network of SBCVIC cases in BA.5.2.1, and the observation of a small component consisting only of GISAID entries, highlights the need for nationwide implementation of viral genomic surveillance.

Genome-testing-based sentinel surveillance, targeting corporate employees without relying on symptoms as an indicator, can detect newly introduced variants earlier than retrospective GISAID analysis. GISAID entries result from symptom-reporting surveillance, and our study suggests that genome testing in specific corporate groups is a more proactive approach to identifying new variants.

Limited by data availability, our comparison focuses on GISAID genomic information as of April 2023, omitting some national surveillance data. A potential drop in GISAID registrations after August 2022 could be attributed to fewer Japanese registrations rather than a decline in sequence analyses, affecting the representativeness of our results compared to national government data. In addition, SBCVIC linked positive individuals to a health care provider for a confirmatory diagnosis and subsequent treatment; however, it did not follow up on their disease course; therefore, some may have been presymptomatic cases. Although the asymptomatic rate may be lower than in NESID, where all cases are symptomatic, another limitation of this study is that the exact difference in asymptomatic rates is not known. Another limitation is that SBCVIC did not necessarily diagnose and collect data from all geographically transmitted populations in Japan and we could only analyze samples from 24 out of the 47 prefectures in Japan. Implementing a nationwide surveillance system could have facilitated the early detection of the BA.5 outbreak in seventh wave and subsequent epidemic. A well-designed genome-based sentinel surveillance is more effective at detecting genomic data than a comprehensive reporting system18. In such a context, the viral genome information based on SBCVIC could reveal the variant shift earlier, although it includes presymptomatic individuals. Sentinel screening tests in asymptomatic populations are effective in preventing infection19,20,38. Their implementation, especially in the era of highly cryptic infections like Omicron-related variants, should be considered, owing to their low facility burden.

Conclusions

  • Viral genome surveillance using the SBCVIC testing system effectively observed genetic diversity in SARS-CoV-2 variants during the seventh wave in Japan.

  • The number of sequences analyzed by the SBCVIC was consistent with the reported number of positive individuals in the national surveillance system.

  • The SBCVIC cases detected the lineage shift from BA.2 to BA.5 earlier than GISAID entries in the retrospective analysis.

  • In the BA.2 network, GISAID entries dominated, whereas SBCVIC sequences were scattered across different locations in the network.

  • The BA.5 network components, primarily from the SBCVIC, displayed high density, whereas GISAID entries often formed distinct a subcluster separate from SBCVIC cases.

  • Testing-based sentinel surveillance of COVID-19 in asymptomatic individuals may be more efficient and less burdensome than notifiable surveillance using sequence analysis.

  • Further studies should aim to develop rapid and efficient genome-based surveillance systems for the next emerging infectious diseases.

Methods

Design of commercially based COVID-19 testing and screening

From July 27, 2020, the SBCVIC has conducted workplace screening and local government-based voluntary COVID-19 testing using RT-PCR for > 2.8 million individuals without self-reported symptoms, following a protocol outlined in a previous report20. Briefly, self-collected and mailed saliva samples were tested using the SARS-CoV-2 Direct Detection RT-qPCR Kit (RC30JW; Takara Bio, Shiga, Japan) on the day of arrival at the SBCVIC. Samples with cycle threshold (Ct) values ≤ 40 were considered positive for COVID-19 infection, and almost all of them underwent viral genome sequencing using NGS with the ARTIC primer set and Illumina COVIDseq Test (Illumina Inc., CA, USA). NGS results were analyzed using the Illumina DRAGEN COVIDseq Test (RUO), and a consensus sequence was used for subsequent analyses.

Ethics approval

All participants provided written informed consent, allowing for the collection of saliva samples, viral sequencing, and demographic information for the study. Participants had the option to decline participation in the study via the website. This study was approved by the Institutional Review Board of the National Center for Global Health and Medicine (NCGM-G-003678-00). The study was conducted in accordance with the Declaration of Helsinki (revised in 2013).

Data collection

We studied Omicron-infected cases in Japan during the seventh wave of the COVID-19 epidemic from June to October 2022, including a month before and after the epidemic (Fig. 1A). First, we identified whole-genome sequences of Omicron variants of concern derived from samples with Ct ≤ 30 from 7004 samples in the SBCVIC database, resulting in the selection of 2,518 sequences. Additionally, we retrieved 4,958 complete sequences from GISAID, collected in Japan from January to October 2022, and registered them as of April 11, 2023, via gisaid.org/EPI_SET_230609fz. We also retrieved Wuhan/Hu-1/2019, the first reported SARS-CoV-2 complete sequence, from the National Center for Biotechnology Information GenBank (accession number: NM908947). From this dataset, we extracted 1,248 sequences with a read length ≥ 29,000 bases of complete nucleotides and 1,764 sequences from the SBCVIC and GISAID entries collected since June, respectively. These sequences, along with Wuhan/Hu-1, were aligned using MAFFT version 7.539 to create the seventh wave dataset, which was used to estimate nucleotide diversity and genetic distance. Meanwhile, 300 sequences were randomly extracted for each sampling route from both the comprehensive alignment data containing all cases since January (Omicron wave dataset) and the seventh wave dataset. Bayesian Markov Chain Monte Carlo analysis was performed for each set. The accession numbers of the sequences from the SBCVIC are BS009006-BS010295.

Molecular evolutionary analyses

Prior to the analyses, we masked seven sites of concern for artificial mutations40 by replacing the corresponding nucleotide in Wuhan/Hu-1 with an ‘N.’ Additionally, we excluded the first 100 bases and the last 200 bases of the alignment, where sequencing accuracy is poor. First, we identified the PANGO lineage of the infected virus using Pangolin version 4.223. Thereafter, we estimated the number of base substitutions per site among all pairs of sequences, both gross and net averages between the monthly or collection route groups of sequences, and averages within the groups in the seventh wave alignment. The Tamura-Nei 93 model with gamma-distributed sites was used for these calculations using MEGA1141. Hierarchical clustering was computed using a distance matrix of the net average substitutions and plotted as a Ward method dendrogram using R 4.0.242. To infer the transmission network of each lineage prevalent in Japan during the study period, we analyzed the pairwise substitution matrix according to previous studies on HIV-143,44. We linked two cases (nodes) in the network when their viral sequences were < 0.015% distance in the substitution matrix, equivalent to approximately 30% of the mean diversity in the entire population in the seventh wave dataset (d = 0.00056). The resulting adjacent matrices were transformed into network graphs and their components using the SNA package in R. Bayesian MCMC clade credibility phylogenies for the Omicron and seventh wave datasets were inferred using BEAST 1.945 with GTR + G + I substitution and Bayesian skyline plot population models. The relative population dynamics of the Omicron epidemics in Japan from January 2022 were estimated using the Coalescent Bayesian Skyline Plot46.

Statistical analysis

Data regarding the month of collection, collection route, and PANGO lineage were compiled using Microsoft Excel. All epidemiological data analyses, including the Mann–Whitney U test and standard error of means, were performed using R version 4.0.2.