Introduction

Parkinson’s disease (PD) is a common neurodegenerative disease. Several studies have shown that gut microbiota have an impact on human health1,2, as changes in the gut microbiota composition affect the brain–gut communication and pathology2. Accordingly, different neurodegenerative diseases, especially PD, are associated with gut microbiota composition changes3. Although several studies have shown which gut microbiota mechanisms are potentially relevant3,4, more comprehensive knowledge of the gut microbiota on the genome level could help to understand the functional impact of gut microbiota in PD.

With respect to gut microbiota, the Helsinki cohort is one of the most thoroughly studied cohorts in the field of PD. We have originally described differences between PD and control groups from stool samples3, and followed the temporal stability of the microbiota4. In addition, the bowel symptoms and their linkage to the gut microbiota have been studied5. Following the Braak hypothesis, we have also investigated the microbiome and PD connections in samples from nasal and oral locations6. These original microbiome analyses have been followed by studies on short-chain fatty acids (SCFA) and immune responses7, and recently on the links between metabolomics, metabolic enzymes, and epigenetic regulation of the host8,9,10.

In this study, we used shotgun metagenomics data from the same cohort3,4 and analysed the metagenome-assembled genomes (MAGs), their distribution and functionalities when contrasted between PD and control groups with the aim of gaining a deeper understanding of the functional potential of gut microbiota in Parkinson’s disease and potentially informing future research directions.

Materials and methods

A summary visualisation of the methods can be seen in Supplementary Fig. S1, and Supplementary Fig. S2.

Subjects

This study used the same patient cohort as previously described3,4. The stool samples and the DNA are the same as the baseline samples of our previous study4. This ensures direct comparability between the current findings and our previous results. We included 68 control and 68 PD patients in this case–control study (Table 1). To ensure ethical conduct, the ethics committee of the Hospital District of Helsinki and Uusimaa approved the research in accordance with the guidelines of the Declaration of Helsinki. All participants provided written informed consent to participate.

Table 1 Characteristics of the 136 subjects included in this case–control study.

Extraction of total DNA and shotgun DNA sequencing

Total DNA was extracted from frozen fecal samples using the STRATEC Molecular (Stratec, Birkenfeld, Germany) PSP Spin Stool DNA Plus Kit according to the manufacturer’s instructions. Nextera Library Preparation Biochemistry (Illumina, San Diego, CA, USA) was used to prepare sequencing libraries. We performed shotgun paired-end DNA sequencing on Illumina Nextseq 500 (170 bp + 140 bp) and Illumina NovaSeq 6000 (150 bp + 150 bp) platforms.

Read pre-processing and metagenomic assembly

Quality and adapter trimming was performed using Cutadapt v1.811 with “-m50” and truseq adapter parameters. Reads were mapped to the Human reference genome using BWA v0.7.12-r10312 to filter out human reads. The total of filtered human reads showed no significant difference between groups (Supplementary Table S1). The independent assembly of metagenomic reads of each biological sample (i.e., 136 separate assemblies, one per study subject) was done using SPAdes v3.11.113 with the “-meta” option.

Binning and pangenomic analysis

We first binned the contigs using five different binning tools: Metabat v2.12.114, MaxBin v2.2.515, Abawaca v1.00 (https://github.com/CK7/abawaca), Concoct v0.4.016, and MyCC17 with default options. All results from the five binning tools were used as input to DASTool v1.1.018 with the option “--search_engine diamond” to create the final MAGs (Supplementary Table S2). The quality of MAGs was checked using CheckM v1.0.1219. The pangenomic analysis of selected genera was done using Anvi’o v7.020.

Phylogenetic analyses of MAGs

GTDB-Tk v2.1.0 (database release GTDB R07-RS207) tool21 was used with the “‘classify_wf’” function to perform taxonomic annotation of each MAGs. A maximum-likelihood tree was calculated using IQ-TREE v1.6.1222 using the protein sequence alignments produced by the GTDB-Tk tool. The tree was visualised using the iTOL web tool23.

Dereplication and microdiversity profiling

We used dRep v3.2.224 with the “dereplicate --S_algorithm fastANI --multiround_primary_clustering -ms 10000 -pa 0.9 -sa 0.95 -nc 0.30 -cm larger” options to dereplicate the constructed MAGs from all samples. To perform microdiversity analysis, the sequencing reads from each sample were mapped back to the dereplicated MAGs using Bowtie v2-2.4.425. Then, the inStrain v1.3.126 tool was used with the mapping files, which also provided coverage depth and nucleotide diversity of dereplicated MAGs in each sample. The coverage depth data were normalised by dividing the total sample’s number of reads. For statistical analysis of nucleotide diversity, we only include samples with ≥ 1× coverage depth.

Growth rate index (GRiD) calculation

We used the GRiD tool27 to predict the bacterial growth rate of the dereplicated MAG set. First, a database was created using the “grid update_database” function with input of all dereplicated MAGs. Then, the “grid multiplex” function was used with the “-c 0.2” option using all the metagenomic reads per sample. We further filtered out growth estimates with species heterogeneity values ≥ 0.3 and coverage < 1X to ensure reliability, following GRiD guidelines.

Gene prediction, annotation, and clustering

Genes from the assemblies were predicted using Prodigal28 with the “-p meta” option. COG annotation of the predicted genes was done with the reCOGnizer tool29. In addition, to annotate cas genes we used CRISPRCasTyper30. The genes were clustered based on their nucleotide sequences with MMseqs231 using “easy-cluster --cov-mode 1 -c 0.8 --min-seq-id 0.9” options. For amino acid-level clustering, “easy-cluster --cov-mode 1 -c 0.8 --min-seq-id 0.9” settings were used. Based on the clustering, a presence-absence table was created and Scoary v1.6.1632 was used to compare PD and control groups. For the statistically significant gene clusters, an additional detailed functional annotation was performed using PANNZER233. We also specifically predicted antiviral defence systems within all the MAGs using PADLOC34 and PADLOC database v1.4.034.

Identification of viral sequences

We identified viral contigs from assemblies using VIBRANT v1.2.135 with default settings. Viral contigs were searched within the MAGs, and if a viral contig was detected, we assumed the MAG is the host for the phage. The taxonomy of the identified contigs were predicted using Kaiju v1.8.236 with Kaiju “viruses” database (downloaded on March 2022).

Differential abundance analysis using human gut archaeome database

We used the human gut archaeome database37 for archeal differential abundance analysis. Human gut archaeome database includes a few subsets based on the ANI similarity and genome completeness37. To work with a non-redundant dataset, we selected the subset with 98 archeal genomes (subset “MAGs_98” of the human gut archaeome database37). We mapped pre-processed sequencing reads to the database using Bowtie v2-2.4.425, and the mapped reads were counted using BAMtk v0.1.1 (https://github.com/meb-team/BAM-Tk). Statistical significance between groups was calculated using DESeq2 v1.30.0 R package38.

Secondary metabolites BGCs prediction and comparison

Biosynthetic gene clusters (BGCs) putatively involved in the synthesis of secondary metabolites were predicted using antiSMASH v.5.0.039 using full feature run as described by the developers (https://docs.antismash.secondarymetabolites.org/command_line/). Predicted BGCs were compared using automatic mode in BiG-SCAPE/CORASON40 at different cut-offs (0.30, 0.40 and 0.60), and networks of similar BGCs were obtained in addition to their evolutionary histories. Selected core genes from BGCs that were not grouped with those deposited to the Minimum Information about a Biosynthetic Gene Cluster (MIBiG) repository were used to perform a BLASTp search.

Results

Shotgun metagenomic reads

Previously, we have observed clear differences between the PD group and the control group at the bacterial community level using the 16S rRNA gene amplicon sequencing approach3,4. In the present study, we used the same DNA samples for the shotgun metagenomics sequencing approach. Our aim was to use the metagenome data to unravel possible genes and pathways that might contribute to the PD phenotype. We focused on the metagenome-assembled genomes (MAGs), their distribution, as well as the genetic arsenal of the microbes. In total we analyzed 136 biological samples, one per study subject. Sample specific read and base counts are listed in Supplementary Table S1.

Metagenomic assembly and MAG reconstruction

We observed that the total length of the assembly is larger in the control group compared to the PD group (Supplementary Fig. S3). Specifically, the mean total length of the assembly was 277,969,072 bp in the control group versus 224,912,227 bp in the PD group. From the metagenomic assemblies, a total of 6,736 MAGs were retrieved, with 3663 from the control group and 3073 from the PD group (Supplementary Table S2). On average, 53 MAGs were created per sample in the control group and 45 per sample in the PD group (Supplementary Fig. S3). The average completeness of these MAGs was 86.9%, contamination was 3.7%, and the average genome length was 2,481,247 bp (Supplementary Table S3a).

Taxonomic annotation of MAGs

The majority of the MAGs (6692) were annotated as bacteria, with only 44 identified as archaea. At the phylum level, 86.5% of the bacterial MAGs were annotated as Firmicutes and Bacteroidota (synonym Bacteroidetes). The most common genus across all samples was Alistipes, with 299 MAGs (Supplementary Fig. S4).

Dereplication and MAG coverage

Since we performed an independent assembly for each subject’s sample, we encountered highly similar MAGs across subjects. To obtain non-redundant MAGs, we chose the best representative MAG using dRep24. All 6,736 MAGs were aggregated and dereplicated at 95% ANI to choose representative MAGs. In total, the dereplicated genome set consisted of 952 MAGs. Most of the MAGs were classified as bacteria (943 MAGS), while only 9 MAGs were archaea. The proportion of reads successfully mapped back to the dereplicated MAG set ranged from 68.97 to 84.72% across the 136 samples, with a median of 78.71% (Supplementary Table S3b). The annotation of MAGs indicated that Firmicutes is the dominant phylum (Supplementary Fig. S5) (Fig. 1). On the genus level, even with the dereplication set, we recovered 22 different genomes of Collinsella and Prevotella, which indicates high variability of those genera in the human gut microbiota.

Figure 1
figure 1

Maximum-likelihood phylogenetic tree including 943 dereplicated bacterial MAGs. Each tree branch represents a MAG. Bars on the outer layer represent how many MAGs were clustered within the dereplicated MAG, outer strip colours represent phylum of the MAGs. Background colour of the clade represents the annotated Cas type within the MAG.

The coverage depth (mean depth) of the dereplicated MAGs was calculated for each sample using the inStrain tool26 by mapping back the reads to the dereplicated MAGs (Supplementary Table S4). The coverage depth data was used to understand MAG abundances across all samples. After normalising the coverage depth by sequencing depth, a Wilcoxon Rank Sum Test was applied to identify the difference in coverage depth between the control and PD groups. The results indicated that the coverage depth of 10 dereplicated MAGs was statistically significantly different between control and PD groups (please note that the significance threshold was p-adjusted to 0.1) (Supplementary Table S4, Supplementary Fig. S6). For only one MAG, coverage depth was statistically significantly higher (p < 0.1) in the PD samples, which was taxonomically annotated as Alistipes onderdonkii. This suggests that the abundance of Alistipes onderdonkii was higher in PD samples compared to the control group. MAGs with significantly higher coverage depth (p < 0.1) in the control group belonged to genera Agathobacter, Prevotella, Dysosmobacter, Clostridium, Choladocola, and Blautia (Supplementary Fig. S6).

Microdiversity analysis

Microbial communities typically have different variants of bacterial genomes, presenting microdiversity that could have functional effects. Thus, reads from each sample were mapped back to the dereplicated genome set and the microdiversity profile was calculated with the inStrain tool26 (Supplementary Table S4). Based on the Wilcoxon rank-sum statistics, nucleotide diversity of a MAG, annotated as Ruminococcus bromii B by GTDB-Tk, was statistically significantly (p = 0.002) different between the control and PD groups (Fig. 2). Specifically, the nucleotide diversity was higher in the control group. This indicates that Ruminococcus bromii is more diverse on the strain level in the control group compared to PD.

Figure 2
figure 2

Box-plot showing nucleotide diversity of Ruminococcus bromii in each sample. Each dot represents one sample. Blue box represents the control group, and the orange represents PD. The statistical significance was observed based on Wilcoxon rank-sum statistics and Benjamini/Hochberg correction. Samples with greater than 1× coverage depth were used for the statistical analysis.

Growth rate index (GRiD)

A previous study indicated the possibility to estimate the growth of individual bacteria from shotgun metagenome data based on DNA copy number variation along the genomes41. Recently, the GRiD tool was published for making these analyses feasible for metagenomic samples27. We used this to predict the growth rate of dereplicated MAGs in each sample (Supplementary Table S5a).

All samples were pooled together to compare growth within different taxonomic levels. Phylum-level analysis using pooled samples was done to gain understanding about the growth of a bacteria belonging to specific phylum in the gut environment in general. The data indicated that at the Firmicutes level, the GRiD scores were statistically significant between control and Parkinson’s samples. Additionally, statistical significance was observed at the Desulfobacterota level; however, the number of samples with GRiD scores for Desulfobacterota was low (Fig. 3, Supplementary Table S5b). The lowest taxonomic level at which we observed a significant difference was the family level. Within this level, MAGs belonging to the Oscillospiraceae family had statistically significantly higher GRiD scores in the control group (Fig. 3, Supplementary Table S5b). The distribution of the GRiD scores also indicated that the variation is high between MAGs. For example, the minimum GRiD score within the Bacteroidota phylum was 1.00, while the maximum was 9.99 (Fig. 3, Supplementary Table S5b).

Figure 3
figure 3

Violinplot shows the Growth Rate Index (GRiD) score of each MAG grouped by (a) Phylum and (b) Family. Blue distribution diagram represents the control group, and the orange represents PD. White dot indicates the median. For simplicity, we only plotted the top eight family taxa with the highest number of GRiD record observations.

The comparison between control and PD groups showed that none of the MAGs has a statistically significant difference in growth rate. A slight but non-significant difference was observed for the growth rate (p < 0.20) for two MAGs, which were annotated as Bacteroides eggerthii, and UBA11774 (belongs to Lachnospiraceae family). For these two MAGs, the GRiD score was slightly higher in the control group (Supplementary Fig. S7).

Pangenomes of selected genera

Previous studies showed that the abundance of Prevotella is related to PD3. Therefore, we focused on Prevotella in detail for our pangenomic analysis. Pangenome analysis of 64 Prevotella MAGs showed that MAGs did not cluster based on the condition (control vs. PD) using average nucleotide identity (ANI) (Fig. 4). We also annotated genes of all Prevotella MAGs using the Clusters of Orthologous Genes (COG) database. We observed that the three most common COG categories in Prevotella MAGs were “Cell wall/membrane/envelope biogenesis”, “Translation, ribosomal structure and biogenesis”, and “Carbohydrate transport and metabolism” (Supplementary Fig. S8). To observe if any COG group was enriched in either or both PD and control groups, we used the “compute-functional-enrichment-in-pan” functionality in the Anvi’o tool20. However, no significant difference was observed between groups.

Figure 4
figure 4

The pangenome of Prevotella MAGs. The dendrogram in the centre organises the 13,257 gene clusters that occur in all 64 Prevotella MAGs. The 64 inner circular layers correspond to the 64 Prevotella MAG. MAGs that were created from the control group and PD group are shown in blue and yellow, respectively. MAGs are ordered according to ANI scores which are shown at the top-right corner with red gradient. The length of the MAGs is shown with grey bars.

We additionally performed pangenomic analysis of other selected genera that were shown to be related to PD based on literature and our own studies3,4,42. However, we did not observe condition-specific differences (PD and control). COG functional enrichment was performed between the genes of control and PD MAGs for selected genera. However, we did not see any statistically significant difference in COG functions in the selected genera (Supplementary Fig. S9).

Prediction of antiviral defence systems in MAGs

Based on PADLOC34 results, 5990 of 6,736 (89%) MAGs had at least one gene related to antiviral defence systems. For the rest of the 746 MAGs, antiviral defence system genes were not observed (Supplementary Table S6). The most common defence systems in gut microbiota were “Restriction modification type II

(RM_type_II)”, “Bacterial abortive infection type E (AbiE)”, “Restriction modification type I (RM_type_I)”, and “CRISPR-Cas system type I-C (cas_type_I-C)” (Supplementary Fig. S10).

Cas gene identification from assemblies

We also specifically identified cas genes from each assembly. The most common Cas operon type in gut microbiota is I-C type operon (Supplementary Fig. S11). In addition, III-A, I-E, II-A, I-B, II-C and V-A type operons were commonly detected (Supplementary Table S7) in both PD and control groups (no difference between groups for the number of Cas operon types; Supplementary Fig. S11. Similarly, we did not observe any phylum-specific cas type (Fig. 1).

Genes from metagenome assemblies

From all 136 assemblies, we predicted 93,075,923 genes, which results in an average of 684,381 predicted genes per assembly. On the MAG level, there were 15,889,713 predicted genes from the 6736 MAGs, or an average of 2359 genes per MAG. Since the number of predicted genes differed between assembly and MAG level, we made additional comparisons using assembly-level predictions. All 93 million genes from metagenome assemblies were annotated with the reCOGnizer (https://github.com/iquasere/reCOGnizer) tool. Approximately 51 million genes, which represent about 55% of all genes, were successfully annotated with COG terms. “Transcription” was the most common COG category, and the OmpR family DNA DNA-binding response regulator gene was the most common gene based on COG annotation (Supplementary Fig. S12, Supplementary Table S8, Supplementary Table S9).

The predicted genes were clustered on both nucleotide and amino acid sequence-level, and the presence-absence table was constructed based on the clustering. Next, the Scoary tool32 (binomial test with Benjamini–Hochberg adjustment) was used to study the association between the clustered genes’ presence or absence and the groups (control vs. PD). Such analysis indicates the link between presence or absence of particular genes and observed traits (control vs. PD). Statistical significance (p < 0.05) was observed for nine gene clusters on the nucleotide level, and for seven gene clusters on the amino acid level. Three of these were the same gene (Supplementary Table S10). Most of the gene clusters were seen in Ruminococcus and Blautia, and the occurrence was higher in the control group. For all significant gene clusters, the occurrence was higher in the control group compared to the PD group.

The genes with very low occurrence in PD samples can be interesting. For example, the speF (ornithine decarboxylase SpeF) gene from Veillonella (which occurred in 24 control samples, but was absent in the PD samples), “Uncharacterized protein” gene from Clostridiaceae (occurs in 5 PD vs. 34 control samples), “GIY-YIG nuclease family protein” gene from Faecalibacterium sp. (occurs in 1 PD sample vs. 23 control samples), and “Fe–S oxidoreductase” gene from Eubacterium sp. (occurs in 11 PD vs. 40 control samples). (Supplementary Table S10).

Secondary metabolites biosynthetic gene clusters (BGCs)

BGCs putatively involved in the synthesis of secondary metabolites were predicted in the dereplicated MAGs using antiSMASH and compared using BiGSCAPE-CORASON (Supplementary Table S11). Some BGCs present in the MAGs were closely related to BGCs that have been previously annotated and deposited in the MIBiG database of known secondary metabolites BGCs (Supplementary Table S12 and Supplementary Fig. S13). The presence of biosynthetic genes involved in the synthesis of multiple molecules could be detected in MAGs dereplicated from control and PD patients, such as yersiniabactin, N-myristoyl-D-asparagine (precursor of colibactin43) and aerobactin. The BGCs present in the replicated MAGs were diverse, from which putative similar pathways were detected (Supplementary Table S11 and Supplementary Fig. S14). Core genes from the most prevalent BGCs were compared to the NCBI database (BLASTp); these genes could indicate their potential involvement in the synthesis of ranthipeptides, squalene/phytoene synthesis and arylpolyenes among biosynthetic pathways that still need to be further clarified (Supplementary Fig. S14).

Identifying phage contigs and locating them within MAGs

From the metagenomic assemblies, we identified phages using Vibrant35, which were then searched from the MAGs. If a phage (prophage) contig is located within the MAG, we assume that the MAG is the host for the phage. In total 111,099 contigs were identified as phages in the whole dataset; 31,127 of them (28%) were binned in a MAG (Supplementary Table S13). The most common phage order was Caudovirales (Supplementary Table S13). On the family level, Siphoviridae, Myoviridae, and Podoviridae were the most commonly seen phage families within the MAGs (Supplementary Table S13).

Archeal MAGs and differential abundance analysis using human gut archaeome

Of the 6736 MAGs analyzed, 44 were identified as archaea, in which 26 were found in the control and 18 were detected from the PD group (Supplementary Table S14). Most of the archeal MAGs (35 of the 44) were annotated as Methanobrevibacter smithii. With the dereplication, we identified nine non-redundant high-quality archaeal MAGs in our complete dataset. Our coverage depth analysis showed that none of the MAGs were differentially abundant between the control and PD group.

In addition to MAGs, we also used the human gut archaeome database37 for differential abundance analysis, which indicated 11 archeal genomes that differed in abundance (log2foldchange >|0.8| and p-adj < 0.05) between the control and PD group (Supplementary Table S14). Seven were statistically significantly more abundant in the PD group, six of which were annotated as Methanobrevibacter smithii (family level taxa is Methanobacteriaceae). Four genomes were found to be statistically significantly more abundant in the control group, all of which were annotated as Methanomethylophilaceae at the family level. Hence, our data indicated that Methanobacteriaceae was more abundant in the PD group, while Methanomethylophilaceae was more abundant in the control group (Fig. 5).

Figure 5
figure 5

Box-plot illustrating the abundance of selected archaeal species across different samples. Each dot corresponds to a single sample. The blue box represents the control group, while the orange box represents the PD group. Panel (a) shows GUT_GENOME109037, and panel (b) shows GUT_GENOME247230, both of which are selected archaeal genomes from the human gut archaeome database.

Discussion

Investigating the gut microbiota using shotgun sequencing and metagenome-assembled genomes (MAGs) provides more detailed genomic information compared to using only 16S RNA gene amplicon sequencing. For example, a human gut microbiota study that investigated gut microbiota MAGs during Helicobacter pylori eradication therapy reported that the spread of antibiotic resistance genes between MAGs increased the relative abundance of certain MAGs44. Here we studied detailed genomic features of the MAGs that were obtained from control and PD groups. We retrieved 6736 MAGs from 136 human gut microbiota samples, corresponding to an average of 49.5 MAGs per sample. Another human gut microbiota study also indicated that on average, 53 MAGs could be created per individual sample44. However, these numbers are highly dependent on the depth of sequencing, the techniques used to reconstruct the MAGs, and the thresholds applied to consider a MAG as valid, including factors such as completeness, contamination, and the number of contigs in the MAG. With the following dereplication process, we reported 952 high-quality (min. completeness 75%, max. contamination 25%) non-redundant representative genomes from gut microbiota. Only less than 1% of the representative genomes were archaeal genomes, which is in line with a previous human gut microbiome study45. Moreover, similar to previous gut MAG studies45,46,47, Firmicutes and Bacteroidota were the most common phyla detected in the gut microbiota of both the control and PD groups. On the species level, 11% of the dereplicated MAGs could not be assigned to an existing species within the GTDB database, which could indicate that those MAGs might be novel species. A previous study reported that 60% of gut metagenome MAGs could not be assigned to a species45. However, it is also worth mentioning that genome taxonomy databases grow parabolically. When we used Release 202 of the GTDB database (GTDB release date: April 27, 2021), 50% of MAGs were not assigned to existing species. With Release 207 of the GTDB database (GTDB release date: April 8, 2022), only 11% of MAGs were unassigned. Here we studied quantitative differences between control and PD groups. However, it is relevant to note that part of the quantitative differences can be hampered by the MAGs building process. Reads can be disproportionately mapped to MAGs, distorting the quantitation of features like genes between the PD and control groups. Nevertheless, MAGs give a more accurate depiction of the gene and genomes compared to analysis of only reads and contigs. In our study, working with MAGs allowed us to achieve a higher resolution for this specific cohort, enabling us to capture unique microbial signatures that may not be present in larger databases. However, we acknowledge that read-based analyses using extensive databases can also be powerful. In fact, such an analysis was conducted using the same data in our other study (manuscript in preparation).

Microdiversity analysis using a minimum coverage threshold of 1× indicated that the MAGs annotated as Ruminococcus bromii show significant differences in strain diversity between PD and control groups. While this approach allows for the identification of strain-level variations, higher coverage depths could potentially reveal a more comprehensive picture of the genomic diversity within Ruminococcus bromii. Specifically, the control group has statistically significantly (p < 0.002) more diverse Ruminococcus bromii genomes compared to PD. The difference in Ruminococcus bromii genome diversity between groups might indicate that this species adapts to the different gut microbiota environments via genomic changes. It can also be speculated that the genomic difference in Ruminococcus might contribute to the changes in gut microbiota environment in PD and control subjects. We observed that Ruminococcus is a highly abundant genus in gut microbiota, hence genomic variation in Ruminococcus might influence the whole environment. Moreover, gene clustering and the presence-absence association analysis indicated that several Ruminococcus genes had significantly different occurrences between control and PD samples. Those genes were mostly related to metabolism, which might indicate the adaptation process.

We performed pan-genome analysis for selected genera using the generated MAGs. However, we did not see a statistically significant difference for the selected genera between control and PD groups based on COG enrichment. We then extended our gene comparison analysis to the assembly level, using all the contigs from each assembly rather than just focusing on the MAGs. While about 16 million genes were predicted in the MAGs, more than 90 million genes were predicted from all the contigs of the independent assemblies of 136 individuals. Hence, by using all the contigs from all assemblies, we can analyse most of the gene information in the gut environment. All predicted genes from assemblies were used for clustering, and based on the cluster table, a presence or absence table was created. The statistical analysis on the presence-absence table allowed us to compare gene occurrence levels between PD and control samples. We identified some genes that were significantly more frequent in the control group, for example speF (encoding ornithine decarboxylase) from Veillonella sp. This gene cluster was absent from all PD samples when searched for within their assemblies, while 24 control samples (35.2%) possessed the cluster within their assemblies. It has been reported that the ornithine level in PD patients is higher compared to controls48. It may be speculated that ornithine decarboxylase might reduce the ornithine level by converting ornithine to putrescine. The Scoary analysis indicated that several gene clusters from Ruminococcus and Blautia were significantly more frequent in the control group. As most were genes related to metabolism, one could speculate that those genes can play a role in PD protection. For example, reduced short-chain fatty acid levels have been observed in the PD group compared to the control group7,49,50. Accordingly, some fatty acid-related gene clusters were more frequent in the control group, and those genes might help to keep short-chain fatty acid levels high in the control group.

Growth rate of the gut microbes can be a useful metric to understand the growth dynamics of the gut microbial community. With the help of new tools, the replication rate of MAGs can be estimated using metagenomics data, and replication is an indication of microbial growth rate27. Our data indicated that some of the MAGs belonging to Bacteroidota have a high growth rate (up to 9.9 GRiD score), which has not been seen for other phyla MAGs. One possible explanation is the high abundance of Bacteroidota in gut environments45. However, it is also possible that the low quality of some MAGs, such as those fragmented into multiple contigs, may result in false positive GRiD scores. Interestingly, we observed a higher GRiD score for the Desulfobacterota phylum in the control group compared to the PD group. This finding appears to contradict previous research suggesting a potential role for Desulfovibrio bacteria (belongs to the Desulfobacterota) in PD pathogenesis51. However, it’s important to note that we did not observe any significant difference in the coverage depth of Desulfobacterota itself between the control and PD groups. The low number of samples with available GRiD scores for Desulfobacterota might limit the generalizability of this observation (Fig. 3, Supplementary Table S5b). Moreover, we observed significant differences in GRiD score at the family level, particularly within the Oscillospiraceae, but we did not detect differences at more specific taxonomic levels like genus or species. The differential growth exhibited by MAGs belonging to this family is noteworthy. However, due to the broad taxonomic scope of this finding, drawing precise conclusions about individual members’ contributions to the observed patterns requires further research.

There were 10 MAGs that have a statistically significant (p < 0.1) difference in coverage depth between the control and PD group. Here, coverage depth—the average number of reads mapping to a MAG—was calculated using in Strain26 by mapping the sequencing reads back to dereplicated MAGs. Coverage depth helps to understand the abundance of a microorganism, and comparing abundance between PD and control group might indicate association of a microorganism to the disease. A MAG that was annotated as Alistipes onderdonkii had significantly (p < 0.1) more coverage depth in the PD group, which was similarly reported in other studies with high abundance of Alistipes in the PD group52,53,54.

In our dereplicated MAG set, there were 22 different Collinsella and Prevotella MAGs. Those had the highest number of different MAGs on the genus level. This indicates that Collinsella and Prevotella genera were highly diverse within all samples. The genus Prevotella is a very large group, with 717 species and 99 strains (based on NCBI Taxonomy Browser, January 2023). Previous studies showed high abundance of Prevotella in the control group gut microbiota compared to the PD group3,5,55,56. Similarly, we observed two Prevotella MAGs that had statistically significantly (p < 0.1) more coverage depth in the control group compared to the PD group, indicating higher abundance of Prevotella in the former (Supplementary Fig. S6). Since the MAGs were directly constructed from the patient’s gut microbiota, the abundant Prevotella MAGs in the control group might help future studies to focus on the Prevotella species. However, given that our pangenome analysis did not reveal significant differences between control and PD Prevotella MAGs in terms of gene content and functional categories, it is possible that the abundance of Prevotella, rather than its genomic characteristics, plays a crucial role in its association with Parkinson’s disease. This notion is supported by several previous studies that have consistently reported higher Prevotella abundance in healthy controls compared to PD patients3,5,55,56. It should be noted that the coverage depth analysis (abundance analysis) was performed on the dereplicated MAG set, and further analysis using larger databases can provide more accurate abundance calculation. Such analysis was done using the same data in our other study (manuscript in preparation), which also indicated higher abundance of Prevotella in the control group.

The dereplicated MAG set was also used to predict biosynthetic gene clusters (BGCs) that could be putatively involved in the synthesis of secondary metabolites. A previous study identified 43 BGCs that were found to be enriched in PD patients, although they were not correlated to metabolites known to be related to the disease57. Some secondary metabolite BGCs predicted from dereplicated MAGs could be involved in the synthesis of some secondary metabolites (Supplementary Table S12 and Supplementary Fig. S13). A MAG obtained from the PD sample P28 contained BGCs putatively involved in the synthesis of yersiniabactin and colibactin, which can commonly co-occur in enterobacteria58. However, the most common BGCs were present in few MAGs (up to 11 BGCs), and the produced metabolite is still unknown. Some core genes of these BGCs had sequences similar to ranthipeptides, squalene/phytoene synthesis, enterobactins, and arylpolyenes (Supplementary Fig. S14). Previous analyses based on multiomics approaches of the same cohort (but using samples taken two years later than the ones used in this study) indicated that metabolites involved in the squalene and cholesterol pathways could be used as predictive molecules8, and the BGCs detected in this study could be potentially involved in the synthesis of these metabolites.

With our non-redundant MAGs, we could not detect significant differences for archeal MAGs abundance between the control and PD group. However, when we used the human gut archaeome as a database37, our data indicated Methanobrevibacter smithii was statistically significantly (p < 0.05) more abundant in the PD group (Supplementary Table S14). This is in line with previous PD gut studies, which have previously reported that the genus Methanobrevibacter is more abundant59 in PD patients. Recently, it has been reported that the 2-hydroxypyridine (2-HP) molecule proposed to be synthesised by the genus Methanobrevibacter smithii is more abundant in PD patients60. Moreover, our data indicated Methanomethylophilaceae UBA71 genus was statistically significantly (p < 0.05) more abundant in the control group. However, more studies are needed to understand the effect of Methanomethylophilaceae UBA71 in the control group.

Phages within the MAGs were analysed in this study, while further phage analyses from all metagenomic contigs are detailed in another study (Lecomte et al., submitted). From all the 111,099 phage contigs that were identified using the 136 metagenome assemblies, 31,127 (28%) were found to be part of the predicted MAGs. Similarly, another study61 reported that a host was able to be predicted for only 28.6% of all human gut phages61. We observed that the MAGs that were annotated as Bilophila had the highest number of phage contigs (22) within the MAGs. It should be noted that most of the predicted phage contigs are “low quality draft”, therefore the fragmented short phage contigs can be the reason for the high number of phages in Bilophila MAGs.

One of our interests was antiviral defence genes in the MAGs; antiviral defence systems in prokaryotes are crucial for their survival, as they are constantly infected by viruses62. Hence, we hypothesized that a high proportion of gut microbiota MAGs would contain genes encoding antiviral defence systems. As expected, prediction of antiviral defence systems in the MAGs indicated that a very high percentage of the constructed MAGs contain at least one antiviral defence systems gene. Such a high percentage of presence of antiviral defence systems in gut bacteria might indicate that a high number of viruses can be found in gut microbiota, and gut bacteria adapt well to defend themselves from viruses. To our knowledge, the complete antiviral defence system search has not been performed for other metagenomic environmental samples, therefore it is not known if the high presence of antiviral defence systems is specific to gut microbiota or more widespread. Understanding the prevalence and diversity of antiviral defence systems in gut microbiota is crucial because these systems play a vital role in shaping the bacterial-viral interplay, potentially impacting gut health and disease. To complement the findings presented here, we conducted an extensive phage-centric analysis based on the dataset, which has been submitted for publication (Lecomte et al., submitted).

Conclusion

We report 952 non-redundant near-completed MAGs from 136 human gut metagenome assemblies to better understand the gut-associated microbiome genetics and diversity. We found that the Ruminococcus bromii genome diversity is statistically significantly higher in the control group compared to the PD group. Moreover, statistically significantly more frequently occurring genes were identified in the control group. Our results expand the knowledge of gut-associated microbiome genomes, and underscore the importance of potential consequences of strain-level differences between groups.