Background

Small RNAs like microRNAs (miRNAs) have key roles in gene regulation [1]. miRNAs are ~ 22 nucleotides (nt) in length and generally act by a post-transcriptional mechanism involving base-pairing interactions with their target mRNA. miRNA target sites are typically located in the 3’-UTRs of protein-coding mRNAs [2, 3] and referred to as miRNA response elements (MREs). miRNAs are abundant across eukaryotic genomes ranging from plants to animals. miRNA-like species are also encoded by fungal [4] and viral genomes [5,6,7]. The activities of other processed small RNAs such as tRNA fragments have also been associated with essential cellular activities including the repression of endogenous retroviruses [8] and metabolic regulation [9].

Canonical miRNA biogenesis is regulated via serial processing by two distinct RNases. miRNAs are initially produced as larger mRNAs—termed primary miRNAs (pri-miRNAs)—transcribed by either RNA pol II or RNA Pol III. Many pri-miRNA encoding loci are termed host genes. pri-miRNAs are recognized and processed in the nucleus by the microprocessor complex consisting of the RNase III endonuclease Drosha and its cofactor, DGCR8 [10,11,12,13]. Drosha cleavage of the pri-miRNA produces the precursor miRNA (pre-miRNA), an ~ 80 base pair long RNA hairpin species with a two nucleotide 3’-overhang structure. Pre-miRNAs are exported to the cytoplasm by exportin 5 [14] where they are cleaved by the RNase DICER to generate miRNA duplexes 18–22 nts in length [15, 16]. Subsequently, one strand of the miRNA duplex referred to as the guide strand is loaded into an Argonaute (Ago) protein. The primary Argonaute protein in humans is Ago2. Ago2 loaded with a miRNA is defined as RNA-induced Silencing Complex (RISC) [17, 18]. RISC is the effector miRNA complex that targets genes for silencing via translation inhibition or mRNA destabilization.

miRNA genes typically reside in both intergenic as well as intragenic regions of our genome [19, 20]. Intergenic miRNAs are transcribed by RNA polymerase II or III as independent transcription units. Intragenic miRNAs on the sense strand are frequently transcribed as part of the host gene. This bicistronic gene structure implies co-regulation of the miRNA and the host gene. A subclass of miRNAs that reside in introns on the sense strand, which are known as miRtrons, are produced via splicing independent of Drosha-processing [21, 22]. There are also instances of pre-miRNAs overlapping exon–intron junctions that are produced by either Drosha or pre-mRNA splicing [23].

To date, majority of exonic miRNAs are known to be almost exclusively encoded by exons of long non-coding RNAs [19, 20, 24]. Early miRNA studies [19, 20], which were prior to the explosion of next-gen RNA-seq studies and consortia like GTEx [25], identified a dearth of exonic miRNAs in protein-coding genes. A subsequent analysis in 2013 identified four human and eleven mouse miRNAs – fifteen total—encoded by exons of protein-coding genes [24] (Table 1). Furthermore, experimental studies have identified and characterized some miRNAs encoded by exons of protein-coding genes (Fig. 1). Examples include miR-198 encoded by the 3’-UTR of FSTL1 [26], miR-3618 encoded by the 5’-UTR of DGCR8 [27], miR-1306 in the first coding exon of DGCR8, and miR-147b encoded by the 3’-UTR of C15orf48 also known as MISTRAV [28, 29]. In general, small RNAs processed from protein-coding transcripts are especially interesting as their processing is predicted to destabilize the host mRNA thereby inactivating any activities encoded by the intact transcript [26,27,28]. Collectively, the gene structure of these potential bicistronic transcripts and experimental studies noted (Table 1) suggest post-transcriptional co-regulation of exonic miRNAs and their host genes.

Table 1 Human and mouse small RNAs encoded by exons of protein-coding genes known prior to this study
Fig. 1
figure 1

Examples of miRNAs that reside in exons of protein-coding genes. A FSTL1 is a known miRNA protein-coding host gene that encodes primate-specific mir-198 in its 3’-UTR. Both gene products are associated with wound healing [26]. B The miRNA microprocessor cofactor DGCR8 is a known miRNA host gene that encodes a miRNA in its 5’-UTR (mir-3618) and first coding exon (mir-1306) [27]. Processing of these miRNAs is linked to the regulation of DGCR8 activity. C The C15orf48 locus is a known miRNA host gene, which encodes a protein that shapes antiviral responses termed MISTRAV, and encodes mir-147b in its 3’-UTR [28]

Given the exponential growth of transcriptome data, we sought to identify small RNAs that reside in exons of protein-coding genes in the greatly expanded and curated transcript datasets. In human and mouse genomes, we find 201 small RNAs embedded within exons of protein-coding genes; of which, 23% (46 small RNAs) display characteristics common to miRNAs based on MirGeneDB. The majority (96%) of the host gene-small RNA relationships have not been documented. We identify nearly fifty human and mouse small RNAs that reside in coding exons of protein-coding host genes. Interestingly, several of these candidate protein-coding host genes have established roles in immunity like the major histocompatibility cofactor B2M [30, 31] and the antiviral factor ZAP [32, 33]. Our analysis generates a resource of putative host genes including protein-coding transcripts for human and mouse small RNAs.

Materials and methods

Bioinformatic analysis of exonic small RNAs

Genomic coordinate data were obtained from UCSC genome browser table browser (https://genome.ucsc.edu/cgi-bin/hgTables; last accessed Aug. 07, 2024). NCBI RefSeq [34] track was selected. GRCh38/hg38 and GRCm38/mm10 were selected for human and mouse, respectively. For small RNA genetic coordinates, the data were obtained from miRBase 22 (https://mirbase.org/download/; last accessed Aug 07, 2024). gff3 files were downloaded for each species (Human:hsa.gff3 and Mouse: mmu.gff3). RefSeq mRNA data were used to map miRBase annotated pre-miRNAs to protein-coding genes. All the data were processed using pandas (Python library) and bash commands. BEDTools [35] was used to classify miRBase annotated pre-miRNAs as exonic, intronic, intergenic, and strandedness. Any miRBase annotated pre-miRNA sequence completely overlapping with an exon was assigned to exon-derived small RNAs, whereas intron-derived small RNA assignment required all coordinates residing within an annotated intron. Non-coding transcripts were defined as mRNAs having the same coordinates for CDS start and CDS end (Fig. 2).

Fig. 2
figure 2

Pipeline for assignment of small RNA and host-gene relationships. A Schematic of analysis integrating genomic annotations and small RNAs present in miRBase and MirGeneDB. Final output files are colored in blue, and files that are further processed are colored in red. B-D Schematics illustrating inclusive classification logic present in the analysis pipeline to identify as many instances of small RNAs residing in exons of protein-coding mRNAs. See Supplementary file 1 for specific examples. B If the location of the pre-miR overlapped with exonic and intronic transcriptional units in the same orientation, the small RNA was assigned as exonic. C If the small RNA location overlapped with a locus that generated protein-coding and non-coding transcripts, the small RNA was assigned protein-coding. D If the small RNA overlapped with an alternative upstream or downstream exon of a transcribed unit, the small RNA was assigned exonic instead of intergenic

Each small RNA was only assigned to one class; however, class assignment was inclusive and not exclusive relative to identifying instances of small RNAs residing in exons of protein-coding mRNAs. As needed, additional curation was carried out to obtain small class estimates when the host gene is produced from a locus that generates protein-coding and non-coding transcripts which may also include upstream or downstream exons as well as additional unique cases. Some scenarios and specific examples noted here (Fig. 2B, C, D, Supplementary file 1). Instance 1: A sense-strand small RNA in an intron of a protein-coding locus that produces coding and non-coding transcripts, the small RNA was assigned intronic and coding (example: hsa-mir-26b/CTDSP1). Instance 2: A sense-strand small RNA in a downstream alternative exon of a non-coding transcript that shares some exonic overlap with protein-coding transcripts but lack the alternative exon; the small RNA was assigned non-coding and exonic (hsa-mir-6129/ZNF652). Instance 3: A protein-coding locus that generates non-coding transcripts with additional upstream or downstream exons where the sense-strand small RNA resides in the upstream/downstream intron; the small RNA was assigned intronic and non-coding (hsa-mir-4786/NDUFA10/NR_136158.2). Instance 4: A non-coding mRNA derived from a fusion transcript from two protein-coding genes where the small RNA is an intron; assignment was protein-coding and intronic (hsa-mir-628/DNAAF4-CCPG1/NR_037923.1).

To gain additional insights, the small RNA classes assigned using the miRBase entries were further analyzed using entries in MirGeneDB (mirgenedb.org). MirGeneDB [36,37,38,39] contains the most extensively curated set of miRNAs relative to miRBase, which has limitations regarding its usage for miRNA classifications as it includes instances of tRNA, rRNA, RNA derived from other classes, and potentially other by-products originating from transcriptional noise or RNA quality control issues [36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51]. All codes are deposited in Github (https://github.com/tyronchang1/Exonic-microRNA-analysis).

Sequence analysis

Sequences were retrieved from NCBI and aligned using Muscle in Geneious Prime (2025.0.3; www.geneious.com). Sequence accession numbers from NCBI (last accessed Jan. 1, 2025) are included in supplemental files and figures. Sequence logos were generated using (https://weblogo.berkeley.edu/logo.cgi). Gene diagrams and conservation tracks were downloaded from the UCSC genome browser (https://genome.ucsc.edu/index.html) and edited in Adobe Illustrator.

Retrieval of known disease variants for candidate protein-coding host genes

Variants for candidate protein-coding host genes were downloaded from ClinVar (https://www.ncbi.nlm.nih.gov/clinvar/; last accessed Dec. 16, 2024) using the following filters: review status (multiple submitters) and molecular consequence (frameshift, missense, nonsense, splice site, and UTR). Our analysis and reporting only involved variants annotated as either pathogenic or likely pathogenic (Table 2, Supplementary File 6).

Table 2 Candidate protein-coding host genes identified here with disease variants in ClinVar

GTEx analysis of exon-derived human small RNAs present in MirGeneDB and their candidate protein-coding host genes

Expression data for both exonic small RNAs and their host genes were retrieved from the GTEx v10 bulk RNA-seq dataset via the GTEx Portal (https://gtexportal.org/home/downloads/adult-gtex/bulk_tissue_expression). All datasets were last accessed on 05/26/25. miRNA expression: miRNA_TPM_matrix_PORTAL_2025_03_17.txt.gz. Protein-coding host gene expression: GTEx_Analysis_v10_RNASeQCv2.4.2_gene_tpm.gct. For downstream analysis, average TPM values for exonic small RNAs identified by our analysis present in MirGeneDB [39] and their candidate protein-coding host genes were computed across tissues using NumPy [52], followed by log₂ transformation. Data wrangling was performed with pandas (https://pandas.pydata.org/), and visualizations were generated with Matplotlib [53] and Seaborn [54].

Linear regression analysis

Mean log₂TPM values for human small RNAs and candidate protein-coding host genes were calculated using NumPy [52]. Small RNAs analyzed were selected based on their expression profile in the GTEx data (Fig. 8), the identity of the host gene, and their presence in MirGeneDB (Supplementary File 4). Linear regression analyses for each small RNA–host gene pair were performed with the SciPy package [55]. p-values were obtained from two-sided t-tests implemented in SciPy. All plots were generated using Matplotlib and Seaborn.

Results

Several small RNAs reside in exons of protein-coding transcripts

While there are some known instances of miRNAs encoded by exons of protein-coding genes (Fig. 1A – C) [26,27,28], the overall number identified to date is limited to a few cases (Table 1). As similar transcripts with a bicistronic structure potentially serve as resource to provide insights into specific host genes, small RNAs, and gene regulation, we set out to identify additional potential instances of protein-coding host genes encoding exonic small RNAs (Fig. 2). We started our analysis with miRBase [56], which contains well-studied and poorly characterized small RNAs. While miRBase is also known to contain entries that are derived from sources other than miRNAs such as other subclasses of RNA like tRNA, rRNA, and transcriptional noise [36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51], we reasoned that subsequent filtering using MirGeneDB [39] could aid in discriminating miRNAs. MirGeneDB contains the most up-to-date curation of miRNAs. In addition, small RNAs processed from a larger protein-coding transcript could still potentially be of general interest independent of the small RNA fate or activity post-processing as they may point to an ill-defined regulatory element. To this end, we mapped all annotated human (hsa.gff3) and mouse pre-miRNAs (mmu.gff3) from miRBase to the corresponding genomes (Fig. 3A). For aspects of identifying host genes, small RNAs present in the miRBase dataset may be referred to using an annotation which may include a miR prefix herein.

Fig. 3
figure 3

Informatics pipeline to identify small RNAs encoded by exons of protein-coding transcripts. A Human and mouse small RNAs were downloaded from miRBase 22.1 (mirbase.org) [56] and mapped to the corresponding genome using BEDTools [35]. For additional analysis, miRNAs were retrieved from MirGeneDB [36,37,38,39]. B Intragenic and intergenic miRNAs were identified based on their overlap with RefSeq mRNAs. C Intragenic miRNAs were screened for their orientation relative to the host mRNA and retained for further processing. D Same-sense miRNAs were classified into coding and non-coding host-genes followed by subclassification into exonic versus intronic. E Identified exonic mRNAs mapping to protein-coding mRNAs were further analyzed in this study. Classifications including candidate host genes and transcript accession numbers are in Supplementary File 2 for human small RNAs from miRBase, Supplementary File 3 for mouse small RNAs from miRBase, Supplementary File 4 for human miRNAs from MirGeneDB, and Supplementary File 5 for mouse miRNAs from MirGeneDB. Host loci with multiple transcripts were filtered to only one transcript to obtain estimated host gene-miRNA class counts. Due to alternative splicing and multiple transcript architectures, classifications should be considered inclusive and not exclusive (see methods). Hs – Homo sapiens. Mm – Mus Musculus

To map pre-miRNAs to protein-coding transcripts, we used RefSeq mRNA data [34]. RefSeq consists of ~ 200,000 human and ~ 200,000 mouse protein-coding and non-coding transcripts; many of which are alternatively-spliced mRNAs. To identify small RNAs potentially residing in protein-coding exons, we first classified human and mouse small RNAs from miRBase and MirGeneDB into separate successive groups based on their genomic location and strandedness relative to host genes: Group 1) intragenic (Fig. 3B), Group 2) intragenic and sense orientation relative to host gene (Fig. 3C), Group 3) sense orientation in protein-coding mRNAs (Fig. 3D). Group 4 was sense orientation in exons of protein-coding mRNAs with no overlap with host gene introns (Fig. 3D, E, Fig. 4); hereafter referred to as exonic small RNAs.

Fig. 4
figure 4

Numerous small RNAs, with a subset being Drosha-dependent, have overlapping coordinates with coordinates of exons for protein-coding mRNAs. A The distribution of 118 human small RNAs from miRBase (29 miRNAs from MirGeneDB, light blue) mapping to UTR and CDS exons of protein-coding mRNAs. B The distribution of 83 mouse small RNAs from miRBase (17 miRNAs from MirGeneDB, light blue) mapping to UTR and CDS exons of protein-coding mRNAs. The percent represents the number of small RNAs in that type of exon relative to the total number of small RNAs in exons of protein-coding genes. The number of miRNAs in each exon class is in parentheses. C) Human small RNAs from miRBase, with those also present in MirGeneDB highlighted in light blue, identified by our analysis that map to exons of protein-coding mRNAs and display published evidence for Drosha-dependent or -independent processing [58, 59]. Pre-miRNAs are listed in alphabetical order relative to their host gene listed in parentheses. CDS – coding DNA sequence; UTR – untranslated region; hsa – Homo sapiens

The small RNA-host gene assignments that follow are presented with the percent and number of small RNA entries from miRBase followed by the number of entries overlapping with MirGeneDB (Fig. 3). The percentages are relative to the total number of starting pre-miRNAs in miRBase, 1913. We found that of the 1913 human pre-miRNAs in miRBase, 79.4%; 1519 small RNAs (414 in MirGeneDB) reside in intragenic regions. In agreement with previous observations and suggestive of co-transcriptional regulation, more than half of all miRBase human pre-miRNAs [64.6% (1235); 355 in MirGeneDB] are in the same orientation as the resident gene [19, 20] (Supplementary File 2, Supplementary File 4). Of the 1235 miRBase human small RNAs in intragenic regions, 904 small RNAs of all miRBase human pre-miRNAs are located in annotated protein coding transcripts – exonic plus intronic (47.2%)(205 in MirGeneDB). Interestingly, ~ 6%; 118 of all miRBase human small RNAs (29 in MirGeneDB) analyzed reside in exons of protein-coding transcripts in the same orientation. Consistently, our analysis using miRBase identified all five previously reported miRNAs residing in exons of protein-coding transcripts (hsa-miR-21, hsa-miR-147b, hsa-miR-198, hsa-mir-1306, hsa-mir-3618) with 96% of the host gene-small RNA relationships not well-appreciated. Note, hsa-miR-198 and hsa-mir-3618 are not present in MirGeneDB.

The other human pre-miRNA/host gene subclass numbers identified are as follows (Fig. 3): intronic and coding transcripts, 41.1%, 786 miRBase pre-miRNAs (176 in MirGeneDB), exonic and non-coding mRNAs 4.3%, 83 miRBase pre-miRNAs (54 in MirGeneDB), and intronic and non-coding mRNAs [9.6%, 183 miRBase pre-miRNAs (90 in MirGeneDB)]. Our analysis of the mouse genome identified comparable distributions of these small RNA-host gene relationship classes (Fig. 3B-D). Specifically, of the 1226 mouse pre-miRNAs in miRBase, 79%, 968 pre-miRNAs (274 in MirGeneDB) are intragenic with 67.0%, 822 miRBase pre-miRNAs (230 in MirGeneDB) of all mouse small RNAs in the same orientation as the host gene (Supplementary File 3, Supplemental File 5). Notably, ~ 7%, 83 miRBase pre-miRNAs (17 in MirGeneDB) reside in exons of protein-coding genes. Thus, we have identified several new cases where the genomic coordinates of small RNAs overlap with coordinates for exons of protein-coding mRNAs.

Small RNAs map to both untranslated and coding exons of protein-coding genes

We next assessed where in protein-coding transcripts exonic human and mouse small RNAs are located—5’-UTR, CDS, and 3’-UTR. We found that 30 (25.4%) human miRBase pre-miRNAs reside in 5’-UTR sequences (5 in MirGeneDB), 25 (21.2%) in CDS exons (4 in MirGeneDB), and 58 (49.2%) in 3’-UTR sequences (20 in MirGeneDB) (Fig. 4A, Supplementary File 2, Supplementary File 4). During this analysis, we also identified four pre-miRNAs overlapping the coding sequence and 3’-UTR (FAM89A/hsa-mir-1182, TBC1D17/hsa-mir-4750, ATF5/hsa-mir-4751, RPL28/hsa-mir-6805), and one instance of a pre-miRNA overlapping the 5’-UTR and coding sequence (HSP90B1/hsa-mir-3652); none of these instances were present in MirGeneDB. pre-miRNAs residing in human CDS exons include factors associated with development like HOXD1 (hsa-mir-7704) and stress adaptation such as the HSP90 co-chaperone CDC37 (hsa-mir-1181) [57] (Fig. 5, Supplementary File 2, Supplementary File 4). 27%, 32 of these human exonic small RNAs in protein-coding host genes (15 in MirGeneDB) display evidence in published work of Drosha-dependent processing [58, 59] and eight exonic miRNAs display evidence for processing independent of Drosha (0 in MirGeneDB (Fig. 4C)).

Fig. 5
figure 5

Examples of human and mouse small RNAs mapping to different classes of exons of protein-coding transcripts. A Examples of miRBase pre-miRNAs that map to 5’-UTR exons of protein-coding mRNAs. B miRBase pre-miRNAs that map to UTR and CDS sequences. C Examples of miRBase pre-miRNAs that map to 3’-UTR exons. D All identified miRBase human pre-miRNAs that map entirely within a coding (CDS) exon. E All identified miRBase mouse pre-miRNAs that map entirely within a coding (CDS) exon. Pre-miRNAs are listed in alphabetical order relative to their host gene in parentheses with those also present in MirGeneDB highlighted in light blue. Classifications including candidate protein-coding host genes and transcript accession numbers are in Supplementary File 2 and Supplementary File 4 for human small RNAs and Supplementary File 3 and Supplementary File 5 for mouse small RNAs. hsa – Homo sapiens; mmu – Mus musculus

For the 83 mouse miRBase pre-miRNAs in exons of protein-coding genes in the same orientation, we identified 26 (31.3%) pre-miRNAs in annotated 5’-UTR sequence (4 in MirGeneDB), 21 (25.3%) in CDS exons (4 in MirGeneDB), and 32 (38.6%) in 3’-UTR sequence (9 in MirGeneDB)(Figs. 4B and 5, Supplementary File 3, Supplementary File 5). Similar to human, we identified three mouse pre-miRNAs overlapping 3’-UTR and coding sequences (Clcn7/mmu-mir-12188, Scd2/mmu-mir-5114, Rps6ka4/mmu-mir-5046) and one miRNA overlapping the 5’-UTR and coding sequence (Tlcd1/mmu-mir-7653); none of these instances were present in MirGeneDB. Eight miRBase pre-miRNAs in exons across seven host genes are common to human and mouse [mir-147b (C15orf48/AA467197), mir-24–1 (AOPEP), mir-21 (VMP1), mir-3618 (DGCR8) with four in CDS: mir-1306 (DGCR8), mir-935 (CACNG8), mir-671 (CHPF2), and mir-1199 (MISP3)]. Five out of seven of the aforementioned small RNAs are present in MirGeneDB with mir-1199 (MISP3) and mir-3618 (DGCR8) not present.

Next, we examined evolutionary conservation for select exonic small RNAs based on their candidate host gene using sequences available in the database and identified by BLAST. Here, we were interested in sequence conservation not only across the predicted 22-mer but also the predicted seed sequence (nucleotides: 2–8), which is known to mediate base pair interactions with target mRNAs at MREs [1, 2, 60, 61]. This analysis revealed varying levels of sequence conservation including turnover reflecting appreciated evolutionary histories for small RNAs like miRNAs [24, 62, 63]. First, we examined hsa-mir-10393 which is located in the 3’-UTR of B2M, a protein that is a key component of the MHC complex [30, 31, 64] (Fig. 6A, B). Our analysis showed that hsa-miR-10393-3p displays sequence conservation, particularly in the seed sequence, in mammals but has seemingly degenerated in mice (Fig. 6A). miR-21-5p, which has been well-studied [65, 66], resides in the 3’-UTR of the autophagy protein VMP1. In this instance, the entire miR-21-5p sequence seems well-conserved in vertebrates (Fig. 6C, D). Another interesting example is miR-10399 (Fig. 6E and F), which is encoded by the antiviral factor ZAP [32, 33]. hsa-mir-10399 is conserved in mammals but the predicted seed sequence has also diverged in mice but not in cat or armadillo.

Fig. 6
figure 6

Small RNAs identified that map to exons of protein-coding transcripts display evolutionary conservation. A mir-10393 maps to the 3’-UTR of B2M, a component of MHC. B Predicted mir-10393 hairpin. C mir-21 maps to the 3’-UTR of VMP1, an autophagy factor [65, 66]. D Predicted mir-21 hairpin structure. Emir-10399 maps to the 3’-UTR of ZAP, an antiviral factor [32, 33]. F Predicted mir-10399 hairpin. The most abundant small RNA strand in miRBase is colored red in the predicted hairpin. The 100 vertebrate conservation track (cons.) is derived from the UCSC genome browser. The sequence logo was generated using the aligned sequences shown. In the alignment, the small RNA sequence is in black and the predicted seed sequence (nucleotides 2–8) in red

Other small RNA-host gene relationships potentially of interest included host protein-coding genes that have known disease variants as these genes are often well-studied and biomedically relevant (Table 2, Supplementary File 6). One candidate host gene we identified is PTEN-induced kinase 1 (PINK1), which is a master regulator of mitochondrial quality control, is associated with Parkinson’s disease [67, 68]. The PINK1 gene encodes the ill-defined hsa-mir-6084 in its first-coding exon (Fig. 7A, B). miR-6084-3p, which is the most highly expressed strand according to miRBase, is highly conserved including 100% sequence identity in the predicted seed sequence to marsupials. Another interesting candidate host gene is TMEM94 which encodes hsa-mir-6785 (Fig. 7C, D). TMEM94, also known as ERMA, is an ER-resident protein that is a P-type ATPase transporter important in Mg2 + uptake [69]. hsa-miR-6785-5p is highly conserved to Old World Monkeys. Mutations in TMEM94 are associated with neurodevelopmental delay in multiple unrelated individuals [70, 71]. Another noteworthy example is hsa-mir-4709, which is encoded in the 3’-UTR of Niemann-Pick disease, type C2 (NPC2) gene (Fig. 7E, F). NPC disease is a lysosomal storage disorder [72]. miR-4709-3p is highly conserved to New World Monkeys. While gene ontology analysis of candidate protein-coding host genes did not reveal any enrichment for biological process, molecular function, or cellular component, the noted examples do indicate that several of the host genes have been implicated in immune functions and stress responses in published work. Altogether, we have identified numerous predicted pre-miRNAs, many of which are also in MirGeneDB, that map to exons of protein-coding transcripts with a subset displaying evidence of Drosha-dependent processing and evolutionary conservation patterns consistent with a functional seed sequence.

Fig. 7
figure 7

Examples of small RNAs that map to exons of protein-coding genes implicated in human disease. A miR-6084 maps to the first coding exon of PINK1, a mitochondrial quality control factor mutated in Parkinson’s disease [67, 68, 74]. B Predicted mir-6084 hairpin. C mir-6785 maps to a coding exon of TMEM94, ER protein that acts as a Mg2 + transporter [69], which is mutated in a type of rare syndromic intellectual disability [70, 71]. D Predicted mir-6785 hairpin. E mir-4709 maps to the 3’-UTR of NPC2, which encodes a gene that regulates cholesterol transport. Mutations in NPC2 are associated with a lysosomal storage disorder [72]. F Predicted mir-10393 hairpin. The most abundant mature small strand in miRBase is colored red in the predicted hairpin panels. The 100 vertebrate conservation track (cons.) is derived from the UCSC genome browser. The sequence logo was generated using the sequences shown. In the alignment, the small RNA sequence is in black and the predicted seed sequence (nucleotides 2–8) in red. snubM – snub-nosed monkey

Expression of several exonic small RNAs across human tissues correlates with expression of their corresponding candidate protein-coding host genes

To further characterize the small RNAs of interest here, we analyzed the expression for exonic small RNAs for which data was available (25 human pre-miRNAs; 45 total miRNAs) for instances present in both the miRBase and MirGeneDB subsets (Fig. 8, Supplementary File 4, Supplementary File 7). Specifically, we leveraged data from the Gene Expression Tissue Expression (GTEx) resource consisting of expression data across fifty-seven human tissues including bladder, pancreas, liver, spleen, among others [25]. The analysis included both 5p- and −3p strands if the data were present. This analysis showed that several of these exonic small RNAs are constitutively expressed in many of tissues such as hsa-mir-197-3p (GNAI3), hsa-mir-652-3p (TMEM164), and hsa-mir-423-5p (NSRP1) whereas others like hsa-mir-935 (CACNG8) displayed more variable expression across tissues.

Fig. 8
figure 8

RNA expression levels of mature exonic small RNAs across human tissues. Mean expression levels (log₂ TPM) of 45 mature exonic small RNAs present in both miRBase and MirGeneDB across 57 human tissues. Each tissue expression value represents the mean expression across multiple individuals. 5p/3p labels denote the small RNA arm of origin. Small RNAs with similar expression patterns are grouped by clustering. TPM: transcripts per million. Data downloaded for analysis from GTEx [25]

To examine whether exonic small RNA expression correlated with expression of their candidate host gene, we performed regression analysis using expression data from GTEx (Fig. 9, Supplementary File 8). A correlation would be consistent with the protein-coding gene serving as a precursor transcript for the small RNA. Specifically, we analyzed expression patterns for ten exonic small RNAs and putative host gene pairs based on their tissue expression profile (Fig. 8, Supplementary File 7). For small RNAs where data was available for both 5p- and −3p strands (8 out of 10), one strand was often noticeably expressed at levels greater than the other strand; a known hallmark of miRNAs. Expression for six out of the ten small RNAs-host gene pairs examined displayed statistically significant correlation with host gene expression (p < 0.01: hsa-mir-197-3p (GNAI3); p < 0.0001: hsa-mir-24–1-5p (AOPEP), hsa-mir-935 (CACNG8), hsa-mir-21-3p (VMP1), hsa-mir147b-3p (C15orf48), and hsa-mir-149-5p (GPC1). This analysis suggests that a subset of the protein-coding genes identified here display an expression pattern consistent with the gene serving as a small RNA host gene.

Fig. 9
figure 9

Correlation between exonic small RNA and candidate protein-coding host gene expression across human tissues. Linear regression models depict the relationship between selected exonic small RNAs, which are present in both miRBase and MirGeneDB, and their host genes across 54 human tissues. Each point represents the mean log₂TPM of both the small RNA and putative host gene in a given tissue. Small RNA strands are color-coded by arm of origin: blue for 5p, red for 3p, and black when unspecified. p-values and R2 values for each regression are shown within each panel. TPM: transcripts per million

Discussion

Small RNAs like miRNAs are key players in gene regulation and their dysregulation is linked to a range of human diseases [73]. The biogenesis of miRNAs is associated with the nature of their transcriptional unit [20] and their location in it, whether it be exonic or intronic [21, 22]. The coding potential of the host gene has implications for exonic small RNA processing as exemplified by FSTL1/miR-198 [26], DGCR8/miR-1306 and miR-3618 [27], and MISTRAV(C15orf48)/miR-147 [28]. Previously, only a limited number of small RNAs that reside in exons of protein-coding genes [20, 24] have been reported (Table 1). Here, we have uncovered a total of 201 human and mouse small RNAs including 46 instances that are in the stringently curated miRNA database, MirGeneDB [36,37,38,39], that have genomic coordinates which overlap with coordinates of exons of protein-coding genes (Fig. 3D, Supplementary File 2–5). Relatedly, 32% of these human exonic small RNAs show evidence for processing by Drosha (Fig. 4C). Eight of these exonic small RNAs are common to both human and mouse genomes. Our findings markedly increase the number of exonic small RNA protein-coding host genes relationships with 96% of the cases not being previously appreciated.

While many of the exonic small RNAs here display evidence supporting they may function as miRNAs, many do not. For those exonic small RNAs identified that do not display all of the hallmarks of canonical miRNAs, it is possible that the small RNA may display atypical miRNA features such that it is not classified as a miRNA. For instance, miR-198 (FSTL1), which has been shown to behave experimentally as a miRNA and one of the initial examples of exonic miRNAs in protein-coding host genes [26], is not in MirGeneDB. Ather possibility is that if the small RNA is active, that the small RNA functions by a non-miRNA mechanism. Alternatively, a non-active small RNA may be a by-product of processing of the host gene perhaps for regulatory purposes to destabilize the encoding mRNA similar to DGCR8 [27]. Reasonably, more detailed studies may reveal that some of the small RNAs might represent false positives. Nevertheless, the expression patterns of many of the small RNAs (Fig. 8, Supplementary File 7) and candidate host genes (Fig. 9, Supplementary File 8) does suggest a relationship between a subset of the small RNAs and the identified genes.

Potential implications

Experimental studies of candidate bicistronic mRNAs could aid in uncovering novel post-transcriptional switches. In some cases, poor expression of certain exonic small RNAs under homeostasis may suggest that the production of the small RNA is regulated by additional signals. In those potential instances, inducible-expression is congruent with a scenario where processing of the small RNA compromises the stability of the host protein-coding mRNA (Fig. 10). For example, although hsa-miR-147b-3p is expressed generally at low levels across human tissues (Figs. 8 and 9), this small RNA is well-appreciated to be a functional miRNA that is induced by stress cues [28] such as lipopolysaccharide (LPS) [29]. Data suggest that hsa-miR-147b-3p processing can compromise the expression of the encoding protein-coding host gene (MISTRAV/C15orf48). In particular, C15orf48/hsa-miR-147b highlights that both the miRNA and host mRNA can be co-expressed when induced but also that treatment with volatile triggers shifts the RNA levels largely to the miRNA [28]. In the cases, where these small RNAs act via basepairing interactions, future work may provide new insight into the function of the protein encoded by the host gene by investigating the exonic small RNA targets [28]. Finally, the identity of the candidate protein-coding host genes here may have implications for interpretation of any observed phenotypes in loss of function experiments [28]. Altogether, our studies serve as a resource that may shed new light on the regulation of small RNAs and their relationships with host genes that may be relevant to small RNA researchers and investigators studying specific host genes.

Fig. 10
figure 10

Model for processing of exonic small RNAs from protein-coding host genes. Based on the gene structure and published studies of small RNAs encoded by exons of protein coding transcripts [26,27,28], processing of the small RNA may compromise the host gene mRNA and affect expression of the encoded protein