Rooting Gene Trees via Phylogenetic Networks

Tiuryn, Jerzy; Rutecka, Natalia; Górecki, Paweł

doi:10.1007/978-3-031-22105-7_37

Jerzy Tiuryn¹⁰,
Natalia Rutecka¹⁰ &
Paweł Górecki¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13595))

Included in the following conference series:

International Computing and Combinatorics Conference

1164 Accesses
1 Citations

The original version of this chapter was revised: this chapter was previously published non-open access. The correction to this chapter is available at https://doi.org/10.1007/978-3-031-22105-7_52

Abstract

Gene trees inferred from alignments of molecular sequences are usually unrooted. Since the root of a gene tree is often the desired property, one of the most classical problems in computational biology is gene tree rooting, where the goal is to infer the most credible rooting edge in an unrooted gene tree. One way to solve it is to apply unrooted reconciliation, where the rooting edge is postulated based on a given split of a rooted species tree. Here, we address a novel variant of the rooting problem, where the gene tree root is inferred using a given phylogenetic network of the species present in the gene tree. One can apply unrooted reconciliation to obtain the best rooting, where the unrooted gene tree is jointly reconciled with a set of splits inferred from the given network. Natural candidates are splits induced by display trees of the network. However, such an approach is computationally prohibiting due to the exponential size of the set. Therefore, we propose a broader and easier-to-control set of splits based on the structural properties of the network. Next, we derive exact mathematical formulas for the rooting problem with the algorithm that runs in square time and space. We verify the algorithm’s quality based on simulated gene trees and networks.

The support was provided by National Science Centre grant #2019/33/B/ST6/00737.

You have full access to this open access chapter, Download conference paper PDF

Phylogenetic network-assisted rooting of unrooted gene trees

Article Open access 01 June 2024

Rooting Species Trees Using Gene Tree-Species Tree Reconciliation

Statistically Consistent Rooting of Species Trees Under the Multispecies Coalescent Model

Keywords

1 Introduction

In computational biology, gene trees inferred from multiple alignments of molecular sequences are usually unrooted. However, many practical applications require phylogenetic trees to be correctly rooted. Therefore, several methods have been developed for the inference of the root location in an unrooted gene tree [2, 7, 11]. There are several categories of tree rooting methods [22]. Perhaps, the most popular is outgroup rooting [13]. Another widely used type is based on branch lengths of the gene tree, which includes the mid-point rooting [4], where the root is placed at the mid-point of the longest path in the tree, minimal ancestor deviation rooting [21], and minimum variance rooting [14]. The next category represents methods of rooted phylogenetic inference; for instance, the maximum likelihood with non-reversible or non-stationary evolutionary models of evolution and models with molecular clock assumption [12, 25]. The last category is based on reconciliation methods, where the unrooted gene tree is reconciled with its known rooted species tree using various cost functions, e.g., duplication-loss, duplication-loss-transfer, or Robinson-Foulds distance [3, 6, 7]. For a broader comparison of tree rooting methods, please refer to [17, 22].

A natural extension of a species tree is a phylogenetic network, which is a mathematical model of evolutionary processes with reticulate events such as hybridization of species, recombination, or horizontal gene transfers [1, 10]. According to our knowledge, the gene tree rooting problem has never been studied under the assumption of a known phylogenetic network. Therefore, the rooting problem becomes challenging when horizontal events shaped the history of genes. Here, we propose a novel approach where the gene tree is rooted using a known phylogenetic network.

Our Contribution: We address a novel variant of the rooting problem, where the gene tree root is inferred using a given phylogenetic network of the species present in the gene tree. To obtain the best rooting edge, we apply unrooted reconciliation multiple times, where the unrooted gene tree is jointly reconciled with a set of splits inferred from the given network. We derive an exact and computationally efficient mathematical formula for the rooting problem with the algorithm that runs in square time and space. Finally, we verify the algorithm’s quality is based on simulated gene trees and networks.

2 Basic Definitions

A network on a set of species X is a directed acyclic graph $N=(V(N),E(N))$ with a single root such that: (1) the set of leaves of N is X, i.e., nodes of indegree 1 and outdegree 0, and (2) there is a directed path from the root to any other vertex. Nodes of a binary network N are divided into three groups: leaves, tree nodes (in-degree at most 1 and out-degree 2), and reticulation nodes (in-degree 2, out-degree 1). A network N is said to be fixed-root, if both children of the root node are tree nodes. N is said to be tree-child, if every tree node of N has at least one of its children being a tree node too. In this article, the class of fixed-root tree-child networks is denoted as $\rho $TC. A species tree on X is any network on X without reticulations nodes. See Fig. 1 for examples of networks.

A node v of indegree one and outdegree one in a rooted directed graph can be contracted as follows: (1) remove v, (2) remove both edges incident with v, and (3) insert a new directed edge connecting the unique parent of v with the only child of v. Similarly, if v is a root and has outdegree one we remove v, and the child of v becomes a new root. If a directed graph $G'$ is obtained from a graph G by a sequence of contract operations, then G is called a subdivision of $G'$. Given a network N on X, we say that a tree T on X is displayed by N, if N contains a subgraph $T'$ that is a subdivision of T [20].

For a network N and a node v of N, the set of all species reachable in N from v is denoted $L^{N}_{v}$. Below, we present some properties of $\rho $TC-networks.

Lemma 1

Let N be a $\rho $TC-network. Let v be a child of the root of N. Let X be the set of all species/leaves of N. Then, $X-L^{N}_{v}\ne \emptyset $.

Observe that the assumption that N is $\rho $TC is essential for Lemma 1. For the network $N'$ from Fig. 1 we have $\{a,b\}-L^{N'}_{v}=\emptyset $. Similarly, for the network $N''$ from Fig. 1, we have $\{a,b\}-L^{N''}_{v}=\{a,b\}-L^{N''}_{u}=\emptyset $. The next result immediately follows from the proof of Lemma 1.

Given a species tree S on X, we call split of S the unordered pair $L^S_{r'}|L^S_{r''}$, where $r'$ and $r''$ are the children of the root. Note that, any display tree of a $\rho $TC-network N on X is a species tree on X.

Lemma 2

Let N be a $\rho $TC-network. Let v be a child of the root of N. For any display tree T of N, if A|B is the split of T, then $X-L^{N}_{v}$ is contained in A or in B.

3 Supports of a Network

In this Section, we introduce the notion of network support.

We say that two species a and b are glued together in a tree T on X, if a and b are descendants of the same child of the root of T. Consider the following two decision problems.

$\Delta $:: Given a network N on a set of species X and two species $a,b\in X$. Is it the case that all display trees of N glue together a and b?
$\Gamma $:: Given a network N on a set of species X and two species $a,b\in X$. Is there a display tree of N that glues together a and b?

Theorem 1

Problems $\Delta $ and $\Gamma $ are decidable in polynomial time.

Proof

Let $a, b \in X$. For nodes x and y, let d(x, y) be a predicate that is true if and only if there are node-disjoint paths from x to a and from y to b. Then, d(x, y) can be computed in polynomial time using the dynamic programming:

$$ d(x,y)={\left\{ \begin{array}{ll} False &{} \text {if } x\,=\,y,\\ True &{} x=a \text { and } y=b,\\ d(x',y) &{} x \text { is a reticulation with a child } x',\\ d(x,y') &{} y \text { is a reticulation with a child } y',\\ d(x',y) \vee d(x'',y) &{} x \text { is a tree node with children } x' \text { and } x'',\\ d(x,y') \vee d(x,y'') &{} y \text { is a tree node with children } y' \text { and } y'',\\ False &{} \text {otherwise.} \end{array}\right. } $$

Let $\delta (x,y)=d(x,y) \vee d(y,x)$.

Problem $\Delta $ can be solved by computing the value of $\lnot \delta (u,v)$, where u and v are the children of the root of the network. Problem $\Gamma $ can be solved similarly by checking if there is a tree node t with children x and y located strictly below u or v such that $\delta (x,y)$ is True. In such case a and b can be glued together. $\square $

We conclude from the proof that both problems can be solved in $O(n^2)$ time and space, where n is the number of leaves in N.

Let N be a network on X. A pair of subsets $(A_1,A_2)$ of X is called a support of N, if for every display T of N, if $C_1|C_2$ is the split of T, then $A_1\subseteq C_1$ and $A_2\subseteq C_2$.

Theorem 2

Let N be any network.

(a)
N has the largest support $(A_1^*,A_2^*)$.
(b)
If N is a $\rho $TC-network, then
1. (b1)
  The sets $(A_1^*,A_2^*)$ of the largest support are non-empty and they are (different) equivalence classes of the equivalence relation $\Delta $.
2. (b2)
  Finding equivalence classes that form the largest support of N can be done in polynomial time.

Note that finding the largest support of a $\rho $TC-network can be done by a single traversal of a network in linear time without computation of $\delta $-classes, by traversing through non-reticulation edges only. We omit easy details.

4 Unrooted Tree Reconciliation

Here, we summarize the main results on unrooted reconciliation, that we will use to model rooting of a gene tree based on a network. We start with the basic terms. A rooted gene tree over X is a rooted binary tree, where each non-leaf node has exactly two children, whose leaves are labeled by the elements of X (not necessarily uniquely). Similarly, an unrooted gene tree G over X is a binary unrooted tree, whose leaves are labeled by elements of X and such that any internal node of G has degree 3. An unrooted gene tree G can be rooted by choosing an edge e where the root will be placed. Such a rooted gene tree is denoted $G_e$. By L(G) we denote the set of all species present in a gene tree G. Let $X'|X''$ be the split of a fixed species tree S on X. Without loss of generality, we assume $|L(G)| \ge 3$, $L(G) \cap X' \ne \emptyset $ and $L(G) \cap X'' \ne \emptyset $, for any gene tree G. Any internal node g of an unrooted gene tree G over X determines a star being an unordered triple A|B|C, where A, B and C are the sets of all leaf-labels of three subtrees obtained from G by removing g. Note that $A \cup B \cup C = L(G) \subseteq X$. Let $\bar{A}=B \cup C$, $\bar{B}=C \cup A$ and $\bar{C}=A \cup B$. Let $\zeta (Z)$ be a predicate that is true iff $Z \subseteq X'$ or $Z \subseteq X''$. Then, it follows from [8], that we have five disjoint types of stars (see Fig. 2). Note that reordering of A, B and C may be required: (S1) if $\lnot \zeta (A) \wedge \zeta (\bar{A})$, (S2) if $\zeta (A) \wedge \zeta (\bar{A})$, (S3) if $\lnot \zeta (A) \wedge \lnot \zeta (\bar{A}) \wedge \zeta (B) \wedge \zeta (C)$, (S4) if $\lnot \zeta (A) \wedge \lnot \zeta (B) \wedge \lnot \zeta (C)$, and (S5) if $\zeta (A) \wedge \lnot \zeta (B) \wedge \lnot \zeta (C)$. Next, $\zeta $ induces arrows on the edges of G as follows. If $e=(v,w)$, then we say that e is no-arrow if $\zeta (L^{G_e}_v) \wedge \zeta (L^{G_e}_w)$, e is double-arrow if $\lnot \zeta (L^{G_e}_v) \wedge \lnot \zeta (L^{G_e}_w)$, and e is one-arrow (from v to w) if $\lnot \zeta (L^{G_e}_v) \wedge \zeta (L^{G_e}_w)$. Double-arrow and no-arrow edges are called symmetric.

Given a rooted gene tree and a species tree, one can determine the lowest number of gene duplication and loss events required to reconcile both trees, called duplication-loss cost (${{\,\mathrm{\text {DL}}\,}}$) [7, 18]. Due to space constraints, we omit here the formal definition of the cost. The above measure is often used to determine the rooting of an unrooted gene tree by choosing the edge e that minimizes the ${{\,\mathrm{\text {DL}}\,}}$ cost between $G_e$ and the species tree. Such edges can be identified using the following theorem.

Theorem 3

(Adopted from [7, 8]). Given a species tree S, an unrooted gene tree G, and an edge e from G. The ${{\,\mathrm{\text {DL}}\,}}$ cost of $G_e$ and S is minimal among all rootings of G if and only if e is one-arrow in a star S5 or symmetric in any star.

The set of edges satisfying the conditions from Theorem 3 induces a connected subtree called a plateau of G determined by S and denoted P(G, S). All plateau edges can be identified in linear time and space [7]. Examples are depicted in Fig. 2.

Lemma 3 (Rooting Lemma)

For any species trees S and $S'$ on X with the same split, $P(G,S)=P(G,S')$.

Thus, we will use $P(G,A|B):=P(G,S)$, if A|B is a split of some species tree S.

5 Rooting a Gene Tree

Here, given a network N on a set X of species and an unrooted gene tree G over X, we propose an algorithm to find an optimal place for the root of G. One natural approach is to choose the best rooting by identifying the edge that belongs to the maximal number of plateaus P(G, T), where T ranges over all display trees of N. While the complexity of the problem remains open, we know only a naïve way to solve it by enumerating all possible display trees of N, and, then by applying the unrooted reconciliation algorithm. Such an approach is exponential in the number of reticulations in N. To overcome this complexity, based on Rooting Lemma 3, we propose an algorithm that determines rooting candidates by joint analysis of all possible set-theoretic splits from a given network under some structural assumptions based on display tree properties. The algorithm will be done in three steps. See also an example in Fig. 3

Step 1: Collapsing on the largest support of $\boldsymbol{N}$. Let $(A_1^*, A_2^*)$ be the largest support of N as described by Theorem 2. Let $X^*= X-(A_1^*\cup A_2^*)$. Let a, b be new names of species not occurring in X and let $Y=\{a,b\}\cup X^*$ be a new set of species names. We transform G into a gene tree $G^*$ over Y by replacing every species name of G that belongs to $A_1^*$ by a, and similarly replacing every species name of G that belongs to $A_2^*$ by b. Thus we arrive at a pointed gene tree $(G^*,a,b)$ over Y, with two distinguished species: a and b.

Step 2: Computing weights of edges in a pointed gene tree. For every $Z \subseteq X^*$, we will consider splits $Z \cup \{a\} | (X^*-Z) \cup \{b\}$ of Y induced by Z. Then, the weight of an edge e of $G^*$ is the number of sets Z under which e belongs to the plateau determined by the split induced by Z. Note, that each edge has weight at most $2^{|X^*|}$. Each edge with the maximum weight belongs to the intersection of all plateaus, as Z ranges over all subsets of $X^*$, and therefore it also belongs to the intersection of all plateaus P(G, T) with T ranging over all display trees of N. See Sect. 5.1 for efficient methods to compute weights.

Step 3. Final step. Report all edges of G with the largest weight. These are the potential places for the root to be placed.

5.1 Computing Weights of Edges of Unrooted Gene Trees

In this Section we show how to compute the weight of an edge of an unrooted gene tree with two distinguished species.

Preliminary Assumptions: Here, we assume that an unrooted gene tree G over a set Y of species has two distinguished species a, b and the rest of the species we will denote here by X. So $Y=X\cup \{a,b\}$ and a, b do not belong to X. Since, the top-split is required to determine optimal rootings, we need to assume that

$$\begin{aligned} a,b \in L(G). \end{aligned}$$

(1)

Definition 1 (The weight of an edge)

Under the preliminary assumptions, the weight of an edge e of G, denoted W(e), is the number of sets $Z\subseteq X$ such that e belongs to $P(G, Z_a|Z_b)$, where $Z_a = Z \cup \{a\}$ and $Z_b=(X \setminus Z)\cup \{b\}$.

We also need to distinguish two cases depending on the type of edge. If an edge is incident to a leaf, we call it external; otherwise, the edge is internal. Figure 4 establishes additional notation.

The next Lemma shows formulas for the weight of edges depending on their type.

Lemma 4 (Internal edge weight)

Under the notation of Fig. 4, the number of sets $Z\subseteq X$ such that e belongs to $P(G, Z_a|Z_b)$ is given:

If e is no-arrow,
$$ \varphi _1(C,D)= {\left\{ \begin{array}{ll} 2^{|X \setminus (C\cup D)|} &{} \text { if [(}D \not \ni a \in C \text { and }C \not \ni b \in D\text {) or (}D \not \ni b \in C \text { and } \\ {} &{} C \not \ni a \in D\text {)] and } C\cap D=\emptyset \text {, }\\ 0,&{} \text {otherwise}. \end{array}\right. } $$
If e is one-arrow: $\varphi _2(C,D_1,D_2)= \varphi ^a_2(C,D_1,D_2)+\varphi ^b_2(C,D_1,D_2)$, where^{Footnote 1}
$$ \begin{aligned} \varphi _2^a(C,D_1,D_2)&= I(b\not \in C)\cdot (2^{|X-C|}-f_a(C,D_1,D_2)),\\ \varphi _2^b(C,D_1,D_2)&= I(a\not \in C)\cdot (2^{|X-C|}-f_b(C,D_1,D_2)),\\ f_a(C,D_1,D_2)&= \sum _{i=1}^2 2^{|X \setminus (C\cup D_i)|} \cdot \big ( I(a\not \in D_i)\cdot I((C \setminus \{a\})\subseteq X-D_i) +I(b\not \in D_i) \big )\\&+ 2^{|X-(D_1\cup D_2\cup C)|} \cdot \Big ( I(a\not \in D_1\cup D_2)\cdot I(X\cap C\cap (D_1\cup D_2)=\emptyset ) + \\&+ \sum _{i=1}^2 I(a\not \in D_i)\cdot I(b\not \in D_{3-i})\cdot I(X\cap (C\cup D_{3-i})\cap D_i=\emptyset ) \Big ), \end{aligned} $$

$$ \begin{aligned} f_b(C,D_1,D_2)&=\sum _{i=1}^2 2^{|X \setminus (C\cup D_i)|} \cdot \big ( I(a\not \in D_i) + I(b\not \in D_i)\cdot I(D_i\cap X\cap C=\emptyset ) \big )+\\&+ 2^{|X-(D_1\cup D_2\cup C)|} \cdot \Big ( I(b\not \in D_1\cup D_2)\cdot I(D_1\cap X\cap C=\emptyset )\cdot I(D_2\cap X\cap C=\emptyset ) + \\&+ \sum _{i=1}^2 I(a\not \in D_i)\cdot I(b\not \in D_{3-i})\cdot I(D_{3-i}\cap X\cap (C\cup D_i)=\emptyset ) \Big ). \end{aligned} $$
If e is double-arrow $\varphi _3(C,D)= \varphi ^b_2(\emptyset ,C,D)$.

Below, we present the main formula for computing weights.

Lemma 5 (Computing weights of edges)

Under the notation from Lemma 4, the weight of an internal edge e is $W(e)=\varphi _1(C_1\cup C_2, D_1\cup D_2)+\varphi _2(C_1\cup C_2, D_1,D_2)+ \varphi _2(D_1\cup D_2, C_1,C_2)+\varphi _3(C_1\cup C_2,D_1\cup D_2)$, while the weight of an external edge e is $W(e)=\varphi _1(\{c\},D_1\cup D_2)+\varphi _2(\{c\},D_1,D_2)$.

5.2 Time and Space Complexity of the Rooting Algorithm

Let m be the number of leaves in G, and n be the number of leaves in N. As already shown, computing the largest support of N can be completed in O(m) time. Also, collapsing supports of the gene tree can be done in $O(\max (m,n))$ time. Since a and b are fixed, all indicators of the form $I(a \in D_i)$ or $I(a \notin D_i)$ (and the same for b), can be computed for each edge after one traversal of the gene tree without computing sets $D_i$’s. The same holds for the sizes of power-sets, e.g., $2^{|X \setminus C|}$, and so on. Therefore, computing these values requires O(m) steps in total. For the remaining components of $\varphi $ formulas from Lemma 4, we need to compute: $I(C\setminus \{a\} \subseteq X \setminus D_i)$, $I(X \cap C \cap D_i = \emptyset )$ and $I(X \cap D_1 \cap D_2)$ for $i=1,2$. It is an open question, whether these values can be computed without direct computation of all needed subsets of Y; therefore, we assume the classic bit-vector implementation with O(m) time and space complexity of set operations. We have 4 subsets of Y for every internal edge and 2 subsets for every external edge. This gives the whole algorithm’s total time and space complexity of O(nm).

6 Evaluation

In our evaluation, we first simulate a collection of networks with unrooted gene trees with additional information on the rooting edge. Then, we use the simulated data to verify the performance of our rooting algorithm.

6.1 Simulating Networks and Unrooted Gene Trees

Simulating Species Networks We start by simulating a collection of random species trees with $n = \{20, 40, 60\}$ leaves using TreeSim 2.4 [9] with parameters from [16]. Then, we randomly choose $r=\{4, 8\}$ pairs of edges from each tree. For each pair, we then subdivide its two edges and add a new reticulation edge between the vertices introduced by the subdivision. Following that procedure, we obtain a new species network with r reticulations from each of the generated trees. Additionally, we constrain the network to be tree-child and fixed-root. Finally, we add an extra leaf to each network N, called an outgroup, and another extra node $\rho $. We add an edge from $\rho $ to the root of N and set its length to 1/2 of the height of N and another edge from $\rho $ to the outgroup, setting its length to 3/2 of the height of N, so that $\rho $ becomes the root of the new network.

Simulating Gene Trees. For each network, we randomly generate 16 of its displayed trees. Following [23], we then simulate one gene tree per displayed tree with SimPhy 1.0.2 [15] using 6 sets of biological parameters [16, 19]. The sets of biological parameters contain three values of the duplication-loss rate $DL=\{10^{-10}, 2\cdot 10^{-10}, 5\cdot 10^{-10}\}$ and two values of the effective population size $ILS=\{10^{7}, 5\cdot 10^{7}\}$. Next, we simulate a multiple sequence alignment for each gene tree using INDELible v1.03 [5] and parameters from [16]. We infer a gene tree from each multiple sequence alignment using neighbour-joining method with Kimura 2-parameter correction implemented in NINJA 1.2.2 [24]. We delete the samples for which the method infers trees with negative branch lengths. Finally, we reroot each of the inferred trees, placing the new root on the branch connecting the single leaf labelled by the outgroup with the rest of the tree. After rerooting, we delete the root node and the outgroup node, obtaining the final rooted gene tree. The edge where the root is attached is stored with the unrooted variant of the gene tree as the original root position. Original rooting edges were used to verify the correctness of our rooting algorithm.

Finally, for each reticulation number, leaf-set size and set of biological parameters, we simulate 100 networks and 1600 corresponding gene trees. Thus, we obtained $57600 = 3 \cdot 2 \cdot 3 \cdot 2 \cdot 16 \cdot 100$ triplets of the form $(N,G,o_G)$, where N is a $\rho $TC-network, G is a gene tree simulated using N, and $o_G \in E(G)$ is the original rooting edge of G.

6.2 Rooting Algorithm Results

We tested our algorithm on the set of simulated triplets. The edges with the largest weight we call here optimal. Among 57600 triplets, our algorithm indicated

8624 gene trees in which the original rooting edge is the unique optimal edge,
10286 gene trees in which the original rooting edge is optimal, but the algorithm indicated more than one optimal edge,
and 26325 gene trees where the original root is not optimal.

Since the algorithm requires gene trees with species present in both elements of the largest support, we rejected the remaining 12365 triplets with insufficient species representation in gene trees. We will address this problem in future approaches by contracting networks.

We used the classical path distance as a measure, where the distance between two edges is the number of nodes on the shortest path connecting these edges: the smaller distance, the better performance of the rooting algorithm. For example, the distance between two different adjacent edges is 1. Since the set of optimal edges may contain more than one element, we define a minimal error of $(N,G,o_G)$ as the minimal distance between $o_G$ and an optimal edge. Similarly, we define an average error as the average distance.

The results for all false positive triplets are depicted in Fig. 5. The results for various ILS parameters were similar; therefore, we merged them in the diagrams. The main observation is that the minimum error is usually 1 in most cases, meaning that the original rooting edge is typically adjacent to some optimal edge. We observe a similar pattern for the average error, which is partially a consequence of the property that the number of optimal edges in false-positive cases is generally low, as indicated in Fig. 6. However, generally, the histograms of average error are flatter.

Software Package. The software package with data is freely available from https://bitbucket.org/pgor17/netroot.

7 Conclusions and Future Outlook

We addressed a novel variant of the gene tree rooting problem, where the gene tree root is inferred using a known phylogenetic network of the species present in the gene tree. To obtain the best rooting edge, we apply unrooted reconciliation multiple times, where the unrooted gene tree is jointly reconciled with a set of splits inferred from the given network. We derived an exact and computationally efficient mathematical formula for such a formulation of the rooting problem. Our experimental study based on simulated gene trees and networks showed that the rooting algorithm infers the rooting correctly or with a small error in most cases.

In the future, we plan to compare the novel method with known gene tree rooting approaches. However, conclusions from such a comparative study may be limited since no method assumes a known phylogenetic network. Also, we plan to address the problem of species representation in gene trees, where our method cannot be applied when the largest support sets are disjoint with the collection of species present in the gene tree.

Change history

01 January 2023
A correction has been published.

Notes

1.
Here, I is the indicator function: I(p) equals 1 if the predicate p is satisfied, and 0 otherwise.

References

Bapteste, E., Bapteste, E., et al.: Networks: expanding evolutionary thinking. Trends Genet. 29(8), 439–441 (2013)
Article Google Scholar
Boykin, L.M., Kubatko, L.S., Lowrey, T.K.: Comparison of methods for rooting phylogenetic trees: a case study using Orcuttieae (Poaceae: Chloridoideae). Mol. Phylogenet. Evol. 54(3), 687–700 (2010)
Article Google Scholar
Chen, K., Durand, D., Farach-Colton, M.: NOTUNG: a program for dating gene duplications and optimizing gene family trees. J. Comput. Biol. 7(3–4), 429–447 (2000)
Article Google Scholar
Farris, J.S.: Estimating phylogenetic trees from distance matrices. Am. Nat. 106(951), 645–668 (1972)
Article Google Scholar
Fletcher, W., Yang, Z.: Indelible: a flexible simulator of biological sequence evolution. Molecular Biology and Evolution 26(8), 1879–1888 (2009)
Article Google Scholar
Górecki, P., Eulenstein, O.: Deep coalescence reconciliation with unrooted gene trees: linear time algorithms. In: Gudmundsson, J., Mestre, J., Viglas, T. (eds.) COCOON 2012. LNCS, vol. 7434, pp. 531–542. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32241-9_45
Chapter Google Scholar
Górecki, P., Eulenstein, O., Tiuryn, J.: Unrooted tree reconciliation: a unified approach. IEEE/ACM Trans. Comput. Biol. Bioinform. 10(2), 522–536 (2013)
Article Google Scholar
Górecki, P., Tiuryn, J.: Inferring phylogeny from whole genomes. Bioinformatics 23(2), e116–e122 (2007)
Article Google Scholar
Hartmann, K., Wong, D., Stadler, T.: Sampling trees from evolutionary models. Syst. Biol. 52(4), 465–476 (2010)
Article Google Scholar
Huson, D.H., Rupp, R., Scornavacca, C.: Phylogenetic Networks: Concepts Algorithms and Applications. Cambridge University Press, New York (2010)
Book Google Scholar
Kinene, T., Wainaina, J., Maina, S., Boykin, L.: Rooting trees, methods for. In: Encyclopedia of Evolutionary Biology, pp. 489–493. Elsevier (2016)
Google Scholar
Lepage, T., Bryant, D., Philippe, H., Lartillot, N.: A general comparison of relaxed molecular clock models. Mol. Biol. Evol. 24(12), 2669–2680 (2007)
Article Google Scholar
Maddison, W.P., Donoghue, M.J., Maddison, D.R.: Outgroup analysis and parsimony. Syst. Biol. 33(1), 83–103 (1984)
Article Google Scholar
Mai, U., Sayyari, E., Mirarab, S.: Minimum variance rooting of phylogenetic trees and implications for species tree reconstruction. PLoS ONE 12(8), e0182238 (2017)
Article Google Scholar
Mallo, D., De Oliveira Martins, L., Posada, D.: SimPhy: phylogenomic simulation of gene, locus, and species trees. Syst. Biol. 65(2), 334–344 (2015)
Article Google Scholar
Molloy, E.K., Warnow, T.: FastMulRFS: fast and accurate species tree estimation under generic gene duplication and loss models. Bioinformatics 36(Supplement_1), i57–i65 (2020)
Google Scholar
Mykowiecka, A., Górecki, P.: Credibility of evolutionary events in gene trees. IEEE/ACM Trans. Comput. Biol. Bioinform. 16(3), 713–726 (2019)
Article Google Scholar
Page, R.D.: GeneTree: comparing gene and species phylogenies using reconciled trees. Bioinformatics 14(9), 819–820 (1998)
Article Google Scholar
Rasmussen, M.D., Kellis, M.: Unified modeling of gene duplication, loss, and coalescence using a locus tree. Genome Res. 22(4), 755–765 (2012)
Article Google Scholar
Steel, M.: Phylogeny. Society for Industrial and Applied Mathematics (2016)
Google Scholar
Tria, F.D.K., Landan, G., Dagan, T.: Phylogenetic rooting using minimal ancestor deviation. Nat. Ecol. Evol. 1(1), 1–7 (2017)
Google Scholar
Wade, T., Rangel, L.T., Kundu, S., Fournier, G.P., Bansal, M.S.: Assessing the accuracy of phylogenetic rooting methods on prokaryotic gene families. PLoS ONE 15(5), e0232950 (2020)
Article Google Scholar
Wawerka, M., Dabkowski, D., Rutecka, N., Mykowiecka, A., Górecki, P.: Embedding gene trees into phylogenetic networks by conflict resolution algorithms. Algorithms Mol. Biol. 17(1), 11 (2022)
Article Google Scholar
Wheeler, T.J.: Large-scale neighbor-joining with NINJA. In: Salzberg, S.L., Warnow, T. (eds.) WABI 2009. LNCS, vol. 5724, pp. 375–389. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04241-6_31
Chapter Google Scholar
Williams, T.A., Heaps, S.E., Cherlin, S., Nye, T.M., Boys, R.J., Embley, T.M.: New substitution models for rooting phylogenetic trees. Philos. Trans. R. Soc. B: Biol. Sci. 370(1678), 20140336 (2015)
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Warsaw, Institute of Informatics, Warsaw, Poland
Jerzy Tiuryn, Natalia Rutecka & Paweł Górecki

Authors

Jerzy Tiuryn
View author publications
You can also search for this author in PubMed Google Scholar
Natalia Rutecka
View author publications
You can also search for this author in PubMed Google Scholar
Paweł Górecki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paweł Górecki .

Editor information

Editors and Affiliations

Shenzhen Institute of Advanced Technology, Shenzhen, China
Yong Zhang
Harbin Institute of Technology, Harbin, China
Dongjing Miao
Technical University of Berlin, Berlin, Germany
Rolf Möhring

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tiuryn, J., Rutecka, N., Górecki, P. (2022). Rooting Gene Trees via Phylogenetic Networks. In: Zhang, Y., Miao, D., Möhring, R. (eds) Computing and Combinatorics. COCOON 2022. Lecture Notes in Computer Science, vol 13595. Springer, Cham. https://doi.org/10.1007/978-3-031-22105-7_37

Download citation

DOI: https://doi.org/10.1007/978-3-031-22105-7_37
Published: 01 January 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-22104-0
Online ISBN: 978-3-031-22105-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Rooting Gene Trees via Phylogenetic Networks