Phylogenetic trees of genomes and genes ====== Phylogenetic reconstruction of the genomes was performed using a gene tree summary strategy optimized to handle large datasets (both taxon sampling and gene sampling) with conflicting gene evolutionary histories (mostly due to horizontal gene transfer) while ensuring accuracy. Central to this pipeline is uDance, a computational workflow using a divide- and-conquer strategy. The manuscript is currently under review. - https://github.com/balabanmetin/uDance Database files: - tree.nwk: The "formal" phylogenomic tree of the 15,953 genomes. - This tree is based on the uDance-reconstructed species tree, rooted at the midpoint connecting the Bacteria and Archaea clades, with nodes in decreasing order, with branch lengths re-estimated to be the numbers of amino acid substitutions per site, with numeric IDs assigned to nodes, and with non-confident branches (EN <= 5 or LPP <= 0.5) contracted into polytomies. This tree is identical to: variants/cons.ci.id.nwk. - support.tsv: Support statistics of the internal branches (derived from the ASTRAL output): - EN (effective number of genes): Number of gene trees that contain some quartets around this branch. - QT (quartet score): Proportion of quartets in the gene trees that support this branch. - LPP (local posterior probability): The probability this branch is true given the set of gene trees. - variants/: The same species tree in various formats (with node IDs or branch support values, non-confident branches contracted or not, and alternative branch length units). - marker.tsv: Properties of the 380 marker genes. Adopted from the WoL1 paper. - align/: Multiple sequence alignments of amino acid sequences of the 380 marker genes. - udance/: raw phylogenetic trees generated using the uDance workflow. - genes/: Phylogenies (i.e., gene trees) of the 380 marker genes. - They are not the ones inferred by uDance, which partitions the taxon set (see above). Instead, they were inferred separately from the alignments. Each tree represents the entire taxon set where the gene was found. Procedures: Step 1: 380 global marker genes were extracted per genome using PhyloPhlAn. - These genes are identical to those used in WoLr1, except that p0127 was removed from the analysis. Step 2: Multiple sequence alignment of each marker gene was performed using UPP. Gappy sites were removed using TrimAL. Potential errors were masked using TAPER. Step 3: Phylogenetic reconstruction was performed using the uDance workflow. This partitioned the taxon set, inferred individual gene trees and one species tree per partition, then combined the species trees into one master tree. See udance/README for a brief explanation of its procedures. Step 4: The branch lengths of the species tree were re-estimated using RAxML based on 100 conserved amino acid sites per gene. Step 5: For each marker gene, a full gene tree was inferred separately using FastTree.