Taxonomic classification of genomes
======

Default taxonomy:

The default taxonomy (this directory level) is based on GTDB R207, curated to
match the WoL2 phylogeny. Both Greengenes-style lineage strings and NCBI-style
taxdump (with dummy TaxIDs) are provided.

 - Website: https://gtdb.ecogenomic.org/

Statistics:

Numbers of taxonomic units:

 - Domains:      2
 - Phyla:      124
 - Classes:    321
 - Orders:     914
 - Families: 2,057
 - Genera:   6,811
 - Species: 12,258

Database files:

 - taxid.map: Genome ID to TaxID mapping.
 - nodes.dmp: NCBI taxdump-style node mapping.
 - names.dmp: NCBI taxdump-style name mapping.
 - lineages.txt: Lineage strings with taxon names.
 - linetids.txt: Lineage strings with TaxIDs.


Taxonomy systems:

Taxonomic annotation of the WoL2 genomes were performed based on:

 - Greengenes2 release 2022.10

   - Based on WoLr2 phylogeny and GTDB R207 taxonomy.

     Source: http://ftp.microbio.me/greengenes_release/2022.10/

 - GTDB RS207 (2022-04-08)

   - Source: https://data.gtdb.ecogenomic.org/releases/release207/207.0/

 - NCBI taxdump 2022-01-01

   - Source: https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump_archive/
     taxdmp_2022-01-01.zip

See: gg2/, gtdb/ and ncbi/, respectively.


Curation of taxonomy:

The Original GTDB/NCBI taxonomy was automatically curated according to the
WoL2 phylogeny using Tax2Tree.

 - Website: https://github.com/biocore/tax2tree
 - Citation: McDonald D, Price MN, Goodrich J, Nawrocki EP, DeSantis TZ,
   Probst A, Andersen GL, Knight R, Hugenholtz P. An improved Greengenes
   taxonomy with explicit ranks for ecological and evolutionary analyses of
   bacteria and archaea. The ISME journal. 2012 Mar;6(3):610-8.

See: gtdb/tax2tree and ncbi/tax2tree, respectively.

Used Tax2Tree 1.0 commit 36856f0 (updated on 2022-05-31) with Python 3.10.4.

Installation:

```
conda create -n tax2tree python=3
conda activate tax2tree
pip install tax2tree
```

Analysis:

```
t2t decorate -m linetids.txt -t tree.nwk -o output \
  --no-suffix --min-count 1 --add-nameholder
```

Note: For NCBI, pre- and post-processings were necessary. See ncbi/README for
details.


Further curation:

The Tax2Tree-curated GTDB taxonomy was further curated to ensure that the
hierarchical relationships among taxa are consistent with the original GTDB
taxonomy (species to domain), while retaining the consistency with the WoL2
phylogeny (genome tree topology). The outcome is the default taxonomy (in
the current directory).

Specifically, the following three genomes were edited:

    Genome: G000441555
  Original: d__3; p__24; c__246; o__813; f__4072; g__13316; s__44083
  Tax2Tree: d__3; p__24; c__246; o__813; f__2735; g__13316; s__44083
     Final: d__3; p__24; c__246; o__813; f__2735; g__; s__

    Genome: G000817735
  Original: d__3; p__33; c__257; o__857; f__3208; g__8970; s__73328
  Tax2Tree: d__3; p__33; c__257; o__857; f__2817; g__8970; s__73328
     Final: d__3; p__33; c__257; o__857; f__2817; g__; s__

    Genome: G003265155
  Original: d__3; p__26; c__248; o__820; f__2787; g__8227; s__40950
  Tax2Tree: d__3; p__25; c__247; o__829; f__2757; g__12453; s__40950
     Final: d__3; p__25; c__247; o__829; f__2757; g__12453; s__

The corresponding lineage string files are:

  Original: gtdb/linetids.txt
  Tax2Tree: gtdb/tax2tree/linetids.txt
     Final: linetids.txt