Sequences and metadata of genomes ====== Included in this genome pool are 15,953 non-redundant bacterial and archaeal genomes sampled from NCBI RefSeq and GenBank as an even representation of microbial diversity. Statistics: - Total number of genomes: 15,953 - Total length of genomes: 48,773,860,666 bp - After adding linkers: 48,809,171,826 bp Sequence files: - all.fna: Concatenated genome sequences in multi-FASTA format. - Specifically: Nucleotide sequences of each genome were concatenated with a linker of 20 "N"s into one sequence, and named following the genome ID (e.g., G000123456). Sequences of all genomes were then merged into one FASTA file. This file is the input for building genome databases (see databases/). - raw/: Original genome sequences (.fna) retrieved from NCBI. Metadata and mappings: - metadata.tsv: Metadata of the 15,953 genomes. - assembly.tsv: Original genome assembly report obtained from NCBI. - Taxonomic assignments of genomes in this report reflect the original NCBI records, whereas the curated taxonomy is hosted in taxonomy/. - length.map: Mapping of genomes to total lengths (bp). - This file is useful for normalizing genome frequencies by genome length. - nucl2g.map: Mapping of NCBI nucleotide accessions to genome IDs. - genome.lst: A sorted list of genome IDs. Additional metrics: - quast.tsv: Assembly quality statistics calculated using QUAST v4.5. - Citation: Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013 Apr 15;29(8): 1072-5. ``` quast -t 8 -m 0 -o output input.fna ``` - checkm.tsv: Genome quality assessment results using CheckM v1.0.7 with its database release 2015-01-16. - Citation: Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome research. 2015 Jul 1; 25(7):1043-55. ``` checkm tree -t 8 -x fna input/ output/ checkm tree_qa output/ checkm lineage_set output/ output/lineage.ms checkm analyze -t 8 -x fna output/lineage.ms input/ output/ checkm qa -t 8 output/lineage.ms output/ -o 2 -f output/summary.txt ``` - gunc.tsv: Genome chimerism and contamination detection results using GUNC v1.2 with the proGenomes v2.1 database. - Citation: Orakov A, Fullam A, Coelho LP, Khedkar S, Szklarczyk D, Mende DR, Schmidt TS, Bork P. GUNC: detection of chimerism and contamination in prokaryotic genomes. Genome Biology. 2021 Dec;22(1):1-9. ``` gunc run -t 8 -r gunc_db_progenomes2.1.dmnd -d input/ -e .fna -o output/ ```