This file contains the details about the generation of the SILVA 123 QIIME compatible database. Most of the text are directly copied, with slight modifications of the same notes used for the 119 release.

Raw data from SILVA (along with important notes about the usage and licensing for SILVA data) came from the SILVA exports page, here: http://www.arb-silva.de/no_cache/download/archive/release_123/Exports/

This file contains the notes about generating the QIIME compatible Silva 123 release. QIIME 1.9.0 release, Primer Prospector 1.0.1, and custom scripts were used to generate these data (commands and links to custom scripts are listed below). Please contact William Walters (william.a.walters@gmail.com) about any bugs/errors in this release.

The full aligned SSU sequence from Silva with taxonomy strings in the fasta comments was downloaded from: http://www.arb-silva.de/fileadmin/silva_databases/release_123/Exports/SILVA_123_SSURef_tax_silva_full_align_trunc.fasta.gz
This file contains 1756783 sequences.

Mike Robeson's code (https://github.com/mikerobeson/Misc_Code/tree/master/SILVA_to_RDP) requires PyCogent (1.5.3 was used in this case).

The data were processed on the compy compute cluster of Rob Knight’s at UCSD.

=====================
Important Usage Notes
=====================

The rep_set, rep_set_aligned, and taxonomy folders are used as the earlier versions of the QIIME-compatible SILVA releases. A core alignment file (80% identity clustered sequences, with positions removed that were 100% gaps in the full, unclustered SILVA data) is present in the core_alignment/ folder.

It may be necessary in some cases, if sequences are failing to align, to lower the default 0.75 value for —min_percent_id (e.g. 0.60) with QIIME’s align_seqs.py script.

The suggested alignment filtering settings for filter_alignment.py (1.9.0 QIIME and later) are recommended to be -e 0.10 -g 0.80. See “Alignment Filtering” section for older versions of QIIME.

Taxonomy strings are available in the raw format (strings pulled directly from the SILVA fasta labels), in an expanded RDP compatible format, and a seven-levels RDP format. The expanded RDP format contains expanded levels for every level present in any of the taxonomy strings. This has a consequence that the first 7 levels match domain through species for most Archaea, Bacteria, and many eukaryotes, but due to the extra levels present in many eukaryotes, one will have to look at deeper levels to get the species in many cases. When viewing taxonomy plots generated with these taxa strings, one will need to be aware that the expanded format may result in unmatched taxa levels (e.g. a species level for a bacterial taxon may be family level for a fungi taxon). The 7 level taxonomy uses 7 levels if they are present. If more than 7 levels are present, the first 3 and last 4 levels of taxonomy are used. If less than 7 levels are present, all levels present are used, and empty fields (e.g. d6__;d7__) are padded out to get 7 levels, with the text string of the last defined level replicated in the empty levels. The differences between these taxonomy strings (in the taxonomy/ folder) and those in the majority and consensus taxonomy folders are described at the end of this document.

In the taxonomy/taxonomy_all/97 folder, there are these files, with a brief description of what the file is to use as a guide when choosing which file to use:

raw_taxonomy.txt - these are the sequence IDs followed by the raw taxonomy strings directly pulled from the SILVA NR fasta file (will work with the -m blast assignment method, but not uclust/RDP)

taxonomy_7_levels.txt - This is the raw taxa, forced into exactly 7 levels as described in the preceding paragraph. This will work with all assignment methods

taxonomy_all_levels.txt - This is the raw taxa, expanded out to all levels present in any of the taxonomy strings (14 total levels). Will work with all assignment methods, but will use more memory than the 7 level taxonomy. Deeper levels of taxonomy, which will mostly come from Eukaryotes will require expansion of levels used with QIIME scripts, such as summarize_taxa.py.

consensus_taxonomy_7_levels.txt - This file is the same as the 7 levels, but uses the 100% consensus taxonomy (this is described in the “Consensus and Majority Taxonomies” section).

consensus_taxonomy_all_levels.txt - This file is the same as the all levels taxonomy, but uses the 100% consensus taxonomy (this is described in the “Consensus and Majority Taxonomies” section).
  
majority_taxonomy_7_levels.txt - This file is the same as the 7 levels, but uses the 90% majority taxonomy (this is described in the “Consensus and Majority Taxonomies” section).
     
majority_taxonomy_all_levels.txt - This file is the same as the all levels taxonomy, but uses the 90% majority taxonomy (this is described in the “Consensus and Majority Taxonomies” section).

Memory usage:
97% OTUs
OTU picking
8 gb of memory required for closed-reference UCLUST OTU picking. 16 gigs when -z (enable_reverse_strand_match) option used.

Taxonomic Assignment
16 gb required for UCLUST for assigning taxonomy with the 97% reference sequences/taxonomy mapping files. The 64 bit version of UCLUST is required for allocation of memory > 4 gigs.
For RDP, >24 gigs (--rdp_max_memory 24000) was required for taxonomic assignments with the 97% OTUs/taxonomy mapping files.


===================================================================
Filtering raw fasta file, creation of representative sequence files
===================================================================

clean_fasta.py (Primer Prospector) was called on the input file, with default parameters, to convert U characters to T characters, and remove gaps.

This degapped and filtered file was then used as input for clustering at 99%, 97%, 94%, and 90% identities using QIIME's pick_otus.py script with default settings (apart from sequence similarity) and the default uclust (v1.2.22q) software.

The representative sequence set for each clustered identity was generated with the pick_rep_set.py script using the default settings.

To create fasta files with sequence labels matching the representative sequence (rather than OTU ID), the custom script fix_fasta_labels.py was used: https://gist.github.com/walterst/f5c619799e6dc1f575a0

To create the 80% identity sequence (due to the prolonged clustering time), the 90% identity representative sequences that already had cleaned labels with the fix_fasta_labels.py script, were clustered at 80% identity with QIIME's pick_otus.py script. 

==============================
Taxonomy mapping file creation
==============================

To create the raw (unclustered) taxonomy mapping and reference fasta file, Mike Robeson's script prep_silva_data.py was used with the filtered file created from the full Silva 119 fasta alignment (see Filtering raw fasta file, creation of representative sequence files section).

prep_silva_data.py (Mike Robeson's script) was run on the filtered fasta above to create the raw taxonomy file. The RDP compatible (with expanded levels to match every semicolon-separated level possible in the taxonomy strings) was generated using the taxonomy file created by the call to prep_silva_data.py as input for the prep_silva_taxonomy_file.py script. In this case, there at 16 total levels to match the maximum number of levels present in the Silva 123 release.

Non-ASCII characters, as well as asterisk characters "*" can interfere with RDP, so the taxonomy files created were filtered to remove these with the parse_nonstandard_chars.py script, located here: https://gist.github.com/walterst/0a4d36dbb20c54eeb952

The 7 level RDP-compatible file was created by parsing the full 16 level taxonomy script created in the prior step with the custom script parse_to_7_taxa_levels.py, located here: https://gist.github.com/walterst/9ddb926fece4b7c0e12c

=============================
Generation of alignment files
=============================

After this filtering, the resulting aligned fasta file was filtered to remove the taxonomy strings from the labels, using the truncate_fasta_labels.py script located here:
https://gist.github.com/walterst/5b1db2a11a4490c57169

Finally, the U characters in this alignment were converted to T characters using Primer Prospector's clean_fasta.py script (with the --gap_chars option added to suppress gap removal).

The output filtered full alignment was parsed out using QIIME's filter_fasta.py (the representative sequence files, with labels fixed with the fix_fasta_labels.py script, used as input with the --subject_fasta_fp parameter) to create the reference alignment files for the 99%, 97%, 94%, 90%, and 80% files.

For creation of the core alignment, which has 100% gapped positions filtered, the full alignment was used to detect positions that are 100% gaps to remove in the core alignment file.

The full alignment file from Silva 123 (http://www.arb-silva.de/fileadmin/silva_databases/release_123/Exports/SILVA_123_SSURef_tax_silva_full_align_trunc.fasta.gz) was used with the find_all_gap_positions.py script, which is located here:
https://gist.github.com/walterst/db491ba0fd3916af6f5e

The total number of positions that are not 100% gapped is 32329 from the full alignment.

This mask of 100% gapped positions was used with QIIME's filter_alignment.py script. The SILVA123_gap_positions.txt file was passed with the --lane_mask_fp parameter, and --allowed_gap_frac was set to 1.0. The aligned 80% OTUs file was passed through this filter, and is the core alignment file in this release.

Note that while testing the core alignment (i.e. by running QIIME’s align_seqs.py script), I had to add the -e 50 parameter, due to variability in the sequence length when aligning all of the representative sequences against the core alignment. This should not be needed during general usage of the core alignment, as most reads are going to be of similar length.

===============================
Splitting fasta files by domain
===============================

Reference fasta files (aligned and unaligned) were created using the split_sequences_by_domain.py file https://gist.github.com/walterst/643848d1947f9d95f08b

============================================
Parsing and splitting taxonomy mapping files
============================================

The raw taxonomy mapping files were parsed out to match the representative sequence sets for each clustered subset (90, 94, 97, and 99). This was done using the parse_taxonomy_for_clustered_subset.py script located here: https://gist.github.com/walterst/4e78517e7130b445c620

The mapping files for each clustered subset were split by domain (to create the eukaryote-only files) using the script split_taxonomy_by_domain.py, located here: https://gist.github.com/walterst/6a7c8159dc816c234943

===================
Alignment Filtering
===================

The suggested alignment filtering settings for filter_alignment.py (1.9.0 QIIME and later) are recommended to be -e 0.10 -g 0.80. For 1.8.0 QIIME, the entropy setting needs to be much lower, e.g. -e 0.0005, as this filtering is performed first, rather than after removal of gap characters, in 1.8.0.

=================
Tree construction
=================

Individual alignments were filtered using the entropy and gap setting described above with QIIME’s filter_alignment.py script. The subsequent tree was built with FastTree (2.1.3) using these parameters: -spr 4 -gamma -fastest -no2nd -nt

These trees were visualized with Dendroscope after using a custom script to rename tips to taxonomy strings (https://gist.github.com/walterst/a46252cf1b6000490bb6). Generally correct Eukaryotic tree (SAR taxa grouped together, most Opisthokonts together, etc) although some taxa are split into deeply separated clades.

Users wishing to build trees with their own data, or data clustered at other identities than the 97% OTUs, can use the above approach.

======================
Other Processing Notes
======================

The full Silva alignment file is very large, ~89 gigabytes when uncompressed. If one is working with a comparably sized data set, the amount of memory and hard drive space could be significant for performing the steps above when creating a new database.

Total number of sequences (all domains) for each clustering identity:
99% 537233
97% 251764
94% 113083
90% 44401
80% 5038

Aligned representative sequence files are available in a separate archive, 
rep_set_aligned_gap_pos.tar.gz. These files contain the SILVA alignment file (with U characters converted to T characters using Primer Prospector’s clean_fasta.py script and the -g option to retain gaps) the mask of 100% gapped positions (SILVA123_gap_positions.txt) text file is included in this archive.

=================================
Consensus and Majority Taxonomies
=================================

Custom scripts (linked when described) and code from Mike Robeson (https://github.com/mikerobeson/Misc_Code/tree/master/SILVA_to_RDP) was used to generate taxonomy strings from the full NR Silva 119 fasta file, which is used by the custom scripts below when generating consensus/majority taxonomy strings. The clustered OTU mapping files were generated with QIIME 1.9.0 as described earlier in this document under the “Filtering raw fasta file, creation of representative sequence files” section.

Reason for these alternative taxonomy string files:

A user of the Silva119 data pointed out that the taxonomy with the SILVA119 release is based only upon the taxonomy string of the representative sequence for the cluster of reads, which could lead to incorrect confidence in taxonomy assignments at the fine level (genus/species). To address this, I have endeavored to create taxonomy strings that are either consensus (all taxa strings must match for every read that fell into the cluster) or majority (greater than or equal to 90% of the taxonomy strings for a given cluster). If a taxonomy string fails to be consensus or majority, then it becomes ambiguous, moving up the levels of taxonomy until consensus/majority taxonomy strings are met.

For example, if a cluster had two reads, and one taxonomy string was:
D_0__Archaea;D_1__Euryarchaeota;D_2__Methanobacteria;D_3__Methanobacteriales;D_4__Methanobacteriaceae;D_5__Methanobrevibacter;D_6__Methanobrevibacter sp. HW3
and the second taxonomy string was:
D_0__Archaea;D_1__Euryarchaeota;D_2__Methanobacteria;D_3__Methanobacteriales;D_4__Methanobacteriaceae;D_5__Methanobrevibacter;D_6__Methanobrevibacter smithii

Then for either consensus or majority strings, the level 7 (0 is the first level, the domain) data would become ambiguous, as the species levels do not match. The above string for the representative sequence taxonomy mapping file becomes:
D_0__Archaea;D_1__Euryarchaeota;D_2__Methanobacteria;D_3__Methanobacteriales;D_4__Methanobacteriaceae;D_5__Methanobrevibacter;Ambiguous_taxa

Because the taxonomy strings are not perfectly matched in terms of names/depths across all  of the SILVA data, this can lead to some taxonomies being more ambiguous with my approach (exact string matches) than they actually are, particularly for the eukaryotes. There are over 1.5 million taxonomy strings in the non-redundant SILVA 119 release, so I can’t fault the maintainers of SILVA for these taxonomy strings being imperfect from a parsing/bioinformatics perspective.

The scripts used to create the consensus and 90% majority taxonomy strings are located here:
https://gist.github.com/walterst/bd69a19e75748f79efeb
https://gist.github.com/walterst/f6f08f6583bb320bb10d

==========
References
==========

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol 215(3):403-410.

Edgar RC. 2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26(19):2460-2461.

Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, Peplies J, Gloeckner FO. 2013. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res 41: D590-596.

Wang Q, Garrity GM, Tiedje JM, Cole JR. 2007. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microb 73(16): 5261-5267.