Processing FASTQ Reads into an OTU Table
Before you begin
If you would like a file with all the commands listed, download the file below. If you run this file as is, please be aware this file does not assume you are on a HPC environment, and you must change the parameters according to your individual set up. Additionally it assumes you want to run each step, including the removal of chimeras.
Download QIIME Processing Workflow (Right click + Save As)
Step 1. Open-Refencence OTU-picking
http://qiime.org/scripts/pick_open_reference_otus.html
Description
Once we split OTU's we are ready for the most important step, which will run the workflow to cluster sequencines and assign them to particular taxa. This step is also one of the more computer intensive steps, so it is recommended to run this step on a Cluster environment or a very very powerful computer. We will use the pick_open_reference_otus
work flow because it combines two of the main approaches in QIIME (See: http://qiime.org/tutorials/otu_picking.html) .
One last thing we must do before picking OTU's is to use a parameters file to customize steps along the workflow. This step is optional and you can leave QIIME to its default settings and default programs. This workflow in particular changes the assignment program to RDP and changes RDP's confidence threshold based on study data.
Note: these are parameters for a particular work flow. You should use the parameters best for your data and for your lab's work flow.
Download parameters file for more customization during the OTU-picking steps.
Parameters
--input_fps | -i
The resulting file from the split libraries step. The file will always be called seqs.fna or seqs.fastq
.
--output_dir | -o
A folder to store the work flow files. There will be many folders, files and subfolders generated from this command.
--parameter_fp | -p
The location of the parameters file downloaded above, or your own custom parameters file.
--parallel
Enable the parallel processing (drastically speeds up process)
--jobs_to_start | -a
Specific the number of parallel threads you want to use.
Command
pick_open_reference_otus.py \
-i split_libraries/seqs.fna \
-o pick_otus \
-p 16S_pickotu_param.txt \
--parallel \
-aO 4
Output
Many files are created during the picking OTU work flow, but the most important files to focus on are:
- The OTU Table (
pick_otus/otu_table_mc2_w_tax_no_pynast_failures.biom
) - Phylogenetic Tree (
pick_otus/rep_set.tre
) - Representative Sequences (
pick_otus/rep_set.fna
)
Note:
If you do not want to remove chimera's from your data, you can skip the next section from this work flow and use the 3 files above to start filtering samples or taxa and start performing analyses.
Step 2. Removing Chimeras
http://qiime.org/tutorials/chimera_checking.html
Description
Chimeras are sequences generated by abnormal amplification during the PCR stage. There are tools that have been developed to remove these sequences, but at a loss of data. Although, this is a topic still up for debate, this workflow will walk through the steps. If you do not want to remove chimeras and are ready to analyze the data, you can skip this step and use the file generated from the pick_open_reference_otus.py
command.
Step 2a. Identify Chimeric Sequences
http://qiime.org/scripts/parallel_identify_chimeric_seqs.html
Description
This command is relatively computer intensive, so we must run the command in parallel or on a cluster environment. Our first step of this command generates a text file which includes a list of chimeric sequences. Once they are identified, we will need to filter our phylogenetic tree and OTU table of these chimeric sequences and generate the files again.
Parameters
--forward | -i
A aligned representative sequences FASTA file output from pick_otu.py
command. ( If you're following this tutorial is will be located in: pick_otus/pynast_aligned_seqs/rep_set_aligned.fasta
)
--output_fp | -o
A text file name for the list of chimera's detected. the file must have a .txt
extension
--chimera_detection_method | -m
Program used to detect chimeras. [Default = ChimeraSlayer]
--jobs_to_start | -O
number of jobs to start to quicken computational time.
parallel_identify_chimeric_seqs.py \
-i pick_otus/pynast_aligned_seqs/rep_set_aligned.fasta \
-o pick_otus/chimeric_seqs.txt \
-m ChimeraSlayer \
-O 4
Note:
The --aligned_reference_seqs_fp | -a
option refers to the aligned representative sequences within the GreenGenes database used when OTU picking. By default, MacQIIME uses the default location. If you used a different reference file for OTU picking step, you must change this parameter to reflect the file used.
Step 2b. Filtering Chimeras from Files
http://qiime.org/scripts/filter_fasta.html
http://qiime.org/scripts/filter_alignment.py
Description
Now that we have identified the chimeric sequences, we can remove them from our representative set of aligned sequences generated from the pick otu's work flow. These files are used to generate our OTU table and phylogenetic tree, so we must first filter and then generate the cleaned files.
Parameters
--input_fasta_fp | -f
A aligned representative sequences FASTA file output from pick_open_reference_otus.py
( If you're following this tutorial is will be located in: pick_otus/pynast_aligned_seqs/rep_set_aligned.fasta
--output_fasta_fp | -o
The name of the output aligned representative sequence fasta file without chimeras.
--seq_id_fp | -s
The location of the chimera sequence text file generated from the command above.
--negate | -e
Parameter to remove the sequences found in the chimera text file.
Command
filter_fasta.py \
-f pick_otus/pynast_aligned_seqs/rep_set_aligned.fasta \
-o pick_otus/pynast_aligned_seqs/rep_set_aligned_chimerafree.fasta \
-s pick_otus/chimeric_seqs.txt \
--negate
Additionally we need to perform basic filtering from our newly cleaned aligned representative sequences to remove any highly variable regions. The input of the command above is simply the output from the previous step and the output is the pick_otus
folder. There should only be one file generated from this command and it should have the extension .pfiltered.fasta
.
filter_alignment.py \
-i pick_otus/pynast_aligned_seqs/rep_set_aligned_chimerafree.fasta \
-o pick_otus/
Step 2c. Making New Phylogenetic Tree without Chimeras
http://qiime.org/scripts/make_phylogeny.html
Description
Now that we have properly filtered out the chimeric sequences from the necessary files, we can generate a phylogenetic tree. This tree is used in various metrics and is used to determine the distances between samples based on composition.
Parameters
--input_fp | -i
A aligned repsentative sequences fasta file without chimeras generated from the command filter_alignment.py
--result_fp | -o
The name the new phylogenetic tree to generate without chimeras.
Command
make_phylogeny.py \
-i pick_otus/rep_set_aligned_chimerafree_pfiltered.fasta \
-o pick_otus/rep_set_chimerafree.tre
Step 2d. Making New OTU Table without Chimeras
http://qiime.org/scripts/make_otu_table.html
Description
Now that we have properly filtered out the chimeric sequences from the necessary files, we can generate a new OTU table. This file will be similar to the one generated from the pick-otus step, but it will have less sequences per sample as some OTU's are removed completely.
Parameters
--otu_map_fp | -f
The OTU map generated from pick_open_reference_otus.py
. The file will be called pick_otus/final_otu_map_mc2.txt
if there was no change to the minimum counts to retain OTU's.
--output_biom_fp | -o
The final output name of the OTU table without chimerias.
--taxonomy | -t
The location of the taxonomic assigments. If you used RDP as your taxonomy assigner, it will be located in pick_otus/rdp_assigned_taxonomy/rep_set_tax_assignments.txt
--exclude_otus_fp | -e
The location of the chimera sequence text file generated from the command identify_chimeric_seqs.py
.
make_otu_table.py \
-i pick_otus/final_otu_map_mc2.txt \
-o pick_otus/otu_table_rdp_nochimera.biom \
-t pick_otus/rdp_assigned_taxonomy/rep_set_tax_assignments.txt \
-e pick_otus/chimeric_seqs.txt
3. Final output
After removing chimeric sequences there will be new files that will be your starting OTU table .biom
and aligned tree .tre
file. In the end you will use the following files:
Without chimera's removed
- The OTU Table (
pick_otus/otu_table_mc2_w_tax_no_pynast_failures.biom
) - Phylogenetic Tree (
pick_otus/rep_set.tre
) - Representative Sequences (
pick_otus/rep_set.fna
)
With chimera's removed
- The OTU Table (
pick_otus/chimeraslayer/otu_table_rdp_nochimera.biom
) - Phylogenetic Tree (
pick_otus/chimeraslayer/rep_set_chimerafree.tre
) - Representative Sequences (
pick_otus/chimeraslayer/rep_set_aligned_chimerafree.fasta
)
Step 4. Summarizing Sample OTU Count's
http://biom-format.org/documentation/summarizing_biom_tables.html
Description
After processing the sequences and generating an BIOM file based on OTU counts per sample, we finally want to summarize our findings and describe the OTU counts per sample data. Knowing the mean, median, standard deviation and range of the OTU's per sample is crucial for determining the efficientcy of the sequence run and OTU picking. If many samples have very low counts, then you may have to change a few parameters or analyzing the samples in question for a proper conclusion. To generate this report we will use a command from the biom
package, which is installed along with QIIME.
Parameters
--input-fp | -i
The location of the BIOM file\/OTU table that you would like to gather summary information about.
--output-fp | -o
The output location for the newly generate text file with summary data about the input BIOM file\/OTU table.
The input of this command is any BIOM file you wish to summarize. In this case, we will use the OTU table generated after removing chimeras. The output of this command is a text file with descriptive statistics.
biom summarize-table \
-i pick_otus/otu_table_rdp_nochimera.biom \
-o pick_otus/otu_table_rdp_nochimera_stats.txt