Filtering Samples from OTU-Table
Introduction
The most common type of filtering is filtering the groups of samples from the table. This will be the most important filter as it allows you to remove one particular group or time point from the table or remove samples below a particular sequencing depth.
There are a few different ways to filter out data. The command works differently depending upon the type and amount of samples in one particular group. Either way, the command takes the same formatted argument. You must have the name of the column in your mapping file and a factor or level within that column, separated by a colon and surrounded by quotes.
column variable:name of the group ('Treatment:Control'
)
1a. Positive Filtering of Samples
http://qiime.org/scripts/filter_samples_from_otu_table.html
Description
The first way is positive filtering. You tell the script which groups you WANT to keep. For this example, we have a total of 3 different groups within the column variable 'Treatment:Group1,Group2,Group3'
. If we want to remove 'Group3' we would run the script below.
Parameters
--input_fp | -i
Input OTU table in .biom
format
--output_fp | -o
The name of the new output filtered biom
file.
--mapping_fp | -m
The mapping file that corresponds to the input OTU table.
--output_mapping_fp
The location of the new mapping file which will match the newly created BIOM file.
--valid_states | -s
The names of the groups you want to remove. It MUST be surrounded by single or double quotes.
Command
filter_samples_from_otu_table.py \
-i otu_table.biom \
-o otu_table_filtered.biom \
-m mapping_file.txt \
--output_mapping_fp mapping_file_filtered.txt \
-s 'SampleType:gut,tongue'
1b. Negative Filtering of Samples
Description
The second way is negative filtering. You tell the script which groups you DO NOT WANT to keep. We are going to use the same example found above and remove 'Group3'. The negative filtering requires a special few characters. It needs a *,!
before the name of the group.
Command
filter_samples_from_otu_table.py \
-i otu_table.biom \
-o otu_table_no_gut.biom \
-m mapping_file.txt \
--output_mapping_fp mapping_file_no_gut.txt \
-s 'Treatment:*,!gut'
There are many more features within filter_sample_from_otu_table.py
, such as the ability to remove high coverage samples or to choose samples that match a particular list of SampleID's. See the QIIME website link above for more examples.
1c. Advanced usage of Sample Filtering
Description
If you want to get a bit more advanced, you can specify multiple variables at the same time. If you want to filter out multiple groups as well as a particular study, you can use a semicolon,;
between statements.
Command
filter_samples_from_otu_table.py \
-i otu_table.biom \
-o otu_table_gut_d28.biom \
-m mapping_file.txt \
--output_mapping_fp mapping_file_gut_d28.txt \
-s 'Treatment:gut;Day:28'
2. Splitting The Table Based on Group Information
http://qiime.org/scripts/split_otu_table.html
Description
Another important script for managing the OTU table is the split
function. This command is very siilar to the filter_samples_from_otu_table.py
, but instead of outputing a single filtered OTU table, the command will generate a new OTU table for each factor/group of the chosen variable. If you have separate studies or many timepoint in one OTU table, but you do not want to filter out every group individually, you can split the table, which will create new biom files with each unique particular category. So for example, if you had 5 different timepoints and you wanted to create 5 separate biom
files and mapping files for each time point, you can use this command. The output of this command is a new folder that contains a new biom file for each factor in the column variable chosen. Below is an example file tree listing of the output files.
per_timepoint_tables/
-> otu_table_timepoint1.biom
-> otu_table_timepoint2.biom
-> otu_table_timepoint3.biom
-> otu_table_timepoint4.biom
-> otu_table_timepoint5.biom
-> otu_table_timepoint6.biom
Parameters
--biom_table_fp | -i
Input OTU table in .biom format
--output_dir | -o
The name and location of the folder to store all the output biom
files
--mapping_fp | -m
The mapping file that corresponds to the input OTU table
--fields | -f
The name of the group to split the table.
Command
split_otu_table.py \
-i otu_table.biom \
-o split_by_month \
-m mapping_file.txt \
-f Month
3. Removing Samples with Low Sampling Depth
http://qiime.org/scripts/filter_samples_from_otu_table.html
Description
Instead of filtering based on mapping file data, you may want to perform quality filtering on the samples regardless of group.
One way to filter the samples based on quality is to remove any sample with an observation count (otu count) below a certain threshold. Typically you want to retain as many samples as possible to maximize your analysis, but most analyses cannot be performed on samples that contain only 5 or 10 OTU's. These samples are typically removed before proceeding with any further analysis as they will severely skew the data and results with their low counts.
To determine the correct threshold, it is highly recommended to run biom summarize-table
on the OTU table to generate a report of the per sample observation count.
# Get a table of sampling depth of all samples in an OTU table
biom summarize-table \
-i otu_table.biom \
-o otu_table_stats.txt
Parameters
--input_fp | -i
Input OTU table in .biom
format
--output_fp | -o
The name of the output filtered biom
file.
--mapping_fp | -m
The mapping file that corresponds to the input OTU table.
--output_mapping_fp
The location of the new mapping file which will match the newly created biom
file.
--min_count | -n
The minimum cutoff for the number of sequences per sample. Any samples with a total sequencing depth below this number will be removed from the table.
Command
filter_samples_from_otu_table.py \
-i otu_table.biom \
-o otu_table_m1000.biom \
--output_mapping_fp mapping_file_m1000.txt \
--min_count 1000