Removing Human Contamination
This tutorial will walk through how to remove human contaminated sequences from your 16s rRNA reads. This may be necessary when data must be upload to a public database, but cannot have any human associated DNA reads. The following guide will walk through how to remove the reads from each sample FASTQ file, which can be used for submission to the SRA database.
If you are running the command ona cluster environment, you can download the work flow script, which contains the commands run below.
Note: This work flow assumes that your sequences are in the Illumina Forward/Reverse/Barcode FASTQ file format.
Step 1. Install KneadData
http://huttenhower.sph.harvard.edu/kneaddata
KneadData is a tool developed by the Huttenhower group for removing sequences. The first step is to download the package and install using pip. If you do not have pip, you can run the command sudo easy_install pip
first and enter your computers password. Additionally, if you are installing this program on a cluster environment, you must add the --user
parameters to the end of the KneadData install command.
pip install https://bitbucket.org/biobakery/kneaddata/downloads/kneaddata_v0.5.1.tar.gz
Step 2. Download human reference genome
To remove the human contaminated sequences, we must have a reference of the human genome to align the reads. Fortunately, KneadData contains a function to automatically download a human genome that has been put into Bowtie2 format (required for KneadData). The output is a folder which contains multiple files. You must have access to this folder in future commands, so it is important to keep it in an accessible location. Additionally, this will take up space (~4GB) and could take a long time to download (~10-20 min).
kneaddata_database \
--download human bowtie2 \
human_genome/
Step 3. Joined Reads and Split Libraries
Once KneadData has been installed and the human genome has been downloaded to a folder on your computer, we must process the raw FASTQ files into individual FASTQ files that correspond to each sample. To do this we must run the first steps of the QIIME work flow (See the section: Preparing Sequence Data). If you have already run split libraries with the --store_demultiplexed_fastq
, you can skip this section.
Step 4. Split samples into individual FASTQ files
Once split_libraries_fastq.py
is complete, verify that there exists a file called seqs.fastq
within the split_libraries
output folder. The next step is to further split the file into FASTQ files per sample. Using the seqs.fastq
as an input, run the command below to generate individual sequencing files per sample in an output folder.
split_sequence_file_on_sample_ids.py \
-i split_libraries/seqs.fastq \
-o per_sample_fastq \
--file_type fastq
Step 5. Run KneadData on each of the samples
After generating the FASTQ files for each sample, we can now run KneadData on each to remove any human contaminated reads from the file. We could have run KneadData on our starting forward and reverse files, but there would be no way to add the barcode reads, so we first splits and then clean.
The best way to run the command on each of the files is to use a for loop. See for more detailed UNIX information, but essentially it is a command to run the same script many times.
A. First make a new directory that will store all the new clean fastq files
mkdir -p per_sample_fastq_clean
B. Next run the for loop that will run KneadData on each of the FASTQ files. Ensure the names of the input and output folders are correct.
for fastq in per_sample_fastq/*.fastq # Change ONLY if different split FASTQ files folder name
do
kneaddata \
--input $fastq \
--output per_sample_fastq_clean/$(basename "$fastq" .fastq) \
--db human_genome/Homo_sapiens_Bowtie2_v0.1/Homo_sapiens \
--threads 2 \
--bypass-trim \
--output-prefix $(basename "$fastq" .fastq)
done
Output
In each of the output folders, there will be 3 files which correspond to the clean FASTQ file, the contaminated reads and a log of the removal process.