Forward/Reverse with an Index
The next steps assume that your data comes back from the sequencing facility with a forward, reverse and index read in fastq format. If your data has a FASTQ file for each sample and has already been de-multiplexed, please go to the appropriate section. Before we can generate any OTU's or microbiome composition counts, the sequencing reads must first be joined, filtered and de-multiplexed. The following sections begin to outline this plan.
Step 1. Decompress and Change Header Information
Our very first step is to decompress the sequences so that they are usable for analysis. Usually sequences are received compressed to save space and time during file transfers. The data we get back are typically in
fastq.gz format. The
gz part means that the data is zipped\/compressed to conserve space, so first we need to unzip to obtain a fastq file which we can use for the next steps. Running the command
gunzip will not create any new files and will remove the
.gz part of the file.
gunzip sequences/forward_R1_001.fastq.gz gunzip sequences/barcode_R2_001.fastq.gz gunzip sequences/reverse_R3_001.fastq.gz
Typically we have to perform one minor processing step to our barcode read (R2) before using it in any of the commands. Each sequence within a FASTQ has an header line which acts as an identifier for that particular read. When the files are generated, the header in the barcode read doesn't match the forward and reverse reads unless we switch one letter in the header line. This step can be solved in one quick command line code. To fix we use the
sed command which essentially switches letters in a any file.
We are going to change the line
1:N:0:. To make this minor change we run the command below:
# This command will assume you have changed directory into your sequences folder. sed 's/2:N:0:/1:N:0:/g' sequences/barcode_R2_001.fastq > sequences/barcode_R2_001_fixedheaders.fastq
Step 2. Joining Paired End Reads
Most of our sequence runs are performed using paired-end reads, so before analyzing the 16s rRNA gene, we must first join the forward and reverse reads together. This step is performed using the
join_paired_ends.py command. By default the command uses ea-utils which must be downloaded and installed into your path if you are NOT using MacQIIME. If you are using MacQIIME, all the necessary additionally utilities are installed by default.
--forward_reads_fp | -f
The forward illumina fastq read. Typically will have the characters
--reverse_reads_fp | -r
The reverse illumina fastq read. Typically will have the characters,
--output_dir | -o
The name of the folder to place output files.
--pe_join_method | -m
Which program to use for joining reads. Can be either fastq-join or SeqPrep
--min_overlap | -j
The minimum number of overlapping base pairs between the two reads. (This is an important parameter depending upon the length of your reads. In a 150x150 paired-end read run there should be a 50bp overlap when the two reads are merged. Increasing this number increases the stringency of your sequence filtering and you will lose low quality scores. This parameters should be consider based on your data!
--perc_max_diff | -p
Maximum allowed % differences within region of overlap.
join_paired_ends.py \ -f sequences/forward_R1_001.fastq \ -r sequences/reverse_R3_001.fastq \ -b sequences/barcode_R2_001_fixedheaders.fastq \ -o join_pe_reads \ -m fastq-join \ -j 40 -p 75
Step 3. De-multiplexing Reads
After joining the paired-end reads and creating a single-end read and barcode, we must split the library so that each read has a corresponding sampleID based on the unique barcode in the mapping file. This step also allows us to perform some base quality filtering. There depending upon how your data was recieved and what type of barcodes were used, this step may have different parameters.
--sequence_read_fps | -i
The joined forward Illumina fastq read. File will have the name,
fastqjoin.join.fastq, if join_paired_ends.py was used.
--output_dir | -o
The output folder name.
--barcode_read_fps | -b
The joined forward Illumina fastq read. Will have the name,
fastqjoin.join_barcodes.fastq, if join_paired_ends.py was used.
--mapping_fps | -m
Mapping file that associates barcodes with SampleID's
--phred_quality_threshold | -q
The minimum base quality score to retain a nucleotide. (Default = 3)
--min_per_read_length_fraction | -p
The minimum number of high quality nucleotides to include a read.
Used when using Golay 12-bp barcodes. This may different depending upon you lab's primer setup.
Store split libraries as a FASTQ file for other uses (e.g sequence submission or use with DAD2)
split_libraries_fastq.py \ -i join_pe_reads/fastqjoin.join.fastq \ -o split_libraries \ -b join_pe_reads/fastqjoin.join_barcodes.fastq \ -m mapping_file.txt \ --store_demultiplexed_fastq \ -q 20 -p 0.75 --rev_comp_mapping_barcodes
Once the library has been split, there should be a single
seqs.fna file in the resulting output folder form the command above. This sequences file includes all the sequences to be aligned and clusters, but each of the sequences now contain which sample their originated from based on barcode data.