4. QC Protocols¶
4.1. Overview¶
There are two over-arching QC scripts:
- illumina_qc.sh: runs the QC pipeline for fastq (or fastq.gz) file generated by an Illumina instrument (fastq_screen and FASTQC). This is described in more detail in QC for Illumina sequencing data.
- solid_qc.sh: runs the QC pipeline for csfasta and qual file pair (fragment mode) or pair of pairs (paired-end mode) generated by a SOLiD instrument (solid2fastq, fastq_screen, solid_preprocess_filter and qc_boxplotter). This is described in more detail in QC for SOLiD sequencing data.
Both scripts automatically run a series of checks and data preparation steps on the data prior to any real analysis taking place. Some of the pipeline components can also be run independently (e.g. qc_boxplotter, fastq_screen.sh etc) - see the information in the QC Pipeline command reference.
Many of the QC scripts read their settings from the qc.setup
file,
which tells them where to find some of the underlying software and data
files. See Create qc.setup for how to set up this file.
Note
Generally the QC scripts won’t overwrite outputs from a previous run; if you want to regenerate the outputs then you’ll need to remove the previous outputs first.
Each script only runs on the data for a single sample, so there are two additional helper scripts that are used to process multiple samples and check the results:
- run_qc_pipeline.py: used to run a specific script on multiple sets of files, also performing various job management operations (such as submitting jobs to Grid Engine).
- qcreporter.py: check the outputs from the top-level SOLiD or Illumina QC scripts, and generate an HTML report.
A typical example of running illumina_qc.sh
on all FASTQ files in a
directory DIR
might look like:
run_qc_pipeline.py --input=FORMAT illumina_qc.sh DIR
To verify that the QC has worked:
qcreporter.py --platform=illumina --format=FORMAT --verify DIR
and to generate the HTML QC report:
qcreporter.py --platform=illumina --format=FORMAT DIR
More specific examples are given in the following sections, and in the command reference for the utilities.
4.2. QC for Illumina sequencing data¶
The full QC pipeline for Illumina data (e.g. GA2x, MiSEQ, HiSEQ etc sequencer platforms) is encoded in the illumina_qc.sh script.
This can be run for a set of Illumina fastq
or fastq.gz
format files
in a specific directory using the run_qc_pipeline.py utility. In its
simplest form:
run_qc_pipeline.py --input=FORMAT illumina_qc.sh DIR
FORMAT
can be either fastq
or fastqgz
; this will detect all matching
files in the directory DIR
and then use qsub
to submit Grid Engine jobs
to run the QC script on each file.
For each sample the illumina_qc.sh
generates fastq_screen plots for model
organisms, other organisms and rRNAs plus the report files from FASTQC.
If the input files are fastq.gz
then it can also produce uncompressed
versions of the files (specify the --gunzip
option to turn on this
behaviour).
4.3. QC for SOLiD sequencing data¶
The full QC pipeline for SOLiD data is encoded in the solid_qc.sh
script.
It runs in either “fragment mode”, which takes a CSFASTA/QUAL file pair as input,
or in “paired-end mode”, which takes two CSFASTA/QUAL file pairs as input (the
first should be the F3 pair and the second the corresponding F5 pair).
This can be run in either mode for a set of SOLiD data files in a specific
directory using the run_qc_pipeline.py
command. In its simplest form, for
“fragment” (F3) data:
run_qc_pipeline.py --input=solid solid_qc.sh DIR
or for paired-end (F3/F5) data:
run_qc_pipeline.py --input=solid_paired_end solid_qc.sh DIR
In each case this will detect all matching file groups in the directory DIR
and then use qsub
to submit Grid Engine jobs to run the QC script on each group.
The pipeline consists of:
solid2fastq
: creates a FASTQ file from the input CSFASTA/QUAL file pairfastq_screen
: checks the reads against 3 different screens (model organisms, “other” organisms and rRNA) to look for contaminantssolid_preprocess_filter
: runs theSOLiD_prepreprocess_filter_v2.pl
program on the input CSFASTA/QUAL file pair to filter out “bad” reads, and reports the percentage filtered out (also produces a FASTQ and boxplot for the filtered data)qc_boxplotter
: generates quality-score boxplots from the input QUAL file
The main outputs are the FASTQ file and a subdirectory qc
which holds the screen
and boxplot files.
(See the section above on “Illumina QC” for additional options available for
run_qc_pipeline.py
.)
To verify that the QC has worked, run the qcreporter.py
command:
qcreporter.py --platform=solid --format=FORMAT --verify DIR
(where FORMAT
is either solid
or solid_paired_end
), and to generate the
HTML QC report:
qcreporter.py --platform=solid --format=FORMAT DIR
4.3.1. Outputs¶
SOLiD paired-end data
Say that the input files are PB_F3.csfasta
, PB_F3.qual
and PB_F5.csfasta
,
PB_F5.qual
.
Stage Files Description Comments Quality filtering PB_F3_T_F3.csfasta
,PB_F3_T_F3_QV.qual
F3 data after quality filter PB_F5_T_F3.csfasta
,PB_F5_T_F3_QV.qual
F5 data after quality filter Only has F5 reads: ignore the F3 part of “T_F3” Merge unfiltered PB_paired.fastq
All unfiltered F3 and F5 data in one fastq file Used for fastq_screen Merge F3 filtered PB_paired_F3_filt.fastq
Filtered F3 reads with the matching F5 partner “Lenient” filtering: only the quality of the F3 reads is considered Merge all filtered PB_paired_F3_and_F5_filt.fastq
Filtered F3 reads and filtered F5 reads “Strict” filtering: pairs of reads are rejected on the quality of either of the F3 or F5 components Split FASTQs PB_paired_F3_filt.F3.fastq
F3 reads only from PB_paired_F3_filt.fastq
Data to use for mapping PB_paired_F3_filt.F5.fastq
F5 reads only from PB_paired_F3_filt.fastq
PB_paired_F3_and_F5_filt.F3.fastq
F3 reads only from PB_paired_F3_and_F5_filt.fastq
Data to use for mapping PB_paired_F3_and_F5,filt.F5.fastq
F5 reads only from PB_paired_F3_and_F5_filt.fastq
For each sample the following output files will be produced by solid_qc.sh
.
4.3.2. “Fragment” mode (default)¶
Say that the input SOLiD data file pair is PB.csfasta and PB.qual, then the following FASTQ files are produced:
- PB.fastq: all reads
- PB_T_F3.csfasta and PB_T_F3_QV.qual: primary data after quality filtering
- PB_T_F3.fastq: reads after quality filtering
4.3.3. Paired-end mode¶
Say that the input SOLiD data file pairs are PB_F3.csfasta, PB_F3.qual and PB_F5.csfasta, PB_F5.qual, then the following FASTQ files are produced:
4.3.4. Unfiltered data¶
Merging all the original unfiltered data into a single fastq gives:
- PB_paired.fastq: all unfiltered F3 and F5 data merged into a single fastq
- PB_paired.F3.fastq: unfiltered F3 data
- PB_paired.F5.fastq: unfiltered F5 data
4.3.5. Quality filtered data¶
Quality filtering on the primary data gives:
- PB_F3_T_F3.csfasta and PB_F3_T_F3_QV.qual: F3 data after quality filter
- PB_F5_T_F3.csfasta and PB_F5_T_F3_QV.qual: F5 data after quality filter
(Note that the files with F5 in the name only have F5 reads - ignore the F3 part of T_F3.)
“Lenient” filtering and merging the F3 filtered data with all F5 gives:
- PB_paired_F3_filt.fastq: filtered F3 reads with the matching F5 partner
- PB_paired_F3_filt.F3.fastq: just the F3 reads after filtering
- PB_paired_F3_filt.F5.fastq: just the matching F5 partners
(This is called “lenient” as only the quality of the F3 reads is considered.)
“Strict” filtering and merging gives:
- PB_paired_F3_and_F5_filt.fastq: filtered F3 reads and filtered F5 reads, with “unpartnered” reads removed
- PB_paired_F3_and_F5_filt.F3.fastq: just the F3 reads
- PB_paired_F3_and_F5_filt.F5.fastq: just the F5 reads
(This is called “strict” filtering as a pair of reads will be rejected on the quality of either of the F3 or F5 components.)
4.3.6. Filtering statistics¶
The filtering statistics output file name depends on the mode that the pipeline was run using:
- SOLiD_preprocess_filter.stats: for fragment mode
- SOLiD_preprocess_filter_paired.stats: for paired end mode
In each case the file summarises the number of reads before and after filtering and merging, and indicates the percentage that have been filtered out (with typical values being between 20-30%).
4.3.7. Contamination screens (fastq_screen.sh)¶
Contamination screen outputs are written to the qc directory:
- PB_model_organisms_screen.*: screen against a selection of commonly used genomes
- PB_other_organisms_screen.*: screen against a selection of less common genomes
- PB_rRNA_screen.*: screen against a selection of rRNAs
For each there are .txt and .png files.
4.3.8. Boxplots (qc_boxplotter.sh)¶
Boxplots are written to the qc subdirectory:
- PB.qual_seq-order_boxplot.*: plot using all reads (PDF, PNG and PS formats)
- PB_T_F3_QV.qual_seq-order_boxplot.*: plot using just the quality filtered reads