4. QC Protocols

4.1. Overview

There are two over-arching QC scripts:

  • illumina_qc.sh: runs the QC pipeline for fastq (or fastq.gz) file generated by an Illumina instrument (fastq_screen and FASTQC). This is described in more detail in QC for Illumina sequencing data.
  • solid_qc.sh: runs the QC pipeline for csfasta and qual file pair (fragment mode) or pair of pairs (paired-end mode) generated by a SOLiD instrument (solid2fastq, fastq_screen, solid_preprocess_filter and qc_boxplotter). This is described in more detail in QC for SOLiD sequencing data.

Both scripts automatically run a series of checks and data preparation steps on the data prior to any real analysis taking place. Some of the pipeline components can also be run independently (e.g. qc_boxplotter, fastq_screen.sh etc) - see the information in the QC Pipeline command reference.

Many of the QC scripts read their settings from the qc.setup file, which tells them where to find some of the underlying software and data files. See Create qc.setup for how to set up this file.

Note

Generally the QC scripts won’t overwrite outputs from a previous run; if you want to regenerate the outputs then you’ll need to remove the previous outputs first.

Each script only runs on the data for a single sample, so there are two additional helper scripts that are used to process multiple samples and check the results:

  • run_qc_pipeline.py: used to run a specific script on multiple sets of files, also performing various job management operations (such as submitting jobs to Grid Engine).
  • qcreporter.py: check the outputs from the top-level SOLiD or Illumina QC scripts, and generate an HTML report.

A typical example of running illumina_qc.sh on all FASTQ files in a directory DIR might look like:

run_qc_pipeline.py --input=FORMAT illumina_qc.sh DIR

To verify that the QC has worked:

qcreporter.py --platform=illumina --format=FORMAT --verify DIR

and to generate the HTML QC report:

qcreporter.py --platform=illumina --format=FORMAT DIR

More specific examples are given in the following sections, and in the command reference for the utilities.

4.2. QC for Illumina sequencing data

The full QC pipeline for Illumina data (e.g. GA2x, MiSEQ, HiSEQ etc sequencer platforms) is encoded in the illumina_qc.sh script.

This can be run for a set of Illumina fastq or fastq.gz format files in a specific directory using the run_qc_pipeline.py utility. In its simplest form:

run_qc_pipeline.py --input=FORMAT illumina_qc.sh DIR

FORMAT can be either fastq or fastqgz; this will detect all matching files in the directory DIR and then use qsub to submit Grid Engine jobs to run the QC script on each file.

For each sample the illumina_qc.sh generates fastq_screen plots for model organisms, other organisms and rRNAs plus the report files from FASTQC.

If the input files are fastq.gz then it can also produce uncompressed versions of the files (specify the --gunzip option to turn on this behaviour).

4.3. QC for SOLiD sequencing data

The full QC pipeline for SOLiD data is encoded in the solid_qc.sh script. It runs in either “fragment mode”, which takes a CSFASTA/QUAL file pair as input, or in “paired-end mode”, which takes two CSFASTA/QUAL file pairs as input (the first should be the F3 pair and the second the corresponding F5 pair).

This can be run in either mode for a set of SOLiD data files in a specific directory using the run_qc_pipeline.py command. In its simplest form, for “fragment” (F3) data:

run_qc_pipeline.py --input=solid solid_qc.sh DIR

or for paired-end (F3/F5) data:

run_qc_pipeline.py --input=solid_paired_end solid_qc.sh DIR

In each case this will detect all matching file groups in the directory DIR and then use qsub to submit Grid Engine jobs to run the QC script on each group.

The pipeline consists of:

  • solid2fastq: creates a FASTQ file from the input CSFASTA/QUAL file pair
  • fastq_screen: checks the reads against 3 different screens (model organisms, “other” organisms and rRNA) to look for contaminants
  • solid_preprocess_filter: runs the SOLiD_prepreprocess_filter_v2.pl program on the input CSFASTA/QUAL file pair to filter out “bad” reads, and reports the percentage filtered out (also produces a FASTQ and boxplot for the filtered data)
  • qc_boxplotter: generates quality-score boxplots from the input QUAL file

The main outputs are the FASTQ file and a subdirectory qc which holds the screen and boxplot files.

(See the section above on “Illumina QC” for additional options available for run_qc_pipeline.py.)

To verify that the QC has worked, run the qcreporter.py command:

qcreporter.py --platform=solid --format=FORMAT --verify DIR

(where FORMAT is either solid or solid_paired_end), and to generate the HTML QC report:

qcreporter.py --platform=solid --format=FORMAT DIR

4.3.1. Outputs

SOLiD paired-end data

Say that the input files are PB_F3.csfasta, PB_F3.qual and PB_F5.csfasta, PB_F5.qual.

Stage Files Description Comments
Quality filtering PB_F3_T_F3.csfasta, PB_F3_T_F3_QV.qual F3 data after quality filter  
  PB_F5_T_F3.csfasta, PB_F5_T_F3_QV.qual F5 data after quality filter Only has F5 reads: ignore the F3 part of “T_F3”
Merge unfiltered PB_paired.fastq All unfiltered F3 and F5 data in one fastq file Used for fastq_screen
Merge F3 filtered PB_paired_F3_filt.fastq Filtered F3 reads with the matching F5 partner “Lenient” filtering: only the quality of the F3 reads is considered
Merge all filtered PB_paired_F3_and_F5_filt.fastq Filtered F3 reads and filtered F5 reads “Strict” filtering: pairs of reads are rejected on the quality of either of the F3 or F5 components
Split FASTQs PB_paired_F3_filt.F3.fastq F3 reads only from PB_paired_F3_filt.fastq Data to use for mapping
  PB_paired_F3_filt.F5.fastq F5 reads only from PB_paired_F3_filt.fastq  
  PB_paired_F3_and_F5_filt.F3.fastq F3 reads only from PB_paired_F3_and_F5_filt.fastq Data to use for mapping
  PB_paired_F3_and_F5,filt.F5.fastq F5 reads only from PB_paired_F3_and_F5_filt.fastq  

For each sample the following output files will be produced by solid_qc.sh.

4.3.2. “Fragment” mode (default)

Say that the input SOLiD data file pair is PB.csfasta and PB.qual, then the following FASTQ files are produced:

  • PB.fastq: all reads
  • PB_T_F3.csfasta and PB_T_F3_QV.qual: primary data after quality filtering
  • PB_T_F3.fastq: reads after quality filtering

4.3.3. Paired-end mode

Say that the input SOLiD data file pairs are PB_F3.csfasta, PB_F3.qual and PB_F5.csfasta, PB_F5.qual, then the following FASTQ files are produced:

4.3.4. Unfiltered data

Merging all the original unfiltered data into a single fastq gives:

  • PB_paired.fastq: all unfiltered F3 and F5 data merged into a single fastq
  • PB_paired.F3.fastq: unfiltered F3 data
  • PB_paired.F5.fastq: unfiltered F5 data

4.3.5. Quality filtered data

Quality filtering on the primary data gives:

  • PB_F3_T_F3.csfasta and PB_F3_T_F3_QV.qual: F3 data after quality filter
  • PB_F5_T_F3.csfasta and PB_F5_T_F3_QV.qual: F5 data after quality filter

(Note that the files with F5 in the name only have F5 reads - ignore the F3 part of T_F3.)

“Lenient” filtering and merging the F3 filtered data with all F5 gives:

  • PB_paired_F3_filt.fastq: filtered F3 reads with the matching F5 partner
  • PB_paired_F3_filt.F3.fastq: just the F3 reads after filtering
  • PB_paired_F3_filt.F5.fastq: just the matching F5 partners

(This is called “lenient” as only the quality of the F3 reads is considered.)

“Strict” filtering and merging gives:

  • PB_paired_F3_and_F5_filt.fastq: filtered F3 reads and filtered F5 reads, with “unpartnered” reads removed
  • PB_paired_F3_and_F5_filt.F3.fastq: just the F3 reads
  • PB_paired_F3_and_F5_filt.F5.fastq: just the F5 reads

(This is called “strict” filtering as a pair of reads will be rejected on the quality of either of the F3 or F5 components.)

4.3.6. Filtering statistics

The filtering statistics output file name depends on the mode that the pipeline was run using:

  • SOLiD_preprocess_filter.stats: for fragment mode
  • SOLiD_preprocess_filter_paired.stats: for paired end mode

In each case the file summarises the number of reads before and after filtering and merging, and indicates the percentage that have been filtered out (with typical values being between 20-30%).

4.3.7. Contamination screens (fastq_screen.sh)

Contamination screen outputs are written to the qc directory:

  • PB_model_organisms_screen.*: screen against a selection of commonly used genomes
  • PB_other_organisms_screen.*: screen against a selection of less common genomes
  • PB_rRNA_screen.*: screen against a selection of rRNAs

For each there are .txt and .png files.

4.3.8. Boxplots (qc_boxplotter.sh)

Boxplots are written to the qc subdirectory:

  • PB.qual_seq-order_boxplot.*: plot using all reads (PDF, PNG and PS formats)
  • PB_T_F3_QV.qual_seq-order_boxplot.*: plot using just the quality filtered reads