2. Preparing Illumina Data for analysis

2.1. Background

This section outlines the general structure of the data from Illumina based sequencers (GA2x, HiSEQ and MiSEQ) and the procedures for converting these data into FASTQ format.

2.1.1. Primary sequencing data

The software on the various sequencers performs image analysis and base calling, producing primary data files in either .bcl (binary base call) format, or (for newer instruments), a compressed version .bcl.gz.

Additional software is required to convert these data files to FASTQ format, and in the case of multiplexed runs also perform demultiplexing of the data.

The directories produced by the runs have the format:

<date_stamp>_<instrument_name>_<run_id>_FC

(For example 120518_ILLUMINA-13AD3FA_00002_FC)

The components are:

  • <date-stamp>: a 6-digit date stamp in year-month-day format e.g. 120518 is 18th May 2012
  • <instrument_name>: name of the Illumina instrument e.g. ILLUMINA-13AD3FA
  • <run_id>: id number corresponding to the run e.g. 00002

A partial directory structure is shown below:

<YYMMDD>_<machinename>_<XXXXX>_FC/
         |
         +-- Data/
         |     |
         |     +------ Intensities/
         |                  |
         +                  +-- .pos files
         |                  |
         |                  +-- config.xml
         +-- RunInfo.xml    |
                            +-- L001(2,3...)/  (lanes)
                            |
                            +-- BaseCalls/
                                   |
                                   +-- config.xml
                                   |
                                   +-- SampleSheet.csv
                                   |
                                   +--L001(2,3...)/  (lanes)
                                         |
                                         +-- C1.1/   (lane and cycle)
                                               |
                                               +-- .bcl(.gz) files
                                               |
                                               +-- .stats files

Key points:

  • The .bcl or .bcl.gz files are located under the Data/Intensities/BaseCalls/ directory
  • The config.xml file under the BaseCalls directory is implicitly needed for demultiplexing and fastq conversion
  • The SampleSheet file is only needed if the demultiplexing needs to be performed.

2.1.2. Fastq generation and demultiplexing

Multiplexed sequencing allows multiple samples to be run per lane. The samples are identified by index sequences (barcodes) that are attached to the template during sample preparation.

Originally bcl-to-fastq programs in Illumina’s CASAVA software package could be used to perform both these steps, but were unable to handle the compressed bcf files produced by newer instruments. Illumina now provide a bclToFastq software package which only includes the components of CASAVA required for FASTQ conversion and which can also deal with compressed bcl files.

Note

Both CASAVA and bclToFastq provide the same programs for the conversion, and use the same protocol and input files. Within this documentation bcl-to-fastq is therefor used interchangably to refer to these programs.

The configureBclToFastq.pl script from bcl-to-fastq can be used to set up the bcl to fastq conversion, e.g.:

configureBclToFastq.pl \
         --input-dir <path_to_BaseCalls_dir> \
         --output-dir <path_to_output_dir> \
         [ --sample-sheet <path_to>/SampleSheet.csv ]

This will create the named output directory containing a Makefile which performs the actual conversion; to run, ‘cd’ to the output directory and then run make.

If the --output-dir option is omitted then it defaults to <run_dir>/Unaligned/. The sample sheet is only required for demultiplexing.

Other useful options:

  • --fastq-cluster-count <n>: sets the maximum “cluster size” for the output fastq; this can result in multiple fastq output files. Use -1 to force all reads to be put into a single fastq.
  • --mismatches <n>: number of mismatches allowed for each read; the default is zero (recommended for samples without multiplexing), 1 mismatch is recommended for multiplexed samples with tags of length 6 bases.

According to the CASAVA 1.8.2 documentation: “FASTQ files contain only reads that passed filtering. If you want all reads in a FASTQ file, use the –with-failed-reads option.”

Note

Comprehensive notes on CASAVA options to use for bcl-to-fastq conversion for different demultiplexing scenarios can be found via https://gist.github.com/3125885

2.1.3. Sample sheets

Warning

Sample sheet files are generated by the software on the instrument. For older instruments these could be fed directly into the bcl-to-fastq conversion software; for newer instruments they are in “experimental manager” format, which needs to be converted to the older format - use the prep_sample_sheet.py utility to do this.

The sample sheets accepted by the bcl-to-fastq software are comma-separated files with the following fields on each line:

Field Description
FCID Flow cell ID
Lane Positive integer, indicating the lane number (1-8)
SampleID ID of the sample
SampleRef The reference used for alignment for the sample
Index Index sequences. Multiple index reads are separated by a hyphen (for example, ACCAGTAA-GGACATGA).
Description Description of the sample
Control Y indicates this lane is a control lane, N means sample
Recipe Recipe used during sequencing
Operator Name or ID of the operator
SampleProject The project the sample belongs to

The SampleID field forms the base of the output fastq name (see below); the SampleProject field indicates which project directory the fastq file will be placed into.

It is advised to set both these fields to something descriptive e.g. SampleProject = “Control” and SampleName = “PhiX”.

To remove a lane from the analysis remove references to it from the sample sheet file.

The bcl-to-fastq software will automatically use the samplesheet files in the instrument output directories unless overriden by a user-supplied samplesheet file.

The samplesheet can be edited using Excel or similar spreadsheet program, and manipulated using the prep_sample_sheet.py utility. The modified samplesheet file name can be supplied as an addition argument to the bclToFastq.sh script.

2.1.4. Output directory structure

Example output directory structure is:

Unaligned/
   |
   +-- Project_A/
   |         |
   |         +- Sample_A/
   |         |     |
   |         |   fastq.gz file(s)
   |         |
   |         +- Sample_B/
   |               |
   |             fastq.gz file(s)
   |
   +-- Project_B/
             |
             +- Sample_C/
                   |
                 fastq.gz file(s)

In the absence of a sample sheet, one sample is assumed per lane and all samples belong to he same project.

2.1.5. Output fastq files

The general naming scheme for fastq output files is:

<sample_name>_<barcode_sequence>_L<lane>_R<read_number>_<set_number>.fastq.gz

e.g. NA10931_ATCACG_L002_R1_001.fastq.gz

For non-multiplexed runs, the sample name is the lane (e.g. lane1 etc) and the barcode sequence is NoIndex

e.g. lane1_NoIndex_L001_R1_001.fastq.gz

The read number is either 1 or 2 (2’s only appear for paired-end sequencing).

The quality scores in the output fastq files are Phred+33 (see http://en.wikipedia.org/wiki/FASTQ_format#Quality under the “Encoding” section).

2.1.6. Undetermined reads

When demultiplexing it is likely that the software will be unable to assign some of the reads to a specific sample. In this case the read is assigned to “undetermined” instead, and there will be an additional Undetermined_indexes “project” produced under the Unaligned directory.

2.2. FASTQ generation and analysis directory setup

2.2.1. Overview

This section outlines the protocol for generating FASTQ files from the raw bcl data and setting up per-project analysis directories using the scripts and utilities included in this package.

The basic procedure is:

  1. Create top-level analysis directory
  2. Generate FASTQ files
  3. Populate analysis subdirectories for each project

Subsequently the QC pipeline should be run for each project.

2.2.2. Create top-level analysis directory

Create a top-level analysis directory where the FASTQs and per-project analysis directories will be created, for example:

mkdir /scratch/120919_SN7001250_0035_BC133VACXX_analysis

Note

Conventionally we name analysis directories by appending _analysis to the primary data directory name.

2.2.3. FASTQ generation

Within the top-level directory create a customised copy of the original SampleSheet.csv from the primary data directory. This is best done using the prep_sample_sheet.py utility, as it will automatically convert the original file to the correct format.

prep_sample_sheet.py can automatically address specific issues, for example:

--fix-spaces

replaces spaces in sampleId and sampleProject fields with underscore characters

--fix-duplicates

appends indices to sampleIds to make sampleId/sampleProject combinations unique

These two options together should automatically fix most problems with sample sheets, e.g.:

prep_sample_sheet.py \
    --fix-spaces --fix-duplicates \
    -o custom_samplesheet.csv \
    /mnt/data/120919_SN7001250_0035_BC133VACXX/SampleSheet.csv

It also has options to edit the sample sheet file fields: for example the --set-id=... and --set-project= options allow resetting of sampleId and sampleProject fields.

Note

prep_sample_sheet.py will only write a new sample sheet file if it thinks that the problems have been addressed; to override this use the --ignore-warnings option.

To generate FASTQS, run the bclToFastq.sh script in the top-level analysis directory, e.g.:

qsub -b y -cwd -V bclToFastq.sh \
    /mnt/data/120919_SN7001250_0035_BC133VACXX \
    Unaligned custom_samplesheet.csv

This automatically runs the configureBlcToFastq.ps and make steps (above) together and creates a new subdirectory called Unaligned with the FASTQS.

The general syntax for this step is:

bclToFastq.sh /path/to/ILLUMINA_RUN_DIR output_dir [ samplesheet.csv ]

Note

If bcl-to-fastq fails to generate the FASTQ files due to some problem with the input data then the Troubleshooting bcl to FASTQ conversion section below may help.

2.2.4. Populate analysis subdirectories

Use the build_illumina_analysis_dirs.py utility to create subdirectories for each project named in the input sample sheet file, and populate these with links to the FASTQ files generated in the previous step.

Use the --list option to see what projects and samples the program will use, e.g.:

build_illumina_analysis_dir.py --list \
   /scratch/120919_SN7001250_0035_BC133VACXX_analysis

which produces output of the form:

Project: AB (4 samples)
        AB1
                AB1_NoIndex_L002_R1_001.fastq.gz
        AB2
                AB2_NoIndex_L003_R1_001.fastq.gz
        AB3
                AB3_NoIndex_L004_R1_001.fastq.gz
        AB4
                AB4_NoIndex_L005_R1_001.fastq.gz
Project: Control (4 samples)
        PhiX1
                PhiX1_NoIndex_L001_R1_001.fastq.gz
        PhiX2
                PhiX2_NoIndex_L006_R1_001.fastq.gz
        PhiX3
                PhiX3_NoIndex_L007_R1_001.fastq.gz
        PhiX4
                PhiX4_NoIndex_L008_R1_001.fastq.gz

Use the --expt=EXPT_TYPE option to specify a library type for one or more projects, e.g.:

build_illumina_analysis_dir.py \
   --expt=AB:ChIP-seq \
   /mnt/analyses/120919_ILLUMINA-73D9FA_00008_FC_analysis

This creates new subdirectories for each project which contain symbolic links to the FASTQ files:

<YYMMDD>_<machinename>_<XXXXX>_FC_analysis/
        |
        +-- Unaligned/
        |     |
        |    ...
        |
        +-- <PI>_<library>/
        |     |
        |     +-- *.fastq.gz -> ../Unaligned/.../*.fastq.gz
        |
        |
        +-- <PI>_<library>/
        |     |
        |     +-- *.fastq.gz -> ../Unaligned/.../*.fastq.gz
        |
       ...

Unaligned is the output from the bclToFastq.sh run (see the previous section), and will contain the fastq files. The fastq.gz files in these directories are symbolic links to the files in the Unaligned directory.

By default the FASTQ names are simplified versions of the original FASTQs; use the --keep-names to preserve the full names of the FASTQ files.

2.2.5. Merging replicates

Multiplexed runs can produce large numbers of replicates of each sample, with each replicate producing a single FASTQ file - so if there are 20 samples each with 8 replicates then this will produce 160 FASTQ files.

In this situation it can be more helpful to concatenate the replicates into single FASTQ files, and can be done automatically when creating the analysis subdirectories using the --merge-replicates option.

--merge-replicates doesn’t require any additional input; it produces concatenated FASTQ files (rather than symbolic links) when creating the analysis subdirectory for each project, e.g.:

build_illumina_analysis_dir.py \
    --expt=AB:RNA-seq \
    --merge-replicates \
    /mnt/analyses/120919_SN7001250_0035_BC133VACXX_analysis

Note

Use the verify_paired.py utility to check that the order of reads in the merged files are correct.

2.3. Troubleshooting bcl to FASTQ conversion

Failure with error “sample-dir not valid: number of directories must match the number of barcodes”

This might be due to the presence of spaces in the sampleID and sampleProjects fields in the sampleSheet.csv file, which seems to confuse CASAVA.

The solution is to edit the sample sheet file to remove the spaces; this can be done automatically using the --fix-spaces option of the prep_sample_sheet.py program e.g.:

prep_sample_sheet.py --fix-spaces -o custom_SampleSheet.csv sampleSheet.csv

will create a copy of the original sample sheet file with any spaces replaced by underscores.

Failure with error “barcode XXXXXX for lane 1 has length Y: expected barcode lenth (including delimiters) is Z”

This can happen when attempting to demultiplex paired barcoded samples. The information that CASAVA needs should be read automatically from the RunInfo.xml file, but it appears that this doesn’t always happen (or perhaps the information is not consistent with the bcl files e.g. because the sequencing run didn’t complete properly).

To fix this use the --use-bases-mask option of configureBclToFastq.pl (or bclToFastq.sh) to tell CASAVA how to deal with each base. For example:

--use-bases-mask y101,I8,I8,y85

instructs the software to treat the first 101 bases as the first sequence, the next 8 as the first index (i.e. barcoded tag attached to the first sequence), the next 8 as the second index, and then the next 85 bases as the second sequence.

Note

See also this BioStars question about dealing with the CASAVA error: “barcode CTTGTA for lane 1 has length X: expected barcode lenth is Y” http://www.biostars.org/post/show/49599/casava-error-barcode-cttgta-for-lane-1-has-length-6-expected-barcode-lenth-is-7/#55718