2. Preparing Illumina Data for analysis¶
2.1. Background¶
This section outlines the general structure of the data from Illumina based sequencers (GA2x, HiSEQ and MiSEQ) and the procedures for converting these data into FASTQ format.
2.1.1. Primary sequencing data¶
The software on the various sequencers performs image analysis and base
calling, producing primary data files in either .bcl
(binary base call)
format, or (for newer instruments), a compressed version .bcl.gz
.
Additional software is required to convert these data files to FASTQ format, and in the case of multiplexed runs also perform demultiplexing of the data.
The directories produced by the runs have the format:
<date_stamp>_<instrument_name>_<run_id>_FC
(For example 120518_ILLUMINA-13AD3FA_00002_FC
)
The components are:
<date-stamp>
: a 6-digit date stamp in year-month-day format e.g.120518
is 18th May 2012<instrument_name>
: name of the Illumina instrument e.g.ILLUMINA-13AD3FA
<run_id>
: id number corresponding to the run e.g.00002
A partial directory structure is shown below:
<YYMMDD>_<machinename>_<XXXXX>_FC/
|
+-- Data/
| |
| +------ Intensities/
| |
+ +-- .pos files
| |
| +-- config.xml
+-- RunInfo.xml |
+-- L001(2,3...)/ (lanes)
|
+-- BaseCalls/
|
+-- config.xml
|
+-- SampleSheet.csv
|
+--L001(2,3...)/ (lanes)
|
+-- C1.1/ (lane and cycle)
|
+-- .bcl(.gz) files
|
+-- .stats files
Key points:
- The
.bcl
or.bcl.gz
files are located under theData/Intensities/BaseCalls/
directory - The
config.xml
file under theBaseCalls
directory is implicitly needed for demultiplexing and fastq conversion - The
SampleSheet
file is only needed if the demultiplexing needs to be performed.
2.1.2. Fastq generation and demultiplexing¶
Multiplexed sequencing allows multiple samples to be run per lane. The samples are identified by index sequences (barcodes) that are attached to the template during sample preparation.
Originally bcl-to-fastq programs in Illumina’s CASAVA
software package
could be used to perform both these steps, but were unable to handle the
compressed bcf files produced by newer instruments. Illumina now provide
a bclToFastq
software package which only includes the components of
CASAVA
required for FASTQ conversion and which can also deal with
compressed bcl files.
Note
Both CASAVA
and bclToFastq
provide the same programs for the
conversion, and use the same protocol and input files. Within
this documentation bcl-to-fastq is therefor used interchangably to
refer to these programs.
The configureBclToFastq.pl
script from bcl-to-fastq can be used to set up
the bcl to fastq conversion, e.g.:
configureBclToFastq.pl \
--input-dir <path_to_BaseCalls_dir> \
--output-dir <path_to_output_dir> \
[ --sample-sheet <path_to>/SampleSheet.csv ]
This will create the named output directory containing a Makefile which
performs the actual conversion; to run, ‘cd’ to the output directory and
then run make
.
If the --output-dir
option is omitted then it defaults to
<run_dir>/Unaligned/
. The sample sheet is only required for
demultiplexing.
Other useful options:
--fastq-cluster-count <n>
: sets the maximum “cluster size” for the output fastq; this can result in multiple fastq output files. Use -1 to force all reads to be put into a single fastq.--mismatches <n>
: number of mismatches allowed for each read; the default is zero (recommended for samples without multiplexing), 1 mismatch is recommended for multiplexed samples with tags of length 6 bases.
According to the CASAVA 1.8.2 documentation: “FASTQ files contain only reads that passed filtering. If you want all reads in a FASTQ file, use the –with-failed-reads option.”
Note
Comprehensive notes on CASAVA options to use for bcl-to-fastq conversion for different demultiplexing scenarios can be found via https://gist.github.com/3125885
2.1.3. Sample sheets¶
Warning
Sample sheet files are generated by the software on the instrument. For older instruments these could be fed directly into the bcl-to-fastq conversion software; for newer instruments they are in “experimental manager” format, which needs to be converted to the older format - use the prep_sample_sheet.py utility to do this.
The sample sheets accepted by the bcl-to-fastq software are comma-separated files with the following fields on each line:
Field | Description |
---|---|
FCID | Flow cell ID |
Lane | Positive integer, indicating the lane number (1-8) |
SampleID | ID of the sample |
SampleRef | The reference used for alignment for the sample |
Index | Index sequences. Multiple index reads are separated by a hyphen (for example, ACCAGTAA-GGACATGA). |
Description | Description of the sample |
Control | Y indicates this lane is a control lane, N means sample |
Recipe | Recipe used during sequencing |
Operator | Name or ID of the operator |
SampleProject | The project the sample belongs to |
The SampleID
field forms the base of the output fastq name (see below);
the SampleProject
field indicates which project directory the fastq
file will be placed into.
It is advised to set both these fields to something descriptive e.g. SampleProject = “Control” and SampleName = “PhiX”.
To remove a lane from the analysis remove references to it from the sample sheet file.
The bcl-to-fastq software will automatically use the samplesheet files in the instrument output directories unless overriden by a user-supplied samplesheet file.
The samplesheet can be edited using Excel or similar spreadsheet program,
and manipulated using the prep_sample_sheet.py utility. The modified
samplesheet file name can be supplied as an addition argument to the
bclToFastq.sh
script.
2.1.4. Output directory structure¶
Example output directory structure is:
Unaligned/
|
+-- Project_A/
| |
| +- Sample_A/
| | |
| | fastq.gz file(s)
| |
| +- Sample_B/
| |
| fastq.gz file(s)
|
+-- Project_B/
|
+- Sample_C/
|
fastq.gz file(s)
In the absence of a sample sheet, one sample is assumed per lane and all samples belong to he same project.
2.1.5. Output fastq files¶
The general naming scheme for fastq output files is:
<sample_name>_<barcode_sequence>_L<lane>_R<read_number>_<set_number>.fastq.gz
e.g. NA10931_ATCACG_L002_R1_001.fastq.gz
For non-multiplexed runs, the sample name is the lane (e.g. lane1
etc)
and the barcode sequence is NoIndex
e.g. lane1_NoIndex_L001_R1_001.fastq.gz
The read number is either 1 or 2 (2’s only appear for paired-end sequencing).
The quality scores in the output fastq files are Phred+33 (see http://en.wikipedia.org/wiki/FASTQ_format#Quality under the “Encoding” section).
2.1.6. Undetermined reads¶
When demultiplexing it is likely that the software will be unable to
assign some of the reads to a specific sample. In this case the read is
assigned to “undetermined” instead, and there will be an additional
Undetermined_indexes
“project” produced under the Unaligned
directory.
2.2. FASTQ generation and analysis directory setup¶
2.2.1. Overview¶
This section outlines the protocol for generating FASTQ files from the raw bcl data and setting up per-project analysis directories using the scripts and utilities included in this package.
The basic procedure is:
- Create top-level analysis directory
- Generate FASTQ files
- Populate analysis subdirectories for each project
Subsequently the QC pipeline should be run for each project.
2.2.2. Create top-level analysis directory¶
Create a top-level analysis directory where the FASTQs and per-project analysis directories will be created, for example:
mkdir /scratch/120919_SN7001250_0035_BC133VACXX_analysis
Note
Conventionally we name analysis directories by appending _analysis
to the primary data directory name.
2.2.3. FASTQ generation¶
Within the top-level directory create a customised copy of the original
SampleSheet.csv
from the primary data directory. This is best done
using the prep_sample_sheet.py utility, as it will automatically
convert the original file to the correct format.
prep_sample_sheet.py
can automatically address specific issues, for
example:
-
--fix-spaces
¶
replaces spaces in sampleId and sampleProject fields with underscore characters
-
--fix-duplicates
¶
appends indices to sampleIds to make sampleId/sampleProject combinations unique
These two options together should automatically fix most problems with sample sheets, e.g.:
prep_sample_sheet.py \
--fix-spaces --fix-duplicates \
-o custom_samplesheet.csv \
/mnt/data/120919_SN7001250_0035_BC133VACXX/SampleSheet.csv
It also has options to edit the sample sheet file fields: for example the
--set-id=...
and --set-project=
options allow resetting of sampleId
and sampleProject fields.
Note
prep_sample_sheet.py
will only write a new sample sheet file if
it thinks that the problems have been addressed; to override this use
the --ignore-warnings
option.
To generate FASTQS, run the bclToFastq.sh script in the top-level analysis directory, e.g.:
qsub -b y -cwd -V bclToFastq.sh \
/mnt/data/120919_SN7001250_0035_BC133VACXX \
Unaligned custom_samplesheet.csv
This automatically runs the configureBlcToFastq.ps
and make
steps
(above) together and creates a new subdirectory called Unaligned
with
the FASTQS.
The general syntax for this step is:
bclToFastq.sh /path/to/ILLUMINA_RUN_DIR output_dir [ samplesheet.csv ]
Note
If bcl-to-fastq fails to generate the FASTQ files due to some problem with the input data then the Troubleshooting bcl to FASTQ conversion section below may help.
2.2.4. Populate analysis subdirectories¶
Use the build_illumina_analysis_dirs.py utility to create subdirectories for each project named in the input sample sheet file, and populate these with links to the FASTQ files generated in the previous step.
Use the --list
option to see what projects and samples the program will
use, e.g.:
build_illumina_analysis_dir.py --list \
/scratch/120919_SN7001250_0035_BC133VACXX_analysis
which produces output of the form:
Project: AB (4 samples)
AB1
AB1_NoIndex_L002_R1_001.fastq.gz
AB2
AB2_NoIndex_L003_R1_001.fastq.gz
AB3
AB3_NoIndex_L004_R1_001.fastq.gz
AB4
AB4_NoIndex_L005_R1_001.fastq.gz
Project: Control (4 samples)
PhiX1
PhiX1_NoIndex_L001_R1_001.fastq.gz
PhiX2
PhiX2_NoIndex_L006_R1_001.fastq.gz
PhiX3
PhiX3_NoIndex_L007_R1_001.fastq.gz
PhiX4
PhiX4_NoIndex_L008_R1_001.fastq.gz
Use the --expt=EXPT_TYPE
option to specify a library type for one or
more projects, e.g.:
build_illumina_analysis_dir.py \
--expt=AB:ChIP-seq \
/mnt/analyses/120919_ILLUMINA-73D9FA_00008_FC_analysis
This creates new subdirectories for each project which contain symbolic links to the FASTQ files:
<YYMMDD>_<machinename>_<XXXXX>_FC_analysis/
|
+-- Unaligned/
| |
| ...
|
+-- <PI>_<library>/
| |
| +-- *.fastq.gz -> ../Unaligned/.../*.fastq.gz
|
|
+-- <PI>_<library>/
| |
| +-- *.fastq.gz -> ../Unaligned/.../*.fastq.gz
|
...
Unaligned
is the output from the bclToFastq.sh
run (see the
previous section), and will contain the fastq files. The fastq.gz files
in these directories are symbolic links to the files in the Unaligned
directory.
By default the FASTQ names are simplified versions of the original FASTQs;
use the --keep-names
to preserve the full names of the FASTQ files.
2.2.5. Merging replicates¶
Multiplexed runs can produce large numbers of replicates of each sample, with each replicate producing a single FASTQ file - so if there are 20 samples each with 8 replicates then this will produce 160 FASTQ files.
In this situation it can be more helpful to concatenate the replicates
into single FASTQ files, and can be done automatically when creating the
analysis subdirectories using the --merge-replicates
option.
--merge-replicates
doesn’t require any additional input; it produces
concatenated FASTQ files (rather than symbolic links) when creating the
analysis subdirectory for each project, e.g.:
build_illumina_analysis_dir.py \
--expt=AB:RNA-seq \
--merge-replicates \
/mnt/analyses/120919_SN7001250_0035_BC133VACXX_analysis
Note
Use the verify_paired.py utility to check that the order of reads in the merged files are correct.
2.3. Troubleshooting bcl to FASTQ conversion¶
Failure with error “sample-dir not valid: number of directories must match the number of barcodes”
This might be due to the presence of spaces in the sampleID
and
sampleProjects
fields in the sampleSheet.csv
file, which seems
to confuse CASAVA.
The solution is to edit the sample sheet file to remove the spaces;
this can be done automatically using the --fix-spaces
option of the
prep_sample_sheet.py program e.g.:
prep_sample_sheet.py --fix-spaces -o custom_SampleSheet.csv sampleSheet.csv
will create a copy of the original sample sheet file with any spaces replaced by underscores.
Failure with error “barcode XXXXXX for lane 1 has length Y: expected barcode lenth (including delimiters) is Z”
This can happen when attempting to demultiplex paired barcoded samples.
The information that CASAVA needs should be read automatically from the
RunInfo.xml
file, but it appears that this doesn’t always happen (or
perhaps the information is not consistent with the bcl
files e.g.
because the sequencing run didn’t complete properly).
To fix this use the --use-bases-mask
option of
configureBclToFastq.pl
(or bclToFastq.sh
) to tell CASAVA how to
deal with each base. For example:
--use-bases-mask y101,I8,I8,y85
instructs the software to treat the first 101 bases as the first sequence, the next 8 as the first index (i.e. barcoded tag attached to the first sequence), the next 8 as the second index, and then the next 85 bases as the second sequence.
Note
See also this BioStars question about dealing with the CASAVA error: “barcode CTTGTA for lane 1 has length X: expected barcode lenth is Y” http://www.biostars.org/post/show/49599/casava-error-barcode-cttgta-for-lane-1-has-length-6-expected-barcode-lenth-is-7/#55718