Illumina data handling utilities¶

Utilities for preparing data on the cluster from the Illumina instrument:

analyse_illumina_run.py: reporting and manipulations of Illumina run data
auto_process_illumina.sh: automatically process Illumina-based sequencing run
bclToFastq.sh: generate FASTQ from BCL files
build_illumina_analysis_dirs.py: create and populate per-project analysis dirs
demultiplex_undetermined_fastq.py: demultiplex undetermined Illumina reads
prep_sample_sheet.py: edit SampleSheet.csv before generating FASTQ
report_barcodes.py: analyse barcode sequences from FASTQ files
rsync_seq_data.py: copy sequencing data using rsync
verify_paired.py: utility to check FASTQs form R1/R2 pair

analyse_illumina_run.py¶

Utility for performing various checks and operations on Illumina data.

Usage:

analyse_illumina_run.py OPTIONS illumina_data_dir

illumina_data_dir is the top-level directory containing the Unaligned directory with the fastq.gz files produced by the BCL-to-FASTQ conversion step.

Options:

--report¶: report sample names and number of samples for each project

--summary¶: short report of samples (suitable for logging file)

-l, --list¶: list projects, samples and fastq files directories

--unaligned=UNALIGNED_DIR¶: specify an alternative name for the ‘Unaligned’ directory conatining the fastq.gz files

--copy=COPY_PATTERN¶: copy fastq.gz files matching COPY_PATTERN to current directory

--verify=SAMPLE_SHEET¶: check CASAVA outputs against those expected for SAMPLE_SHEET

--stats¶: Report statistics (read counts etc) for fastq files

auto_process_illumina.sh¶

Automatically process data from an Illumina-based sequencing platform

Usage:

auto_process_illumina.sh COMMAND [ PLATFORM DATA_DIR ]

COMMAND can be one of:

setup: prepares a new analysis directory. This step must be
       done first and requires that PLATFORM and DATA_DIR
       arguments are also supplied (these do not have to be
       specified for other commands).
       This creates an analysis directory in the current dir
       with a custom_SampleSheet.csv file; this should be
       examined and edited before running the subsequent
       steps.

make_fastqs: runs CASAVA to generate Fastq files from the
       raw bcls.

run_qc: runs the QC pipeline and generates reports.

The make_fastqs and run_qc commands must be executed from the analysis directory created by the setup command.

Standard protocol¶

The auto_process_illumina.sh script is intended to automate the major steps in generating FASTQ files from raw Illumina BCL data.

The standard protocol for using the automated script is:

Run the setup step to create a new analysis directory
Move into the analysis directory
Check and if necessary edit the generated sample sheet, based on the predicted output projects and samples
Check and if necessary edit the bases mask setting in the ``DEFINE_RUN`` line in the ``processing.info`` file
Run the make_fastqs step
Inspect the summary file which lists the generated FASTQ files along with their sizes and number of reads (and number of undetermined reads)
Run the run_qc step

The critical step is to check and edit the sample sheet, as this is used to determine which samples are assigned to which project. After editing the sample sheet it is a good idea to check the predicted outputs by running:

prep_sample_sheet.py SAMPLE_SHEET

and ensure that this is what was actually intended, before running the next steps.

To change the settings used by CASAVA’s BCL to FASTQ conversion, it is also necessary to edit the DEFINE_RUN line in the processing.info file. This line typically looks like:

DEFINE_RUN  custom_SampleSheet.csv:Unaligned:y68,I7

The colon-delimited values are:

Sample sheet name in the analysis directory (default: custom_SampleSheet.csv)
The output directory where CASAVA will write the output data file (default: Unaligned)
The bases mask that will be used by CASAVA (default will be determined automatically from the RunInfo.xml file in the source data directory)

Optionally a fourth colon-delimited value can be supplied:

The number of allowed mismatches when demultiplexing (default will be determined from the bases mask value)

Multiple samplesheets¶

In some cases it might be necessary to split the BCL to FASTQ processing across multiple sample sheets.

In this case the protocol would be:

Run the setup step
Move into the analysis directory
Create multiple sample sheets as required
Edit the `processing.info` file to add `DEFINE_RUN` for each sample sheet
Run the make_fastqs step, which will automatically run a separate BCL to FASTQ conversion for each DEFINE_RUN line
For each BCL to FASTQ conversion, inspect the summary file which lists the generated FASTQ files along with their sizes and number of reads (and number of undetermined reads)
Run the run_qc step, which will automatically run a separate QC on the outputs of each BCL to FASTQ conversion

The previous section has more detail on the format and content of the DEFINE_RUN line. In the case of multiple DEFINE_RUN lines, it is advised to specify distinct output directories, e.g.:

DEFINE_RUN  pjbriggs_SampleSheet.csv:Unaligned_pjbriggs:y68,I7

bclToFastq.sh¶

Bcl to Fastq conversion wrapper script

Usage:

bclToFastq.sh <illumina_run_dir> <output_dir>

<illumina_run_dir> is the top-level Illumina data directory; Bcl files are expected to be in the Data/Intensities/BaseCalls subdirectory. <output_dir> is the top-level target directory for the output from the conversion process (including the generated fastq files).

The script runs configureBclToFastq.pl from CASAVA to set up conversion scripts, then runs make to perform the actual conversion. It requires that CASAVA is available on the system.

Options:

--nmismatches N¶: set number of mismatches to allow; recommended values are 0 for samples without multiplexing, 1 for multiplexed samples with tags of length 6 or longer (see the CASAVA user guide for details of the --nmismatches option)

--use-bases-mask BASES_MASK¶: specify a bases-mask string tell CASAVA how to use each cycle. The supplied value is passed directly to configureBcltoFastq.pl (see the CASAVA user guide for details of how –use-bases-mask works)

--nprocessors N¶: set the number of processors to use (defaults to 1). This is passed to the -j option of the ‘make’ step after running configureBcltoFastq.pl (see the CASAVA user guide for details of how -j works)

build_illumina_analysis_dirs.py¶

Query/build per-project analysis directories for post-bcl-to-fastq data from Illumina GA2 sequencer.

Usage:

build_illumina_analysis_dir.py OPTIONS illumina_data_dir

Create per-project analysis directories for Illumina run. illumina_data_dir is the top-level directory containing the Unaligned directory with the fastq.gz files generated from the bcl files. For each Project_... directory build_illumina_analysis_dir.py makes a new subdirectory and populates with links to the fastq.gz files for each sample under that project.

Options:

--dry-run¶: report operations that would be performed if creating the analysis directories but don’t actually do them

--unaligned=UNALIGNED_DIR¶: specify an alternative name for the Unaligned directory conatining the fastq.gz files

--expt=EXPT_TYPE¶: specify experiment type (e.g. ChIP-seq) to append to the project name when creating analysis directories. The syntax for EXPT_TYPE is <project>:<type> e.g. --expt=NY:ChIP-seq will create directory NY_ChIP-seq. Use multiple --expt=... to set the types for different projects

--keep-names¶: preserve the full names of the source fastq files when creating links

--merge-replicates¶: create merged fastq files for each set of replicates detected

demultiplex_undetermined_fastq.py¶

Demultiplex undetermined Illumina reads output from CASAVA.

Usage:

demultiplex_undetermined_fastq.py OPTIONS DIR

Reassign reads with undetermined index sequences. (i.e. barcodes). DIR is the name (including any leading path) of the ‘Undetermined_indices’ directory produced by CASAVA, which contains the FASTQ files with the undetermined reads from each lane.

Options:

--barcode=BARCODE_INFO¶: specify barcode sequence and corresponding sample name as BARCODE_INFO. The syntax is <name>:<barcode>:<lane> e.g. --barcode=PB1:ATTAGA:3

--samplesheet=SAMPLE_SHEET¶: specify SampleSheet.csv file to read barcodes, sample names and lane assignments from (as an alternative to --barcode).

prep_sample_sheet.py¶

Prepare sample sheet files for Illumina sequencers for input into CASAVA.

Usage:

prep_sample_sheet.py [OPTIONS] SampleSheet.csv

Utility to prepare SampleSheet files from Illumina sequencers. Can be used to view, validate and update or fix information such as sample IDs and project names before running BCL to FASTQ conversion.

Options:

-o SAMPLESHEET_OUT¶: output new sample sheet to SAMPLESHEET_OUT

-f FMT, --format=FMT¶: specify the format of the output sample sheet written by the -o option; can be either CASAVA or IEM (defaults to the format of the original file)

-v, --view¶: view contents of sample sheet

--fix-spaces¶: replace spaces in sample ID and project fields with underscores

--fix-duplicates¶: append unique indices to Sample IDs where original ID and project name combination are duplicated

--fix-empty-projects¶: create sample project names where these are blank in the original sample sheet

--set-id=SAMPLE_ID¶: update/set the values in the Sample ID field; SAMPLE_ID should be of the form <lanes>:<name>, where <lanes> is a single integer (e.g. 1), a set of integers (e.g. 1,3,…), a range (e.g. 1-3), or a combination (e.g. 1,3-5,7)

--set-project=SAMPLE_PROJECT¶: update/set values in the sample project field; SAMPLE_PROJECT should be of the form [<lanes>:]<name>, where the optional <lanes> part can be a single integer (e.g. 1), a set of integers (e.g. 1,3,…), a range (e.g. 1-3), or a combination (e.g. 1,3-5,7). If no lanes are specified then all samples will have their project set to <name>

--ignore-warnings¶: ignore warnings about spaces and duplicated sampleID/sampleProject combinations when writing new samplesheet.csv file

--include-lanes=LANES¶: specify a subset of lanes to include in the output sample sheet; LANES should be single integer (e.g. 1), a list of integers (e.g. 1,3,…), a range (e.g. 1-3) or a combination (e.g. 1,3-5,7). Default is to include all lanes

Deprecated options:

--truncate-barcodes=BARCODE_LEN¶: trim barcode sequences in sample sheet to number of bases specified by BARCODE_LEN. Default is to leave barcode sequences unmodified (deprecated; only works for CASAVA-style sample sheets)

--miseq¶: convert MiSEQ input sample sheet to CASAVA-compatible format (deprecated; conversion is performed specify -f/–format CASAVA to convert IEM sample sheet to older format)

Examples:

Read in the sample sheet file SampleSheet.csv, update the SampleProject and SampleID for lanes 1 and 8, and write the updated sample sheet to the file SampleSheet2.csv:
```
prep_sample_sheet.py -o SampleSheet2.csv --set-project=1,8:Control \
     --set-id=1:PhiX_10pM --set-id=8:PhiX_12pM SampleSheet.csv
```
Automatically fix spaces and duplicated sampleID/sampleProject combinations and write out to SampleSheet3.csv:
```
prep_sample_sheet.py --fix-spaces --fix-duplicates \
     -o SampleSheet3.csv SampleSheet.csv
```

report_barcodes.py¶

Examine barcode sequences from one or more Fastq files and report the most prevalent. Sequences will be pooled from all specified Fastqs before being analysed.

Usage:

report_barcodes.py FASTQ [FASTQ...]

Options:

--cutoff=CUTOFF¶: Minimum number of times a barcode sequence must appear to be reported (default is 1000000)

rsync_seq_data.py¶

Rsync sequencing data to archive location, inserting the correct ‘year’ and ‘platform’ subdirectories.

Usage:

rsync_seq_data.py [OPTIONS] DIR BASE_DIR

Wrapper to rsync sequencing data: DIR will be rsync’ed to a subdirectory of BASE_DIR constructed from the year and platform i.e. BASE_DIR/YEAR/PLATFORM/. YEAR will be the current year (over-ride using the –year option), PLATFORM will be inferred from the DIR name (over-ride using the –platform option). The output from rsync is written to a file rsync.DIR.log.

Options:

--platform=PLATFORM¶: explicitly specify the sequencer type

--year=YEAR¶: explicitly specify the year (otherwise current year is assumed)

--dry-run¶: run rsync with --dry-run option

--chmod=CHMOD¶: change file permissions using --chmod option of rsync (e.g ‘u-w,g-w,o-w’)

--exclude=EXCLUDE_PATTERN¶: specify a pattern which will exclude any matching files or directories from the rsync

--mirror¶: mirror the source directory at the destination (update files that have changed and remove any that have been deleted i.e. rsync –delete-after)

--no-log¶: write rsync output directly stdout, don’t create a log file

verify_paired.py¶

Utility to verify that two fastq files form an R1/R2 pair.

Usage:

verify_paired.py OPTIONS R1.fastq R2.fastq

Check that read headers for R1 and R2 fastq files are in agreement, and that the files form an R1/2 pair.