bcftbx.IlluminaData

Provides classes for extracting data about runs of Illumina-based sequencers (e.g. GA2x or HiSeq) from directory structure, data files and naming conventions.

Core data and run handling classes

class bcftbx.IlluminaData.IlluminaData(illumina_analysis_dir, unaligned_dir='Unaligned')

Class for examining Illumina data post bcl-to-fastq conversion

Provides the following attributes:

  • analysis_dir: top-level directory holding the ‘Unaligned’ subdirectory with the primary fastq.gz files

  • projects: list of IlluminaProject objects (one for each project defined at the fastq creation stage)

  • undetermined: IlluminaProject object for the undetermined reads

  • unaligned_dir: full path to the ‘Unaligned’ directory holding the primary fastq.gz files

  • paired_end: True if at least one project is paired end, False otherwise

  • format: Format of the directory structure layout (either ‘casava’ or ‘bcl2fastq2’, or None if the format cannot be determined)

  • lanes: List of lane numbers present; if there are no lanes then this will be a list with ‘None’ as the only value

Provides the following methods:

  • get_project(): lookup and return an IlluminaProject object corresponding to the supplied project name

class bcftbx.IlluminaData.IlluminaProject(dirn)

Class for storing information on a ‘project’ within an Illumina run

A project is a subset of fastq files from a run of an Illumina sequencer; in the first instance projects are defined within the SampleSheet.csv file which is output by the sequencer.

Note that the “undetermined” fastqs (which hold reads for each lane which couldn’t be assigned to a barcode during demultiplexing) is also considered as a project, and can be processed using an IlluminaProject object.

Provides the following attributes:

  • name: name of the project

  • dirn: (full) path of the directory for the project

  • expt_type: the application type for the project e.g. RNA-seq, ChIP-seq (initially set to None; should be explicitly set by the calling subprogram)

  • samples: list of IlluminaSample objects for each sample within the project

  • paired_end: True if all samples are paired end, False otherwise

  • undetermined: True if ‘samples’ are actually undetermined reads

class bcftbx.IlluminaData.IlluminaRun(illumina_run_dir, platform=None)

Class for examining ‘raw’ Illumina data directory.

Provides the following properties:

  • run_dir: name and full path to the top-level data directory

  • basecalls_dir: name and full path to the subdirectory holding bcl files

  • sample_sheet_csv: full path of the SampleSheet.csv file

  • runinfo_xml: full path of the RunInfo.xml file

  • runparameters_xml: full path of the RunParameters.xml file

  • platform: platform e.g. ‘miseq’

  • bcl_extension: file extension for bcl files (either “bcl” or “bcl.gz”)

  • lanes: list of (integer) lane numbers in the run

  • sample_sheet: SampleSheet instance (if the run has an associated sample sheet file)

  • runinfo: IlluminaRunInfo instance (if the run has an associated RunInfo.xml file)

  • runparameters: IlluminaRunParameters instance (if the run has an associated RunParameters.xml file)

class bcftbx.IlluminaData.IlluminaRunInfo(runinfo_xml)

Class for examining Illumina RunInfo.xml file

Extracts basic information from a RunInfo.xml file:

  • run_id: the run id e.g.’130805_PJ600412T_0012_ABCDEZXDYY’

  • run_number: the run number e.g. ‘0012’

  • instrument: the instrument name e.g. ‘PJ600412T’

  • date: the run date e.g. ‘130805’

  • flowcell: the flowcell id e.g. ‘ABCDEZXDYY’

  • lane_count: the flowcell lane count e.g. 8

  • bases_mask: bases mask string derived from the read information e.g. ‘y101,I6,y101’

  • reads: a list of Python dictionaries (one per read)

Each dictionary in the ‘reads’ list has the following keys:

  • number: the read number (1,2,3,…)

  • num_cycles: the number of cycles in the read e.g. 101

  • is_indexed_read: whether the read is an index (i.e. barcode); either ‘Y’ or ‘N’

Parameters:

runinfo_xml (str) – path to the RunInfo.xml file

class bcftbx.IlluminaData.IlluminaSample(dirn, fastqs=None, name=None, prefix='Sample_')

Class for storing information on a ‘sample’ within an Illumina project

A sample is a fastq file generated within an Illumina sequencer run.

Provides the following attributes:

  • name: sample name

  • dirn: (full) path of the directory for the sample

  • fastq: name of the fastq.gz file (without leading directory, join to ‘dirn’ to get full path)

  • paired_end: boolean; indicates whether sample is paired end

Samplesheet handling

class bcftbx.IlluminaData.SampleSheet(sample_sheet=None, fp=None)

Class for handling Illumina sample sheets

This is a general class which tries to handle and convert between older (i.e. ‘CASAVA’-style) and newer (IEM-style) sample sheet files for Illumina sequencers, in a transparent manner.

Experimental Manager (IEM) sample sheet format

The Experimental Manager (IEM) samplel sheets are text files with data delimited by ‘[…]’ lines e.g. ‘[Header]’, ‘[Reads]’ etc.

The ‘Header’ section consists of comma-separated key-value pairs e.g. ‘Application,HiSeq FASTQ Only’.

The ‘Reads’ section consists of values (one per line) (possibly number of bases per read?) e.g. ‘101’.

The ‘Settings’ section consists of comma-separated key-value pairs e.g. ‘Adapter,CTGTCTCTTATACACATCT’.

The ‘Manifests’ section consists of comma-separated key-filename pairs e.g. ‘A,TruSeqAmpliconManifest-1.txt’.

The ‘Data’ section contains the data about the lanes, samples and barcode indexes. It consists of lines of comma-separated values, with the first line being a ‘header’, and the remainder being values for each of those fields.

CASAVA-style sample sheet format

This older style of sample sheet is used by CASAVA and bcl2fastq v1.8.*. It consists of lines of comma-separated values, with the first line being a ‘header’ and the remainder being values for each of the fields:

  • FCID: flow cell ID

  • Lane: lane number (integer from 1 to 8)

  • SampleID: ID (name) for the sample

  • SampleRef: reference used for alignment for the sample

  • Index: index sequences (multiple index reads are separated by a hyphen e.g. ACCAGTAA-GGACATGA

  • Description: Description of the sample

  • Control: Y indicates this lane is a control lane, N means sample

  • Recipe: Recipe used during sequencing

  • Operator: Name or ID of the operator

  • SampleProject: project the sample belongs to

Although the CASAVA-style sample sheet looks much like the IEM ‘Data’ section, note that it has different fields and field names.

Basic usage

To load data from an IEM-format file:

>>> iem = SampleSheet('SampleSheet.csv')

To access ‘header’ items:

>>> iem.header_items
['IEMFileVersion','Date',..]
>>> iem.header['IEMFileVersion']
'4'

To access ‘reads’ data:

>>> iem.reads
['101','101']

To access ‘settings’ items:

>>> iem.settings_items
['ReverseComplement',...]
>>> iem.settings['ReverseComplement']
'0'

To access ‘manifests’ items:

>>> iem.manifests_items
['A',...]
>>> iem.manifests['A']
'TruSeqAmpliconManifest-1.txt'

To access ‘data’ (the actual sample sheet information):

>>> iem.data.header()
['Lane','Sample_ID',...]
>>> iem.data[0]['Lane']
1

etc.

To load data from a CASAVA style sample sheet:

>>> casava = SampleSheet('SampleSheet.csv')

To access the data use the ‘data’ property:

>>> casava.data.header()
['Lane','SampleID',...]
>>> casava.data[0]['Lane']
1

Accessing data directly

The data in the ‘Data’ section can be accessed directly from the SampleSheet instance, e.g.

>>> iem[0]['Lane']

is equivalent to

>>> iem.data[0]['Lane']

It is also possible to set new values for data items using this notation.

The data lines can be iterated over using:

>>> for line in iem:
>>> ...

To find the number of lines that are stored:

>>> len(iem)

To append a new line:

>>> new_line = iem.append(...)

Checking and clean-up methods

A number of methods are available to check and fix common problems, specifically:

  • detect and replace ‘illegal’ characters in sample and project names

  • detect and fix duplicated sample name, project and lane combinations

  • detect blank sample and project names

Sample sheet reconstruction

Data is loaded it is also subjected to some basic cleaning up, including stripping of unnecessary commas and white space. The ‘show’ method returns a reconstructed version of the original sample sheet after the cleaning operations were performed.

class bcftbx.IlluminaData.CasavaSampleSheet(samplesheet=None, fp=None)

Class for reading and manipulating sample sheet files for CASAVA

This class is a subclass of the SampleSheet class, and provides an additional method (‘casava_sample_sheet’) to convert to a CASAVA-style sample sheet, suitable for input into bcl2fastq version 1.8.*.

Raises IlluminaDataError exception if the input data doesn’t appear to be in the correct format.

class bcftbx.IlluminaData.IEMSampleSheet(sample_sheet=None, fp=None)

Class for handling Experimental Manager format sample sheet

This class is a subclass of the SampleSheet class, and provides an additional method (‘casava_sample_sheet’) to convert to a CASAVA-style sample sheet, suitable for input into bcl2fastq version 1.8.*.

bcftbx.IlluminaData.convert_miseq_samplesheet_to_casava(samplesheet=None, fp=None)

Convert a Miseq sample sheet file to CASAVA format

Reads the data in a Miseq-format sample sheet file and returns a CasavaSampleSheet object with the equivalent data.

Note: this is now just a wrapper for the more general conversion function ‘get_casava_sample_sheet’ (which can handle the conversion without knowing a priori what the SampleSheet format is.

Parameters:

samplesheet – name of the Miseq sample sheet file

Returns:

A populated CasavaSampleSheet object.

bcftbx.IlluminaData.get_casava_sample_sheet(samplesheet=None, fp=None, FCID_default='FC1')

Load data into a ‘standard’ CASAVA sample sheet CSV file

Reads the data from an Illumina platform sample sheet CSV file and populates and returns a CasavaSampleSheet object which can be used to generate make a SampleSheet suitable for bcl-to-fastq conversion.

The source sample sheet may be in the format output by the Experimental Manager software (needed when running BaseSpace) or may already be in “standard” format for bcl-to-fastq format.

For Experimental Manager format, the sample sheet consists of sections delimited by headers of the form “[Header]”, “[Reads]” etc. The information about the sample names and barcodes are in the “[Data]” section, which is essentially a list of CSV format lines with the following fields:

MiSEQ:

Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID,index, Sample_Project,Description

HiSEQ:

Lane,Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID, index,Sample_Project,Description

(Note that for dual-indexed runs the fields are e.g.:

Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID,index, I5_Index_ID,index2,Sample_Project,Description

i.e. there are an additional pair of fields describing the second index)

The conversion maps a subset of these onto fields in the Casava format:

Sample_ID -> SampleID index -> Index Sample_Project -> SampleProject Description -> Description

If no lane information is present in the original file then this is set to 1. The FCID is set to an arbitrary value.

For dual-indexed samples, the Index field is generated by putting together the index and index2 fields.

All other fields are left empty.

Parameters:
  • samplesheet – name of the Miseq sample sheet file

  • FCID_default – name to use for flow cell ID if not present in the source file (optional)

Returns:

A populated CasavaSampleSheet object.

bcftbx.IlluminaData.verify_run_against_sample_sheet(illumina_data, sample_sheet, include_sample_dir=False)

Checks existence of predicted outputs from a sample sheet

Parameters:
  • illumina_data – a populated IlluminaData directory

  • sample_sheet – path and name of a CSV sample sheet

  • include_sample_dir – if True then always include a ‘sample_name’ directory level when checking for bcl2fastq2 outputs

Returns:

True if all the predicted outputs from the sample sheet are

found, False otherwise.

bcftbx.IlluminaData.samplesheet_index_sequence(line)

Return the index sequence for a sample sheet line

Parameters:

line (TabDataLine) – line from a SampleSheet instance

Returns:

barcode sequence, or ‘None’ if not defined.

Return type:

String

bcftbx.IlluminaData.normalise_barcode(seq)

Return normalised version of barcode sequence

This standardises the sequence so that:

  • all bases are uppercase

  • dual index barcodes have ‘-’ and ‘+’ removed

Utility classes and functions

class bcftbx.IlluminaData.IlluminaFastq(fastq)

Class for extracting information about Fastq files

Given the name of a Fastq file from CASAVA/Illumina platform, extract data about the sample name, barcode sequence, lane number, read number and set number.

For Fastqs produced by CASAVA and bcl2fastq v1.8, the format of the names follows the general form:

<sample_name>_<barcode_sequence>_L<lane_number>_R<read_number>_<set_number>.fastq.gz

e.g. for

NA10831_ATCACG_L002_R1_001.fastq.gz

sample_name = ‘NA10831’ barcode_sequence = ‘ATCACG’ lane_number = 2 read_number = 1 set_number = 1

For Fastqs produced by bcl2fast v2, the format looks like:

<sample_name>_S<sample_number>_L<lane_number>_R<read_number>_<set_number>.fastq.gz

e.g. for

NA10831_S4_L002_R1_001.fastq.gz

sample_name = ‘NA10831’ sample_number = 4 lane_number = 2 read_number = 1 set_number = 1

Provides the follow attributes:

fastq: the original fastq file name sample_name: name of the sample (leading part of the name) sample_number: number of the same (integer or None, bcl2fastq v2 only) barcode_sequence: barcode sequence (string or None, CASAVA/bcl2fast v1.8 only) lane_number: integer read_number: integer set_number: integer

bcftbx.IlluminaData.describe_project(illumina_project)

Generate description string for samples in a project

Description string gives the project name and a human-readable summary of the sample names, plus number of samples and whether the data is paired end.

Example output: “Project Control: PhiX_1-2 (2 samples)”

Arguments

illumina_project: IlluminaProject instance

Returns

Description string.

bcftbx.IlluminaData.fix_bases_mask(bases_mask, barcode_sequence)

Adjust input bases mask to match actual barcode sequence lengths

Updates the bases mask string extracted from RunInfo.xml so that the index read masks correspond to the index barcode sequence lengths given e.g. in the SampleSheet.csv file.

For example: if the bases mask is ‘y101,I7,y101’ (i.e. assigning 7 cycles to the index read) but the barcode sequence is ‘CGATGT’ (i.e. only 6 bases) then the adjusted bases mask should be ‘y101,I6n,y101’.

Parameters:
  • bases_mask – bases mask string e.g. ‘y101,I7,y101’,’y250,I8,I8,y250’

  • barcode_sequence – index barcode sequence e.g. ‘CGATGT’ (single

  • index) (dual index) –

  • 'TAAGGCGA-TAGATCGC' (dual index) –

Returns:

Updated bases mask string.

bcftbx.IlluminaData.get_unique_fastq_names(fastqs)

Generate mapping of full fastq names to shorter unique names

Given an iterable list of Illumina file fastq names, return a dictionary mapping each name to its shortest unique form within the list.

Parameters:

fastqs – an iterable list of fastq names

Returns:

Dictionary mapping fastq names to shortest unique versions

bcftbx.IlluminaData.split_run_name(dirname)

Split an Illumina directory run name into components

Given a directory for an Illumina run, e.g.

140210_M00879_0031_000000000-A69NA

split the name into components and return as a tuple:

(date_stamp,instrument_name,run_number)

e.g.

(‘140210’,’M00879’,’0031’)

Note that this function doesn’t return the flow cell ID; use the split_run_name_full function to also extract the flow cell information.

bcftbx.IlluminaData.summarise_projects(illumina_data)

Short summary of projects, suitable for logging file

The summary description is a one line summary of the project names along with the number of samples in each, and an indication if the run was paired-ended.

Parameters:

illumina_data – a populated IlluminaData directory

Returns:

Summary description.

Exception classes

class bcftbx.IlluminaData.IlluminaDataError

Base class for errors with Illumina-related code