bcftbx.SolidData

Provides classes for extracting data about SOLiD runs from directory structure, data files and naming conventions.

Typical usage is to create a new SolidRun instance by pointing it at the top-level output directory produced by the sequencer:

>>> solid_run = SolidRun('/path/to/solid0123_20141225_FRAG_BC')

This will automatically attempt to collect the data about the run, which can then be accessed via other objects linked through the SolidRun object’s properties.

The most useful are:

  • SolidRun.run_info: a SolidRunInfo object which holds data extracted from the run name (e.g. instrument, datestamp etc)

  • SolidRun.samples: a list of SolidSample objects which hold data about each of the samples in the run.

Each sample in turn holds a list of libraries within that sample (SolidLibrary objects in ‘SolidSample.libraries’) and a list of projects (SolidProject objects in ‘SolidSample.projects’). The ‘getLibrary’ and ‘getProject’ methods also provide ways to look up specific libraries or projects.

Projects are groupings of libraries (based on library names) which are assumed to form a single experiment. The libraries within a project can be obtained via the SolidLibrary.projects, or using the ‘getLibrary’ method.

Finally, SolidLibrary objects hold data about the location of the primary data files. The ‘SolidLibrary.csfasta’ and ‘SolidLibrary.qual’ properties hold the locations of the data for the F3 reads, while for paired-end runs the ‘SolidLibrary.csfasta_f5’ and ‘SolidLibrary.qual_f5’ properties point to the F5 reads.

(The ‘is_paired_end’ function can be used to test whether a SolidRun object holds data for a paired-end run.)

SolidRun

class bcftbx.SolidData.SolidRun(solid_run_dir)

Describe a SOLiD run.

The SolidRun class provides an interface to data about a SOLiD run. It analyses the SOLiD data directory to look for run definitions, statistics files and primary data files.

It uses the same terminology as the SETS interface and the data files produced by the SOLiD instrument, so a run contains ‘samples’ and each sample contains one or more ‘libraries’.

One initialised, access the data about the run via the SolidRun object’s properties:

  • run_dir: directory with the run data

  • run_name: name of the run e.g. solid0123_20130426_FRAG_BC

  • run_info: a SolidRunInfo object with data derived from the run name

  • run_definition: a SolidRunDefinition object with data extracted from the run_definition.txt file

  • samples: a list of SolidSample objects representing the samples in the run

class bcftbx.SolidData.SolidRunInfo(run_name)

Extract data about a run from the run name

Run names are of the form ‘solid0123_20130426_FRAG_BC_2’

This class analyses the name and breaks it down into components that can be accessed as object properties, specifically:

name: the supplied run name instrument: the instrument name e.g. solid0123 datestamp: e.g. 20130426 is_fragment_library: True or False is_barcoded_sample: True or False flow_cell: 1 or 2 date: datestamp reformatted as DD/MM/YY id: the run name without any flow cell identifier

class bcftbx.SolidData.SolidRunDefinition(run_definition_file)

Class to store data from a SOLiD run definition file

Once the SolidRunDefinition object is populated from a run definition file, use the ‘nSamples’ method to find out how many ‘samples’ (actually sample/library pairs) are defined, and the ‘fields’ method to get a list of column headings for each.

Data can be extracted for each sample using the ‘getDataItem’ method to look up the value for a particular field on a particular line, e.g.:

>>> library = run_defn.getDataItem('library',0)

The SolidRunDefinition object also has a number of attributes populated from the header of the run definition file, specifically:

version, userId, runType, isMultiplexing, runName, runDesc, mask and protocol.

The attributes are strings and can be accessed directly from the object, e.g.:

>>> version = run_defn.version
>>> isMultiplexing = run_defn.isMultiplexing
class bcftbx.SolidData.SolidBarcodeStatistics(barcode_statistics_file)

Store data from a SOLiD BarcodeStatistics file

class bcftbx.SolidData.SolidProject(name, run=None, sample=None)

Class to hold information about a SOLiD ‘project’

A SolidProject object holds a collection of libraries which together constitute a ‘project’.

The definition of a ‘project’ is quite loose in this context: essentially it’s a grouping of libraries within a sample. Typically the grouping is by the initial letters of the library name e.g. DR for DR1, EP for EP_NCYC2669 - but this determination is made at the application level.

Libraries are added to the project via the addLibrary method. Data about the project can be accessed via the following properties:

name: the project name (supplied on object creation) libraries: a list of libraries in the project

Also has the following methods:

  • getSample(): returns the parent SolidSample

  • getRun(): returns the parent SolidRun

  • isBarcoded(): returns boolean indicating whether the libraries in the sample are barcoded

SolidSample

class bcftbx.SolidData.SolidSample(name, parent_run=None)

Store information about a sample in a SOLiD run.

A sample has a name and contains a set of libraries. The information about the sample can be accessed via the following properties:

  • name: the sample name

  • libraries: a list of SolidLibrary objects representing the libraries within the sample

  • projects: a list of SolidProject objects representing groups of related libraries within the sample

  • unassigned: SolidProject object representing the ‘unassigned’ data

  • barcode_stats: a SolidBarcodeStats object with data extracted from the BarcodeStatistics file (or None, if no file was available)

  • parent_run: the parent SolidRun object, or None.

The class also provides the following methods:

  • addLibrary: to create and append a SolidLibrary object

  • getLibrary: fetch an existing SolidLibrary

  • getProject: fetch an existing SolidProject

Typically the calling subprogram calls the ‘addLibrary’ method to add a SolidLibrary object, which it then populates itself.

The SolidSample class automatically creates SolidProject objects based on the library names to group libraries considered to belong to the same experiments.

SolidLibrary

class bcftbx.SolidData.SolidLibrary(name, parent_sample=None)

Store information about a SOLiD library.

The following properties hold data about the library:

  • name: the library name

  • initials: the experimenter’s initials

  • prefix: the library name prefix (i.e. name without the trailing numbers)

  • index_as_string: the trailing numbers from the name, as a string (preserves any leading zeroes)

  • index: the trailing numbers from the name as an integer

  • csfasta: full path to the csfasta file for the library (F3 reads)

  • qual: full path to qual file for the library (F3 reads)

  • csfasta_f5: full path to the F5 read (paired-end runs, otherwise will be None)

  • qual_f5: full path to the F5 read (paired-end runs, otherwise will be None)

  • primary_data: list of SolidPrimaryData objects for all possible primary data file pairs associated with the library

  • parent_sample: parent SolidSample object, or None.

The following methods are also available:

  • addPrimaryData: creates a new SolidPrimaryData object and appends to the list in the primary_data property

SolidPrimaryData

class bcftbx.SolidData.SolidPrimaryData

Class to store references to primary data files

This is a convenience class for storing references to csfasta/qual file pairs within a SolidLibrary instance.

The class provides the following attributes:

csfasta: full path to csfasta file qual: full path to qual file timestamp: timestamp associated with the file pair type: string indicating ‘F3’ or ‘F5’, or None

The following methods are provided:

is_f3: indicates if data is F3 is_f5: indicates if data is F5

Functions

bcftbx.SolidData.extract_library_timestamp(path)

Extract the timestamp string from a path

Given a path of the form ‘/path/to/data/…/primary.1234567/…’, return the timestamp string attached to the ‘primary.XXXXXXX’ component of the name.

Parameters:

path – absolute or relative path to arbitrary directory or file in the SOLiD data structure

Returns:

Timestamp string, or None if no timestamp was identified.

bcftbx.SolidData.get_primary_data_file_pair(dirn)

Return csfasta/qual file pair from specified directory

Parameters:

dirn – directory to search for csfasta/qual pair

Returns:

Tuple (csfasta,qual) with full path for each file, or (None,None) if a pair wasn’t located.

bcftbx.SolidData.is_paired_end(solid_run)

Determine if a SolidRun instance is a paired-end run

Parameters:

solid_run – a populated SolidRun instance

Returns:

True if this is a paired-end run, False otherwise.

bcftbx.SolidData.match(pattern, word)

Check if a word matches pattern

Implements a very simple pattern matching algorithm, which allows only exact matches or glob-like strings (i.e. using a trailing ‘*’ to indicate a wildcard).

For example: ‘ABC*’ matches ‘ABC’, ‘ABC1’, ‘ABCDEFG’ etc, while ‘ABC’ only matches itself.

Parameters:
  • pattern – simple glob-like pattern

  • word – string to test against ‘pattern’

Returns:

True if ‘word’ is a match to ‘pattern’, False otherwise.

bcftbx.SolidData.slide_layout(nsamples)

Description of the slide layout based on number of samples

Parameters:

nsamples – number of samples in the run

Returns:

A string describing the slide layout for the run based on the number of samples in the run, e.g. “Whole slide”, “Quads”, “Octets” etc. Returns None if the number of samples doesn’t map to a recognised layout.