bcftbx.SolidData
Provides classes for extracting data about SOLiD runs from directory structure, data files and naming conventions.
Typical usage is to create a new SolidRun instance by pointing it at the top-level output directory produced by the sequencer:
>>> solid_run = SolidRun('/path/to/solid0123_20141225_FRAG_BC')
This will automatically attempt to collect the data about the run, which can then be accessed via other objects linked through the SolidRun object’s properties.
The most useful are:
SolidRun.run_info: a SolidRunInfo object which holds data extracted from the run name (e.g. instrument, datestamp etc)
SolidRun.samples: a list of SolidSample objects which hold data about each of the samples in the run.
Each sample in turn holds a list of libraries within that sample (SolidLibrary objects in ‘SolidSample.libraries’) and a list of projects (SolidProject objects in ‘SolidSample.projects’). The ‘getLibrary’ and ‘getProject’ methods also provide ways to look up specific libraries or projects.
Projects are groupings of libraries (based on library names) which are assumed to form a single experiment. The libraries within a project can be obtained via the SolidLibrary.projects, or using the ‘getLibrary’ method.
Finally, SolidLibrary objects hold data about the location of the primary data files. The ‘SolidLibrary.csfasta’ and ‘SolidLibrary.qual’ properties hold the locations of the data for the F3 reads, while for paired-end runs the ‘SolidLibrary.csfasta_f5’ and ‘SolidLibrary.qual_f5’ properties point to the F5 reads.
(The ‘is_paired_end’ function can be used to test whether a SolidRun object holds data for a paired-end run.)
SolidRun
- class bcftbx.SolidData.SolidRun(solid_run_dir)
Describe a SOLiD run.
The SolidRun class provides an interface to data about a SOLiD run. It analyses the SOLiD data directory to look for run definitions, statistics files and primary data files.
It uses the same terminology as the SETS interface and the data files produced by the SOLiD instrument, so a run contains ‘samples’ and each sample contains one or more ‘libraries’.
One initialised, access the data about the run via the SolidRun object’s properties:
run_dir: directory with the run data
run_name: name of the run e.g. solid0123_20130426_FRAG_BC
run_info: a SolidRunInfo object with data derived from the run name
run_definition: a SolidRunDefinition object with data extracted from the run_definition.txt file
samples: a list of SolidSample objects representing the samples in the run
- class bcftbx.SolidData.SolidRunInfo(run_name)
Extract data about a run from the run name
Run names are of the form ‘solid0123_20130426_FRAG_BC_2’
This class analyses the name and breaks it down into components that can be accessed as object properties, specifically:
name: the supplied run name instrument: the instrument name e.g. solid0123 datestamp: e.g. 20130426 is_fragment_library: True or False is_barcoded_sample: True or False flow_cell: 1 or 2 date: datestamp reformatted as DD/MM/YY id: the run name without any flow cell identifier
- class bcftbx.SolidData.SolidRunDefinition(run_definition_file)
Class to store data from a SOLiD run definition file
Once the SolidRunDefinition object is populated from a run definition file, use the ‘nSamples’ method to find out how many ‘samples’ (actually sample/library pairs) are defined, and the ‘fields’ method to get a list of column headings for each.
Data can be extracted for each sample using the ‘getDataItem’ method to look up the value for a particular field on a particular line, e.g.:
>>> library = run_defn.getDataItem('library',0)
The SolidRunDefinition object also has a number of attributes populated from the header of the run definition file, specifically:
version, userId, runType, isMultiplexing, runName, runDesc, mask and protocol.
The attributes are strings and can be accessed directly from the object, e.g.:
>>> version = run_defn.version >>> isMultiplexing = run_defn.isMultiplexing
- class bcftbx.SolidData.SolidBarcodeStatistics(barcode_statistics_file)
Store data from a SOLiD BarcodeStatistics file
- class bcftbx.SolidData.SolidProject(name, run=None, sample=None)
Class to hold information about a SOLiD ‘project’
A SolidProject object holds a collection of libraries which together constitute a ‘project’.
The definition of a ‘project’ is quite loose in this context: essentially it’s a grouping of libraries within a sample. Typically the grouping is by the initial letters of the library name e.g. DR for DR1, EP for EP_NCYC2669 - but this determination is made at the application level.
Libraries are added to the project via the addLibrary method. Data about the project can be accessed via the following properties:
name: the project name (supplied on object creation) libraries: a list of libraries in the project
Also has the following methods:
getSample(): returns the parent SolidSample
getRun(): returns the parent SolidRun
isBarcoded(): returns boolean indicating whether the libraries in the sample are barcoded
SolidSample
- class bcftbx.SolidData.SolidSample(name, parent_run=None)
Store information about a sample in a SOLiD run.
A sample has a name and contains a set of libraries. The information about the sample can be accessed via the following properties:
name: the sample name
libraries: a list of SolidLibrary objects representing the libraries within the sample
projects: a list of SolidProject objects representing groups of related libraries within the sample
unassigned: SolidProject object representing the ‘unassigned’ data
barcode_stats: a SolidBarcodeStats object with data extracted from the BarcodeStatistics file (or None, if no file was available)
parent_run: the parent SolidRun object, or None.
The class also provides the following methods:
addLibrary: to create and append a SolidLibrary object
getLibrary: fetch an existing SolidLibrary
getProject: fetch an existing SolidProject
Typically the calling subprogram calls the ‘addLibrary’ method to add a SolidLibrary object, which it then populates itself.
The SolidSample class automatically creates SolidProject objects based on the library names to group libraries considered to belong to the same experiments.
SolidLibrary
- class bcftbx.SolidData.SolidLibrary(name, parent_sample=None)
Store information about a SOLiD library.
The following properties hold data about the library:
name: the library name
initials: the experimenter’s initials
prefix: the library name prefix (i.e. name without the trailing numbers)
index_as_string: the trailing numbers from the name, as a string (preserves any leading zeroes)
index: the trailing numbers from the name as an integer
csfasta: full path to the csfasta file for the library (F3 reads)
qual: full path to qual file for the library (F3 reads)
csfasta_f5: full path to the F5 read (paired-end runs, otherwise will be None)
qual_f5: full path to the F5 read (paired-end runs, otherwise will be None)
primary_data: list of SolidPrimaryData objects for all possible primary data file pairs associated with the library
parent_sample: parent SolidSample object, or None.
The following methods are also available:
addPrimaryData: creates a new SolidPrimaryData object and appends to the list in the primary_data property
SolidPrimaryData
- class bcftbx.SolidData.SolidPrimaryData
Class to store references to primary data files
This is a convenience class for storing references to csfasta/qual file pairs within a SolidLibrary instance.
The class provides the following attributes:
csfasta: full path to csfasta file qual: full path to qual file timestamp: timestamp associated with the file pair type: string indicating ‘F3’ or ‘F5’, or None
The following methods are provided:
is_f3: indicates if data is F3 is_f5: indicates if data is F5
Functions
- bcftbx.SolidData.extract_library_timestamp(path)
Extract the timestamp string from a path
Given a path of the form ‘/path/to/data/…/primary.1234567/…’, return the timestamp string attached to the ‘primary.XXXXXXX’ component of the name.
- Parameters:
path – absolute or relative path to arbitrary directory or file in the SOLiD data structure
- Returns:
Timestamp string, or None if no timestamp was identified.
- bcftbx.SolidData.get_primary_data_file_pair(dirn)
Return csfasta/qual file pair from specified directory
- Parameters:
dirn – directory to search for csfasta/qual pair
- Returns:
Tuple (csfasta,qual) with full path for each file, or (None,None) if a pair wasn’t located.
- bcftbx.SolidData.is_paired_end(solid_run)
Determine if a SolidRun instance is a paired-end run
- Parameters:
solid_run – a populated SolidRun instance
- Returns:
True if this is a paired-end run, False otherwise.
- bcftbx.SolidData.match(pattern, word)
Check if a word matches pattern
Implements a very simple pattern matching algorithm, which allows only exact matches or glob-like strings (i.e. using a trailing ‘*’ to indicate a wildcard).
For example: ‘ABC*’ matches ‘ABC’, ‘ABC1’, ‘ABCDEFG’ etc, while ‘ABC’ only matches itself.
- Parameters:
pattern – simple glob-like pattern
word – string to test against ‘pattern’
- Returns:
True if ‘word’ is a match to ‘pattern’, False otherwise.
- bcftbx.SolidData.slide_layout(nsamples)
Description of the slide layout based on number of samples
- Parameters:
nsamples – number of samples in the run
- Returns:
A string describing the slide layout for the run based on the number of samples in the run, e.g. “Whole slide”, “Quads”, “Octets” etc. Returns None if the number of samples doesn’t map to a recognised layout.