bcftbx.Pipeline

Classes for running scripts iteratively over a collection of data files.

The essential classes are:

  • Job: wrapper for setting up, submitting and monitoring running scripts

  • PipelineRunner: queue and run script multiple times on standard set of inputs

  • SolidPipelineRunner: subclass of PipelineRunner specifically for running on SOLiD data (i.e. pairs of csfasta/qual files)

There are also some useful methods:

  • GetSolidDataFiles: collect csfasta/qual file pairs from a specific directory

  • GetSolidPairedEndFiles: collect csfasta/qual file pairs for paired end data

  • GetFastqFiles: collect fastq files from a specific directory

  • GetFastqGzFiles: collect gzipped fastq files

The PipelineRunners depend on the JobRunner instances (created from classes in the JobRunner module) to interface with the job management system. So typical usage might look like:

>>> import JobRunner
>>> import Pipeline
>>> runner = JobRunner.GEJobRunner() # to use Grid Engine
>>> pipeline = Pipeline.PipelineRunner(runner)
>>> pipeline.queueJob(...)
>>> pipeline.run()

Classes

class bcftbx.Pipeline.Job(runner, name, dirn, script, args, label=None, group=None)

Wrapper class for setting up, submitting and monitoring running scripts

Set up a job by creating a Job instance specifying the name, working directory, script file to execute, and arguments to be supplied to the script.

The job is started by invoking the ‘start’ method; its status can be checked with the ‘isRunning’ method, and terminated and restarted using the ‘terminate’ and ‘restart’ methods respectively.

Information about the job can also be accessed via its properties. The following properties record the original parameters supplied on instantiation:

name working_dir script args label group_label

Additional information is set once the job has started or stopped running:

job_id The id number for the running job returned by the JobRunner log The log file for the job (relative to working_dir) err The error log file for the job start_time The start time (seconds since the epoch) end_time The end time (seconds since the epoch) exit_status The exit code from the command that was run (integer, or None)

The Job class uses a JobRunner instance (which supplies the necessary methods for starting, stopping and monitoring) for low-level job interactions.

class bcftbx.Pipeline.PipelineRunner(runner, max_concurrent_jobs=4, poll_interval=30, jobCompletionHandler=None, groupCompletionHandler=None)

Class to run and manage multiple concurrent jobs.

PipelineRunner enables multiple jobs to be queued via the ‘queueJob’ method. The pipeline is then started using the ‘run’ method - this starts each job up to a a specified maximum of concurrent jobs, and then monitors their progress. As jobs finish, pending jobs are started until all jobs have completed.

Example usage:

>>> p = PipelineRunner()
>>> p.queueJob('/home/foo','foo.sh','bar.in')
... Queue more jobs ...
>>> p.run()

By default the pipeline runs in ‘blocking’ mode, i.e. ‘run’ doesn’t return until all jobs have been submitted and have completed; see the ‘run’ method for details of how to operate the pipeline in non-blocking mode.

The invoking subprogram can also specify functions that will be called when a job completes (‘jobCompletionHandler’), and when a group completes (‘groupCompletionHandler’). These can perform any specific actions that are required such as sending notification email, setting file ownerships and permissions etc.

class bcftbx.Pipeline.SolidPipelineRunner(runner, script, max_concurrent_jobs=4, poll_interval=30)

Class to run and manage multiple jobs for Solid data pipelines

Subclass of PipelineRunner specifically for dealing with scripts that take Solid data (i.e. csfasta/qual file pairs).

Defines the addDir method in addition to all methods already defined in the base class; use this method one or more times to specify directories with data to run the script on. The SOLiD data file pairs in each specified directory will be located automatically.

For example:

solid_pipeline = SolidPipelineRunner(‘qc.sh’) solid_pipeline.addDir(‘/path/to/datadir’) solid_pipeline.run()

Functions

bcftbx.Pipeline.GetSolidDataFiles(dirn, pattern=None, file_list=None)

Return list of csfasta/qual file pairs in target directory

Note that files with names ending in ‘_T_F3’ will be rejected as these are assumed to come from the preprocess filtering stage.

Optionally also specify a regular expression pattern that file names must also match in order to be included.

Parameters:
  • dirn – name/path of directory to look for files in

  • pattern – optional, regular expression pattern to filter names with

  • file_list – optional, a list of file names to use instead of fetching a list of files from the specified directory

Returns:

List of tuples consisting of two csfasta-qual file pairs (F3 and F5).

bcftbx.Pipeline.GetSolidPairedEndFiles(dirn, pattern=None, file_list=None)

Return list of csfasta/qual file pairs for paired end data

Optionally also specify a regular expression pattern that file names must also match in order to be included.

Parameters:
  • dirn – name/path of directory to look for files in

  • pattern – optional, regular expression pattern to filter names with

  • file_list – optional, a list of file names to use instead of fetching a list of files from the specified directory

Returns:

List of csfasta-qual pair tuples.

bcftbx.Pipeline.GetFastqFiles(dirn, pattern=None, file_list=None)

Return list of fastq files in target directory

Optionally also specify a regular expression pattern that file names must also match in order to be included.

Parameters:
  • dirn – name/path of directory to look for files in

  • pattern – optional, regular expression pattern to filter names with

  • file_list – optional, a list of file names to use instead of fetching a list of files from the specified directory

Returns:

List of file-pair tuples.

bcftbx.Pipeline.GetFastqGzFiles(dirn, pattern=None, file_list=None)

Return list of fastq.gz files in target directory

Optionally also specify a regular expression pattern that file names must also match in order to be included.

Parameters:
  • dirn – name/path of directory to look for files in

  • pattern – optional, regular expression pattern to filter names with

  • file_list – optional, a list of file names to use instead of fetching a list of files from the specified directory

Returns:

List of file-pair tuples.