bcftbx.FASTQFile

A set of classes for reading through FASTQ files and manipulating the data within them:

  • FastqIterator: enables looping through all read records in FASTQ file
  • FastqRead: provides access to a single FASTQ read record
  • SequenceIdentifier: provides access to sequence identifier info in a read
  • FastqAttributes: provides access to gross attributes of FASTQ file

Additionally there are a few utility functions:

  • get_fastq_file_handle: return a file handled opened for reading a FASTQ file
  • nreads: return the number of reads in a FASTQ file
  • fastqs_are_pair: check whether two FASTQs form an R1/R2 pair

Information on the FASTQ file format: http://en.wikipedia.org/wiki/FASTQ_format

class bcftbx.FASTQFile.FastqAttributes(fastq_file=None, fp=None)

Class to provide access to gross attributes of a FASTQ file

Given a FASTQ file (can be uncompressed or gzipped), enables various attributes to be queried via the following properties:

nreads: number of reads in the FASTQ file fsize: size of the file (in bytes)

fsize

Return size of the FASTQ file (bytes)

nreads

Return number of reads in the FASTQ file

class bcftbx.FASTQFile.FastqIterator(fastq_file=None, fp=None, bufsize=102400)

Class to loop over all records in a FASTQ file, returning a FastqRead object for each record.

Example looping over all reads:

>>> for read in FastqIterator(fastq_file):
>>>    print read

Input FASTQ can be in gzipped format; FASTQ data can also be supplied as a file-like object opened for reading, for example:

>>> fp = open(fastq_file,'rU')
>>> for read in FastqIterator(fp=fp):
>>>    print read
>>> fp.close()
next()

Return next record from FASTQ file as a FastqRead object

class bcftbx.FASTQFile.FastqRead(seqid_line=None, seq_line=None, optid_line=None, quality_line=None)

Class to store a FASTQ record with information about a read

Provides the following properties for accessing the read data:

seqid: the “sequence identifier” information (first line of the read record)
as a SequenceIdentifier object

sequence: the raw sequence (second line of the record) optid: the optional sequence identifier line (third line of the record) quality: the quality values (fourth line of the record)

Additional properties:

raw_seqid: the original sequence identifier string supplied when the
object was created

seqlen: length of the sequence maxquality: maximum quality value (in character representation) minquality: minimum quality value (in character representation)

(Note that quality scores can only be obtained from character representations once the encoding scheme is known)

is_colorspace: returns True if the read looks like a colorspace read, False
otherwise
class bcftbx.FASTQFile.SequenceIdentifier(seqid)

Class to store/manipulate sequence identifier information from a FASTQ record

Provides access to the data items in the sequence identifier line of a FASTQ record.

format

Identify the format of the sequence identifier

Returns:
String: ‘illumina18’, ‘illumina’ or None
is_pair_of(seqid)

Check if this forms a pair with another SequenceIdentifier

bcftbx.FASTQFile.fastqs_are_pair(fastq1=None, fastq2=None, verbose=True, fp1=None, fp2=None)

Check that two FASTQs form an R1/R2 pair

Arguments:
fastq1: first FASTQ fastq2: second FASTQ
Returns:
True if each read in fastq1 forms an R1/R2 pair with the equivalent read (i.e. in the same position) in fastq2, otherwise False if any do not form an R1/R2 (or if there are more reads in one than than the other).
bcftbx.FASTQFile.get_fastq_file_handle(fastq)

Return a file handle opened for reading for a FASTQ file

Deals with both compressed (gzipped) and uncompressed FASTQ files.

Arguments:
fastq: name (including path, if required) of FASTQ file.
The file can be gzipped (must have ‘.gz’ extension)
Returns:
File handle that can be used for read operations.
bcftbx.FASTQFile.nreads(fastq=None, fp=None)

Return number of reads in a FASTQ file

Performs a simple-minded read count, by counting the number of lines in the file and dividing by 4.

The FASTQ file can be specified either as a file name (using the ‘fastq’ argument) or as a file-like object opened for line reading (using the ‘fp’ argument).

This function can handle gzipped FASTQ files supplied via the ‘fastq’ argument.

Line counting uses a variant of the “buf count” method outlined here: http://stackoverflow.com/a/850962/579925

Arguments:
fastq: fastq(.gz) file fp: open file descriptor for fastq file
Returns:
Number of reads