bcftbx.TabFile

Classes for working with generic tab-delimited data.

The TabFile module provides a TabFile class, which represents a tab-delimited data file, and a TabDataLine class, which represents a line of data.

Creating a TabFile

TabFile objects can be initialised from existing files:

>>> data = TabFile('data.txt')

or an ‘empty’ TabFile can be created if no file name is specified.

Lines starting with ‘#’ are ignored.

Accessing Data within a TabFile

Within a TabFile object each line of data is represented by a TabDataLine object. Lines of data are referenced using index notation, with the first line of data being index zero:

>>> line = data[0]
>>> line = data[i]

Note that the index is not the same as the line number from the source file, (if one was specified) - this can be obtained from the ‘lineno’ method of each line:

>>> line_number = line.lineno()

len() gives the total number of lines of data in the TabFile object:

>>> len(data)

It is possible to iterate over the data lines in the object:

>>> for line in data:
>>>    ... do something with line ...

By default columns of data in the file are referenced by index notation, with the first column being index zero:

>>> line = data[0]
>>> value = line[0]

If column headers are specified then these can also be used to reference columns of data:

>>> data = TabFile('data.txt',column_names=['ex','why','zed'])
>>> line = data[0]
>>> ex = line['ex']
>>> line['why'] = 3.454

Headers can also be read from the first line of an input file:

>>> data = TabFile('data.txt',first_line_is_header=True)

A list of the column names can be fetched using the ‘header’ method:

>>> print(data.header())

Use the ‘str’ built-in to get the line as a tab-delimited string:

>>> str(line)

Adding and Removing Data

New lines can be added to the TabFile object via the ‘append’ and ‘insert’ methods:

>>> data.append()  # No data i.e. empty line
>>> data.append(data=[1,2,3]) # Provide data values as a list
>>> data.append(tabdata='1      2       3') # Provide values as tab-delimited string
>>> data.insert(1,data=[5,6,7]) # Inserts line of data at index 1

Type conversion is automatically performed when data values are assigned:

>>> line = data.append(data=['1',2,'3.4','pjb'])
>>> line[0]
1
>>> line[2]
3.4
>>> line[3]
'pjb'

Lines can also be removed using the ‘del’ built-in:

>>> del(data[0]) # Deletes first data line

New columns can be added using the ‘appendColumn’ method e.g.:

>>> data.appendColumn('new_col') # Creates a new empty column

Filtering Data

The ‘lookup’ method returns a set of data lines where a key matches a specific value:

>>> data = TabFile('data.txt',column_names=['chr','start','end'])
>>> chrom = data.lookup('chr','chrX')

Within a single data line the ‘subset’ method returns a list of values for a set of column indices or column names:

>>> data = TabFile(column_names=['chr','start','end','strand'])
>>> data.append(data=['chr1',123456,234567,'+'])
>>> data[0].subset('chr1','start')
['chr1',123456]

Sorting Data

The ‘sort’ method offers a simple way of sorting the data lines within a TabFile. The simplest example is sorting on a specific column:

>>> data.sort(lambda line: line['start'])

See the method documentation for more detail on using the ‘sort’ method.

Manipulating Data: whole column operations

The ‘transformColumn’ and ‘computeColumn’ methods provide a way to update all the values in a column with a single method call. In each case the calling subprogram must supply a function object which is used to update the values in a specific column.

The function supplied to ‘transformColumn’ must take a single argument which is the current value of the column in that line. For example: define a function to increment a supplied value by 1:

>>> def addOne(x):
>>> ...     return x+1

Then use this to add one to all values in the column ‘start’:

>>> data.transformColumn('start',addOne)

Alternatively a lambda can be used to avoid defining a new function:

>>> data.transformColumn('start',lambda x: x+1)

The function supplied to ‘computeColumn’ must take a single argument which is the current line (i.e. a TabDataLine object) and return a new value for the specified column. For example:

>>> def calculateMidpoint(line):
>>> ...   return (line['start'] + line['stop'])/2.0
>>> data.computeColumn('midpoint',calculateMidpoint)

Again a lambda expression can be used instead:

>>> data.computeColumn('midpoint',lambda line: line['stop'] - line['start'])

Writing to File

Use the TabFile’s ‘write’ method to output the content to a file:

>>> data.write('newfile.txt') # Writes all the data to newfile.txt

It’s also possible to reorder the columns before writing out using the ‘reorderColumns’ method.

Specifying Delimiters

It’s possible to use a different field delimiter than tabs, by explicitly specifying the value of the ‘delimiter’ argument when creating a new TabFile object, for example for a comma-delimited file:

>>> data = TabFile('data.txt',delimiter=',')

TabFileIterator: iterating through a tab-delimited file

The TabFileIterator provides a light-weight alternative to TabFile in situations where it is only necessary to iterate through each line in a tab-delimited file:

>>> for line in TabFileIterator(filen='data.tsv'):
...   print(line)

Each line is returned as a TabDataLine instance, so the methods available that class can be used on the data.

class bcftbx.TabFile.TabDataLine(line=None, column_names=None, delimiter='\t', lineno=None, convert=True, allow_underscores_in_numeric_literals=False)

Class to store a line of data from a tab-delimited file

Values can be accessed by integer index or by column names (if set), e.g.

line = TabDataLine(“1 2 3”,(‘first’,’second’,’third’))

allows the 2nd column of data to accessed either via line[1] or line[‘second’].

Values can also be changed, e.g.

line[‘second’] = new_value

Values are automatically converted to integer or float types as appropriate.

Subsets of data can be created using the ‘subset’ method.

Line numbers can also be set by the creating subprogram, and queried via the ‘lineno’ method.

It is possible to use a different field delimiter than tabs, by explicitly specifying the value of the ‘delimiter’ argument, e.g. for a comma-delimited line:

line = TabDataLine(“1,2,3”,delimiter=’,’)

Check if a line is empty:

if not line: print(“Blank line”)

append(*values)

Append values to the data line

Should only be used when creating new data lines.

appendColumn(key, value)

Append keyed values to the data line

This adds a new value along with a header name (i.e. key)

convert_to_str(value)

Convert value to string

convert_to_type(value)

Internal: convert a value to the correct type

Used to coerce input values into integers or floats if appropriate before storage in the TabDataLine object.

convert_to_type_pep515(value)

Internal: convert a value to the correct type

Used to coerce input values into integers or floats if appropriate before storage in the TabDataLine object.

The conversion honors PEP 515 so numerical values can also contain underscore characters.

delimiter(new_delimiter=None)

Set and get the delimiter for the line

If ‘new_delimiter’ is not None then the field delimiter for the line will be updated to the supplied value. This affects how lines are represented via the __repr__ built-in.

Returns the current value of the delimiter.

lineno()

Return the line number associated with the line

NB The line number is set by the class or function which created the TabDataLine object, it is not guaranteed by the TabDataLine class itself.

subset(*keys)

Return a subset of data items

This method creates a new TabDataLine instance with a subset of data specified by the ‘keys’ argument, e.g.

new_line = line.subset(2,1)

returns an instance with only the 2nd and 3rd data values in reverse order.

To access the items in a subset using index notation, use the same keys as those specified when the subset was created. For example, for

s = line.subset(“two”,”nine”)

use s[“two”] and s[“nine”] to access the data; while for

s = line.subset(2,9)

use s[2] and s[9].

Parameters:

keys – one or more keys specifying columns to include in the subset. Keys can be column indices, column names, or a mixture, and the same column can be referenced multiple times.

class bcftbx.TabFile.TabFile(filen=None, fp=None, column_names=None, skip_first_line=False, first_line_is_header=False, tab_data_line=<class 'bcftbx.TabFile.TabDataLine'>, delimiter='\t', convert=True, allow_underscores_in_numeric_literals=False, keep_commented_lines=False)

Class to get data from a tab-delimited file

Loads data from the specified file into a data structure than can then be queried on a per line and per item basis.

Data lines are represented by data line objects which must be TabDataLine-like.

Example usage:

data = TabFile(myfile) # load initial data

print(‘%s’ % len(data)) # report number of lines of data

print(‘%s’ % data.header()) # report header (i.e. column names)

for line in data:

… # loop over lines of data

myline = data[0] # fetch first line of data

append(data=None, tabdata=None, tabdataline=None)

Create and append a new data line

Creates a new data line object and appends it to the end of the list of lines.

Optionally the ‘data’ or ‘tabdata’ arguments can specify data items which will be used to populate the new line; alternatively ‘tabdataline’ can provide a TabDataLine-based object to be appended.

If none of these are specified then a default blank TabDataLine-based object is created, appended and returned.

Parameters:
  • data – (optional) a list of data items

  • tabdata – (optional) a string of tab-delimited data items

  • tabdataline – (optional) a TabDataLine-based object

Returns:

Appended data line object.

appendColumn(name, fill_value='')

Append a new (empty) column

Parameters:
  • name – name for the new column

  • fill_value – optional, value to insert into all rows in the new column

computeColumn(column_name, compute_func)

Compute and store values in a new column

For each line of data the computation function will be invoked with the line as the sole argument, and the result will be stored in a new column with the specified name.

Parameters:
  • column_name – name or index of column to write transformation result to

  • compute_func – callable object that will be invoked to perform the computation

filename()

Return the file name associated with the TabFile

header()

Return list of column names

If no column names were set then this will be an empty list.

indexByLineNumber(n)

Return index of a data line given the file line number

Given the line number n for a line in the original file, returns the index required to access the data for that line in the TabFile object.

If no matching line is found then raises an IndexError.

insert(i, data=None, tabdata=None, tabdataline=None)

Create and insert a new data line at a specified index

Creates a new data line object and inserts it into the list of lines at the specified index position ‘i’ (nb NOT a line number).

Optionally the ‘data’ or ‘tabdata’ arguments can specify data items which will be used to populate the new line; alternatively ‘tabdataline’ can provide a TabDataLine-based object to be inserted.

Parameters:
  • i – index position to insert the line at

  • data – (optional) a list of data items

  • tabdata – (optional) a string of tab-delimited data items

  • tabdataline – (optional) a TabDataLine-based object

Returns:

New inserted data line object.

lookup(key, value)

Return lines where the key matches the specified value

nColumns()

Return the number of columns in the file

If the file had a header then this will be the number of header columns; otherwise it will be the number of columns found in the first line of data

reorderColumns(new_columns)

Rearrange the columns in the file

Parameters:

new_columns – list of column names or indices in the new order

Returns:

New TabFile object

sort(sort_func, reverse=False)

Sort data using arbitrary function

Performs an in-place sort based on the suppled sort_func.

sort_func should be a function object which takes a data line object as input and returns a single numerical value; the data lines will be sorted in ascending order of these values (or descending order if reverse is set to True).

To sort on the value of a specific column use e.g.

>>> tabfile.sort(lambda line: line['col'])
Parameters:
  • sort_func – function object taking a data line object as input and returning a single numerical value

  • reverse – (optional) Boolean, either False (default) to sort in ascending order, or True to sort in descending order

transformColumn(column_name, transform_func)

Apply arbitrary function to a column

For each line of data the transformation function will be invoked with the value of the named column, with the result being written back to that column (overwriting the existing value).

Parameters:
  • column_name – name of column to write transformation result to

  • transform_func – callable object that will be invoked to perform the transformation

transpose()

Transpose the contents of the file

Returns:

New TabFile object

write(filen=None, fp=None, include_header=False, no_hash=False, delimiter=None)

Write the TabFile data to an output file

One of either the ‘filen’ or ‘fp’ arguments must be given, specifying the file name or stream to write the TabFile data to.

Parameters:
  • filen – (optional) name of file to write to; ignored if fp is also specified

  • fp – (optional) a file-like object opened for writing; used in preference to filen if set to a non-null value Note that the calling program must close the stream in these cases.

  • include_header – (optional) if set to True, the first line will be a ‘header’ line

  • no_hash – (optional) if set to True and include_header is also True then don’t put a hash character ‘#’ at the start of the header line in the output file.

  • delimiter – (optional) delimiter to use when writing data values to file (defaults to the delimiter specified on input)

class bcftbx.TabFile.TabFileIterator(filen=None, fp=None, column_names=None)

Iterate through lines in a tab-delimited file

Class to loop over all lines in a TSV file, returning a TabDataLine object for each record.