bcftbx.TabFile
Classes for working with generic tab-delimited data.
The TabFile module provides a TabFile class, which represents a tab-delimited data file, and a TabDataLine class, which represents a line of data.
Creating a TabFile
TabFile objects can be initialised from existing files:
>>> data = TabFile('data.txt')
or an ‘empty’ TabFile can be created if no file name is specified.
Lines starting with ‘#’ are ignored.
Accessing Data within a TabFile
Within a TabFile object each line of data is represented by a TabDataLine object. Lines of data are referenced using index notation, with the first line of data being index zero:
>>> line = data[0]
>>> line = data[i]
Note that the index is not the same as the line number from the source file, (if one was specified) - this can be obtained from the ‘lineno’ method of each line:
>>> line_number = line.lineno()
len() gives the total number of lines of data in the TabFile object:
>>> len(data)
It is possible to iterate over the data lines in the object:
>>> for line in data:
>>> ... do something with line ...
By default columns of data in the file are referenced by index notation, with the first column being index zero:
>>> line = data[0]
>>> value = line[0]
If column headers are specified then these can also be used to reference columns of data:
>>> data = TabFile('data.txt',column_names=['ex','why','zed'])
>>> line = data[0]
>>> ex = line['ex']
>>> line['why'] = 3.454
Headers can also be read from the first line of an input file:
>>> data = TabFile('data.txt',first_line_is_header=True)
A list of the column names can be fetched using the ‘header’ method:
>>> print(data.header())
Use the ‘str’ built-in to get the line as a tab-delimited string:
>>> str(line)
Adding and Removing Data
New lines can be added to the TabFile object via the ‘append’ and ‘insert’ methods:
>>> data.append() # No data i.e. empty line
>>> data.append(data=[1,2,3]) # Provide data values as a list
>>> data.append(tabdata='1 2 3') # Provide values as tab-delimited string
>>> data.insert(1,data=[5,6,7]) # Inserts line of data at index 1
Type conversion is automatically performed when data values are assigned:
>>> line = data.append(data=['1',2,'3.4','pjb'])
>>> line[0]
1
>>> line[2]
3.4
>>> line[3]
'pjb'
Lines can also be removed using the ‘del’ built-in:
>>> del(data[0]) # Deletes first data line
New columns can be added using the ‘appendColumn’ method e.g.:
>>> data.appendColumn('new_col') # Creates a new empty column
Filtering Data
The ‘lookup’ method returns a set of data lines where a key matches a specific value:
>>> data = TabFile('data.txt',column_names=['chr','start','end'])
>>> chrom = data.lookup('chr','chrX')
Within a single data line the ‘subset’ method returns a list of values for a set of column indices or column names:
>>> data = TabFile(column_names=['chr','start','end','strand'])
>>> data.append(data=['chr1',123456,234567,'+'])
>>> data[0].subset('chr1','start')
['chr1',123456]
Sorting Data
The ‘sort’ method offers a simple way of sorting the data lines within a TabFile. The simplest example is sorting on a specific column:
>>> data.sort(lambda line: line['start'])
See the method documentation for more detail on using the ‘sort’ method.
Manipulating Data: whole column operations
The ‘transformColumn’ and ‘computeColumn’ methods provide a way to update all the values in a column with a single method call. In each case the calling subprogram must supply a function object which is used to update the values in a specific column.
The function supplied to ‘transformColumn’ must take a single argument which is the current value of the column in that line. For example: define a function to increment a supplied value by 1:
>>> def addOne(x):
>>> ... return x+1
Then use this to add one to all values in the column ‘start’:
>>> data.transformColumn('start',addOne)
Alternatively a lambda can be used to avoid defining a new function:
>>> data.transformColumn('start',lambda x: x+1)
The function supplied to ‘computeColumn’ must take a single argument which is the current line (i.e. a TabDataLine object) and return a new value for the specified column. For example:
>>> def calculateMidpoint(line):
>>> ... return (line['start'] + line['stop'])/2.0
>>> data.computeColumn('midpoint',calculateMidpoint)
Again a lambda expression can be used instead:
>>> data.computeColumn('midpoint',lambda line: line['stop'] - line['start'])
Writing to File
Use the TabFile’s ‘write’ method to output the content to a file:
>>> data.write('newfile.txt') # Writes all the data to newfile.txt
It’s also possible to reorder the columns before writing out using the ‘reorderColumns’ method.
Specifying Delimiters
It’s possible to use a different field delimiter than tabs, by explicitly specifying the value of the ‘delimiter’ argument when creating a new TabFile object, for example for a comma-delimited file:
>>> data = TabFile('data.txt',delimiter=',')
TabFileIterator: iterating through a tab-delimited file
The TabFileIterator
provides a light-weight alternative to
TabFile
in situations where it is only necessary to iterate
through each line in a tab-delimited file:
>>> for line in TabFileIterator(filen='data.tsv'):
... print(line)
Each line is returned as a TabDataLine
instance, so the
methods available that class can be used on the data.
- class bcftbx.TabFile.TabDataLine(line=None, column_names=None, delimiter='\t', lineno=None, convert=True, allow_underscores_in_numeric_literals=False)
Class to store a line of data from a tab-delimited file
Values can be accessed by integer index or by column names (if set), e.g.
line = TabDataLine(“1 2 3”,(‘first’,’second’,’third’))
allows the 2nd column of data to accessed either via line[1] or line[‘second’].
Values can also be changed, e.g.
line[‘second’] = new_value
Values are automatically converted to integer or float types as appropriate.
Subsets of data can be created using the ‘subset’ method.
Line numbers can also be set by the creating subprogram, and queried via the ‘lineno’ method.
It is possible to use a different field delimiter than tabs, by explicitly specifying the value of the ‘delimiter’ argument, e.g. for a comma-delimited line:
line = TabDataLine(“1,2,3”,delimiter=’,’)
Check if a line is empty:
if not line: print(“Blank line”)
- append(*values)
Append values to the data line
Should only be used when creating new data lines.
- appendColumn(key, value)
Append keyed values to the data line
This adds a new value along with a header name (i.e. key)
- convert_to_str(value)
Convert value to string
- convert_to_type(value)
Internal: convert a value to the correct type
Used to coerce input values into integers or floats if appropriate before storage in the TabDataLine object.
- convert_to_type_pep515(value)
Internal: convert a value to the correct type
Used to coerce input values into integers or floats if appropriate before storage in the TabDataLine object.
The conversion honors PEP 515 so numerical values can also contain underscore characters.
- delimiter(new_delimiter=None)
Set and get the delimiter for the line
If ‘new_delimiter’ is not None then the field delimiter for the line will be updated to the supplied value. This affects how lines are represented via the __repr__ built-in.
Returns the current value of the delimiter.
- lineno()
Return the line number associated with the line
NB The line number is set by the class or function which created the TabDataLine object, it is not guaranteed by the TabDataLine class itself.
- subset(*keys)
Return a subset of data items
This method creates a new TabDataLine instance with a subset of data specified by the ‘keys’ argument, e.g.
new_line = line.subset(2,1)
returns an instance with only the 2nd and 3rd data values in reverse order.
To access the items in a subset using index notation, use the same keys as those specified when the subset was created. For example, for
s = line.subset(“two”,”nine”)
use s[“two”] and s[“nine”] to access the data; while for
s = line.subset(2,9)
use s[2] and s[9].
- Parameters:
keys – one or more keys specifying columns to include in the subset. Keys can be column indices, column names, or a mixture, and the same column can be referenced multiple times.
- class bcftbx.TabFile.TabFile(filen=None, fp=None, column_names=None, skip_first_line=False, first_line_is_header=False, tab_data_line=<class 'bcftbx.TabFile.TabDataLine'>, delimiter='\t', convert=True, allow_underscores_in_numeric_literals=False, keep_commented_lines=False)
Class to get data from a tab-delimited file
Loads data from the specified file into a data structure than can then be queried on a per line and per item basis.
Data lines are represented by data line objects which must be TabDataLine-like.
Example usage:
data = TabFile(myfile) # load initial data
print(‘%s’ % len(data)) # report number of lines of data
print(‘%s’ % data.header()) # report header (i.e. column names)
- for line in data:
… # loop over lines of data
myline = data[0] # fetch first line of data
- append(data=None, tabdata=None, tabdataline=None)
Create and append a new data line
Creates a new data line object and appends it to the end of the list of lines.
Optionally the ‘data’ or ‘tabdata’ arguments can specify data items which will be used to populate the new line; alternatively ‘tabdataline’ can provide a TabDataLine-based object to be appended.
If none of these are specified then a default blank TabDataLine-based object is created, appended and returned.
- Parameters:
data – (optional) a list of data items
tabdata – (optional) a string of tab-delimited data items
tabdataline – (optional) a TabDataLine-based object
- Returns:
Appended data line object.
- appendColumn(name, fill_value='')
Append a new (empty) column
- Parameters:
name – name for the new column
fill_value – optional, value to insert into all rows in the new column
- computeColumn(column_name, compute_func)
Compute and store values in a new column
For each line of data the computation function will be invoked with the line as the sole argument, and the result will be stored in a new column with the specified name.
- Parameters:
column_name – name or index of column to write transformation result to
compute_func – callable object that will be invoked to perform the computation
- filename()
Return the file name associated with the TabFile
- header()
Return list of column names
If no column names were set then this will be an empty list.
- indexByLineNumber(n)
Return index of a data line given the file line number
Given the line number n for a line in the original file, returns the index required to access the data for that line in the TabFile object.
If no matching line is found then raises an IndexError.
- insert(i, data=None, tabdata=None, tabdataline=None)
Create and insert a new data line at a specified index
Creates a new data line object and inserts it into the list of lines at the specified index position ‘i’ (nb NOT a line number).
Optionally the ‘data’ or ‘tabdata’ arguments can specify data items which will be used to populate the new line; alternatively ‘tabdataline’ can provide a TabDataLine-based object to be inserted.
- Parameters:
i – index position to insert the line at
data – (optional) a list of data items
tabdata – (optional) a string of tab-delimited data items
tabdataline – (optional) a TabDataLine-based object
- Returns:
New inserted data line object.
- lookup(key, value)
Return lines where the key matches the specified value
- nColumns()
Return the number of columns in the file
If the file had a header then this will be the number of header columns; otherwise it will be the number of columns found in the first line of data
- reorderColumns(new_columns)
Rearrange the columns in the file
- Parameters:
new_columns – list of column names or indices in the new order
- Returns:
New TabFile object
- sort(sort_func, reverse=False)
Sort data using arbitrary function
Performs an in-place sort based on the suppled sort_func.
sort_func should be a function object which takes a data line object as input and returns a single numerical value; the data lines will be sorted in ascending order of these values (or descending order if reverse is set to True).
To sort on the value of a specific column use e.g.
>>> tabfile.sort(lambda line: line['col'])
- Parameters:
sort_func – function object taking a data line object as input and returning a single numerical value
reverse – (optional) Boolean, either False (default) to sort in ascending order, or True to sort in descending order
- transformColumn(column_name, transform_func)
Apply arbitrary function to a column
For each line of data the transformation function will be invoked with the value of the named column, with the result being written back to that column (overwriting the existing value).
- Parameters:
column_name – name of column to write transformation result to
transform_func – callable object that will be invoked to perform the transformation
- transpose()
Transpose the contents of the file
- Returns:
New TabFile object
- write(filen=None, fp=None, include_header=False, no_hash=False, delimiter=None)
Write the TabFile data to an output file
One of either the ‘filen’ or ‘fp’ arguments must be given, specifying the file name or stream to write the TabFile data to.
- Parameters:
filen – (optional) name of file to write to; ignored if fp is also specified
fp – (optional) a file-like object opened for writing; used in preference to filen if set to a non-null value Note that the calling program must close the stream in these cases.
include_header – (optional) if set to True, the first line will be a ‘header’ line
no_hash – (optional) if set to True and include_header is also True then don’t put a hash character ‘#’ at the start of the header line in the output file.
delimiter – (optional) delimiter to use when writing data values to file (defaults to the delimiter specified on input)
- class bcftbx.TabFile.TabFileIterator(filen=None, fp=None, column_names=None)
Iterate through lines in a tab-delimited file
Class to loop over all lines in a TSV file, returning a TabDataLine object for each record.