bcftbx.utils

utils

Utility classes and functions shared between BCF codes.

General utility classes:

AttributeDictionary OrderedDictionary

File reading utilities:

getlines

File system wrappers and utilities:

PathInfo mkdir mkdirs mklink chmod touch format_file_size convert_size_to_bytes commonprefix is_gzipped_file rootname find_program get_current_user get_user_from_uid get_uid_from_user get_group_from_gid get_gid_from_group get_hostname walk list_dirs strip_ext

Symbolic link handling:

Symlink links

Sample name utilities:

extract_initials extract_prefix extract_index_as_string extract_index pretty_print_names name_matches

File manipulations:

concatenate_fastq_files

Text manipulations:

split_into_lines

Command line parsing utilities:

parse_named_lanes parse_lanes

General utility classes

class bcftbx.utils.AttributeDictionary(**args)

Dictionary-like object with items accessible as attributes

AttributeDict provides a dictionary-like object where the value of items can also be accessed as attributes of the object.

For example:

>>> d = AttributeDict()
>>> d['salutation'] = "hello"
>>> d.salutation
... "hello"

Attributes can only be assigned by using dictionary item assignment notation i.e. d[‘key’] = value. d.key = value doesn’t work.

If the attribute doesn’t match a stored item then an AttributeError exception is raised.

len(d) returns the number of stored items.

The AttributeDict behaves like a dictionary for iterations, for example:

>>> for attr in d:
>>>    print("%s = %s" % (attr,d[attr]))
class bcftbx.utils.OrderedDictionary

Augumented dictionary which keeps keys in order

OrderedDictionary provides an augmented Python dictionary class which keeps the dictionary keys in the order they are added to the object.

Items are added, modified and removed as with a standard dictionary e.g.:

>>> d[key] = value
>>> value = d[key]
>>> del(d[key])

The ‘keys()’ method returns the OrderedDictionary’s keys in the correct order.

File handling utilities

bcftbx.utils.getlines(filen)

Fetch lines from a file and return them one by one

This generator function tries to implement an efficient method of reading lines sequentially from a text file, by minimising the number of reads from the file and performing the line splitting in memory. It attempts to replicate the idiom:

>>> for line in io.open(filen):
>>> ...

using:

>>> for line in getlines(filen):
>>> ...

The file can be gzipped; this function should handle this invisibly provided that the file extension is ‘.gz’.

Parameters:

filen (str) – path of the file to read lines from

Yields:

String

next line of text from the file, with any

newline character removed.

File system wrappers and utilities

class bcftbx.utils.PathInfo(path, basedir=None)

Collect and report information on a file

The PathInfo class provides an interface to getting general information on a path, which may point to a file, directory, link or non-existent location.

The properties provide information on whether the path is readable (i.e. accessible) by the current user, whether it is readable by members of the same group, who is the owner and what group does it belong to, when was it last modified etc.

chown(user=None, group=None)

Change associated owner and group

‘user’ and ‘group’ must be supplied as UID/GID numbers (or None to leave the current values unchanged).

* Note that chown will fail attempting to change the owner if the current process is not owned by root *

This is actually a wrapper to the os.lchmod function, so it doesn’t follow symbolic links.

property datetime

Return last modification time as datetime object

property deepest_accessible_parent

Return longest accessible directory that leads to path

Tries to find the longest parent directory above path which is accessible by the current user.

If it’s not possible to find a parent that is accessible then raise an exception.

property exists

Return True if the path refers to an existing location

Note that this is a wrapper to os.path.lexists so it reports the existence of symbolic links rather than their targets.

property gid

Return associated GID (group ID)

Attempts to return the GID (group ID) number associated with the path.

If the GID can’t be found then returns None.

property group

Return associated group name

Attempts to return the group name associated with the path. If the name can’t be found then tries to return the GID instead.

If neither pieces of information can be found then returns None.

property is_dir

Return True if path refers to a directory

property is_executable

Return True if path refers to an executable file

property is_file

Return True if path refers to a file

property is_group_readable

Return True if the path exists and is group-readable

Paths may be reported as unreadable for various reasons, e.g. the target doesn’t exist, or doesn’t have permission for this user to read it, or if part of the path doesn’t allow the user to read the file.

property is_group_writable

Return True if the path exists and is group-writable

Paths may be reported as unwritable for various reasons, e.g. the target doesn’t exist, or doesn’t have permission for this user to write to it, or if part of the path doesn’t allow the user to read the file.

Return True if path refers to a symbolic link

property is_readable

Return True if the path exists and is readable by the owner

Paths may be reported as unreadable for various reasons, e.g. the target doesn’t exist, or doesn’t have permission for this user to read it, or if part of the path doesn’t allow the user to read the file.

property mtime

Return last modification timestamp for path

property path

Return the filesystem path

relpath(dirn)

Return part of path relative to a directory

Wrapper for os.path.relpath(…).

If path or parent directory is a link then return actual path

Resolves and returns the ‘real’ path for a path where either it or one of its parent directories is a symbolic link.

It will resolve multiple levels of symlinks to generate a path that is free of links (nb it is possible that the resolved path will not be an existing file or directory).

If there are no links in the directory tree then returns the full path of the input.

property uid

Return associated UID (user ID)

Attempts to return the UID (user ID) number associated with the path.

If the UID can’t be found then returns None.

property user

Return associated user name

Attempts to return the user name associated with the path. If the name can’t be found then tries to return the UID instead.

If neither pieces of information can be found then returns None.

bcftbx.utils.mkdir(dirn, mode=None, recursive=False)

Make a directory

Parameters:
  • dirn – the path of the directory to be created

  • mode – (optional) a mode specifier to be applied to the new directory once it has been created e.g. 0775 or 0664

  • recursive – (optional) if True then also create any intermediate parent directories if they don’t already exist

Make a symbolic link

Parameters:
  • target – the file or directory to link to

  • link_name – name of the link

  • relative – if True then make a relative link (if possible); otherwise link to the target as given (default)

bcftbx.utils.chmod(target, mode)

Change mode of file or directory

This a wrapper for the os.chmod function, with the addition that it doesn’t follow symbolic links.

For symbolic links it attempts to use the os.lchmod function instead, as this operates on the link itself and not the link target. If os.lchmod is not available then links are ignored.

Parameters:
  • target – file or directory to apply new mode to

  • mode – a valid mode specifier e.g. 0775 or 0664

bcftbx.utils.touch(filename)

Create new empty file, or update modification time if already exists

Parameters:

filename – name of the file to create (can include leading path)

bcftbx.utils.format_file_size(fsize, units=None)

Format a file size from bytes to human-readable form

Takes a file size in bytes and returns a human-readable string, e.g. 4.0K, 186M, 1.5G.

Alternatively specify the required units via the ‘units’ arguments.

Parameters:
  • fsize – size in bytes

  • units – (optional) specify output in kb (‘K’), Mb (‘M’), Gb (‘G’) or Tb (‘T’)

Returns:

Human-readable version of file size.

bcftbx.utils.commonprefix(path1, path2)

Determine common prefix path for path1 and path2

Use this in preference to os.path.commonprefix as the version in os.path compares the two paths in a character-wise fashion and so can give counter-intuitive matches; this version compares path components which seems more sensible.

For example: for two paths /mnt/dir1/file and /mnt/dir2/file, os.path.commonprefix will return /mnt/dir, whereas this function will return /mnt.

Parameters:
  • path1 – first path in comparison

  • path2 – second path in comparison

Returns:

Leading part of path which is common to both input paths.

bcftbx.utils.is_gzipped_file(filename)

Check if a file has a .gz extension

Parameters:

filename – name of the file to be tested (can include leading path)

Returns:

True if filename has trailing .gz extension, False if not.

bcftbx.utils.rootname(name)

Remove all extensions from name

Parameters:

name – name of a file

Returns:

Leading part of name up to first dot, i.e. name without any trailing extensions.

bcftbx.utils.find_program(name)

Find a program on the PATH

Search the current PATH for the specified program name and return the full path, or None if not found.

bcftbx.utils.get_current_user()

Return name of the current user

Looks up user name for the current user; returns None if no matching name can be found.

bcftbx.utils.get_user_from_uid(uid)

Return user name from UID

Looks up user name matching the supplied UID; returns None if no matching name can be found.

bcftbx.utils.get_uid_from_user(user)

Return UID from user name

Looks up UID matching the supplied user name; returns None if no matching name can be found.

NB returned UID will be an integer.

bcftbx.utils.get_group_from_gid(gid)

Return group name from GID

Looks up group name matching the supplied GID; returns None if no matching name can be found.

bcftbx.utils.get_gid_from_group(group)

Return GID from group name

Looks up GID matching the supplied group name; returns None if no matching name can be found.

NB returned GID will be an integer.

bcftbx.utils.walk(dirn, include_dirs=True, pattern=None)

Traverse the directory, subdirectories and files

Essentially this ‘walk’ function is a convenience wrapper for the ‘os.walk’ function.

Parameters:
  • dirn – top-level directory to start traversal from

  • include_dirs – if True then yield directories as well as files (default)

  • pattern – if not None then specifies a regular expression pattern which restricts the set of yielded files and directories to a subset of those which match the pattern

bcftbx.utils.list_dirs(parent, matches=None, startswith=None)

Return list of subdirectories relative to ‘parent’

Parameters:
  • parent – directory to list subdirectories of

  • matches – if not None then only include subdirectories that exactly match the supplied string

  • startswith – if not None then then return subset of subdirectories that start with the supplied string

Returns:

List of subdirectories (relative to the parent dir).

bcftbx.utils.strip_ext(name, ext=None)

Strip extension from file name

Given a file name or path, remove the extension (including the dot) and return just the leading part of the name.

If an extension is explicitly specified then only remove the extension if it matches.

Extension can be multipart e.g. ‘fastq.gz’ and can include a leading dot e.g. ‘.gz’ or ‘gz’.

Parameters:

name – name of a file

Returns:

Leading part of name excluding specified extension, or first extension i.e. to last dot.

Sample name utilities

bcftbx.utils.extract_initials(name)

Return leading initials from the library or sample name

Conventionaly the experimenter’s initials are the leading characters of the name e.g. ‘DR’ for ‘DR1’, ‘EP’ for ‘EP_NCYC2669’, ‘CW’ for ‘CW_TI’ etc

Parameters:

name – the name of a sample or library

Returns:

The leading initials from the name.

bcftbx.utils.extract_prefix(name)

Return the library or sample name prefix

Parameters:

name – the name of a sample or library

Returns:

The prefix consisting of the name with trailing numbers removed, e.g. ‘LD_C’ for ‘LD_C1’

bcftbx.utils.extract_index_as_string(name)

Return the library or sample name index as a string

Parameters:

name – the name of a sample or library

Returns:

The index, consisting of the trailing numbers from the name. It is returned as a string to preserve leading zeroes, e.g. ‘1’ for ‘LD_C1’, ‘07’ for ‘DR07’ etc

bcftbx.utils.extract_index(name)

Return the library or sample name index as an integer

Parameters:

name – the name of a sample or library

Returns:

The index as an integer, or None if the index cannot be converted to integer format.

bcftbx.utils.pretty_print_names(name_list)

Given a list of library or sample names, format for pretty printing.

Parameters:

name_list – a list or tuple of library or sample names

Returns:

String with a condensed description of the library names, for example:

[‘DR1’, ‘DR2’, ‘DR3’, DR4’] -> ‘DR1-4’

bcftbx.utils.name_matches(name, pattern)

Simple wildcard matching of project and sample names

Matching options are:

  • exact match of a single name e.g. pattern ‘PJB’ matches ‘PJB’

  • match start of a name using trailing ‘*’ e.g. pattern ‘PJ*’ matches ‘PJB’,’PJBriggs’ etc

  • match using multiple patterns by separating with comma e.g. pattern ‘PJB,IJD’ matches ‘PJB’ or ‘IJD’. Subpatterns can include trailing ‘*’ character to match more names.

Arguments

name: text to match against pattern pattern: simple ‘glob’-like pattern to match against

Returns

True if name matches pattern; False otherwise.

File manipulations

bcftbx.utils.concatenate_fastq_files(merged_fastq, fastq_files, bufsize=10240, overwrite=False, verbose=True)

Create a single FASTQ file by concatenating one or more FASTQs

Given a list or tuple of FASTQ files (which can be compressed or uncompressed or a combination), creates a single output FASTQ by concatenating the contents.

Parameters:
  • merged_fastq – name of output FASTQ file (mustn’t exist beforehand)

  • fastq_files – list of FASTQ files to concatenate

  • bufsize – (optional) size of buffer to use for copying data

  • overwrite – (optional) if True then overwrite the output file if it already exists (otherwise raise OSError); default is False

  • verbose – (optional) if True then report operations to stdout, otherwise operate quietly

Text manipulations

bcftbx.utils.split_into_lines(text, char_limit, delimiters=' \t\n', sympathetic=False)

Split a string into multiple lines with maximum length

Splits a string into multiple lines on one or more delimiters (defaults to the whitespace characters i.e. ‘ ‘,tab and newline), such that each line is no longer than a specified length.

For example:

>>> split_into_lines("This is some text to split",10)
['This is','some text','to split']

If it’s not possible to split part of the text to a suitable length then the line is split “unsympathetically” at the line length, e.g.

>>> split_into_lines("This is supercalifragilicous text",10)
['This is','supercalif','ragilicous','text']

Set the ‘sympathetic’ flag to True to include a hyphen to indicate that a word has been broken, e.g.

>>> split_into_lines("This is supercalifragilicous text",10,
...                  sympathetic=True)
['This is','supercali-','fragilico-','us text']

To use an alternative set of delimiter characters, set the ‘delimiters’ argument, e.g.

>>> split_into_lines("This: is some text",10,delimiters=':')
['This',' is some t','ext']
Parameters:
  • text – string of text to be split into lines

  • char_limit – maximum length for any given line

  • delimiters – optional, specify a set of non-default delimiter characters (defaults to whitespace)

  • sympathetic – optional, if True then add hyphen to indicate when a word has been broken

Returns:

List of lines (i.e. strings).