`bcftbx.utils`

General utility classes

class bcftbx.utils.AttributeDictionary(**args)

Dictionary-like object with items accessible as attributes

‘AttributeDict’ provides a Dictionary-like object where the value of items can also be accessed as attributes of the object.

For example:

>>> d = AttributeDict()
>>> d['salutation'] = "hello"
>>> d.salutation
... "hello"

Attributes can only be assigned by using dictionary item assignment notation i.e. d[‘key’] = value. d.key = value doesn’t work.

If the attribute doesn’t match a stored item then an AttributeError exception is raised.

len(d) returns the number of stored items.

The AttributeDict behaves like a dictionary for iterations, for example:

>>> for attr in d:
>>>    print("%s = %s" % (attr,d[attr]))

class bcftbx.utils.OrderedDictionary

Augumented Dictionary which keeps keys in order

OrderedDictionary provides an augmented Python dictionary class which keeps the dictionary keys in the order they are added to the object.

Items are added, modified and removed as with a standard dictionary e.g.:

>>> d[key] = value
>>> value = d[key]
>>> del(d[key])

The ‘keys()’ method returns the OrderedDictionary’s keys in the correct order.

File handling utilities

bcftbx.utils.getlines(filen)

Fetch lines from a file and return them one by one

This generator function tries to implement an efficient method of reading lines sequentially from a text file, by minimising the number of reads from the file and performing the line splitting in memory. It attempts to replicate the idiom:

>>> for line in open(filen):
>>> ...

using:

>>> for line in getlines(filen):
>>> ...

The file can be gzipped; this function should handle this invisibly provided that the file extension is ‘.gz’.

Parameters:

filen (str) – path of the file to read lines from

Yields:

String –

next line of text from the file, with any: newline character removed.

File system wrappers and utilities

class bcftbx.utils.PathInfo(path, basedir=None)

Collect and report information on a file

The PathInfo class provides an interface to getting general information on a path, which may point to a file, directory, link or non-existent location.

The properties provide information on whether the path is readable (i.e. accessible) by the current user, whether it is readable by members of the same group, who is the owner and what group does it belong to, when was it last modified etc.

Parameters:

path (str) – a filesystem path, which can be relative or absolute, or point to a non-existent location
basedir (str) – (optional) if supplied then is prepended to the supplied path

chown(user=None, group=None)

Change associated owner and group

‘user’ and ‘group’ must be supplied as UID/GID numbers (or None to leave the current values unchanged).

* Note that chown will fail attempting to change the owner if the current process is not owned by root *

This is actually a wrapper to the os.lchmod function, so it doesn’t follow symbolic links.

property datetime: Return last modification time as datetime object

property deepest_accessible_parent

Return longest accessible directory that leads to path

Tries to find the longest parent directory above path which is accessible by the current user.

If it’s not possible to find a parent that is accessible then raise an exception.

property exists

Return True if the path refers to an existing location

Note that this is a wrapper to os.path.lexists so it reports the existence of symbolic links rather than their targets.

property gid

Return associated GID (group ID)

Attempts to return the GID (group ID) number associated with the path.

If the GID can’t be found then returns None.

property group

Return associated group name

Attempts to return the group name associated with the path. If the name can’t be found then tries to return the GID instead.

If neither pieces of information can be found then returns None.

property is_dir: Return True if path refers to a directory

property is_executable: Return True if path refers to an executable file

property is_file: Return True if path refers to a file

property is_group_readable

Return True if the path exists and is group-readable

Paths may be reported as unreadable for various reasons, e.g. the target doesn’t exist, or doesn’t have permission for this user to read it, or if part of the path doesn’t allow the user to read the file.

property is_group_writable

Return True if the path exists and is group-writable

Paths may be reported as unwritable for various reasons, e.g. the target doesn’t exist, or doesn’t have permission for this user to write to it, or if part of the path doesn’t allow the user to read the file.

property is_link: Return True if path refers to a symbolic link

property is_readable

Return True if the path exists and is readable by the owner

Paths may be reported as unreadable for various reasons, e.g. the target doesn’t exist, or doesn’t have permission for this user to read it, or if part of the path doesn’t allow the user to read the file.

property mtime: Return last modification timestamp for path

property path: Return the filesystem path

relpath(dirn)

Return part of path relative to a directory

Wrapper for os.path.relpath(…).

Parameters:: dirn (str) – directory to relpath to
Returns:: relative path
Return type:: String

property resolve_link_via_parent

If path or parent directory is a link then return actual path

Resolves and returns the ‘real’ path for a path where either it or one of its parent directories is a symbolic link.

It will resolve multiple levels of symlinks to generate a path that is free of links (nb it is possible that the resolved path will not be an existing file or directory).

If there are no links in the directory tree then returns the full path of the input.

property uid

Return associated UID (user ID)

Attempts to return the UID (user ID) number associated with the path.

If the UID can’t be found then returns None.

property user

Return associated user name

Attempts to return the user name associated with the path. If the name can’t be found then tries to return the UID instead.

If neither pieces of information can be found then returns None.

bcftbx.utils.mkdir(path, mode=None, recursive=False)

Make a directory

Parameters:

path (str) – the path of the directory to be created
mode (str) – (optional) a mode specifier to be applied to the new directory once it has been created e.g. 0775 or 0664
recursive (bool) – (optional) if True then also create any intermediate parent directories if they don’t already exist

bcftbx.utils.mklink(target, link_name, relative=False)

Make a symbolic link

Parameters:

target (str) – path of the file or directory to link to
link_name (str) – path for the link
relative (bool) – if True then make a relative link (if possible); otherwise link to the target as given (default)

bcftbx.utils.chmod(path, mode)

Change mode of file or directory

This a wrapper for the os.chmod function, with the addition that it doesn’t follow symbolic links.

For symbolic links it attempts to use the os.lchmod function instead, as this operates on the link itself and not the link target. If os.lchmod is not available then links are ignored.

Parameters:

path (str) – file or directory to apply new mode to
mode (str) – a valid mode specifier e.g. 0775 or 0664

bcftbx.utils.touch(path)

Create new empty file, or update modification time if already exists

Parameters:: path (str) – path to the file to create or update

bcftbx.utils.format_file_size(fsize, units=None)

Format a file size from bytes to human-readable form

Takes a file size in bytes and returns a human-readable string, e.g. 4.0K, 186M, 1.5G.

Alternatively specify the required units via the ‘units’ arguments.

Parameters:

fsize (int) – size in bytes
units (str) – (optional) specify output in kb (‘K’), Mb (‘M’), Gb (‘G’), Tb (‘T’) or Pb (‘P’)

Returns:

human-readable version of file size.

Return type:

String

bcftbx.utils.commonprefix(path1, path2)

Determine common prefix path for path1 and path2

Use this in preference to os.path.commonprefix as the version in os.path compares the two paths in a character-wise fashion and so can give counter-intuitive matches; this version compares path components which seems more sensible.

For example: for two paths /mnt/dir1/file and /mnt/dir2/file, os.path.commonprefix will return /mnt/dir, whereas this function will return /mnt.

Parameters:

path1 (str) – first path in comparison
path2 (str) – second path in comparison

Returns:

leading part of path which is common to both input paths.

Return type:

String

bcftbx.utils.is_gzipped_file(filename)

Check if a file has a .gz extension

Parameters:: filename (str) – name of the file to be tested (can include leading path)
Returns:: True if filename has trailing .gz extension, False if not.
Return type:: Boolean

bcftbx.utils.rootname(name)

Remove all extensions from name

Parameters:

name (str) – name of a file

Returns:

Leading part of name up to first dot, i.e. name without any: trailing extensions.

Return type:

String

bcftbx.utils.find_program(name)

Find a program on the PATH

Search the current PATH for the specified program name and return the full path, or None if not found.

Parameters:: name (str) – name of the program
Returns:: the full path of the program
Return type:: String

bcftbx.utils.get_current_user()

Return name of the current user

Looks up user name for the current user.

Returns:

current user name or None if no matching: name can be found.

Return type:

String

bcftbx.utils.get_user_from_uid(uid)

Return user name from UID

Looks up user name matching the supplied UID.

Parameters:: uid (int) – user ID
Returns:: user name or None.
Return type:: String

bcftbx.utils.get_uid_from_user(user)

Return UID from user name

Looks up UID matching the supplied user name; returns None if no matching name can be found.

Parameters:: user (str) – user name
Returns:: matching UID or None.
Return type:: Integer

bcftbx.utils.get_group_from_gid(gid)

Return group name from GID

Looks up group name matching the supplied GID; returns None if no matching name can be found.

Argument:: gid (int): group ID (GID)

Returns:: group name or None.
Return type:: String

bcftbx.utils.get_gid_from_group(group)

Return GID from group name

Looks up GID matching the supplied group name; returns None if no matching name can be found.

Parameters:: group (str) – group name to look up
Returns:: GID or None.
Return type:: Integer

bcftbx.utils.walk(dirn, include_dirs=True, pattern=None)

Traverse the directory, subdirectories and files

Essentially this ‘walk’ function is a convenience wrapper for the ‘os.walk’ function.

Parameters:

dirn (str) – top-level directory to start traversal from
include_dirs (bool) – if True then yield directories as well as files (default)
pattern (str) – if not None then specifies a regular expression pattern which restricts the set of yielded files and directories to a subset of those which match the pattern

bcftbx.utils.list_dirs(parent, matches=None, startswith=None)

Return list of subdirectories relative to ‘parent’

Parameters:

parent (str) – directory to list subdirectories of
matches (str) – if not None then only include subdirectories that exactly match the supplied string
startswith (str) – if not None then then return subset of subdirectories that start with the supplied string

Returns:

names of matching subdirectories relative to the parent: directory.

Return type:

List

bcftbx.utils.strip_ext(name, ext=None)

Strip extension from file name

Given a file name or path, remove the extension (including the dot) and return just the leading part of the name.

If an extension is explicitly specified then only remove the extension if it matches.

Extension can be multipart e.g. ‘fastq.gz’ and can include a leading dot e.g. ‘.gz’ or ‘gz’.

Parameters:: name (str) – name of a file
Returns:: Leading part of name excluding specified extension, or first extension i.e. to last dot.

Symbolic link handling

class bcftbx.utils.Symlink(path)

Class for interrogating and modifying symbolic links

The Symlink class provides an interface for getting information about a symbolic link.

To create a new Symlink instance do e.g.:

>>> l = Symlink('my_link.lnk')

Information about the link can be obtained via the various properties:

target = returns the link target
is_absolute = reports if the target represents an absolute link
is_broken = reports if the target doesn’t exist

There are also methods:

resolve_target() = returns the normalise absolute path to the target
update_target() = updates the target to a new location

Parameters:: path (str) – path to the link

property is_absolute: Return True if the link target is an absolute link

property is_broken: Return True if the link target doesn’t exist i.e. link is broken

resolve_target(): Return the normalised absolute path to the link target

property target: Return the target of the symlink

update_target(new_target)

Replace the current link target with new_target

Parameters:: new_target – path to replace the existing target with

bcftbx.utils.links(dirn)

Traverse and return all symbolic links in under a directory

Given a starting directory, traverses the structure underneath and yields the path for each symlink that is found.

Parameters:: dirn (str) – name of the top-level directory
Yields:: String – full path for each symbolic link under ‘dirn’.

Sample name utilities

bcftbx.utils.extract_initials(name)

Return leading initials from the sample name

Conventionally the experimenter’s initials are the leading characters of the name e.g. ‘DR’ for ‘DR1’, ‘EP’ for ‘EP_NCYC2669’, ‘CW’ for ‘CW_TI’ etc

Parameters:: name (str) – the name of a sample
Returns:: the leading initials from the name.
Return type:: String

bcftbx.utils.extract_prefix(name)

Return the sample name prefix

Parameters:

name (str) – the name of a sample

Returns:

the prefix consisting of the name with trailing numbers: removed, e.g. ‘LD_C’ for ‘LD_C1’

Return type:

String

bcftbx.utils.extract_index_as_string(name)

Return the sample name index as a string

Parameters:

name (str) – the name of a sample or library

Returns:

the extracted index, consisting of the trailing numbers from the: name, returned as a string to preserve leading zeroes (e.g. ‘1’ for

’LD_C1’, ‘07’ for ‘DR07’ etc)

Return type:

String

bcftbx.utils.extract_index(name)

Return the sample name index as an integer

Parameters:

name (str) – the name of a sample or library

Returns:

the index as an integer, or None if the index cannot be: converted to integer format.

Return type:

Integer

bcftbx.utils.pretty_print_names(name_list)

Format a list of sample names for pretty printing.

Parameters:

name_list (list) – a list or tuple of sample names

Returns:

String with a condensed description of the library names, for example:

[‘DR1’, ‘DR2’, ‘DR3’, DR4’] -> ‘DR1-4’

bcftbx.utils.name_matches(name, pattern)

Simple wildcard matching of project and sample names

Matching options are:

exact match of a single name e.g. pattern ‘PJB’ matches ‘PJB’
match start of a name using trailing ‘*’ e.g. pattern ‘PJ*’ matches ‘PJB’,’PJBriggs’ etc
match using multiple patterns by separating with comma e.g. pattern ‘PJB,IJD’ matches ‘PJB’ or ‘IJD’. Subpatterns can include trailing ‘*’ character to match more names.

Arguments: name (str): text to match against pattern pattern (str): simple ‘glob’-like pattern to match against
Returns: Boolean: True if name matches pattern; False otherwise.

File manipulations

bcftbx.utils.concatenate_fastq_files(merged_fastq, fastq_files, bufsize=10240, overwrite=False, verbose=True)

Create a single FASTQ file by concatenating one or more FASTQs

Given a list or tuple of FASTQ files (which can be compressed or uncompressed or a combination), creates a single output FASTQ by concatenating the contents.

Parameters:

merged_fastq (str) – name of output FASTQ file (mustn’t exist beforehand)
fastq_files (list) – list of FASTQ files to concatenate
bufsize (int) – (optional) size of buffer to use for copying data
overwrite (bool) – (optional) if True then overwrite the output file if it already exists (otherwise raise OSError); default is False
verbose (bool) – (optional) if True then report operations to stdout, otherwise operate quietly

Text manipulations

bcftbx.utils.split_into_lines(text, char_limit, delimiters=' \t\n', sympathetic=False)

Split a string into multiple lines with maximum length

Splits a string into multiple lines on one or more delimiters (defaults to the whitespace characters i.e. ‘ ‘,tab and newline), such that each line is no longer than a specified length.

For example:

>>> split_into_lines("This is some text to split",10)
['This is','some text','to split']

If it’s not possible to split part of the text to a suitable length then the line is split “unsympathetically” at the line length, e.g.

>>> split_into_lines("This is supercalifragilicous text",10)
['This is','supercalif','ragilicous','text']

Set the ‘sympathetic’ flag to True to include a hyphen to indicate that a word has been broken, e.g.

>>> split_into_lines("This is supercalifragilicous text",10,
...                  sympathetic=True)
['This is','supercali-','fragilico-','us text']

To use an alternative set of delimiter characters, set the ‘delimiters’ argument, e.g.

>>> split_into_lines("This: is some text",10,delimiters=':')
['This',' is some t','ext']

Parameters:

text (str) – string of text to be split into lines
char_limit (int) – maximum length for any given line
delimiters (str) – optional, specify a set of non-default delimiter characters (defaults to whitespace)
sympathetic (bool) – optional, if True then add hyphen to indicate when a word has been broken

Returns:

lines of split text as strings.

Return type:

List

bcftbx.utils

General utility classes

File handling utilities

File system wrappers and utilities

Symbolic link handling

Sample name utilities

File manipulations

Text manipulations

`bcftbx.utils`