bcftbx.utils
utils
Utility classes and functions shared between BCF codes.
General utility classes:
AttributeDictionary OrderedDictionary
File reading utilities:
getlines
File system wrappers and utilities:
PathInfo mkdir mkdirs mklink chmod touch format_file_size convert_size_to_bytes commonprefix is_gzipped_file rootname find_program get_current_user get_user_from_uid get_uid_from_user get_group_from_gid get_gid_from_group get_hostname walk list_dirs strip_ext
Symbolic link handling:
Symlink links
Sample name utilities:
extract_initials extract_prefix extract_index_as_string extract_index pretty_print_names name_matches
File manipulations:
concatenate_fastq_files
Text manipulations:
split_into_lines
Command line parsing utilities:
parse_named_lanes parse_lanes
General utility classes
- class bcftbx.utils.AttributeDictionary(**args)
Dictionary-like object with items accessible as attributes
AttributeDict provides a dictionary-like object where the value of items can also be accessed as attributes of the object.
For example:
>>> d = AttributeDict() >>> d['salutation'] = "hello" >>> d.salutation ... "hello"
Attributes can only be assigned by using dictionary item assignment notation i.e. d[‘key’] = value. d.key = value doesn’t work.
If the attribute doesn’t match a stored item then an AttributeError exception is raised.
len(d) returns the number of stored items.
The AttributeDict behaves like a dictionary for iterations, for example:
>>> for attr in d: >>> print("%s = %s" % (attr,d[attr]))
- class bcftbx.utils.OrderedDictionary
Augumented dictionary which keeps keys in order
OrderedDictionary provides an augmented Python dictionary class which keeps the dictionary keys in the order they are added to the object.
Items are added, modified and removed as with a standard dictionary e.g.:
>>> d[key] = value >>> value = d[key] >>> del(d[key])
The ‘keys()’ method returns the OrderedDictionary’s keys in the correct order.
File handling utilities
- bcftbx.utils.getlines(filen)
Fetch lines from a file and return them one by one
This generator function tries to implement an efficient method of reading lines sequentially from a text file, by minimising the number of reads from the file and performing the line splitting in memory. It attempts to replicate the idiom:
>>> for line in io.open(filen): >>> ...
using:
>>> for line in getlines(filen): >>> ...
The file can be gzipped; this function should handle this invisibly provided that the file extension is ‘.gz’.
- Parameters:
filen (str) – path of the file to read lines from
- Yields:
String –
- next line of text from the file, with any
newline character removed.
File system wrappers and utilities
- class bcftbx.utils.PathInfo(path, basedir=None)
Collect and report information on a file
The PathInfo class provides an interface to getting general information on a path, which may point to a file, directory, link or non-existent location.
The properties provide information on whether the path is readable (i.e. accessible) by the current user, whether it is readable by members of the same group, who is the owner and what group does it belong to, when was it last modified etc.
- chown(user=None, group=None)
Change associated owner and group
‘user’ and ‘group’ must be supplied as UID/GID numbers (or None to leave the current values unchanged).
* Note that chown will fail attempting to change the owner if the current process is not owned by root *
This is actually a wrapper to the os.lchmod function, so it doesn’t follow symbolic links.
- property datetime
Return last modification time as datetime object
- property deepest_accessible_parent
Return longest accessible directory that leads to path
Tries to find the longest parent directory above path which is accessible by the current user.
If it’s not possible to find a parent that is accessible then raise an exception.
- property exists
Return True if the path refers to an existing location
Note that this is a wrapper to os.path.lexists so it reports the existence of symbolic links rather than their targets.
- property gid
Return associated GID (group ID)
Attempts to return the GID (group ID) number associated with the path.
If the GID can’t be found then returns None.
- property group
Return associated group name
Attempts to return the group name associated with the path. If the name can’t be found then tries to return the GID instead.
If neither pieces of information can be found then returns None.
- property is_dir
Return True if path refers to a directory
- property is_executable
Return True if path refers to an executable file
- property is_file
Return True if path refers to a file
- property is_group_readable
Return True if the path exists and is group-readable
Paths may be reported as unreadable for various reasons, e.g. the target doesn’t exist, or doesn’t have permission for this user to read it, or if part of the path doesn’t allow the user to read the file.
- property is_group_writable
Return True if the path exists and is group-writable
Paths may be reported as unwritable for various reasons, e.g. the target doesn’t exist, or doesn’t have permission for this user to write to it, or if part of the path doesn’t allow the user to read the file.
- property is_link
Return True if path refers to a symbolic link
- property is_readable
Return True if the path exists and is readable by the owner
Paths may be reported as unreadable for various reasons, e.g. the target doesn’t exist, or doesn’t have permission for this user to read it, or if part of the path doesn’t allow the user to read the file.
- property mtime
Return last modification timestamp for path
- property path
Return the filesystem path
- relpath(dirn)
Return part of path relative to a directory
Wrapper for os.path.relpath(…).
- property resolve_link_via_parent
If path or parent directory is a link then return actual path
Resolves and returns the ‘real’ path for a path where either it or one of its parent directories is a symbolic link.
It will resolve multiple levels of symlinks to generate a path that is free of links (nb it is possible that the resolved path will not be an existing file or directory).
If there are no links in the directory tree then returns the full path of the input.
- property uid
Return associated UID (user ID)
Attempts to return the UID (user ID) number associated with the path.
If the UID can’t be found then returns None.
- property user
Return associated user name
Attempts to return the user name associated with the path. If the name can’t be found then tries to return the UID instead.
If neither pieces of information can be found then returns None.
- bcftbx.utils.mkdir(dirn, mode=None, recursive=False)
Make a directory
- Parameters:
dirn – the path of the directory to be created
mode – (optional) a mode specifier to be applied to the new directory once it has been created e.g. 0775 or 0664
recursive – (optional) if True then also create any intermediate parent directories if they don’t already exist
- bcftbx.utils.mklink(target, link_name, relative=False)
Make a symbolic link
- Parameters:
target – the file or directory to link to
link_name – name of the link
relative – if True then make a relative link (if possible); otherwise link to the target as given (default)
- bcftbx.utils.chmod(target, mode)
Change mode of file or directory
This a wrapper for the os.chmod function, with the addition that it doesn’t follow symbolic links.
For symbolic links it attempts to use the os.lchmod function instead, as this operates on the link itself and not the link target. If os.lchmod is not available then links are ignored.
- Parameters:
target – file or directory to apply new mode to
mode – a valid mode specifier e.g. 0775 or 0664
- bcftbx.utils.touch(filename)
Create new empty file, or update modification time if already exists
- Parameters:
filename – name of the file to create (can include leading path)
- bcftbx.utils.format_file_size(fsize, units=None)
Format a file size from bytes to human-readable form
Takes a file size in bytes and returns a human-readable string, e.g. 4.0K, 186M, 1.5G.
Alternatively specify the required units via the ‘units’ arguments.
- Parameters:
fsize – size in bytes
units – (optional) specify output in kb (‘K’), Mb (‘M’), Gb (‘G’) or Tb (‘T’)
- Returns:
Human-readable version of file size.
- bcftbx.utils.commonprefix(path1, path2)
Determine common prefix path for path1 and path2
Use this in preference to os.path.commonprefix as the version in os.path compares the two paths in a character-wise fashion and so can give counter-intuitive matches; this version compares path components which seems more sensible.
For example: for two paths /mnt/dir1/file and /mnt/dir2/file, os.path.commonprefix will return /mnt/dir, whereas this function will return /mnt.
- Parameters:
path1 – first path in comparison
path2 – second path in comparison
- Returns:
Leading part of path which is common to both input paths.
- bcftbx.utils.is_gzipped_file(filename)
Check if a file has a .gz extension
- Parameters:
filename – name of the file to be tested (can include leading path)
- Returns:
True if filename has trailing .gz extension, False if not.
- bcftbx.utils.rootname(name)
Remove all extensions from name
- Parameters:
name – name of a file
- Returns:
Leading part of name up to first dot, i.e. name without any trailing extensions.
- bcftbx.utils.find_program(name)
Find a program on the PATH
Search the current PATH for the specified program name and return the full path, or None if not found.
- bcftbx.utils.get_current_user()
Return name of the current user
Looks up user name for the current user; returns None if no matching name can be found.
- bcftbx.utils.get_user_from_uid(uid)
Return user name from UID
Looks up user name matching the supplied UID; returns None if no matching name can be found.
- bcftbx.utils.get_uid_from_user(user)
Return UID from user name
Looks up UID matching the supplied user name; returns None if no matching name can be found.
NB returned UID will be an integer.
- bcftbx.utils.get_group_from_gid(gid)
Return group name from GID
Looks up group name matching the supplied GID; returns None if no matching name can be found.
- bcftbx.utils.get_gid_from_group(group)
Return GID from group name
Looks up GID matching the supplied group name; returns None if no matching name can be found.
NB returned GID will be an integer.
- bcftbx.utils.walk(dirn, include_dirs=True, pattern=None)
Traverse the directory, subdirectories and files
Essentially this ‘walk’ function is a convenience wrapper for the ‘os.walk’ function.
- Parameters:
dirn – top-level directory to start traversal from
include_dirs – if True then yield directories as well as files (default)
pattern – if not None then specifies a regular expression pattern which restricts the set of yielded files and directories to a subset of those which match the pattern
- bcftbx.utils.list_dirs(parent, matches=None, startswith=None)
Return list of subdirectories relative to ‘parent’
- Parameters:
parent – directory to list subdirectories of
matches – if not None then only include subdirectories that exactly match the supplied string
startswith – if not None then then return subset of subdirectories that start with the supplied string
- Returns:
List of subdirectories (relative to the parent dir).
- bcftbx.utils.strip_ext(name, ext=None)
Strip extension from file name
Given a file name or path, remove the extension (including the dot) and return just the leading part of the name.
If an extension is explicitly specified then only remove the extension if it matches.
Extension can be multipart e.g. ‘fastq.gz’ and can include a leading dot e.g. ‘.gz’ or ‘gz’.
- Parameters:
name – name of a file
- Returns:
Leading part of name excluding specified extension, or first extension i.e. to last dot.
Symbolic link handling
- class bcftbx.utils.Symlink(path)
Class for interrogating and modifying symbolic links
The Symlink class provides an interface for getting information about a symbolic link.
To create a new Symlink instance do e.g.:
>>> l = Symlink('my_link.lnk')
Information about the link can be obtained via the various properties:
target = returns the link target
is_absolute = reports if the target represents an absolute link
is_broken = reports if the target doesn’t exist
There are also methods:
resolve_target() = returns the normalise absolute path to the target
update_target() = updates the target to a new location
- property is_absolute
Return True if the link target is an absolute link
- property is_broken
Return True if the link target doesn’t exist i.e. link is broken
- resolve_target()
Return the normalised absolute path to the link target
- property target
Return the target of the symlink
- update_target(new_target)
Replace the current link target with new_target
- Parameters:
new_target – path to replace the existing target with
- bcftbx.utils.links(dirn)
Traverse and return all symbolic links in under a directory
Given a starting directory, traverses the structure underneath and yields the path for each symlink that is found.
- Parameters:
dirn – name of the top-level directory
- Returns:
Yields the name and full path for each symbolic link under ‘dirn’.
Sample name utilities
- bcftbx.utils.extract_initials(name)
Return leading initials from the library or sample name
Conventionaly the experimenter’s initials are the leading characters of the name e.g. ‘DR’ for ‘DR1’, ‘EP’ for ‘EP_NCYC2669’, ‘CW’ for ‘CW_TI’ etc
- Parameters:
name – the name of a sample or library
- Returns:
The leading initials from the name.
- bcftbx.utils.extract_prefix(name)
Return the library or sample name prefix
- Parameters:
name – the name of a sample or library
- Returns:
The prefix consisting of the name with trailing numbers removed, e.g. ‘LD_C’ for ‘LD_C1’
- bcftbx.utils.extract_index_as_string(name)
Return the library or sample name index as a string
- Parameters:
name – the name of a sample or library
- Returns:
The index, consisting of the trailing numbers from the name. It is returned as a string to preserve leading zeroes, e.g. ‘1’ for ‘LD_C1’, ‘07’ for ‘DR07’ etc
- bcftbx.utils.extract_index(name)
Return the library or sample name index as an integer
- Parameters:
name – the name of a sample or library
- Returns:
The index as an integer, or None if the index cannot be converted to integer format.
- bcftbx.utils.pretty_print_names(name_list)
Given a list of library or sample names, format for pretty printing.
- Parameters:
name_list – a list or tuple of library or sample names
- Returns:
String with a condensed description of the library names, for example:
[‘DR1’, ‘DR2’, ‘DR3’, DR4’] -> ‘DR1-4’
- bcftbx.utils.name_matches(name, pattern)
Simple wildcard matching of project and sample names
Matching options are:
exact match of a single name e.g. pattern ‘PJB’ matches ‘PJB’
match start of a name using trailing ‘*’ e.g. pattern ‘PJ*’ matches ‘PJB’,’PJBriggs’ etc
match using multiple patterns by separating with comma e.g. pattern ‘PJB,IJD’ matches ‘PJB’ or ‘IJD’. Subpatterns can include trailing ‘*’ character to match more names.
- Arguments
name: text to match against pattern pattern: simple ‘glob’-like pattern to match against
- Returns
True if name matches pattern; False otherwise.
File manipulations
- bcftbx.utils.concatenate_fastq_files(merged_fastq, fastq_files, bufsize=10240, overwrite=False, verbose=True)
Create a single FASTQ file by concatenating one or more FASTQs
Given a list or tuple of FASTQ files (which can be compressed or uncompressed or a combination), creates a single output FASTQ by concatenating the contents.
- Parameters:
merged_fastq – name of output FASTQ file (mustn’t exist beforehand)
fastq_files – list of FASTQ files to concatenate
bufsize – (optional) size of buffer to use for copying data
overwrite – (optional) if True then overwrite the output file if it already exists (otherwise raise OSError); default is False
verbose – (optional) if True then report operations to stdout, otherwise operate quietly
Text manipulations
- bcftbx.utils.split_into_lines(text, char_limit, delimiters=' \t\n', sympathetic=False)
Split a string into multiple lines with maximum length
Splits a string into multiple lines on one or more delimiters (defaults to the whitespace characters i.e. ‘ ‘,tab and newline), such that each line is no longer than a specified length.
For example:
>>> split_into_lines("This is some text to split",10) ['This is','some text','to split']
If it’s not possible to split part of the text to a suitable length then the line is split “unsympathetically” at the line length, e.g.
>>> split_into_lines("This is supercalifragilicous text",10) ['This is','supercalif','ragilicous','text']
Set the ‘sympathetic’ flag to True to include a hyphen to indicate that a word has been broken, e.g.
>>> split_into_lines("This is supercalifragilicous text",10, ... sympathetic=True) ['This is','supercali-','fragilico-','us text']
To use an alternative set of delimiter characters, set the ‘delimiters’ argument, e.g.
>>> split_into_lines("This: is some text",10,delimiters=':') ['This',' is some t','ext']
- Parameters:
text – string of text to be split into lines
char_limit – maximum length for any given line
delimiters – optional, specify a set of non-default delimiter characters (defaults to whitespace)
sympathetic – optional, if True then add hyphen to indicate when a word has been broken
- Returns:
List of lines (i.e. strings).