bcftbx.utils
¶
utils
Utility classes and functions shared between BCF codes.
General utility classes:
AttributeDictionary OrderedDictionary
File reading utilities:
getlines
File system wrappers and utilities:
PathInfo mkdir mkdirs mklink chmod touch format_file_size commonprefix is_gzipped_file rootname find_program get_current_user get_user_from_uid get_uid_from_user get_group_from_gid get_gid_from_group get_hostname walk list_dirs strip_ext
Symbolic link handling:
Symlink links
Sample name utilities:
extract_initials extract_prefix extract_index_as_string extract_index pretty_print_names name_matches
File manipulations:
concatenate_fastq_files
Text manipulations:
split_into_lines
Command line parsing utilities:
parse_named_lanes parse_lanes
General utility classes¶
-
class
bcftbx.utils.
AttributeDictionary
(**args)¶ Dictionary-like object with items accessible as attributes
AttributeDict provides a dictionary-like object where the value of items can also be accessed as attributes of the object.
For example:
>>> d = AttributeDict() >>> d['salutation'] = "hello" >>> d.salutation ... "hello"
Attributes can only be assigned by using dictionary item assignment notation i.e. d[‘key’] = value. d.key = value doesn’t work.
If the attribute doesn’t match a stored item then an AttributeError exception is raised.
len(d) returns the number of stored items.
The AttributeDict behaves like a dictionary for iterations, for example:
>>> for attr in d: >>> print("%s = %s" % (attr,d[attr]))
-
class
bcftbx.utils.
OrderedDictionary
¶ Augumented dictionary which keeps keys in order
OrderedDictionary provides an augmented Python dictionary class which keeps the dictionary keys in the order they are added to the object.
Items are added, modified and removed as with a standard dictionary e.g.:
>>> d[key] = value >>> value = d[key] >>> del(d[key])
The ‘keys()’ method returns the OrderedDictionary’s keys in the correct order.
File handling utilities¶
-
bcftbx.utils.
getlines
(filen)¶ Fetch lines from a file and return them one by one
This generator function tries to implement an efficient method of reading lines sequentially from a text file, by minimising the number of reads from the file and performing the line splitting in memory. It attempts to replicate the idiom:
>>> for line in io.open(filen): >>> ...
using:
>>> for line in getlines(filen): >>> ...
The file can be gzipped; this function should handle this invisibly provided that the file extension is ‘.gz’.
Parameters: filen (str) – path of the file to read lines from
Yields: String –
- next line of text from the file, with any
newline character removed.
File system wrappers and utilities¶
-
class
bcftbx.utils.
PathInfo
(path, basedir=None)¶ Collect and report information on a file
The PathInfo class provides an interface to getting general information on a path, which may point to a file, directory, link or non-existent location.
The properties provide information on whether the path is readable (i.e. accessible) by the current user, whether it is readable by members of the same group, who is the owner and what group does it belong to, when was it last modified etc.
-
chown
(user=None, group=None)¶ Change associated owner and group
‘user’ and ‘group’ must be supplied as UID/GID numbers (or None to leave the current values unchanged).
* Note that chown will fail attempting to change the owner if the current process is not owned by root *
This is actually a wrapper to the os.lchmod function, so it doesn’t follow symbolic links.
-
datetime
¶ Return last modification time as datetime object
-
deepest_accessible_parent
¶ Return longest accessible directory that leads to path
Tries to find the longest parent directory above path which is accessible by the current user.
If it’s not possible to find a parent that is accessible then raise an exception.
-
exists
¶ Return True if the path refers to an existing location
Note that this is a wrapper to os.path.lexists so it reports the existence of symbolic links rather than their targets.
-
gid
¶ Return associated GID (group ID)
Attempts to return the GID (group ID) number associated with the path.
If the GID can’t be found then returns None.
-
group
¶ Return associated group name
Attempts to return the group name associated with the path. If the name can’t be found then tries to return the GID instead.
If neither pieces of information can be found then returns None.
-
is_dir
¶ Return True if path refers to a directory
-
is_executable
¶ Return True if path refers to an executable file
-
is_file
¶ Return True if path refers to a file
-
is_group_readable
¶ Return True if the path exists and is group-readable
Paths may be reported as unreadable for various reasons, e.g. the target doesn’t exist, or doesn’t have permission for this user to read it, or if part of the path doesn’t allow the user to read the file.
-
is_group_writable
¶ Return True if the path exists and is group-writable
Paths may be reported as unwritable for various reasons, e.g. the target doesn’t exist, or doesn’t have permission for this user to write to it, or if part of the path doesn’t allow the user to read the file.
-
is_link
¶ Return True if path refers to a symbolic link
-
is_readable
¶ Return True if the path exists and is readable by the owner
Paths may be reported as unreadable for various reasons, e.g. the target doesn’t exist, or doesn’t have permission for this user to read it, or if part of the path doesn’t allow the user to read the file.
-
mtime
¶ Return last modification timestamp for path
-
path
¶ Return the filesystem path
-
relpath
(dirn)¶ Return part of path relative to a directory
Wrapper for os.path.relpath(…).
-
resolve_link_via_parent
¶ If path or parent directory is a link then return actual path
Resolves and returns the ‘real’ path for a path where either it or one of its parent directories is a symbolic link.
It will resolve multiple levels of symlinks to generate a path that is free of links (nb it is possible that the resolved path will not be an existing file or directory).
If there are no links in the directory tree then returns the full path of the input.
-
uid
¶ Return associated UID (user ID)
Attempts to return the UID (user ID) number associated with the path.
If the UID can’t be found then returns None.
-
user
¶ Return associated user name
Attempts to return the user name associated with the path. If the name can’t be found then tries to return the UID instead.
If neither pieces of information can be found then returns None.
-
-
bcftbx.utils.
mkdir
(dirn, mode=None, recursive=False)¶ Make a directory
Parameters: - dirn – the path of the directory to be created
- mode – (optional) a mode specifier to be applied to the new directory once it has been created e.g. 0775 or 0664
- recursive – (optional) if True then also create any intermediate parent directories if they don’t already exist
-
bcftbx.utils.
mklink
(target, link_name, relative=False)¶ Make a symbolic link
Parameters: - target – the file or directory to link to
- link_name – name of the link
- relative – if True then make a relative link (if possible); otherwise link to the target as given (default)
-
bcftbx.utils.
chmod
(target, mode)¶ Change mode of file or directory
This a wrapper for the os.chmod function, with the addition that it doesn’t follow symbolic links.
For symbolic links it attempts to use the os.lchmod function instead, as this operates on the link itself and not the link target. If os.lchmod is not available then links are ignored.
Parameters: - target – file or directory to apply new mode to
- mode – a valid mode specifier e.g. 0775 or 0664
-
bcftbx.utils.
touch
(filename)¶ Create new empty file, or update modification time if already exists
Parameters: filename – name of the file to create (can include leading path)
-
bcftbx.utils.
format_file_size
(fsize, units=None)¶ Format a file size from bytes to human-readable form
Takes a file size in bytes and returns a human-readable string, e.g. 4.0K, 186M, 1.5G.
Alternatively specify the required units via the ‘units’ arguments.
Parameters: - fsize – size in bytes
- units – (optional) specify output in kb (‘K’), Mb (‘M’), Gb (‘G’) or Tb (‘T’)
Returns: Human-readable version of file size.
-
bcftbx.utils.
commonprefix
(path1, path2)¶ Determine common prefix path for path1 and path2
Use this in preference to os.path.commonprefix as the version in os.path compares the two paths in a character-wise fashion and so can give counter-intuitive matches; this version compares path components which seems more sensible.
For example: for two paths /mnt/dir1/file and /mnt/dir2/file, os.path.commonprefix will return /mnt/dir, whereas this function will return /mnt.
Parameters: - path1 – first path in comparison
- path2 – second path in comparison
Returns: Leading part of path which is common to both input paths.
-
bcftbx.utils.
is_gzipped_file
(filename)¶ Check if a file has a .gz extension
Parameters: filename – name of the file to be tested (can include leading path) Returns: True if filename has trailing .gz extension, False if not.
-
bcftbx.utils.
rootname
(name)¶ Remove all extensions from name
Parameters: name – name of a file Returns: Leading part of name up to first dot, i.e. name without any trailing extensions.
-
bcftbx.utils.
find_program
(name)¶ Find a program on the PATH
Search the current PATH for the specified program name and return the full path, or None if not found.
-
bcftbx.utils.
get_current_user
()¶ Return name of the current user
Looks up user name for the current user; returns None if no matching name can be found.
-
bcftbx.utils.
get_user_from_uid
(uid)¶ Return user name from UID
Looks up user name matching the supplied UID; returns None if no matching name can be found.
-
bcftbx.utils.
get_uid_from_user
(user)¶ Return UID from user name
Looks up UID matching the supplied user name; returns None if no matching name can be found.
NB returned UID will be an integer.
-
bcftbx.utils.
get_group_from_gid
(gid)¶ Return group name from GID
Looks up group name matching the supplied GID; returns None if no matching name can be found.
-
bcftbx.utils.
get_gid_from_group
(group)¶ Return GID from group name
Looks up GID matching the supplied group name; returns None if no matching name can be found.
NB returned GID will be an integer.
-
bcftbx.utils.
walk
(dirn, include_dirs=True, pattern=None)¶ Traverse the directory, subdirectories and files
Essentially this ‘walk’ function is a convenience wrapper for the ‘os.walk’ function.
Parameters: - dirn – top-level directory to start traversal from
- include_dirs – if True then yield directories as well as files (default)
- pattern – if not None then specifies a regular expression pattern which restricts the set of yielded files and directories to a subset of those which match the pattern
-
bcftbx.utils.
list_dirs
(parent, matches=None, startswith=None)¶ Return list of subdirectories relative to ‘parent’
Parameters: - parent – directory to list subdirectories of
- matches – if not None then only include subdirectories that exactly match the supplied string
- startswith – if not None then then return subset of subdirectories that start with the supplied string
Returns: List of subdirectories (relative to the parent dir).
-
bcftbx.utils.
strip_ext
(name, ext=None)¶ Strip extension from file name
Given a file name or path, remove the extension (including the dot) and return just the leading part of the name.
If an extension is explicitly specified then only remove the extension if it matches.
Extension can be multipart e.g. ‘fastq.gz’ and can include a leading dot e.g. ‘.gz’ or ‘gz’.
Parameters: name – name of a file Returns: Leading part of name excluding specified extension, or first extension i.e. to last dot.
Symbolic link handling¶
-
class
bcftbx.utils.
Symlink
(path)¶ Class for interrogating and modifying symbolic links
The Symlink class provides an interface for getting information about a symbolic link.
To create a new Symlink instance do e.g.:
>>> l = Symlink('my_link.lnk')
Information about the link can be obtained via the various properties:
- target = returns the link target
- is_absolute = reports if the target represents an absolute link
- is_broken = reports if the target doesn’t exist
There are also methods:
- resolve_target() = returns the normalise absolute path to the target
- update_target() = updates the target to a new location
-
is_absolute
¶ Return True if the link target is an absolute link
-
is_broken
¶ Return True if the link target doesn’t exist i.e. link is broken
-
resolve_target
()¶ Return the normalised absolute path to the link target
-
target
¶ Return the target of the symlink
-
update_target
(new_target)¶ Replace the current link target with new_target
Parameters: new_target – path to replace the existing target with
-
bcftbx.utils.
links
(dirn)¶ Traverse and return all symbolic links in under a directory
Given a starting directory, traverses the structure underneath and yields the path for each symlink that is found.
Parameters: dirn – name of the top-level directory Returns: Yields the name and full path for each symbolic link under ‘dirn’.
Sample name utilities¶
-
bcftbx.utils.
extract_initials
(name)¶ Return leading initials from the library or sample name
Conventionaly the experimenter’s initials are the leading characters of the name e.g. ‘DR’ for ‘DR1’, ‘EP’ for ‘EP_NCYC2669’, ‘CW’ for ‘CW_TI’ etc
Parameters: name – the name of a sample or library Returns: The leading initials from the name.
-
bcftbx.utils.
extract_prefix
(name)¶ Return the library or sample name prefix
Parameters: name – the name of a sample or library Returns: The prefix consisting of the name with trailing numbers removed, e.g. ‘LD_C’ for ‘LD_C1’
-
bcftbx.utils.
extract_index_as_string
(name)¶ Return the library or sample name index as a string
Parameters: name – the name of a sample or library Returns: The index, consisting of the trailing numbers from the name. It is returned as a string to preserve leading zeroes, e.g. ‘1’ for ‘LD_C1’, ‘07’ for ‘DR07’ etc
-
bcftbx.utils.
extract_index
(name)¶ Return the library or sample name index as an integer
Parameters: name – the name of a sample or library Returns: The index as an integer, or None if the index cannot be converted to integer format.
-
bcftbx.utils.
pretty_print_names
(name_list)¶ Given a list of library or sample names, format for pretty printing.
Parameters: name_list – a list or tuple of library or sample names Returns: String with a condensed description of the library names, for example: [‘DR1’, ‘DR2’, ‘DR3’, DR4’] -> ‘DR1-4’
-
bcftbx.utils.
name_matches
(name, pattern)¶ Simple wildcard matching of project and sample names
Matching options are:
- exact match of a single name e.g. pattern ‘PJB’ matches ‘PJB’
- match start of a name using trailing ‘*’ e.g. pattern ‘PJ*’ matches ‘PJB’,’PJBriggs’ etc
- match using multiple patterns by separating with comma e.g. pattern ‘PJB,IJD’ matches ‘PJB’ or ‘IJD’. Subpatterns can include trailing ‘*’ character to match more names.
- Arguments
- name: text to match against pattern pattern: simple ‘glob’-like pattern to match against
- Returns
- True if name matches pattern; False otherwise.
File manipulations¶
-
bcftbx.utils.
concatenate_fastq_files
(merged_fastq, fastq_files, bufsize=10240, overwrite=False, verbose=True)¶ Create a single FASTQ file by concatenating one or more FASTQs
Given a list or tuple of FASTQ files (which can be compressed or uncompressed or a combination), creates a single output FASTQ by concatenating the contents.
Parameters: - merged_fastq – name of output FASTQ file (mustn’t exist beforehand)
- fastq_files – list of FASTQ files to concatenate
- bufsize – (optional) size of buffer to use for copying data
- overwrite – (optional) if True then overwrite the output file if it already exists (otherwise raise OSError); default is False
- verbose – (optional) if True then report operations to stdout, otherwise operate quietly
Text manipulations¶
-
bcftbx.utils.
split_into_lines
(text, char_limit, delimiters=' \t\n', sympathetic=False)¶ Split a string into multiple lines with maximum length
Splits a string into multiple lines on one or more delimiters (defaults to the whitespace characters i.e. ‘ ‘,tab and newline), such that each line is no longer than a specified length.
For example:
>>> split_into_lines("This is some text to split",10) ['This is','some text','to split']
If it’s not possible to split part of the text to a suitable length then the line is split “unsympathetically” at the line length, e.g.
>>> split_into_lines("This is supercalifragilicous text",10) ['This is','supercalif','ragilicous','text']
Set the ‘sympathetic’ flag to True to include a hyphen to indicate that a word has been broken, e.g.
>>> split_into_lines("This is supercalifragilicous text",10, ... sympathetic=True) ['This is','supercali-','fragilico-','us text']
To use an alternative set of delimiter characters, set the ‘delimiters’ argument, e.g.
>>> split_into_lines("This: is some text",10,delimiters=':') ['This',' is some t','ext']
Parameters: - text – string of text to be split into lines
- char_limit – maximum length for any given line
- delimiters – optional, specify a set of non-default delimiter characters (defaults to whitespace)
- sympathetic – optional, if True then add hyphen to indicate when a word has been broken
Returns: List of lines (i.e. strings).