Command reference
Note
This documentation has been auto-generated from the command help
The following utilities are available:
analyse_solid_run.py
usage: analyse_solid_run.py [-h] [--version] [--only] [--report]
[--report-paths] [--xls] [--verify] [--layout]
[--rsync] [--copy COPY_PATTERN]
[--gzip GZIP_PATTERN] [--md5 MD5_PATTERN]
[--md5sum] [--no-warnings] [--debug]
solid_run_dir [solid_run_dir ...]
Utility for performing various checks and operations on SOLiD run directories.
If a single solid_run_dir is specified then analyse_solid_run.py automatically
finds and operates on all associated directories from the same instrument and
with the same timestamp.
positional arguments:
solid_run_dir SOLiD run directory to operate on
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
--only only operate on the specified solid_run_dir, don't
locate associated run directories
--report print a report of the SOLiD run
--report-paths in report mode, also print full paths to primary data
files
--xls write report to Excel spreadsheet
--verify do verification checks on SOLiD run directories
--layout generate script for laying out analysis directories
--rsync generate script for rsyncing data
--copy COPY_PATTERN copy primary data files to pwd from specific library
where names match COPY_PATTERN, which should be of the
form '<sample>/<library>'
--gzip GZIP_PATTERN make gzipped copies of primary data files in pwd from
specific libraries where names match GZIP_PATTERN,
which should be of the form '<sample>/<library>'
--md5 MD5_PATTERN calculate md5sums for primary data files from specific
libraries where names match MD5_PATTERN, which should
be of the form '<sample>/<library>'
--md5sum calculate md5sums for all primary data files
(equivalent to --md5=*/*)
--no-warnings suppress warning messages
--debug turn on debugging output (nb overrides --no-warnings)
annotate_probesets.py
usage: annotate_probesets.py [-h] [--version] [-o OUT_FILE] IN_FILE
Annotate probeset list based on name: reads in first column of tab-delimited
input file 'probe_set_file' as a list of probeset names and outputs these
names to another tab-delimited file with a description for each. Output file
name can be specified with the -o option, otherwise it will be the input file
name with '_annotated' appended.
positional arguments:
IN_FILE input probeset file
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-o OUT_FILE specify output file name
best_exons.py
usage: best_exons.py [-h] [--version] [--rank-by {log2_fold_change,p_value}]
[--probeset-col PROBESET_COL]
[--gene-symbol-col GENE_SYMBOL_COL]
[--log2-fold-change-col LOG2_FOLD_CHANGE_COL]
[--p-value-col P_VALUE_COL] [--debug]
EXONS_IN BEST_EXONS
Read exon and gene symbol data from EXONS_IN and picks the top three exons for
each gene symbol, then outputs averages of the associated values to
BEST_EXONS.
positional arguments:
EXONS_IN input file with exon and gene symbol data
BEST_EXONS output file averages from top three exons for eachgene
symbol
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
--rank-by {log2_fold_change,p_value}
select the criterion for ranking the 'best' exons;
possible options are: 'log2_fold_change' (default), or
'p_value'.
--probeset-col PROBESET_COL
specify column with probeset names (default=0, columns
start counting from zero)
--gene-symbol-col GENE_SYMBOL_COL
specify column with gene symbols (default=1, columns
start counting from zero)
--log2-fold-change-col LOG2_FOLD_CHANGE_COL
specify column with log2 fold change (default=12,
columns start counting from zero)
--p-value-col P_VALUE_COL
specify column with p-value (default=13; columns start
counting from zero)
--debug Turn on debug output
bowtie_mapping_stats.py
usage: bowtie_mapping_stats.py [-h] [--version] [-o xls_file] [-t]
BOWTIE_LOG_FILE [BOWTIE_LOG_FILE ...]
Extract mapping statistics for each sample referenced in the input bowtie log
files and summarise the data in an XLS spreadsheet. Handles output from both
Bowtie and Bowtie2.
positional arguments:
BOWTIE_LOG_FILE logfile output from Bowtie or Bowtie2
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-o xls_file specify name of the output XLS file (otherwise defaults to
'mapping_summary.xls').
-t write data to tab-delimited file in addition to the XLS
file. The tab file will have the same name as the XLS file,
with the extension replaced by .txt
extract_reads.py
usage: extract_reads.py [-h] [--version] [-m PATTERN] [-n N] [-s SEED]
infile [infile ...]
Extract subsets of reads from each of the supplied files according to
specified criteria (e.g. random, matching a pattern etc). Input files can be
any mixture of FASTQ (.fastq, .fq), CSFASTA (.csfasta) and QUAL (.qual).
positional arguments:
infile input FASTQ, CSFASTA, or QUAL file
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-m PATTERN, --match PATTERN
extract records that match Python regular expression
PATTERN
-n N extract N random reads from the input file(s). If
multiple files are supplied (e.g. R1/R2 pair) then the
same subsets will be extracted for each. (Optionally a
percentage can be supplied instead e.g. '50%' to
extract a subset of half the reads.)
-s SEED, --seed SEED specify seed for random number generator (used for -n
option; using the same seed should produce the same
'random' sample of reads)
fastq_strand.py
Fastq_strand: version 1.13.1
usage: fastq_strand.py [-h] [--version] [-g GENOMEDIR] [--subset SUBSET]
[-o OUTDIR] [-c FILE] [-n N] [--counts]
[--keep-star-output]
READ1 [READ2]
Generate strandedness statistics for FASTQ or FASTQpair, by running STAR using
one or more genome indexes
positional arguments:
READ1 R1 Fastq file
READ2 R2 Fastq file
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-g GENOMEDIR, --genome GENOMEDIR
path to directory with STAR index for genome to use
(use as an alternative to -c/--conf; can be specified
multiple times to include additional genomes)
--subset SUBSET use a random subset of read pairs from the input
Fastqs; set to zero to use all reads (default: 10000)
-o OUTDIR, --outdir OUTDIR
specify directory to write final outputs to (default:
current directory)
-c FILE, --conf FILE specify delimited 'conf' file with list of NAME and
STAR index directory pairs. NB if a conf file is
supplied then any indices specifed on the command line
will be ignored
-n N number of threads to run STAR with (default: 1)
--counts include the count sums for unstranded, 1st read strand
aligned and 2nd read strand aligned in the output file
(default: only include percentages)
--keep-star-output keep the output from STAR (default: delete outputs on
completion)
log_seq_data.sh
Usage:
log_seq_data.sh <logging_file> [-d|-u] <seq_data_dir> [<description>]
log_seq_data.sh <logging_file> -c <seq_data_dir> <new_dir> [<description>]
log_seq_data.sh <logging_file> -i <seq_data_dir>
log_seq_data.sh <logging_file> -v
Add, update or delete an entry for <seq_data_dir> in <logging_file>, or
verify entries.
<seq_data_dir> can be a primary data directory from a sequencer or a
directory of derived data (e.g. analysis directory)
By default an entry is added for the specified data directory; each
entry is a tab-delimited line with the full path for the data directory
followed by the UNIX timestamp and the optional description text.
If <logging_file> doesn't exist then it will be created; if
<seq_data_dir> is already in the log file then it won't be added again.
Options:
-d deletes an existing entry
-u update description for an existing entry (or creates a new one
if an existing entry not found)
-c changes an existing entry, updating the directory path and
(optionally) the description
-i print information about an entry
-v validates the entries in the logging file.
make_macs_xls.py
usage: make_macs_xls.py [-h] [--version] MACS_OUTPUT [XLS_OUT]
Create an XLS spreadsheet from the output of the MACS peak caller.
<MACS_OUTPUT> is the output '.xls' file from MACS; if supplied then <XLS_OUT>
is the name to use for the output file, otherwise it will be called
'XLS_<MACS_OUTPUT>.xls'.
positional arguments:
MACS_OUTPUT output .xls file from MACS
XLS_OUT output MS XLS file (defaults to 'XLS_<MACS_OUTPUT>.xls').
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
make_macs2_xls.py
usage: make_macs2_xls.py [-h] [--version] [-f XLS_FORMAT] [-b]
MACS2_XLS [XLS_OUT]
Create an XLS(X) spreadsheet from the output of the MACS2 peak caller.
MACS2_XLS is the output '.xls' file from MACS2; if supplied then XLS_OUT is
the name to use for the output file (otherwise it will be called
'XLS_<MACS2_XLS>.xls(x)').
positional arguments:
MACS2_XLS output '.xls' file from MACS2
XLS_OUT name to use for the output file (default is
'XLS_<MACS2_XLS>.xls(x)')
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-f XLS_FORMAT, --format XLS_FORMAT
specify the output Excel spreadsheet format; must be
one of 'xlsx' or 'xls' (default is 'xlsx')
-b, --bed write an additional TSV file with chrom,
abs_summit+100 and abs_summit-100 data as the columns.
(NB only possible for MACS2 run without --broad)
manage_seqs.py
usage: manage_seqs.py [-h] [--version] [-o OUT_FILE] [-a APPEND_FILE]
[-d DESCRIPTION]
INFILE [INFILE ...]
Read sequences and names from one or more INFILEs (which can be a mixture of
FastQC 'contaminants' format and or Fasta format), check for redundancy (i.e.
sequences with multiple associated names) and contradictions (i.e. names with
multiple associated sequences).
positional arguments:
INFILE input sequences
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-o OUT_FILE write all sequences to OUT_FILE in FastQC 'contaminants'
format
-a APPEND_FILE append sequences to existing APPEND_FILE (not compatible
with -o)
-d DESCRIPTION supply arbitrary text to write to the header of the output
file
md5checker.py
usage:
md5checker.py -d SOURCE_DIR DEST_DIR
md5checker.py -d FILE1 FILE2
md5checker.py [ -o CHKSUM_FILE ] DIR
md5checker.py [ -o CHKSUM_FILE ] FILE
md5checker.py -c CHKSUM_FILE
Compute and verify MD5 checksums for files and directories.
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-d, --diff for two directories: check that contents of directory
DIR1 are present in DIR2 and have the same MD5 sums;
for two files: check that FILE1 and FILE2 have the
same MD5 sums
-c, --check read MD5 sums from the specified file and check them
-q, --quiet suppress output messages and only report failures
Directory comparison (-d, --diff):
Check that the contents of SOURCE_DIR are present in TARGET_DIR and have
matching MD5 sums. Note that files that are only present in TARGET_DIR are
not reported.
File comparison (-d, --diff):
Check that FILE1 and FILE2 have matching MD5 sums.
Checksum generation:
MD5 checksums are calcuated for all files in the specified directory, or
for a single specified file.
-o CHKSUM_FILE, --output CHKSUM_FILE
optionally write computed MD5 sums to CHKSUM_FILE
(otherwise the sums are written to stdout). The output
format is the same as that used by the Linux 'md5sum'
tool.
Checksum verification (-c, --check):
Check MD5 sums for each of the files listed in the specified CHKSUM_FILE
relative to the current directory. This option behaves the same as the
Linux 'md5sum' tool.
prep_sample_sheet.py
usage: prep_sample_sheet.py [-h] [--version] [-o SAMPLESHEET_OUT] [-f FMT]
[-V] [--fix-spaces] [--fix-duplicates]
[--fix-empty-projects] [--set-id SAMPLE_ID]
[--set-project SAMPLE_PROJECT]
[--reverse-complement-i5] [--ignore-warnings]
[--include-lanes LANES] [--set-adapter ADAPTER]
[--set-adapter-read2 ADAPTER_READ2]
[--truncate-barcodes BARCODE_LEN] [--miseq]
SAMPLE_SHEET
Utility to prepare SampleSheet files from Illumina sequencers. Can be used to
view, validate and update or fix information such as sample IDs and project
names before running BCL to FASTQ conversion.
positional arguments:
SAMPLE_SHEET input sample sheet file
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-o SAMPLESHEET_OUT output new sample sheet to SAMPLESHEET_OUT
-f FMT, --format FMT specify the format of the output sample sheet written
by the -o option; can be either 'CASAVA' or 'IEM'
(defaults to the format of the original file)
-V, --view view predicted outputs from sample sheet
--fix-spaces replace spaces in sample ID and project fields with
underscores
--fix-duplicates append unique indices to sample IDs where the original
ID and project name combination are duplicated
--fix-empty-projects create sample project names where these are blank in
the original sample sheet
--set-id SAMPLE_ID update/set the values in sample ID field; SAMPLE_ID
should be of the form '<lanes>:<name>', where <lanes>
is a single integer (e.g. 1), a set of integers (e.g.
1,3,...), a range (e.g. 1-3), or a combination (e.g.
1,3-5,7)
--set-project SAMPLE_PROJECT
update/set values in the sample project field;
SAMPLE_PROJECT should be of the form
'[<lanes>:]<name>', where the optional <lanes> part
can be a single integer (e.g. 1), a set of integers
(e.g. 1,3,...), a range (e.g. 1-3), or a combination
(e.g. 1,3-5,7). If no lanes are specified then all
samples will have their project set to <name>
--reverse-complement-i5
replace i5 index sequences with their reverse
complement
--ignore-warnings ignore warnings about spaces and duplicated
sampleID/sampleProject combinations when writing new
samplesheet.csv file
--include-lanes LANES
specify a subset of lanes to include in the output
sample sheet; LANES should be single integer (e.g. 1),
a list of integers (e.g. 1,3,...), a range (e.g. 1-3)
or a combination (e.g. 1,3-5,7). Default is to include
all lanes
--set-adapter ADAPTER
set the adapter sequence in the 'Settings' section to
ADAPTER
--set-adapter-read2 ADAPTER_READ2
set the adapter sequence for read 2 in the
'Settings'section to ADAPTER_READ2
Deprecated options:
--truncate-barcodes BARCODE_LEN
trim barcode sequences in sample sheet to number of
bases specified by BARCODE_LEN. Default is to leave
barcode sequences unmodified (deprecated; only works
for CASAVA-style sample sheets)
--miseq convert input MiSEQ sample sheet to CASAVA-compatible
format (deprecated; specify -f/--format CASAVA to
convert IEM sample sheet to older format)
reorder_fasta.py
usage: reorder_fasta.py [-h] [--version] FASTA
Reorder the chromosome records in a FASTA file into karyotypic order.
positional arguments:
FASTA FASTA file to reorder
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
sam2soap.py
usage: sam2soap.py [-h] [--version] [-o SOAPFILE] [--debug] [SAMFILE]
Convert SAM file to SOAP format - reads from stdin (or SAMFILE, if specified),
and writes output to stdout unless -o option is specified.
positional arguments:
SAMFILE SAM file to convert (or stdin if not specified)
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-o SOAPFILE Output SOAP file name
--debug Turn on debugging output
split_fasta.py
usage: split_fasta.py [-h] [--version] [fasta_file]
Split input FASTA file with multiple sequences into multiple files each
containing sequences for a single chromosome.
positional arguments:
fasta_file input FASTA file to split
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
split_fastq.py
usage: split_fastq.py [-h] [--version] [-l LANES] FASTQ
Split input Fastq file into multiple output Fastqs where each output only
contains reads from a single lane.
positional arguments:
FASTQ Fastq to split
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-l LANES, --lanes LANES
lanes to extract: can be a single integer, a comma-
separated list (e.g. 1,3), a range (e.g. 5-7) or a
combination (e.g. 1,3,5-7). Default is to extract all
lanes in the Fastq
verify_paired.py
usage: verify_paired.py [-h] [--version] R1.fastq R2.fastq
Check that read headers for R1 and R2 fastq files are in agreement, and that
the files form an R1/2 pair.
positional arguments:
R1.fastq Fastq file with R1 reads
R2.fastq Fastq file with R2 reads to check against R1 reads
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
xrorthologs.py
usage: xrorthologs.py [-h] [--version] [--debug] LOOKUPFILE SPECIES1 SPECIES2
Cross-reference data from two species given a lookup file that maps probeset
IDs from one species onto those onto the other. LOOKUPFILE is tab-delimited
file with one probe set for species 1 per line in first column and a comma-
separated list of the equivalent probe sets for species 2 in the fourth
column. Data for the two species are in tab-delimited files SPECIES1 and
SPECIES2. Output is two files: SPECIES1_appended.txt (SPECIES1 with the cross-
referenced data from SPECIES2 appended to each line) and SPECIES2_appended.txt
(SPECIES2 with SPECIES1 data appended).
positional arguments:
LOOKUPFILE tab-delimited file with one probe set for species 1 per line in
first column and a comma-separated list of the equivalent probe
sets for species 2 in the fourth column
SPECIES1 data for species 1
SPECIES2 data for species 2
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
--debug Turn on debugging output