Command reference

Note

This documentation has been auto-generated from the command help

The following utilities are available:

analyse_solid_run.py

usage: analyse_solid_run.py [-h] [--version] [--only] [--report]
                            [--report-paths] [--xls] [--verify] [--layout]
                            [--rsync] [--copy COPY_PATTERN]
                            [--gzip GZIP_PATTERN] [--md5 MD5_PATTERN]
                            [--md5sum] [--no-warnings] [--debug]
                            solid_run_dir [solid_run_dir ...]

Utility for performing various checks and operations on SOLiD run directories.
If a single solid_run_dir is specified then analyse_solid_run.py automatically
finds and operates on all associated directories from the same instrument and
with the same timestamp.

positional arguments:
  solid_run_dir        SOLiD run directory to operate on

optional arguments:
  -h, --help           show this help message and exit
  --version            show program's version number and exit
  --only               only operate on the specified solid_run_dir, don't
                       locate associated run directories
  --report             print a report of the SOLiD run
  --report-paths       in report mode, also print full paths to primary data
                       files
  --xls                write report to Excel spreadsheet
  --verify             do verification checks on SOLiD run directories
  --layout             generate script for laying out analysis directories
  --rsync              generate script for rsyncing data
  --copy COPY_PATTERN  copy primary data files to pwd from specific library
                       where names match COPY_PATTERN, which should be of the
                       form '<sample>/<library>'
  --gzip GZIP_PATTERN  make gzipped copies of primary data files in pwd from
                       specific libraries where names match GZIP_PATTERN,
                       which should be of the form '<sample>/<library>'
  --md5 MD5_PATTERN    calculate md5sums for primary data files from specific
                       libraries where names match MD5_PATTERN, which should
                       be of the form '<sample>/<library>'
  --md5sum             calculate md5sums for all primary data files
                       (equivalent to --md5=*/*)
  --no-warnings        suppress warning messages
  --debug              turn on debugging output (nb overrides --no-warnings)

annotate_probesets.py

usage: annotate_probesets.py [-h] [--version] [-o OUT_FILE] IN_FILE

Annotate probeset list based on name: reads in first column of tab-delimited
input file 'probe_set_file' as a list of probeset names and outputs these
names to another tab-delimited file with a description for each. Output file
name can be specified with the -o option, otherwise it will be the input file
name with '_annotated' appended.

positional arguments:
  IN_FILE      input probeset file

optional arguments:
  -h, --help   show this help message and exit
  --version    show program's version number and exit
  -o OUT_FILE  specify output file name

best_exons.py

usage: best_exons.py [-h] [--version] [--rank-by {log2_fold_change,p_value}]
                     [--probeset-col PROBESET_COL]
                     [--gene-symbol-col GENE_SYMBOL_COL]
                     [--log2-fold-change-col LOG2_FOLD_CHANGE_COL]
                     [--p-value-col P_VALUE_COL] [--debug]
                     EXONS_IN BEST_EXONS

Read exon and gene symbol data from EXONS_IN and picks the top three exons for
each gene symbol, then outputs averages of the associated values to
BEST_EXONS.

positional arguments:
  EXONS_IN              input file with exon and gene symbol data
  BEST_EXONS            output file averages from top three exons for eachgene
                        symbol

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --rank-by {log2_fold_change,p_value}
                        select the criterion for ranking the 'best' exons;
                        possible options are: 'log2_fold_change' (default), or
                        'p_value'.
  --probeset-col PROBESET_COL
                        specify column with probeset names (default=0, columns
                        start counting from zero)
  --gene-symbol-col GENE_SYMBOL_COL
                        specify column with gene symbols (default=1, columns
                        start counting from zero)
  --log2-fold-change-col LOG2_FOLD_CHANGE_COL
                        specify column with log2 fold change (default=12,
                        columns start counting from zero)
  --p-value-col P_VALUE_COL
                        specify column with p-value (default=13; columns start
                        counting from zero)
  --debug               Turn on debug output

bowtie_mapping_stats.py

usage: bowtie_mapping_stats.py [-h] [--version] [-o xls_file] [-t]
                               BOWTIE_LOG_FILE [BOWTIE_LOG_FILE ...]

Extract mapping statistics for each sample referenced in the input bowtie log
files and summarise the data in an XLS spreadsheet. Handles output from both
Bowtie and Bowtie2.

positional arguments:
  BOWTIE_LOG_FILE  logfile output from Bowtie or Bowtie2

optional arguments:
  -h, --help       show this help message and exit
  --version        show program's version number and exit
  -o xls_file      specify name of the output XLS file (otherwise defaults to
                   'mapping_summary.xls').
  -t               write data to tab-delimited file in addition to the XLS
                   file. The tab file will have the same name as the XLS file,
                   with the extension replaced by .txt

extract_reads.py

usage: extract_reads.py [-h] [--version] [-m PATTERN] [-n N] [-s SEED]
                        infile [infile ...]

Extract subsets of reads from each of the supplied files according to
specified criteria (e.g. random, matching a pattern etc). Input files can be
any mixture of FASTQ (.fastq, .fq), CSFASTA (.csfasta) and QUAL (.qual).

positional arguments:
  infile                input FASTQ, CSFASTA, or QUAL file

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  -m PATTERN, --match PATTERN
                        extract records that match Python regular expression
                        PATTERN
  -n N                  extract N random reads from the input file(s). If
                        multiple files are supplied (e.g. R1/R2 pair) then the
                        same subsets will be extracted for each. (Optionally a
                        percentage can be supplied instead e.g. '50%' to
                        extract a subset of half the reads.)
  -s SEED, --seed SEED  specify seed for random number generator (used for -n
                        option; using the same seed should produce the same
                        'random' sample of reads)

fastq_strand.py

Fastq_strand: version 1.13.1
usage: fastq_strand.py [-h] [--version] [-g GENOMEDIR] [--subset SUBSET]
                       [-o OUTDIR] [-c FILE] [-n N] [--counts]
                       [--keep-star-output]
                       READ1 [READ2]

Generate strandedness statistics for FASTQ or FASTQpair, by running STAR using
one or more genome indexes

positional arguments:
  READ1                 R1 Fastq file
  READ2                 R2 Fastq file

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  -g GENOMEDIR, --genome GENOMEDIR
                        path to directory with STAR index for genome to use
                        (use as an alternative to -c/--conf; can be specified
                        multiple times to include additional genomes)
  --subset SUBSET       use a random subset of read pairs from the input
                        Fastqs; set to zero to use all reads (default: 10000)
  -o OUTDIR, --outdir OUTDIR
                        specify directory to write final outputs to (default:
                        current directory)
  -c FILE, --conf FILE  specify delimited 'conf' file with list of NAME and
                        STAR index directory pairs. NB if a conf file is
                        supplied then any indices specifed on the command line
                        will be ignored
  -n N                  number of threads to run STAR with (default: 1)
  --counts              include the count sums for unstranded, 1st read strand
                        aligned and 2nd read strand aligned in the output file
                        (default: only include percentages)
  --keep-star-output    keep the output from STAR (default: delete outputs on
                        completion)

log_seq_data.sh

Usage:
    log_seq_data.sh <logging_file> [-d|-u] <seq_data_dir> [<description>]
    log_seq_data.sh <logging_file> -c <seq_data_dir> <new_dir> [<description>]
    log_seq_data.sh <logging_file> -i <seq_data_dir>
    log_seq_data.sh <logging_file> -v

Add, update or delete an entry for <seq_data_dir> in <logging_file>, or
verify entries.

<seq_data_dir> can be a primary data directory from a sequencer or a
directory of derived data (e.g. analysis directory)

By default an entry is added for the specified data directory; each
entry is a tab-delimited line with the full path for the data directory
followed by the UNIX timestamp and the optional description text.

If <logging_file> doesn't exist then it will be created; if
<seq_data_dir> is already in the log file then it won't be added again.

Options:

     -d     deletes an existing entry
     -u     update description for an existing entry (or creates a new one
            if an existing entry not found)
     -c     changes an existing entry, updating the directory path and
            (optionally) the description
     -i     print information about an entry
     -v     validates the entries in the logging file.

make_macs_xls.py

usage: make_macs_xls.py [-h] [--version] MACS_OUTPUT [XLS_OUT]

Create an XLS spreadsheet from the output of the MACS peak caller.
<MACS_OUTPUT> is the output '.xls' file from MACS; if supplied then <XLS_OUT>
is the name to use for the output file, otherwise it will be called
'XLS_<MACS_OUTPUT>.xls'.

positional arguments:
  MACS_OUTPUT  output .xls file from MACS
  XLS_OUT      output MS XLS file (defaults to 'XLS_<MACS_OUTPUT>.xls').

optional arguments:
  -h, --help   show this help message and exit
  --version    show program's version number and exit

make_macs2_xls.py

usage: make_macs2_xls.py [-h] [--version] [-f XLS_FORMAT] [-b]
                         MACS2_XLS [XLS_OUT]

Create an XLS(X) spreadsheet from the output of the MACS2 peak caller.
MACS2_XLS is the output '.xls' file from MACS2; if supplied then XLS_OUT is
the name to use for the output file (otherwise it will be called
'XLS_<MACS2_XLS>.xls(x)').

positional arguments:
  MACS2_XLS             output '.xls' file from MACS2
  XLS_OUT               name to use for the output file (default is
                        'XLS_<MACS2_XLS>.xls(x)')

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  -f XLS_FORMAT, --format XLS_FORMAT
                        specify the output Excel spreadsheet format; must be
                        one of 'xlsx' or 'xls' (default is 'xlsx')
  -b, --bed             write an additional TSV file with chrom,
                        abs_summit+100 and abs_summit-100 data as the columns.
                        (NB only possible for MACS2 run without --broad)

manage_seqs.py

usage: manage_seqs.py [-h] [--version] [-o OUT_FILE] [-a APPEND_FILE]
                      [-d DESCRIPTION]
                      INFILE [INFILE ...]

Read sequences and names from one or more INFILEs (which can be a mixture of
FastQC 'contaminants' format and or Fasta format), check for redundancy (i.e.
sequences with multiple associated names) and contradictions (i.e. names with
multiple associated sequences).

positional arguments:
  INFILE          input sequences

optional arguments:
  -h, --help      show this help message and exit
  --version       show program's version number and exit
  -o OUT_FILE     write all sequences to OUT_FILE in FastQC 'contaminants'
                  format
  -a APPEND_FILE  append sequences to existing APPEND_FILE (not compatible
                  with -o)
  -d DESCRIPTION  supply arbitrary text to write to the header of the output
                  file

md5checker.py

usage:
  md5checker.py -d SOURCE_DIR DEST_DIR
  md5checker.py -d FILE1 FILE2
  md5checker.py [ -o CHKSUM_FILE ] DIR
  md5checker.py [ -o CHKSUM_FILE ] FILE
  md5checker.py -c CHKSUM_FILE

Compute and verify MD5 checksums for files and directories.

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  -d, --diff            for two directories: check that contents of directory
                        DIR1 are present in DIR2 and have the same MD5 sums;
                        for two files: check that FILE1 and FILE2 have the
                        same MD5 sums
  -c, --check           read MD5 sums from the specified file and check them
  -q, --quiet           suppress output messages and only report failures

Directory comparison (-d, --diff):
  Check that the contents of SOURCE_DIR are present in TARGET_DIR and have
  matching MD5 sums. Note that files that are only present in TARGET_DIR are
  not reported.

File comparison (-d, --diff):
  Check that FILE1 and FILE2 have matching MD5 sums.

Checksum generation:
  MD5 checksums are calcuated for all files in the specified directory, or
  for a single specified file.

  -o CHKSUM_FILE, --output CHKSUM_FILE
                        optionally write computed MD5 sums to CHKSUM_FILE
                        (otherwise the sums are written to stdout). The output
                        format is the same as that used by the Linux 'md5sum'
                        tool.

Checksum verification (-c, --check):
  Check MD5 sums for each of the files listed in the specified CHKSUM_FILE
  relative to the current directory. This option behaves the same as the
  Linux 'md5sum' tool.

prep_sample_sheet.py

usage: prep_sample_sheet.py [-h] [--version] [-o SAMPLESHEET_OUT] [-f FMT]
                            [-V] [--fix-spaces] [--fix-duplicates]
                            [--fix-empty-projects] [--set-id SAMPLE_ID]
                            [--set-project SAMPLE_PROJECT]
                            [--reverse-complement-i5] [--ignore-warnings]
                            [--include-lanes LANES] [--set-adapter ADAPTER]
                            [--set-adapter-read2 ADAPTER_READ2]
                            [--truncate-barcodes BARCODE_LEN] [--miseq]
                            SAMPLE_SHEET

Utility to prepare SampleSheet files from Illumina sequencers. Can be used to
view, validate and update or fix information such as sample IDs and project
names before running BCL to FASTQ conversion.

positional arguments:
  SAMPLE_SHEET          input sample sheet file

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  -o SAMPLESHEET_OUT    output new sample sheet to SAMPLESHEET_OUT
  -f FMT, --format FMT  specify the format of the output sample sheet written
                        by the -o option; can be either 'CASAVA' or 'IEM'
                        (defaults to the format of the original file)
  -V, --view            view predicted outputs from sample sheet
  --fix-spaces          replace spaces in sample ID and project fields with
                        underscores
  --fix-duplicates      append unique indices to sample IDs where the original
                        ID and project name combination are duplicated
  --fix-empty-projects  create sample project names where these are blank in
                        the original sample sheet
  --set-id SAMPLE_ID    update/set the values in sample ID field; SAMPLE_ID
                        should be of the form '<lanes>:<name>', where <lanes>
                        is a single integer (e.g. 1), a set of integers (e.g.
                        1,3,...), a range (e.g. 1-3), or a combination (e.g.
                        1,3-5,7)
  --set-project SAMPLE_PROJECT
                        update/set values in the sample project field;
                        SAMPLE_PROJECT should be of the form
                        '[<lanes>:]<name>', where the optional <lanes> part
                        can be a single integer (e.g. 1), a set of integers
                        (e.g. 1,3,...), a range (e.g. 1-3), or a combination
                        (e.g. 1,3-5,7). If no lanes are specified then all
                        samples will have their project set to <name>
  --reverse-complement-i5
                        replace i5 index sequences with their reverse
                        complement
  --ignore-warnings     ignore warnings about spaces and duplicated
                        sampleID/sampleProject combinations when writing new
                        samplesheet.csv file
  --include-lanes LANES
                        specify a subset of lanes to include in the output
                        sample sheet; LANES should be single integer (e.g. 1),
                        a list of integers (e.g. 1,3,...), a range (e.g. 1-3)
                        or a combination (e.g. 1,3-5,7). Default is to include
                        all lanes
  --set-adapter ADAPTER
                        set the adapter sequence in the 'Settings' section to
                        ADAPTER
  --set-adapter-read2 ADAPTER_READ2
                        set the adapter sequence for read 2 in the
                        'Settings'section to ADAPTER_READ2

Deprecated options:
  --truncate-barcodes BARCODE_LEN
                        trim barcode sequences in sample sheet to number of
                        bases specified by BARCODE_LEN. Default is to leave
                        barcode sequences unmodified (deprecated; only works
                        for CASAVA-style sample sheets)
  --miseq               convert input MiSEQ sample sheet to CASAVA-compatible
                        format (deprecated; specify -f/--format CASAVA to
                        convert IEM sample sheet to older format)

reorder_fasta.py

usage: reorder_fasta.py [-h] [--version] FASTA

Reorder the chromosome records in a FASTA file into karyotypic order.

positional arguments:
  FASTA       FASTA file to reorder

optional arguments:
  -h, --help  show this help message and exit
  --version   show program's version number and exit

sam2soap.py

usage: sam2soap.py [-h] [--version] [-o SOAPFILE] [--debug] [SAMFILE]

Convert SAM file to SOAP format - reads from stdin (or SAMFILE, if specified),
and writes output to stdout unless -o option is specified.

positional arguments:
  SAMFILE      SAM file to convert (or stdin if not specified)

optional arguments:
  -h, --help   show this help message and exit
  --version    show program's version number and exit
  -o SOAPFILE  Output SOAP file name
  --debug      Turn on debugging output

split_fasta.py

usage: split_fasta.py [-h] [--version] [fasta_file]

Split input FASTA file with multiple sequences into multiple files each
containing sequences for a single chromosome.

positional arguments:
  fasta_file  input FASTA file to split

optional arguments:
  -h, --help  show this help message and exit
  --version   show program's version number and exit

split_fastq.py

usage: split_fastq.py [-h] [--version] [-l LANES] FASTQ

Split input Fastq file into multiple output Fastqs where each output only
contains reads from a single lane.

positional arguments:
  FASTQ                 Fastq to split

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  -l LANES, --lanes LANES
                        lanes to extract: can be a single integer, a comma-
                        separated list (e.g. 1,3), a range (e.g. 5-7) or a
                        combination (e.g. 1,3,5-7). Default is to extract all
                        lanes in the Fastq

verify_paired.py

usage: verify_paired.py [-h] [--version] R1.fastq R2.fastq

Check that read headers for R1 and R2 fastq files are in agreement, and that
the files form an R1/2 pair.

positional arguments:
  R1.fastq    Fastq file with R1 reads
  R2.fastq    Fastq file with R2 reads to check against R1 reads

optional arguments:
  -h, --help  show this help message and exit
  --version   show program's version number and exit

xrorthologs.py

usage: xrorthologs.py [-h] [--version] [--debug] LOOKUPFILE SPECIES1 SPECIES2

Cross-reference data from two species given a lookup file that maps probeset
IDs from one species onto those onto the other. LOOKUPFILE is tab-delimited
file with one probe set for species 1 per line in first column and a comma-
separated list of the equivalent probe sets for species 2 in the fourth
column. Data for the two species are in tab-delimited files SPECIES1 and
SPECIES2. Output is two files: SPECIES1_appended.txt (SPECIES1 with the cross-
referenced data from SPECIES2 appended to each line) and SPECIES2_appended.txt
(SPECIES2 with SPECIES1 data appended).

positional arguments:
  LOOKUPFILE  tab-delimited file with one probe set for species 1 per line in
              first column and a comma-separated list of the equivalent probe
              sets for species 2 in the fourth column
  SPECIES1    data for species 1
  SPECIES2    data for species 2

optional arguments:
  -h, --help  show this help message and exit
  --version   show program's version number and exit
  --debug     Turn on debugging output