Command reference
=================

.. note::

   This documentation has been auto-generated from the
   command help

The following utilities are available:

.. contents:: :local:

.. _reference_analyse_solid_run:

analyse_solid_run.py
********************

::

    usage: analyse_solid_run.py [-h] [--version] [--only] [--report]
                                [--report-paths] [--xls] [--verify] [--layout]
                                [--rsync] [--copy COPY_PATTERN]
                                [--gzip GZIP_PATTERN] [--md5 MD5_PATTERN]
                                [--md5sum] [--no-warnings] [--debug]
                                solid_run_dir [solid_run_dir ...]
    
    Utility for performing various checks and operations on SOLiD run directories.
    If a single solid_run_dir is specified then analyse_solid_run.py automatically
    finds and operates on all associated directories from the same instrument and
    with the same timestamp.
    
    positional arguments:
      solid_run_dir        SOLiD run directory to operate on
    
    optional arguments:
      -h, --help           show this help message and exit
      --version            show program's version number and exit
      --only               only operate on the specified solid_run_dir, don't
                           locate associated run directories
      --report             print a report of the SOLiD run
      --report-paths       in report mode, also print full paths to primary data
                           files
      --xls                write report to Excel spreadsheet
      --verify             do verification checks on SOLiD run directories
      --layout             generate script for laying out analysis directories
      --rsync              generate script for rsyncing data
      --copy COPY_PATTERN  copy primary data files to pwd from specific library
                           where names match COPY_PATTERN, which should be of the
                           form '<sample>/<library>'
      --gzip GZIP_PATTERN  make gzipped copies of primary data files in pwd from
                           specific libraries where names match GZIP_PATTERN,
                           which should be of the form '<sample>/<library>'
      --md5 MD5_PATTERN    calculate md5sums for primary data files from specific
                           libraries where names match MD5_PATTERN, which should
                           be of the form '<sample>/<library>'
      --md5sum             calculate md5sums for all primary data files
                           (equivalent to --md5=*/*)
      --no-warnings        suppress warning messages
      --debug              turn on debugging output (nb overrides --no-warnings)
    
.. _reference_annotate_probesets:

annotate_probesets.py
*********************

::

    usage: annotate_probesets.py [-h] [--version] [-o OUT_FILE] IN_FILE
    
    Annotate probeset list based on name: reads in first column of tab-delimited
    input file 'probe_set_file' as a list of probeset names and outputs these
    names to another tab-delimited file with a description for each. Output file
    name can be specified with the -o option, otherwise it will be the input file
    name with '_annotated' appended.
    
    positional arguments:
      IN_FILE      input probeset file
    
    optional arguments:
      -h, --help   show this help message and exit
      --version    show program's version number and exit
      -o OUT_FILE  specify output file name
    
.. _reference_best_exons:

best_exons.py
*************

::

    usage: best_exons.py [-h] [--version] [--rank-by {log2_fold_change,p_value}]
                         [--probeset-col PROBESET_COL]
                         [--gene-symbol-col GENE_SYMBOL_COL]
                         [--log2-fold-change-col LOG2_FOLD_CHANGE_COL]
                         [--p-value-col P_VALUE_COL] [--debug]
                         EXONS_IN BEST_EXONS
    
    Read exon and gene symbol data from EXONS_IN and picks the top three exons for
    each gene symbol, then outputs averages of the associated values to
    BEST_EXONS.
    
    positional arguments:
      EXONS_IN              input file with exon and gene symbol data
      BEST_EXONS            output file averages from top three exons for eachgene
                            symbol
    
    optional arguments:
      -h, --help            show this help message and exit
      --version             show program's version number and exit
      --rank-by {log2_fold_change,p_value}
                            select the criterion for ranking the 'best' exons;
                            possible options are: 'log2_fold_change' (default), or
                            'p_value'.
      --probeset-col PROBESET_COL
                            specify column with probeset names (default=0, columns
                            start counting from zero)
      --gene-symbol-col GENE_SYMBOL_COL
                            specify column with gene symbols (default=1, columns
                            start counting from zero)
      --log2-fold-change-col LOG2_FOLD_CHANGE_COL
                            specify column with log2 fold change (default=12,
                            columns start counting from zero)
      --p-value-col P_VALUE_COL
                            specify column with p-value (default=13; columns start
                            counting from zero)
      --debug               Turn on debug output
    
.. _reference_bowtie_mapping_stats:

bowtie_mapping_stats.py
***********************

::

    usage: bowtie_mapping_stats.py [-h] [--version] [-o xls_file] [-t]
                                   BOWTIE_LOG_FILE [BOWTIE_LOG_FILE ...]
    
    Extract mapping statistics for each sample referenced in the input bowtie log
    files and summarise the data in an XLS spreadsheet. Handles output from both
    Bowtie and Bowtie2.
    
    positional arguments:
      BOWTIE_LOG_FILE  logfile output from Bowtie or Bowtie2
    
    optional arguments:
      -h, --help       show this help message and exit
      --version        show program's version number and exit
      -o xls_file      specify name of the output XLS file (otherwise defaults to
                       'mapping_summary.xls').
      -t               write data to tab-delimited file in addition to the XLS
                       file. The tab file will have the same name as the XLS file,
                       with the extension replaced by .txt
    
.. _reference_extract_reads:

extract_reads.py
****************

::

    usage: extract_reads.py [-h] [--version] [-m PATTERN] [-n N] [-s SEED]
                            infile [infile ...]
    
    Extract subsets of reads from each of the supplied files according to
    specified criteria (e.g. random, matching a pattern etc). Input files can be
    any mixture of FASTQ (.fastq, .fq), CSFASTA (.csfasta) and QUAL (.qual).
    
    positional arguments:
      infile                input FASTQ, CSFASTA, or QUAL file
    
    optional arguments:
      -h, --help            show this help message and exit
      --version             show program's version number and exit
      -m PATTERN, --match PATTERN
                            extract records that match Python regular expression
                            PATTERN
      -n N                  extract N random reads from the input file(s). If
                            multiple files are supplied (e.g. R1/R2 pair) then the
                            same subsets will be extracted for each. (Optionally a
                            percentage can be supplied instead e.g. '50%' to
                            extract a subset of half the reads.)
      -s SEED, --seed SEED  specify seed for random number generator (used for -n
                            option; using the same seed should produce the same
                            'random' sample of reads)
    
.. _reference_fastq_strand:

fastq_strand.py
***************

::

    Fastq_strand: version 1.14.0
    usage: fastq_strand.py [-h] [--version] [-g GENOMEDIR] [--subset SUBSET]
                           [-o OUTDIR] [-c FILE] [-n N] [--counts]
                           [--keep-star-output]
                           READ1 [READ2]
    
    Generate strandedness statistics for FASTQ or FASTQpair, by running STAR using
    one or more genome indexes
    
    positional arguments:
      READ1                 R1 Fastq file
      READ2                 R2 Fastq file
    
    optional arguments:
      -h, --help            show this help message and exit
      --version             show program's version number and exit
      -g GENOMEDIR, --genome GENOMEDIR
                            path to directory with STAR index for genome to use
                            (use as an alternative to -c/--conf; can be specified
                            multiple times to include additional genomes)
      --subset SUBSET       use a random subset of read pairs from the input
                            Fastqs; set to zero to use all reads (default: 10000)
      -o OUTDIR, --outdir OUTDIR
                            specify directory to write final outputs to (default:
                            current directory)
      -c FILE, --conf FILE  specify delimited 'conf' file with list of NAME and
                            STAR index directory pairs. NB if a conf file is
                            supplied then any indices specifed on the command line
                            will be ignored
      -n N                  number of threads to run STAR with (default: 1)
      --counts              include the count sums for unstranded, 1st read strand
                            aligned and 2nd read strand aligned in the output file
                            (default: only include percentages)
      --keep-star-output    keep the output from STAR (default: delete outputs on
                            completion)
    
.. _reference_log_seq_data:

log_seq_data.sh
***************

::

    Usage:
        log_seq_data.sh <logging_file> [-d|-u] <seq_data_dir> [<description>]
        log_seq_data.sh <logging_file> -c <seq_data_dir> <new_dir> [<description>]
        log_seq_data.sh <logging_file> -i <seq_data_dir>
        log_seq_data.sh <logging_file> -v
    
    Add, update or delete an entry for <seq_data_dir> in <logging_file>, or
    verify entries.
    
    <seq_data_dir> can be a primary data directory from a sequencer or a
    directory of derived data (e.g. analysis directory)
    
    By default an entry is added for the specified data directory; each
    entry is a tab-delimited line with the full path for the data directory
    followed by the UNIX timestamp and the optional description text.
    
    If <logging_file> doesn't exist then it will be created; if
    <seq_data_dir> is already in the log file then it won't be added again.
    
    Options:
    
         -d     deletes an existing entry
         -u     update description for an existing entry (or creates a new one
                if an existing entry not found)
         -c     changes an existing entry, updating the directory path and
                (optionally) the description
         -i     print information about an entry
         -v     validates the entries in the logging file.
    
.. _reference_make_macs_xls:

make_macs_xls.py
****************

::

    usage: make_macs_xls.py [-h] [--version] MACS_OUTPUT [XLS_OUT]
    
    Create an XLS spreadsheet from the output of the MACS peak caller.
    <MACS_OUTPUT> is the output '.xls' file from MACS; if supplied then <XLS_OUT>
    is the name to use for the output file, otherwise it will be called
    'XLS_<MACS_OUTPUT>.xls'.
    
    positional arguments:
      MACS_OUTPUT  output .xls file from MACS
      XLS_OUT      output MS XLS file (defaults to 'XLS_<MACS_OUTPUT>.xls').
    
    optional arguments:
      -h, --help   show this help message and exit
      --version    show program's version number and exit
    
.. _reference_make_macs2_xls:

make_macs2_xls.py
*****************

::

    usage: make_macs2_xls.py [-h] [--version] [-f XLS_FORMAT] [-b]
                             MACS2_XLS [XLS_OUT]
    
    Create an XLS(X) spreadsheet from the output of the MACS2 peak caller.
    MACS2_XLS is the output '.xls' file from MACS2; if supplied then XLS_OUT is
    the name to use for the output file (otherwise it will be called
    'XLS_<MACS2_XLS>.xls(x)').
    
    positional arguments:
      MACS2_XLS             output '.xls' file from MACS2
      XLS_OUT               name to use for the output file (default is
                            'XLS_<MACS2_XLS>.xls(x)')
    
    optional arguments:
      -h, --help            show this help message and exit
      --version             show program's version number and exit
      -f XLS_FORMAT, --format XLS_FORMAT
                            specify the output Excel spreadsheet format; must be
                            one of 'xlsx' or 'xls' (default is 'xlsx')
      -b, --bed             write an additional TSV file with chrom,
                            abs_summit+100 and abs_summit-100 data as the columns.
                            (NB only possible for MACS2 run without --broad)
    
.. _reference_manage_seqs:

manage_seqs.py
**************

::

    usage: manage_seqs.py [-h] [--version] [-o OUT_FILE] [-a APPEND_FILE]
                          [-d DESCRIPTION]
                          INFILE [INFILE ...]
    
    Read sequences and names from one or more INFILEs (which can be a mixture of
    FastQC 'contaminants' format and or Fasta format), check for redundancy (i.e.
    sequences with multiple associated names) and contradictions (i.e. names with
    multiple associated sequences).
    
    positional arguments:
      INFILE          input sequences
    
    optional arguments:
      -h, --help      show this help message and exit
      --version       show program's version number and exit
      -o OUT_FILE     write all sequences to OUT_FILE in FastQC 'contaminants'
                      format
      -a APPEND_FILE  append sequences to existing APPEND_FILE (not compatible
                      with -o)
      -d DESCRIPTION  supply arbitrary text to write to the header of the output
                      file
    
.. _reference_md5checker:

md5checker.py
*************

::

    usage: 
      md5checker.py -d SOURCE_DIR DEST_DIR
      md5checker.py -d FILE1 FILE2
      md5checker.py [ -o CHKSUM_FILE ] DIR
      md5checker.py [ -o CHKSUM_FILE ] FILE
      md5checker.py -c CHKSUM_FILE
    
    Compute and verify MD5 checksums for files and directories.
    
    optional arguments:
      -h, --help            show this help message and exit
      --version             show program's version number and exit
      -d, --diff            for two directories: check that contents of directory
                            DIR1 are present in DIR2 and have the same MD5 sums;
                            for two files: check that FILE1 and FILE2 have the
                            same MD5 sums
      -c, --check           read MD5 sums from the specified file and check them
      -q, --quiet           suppress output messages and only report failures
    
    Directory comparison (-d, --diff):
      Check that the contents of SOURCE_DIR are present in TARGET_DIR and have
      matching MD5 sums. Note that files that are only present in TARGET_DIR are
      not reported.
    
    File comparison (-d, --diff):
      Check that FILE1 and FILE2 have matching MD5 sums.
    
    Checksum generation:
      MD5 checksums are calcuated for all files in the specified directory, or
      for a single specified file.
    
      -o CHKSUM_FILE, --output CHKSUM_FILE
                            optionally write computed MD5 sums to CHKSUM_FILE
                            (otherwise the sums are written to stdout). The output
                            format is the same as that used by the Linux 'md5sum'
                            tool.
    
    Checksum verification (-c, --check):
      Check MD5 sums for each of the files listed in the specified CHKSUM_FILE
      relative to the current directory. This option behaves the same as the
      Linux 'md5sum' tool.
    
.. _reference_prep_sample_sheet:

prep_sample_sheet.py
********************

::

    usage: prep_sample_sheet.py [-h] [--version] [-o SAMPLESHEET_OUT] [-f FMT]
                                [-V] [--fix-spaces] [--fix-duplicates]
                                [--fix-empty-projects] [--set-id SAMPLE_ID]
                                [--set-project SAMPLE_PROJECT]
                                [--reverse-complement-i5] [--ignore-warnings]
                                [--include-lanes LANES] [--set-adapter ADAPTER]
                                [--set-adapter-read2 ADAPTER_READ2]
                                [--truncate-barcodes BARCODE_LEN] [--miseq]
                                SAMPLE_SHEET
    
    Utility to prepare SampleSheet files from Illumina sequencers. Can be used to
    view, validate and update or fix information such as sample IDs and project
    names before running BCL to FASTQ conversion.
    
    positional arguments:
      SAMPLE_SHEET          input sample sheet file
    
    optional arguments:
      -h, --help            show this help message and exit
      --version             show program's version number and exit
      -o SAMPLESHEET_OUT    output new sample sheet to SAMPLESHEET_OUT
      -f FMT, --format FMT  specify the format of the output sample sheet written
                            by the -o option; can be either 'CASAVA' or 'IEM'
                            (defaults to the format of the original file)
      -V, --view            view predicted outputs from sample sheet
      --fix-spaces          replace spaces in sample ID and project fields with
                            underscores
      --fix-duplicates      append unique indices to sample IDs where the original
                            ID and project name combination are duplicated
      --fix-empty-projects  create sample project names where these are blank in
                            the original sample sheet
      --set-id SAMPLE_ID    update/set the values in sample ID field; SAMPLE_ID
                            should be of the form '<lanes>:<name>', where <lanes>
                            is a single integer (e.g. 1), a set of integers (e.g.
                            1,3,...), a range (e.g. 1-3), or a combination (e.g.
                            1,3-5,7)
      --set-project SAMPLE_PROJECT
                            update/set values in the sample project field;
                            SAMPLE_PROJECT should be of the form
                            '[<lanes>:]<name>', where the optional <lanes> part
                            can be a single integer (e.g. 1), a set of integers
                            (e.g. 1,3,...), a range (e.g. 1-3), or a combination
                            (e.g. 1,3-5,7). If no lanes are specified then all
                            samples will have their project set to <name>
      --reverse-complement-i5
                            replace i5 index sequences with their reverse
                            complement
      --ignore-warnings     ignore warnings about spaces and duplicated
                            sampleID/sampleProject combinations when writing new
                            samplesheet.csv file
      --include-lanes LANES
                            specify a subset of lanes to include in the output
                            sample sheet; LANES should be single integer (e.g. 1),
                            a list of integers (e.g. 1,3,...), a range (e.g. 1-3)
                            or a combination (e.g. 1,3-5,7). Default is to include
                            all lanes
      --set-adapter ADAPTER
                            set the adapter sequence in the 'Settings' section to
                            ADAPTER
      --set-adapter-read2 ADAPTER_READ2
                            set the adapter sequence for read 2 in the
                            'Settings'section to ADAPTER_READ2
    
    Deprecated options:
      --truncate-barcodes BARCODE_LEN
                            trim barcode sequences in sample sheet to number of
                            bases specified by BARCODE_LEN. Default is to leave
                            barcode sequences unmodified (deprecated; only works
                            for CASAVA-style sample sheets)
      --miseq               convert input MiSEQ sample sheet to CASAVA-compatible
                            format (deprecated; specify -f/--format CASAVA to
                            convert IEM sample sheet to older format)
    
.. _reference_reorder_fasta:

reorder_fasta.py
****************

::

    usage: reorder_fasta.py [-h] [--version] FASTA
    
    Reorder the chromosome records in a FASTA file into karyotypic order.
    
    positional arguments:
      FASTA       FASTA file to reorder
    
    optional arguments:
      -h, --help  show this help message and exit
      --version   show program's version number and exit
    
.. _reference_sam2soap:

sam2soap.py
***********

::

    usage: sam2soap.py [-h] [--version] [-o SOAPFILE] [--debug] [SAMFILE]
    
    Convert SAM file to SOAP format - reads from stdin (or SAMFILE, if specified),
    and writes output to stdout unless -o option is specified.
    
    positional arguments:
      SAMFILE      SAM file to convert (or stdin if not specified)
    
    optional arguments:
      -h, --help   show this help message and exit
      --version    show program's version number and exit
      -o SOAPFILE  Output SOAP file name
      --debug      Turn on debugging output
    
.. _reference_split_fasta:

split_fasta.py
**************

::

    usage: split_fasta.py [-h] [--version] [fasta_file]
    
    Split input FASTA file with multiple sequences into multiple files each
    containing sequences for a single chromosome.
    
    positional arguments:
      fasta_file  input FASTA file to split
    
    optional arguments:
      -h, --help  show this help message and exit
      --version   show program's version number and exit
    
.. _reference_split_fastq:

split_fastq.py
**************

::

    usage: split_fastq.py [-h] [--version] [-l LANES] FASTQ
    
    Split input Fastq file into multiple output Fastqs where each output only
    contains reads from a single lane.
    
    positional arguments:
      FASTQ                 Fastq to split
    
    optional arguments:
      -h, --help            show this help message and exit
      --version             show program's version number and exit
      -l LANES, --lanes LANES
                            lanes to extract: can be a single integer, a comma-
                            separated list (e.g. 1,3), a range (e.g. 5-7) or a
                            combination (e.g. 1,3,5-7). Default is to extract all
                            lanes in the Fastq
    
.. _reference_verify_paired:

verify_paired.py
****************

::

    usage: verify_paired.py [-h] [--version] R1.fastq R2.fastq
    
    Check that read headers for R1 and R2 fastq files are in agreement, and that
    the files form an R1/2 pair.
    
    positional arguments:
      R1.fastq    Fastq file with R1 reads
      R2.fastq    Fastq file with R2 reads to check against R1 reads
    
    optional arguments:
      -h, --help  show this help message and exit
      --version   show program's version number and exit
    
.. _reference_xrorthologs:

xrorthologs.py
**************

::

    usage: xrorthologs.py [-h] [--version] [--debug] LOOKUPFILE SPECIES1 SPECIES2
    
    Cross-reference data from two species given a lookup file that maps probeset
    IDs from one species onto those onto the other. LOOKUPFILE is tab-delimited
    file with one probe set for species 1 per line in first column and a comma-
    separated list of the equivalent probe sets for species 2 in the fourth
    column. Data for the two species are in tab-delimited files SPECIES1 and
    SPECIES2. Output is two files: SPECIES1_appended.txt (SPECIES1 with the cross-
    referenced data from SPECIES2 appended to each line) and SPECIES2_appended.txt
    (SPECIES2 with SPECIES1 data appended).
    
    positional arguments:
      LOOKUPFILE  tab-delimited file with one probe set for species 1 per line in
                  first column and a comma-separated list of the equivalent probe
                  sets for species 2 in the fourth column
      SPECIES1    data for species 1
      SPECIES2    data for species 2
    
    optional arguments:
      -h, --help  show this help message and exit
      --version   show program's version number and exit
      --debug     Turn on debugging output