Genome indexes and reference data utilities¶

Scripts for setting up genome indexes for various programs:

fetch_fasta.sh: download and build FASTA file for pre-defined organisms
build_indexes.sh: build all indexes from a FASTA file
bfast_build_indexes.sh: build bfast color-space indexes
bowtie_build_indexes.sh: build color- and base-space bowtie indexes
bowtie2_build_indexes.sh: build indexes for bowtie2
srma_build_indexes.sh: build indexes for srma
setup_genome_indexes.sh: automatically and reproducibly set up genome indexes
build_rRNA_bowtie_indexes.sh: create indexes and fastq_screen.conf for rRNA
make_seq_alignments.sh: build sequence alignment (.nib) files from FASTA

fetch_fasta.sh¶

Reproducibly downloads and builds FASTA files for pre-defined organisms.

Usage:

fetch_fasta.sh <name>

<name> identifies a specific organism and build, for example ‘hg18’ or mm9 (run without specifying a name to see a list of all the available organisms).

Outputs¶

Downloads and creates a FASTA file for the specified organism and puts this into a`fasta` subdirectory in the current working directory. When possible the script verifies the FASTA file by running an MD5 checksum it.

An .info file is also written which contains details about the FASTA file, such as source location and additional operations that were performed to unpack and construct the file, and the date and user who ran the script.

Adding new organisms¶

New organisms can be added to the script by creating additional setup_<name> functions for each, and defining the source and operations required to build it. For example:

function setup_hg18() {
    set_name    "Homo sapiens"
    set_species "Human"
    set_build   "HG18/NCBI36.1 March 2006"
    set_info    "Base chr. (1 to 22, X, Y), 'random' and chrM - unmasked"
    set_mirror  http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips
    set_archive chromFa.zip
    set_ext     fa
    set_md5sum  8cdfcaee2db09f2437e73d1db22fe681
    # Delete haplotypes
    add_processing_step "Delete haplotypes" "rm -f *hap*"
}

See the comments in the head of the script along with the existing setup_... functions for more specifics.

build_indexes.sh¶

Builds all indexes (bowtie, bowtie2, SRMA) within a standard directory structure from a FASTA file, by running the scripts for building the individual indexes.

Usage:

build_indexes.sh <fasta_file>

Outputs¶

Typically you would create a new directory for each organism, and then place the FASTA file in a fasta subdirectory e.g.:

hg18/
    fasta/
         hg18.fasta

Then invoke this script from within the top-level hg18 directory e.g.:

build_indexes.sh fasta/hg18.fasta

resulting in:

hg18/
    fasta/
    bfast/
    bowtie/

with the indexes placed in the appropriate directories (see the individual scripts for more details).

bfast_build_indexes.sh¶

Builds the bfast color-space indexes from a reference FASTA file.

Usage:

bfast_build_indexes.sh [OPTIONS] <genome_fasta_file>

Run with -h option to print full usage information.

Options:

-d <depth>¶: Specify depth-of-splitting used by Bfast (default 1)

-w <hash_width>¶: Specify hash width used by Bfast (default 14)

--dry-run¶: Print commands without executing them

-h¶: Print usage information and defaults

Outputs¶

Index files are created in the directory the script was run in.

.bif index files
.brg index files for base- and color-space
Symbolic link to the reference (input) FASTA file.

Warning

If .brg and/or .bif files already exist then bfast index may not run correctly. It’s recommended to remove any old files before rerunning the build script.

bowtie_build_indexes.sh¶

Builds the bowtie color and/or nucleotide space indexes from the reference FASTA file.

Usage:

bowtie_build_indexes.sh OPTIONS <genome_fasta_file>

Options:

By default both color- and nucleotide space indexes are built; to only build one or the other use one of:

--nt¶: build nucleotide-space indexes

--cs¶: build colorspace indexes

Outputs¶

Index files are created in the directory the script was run in.

Nucleotide indexes as <genome_name>.*.ebwt
Color space indexes as <genome_name>_c.*.ebwt

bowtie2_build_indexes.sh¶

Builds the indexes for bowtie2 (letter space only; bowtie2 doesn’t support colorspace) from the reference FASTA file.

Usage:

bowtie2_build_indexes.sh <genome_fasta_file>

Outputs¶

Index files are created in the directory the script was run in, with the names <genome_name>.*.bt2.

srma_build_indexes.sh¶

Creates the index files required by SRMA.

Note

By default the script expects the CreateSequenceDictionary.jar file to be in the /usr/share/java/picard-tools directory; if this is not the case then set the variable PICARD_TOOLS_DIR variable in your environment to point to the actual location.

For example for bash:

export PICARD_TOOLS_DIR=/path/to/my/picard-tools

Usage:

srma_build_indexes.sh <genome_fasta_file>

Outputs¶

Index files are created in the same directory as the reference FASTA file (which is where SRMA requires them to be); the script itself can be run from anywhere.

.fai and .dict files required by SRMA.

index_indexes.sh¶

Utility for exploring/reporting on existing genome indexes within a directory hierarchy.

Usage:

index_indexes.sh <dir>

Outputs¶

Searches <dir> and its subdirectories recursively and prints a report of the genome index-specific files (fasta, info etc) it finds.

setup_genome_indexes.sh¶

Automatically and reproducibly set up genome indexes.

Usage:

setup_genome_indexes.sh

The setup_genome_indexes.sh script doesn’t take any options, it runs through hard-coded lists of organisms for obtaining the sequence and creating bowtie, bfast and Picard/SRMA indexes, Galaxy .loc files and fastq_screen .conf files.

Outputs¶

The script outputs genome indexes based on the following directory structure for each organism:

pwd/
    organism/
            organism.info
            organism.chr.list
            bowtie/
                  ...bowtie indexes...
            bfast/
                  ...bfast indexes...
            fasta/
                  organism.fasta
                  ...picard/srma indexes...

It also creates:

fastq_screen directory: containing specified fastq_screen .conf files
Galaxy .loc files: for bowtie, bfast, picard, all_fasta and fastq_screen
genome_indexes.html file: HTML file listing the available genome indexes

build_rRNA_bowtie_indexes.sh¶

Create bowtie indexes and fastq_screen.conf file for rRNA sequences.

Usage:

build_rRNA_bowtie_indexes.sh <rRNAs>.tar.gz

The build_rRNA_bowtie_indexes.sh script unpacks the supplied archive file <rRNAs>.tar.gz and copies the FASTA-formatted sequence files it contains, then generates bowtie indexes from these and produces a fastq_screen.conf file for them.

Inputs¶

The script expects the input <rRNAs>.tar.gz file to unpack into the following directory structure:

rRNAs/
     fasta/
          ... fasta files ...

Outputs¶

The script creates the following directory structure in the current directory:

pwd/
    rRNAs/
         bowtie/
               ...bowtie indexes...
         fasta/
               ...rRNA fasta files...

It also creates fastq_screen_rRNAs.conf in the fastq_screen subdirectory of the current directory.

make_seq_alignments.sh¶

Build sequence alignment (.nib) files from a FASTA file.

Warning

faToNib is no longer distributed with the UCSC tools and .nib format is now deprecated in favour of .2bit.

The procedure is:

Split FASTA file into individual chromosomes (uses the split_fasta.py utility)
For each resulting chromosome run the UCSC tool faToNib to generate a sequence alignment file
Copy these to a specified destination directory

Usage:

make_seq_alignments.sh [--qsub=...] FASTA SEQ_DIR

Generates sequence alignment (.nib) files for each chromosome in FASTA, and copies them into the (pre-existing) directory SEQ_DIR.

Options:

--qsub[=...]¶: Run operations via Grid Engine (otherwise run directly). Optionally also supply extra arguments using --qsub="..." e.g. name of a specific queue.

Inputs¶

FASTA file with all chromosome sequences.

Outputs¶

A set of sequence alignment (.nib) files in the specified output directory.