Genome indexes and reference data utilities

Scripts for setting up genome indexes for various programs:

fetch_fasta.sh

Reproducibly downloads and builds FASTA files for pre-defined organisms.

Usage:

fetch_fasta.sh <name>

<name> identifies a specific organism and build, for example ‘hg18’ or mm9 (run without specifying a name to see a list of all the available organisms).

Outputs

Downloads and creates a FASTA file for the specified organism and puts this into a`fasta` subdirectory in the current working directory. When possible the script verifies the FASTA file by running an MD5 checksum it.

An .info file is also written which contains details about the FASTA file, such as source location and additional operations that were performed to unpack and construct the file, and the date and user who ran the script.

Adding new organisms

New organisms can be added to the script by creating additional setup_<name> functions for each, and defining the source and operations required to build it. For example:

function setup_hg18() {
    set_name    "Homo sapiens"
    set_species "Human"
    set_build   "HG18/NCBI36.1 March 2006"
    set_info    "Base chr. (1 to 22, X, Y), 'random' and chrM - unmasked"
    set_mirror  http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips
    set_archive chromFa.zip
    set_ext     fa
    set_md5sum  8cdfcaee2db09f2437e73d1db22fe681
    # Delete haplotypes
    add_processing_step "Delete haplotypes" "rm -f *hap*"
}

See the comments in the head of the script along with the existing setup_... functions for more specifics.

build_indexes.sh

Builds all indexes (bowtie, bowtie2, SRMA) within a standard directory structure from a FASTA file, by running the scripts for building the individual indexes.

Usage:

build_indexes.sh <fasta_file>

Outputs

Typically you would create a new directory for each organism, and then place the FASTA file in a fasta subdirectory e.g.:

hg18/
    fasta/
         hg18.fasta

Then invoke this script from within the top-level hg18 directory e.g.:

build_indexes.sh fasta/hg18.fasta

resulting in:

hg18/
    fasta/
    bfast/
    bowtie/

with the indexes placed in the appropriate directories (see the individual scripts for more details).

bfast_build_indexes.sh

Builds the bfast color-space indexes from a reference FASTA file.

Usage:

bfast_build_indexes.sh [OPTIONS] <genome_fasta_file>

Run with -h option to print full usage information.

Options:

-d <depth>

Specify depth-of-splitting used by Bfast (default 1)

-w <hash_width>

Specify hash width used by Bfast (default 14)

--dry-run

Print commands without executing them

-h

Print usage information and defaults

Outputs

Index files are created in the directory the script was run in.

  • .bif index files
  • .brg index files for base- and color-space
  • Symbolic link to the reference (input) FASTA file.

Warning

If .brg and/or .bif files already exist then bfast index may not run correctly. It’s recommended to remove any old files before rerunning the build script.

bowtie_build_indexes.sh

Builds the bowtie color and/or nucleotide space indexes from the reference FASTA file.

Usage:

bowtie_build_indexes.sh OPTIONS <genome_fasta_file>

Options:

By default both color- and nucleotide space indexes are built; to only build one or the other use one of:

--nt

build nucleotide-space indexes

--cs

build colorspace indexes

Outputs

Index files are created in the directory the script was run in.

  • Nucleotide indexes as <genome_name>.*.ebwt
  • Color space indexes as <genome_name>_c.*.ebwt

bowtie2_build_indexes.sh

Builds the indexes for bowtie2 (letter space only; bowtie2 doesn’t support colorspace) from the reference FASTA file.

Usage:

bowtie2_build_indexes.sh <genome_fasta_file>

Outputs

Index files are created in the directory the script was run in, with the names <genome_name>.*.bt2.

srma_build_indexes.sh

Creates the index files required by SRMA.

Note

By default the script expects the CreateSequenceDictionary.jar file to be in the /usr/share/java/picard-tools directory; if this is not the case then set the variable PICARD_TOOLS_DIR variable in your environment to point to the actual location.

For example for bash:

export PICARD_TOOLS_DIR=/path/to/my/picard-tools

Usage:

srma_build_indexes.sh <genome_fasta_file>

Outputs

Index files are created in the same directory as the reference FASTA file (which is where SRMA requires them to be); the script itself can be run from anywhere.

  • .fai and .dict files required by SRMA.

index_indexes.sh

Utility for exploring/reporting on existing genome indexes within a directory hierarchy.

Usage:

index_indexes.sh <dir>

Outputs

Searches <dir> and its subdirectories recursively and prints a report of the genome index-specific files (fasta, info etc) it finds.

setup_genome_indexes.sh

Automatically and reproducibly set up genome indexes.

Usage:

setup_genome_indexes.sh

The setup_genome_indexes.sh script doesn’t take any options, it runs through hard-coded lists of organisms for obtaining the sequence and creating bowtie, bfast and Picard/SRMA indexes, Galaxy .loc files and fastq_screen .conf files.

Outputs

The script outputs genome indexes based on the following directory structure for each organism:

pwd/
    organism/
            organism.info
            organism.chr.list
            bowtie/
                  ...bowtie indexes...
            bfast/
                  ...bfast indexes...
            fasta/
                  organism.fasta
                  ...picard/srma indexes...

It also creates:

  • fastq_screen directory: containing specified fastq_screen .conf files
  • Galaxy .loc files: for bowtie, bfast, picard, all_fasta and fastq_screen
  • genome_indexes.html file: HTML file listing the available genome indexes

build_rRNA_bowtie_indexes.sh

Create bowtie indexes and fastq_screen.conf file for rRNA sequences.

Usage:

build_rRNA_bowtie_indexes.sh <rRNAs>.tar.gz

The build_rRNA_bowtie_indexes.sh script unpacks the supplied archive file <rRNAs>.tar.gz and copies the FASTA-formatted sequence files it contains, then generates bowtie indexes from these and produces a fastq_screen.conf file for them.

Inputs

The script expects the input <rRNAs>.tar.gz file to unpack into the following directory structure:

rRNAs/
     fasta/
          ... fasta files ...

Outputs

The script creates the following directory structure in the current directory:

pwd/
    rRNAs/
         bowtie/
               ...bowtie indexes...
         fasta/
               ...rRNA fasta files...

It also creates fastq_screen_rRNAs.conf in the fastq_screen subdirectory of the current directory.

make_seq_alignments.sh

Build sequence alignment (.nib) files from a FASTA file.

Warning

faToNib is no longer distributed with the UCSC tools and .nib format is now deprecated in favour of .2bit.

The procedure is:

  • Split FASTA file into individual chromosomes (uses the split_fasta.py utility)
  • For each resulting chromosome run the UCSC tool faToNib to generate a sequence alignment file
  • Copy these to a specified destination directory

Usage:

make_seq_alignments.sh [--qsub=...] FASTA SEQ_DIR

Generates sequence alignment (.nib) files for each chromosome in FASTA, and copies them into the (pre-existing) directory SEQ_DIR.

Options:

--qsub[=...]

Run operations via Grid Engine (otherwise run directly). Optionally also supply extra arguments using --qsub="..." e.g. name of a specific queue.

Inputs

FASTA file with all chromosome sequences.

Outputs

A set of sequence alignment (.nib) files in the specified output directory.