1. Reference data preparation¶
1.1. Genome sequence data and indexes¶
1.1.1. Suggested directory structure¶
For a given genome build, the recommended basic structure is (e.g. for Rattus
norvegicus rn4
):
rn4/
|
+- rn4.info (metadata file)
|
+- fasta/ (sequences)
|
+- bowtie/ (indexes for bowtie - both color- and letterspace)
|
+- bfast/ (indexes for bfast)
|
+- liftOver/ (chain files for liftOver from this assembly to others)
|
+- seq/ (per-chromosome nib files for sequence alignments)
|
...
i.e. a top-level directory containing a .info
file, plus directories for
FASTA sequences and derived or additional genome indexes for various aligners
and other programs.
Note that the indexes for SRMA are placed in the “fasta” directory, as SRMA needs .fa, .fai and .dict files all to be placed in the same directory.
1.1.2. Creating a directory for a new genome¶
To add a new genome index:
- Create a new top-level directory for the organism and genome build
- Create a
fasta
subdirectory to hold the sequence data, and download and prepare the FASTA file(s) within this directory (see below for hints) - Create a
.info
file and record the details of the genome for future reference (see below for more detail on .info files) - Create and populate
bowtie
,bfast
etc subdirectories with the appropriate indexes (see below for advice on generating indexes)
1.1.3. Download and prepare FASTA genome files¶
Note
The fetch_fasta.sh script is intended to reproducibly create FASTA files for a set of genomes.
To see which genomes are available run the program without any arguments; to obtain the FASTA file do e.g.:
fetch_fasta.sh mm9
Where the reference genome is a collection of fasta files for each chromosome, it’s necessary to prepare a single file for the bfast and bowtie index generation by concatenating them together, e.g.:
cat chr* > hg18_random_chrM.fa
The individual chromosome fasta files can then be removed or archived, e.g.:
tar -cvf hg18_random_chrM.tar chr*
gzip hg18_random_chrM.tar
1.2. .info
metadata files¶
Standard practice when add a new genome index is to also create a .info
file (for example hg18_random_chrM.info
).
These are hand-generated text files consisting of header fields followed by free text.
A typical header looks like (e.g. from mm9_random_chrM.info
for Mus musculus
mm9
):
# Organism: Mus musculus
# Genome Build: MM9/NCBI37 July 2007
# Manipulations: Base chr. (1 to 19, X, Y), chrN_random, chrM and chrUn_random - unmasked
# Source: wget http://hgdownload.cse.ucsc.edu/goldenPath/mm9/bigZips/chromFa.tar.gz
The free text area can contain any additional information that the person preparing the indexes thinks is important (for example, scripts or commands used to generate the indexes for individual programs).
1.3. Generate indexes for mapping software¶
This package includes a number of scripts for fetching and generating genome
indexes for Bfast
, Bowtie
and SRMA
.
bowtie_build_indexes.sh can be used to generate color- and nuleotide-space indexes from a FASTA file.
To use, go to the bowtie subdirectory for the genome and do e.g.:
qsub -b y -V -cwd bowtie_build_indexes.sh ../fasta/genome.fa
This will create both color and nucleotide space indexes; to only generate colorspace use the
--cs
option of the script, to only get nucleotide space use--nt
.bowtie2_build_indexes.sh generates indexes for
bowtie2
(letter space only).bfast_build_indexes.sh prepares indexes for
bfast
.srma_build_indexes.sh prepare indexes for
SRMA
.