Installation and set up

Installing the genomics/bcftbx package

It is recommended to install directly from github using pip:

pip install git+https://github.com/fls-bioinformatics-core/genomics.git

from within the top-level source directory to install the package.

To use the package without installing it first you will need to add the directory to your PYTHONPATH environment, and reference the scripts and programs using their full paths.

Dependencies

The package consists predominantly of code written in Python, which has been used extensively with Python 2.6 and 2.7.

In addition there are scripts requiring:

  • bash
  • Perl
  • R

and the following packages are required for subsets of the code:

  • Perl: Statistics::Descriptive and BioPerl
  • python: xlwt, xlrd and xlutils

Finally, some of the utilities also use 3rd-party software packages, including:

Core software

Illumina-specific

SOLiD-specific

Set up reference data

bowtie indexes

fastq_screen needs bowtie indexes for each of the reference genomes that you want to screen against.

The fetch_fasta.sh script can be used to acquire FASTA files for genome builds of common reference organisms, for example:

mkdir -p data/genomes/PhiX
cd data/genomes/PhiX/
fetch_fastas.sh PhiX

To generate bowtie indexes, use the bowtie_build_indexes.sh script, for example:

mkdir -p data/genomes/PhiX/bowtie
cd data/genomes/PhiX/bowtie/
bowtie_build_indexes.sh ../fasta/PhiX.fa

(This will create both colorspace and nucleotide space indexes by default.)

(Use bowtie2_build_indexes.sh to build indexes for bowtie2. Note that bowtie2 does not support colorspace.)

(Alternatively use build_indexes.sh to make all the indexes: bfast, bowtie and bowtie2, and SRMA.)

For rRNAs, get the rRNAs.tar.gz file and run the build_rRNA_bowtie_indexes.sh script, for example:

cd data/genomes/
wget .../rRNAs.tar.gz
build_rRNA_bowtie_indexes.sh rRNAs.tar.gz

which will extract the FASTA sequences to a subdirectory rRNAs/fasta/ and create nucleotide- and colorspace bowtie indexes in rRNAs/bowtie.

fastq_screen configuration files

The QC scripts currently that there will be the following three fastq_screen configuration files:

  • fastq_screen_model_organisms.conf
  • fastq_screen_other_organisms.conf
  • fastq_screen_rRNA.conf

(The actual form of the names are:

fastq_screen_<NAME><EXT>.conf

where <NAME> is one of model_organisms, other_organisms or rRNA, and <EXT> is an extension which is used to distinguish between nucleotide- and colorspace indexes.)

Each configuration file defines “databases” with lines of the form:

DATABASE     Fly (dm3)       /home/data/genomes/dm3/bowtie/dm3_het_chrM_chrU

for nucleotide space indexes, and

DATABASE     Fly (dm3)       /home/data/genomes/dm3/bowtie/dm3_het_chrM_chrU_c

for colorspace. (In each case the path is the base name for the index files.)

Create qc.setup

When the package is installed a template qc.setup.sample file is created in the config subdirectory - it needs to be copied to qc.setup and edited to set the locations for external software and data.