RNA-seq specific utilities¶
Scripts and tools for RNA-seq specific tasks.
- bowtie_mapping_stats.py: summarise statistics from bowtie output in spreadsheet
- GFFedit: swap gene names in GFF file to gene ID
- qc_bash_script.sh: generalised QC pipeline for RNA-seq
- Split: filter reads from bowtie mapping against two genomes
bowtie_mapping_stats.py¶
Extract mapping statistics for each sample referenced in the input bowtie log files and summarise the data in an XLS spreadsheet. Handles output from both Bowtie and Bowtie2.
Usage:
bowtie_mapping_stats.py [options] bowtie_log_file [ bowtie_log_file ... ]
By default the output file is called mapping_summary.xls
; use the -o
option to
specify the spreadsheet name explicitly.
Options:
-
-o
xls_file
¶ specify name of the output XLS file (otherwise defaults to
mapping_summary.xls
).
-
-t
¶
write data to tab-delimited file in addition to the XLS file. The tab file will have the same name as the XLS file, with the extension replaced by
.txt
Input bowtie log file¶
The program expects the input log file to consist of multiple blocks of text of the form:
...
<SAMPLE_NAME>
Time loading reference: 00:00:01
Time loading forward index: 00:00:00
Time loading mirror index: 00:00:02
Seeded quality full-index search: 00:10:20
# reads processed: 39808407
# reads with at least one reported alignment: 2737588 (6.88%)
# reads that failed to align: 33721722 (84.71%)
# reads with alignments suppressed due to -m: 3349097 (8.41%)
Reported 2737588 alignments to 1 output stream(s)
Time searching: 00:10:27
Overall time: 00:10:27
...
The sample name will be extracted along with the numbers of reads processed, with at least one reported alignment, that failed to align, and with alignments suppressed and tabulated in the output spreadsheet.
GFFedit¶
Takes a GFF file and edits it, changing gene names in the input file to the geneID (if they are not of the DDB_G0…format), and outputs the edited GFF file.
Compilation:
javac GFFedit.java
jar cf GFFedit.jar GFFedit.class
Usage:
java -cp /path/to/GFFedit.jar GFFedit <myfile>.gff
Arguments:
-
myfile.gff
¶
input GFF file
Output:
-
GFFedit_<myfile>.gff
¶
edited version of input file
qc_bash_script.sh¶
Generalised QC pipeline for RNA-seq: runs bowtie, fastq_screen and qc_boxplotter on SOLiD data.
Usage:
qc_bash_script.sh <analysis_dir> <sample_name> <csfasta> <qual> <bowtie_genome_index>
Arguments:
-
analysis_dir
¶
directory to write the outputs to
-
sample_name
¶
name of the sample
-
csfasta
¶
input csfasta file
-
qual
¶
input qual file
-
bowtie_genome_index
¶
full path to bowtie genome index
Outputs:
Creates a qc
subdirectory in analysis_dir
which contains the fastq_screen
and boxplotter output files.
Split¶
Takes in two SAM files from bowtie where the same sample has been mapped to two genomes (“genomeS” and “genomeB”), and filters the reads to isolate those which map only to genomeS, only to genomeB, and to both genomes (see “Output”, below).
Compilation:
javac Split.java
jar cf Split.jar Split.class
Usage:
java -cp /path/to/Split.jar Split <map_to_genomeS>.sam <map_to_genomeB>.sam
Arguments:
-
map_to_genomeS.sam
¶
SAM file from Bowtie with reads mapped to genomeS
-
map_to_genomeB.sam
¶
SAM file from Bowtie with reads mapped to genomeB
Outputs 4 SAM files:
- Reads that map to genomeS only
- Reads that map to genomeB only
- Reads that map to genomeS and genomeB keeping the genomeS genome coordinates
- Reads that map to genomeS and genomeB keeping the genomeB genome coordinates