NGS utilities

Reporting ChIP-seq outputs

The make_macs2_xls.py utility can be used to convert an output tab-delimited .XLS file from macs2 into an MS Excel spreadsheet (either .xlsx or .xls format).

Additionally a .bed format file can be output, provided that macs2 was not run with the --broad option.

To process output from older versions of macs (i.e. 1.4.2 and earlier) the legacy make_macs_xls.py utility can be used; however for this version only MS XLS format is supported, and there is no option to output a .bed file.

Reporting RNA-seq outputs

The bowtie_mapping_stats.py utility can be used to summarise the mapping statistics produced by bowtie2 or bowtie, and output to an MS Excel spreadsheet file.

The utility reads the bowtie2 log file and expects this to consist of multiple blocks of text of the form:

...
<SAMPLE_NAME>
Time loading reference: 00:00:01
Time loading forward index: 00:00:00
Time loading mirror index: 00:00:02
Seeded quality full-index search: 00:10:20
# reads processed: 39808407
# reads with at least one reported alignment: 2737588 (6.88%)
# reads that failed to align: 33721722 (84.71%)
# reads with alignments suppressed due to -m: 3349097 (8.41%)
Reported 2737588 alignments to 1 output stream(s)
Time searching: 00:10:27
Overall time: 00:10:27
...

The sample name will be extracted along with the numbers of reads processed, with at least one reported alignment, that failed to align, and with alignments suppressed and tabulated in the output spreadsheet.

Determining strandedness of sequencing data

The fastq_strand.py utility can be used to determine the strandedness (forward, reverse, or unstranded) of sequencing data in Fastq format, using either a single Fastq file, or an an R1/R2 pair of Fastqs.

Note

The utility is a wrapper for the STAR mapper and requires that STAR has been installed separately and is available on the PATH.

The simplest example checks the strandedness for a single genome:

fastq_strand.py R1.fastq.gz R2.fastq.gz -g STARindex/mm10

In this example, STARindex/mm10 is a directory which contains the STAR indexes for the mm10 genome build.

The output is a file called R1_fastq_strand.txt which summarises the forward and reverse strandedness percentages:

#fastq_strand version: 0.0.1        #Aligner: STAR  #Reads in subset: 1000
#Genome     1st forward     2nd reverse
STARindex/mm10      13.13   93.21

To include the count sums for unstranded, 1st read strand aligned and 2nd read strand aligned in the output file, specify the --counts option:

#fastq_strand version: 0.0.1        #Aligner: STAR  #Reads in subset: 1000
#Genome     1st forward     2nd reverse     Unstranded      1st read strand aligned 2nd read strand aligned
STARindex/mm10      13.13   93.21   391087  51339   364535

Strandedness can be checked for multiple genomes by specifying additional STAR indexes on the command line with multiple -g flags:

fastq_strand.py R1.fastq.gz R2.fastq.gz -g STARindex/hg38 -g STARindex/mm10

Alternatively a panel of indexes can be supplied via a configuration file of the form:

#Name       STAR index
hg38        /mnt/data/STARindex/hg38
mm10        /mnt/data/STARindex/mm10

(NB blank lines and lines starting with a # are ignored). Use the -c/--conf option to get the strandedness percentages using a configuration file, For example:

fastq_strand.py -c model_organisms.conf R1.fastq.gz R2.fastq.gz

By default a random subset of 1000 read pairs is used from the input Fastq pair; this can be changed using the --subset option. If the subset is set to zero then all reads are used.

The number of threads used to run STAR can be set via the -n option; to keep all the outputs from STAR specify the --keep-star-output option.

The strandedness statistics can also be generated for a single Fastq file, by only specifying one file on the command line. For example:

fastq_strand.py -c model_organisms.conf R1.fastq.gz

Manage contaminant sequences for FastQC

The manage_seqs.py utility can to help create and update files with lists of so-called “contaminant” sequences, for input into the FastQC program (specifically, via FastQC’s --contaminants option).

For example, to create a new contaminants file using sequences from a FASTA file:

manage_seqs.py -o custom_contaminants.txt sequences.fa

To append sequences to an existing contaminants file:

manage_seqs.py -a custom_contaminants.txt additional_seqs.fa

The inputs can be a mixture of FastQC “contaminants” format and/or Fasta format files). The utility also check for redundancy (i.e. sequences with multiple associated names) and contradictions (i.e. names with multiple associated sequences).

Convert SAM file to SOAP format

The sam2soap.py utility converts a SAM file to SOAP format.