General NGS utilities¶

General NGS scripts that are used for both ChIP-seq and RNA-seq.

explain_sam_flag.sh: decodes bit-wise flag from SAM file
extract_reads.py: write out subsets of reads from input data files
fastq_edit.py: edit FASTQ files and data
fastq_sniffer.py: “sniff” FASTQ file to determine quality encoding
SamStats: counts uniquely map reads per chromosome/contig
splitBarcodes.pl: separate multiple barcodes in SOLiD data
remove_mispairs.pl: remove “singleton” reads from paired end fastq
remove_mispairs.pl: remove “singleton” reads from paired end fastq
reorder_fasta.py: reorder chromosomes in FASTA file in karyotypic order
sam2soap.py: convert from SAM file to SOAP format
separate_paired_fastq.pl: separate F3 and F5 reads from fastq
split_fasta.py: extract individual chromosome sequences from fasta file
split_fastq : split fastq file by lane
trim_fastq.pl: trim down sequences in fastq file from 5’ end
uncompress_fastqgz.sh: create ungzipped version of a compressed FASTQ file

explain_sam_flag.sh¶

Convert a decimal bitwise SAM flag value to binary representation and interpret each bit.

extract_reads.py¶

Usage:

extract_reads.py OPTIONS infile [infile ...]

Extract subsets of reads from each of the supplied files according to specified criteria (e.g. random, matching a pattern etc). Input files can be any mixture of FASTQ (.fastq, .fq), CSFASTA (.csfasta) and QUAL (.qual).

Output file names will be the input file names with .subset appended.

Options:

-m PATTERN, --match=PATTERN¶: Extract records that match Python regular expression PATTERN

..cmdoption:: -n N

Extract N random records from the input file(s) (default 500). If multiple input files are specified, the same subsets will be extracted for each.

fastq_edit.py¶

Usage:

fastq_edit.py [options] <fastq_file>

Perform various operations on FASTQ file.

Options:

--stats¶: Generate basic stats for input FASTQ

--instrument-name=INSTRUMENT_NAME¶: Update the instrument name in the sequence identifier part of each read record and write updated FASTQ file to stdout

fastq_sniffer.py¶

Usage:

fastq_sniffer.py <fastq_file>

“Sniff” FASTQ file to try and determine likely format and quality encoding.

Attempts to identify FASTQ format and quality encoding, and suggests likely datatype for import into Galaxy.

Use the --subset option to only use a subset of reads from the file for the type determination (using a smaller set speeds up the process at the risk of not being able to accuracy determine the encoding convention).

See http://en.wikipedia.org/wiki/FASTQ_format for information on the different quality encoding standards used in different FASTQ formats.

Options:

--subset=N_SUBSET¶: try to determine encoding from a subset of consisting of the first N_SUBSET reads. (Quicker than using all reads but may not be accurate if subset is not representative of the file as a whole.)

SamStats¶

Counts how many reads are uniquely mapped onto each chromosome or contig in a SAM file. To run:

java -classpath <dir_with_SamStats.class> SamStats <sam_file>

or (if using a Jar file):

java -cp /path/to/SamStats.jar SamStats <sam_file>

(To compile into a jar, do jar cf SamStats.jar SamStats.class)

Output is a text file SamStats_maponly_<sam_file>.stats

splitBarcodes.pl¶

Split csfasta and qual files containing multiple barcodes into separate sets.

Usage:

./splitBarcodes.pl <barcode.csfasta> <read.csfasta> <read.qual>

Expects BC.csfasta, F3.csfasta and F3.qual files containing multiple barcodes. Currently set up for ‘BC Kit Module 1-16’.

Note that this utility also requires BioPerl.

remove_mispairs.pl¶

Look through fastq file from solid2fastq that has interleaved paired end reads and remove singletons (missing partner)

Usage:

remove_mispairs.pl <interleaved FASTQ>

Outputs:

<FASTQ>.paired: copy of input fastq with all singleton reads removed
<FASTQ>.single.header: list of headers for all reads that were removed as singletons
<FASTQ>.pair.header: list of headers for all reads there were kept as part of a pair

remove_mispairs.py¶

Python implementation of remove_mispairs.pl which can also remove singletons for paired end fastq data file where the reads are not interleaved.

reorder_fasta.py¶

Reorder the chromosome records in a FASTA file into karyotypic order.

Usage:

reorder_fasta.py INFILE.fa

Reorders the chromosome records from a FASTA file into ‘kayrotypic’ order, e.g.:

chr1
chr2
...
chr10
chr11

The output FASTA file will be called INFILE.karyotypic.fa.

sam2soap.py¶

Convert a SAM file into SOAP format.

Usage:

sam2soap.py OPTIONS [ SAMFILE ]

Convert SAM file to SOAP format - reads from stdin (or SAMFILE, if specified), and writes output to stdout unless -o option is specified.

Options:

-o SOAPFILE¶: Output SOAP file name

separate_paired_fastq.pl¶

Takes a fastq file with paired F3 and F5 reads and separate into a file for each.

Usage:

separate_paired_fastq.pl <interleaved FASTQ>

split_fasta.py¶

Extract individual chromosome sequences from a fasta file.

Usage:

split_fasta.py fasta_file

Split input FASTA file with multiple sequences into multiple files each containing sequences for a single chromosome.

For each chromosome CHROM found in the input Fasta file (delimited by a line >CHROM), outputs a file called CHROM.fa in the current directory containing just the sequence for that chromosome.

split_fastq¶

Splits a Fastq file by lane.

Usage:

split_fastq.py [-h] [-l LANES] FASTQ

Split input Fastq file into multiple output Fastqs where each output only contains reads from a single lane.

Options:

-l LANES, --lanes LANES¶: lanes to extract: can be a single integer, a comma- separated list (e.g. 1,3), a range (e.g. 5-7) or a combination (e.g. 1,3,5-7). Default is to extract all lanes in the Fastq

trim_fastq.pl¶

Takes a fastq file and keeps the first (5’) bases of the sequences specified by the user.

Usage:

trim_fastq.pl <single end FASTQ> <desired length>

uncompress_fastqgz.sh¶

Create uncompressed copies of fastq.gz file (if input is fastq.gz).

Usage:

uncompress_fastqgz.sh <fastq>

<fastq> can be either fastq or fastq.gz file.

The original file will not be removed or altered.