NGS utilities ============= ************************** Reporting ChIP-seq outputs ************************** The :ref:`reference_make_macs2_xls` utility can be used to convert an output tab-delimited ``.XLS`` file from ``macs2`` into an MS Excel spreadsheet (either ``.xlsx`` or ``.xls`` format). Additionally a ``.bed`` format file can be output, provided that ``macs2`` was not run with the ``--broad`` option. To process output from older versions of ``macs`` (i.e. 1.4.2 and earlier) the legacy :ref:`reference_make_macs_xls` utility can be used; however for this version only MS XLS format is supported, and there is no option to output a ``.bed`` file. ************************* Reporting RNA-seq outputs ************************* The :ref:`reference_bowtie_mapping_stats` utility can be used to summarise the mapping statistics produced by ``bowtie2`` or ``bowtie``, and output to an MS Excel spreadsheet file. The utility reads the ``bowtie2`` log file and expects this to consist of multiple blocks of text of the form: :: ... Time loading reference: 00:00:01 Time loading forward index: 00:00:00 Time loading mirror index: 00:00:02 Seeded quality full-index search: 00:10:20 # reads processed: 39808407 # reads with at least one reported alignment: 2737588 (6.88%) # reads that failed to align: 33721722 (84.71%) # reads with alignments suppressed due to -m: 3349097 (8.41%) Reported 2737588 alignments to 1 output stream(s) Time searching: 00:10:27 Overall time: 00:10:27 ... The sample name will be extracted along with the numbers of reads processed, with at least one reported alignment, that failed to align, and with alignments suppressed and tabulated in the output spreadsheet. ******************************************* Determining strandedness of sequencing data ******************************************* The :ref:`reference_fastq_strand` utility can be used to determine the strandedness (forward, reverse, or unstranded) of sequencing data in Fastq format, using either a single Fastq file, or an an R1/R2 pair of Fastqs. .. note:: The utility is a wrapper for the ``STAR`` mapper and requires that ``STAR`` has been installed separately and is available on the ``PATH``. The simplest example checks the strandedness for a single genome: :: fastq_strand.py R1.fastq.gz R2.fastq.gz -g STARindex/mm10 In this example, ``STARindex/mm10`` is a directory which contains the ``STAR`` indexes for the ``mm10`` genome build. The output is a file called ``R1_fastq_strand.txt`` which summarises the forward and reverse strandedness percentages:: #fastq_strand version: 0.0.1 #Aligner: STAR #Reads in subset: 1000 #Genome 1st forward 2nd reverse STARindex/mm10 13.13 93.21 To include the count sums for unstranded, 1st read strand aligned and 2nd read strand aligned in the output file, specify the ``--counts`` option:: #fastq_strand version: 0.0.1 #Aligner: STAR #Reads in subset: 1000 #Genome 1st forward 2nd reverse Unstranded 1st read strand aligned 2nd read strand aligned STARindex/mm10 13.13 93.21 391087 51339 364535 Strandedness can be checked for multiple genomes by specifying additional ``STAR`` indexes on the command line with multiple ``-g`` flags: :: fastq_strand.py R1.fastq.gz R2.fastq.gz -g STARindex/hg38 -g STARindex/mm10 Alternatively a panel of indexes can be supplied via a configuration file of the form: :: #Name STAR index hg38 /mnt/data/STARindex/hg38 mm10 /mnt/data/STARindex/mm10 (NB blank lines and lines starting with a ``#`` are ignored). Use the ``-c``/``--conf`` option to get the strandedness percentages using a configuration file, For example: :: fastq_strand.py -c model_organisms.conf R1.fastq.gz R2.fastq.gz By default a random subset of 1000 read pairs is used from the input Fastq pair; this can be changed using the ``--subset`` option. If the subset is set to zero then all reads are used. The number of threads used to run ``STAR`` can be set via the ``-n`` option; to keep all the outputs from ``STAR`` specify the ``--keep-star-output`` option. The strandedness statistics can also be generated for a single Fastq file, by only specifying one file on the command line. For example: :: fastq_strand.py -c model_organisms.conf R1.fastq.gz *************************************** Manage contaminant sequences for FastQC *************************************** The :ref:`reference_manage_seqs` utility can to help create and update files with lists of so-called "contaminant" sequences, for input into the FastQC program (specifically, via FastQC's ``--contaminants`` option). For example, to create a new contaminants file using sequences from a FASTA file: :: manage_seqs.py -o custom_contaminants.txt sequences.fa To append sequences to an existing contaminants file: :: manage_seqs.py -a custom_contaminants.txt additional_seqs.fa The inputs can be a mixture of FastQC "contaminants" format and/or Fasta format files). The utility also check for redundancy (i.e. sequences with multiple associated names) and contradictions (i.e. names with multiple associated sequences). ******************************* Convert SAM file to SOAP format ******************************* The :ref:`reference_sam2soap` utility converts a SAM file to SOAP format.