Microarray data =============== ******************* Probeset annotation ******************* The :ref:`reference_annotate_probesets` utility can be used to annotate a probeset list based on probe set names. It requires a tab-delimited file as input, where the first column comprises the probeset names (any other other columns are ignored), and outputs each name to a new tab-delimited file alongside a description of each. For example input, the following input: :: ... 1769726_at 1769727_s_at ... generates: :: ... 1769726_at Rank 1: _at : anti-sense target (most probe sets on the array) 1769727_s_at Warning: _s_at : designates probe sets that share common probes among multiple transcripts from different genes ... ***************************** Average data for 'best' exons ***************************** The :ref:`reference_best_exons` utility picks the 'top' three exons for each gene symbol from a tab-delimited (TSV) input file containing the exon data, and outputs a single line for that gene symbol with values averaged over the top three. 'Top' or 'best' exons are determined by ranking on either the ``log2FoldChange`` (the default) or ``pValue`` (this is set using the ``--rank-by`` option): * For ``log2FoldChange``, the 'best' exon is the one with the biggest absolute ``log2FoldChange``; if this is positive or zero then takes the top three largest fold change value. Otherwise takes the bottom three. * For ``pValue``, the 'best' exon is the one with the smallest value. Outputs a TSV file with one line per gene symbol plus the average of each data value for the 3 best exons according to the specified criterion. The averages are just the mean of all the values. Input file format ----------------- Tab separated values (TSV) file, with first line optionally being a header line. By default the program assumes: * Column 0: probeset name (change using ``--probeset-col``) * Column 1: gene symbol (change using ``--gene-symbol-col``) * Column 12: log2 fold change (change using ``--log2-fold-change-col``) * Column 13: p-value (change using ``--p-value-col``) Column numbering starts from zero. Output file format ------------------- TSV file with one gene symbol per line plus averaged data for the three 'best' exons (according to the specified criterion), and an extra column which has a ``*`` to indicate which gene symbols had 4 or fewer exons associated with them in the input file. Note that the averages are just the mean of all the values. ************************************ Cross-reference data for two species ************************************ The :ref:`reference_xrorthologs` utility will cross-reference data from two species, given a lookup file that maps probe set IDs from one species onto those onto the other. The lookup file is a tab-delimited file with one probe set for species #1 per line in the first column, and a comma-separated list of the equivalent probe sets for species 2 in the fourth column (columns two and three are ignored). For example: :: ... 121_at 7849 18510 1418208_at,1446561_at 1255_g_at 2978 14913 1421061_a 1316_at 7067 21833 1426997_at,1443952_at,1454675_at 1320_at 11099 24000 1419054_a_at,1419055_a_at,1453298_at 1405_i_at 6352 20304 1418126_at ... Data for the two species are supplied via tab-delimited files ``SPECIES1`` and ``SPECIES2``, where the first column in each is a probe set ID (this is the only requirement). The output consists of two files: * ``SPECIES1_appended.txt``: a copy of ``SPECIES1`` with the cross-referenced data from ``SPECIES2`` appended to each line, and * ``SPECIES2_appended.txt``: a copy of ``SPECIES2`` with the ``SPECIES1`` data appended. Where there are multiple matching orthologs to a probe set ID, the data for each match is appended onto a single line on the output.