Microarray data

Probeset annotation

The annotate_probesets.py utility can be used to annotate a probeset list based on probe set names.

It requires a tab-delimited file as input, where the first column comprises the probeset names (any other other columns are ignored), and outputs each name to a new tab-delimited file alongside a description of each.

For example input, the following input:

...
1769726_at
1769727_s_at
...

generates:

...
1769726_at  Rank 1: _at : anti-sense target (most probe sets on the array)
1769727_s_at        Warning: _s_at : designates probe sets that share common probes among multiple transcripts from different genes
...

Average data for ‘best’ exons

The best_exons.py utility picks the ‘top’ three exons for each gene symbol from a tab-delimited (TSV) input file containing the exon data, and outputs a single line for that gene symbol with values averaged over the top three.

‘Top’ or ‘best’ exons are determined by ranking on either the log2FoldChange (the default) or pValue (this is set using the --rank-by option):

  • For log2FoldChange, the ‘best’ exon is the one with the biggest absolute log2FoldChange; if this is positive or zero then takes the top three largest fold change value. Otherwise takes the bottom three.

  • For pValue, the ‘best’ exon is the one with the smallest value.

Outputs a TSV file with one line per gene symbol plus the average of each data value for the 3 best exons according to the specified criterion. The averages are just the mean of all the values.

Input file format

Tab separated values (TSV) file, with first line optionally being a header line.

By default the program assumes:

  • Column 0: probeset name (change using --probeset-col)

  • Column 1: gene symbol (change using --gene-symbol-col)

  • Column 12: log2 fold change (change using --log2-fold-change-col)

  • Column 13: p-value (change using --p-value-col)

Column numbering starts from zero.

Output file format

TSV file with one gene symbol per line plus averaged data for the three ‘best’ exons (according to the specified criterion), and an extra column which has a * to indicate which gene symbols had 4 or fewer exons associated with them in the input file.

Note that the averages are just the mean of all the values.

Cross-reference data for two species

The xrorthologs.py utility will cross-reference data from two species, given a lookup file that maps probe set IDs from one species onto those onto the other.

The lookup file is a tab-delimited file with one probe set for species #1 per line in the first column, and a comma-separated list of the equivalent probe sets for species 2 in the fourth column (columns two and three are ignored).

For example:

...
121_at      7849    18510   1418208_at,1446561_at
1255_g_at   2978    14913   1421061_a
1316_at     7067    21833   1426997_at,1443952_at,1454675_at
1320_at     11099   24000   1419054_a_at,1419055_a_at,1453298_at
1405_i_at   6352    20304   1418126_at
...

Data for the two species are supplied via tab-delimited files SPECIES1 and SPECIES2, where the first column in each is a probe set ID (this is the only requirement).

The output consists of two files:

  • SPECIES1_appended.txt: a copy of SPECIES1 with the cross-referenced data from SPECIES2 appended to each line, and

  • SPECIES2_appended.txt: a copy of SPECIES2 with the SPECIES1 data appended.

Where there are multiple matching orthologs to a probe set ID, the data for each match is appended onto a single line on the output.