SOLiD data handling utilities¶

Utilities for transferring data from the SOLiD instrument to the cluster:

rsync_solid_to_cluster.sh: perform transfer of primary data
log_seq_data.sh: maintain logging file of transferred runs and analysis data
analyse_solid_run.py: report on the primary data directories from SOLiD runs
build_analysis_dir.py: construct analysis directories for experiments

rsync_solid_to_cluster.sh¶

Interactive script to semi-automate the transfer of data from the SOLiD instrument to a destination machine.

Usage:

rsync_solid_to_cluster.sh <local_dir> <user>@<remote_host>:<remote_dir> [<email_address>]

<local_dir> is the directory to be copied; <user> is an account on the destination machine <remote_host> and <remote_dir> is the parent directory where a copy of <local_dir> will be made.

The script operates interactively and prompts the user at each step to confirm each action:

Check that the information is correct
Do rsync --dry-run and inspect the output
Perform the actual rsync operation (including removing group write permission from the remote copy)
Check that the local and remote file sizes match

If there is more than one run with the same base name then the script will detect the second run and offer to copy both in a single script run.

The output from each rsync is also captured in a timestamped log file. If an email address is supplied on the command line then copies of the logs will be mailed to this address.

log_seq_data.sh¶

Script to add entries for transferred SOLiD run or analysis directories to a logging file.

Usage:

log_seq_data.sh [-u|-d] <logging_file> <solid_run_dir> [<description>]

A new entry for the directory <solid_run_dir> will be added to <logging_file>, consisting of the path to the directory, a UNIX timestamp, and the optional description.

The path can be relative or absolute; relative paths are automatically converted to full paths.

If the logging file doesn’t exist then it will be created. A new entry won’t be created for any SOLiD run directory that is already in the logging file.

Options:

-u¶: Updates an existing entry

-d¶: Deletes an existing entry

Examples:

Log a primary data directory:

log_seq_data.sh /mnt/data/SEQ_DATA.log /mnt/data/solid0127_20110914_FRAG_BC "Primary data"

Log an analysis directory (no description):

log_seq_data.sh /mnt/data/SEQ_DATA.log /mnt/data/solid0127_20110914_FRAG_BC_analysis

Update an entry to add a description:

log_seq_data.sh /mnt/data/SEQ_DATA.log -u /mnt/data/solid0127_20110914_FRAG_BC_analysis \
    "Analysis directory"

Delete an entry:

log_seq_data.sh /mnt/data/SEQ_DATA.log -d /mnt/data/solid0127_20110914_FRAG_BC_analysis

analyse_solid_run.py¶

Utility for performing various checks and operations on SOLiD run directories. If a single solid_run_dir is specified then analyse_solid_run.py automatically finds and operates on all associated directories from the same instrument and with the same timestamp.

Usage:

analyse_solid_run.py OPTIONS solid_run_dir [ solid_run_dir ... ]

Options:

--only¶: only operate on the specified solid_run_dir, don’t locate associated run directories

--report¶: print a report of the SOLiD run

--report-paths¶: in report mode, also print full paths to primary data files

--xls¶: write report to Excel spreadsheet

--verify¶: do verification checks on SOLiD run directories

--layout¶: generate script for laying out analysis directories

--rsync¶: generate script for rsyncing data

--copy=COPY_PATTERN¶: copy primary data files to pwd from specific library where names match COPY_PATTERN, which should be of the form '<sample>/<library>'

--gzip=GZIP_PATTERN¶: make gzipped copies of primary data files in pwd from specific libraries where names match GZIP_PATTERN, which should be of the form '<sample>/<library>'

--md5=MD5_PATTERN¶: calculate md5sums for primary data files from specific libraries where names match MD5_PATTERN, which should be of the form '<sample>/<library>'

--md5sum¶: calculate md5sums for all primary data files (equivalent to --md5=*/*)

--no-warnings¶: suppress warning messages

build_analysis_dir.py¶

Automatically construct analysis directories for experiments which contain links to the primary data files.

Usage:

build_analysis_dir.py [OPTIONS] EXPERIMENT [EXPERIMENT ...] <solid_run_dir>

General Options:

--dry-run¶: report the operations that would be performed

--debug¶: turn on debugging output

--top-dir=<dir>¶: create analysis directories as subdirs of <dir>; otherwise create them in cwd.

--naming-scheme=<scheme>¶: specify naming scheme for links to primary data (one of minimal - library names only, partial - includes instrument name, datestamp and library name (default) or full - same as source data file

--link=<type>¶: type of links to create to primary data files, either absolute (default) or relative

--run-pipeline=<script>¶: after creating analysis directories, run the specified <script> on SOLiD data file pairs in each

Options For Defining Experiments:

An “experiment” is defined by a group of options, which must be supplied in this order for each experiment specified on the command line:

--name=<name> [--type=<expt_type>] --source=<sample>/<library>
                                  [--source=... ]

<name> is an identifier (typically the user’s initials) used for the analysis directory e.g. PB

<expt_type> is e.g. reseq, ChIP-seq, RNAseq, miRNA…

<sample>/<library> specify the names for primary data files e.g. PB_JB_pool/PB*

Example:

--name=PB --type=ChIP-seq --source=PB_JB_pool/PB*

Both <sample> and <library> can include a trailing wildcard character (i.e. *) to match multiple names. */* will match all primary data files. Multiple --sources can be declared for each experiment.

For each experiment defined on the command line, a subdirectory called <name>_<expt_type> (e.g. PB_ChIP-seq - if no <expt_type> was supplied then just the name is used) will be made containing links to each of the primary data files.