SOLiD data handling utilities¶
Utilities for transferring data from the SOLiD instrument to the cluster:
- rsync_solid_to_cluster.sh: perform transfer of primary data
- log_seq_data.sh: maintain logging file of transferred runs and analysis data
- analyse_solid_run.py: report on the primary data directories from SOLiD runs
- build_analysis_dir.py: construct analysis directories for experiments
rsync_solid_to_cluster.sh¶
Interactive script to semi-automate the transfer of data from the SOLiD instrument to a destination machine.
Usage:
rsync_solid_to_cluster.sh <local_dir> <user>@<remote_host>:<remote_dir> [<email_address>]
<local_dir>
is the directory to be copied; <user>
is an account on the
destination machine <remote_host>
and <remote_dir>
is the parent directory
where a copy of <local_dir>
will be made.
The script operates interactively and prompts the user at each step to confirm each action:
- Check that the information is correct
- Do
rsync --dry-run
and inspect the output - Perform the actual rsync operation (including removing group write permission from the remote copy)
- Check that the local and remote file sizes match
If there is more than one run with the same base name then the script will detect the second run and offer to copy both in a single script run.
The output from each rsync
is also captured in a timestamped log file. If an email
address is supplied on the command line then copies of the logs will be mailed to this
address.
log_seq_data.sh¶
Script to add entries for transferred SOLiD run or analysis directories to a logging file.
Usage:
log_seq_data.sh [-u|-d] <logging_file> <solid_run_dir> [<description>]
A new entry for the directory <solid_run_dir>
will be added to
<logging_file>
, consisting of the path to the directory, a UNIX timestamp,
and the optional description.
The path can be relative or absolute; relative paths are automatically converted to full paths.
If the logging file doesn’t exist then it will be created. A new entry won’t be created for any SOLiD run directory that is already in the logging file.
Options:
-
-u
¶
Updates an existing entry
-
-d
¶
Deletes an existing entry
Examples:
Log a primary data directory:
log_seq_data.sh /mnt/data/SEQ_DATA.log /mnt/data/solid0127_20110914_FRAG_BC "Primary data"
Log an analysis directory (no description):
log_seq_data.sh /mnt/data/SEQ_DATA.log /mnt/data/solid0127_20110914_FRAG_BC_analysis
Update an entry to add a description:
log_seq_data.sh /mnt/data/SEQ_DATA.log -u /mnt/data/solid0127_20110914_FRAG_BC_analysis \
"Analysis directory"
Delete an entry:
log_seq_data.sh /mnt/data/SEQ_DATA.log -d /mnt/data/solid0127_20110914_FRAG_BC_analysis
analyse_solid_run.py¶
Utility for performing various checks and operations on SOLiD run directories. If a single solid_run_dir is specified then analyse_solid_run.py automatically finds and operates on all associated directories from the same instrument and with the same timestamp.
Usage:
analyse_solid_run.py OPTIONS solid_run_dir [ solid_run_dir ... ]
Options:
-
--only
¶
only operate on the specified solid_run_dir, don’t locate associated run directories
-
--report
¶
print a report of the SOLiD run
-
--report-paths
¶
in report mode, also print full paths to primary data files
-
--xls
¶
write report to Excel spreadsheet
-
--verify
¶
do verification checks on SOLiD run directories
-
--layout
¶
generate script for laying out analysis directories
-
--rsync
¶
generate script for rsyncing data
-
--copy
=COPY_PATTERN
¶ copy primary data files to pwd from specific library where names match
COPY_PATTERN
, which should be of the form'<sample>/<library>'
-
--gzip
=GZIP_PATTERN
¶ make gzipped copies of primary data files in pwd from specific libraries where names match
GZIP_PATTERN
, which should be of the form'<sample>/<library>'
-
--md5
=MD5_PATTERN
¶ calculate md5sums for primary data files from specific libraries where names match
MD5_PATTERN
, which should be of the form'<sample>/<library>'
-
--md5sum
¶
calculate md5sums for all primary data files (equivalent to
--md5=*/*
)
-
--no-warnings
¶
suppress warning messages
build_analysis_dir.py¶
Automatically construct analysis directories for experiments which contain links to the primary data files.
Usage:
build_analysis_dir.py [OPTIONS] EXPERIMENT [EXPERIMENT ...] <solid_run_dir>
General Options:
-
--dry-run
¶
report the operations that would be performed
-
--debug
¶
turn on debugging output
-
--top-dir
=<dir>
¶ create analysis directories as subdirs of
<dir>
; otherwise create them incwd
.
-
--naming-scheme
=<scheme>
¶ specify naming scheme for links to primary data (one of
minimal
- library names only,partial
- includes instrument name, datestamp and library name (default) orfull
- same as source data file
-
--link
=<type>
¶ type of links to create to primary data files, either
absolute
(default) orrelative
-
--run-pipeline
=<script>
¶ after creating analysis directories, run the specified
<script>
on SOLiD data file pairs in each
Options For Defining Experiments:
An “experiment” is defined by a group of options, which must be supplied in this order for each experiment specified on the command line:
--name=<name> [--type=<expt_type>] --source=<sample>/<library>
[--source=... ]
<name>
is an identifier (typically the user’s initials) used for the
analysis directory e.g. PB
<expt_type>
is e.g. reseq
, ChIP-seq
, RNAseq
, miRNA
…
<sample>/<library>
specify the names for primary data files e.g.
PB_JB_pool/PB*
Example:
--name=PB --type=ChIP-seq --source=PB_JB_pool/PB*
Both <sample>
and <library>
can include a trailing wildcard character
(i.e. *
) to match multiple names. */*
will match all primary data files.
Multiple --sources
can be declared for each experiment.
For each experiment defined on the command line, a subdirectory called
<name>_<expt_type>
(e.g. PB_ChIP-seq
- if no <expt_type>
was supplied then just the name is used) will be made containing links to
each of the primary data files.