2 Commits

Author SHA1 Message Date
e793a6678f update all docs 2018-08-03 22:08:31 -07:00
d2a6ceca81 add snakefile for utility tasks (and add .snakemake to .gitignore) 2018-07-24 01:02:51 -07:00
8 changed files with 252 additions and 10 deletions

1
.gitignore vendored
View File

@@ -1 +1,2 @@
.snakemake/
site/

118
Snakefile Normal file
View File

@@ -0,0 +1,118 @@
ghurl = 'git@github.com:charlesreid1/how-do-i-snakemake.git'
cmrurl = 'ssh://git@git.charlesreid1.com:222/charlesreid1/how-do-i-snakemake.git'
cmrmkm = 'ssh://git@git.charlesreid1.com:222/charlesreid1/mkdocs-material.git'
index = 'index.html'
rule default:
message:
"""
Welcome to the Snakefile for the how-do-i-snakemake repo.
This Snakefile contains utility methods for building and
deploying the site.
----------------------------------------
Snakefile
Add the -p or --printshellcmd flag to print the shell commands being run.
Add the -n or --dryrun flag to do a dry run.
To clone the deployed site to site/:
snakemake -p clone_site
To initialize submodules (if you did not clone this repo recursively):
snakemake -p submodule_init
To build the documentation in docs/ into the site/ directory:
snakemake build
To build and serve the documentation site locally (viewable at localhost:8000):
snakemake serve
To safely clean the documentation site before next deployment:
snakemake clean_docs
To build and deploy the updated documentation site to Heroku app dcppc-private-www:
snakemake deploy_docs
"""
rule clone_site:
"""
Clone the deployed site to site/
and add the proper remotes.
"""
output:
'site'
shell:
'''
git clone -b gh-pages {ghurl} site/
cd site \
&& git remote add cmr {cmrurl} \
&& git remote add gh {ghurl}
'''
rule submodule_init:
"""
Initialize the submodules (mkdocs-material)
so we can build the documentation.
"""
shell:
'''
git submodule update --init
'''
rule build:
"""
Build the documentation with mkdocs.
"""
shell:
'''
mkdocs build
'''
rule serve:
"""
Serve the documentation with mkdocs.
Visit localhost:8000 in your browser.
"""
shell:
'''
mkdocs serve
'''
rule clean_docs:
"""
Safely clean the deployed documentation site.
"""
shell:
'''
rm -fr site/content/*
'''
rule deploy_docs:
"""
Deploy the documentation.
"""
shell:
'''
mkdocs build
cd site/
git add -A .
git commit --allow-empty . -m 'updating site'
git push cmr gh-pages
git push gh gh-pages
'''

View File

@@ -25,7 +25,8 @@ Two recommendations to help overcome this:
smaller parts, and convert some of those parts into separate Snakemake rules.
## Example: Filtering Sequencer Reads
<a name="example"></a>
## Example Overview: Filtering Sequencer Reads
Let's illustrate the process of converting a workflow from shell scripts to a
Snakefile, and doing so in stages, using a hypothetical workflow that involves
@@ -43,6 +44,7 @@ downloading data files containing reads from a sequencer from an external URL:
| `SRR609_2_reads.fq.gz` | <http://example.com/SRR609_2_reads.fq.gz> |
<a name="stage1"></a>
## Stage 1: Shell Script + Snakefile
### The Shell Script
@@ -333,6 +335,7 @@ You can force Snakemake to re-download the files two ways:
<a name="stage2"></a>
## Stage 2: Replace Script with Snakefile (Hard-Coded)
The next step in converting our workflow to Snakemake is to
@@ -388,8 +391,95 @@ rule download_reads:
The Python variables `read_file` and `read_url` are available to the shell command
through `{read_file}` and `{read_url}`.
The Snakefile above shows how the `run:` directive and `shell()` function call
can be combined. This is very convenient, since otherwise we would end up with
complicated subprocess command construction and funky string manipulations.
**Problem:** There's still one big problem, and that's how the task of
downloading each file is being divided up. We have a relatively short list of
files to download here, but suppose we had a list of 200 files. Now, we have
a single rule that is responsible for downoading 200 files. If any of those
files are missing, it will correctly assume the rule needs to be re-run, but
will end up running the entire rule, and re-downloading every file.
We really need to split our task up so that each rule corresponds to a single
task of downloading a single indivdiual file. If we were hard-coding everything,
then we might end up hard-coding a bunch of rules, and that would stink.
Instead, let's use wildcards.
```
-----------------------------------8<----------------------------------------------
```
<a name="stage3"></a>
## Stage 3: Replace Script with Snakefile (Wildcards)
The next step in converting our workflow to Snakemake is to
hard-code the file names into a Snakemake rule and let Snakemake
run the curl command to download them. Here are the links:
| Read Files | URL (note: these links are fake) |
|------------|----------------------------------|
| `SRR606_1_reads.fq.gz` | <http://example.com/SRR606_1_reads.fq.gz> |
| `SRR606_2_reads.fq.gz` | <http://example.com/SRR606_2_reads.fq.gz> |
| `SRR607_1_reads.fq.gz` | <http://example.com/SRR607_1_reads.fq.gz> |
| `SRR607_2_reads.fq.gz` | <http://example.com/SRR607_2_reads.fq.gz> |
| `SRR608_1_reads.fq.gz` | <http://example.com/SRR608_1_reads.fq.gz> |
| `SRR608_2_reads.fq.gz` | <http://example.com/SRR608_2_reads.fq.gz> |
| `SRR609_1_reads.fq.gz` | <http://example.com/SRR609_1_reads.fq.gz> |
| `SRR609_2_reads.fq.gz` | <http://example.com/SRR609_2_reads.fq.gz> |
There are multiple ways to modify the Snakefile to download the files directly.
The approach shown below uses a `run` directive to run Python code, and a
`shell()` call to run a shell command. It also shows how these two can be mixed:
**`Snakefile`:**
```python
touchfile = '.downloaded_reads'
# from https://snakemake.readthedocs.io/en/stable/snakefiles/remote_files.html#read-only-web-http-s
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
HTTP = HTTPRemoteProvider()
# map of read files to read urls
reads = {
"SRR606_1_reads.fq.gz" : "http://example.com/SRR606_1_reads.fq.gz",
"SRR606_2_reads.fq.gz" : "http://example.com/SRR606_2_reads.fq.gz",
"SRR607_1_reads.fq.gz" : "http://example.com/SRR607_1_reads.fq.gz",
"SRR607_2_reads.fq.gz" : "http://example.com/SRR607_2_reads.fq.gz",
"SRR608_1_reads.fq.gz" : "http://example.com/SRR608_1_reads.fq.gz",
"SRR608_2_reads.fq.gz" : "http://example.com/SRR608_2_reads.fq.gz",
"SRR609_1_reads.fq.gz" : "http://example.com/SRR609_1_reads.fq.gz",
"SRR609_2_reads.fq.gz" : "http://example.com/SRR609_2_reads.fq.gz"
}
rule download_read:
"""
Download a single individual read
"""
input:
# The input file is the remote HTTP url
HTTP.remote("example.com/{prefix}_{direction}_reads.fq.gz", keep_local=True)
output:
# The output file is now the read file itself
"{prefix}_{direction}_reads.fq.gz"
shell:
'''
curl -L {input} -o {output}
'''
```
Let's walk through this step by step.
We start by importing the `HTTPRemoteProvider`. This is an object that will
check if a remote file is available and if it is not the rule fails to
execute.
The `input:` directive contains a call to `HTTP.remote()` that passes the
URL of the file, containing the wildcards that are matched in the `output:`
directive. The `keep_local=True` flag ensures the downloaded files are
not deleted.

View File

@@ -49,6 +49,14 @@ throughout this documentation and what they mean.
[Converting Workflows to Snakemake](converting.md) - strategies for
converting shell script workflows into Snakemake workflows.
[**Example Overview: Filtering Sequence Reads**](converting.md#example)
[**Stage 1: Shell Script + Snakefile**](converting.md#stage1)
[**Stage 2: Replace Script with Snakefile (Hard-Coded)**](converting.md#stage2)
[**Stage 3: Replace Script with Snakefile (Wildcards)**](converting.md#stage3)
## Useful Resources
@@ -56,3 +64,6 @@ Following is a list of useful Snakemake resources:
* <https://snakemake.readthedocs.io/>

View File

@@ -25,8 +25,8 @@ Once you have Python installed, you should have `pip` available as well.
Snakemake can be installed using pip:
```
$ virtualenv -p python3 .venv
$ source .venv/bin/activate
$ virtualenv vp
$ source vp/bin/activate
$ pip install snakemake
```
@@ -35,5 +35,9 @@ $ pip install snakemake
If you are using conda, you can install Snakemake using conda by first
adding some conda channels, then installing Snakemake using `conda install`:
```
conda install -c bioconda -c conda-forge snakemake
```
This will install snakemake from the bioconda channel.

View File

@@ -1,11 +1,32 @@
# Terminology
* **container** - Containers are like very lightweight
virtual machines that can provide an isolated,
consistent, reproducible environment in which to
run software.
* **directive** - this refers to subheadings of rules,
such as `input:` or `output:` or `shell:`
* **docker** - Docker is a program that allows running
containers. Docker is very popular in enterprise
software engineering, but presents challenges for
scientific computing because it requires an admin
account and presents security risks, making it
hard to run in an HPC environment.
* **rule** - Snakemake rules define a given task,
the input files it depends on, the output files
it produces, the shell commands it should run,
etc.
* **directive** - this refers to subheadings of rules,
such as `input:` or `output:` or `shell:`
* **singularity** - Singularity is a program that allows
running containers, like Docker, but without requiring
an admin account and without many of the security
concerns that Docker creates.
* **Snakefile** - Snakemake is a Python program that is used
to run a set of commands in a file; Snakefile is the
default name of the file in which Snakemake expects to
find those definitions.

View File

@@ -21,7 +21,7 @@ theme:
font:
text: 'Bitter'
code: 'PT Mono'
nav:
pages:
- "Index" : "index.md"
- "Installing Snakemake" : "installing.md"
- "Snakemake Terminology" : "terminology.md"
@@ -34,6 +34,3 @@ markdown_extensions:
guess_lang: false
- toc:
permalink: true
strict: true