Compare commits
2 Commits
Author | SHA1 | Date | |
---|---|---|---|
e793a6678f | |||
d2a6ceca81 |
1
.gitignore
vendored
1
.gitignore
vendored
@@ -1 +1,2 @@
|
||||
.snakemake/
|
||||
site/
|
||||
|
118
Snakefile
Normal file
118
Snakefile
Normal file
@@ -0,0 +1,118 @@
|
||||
ghurl = 'git@github.com:charlesreid1/how-do-i-snakemake.git'
|
||||
cmrurl = 'ssh://git@git.charlesreid1.com:222/charlesreid1/how-do-i-snakemake.git'
|
||||
cmrmkm = 'ssh://git@git.charlesreid1.com:222/charlesreid1/mkdocs-material.git'
|
||||
index = 'index.html'
|
||||
|
||||
rule default:
|
||||
message:
|
||||
"""
|
||||
|
||||
Welcome to the Snakefile for the how-do-i-snakemake repo.
|
||||
This Snakefile contains utility methods for building and
|
||||
deploying the site.
|
||||
|
||||
----------------------------------------
|
||||
Snakefile
|
||||
|
||||
Add the -p or --printshellcmd flag to print the shell commands being run.
|
||||
Add the -n or --dryrun flag to do a dry run.
|
||||
|
||||
|
||||
To clone the deployed site to site/:
|
||||
snakemake -p clone_site
|
||||
|
||||
|
||||
To initialize submodules (if you did not clone this repo recursively):
|
||||
snakemake -p submodule_init
|
||||
|
||||
|
||||
To build the documentation in docs/ into the site/ directory:
|
||||
snakemake build
|
||||
|
||||
|
||||
To build and serve the documentation site locally (viewable at localhost:8000):
|
||||
snakemake serve
|
||||
|
||||
|
||||
To safely clean the documentation site before next deployment:
|
||||
snakemake clean_docs
|
||||
|
||||
|
||||
To build and deploy the updated documentation site to Heroku app dcppc-private-www:
|
||||
snakemake deploy_docs
|
||||
|
||||
"""
|
||||
|
||||
|
||||
rule clone_site:
|
||||
"""
|
||||
Clone the deployed site to site/
|
||||
and add the proper remotes.
|
||||
"""
|
||||
output:
|
||||
'site'
|
||||
shell:
|
||||
'''
|
||||
git clone -b gh-pages {ghurl} site/
|
||||
cd site \
|
||||
&& git remote add cmr {cmrurl} \
|
||||
&& git remote add gh {ghurl}
|
||||
'''
|
||||
|
||||
|
||||
rule submodule_init:
|
||||
"""
|
||||
Initialize the submodules (mkdocs-material)
|
||||
so we can build the documentation.
|
||||
"""
|
||||
shell:
|
||||
'''
|
||||
git submodule update --init
|
||||
'''
|
||||
|
||||
|
||||
rule build:
|
||||
"""
|
||||
Build the documentation with mkdocs.
|
||||
"""
|
||||
shell:
|
||||
'''
|
||||
mkdocs build
|
||||
'''
|
||||
|
||||
|
||||
rule serve:
|
||||
"""
|
||||
Serve the documentation with mkdocs.
|
||||
Visit localhost:8000 in your browser.
|
||||
"""
|
||||
shell:
|
||||
'''
|
||||
mkdocs serve
|
||||
'''
|
||||
|
||||
|
||||
rule clean_docs:
|
||||
"""
|
||||
Safely clean the deployed documentation site.
|
||||
"""
|
||||
shell:
|
||||
'''
|
||||
rm -fr site/content/*
|
||||
'''
|
||||
|
||||
|
||||
rule deploy_docs:
|
||||
"""
|
||||
Deploy the documentation.
|
||||
"""
|
||||
shell:
|
||||
'''
|
||||
mkdocs build
|
||||
cd site/
|
||||
git add -A .
|
||||
git commit --allow-empty . -m 'updating site'
|
||||
git push cmr gh-pages
|
||||
git push gh gh-pages
|
||||
'''
|
||||
|
@@ -25,7 +25,8 @@ Two recommendations to help overcome this:
|
||||
smaller parts, and convert some of those parts into separate Snakemake rules.
|
||||
|
||||
|
||||
## Example: Filtering Sequencer Reads
|
||||
<a name="example"></a>
|
||||
## Example Overview: Filtering Sequencer Reads
|
||||
|
||||
Let's illustrate the process of converting a workflow from shell scripts to a
|
||||
Snakefile, and doing so in stages, using a hypothetical workflow that involves
|
||||
@@ -43,6 +44,7 @@ downloading data files containing reads from a sequencer from an external URL:
|
||||
| `SRR609_2_reads.fq.gz` | <http://example.com/SRR609_2_reads.fq.gz> |
|
||||
|
||||
|
||||
<a name="stage1"></a>
|
||||
## Stage 1: Shell Script + Snakefile
|
||||
|
||||
### The Shell Script
|
||||
@@ -333,6 +335,7 @@ You can force Snakemake to re-download the files two ways:
|
||||
|
||||
|
||||
|
||||
<a name="stage2"></a>
|
||||
## Stage 2: Replace Script with Snakefile (Hard-Coded)
|
||||
|
||||
The next step in converting our workflow to Snakemake is to
|
||||
@@ -388,8 +391,95 @@ rule download_reads:
|
||||
The Python variables `read_file` and `read_url` are available to the shell command
|
||||
through `{read_file}` and `{read_url}`.
|
||||
|
||||
The Snakefile above shows how the `run:` directive and `shell()` function call
|
||||
can be combined. This is very convenient, since otherwise we would end up with
|
||||
complicated subprocess command construction and funky string manipulations.
|
||||
|
||||
**Problem:** There's still one big problem, and that's how the task of
|
||||
downloading each file is being divided up. We have a relatively short list of
|
||||
files to download here, but suppose we had a list of 200 files. Now, we have
|
||||
a single rule that is responsible for downoading 200 files. If any of those
|
||||
files are missing, it will correctly assume the rule needs to be re-run, but
|
||||
will end up running the entire rule, and re-downloading every file.
|
||||
|
||||
We really need to split our task up so that each rule corresponds to a single
|
||||
task of downloading a single indivdiual file. If we were hard-coding everything,
|
||||
then we might end up hard-coding a bunch of rules, and that would stink.
|
||||
Instead, let's use wildcards.
|
||||
|
||||
|
||||
```
|
||||
-----------------------------------8<----------------------------------------------
|
||||
```
|
||||
|
||||
|
||||
<a name="stage3"></a>
|
||||
## Stage 3: Replace Script with Snakefile (Wildcards)
|
||||
|
||||
The next step in converting our workflow to Snakemake is to
|
||||
hard-code the file names into a Snakemake rule and let Snakemake
|
||||
run the curl command to download them. Here are the links:
|
||||
|
||||
| Read Files | URL (note: these links are fake) |
|
||||
|------------|----------------------------------|
|
||||
| `SRR606_1_reads.fq.gz` | <http://example.com/SRR606_1_reads.fq.gz> |
|
||||
| `SRR606_2_reads.fq.gz` | <http://example.com/SRR606_2_reads.fq.gz> |
|
||||
| `SRR607_1_reads.fq.gz` | <http://example.com/SRR607_1_reads.fq.gz> |
|
||||
| `SRR607_2_reads.fq.gz` | <http://example.com/SRR607_2_reads.fq.gz> |
|
||||
| `SRR608_1_reads.fq.gz` | <http://example.com/SRR608_1_reads.fq.gz> |
|
||||
| `SRR608_2_reads.fq.gz` | <http://example.com/SRR608_2_reads.fq.gz> |
|
||||
| `SRR609_1_reads.fq.gz` | <http://example.com/SRR609_1_reads.fq.gz> |
|
||||
| `SRR609_2_reads.fq.gz` | <http://example.com/SRR609_2_reads.fq.gz> |
|
||||
|
||||
There are multiple ways to modify the Snakefile to download the files directly.
|
||||
The approach shown below uses a `run` directive to run Python code, and a
|
||||
`shell()` call to run a shell command. It also shows how these two can be mixed:
|
||||
|
||||
**`Snakefile`:**
|
||||
|
||||
```python
|
||||
touchfile = '.downloaded_reads'
|
||||
|
||||
# from https://snakemake.readthedocs.io/en/stable/snakefiles/remote_files.html#read-only-web-http-s
|
||||
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
|
||||
HTTP = HTTPRemoteProvider()
|
||||
|
||||
# map of read files to read urls
|
||||
reads = {
|
||||
"SRR606_1_reads.fq.gz" : "http://example.com/SRR606_1_reads.fq.gz",
|
||||
"SRR606_2_reads.fq.gz" : "http://example.com/SRR606_2_reads.fq.gz",
|
||||
"SRR607_1_reads.fq.gz" : "http://example.com/SRR607_1_reads.fq.gz",
|
||||
"SRR607_2_reads.fq.gz" : "http://example.com/SRR607_2_reads.fq.gz",
|
||||
"SRR608_1_reads.fq.gz" : "http://example.com/SRR608_1_reads.fq.gz",
|
||||
"SRR608_2_reads.fq.gz" : "http://example.com/SRR608_2_reads.fq.gz",
|
||||
"SRR609_1_reads.fq.gz" : "http://example.com/SRR609_1_reads.fq.gz",
|
||||
"SRR609_2_reads.fq.gz" : "http://example.com/SRR609_2_reads.fq.gz"
|
||||
}
|
||||
|
||||
rule download_read:
|
||||
"""
|
||||
Download a single individual read
|
||||
"""
|
||||
input:
|
||||
# The input file is the remote HTTP url
|
||||
HTTP.remote("example.com/{prefix}_{direction}_reads.fq.gz", keep_local=True)
|
||||
output:
|
||||
# The output file is now the read file itself
|
||||
"{prefix}_{direction}_reads.fq.gz"
|
||||
shell:
|
||||
'''
|
||||
curl -L {input} -o {output}
|
||||
'''
|
||||
```
|
||||
|
||||
Let's walk through this step by step.
|
||||
|
||||
We start by importing the `HTTPRemoteProvider`. This is an object that will
|
||||
check if a remote file is available and if it is not the rule fails to
|
||||
execute.
|
||||
|
||||
The `input:` directive contains a call to `HTTP.remote()` that passes the
|
||||
URL of the file, containing the wildcards that are matched in the `output:`
|
||||
directive. The `keep_local=True` flag ensures the downloaded files are
|
||||
not deleted.
|
||||
|
||||
|
@@ -49,6 +49,14 @@ throughout this documentation and what they mean.
|
||||
[Converting Workflows to Snakemake](converting.md) - strategies for
|
||||
converting shell script workflows into Snakemake workflows.
|
||||
|
||||
[**Example Overview: Filtering Sequence Reads**](converting.md#example)
|
||||
|
||||
[**Stage 1: Shell Script + Snakefile**](converting.md#stage1)
|
||||
|
||||
[**Stage 2: Replace Script with Snakefile (Hard-Coded)**](converting.md#stage2)
|
||||
|
||||
[**Stage 3: Replace Script with Snakefile (Wildcards)**](converting.md#stage3)
|
||||
|
||||
|
||||
## Useful Resources
|
||||
|
||||
@@ -56,3 +64,6 @@ Following is a list of useful Snakemake resources:
|
||||
|
||||
* <https://snakemake.readthedocs.io/>
|
||||
|
||||
|
||||
|
||||
|
||||
|
@@ -25,8 +25,8 @@ Once you have Python installed, you should have `pip` available as well.
|
||||
Snakemake can be installed using pip:
|
||||
|
||||
```
|
||||
$ virtualenv -p python3 .venv
|
||||
$ source .venv/bin/activate
|
||||
$ virtualenv vp
|
||||
$ source vp/bin/activate
|
||||
$ pip install snakemake
|
||||
```
|
||||
|
||||
@@ -35,5 +35,9 @@ $ pip install snakemake
|
||||
If you are using conda, you can install Snakemake using conda by first
|
||||
adding some conda channels, then installing Snakemake using `conda install`:
|
||||
|
||||
```
|
||||
conda install -c bioconda -c conda-forge snakemake
|
||||
```
|
||||
|
||||
This will install snakemake from the bioconda channel.
|
||||
|
||||
|
@@ -1,11 +1,32 @@
|
||||
# Terminology
|
||||
|
||||
* **container** - Containers are like very lightweight
|
||||
virtual machines that can provide an isolated,
|
||||
consistent, reproducible environment in which to
|
||||
run software.
|
||||
|
||||
* **directive** - this refers to subheadings of rules,
|
||||
such as `input:` or `output:` or `shell:`
|
||||
|
||||
* **docker** - Docker is a program that allows running
|
||||
containers. Docker is very popular in enterprise
|
||||
software engineering, but presents challenges for
|
||||
scientific computing because it requires an admin
|
||||
account and presents security risks, making it
|
||||
hard to run in an HPC environment.
|
||||
|
||||
* **rule** - Snakemake rules define a given task,
|
||||
the input files it depends on, the output files
|
||||
it produces, the shell commands it should run,
|
||||
etc.
|
||||
|
||||
* **directive** - this refers to subheadings of rules,
|
||||
such as `input:` or `output:` or `shell:`
|
||||
* **singularity** - Singularity is a program that allows
|
||||
running containers, like Docker, but without requiring
|
||||
an admin account and without many of the security
|
||||
concerns that Docker creates.
|
||||
|
||||
* **Snakefile** - Snakemake is a Python program that is used
|
||||
to run a set of commands in a file; Snakefile is the
|
||||
default name of the file in which Snakemake expects to
|
||||
find those definitions.
|
||||
|
||||
|
Submodule mkdocs-material updated: b0c6890853...6569122bb1
@@ -21,7 +21,7 @@ theme:
|
||||
font:
|
||||
text: 'Bitter'
|
||||
code: 'PT Mono'
|
||||
nav:
|
||||
pages:
|
||||
- "Index" : "index.md"
|
||||
- "Installing Snakemake" : "installing.md"
|
||||
- "Snakemake Terminology" : "terminology.md"
|
||||
@@ -34,6 +34,3 @@ markdown_extensions:
|
||||
guess_lang: false
|
||||
- toc:
|
||||
permalink: true
|
||||
|
||||
|
||||
strict: true
|
||||
|
Reference in New Issue
Block a user