update all docs

add snakefile for utility tasks (and add .snakemake to .gitignore)
2018-08-03 22:08:31 -07:00 · 2018-07-24 01:02:51 -07:00
8 changed files with 252 additions and 10 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -1 +1,2 @@
+.snakemake/
 site/
--- a/118
+++ b/118
@@ -0,0 +1,118 @@
+ghurl = 'git@github.com:charlesreid1/how-do-i-snakemake.git'
+cmrurl = 'ssh://git@git.charlesreid1.com:222/charlesreid1/how-do-i-snakemake.git'
+cmrmkm = 'ssh://git@git.charlesreid1.com:222/charlesreid1/mkdocs-material.git'
+index = 'index.html'
+
+rule default:
+    message:
+        """
+
+Welcome to the Snakefile for the how-do-i-snakemake repo.
+This Snakefile contains utility methods for building and
+deploying the site.
+        
+----------------------------------------
+            Snakefile
+
+Add the -p or --printshellcmd flag to print the shell commands being run.
+Add the -n or --dryrun flag to do a dry run.
+
+
+To clone the deployed site to site/:
+    snakemake -p clone_site
+
+
+To initialize submodules (if you did not clone this repo recursively):
+    snakemake -p submodule_init
+
+
+To build the documentation in docs/ into the site/ directory:
+    snakemake build
+        
+
+To build and serve the documentation site locally (viewable at localhost:8000):
+    snakemake serve
+
+
+To safely clean the documentation site before next deployment:
+    snakemake clean_docs
+
+
+To build and deploy the updated documentation site to Heroku app dcppc-private-www:
+    snakemake deploy_docs
+
+"""
+
+
+rule clone_site:
+    """
+    Clone the deployed site to site/
+    and add the proper remotes.
+    """
+    output:
+        'site'
+    shell:
+       '''
+       git clone -b gh-pages {ghurl} site/
+       cd site \
+            && git remote add cmr {cmrurl} \
+            && git remote add gh {ghurl}
+       '''
+
+
+rule submodule_init:
+    """
+    Initialize the submodules (mkdocs-material)
+    so we can build the documentation.
+    """
+    shell:
+        '''
+        git submodule update --init
+        '''
+
+
+rule build:
+    """
+    Build the documentation with mkdocs.
+    """
+    shell:
+        '''
+        mkdocs build
+        '''
+
+
+rule serve:
+    """
+    Serve the documentation with mkdocs.
+    Visit localhost:8000 in your browser.
+    """
+    shell:
+        '''
+        mkdocs serve 
+        '''
+
+
+rule clean_docs:
+    """
+    Safely clean the deployed documentation site.
+    """
+    shell:
+        '''
+        rm -fr site/content/*
+        '''
+
+
+rule deploy_docs:
+    """
+    Deploy the documentation.
+    """
+    shell:
+        '''
+        mkdocs build
+        cd site/
+        git add -A .
+        git commit --allow-empty . -m 'updating site'
+        git push cmr gh-pages
+        git push gh gh-pages 
+        '''
+
--- a/docs/converting.md
+++ b/docs/converting.md
@@ -25,7 +25,8 @@ Two recommendations to help overcome this:
   smaller parts, and convert some of those parts into separate Snakemake rules.


-## Example: Filtering Sequencer Reads
+<a name="example"></a>
+## Example Overview: Filtering Sequencer Reads

 Let's illustrate the process of converting a workflow from shell scripts to a
 Snakefile, and doing so in stages, using a hypothetical workflow that involves
@@ -43,6 +44,7 @@ downloading data files containing reads from a sequencer from an external URL:
 | `SRR609_2_reads.fq.gz` | <http://example.com/SRR609_2_reads.fq.gz> |


+<a name="stage1"></a>
 ## Stage 1: Shell Script + Snakefile

 ### The Shell Script
@@ -333,6 +335,7 @@ You can force Snakemake to re-download the files two ways:



+<a name="stage2"></a>
 ## Stage 2: Replace Script with Snakefile (Hard-Coded)

 The next step in converting our workflow to Snakemake is to 
@@ -388,8 +391,95 @@ rule download_reads:
 The Python variables `read_file` and `read_url` are available to the shell command
 through `{read_file}` and `{read_url}`.

+The Snakefile above shows how the `run:` directive and `shell()` function call
+can be combined. This is very convenient, since otherwise we would end up with
+complicated subprocess command construction and funky string manipulations.

+**Problem:** There's still one big problem, and that's how the task of
+downloading each file is being divided up. We have a relatively short list of
+files to download here, but suppose we had a list of 200 files. Now, we have
+a single rule that is responsible for downoading 200 files. If any of those 
+files are missing, it will correctly assume the rule needs to be re-run, but
+will end up running the entire rule, and re-downloading every file.
+
+We really need to split our task up so that each rule corresponds to a single
+task of downloading a single indivdiual file. If we were hard-coding everything,
+then we might end up hard-coding a bunch of rules, and that would stink.
+Instead, let's use wildcards.
+
+
+```
+-----------------------------------8<----------------------------------------------
+```
+
+
+<a name="stage3"></a>
 ## Stage 3: Replace Script with Snakefile (Wildcards)

+The next step in converting our workflow to Snakemake is to 
+hard-code the file names into a Snakemake rule and let Snakemake
+run the curl command to download them. Here are the links:

+| Read Files | URL (note: these links are fake) |
+|------------|----------------------------------|
+| `SRR606_1_reads.fq.gz` | <http://example.com/SRR606_1_reads.fq.gz> | 
+| `SRR606_2_reads.fq.gz` | <http://example.com/SRR606_2_reads.fq.gz> |
+| `SRR607_1_reads.fq.gz` | <http://example.com/SRR607_1_reads.fq.gz> |
+| `SRR607_2_reads.fq.gz` | <http://example.com/SRR607_2_reads.fq.gz> |
+| `SRR608_1_reads.fq.gz` | <http://example.com/SRR608_1_reads.fq.gz> |
+| `SRR608_2_reads.fq.gz` | <http://example.com/SRR608_2_reads.fq.gz> |
+| `SRR609_1_reads.fq.gz` | <http://example.com/SRR609_1_reads.fq.gz> |
+| `SRR609_2_reads.fq.gz` | <http://example.com/SRR609_2_reads.fq.gz> |
+
+There are multiple ways to modify the Snakefile to download the files directly.
+The approach shown below uses a `run` directive to run Python code, and a
+`shell()` call to run a shell command. It also shows how these two can be mixed:
+
+**`Snakefile`:**
+
+```python
+touchfile = '.downloaded_reads'
+
+# from https://snakemake.readthedocs.io/en/stable/snakefiles/remote_files.html#read-only-web-http-s
+from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
+HTTP = HTTPRemoteProvider()
+
+# map of read files to read urls
+reads = {
+    "SRR606_1_reads.fq.gz" : "http://example.com/SRR606_1_reads.fq.gz",
+    "SRR606_2_reads.fq.gz" : "http://example.com/SRR606_2_reads.fq.gz",
+    "SRR607_1_reads.fq.gz" : "http://example.com/SRR607_1_reads.fq.gz",
+    "SRR607_2_reads.fq.gz" : "http://example.com/SRR607_2_reads.fq.gz",
+    "SRR608_1_reads.fq.gz" : "http://example.com/SRR608_1_reads.fq.gz",
+    "SRR608_2_reads.fq.gz" : "http://example.com/SRR608_2_reads.fq.gz",
+    "SRR609_1_reads.fq.gz" : "http://example.com/SRR609_1_reads.fq.gz",
+    "SRR609_2_reads.fq.gz" : "http://example.com/SRR609_2_reads.fq.gz"
+}
+
+rule download_read:
+    """
+    Download a single individual read
+    """
+    input:
+        # The input file is the remote HTTP url
+        HTTP.remote("example.com/{prefix}_{direction}_reads.fq.gz", keep_local=True)
+    output:
+        # The output file is now the read file itself
+        "{prefix}_{direction}_reads.fq.gz"
+    shell:
+        '''
+        curl -L {input} -o {output}
+        '''
+```
+
+Let's walk through this step by step.
+
+We start by importing the `HTTPRemoteProvider`. This is an object that will
+check if a remote file is available and if it is not the rule fails to
+execute.
+
+The `input:` directive contains a call to `HTTP.remote()` that passes the 
+URL of the file, containing the wildcards that are matched in the `output:`
+directive. The `keep_local=True` flag ensures the downloaded files are 
+not deleted.

--- a/docs/index.md
+++ b/docs/index.md
@@ -49,6 +49,14 @@ throughout this documentation and what they mean.
 [Converting Workflows to Snakemake](converting.md) - strategies for
 converting shell script workflows into Snakemake workflows.

+[**Example Overview: Filtering Sequence Reads**](converting.md#example)
+
+[**Stage 1: Shell Script + Snakefile**](converting.md#stage1)
+
+[**Stage 2: Replace Script with Snakefile (Hard-Coded)**](converting.md#stage2)
+
+[**Stage 3: Replace Script with Snakefile (Wildcards)**](converting.md#stage3)
+

 ## Useful Resources

@@ -56,3 +64,6 @@ Following is a list of useful Snakemake resources:

 * <https://snakemake.readthedocs.io/>

+
+
+
--- a/docs/installing.md
+++ b/docs/installing.md
@@ -25,8 +25,8 @@ Once you have Python installed, you should have `pip` available as well.
 Snakemake can be installed using pip:

 ```
-$ virtualenv -p python3 .venv
-$ source .venv/bin/activate
+$ virtualenv vp
+$ source vp/bin/activate
 $ pip install snakemake
 ```

@@ -35,5 +35,9 @@ $ pip install snakemake
 If you are using conda, you can install Snakemake using conda by first 
 adding some conda channels, then installing Snakemake using `conda install`:

+```
+conda install -c bioconda -c conda-forge snakemake
+```

+This will install snakemake from the bioconda channel.

--- a/docs/terminology.md
+++ b/docs/terminology.md
@@ -1,11 +1,32 @@
 # Terminology

+* **container** - Containers are like very lightweight
+    virtual machines that can provide an isolated,
+    consistent, reproducible environment in which to
+    run software.
+
+* **directive** - this refers to subheadings of rules,
+    such as `input:` or `output:` or `shell:`
+
+* **docker** - Docker is a program that allows running
+    containers. Docker is very popular in enterprise
+    software engineering, but presents challenges for
+    scientific computing because it requires an admin
+    account and presents security risks, making it
+    hard to run in an HPC environment.
+
 * **rule** - Snakemake rules define a given task,
    the input files it depends on, the output files
    it produces, the shell commands it should run,
    etc.

-* **directive** - this refers to subheadings of rules,
-    such as `input:` or `output:` or `shell:`
+* **singularity** - Singularity is a program that allows
+    running containers, like Docker, but without requiring
+    an admin account and without many of the security 
+    concerns that Docker creates.

+* **Snakefile** - Snakemake is a Python program that is used
+    to run a set of commands in a file; Snakefile is the
+    default name of the file in which Snakemake expects to
+    find those definitions.

--- a/2
+++ b/2
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -21,7 +21,7 @@ theme:
  font:
    text: 'Bitter'
    code: 'PT Mono'
-nav:
+pages:
  - "Index" : "index.md"
  - "Installing Snakemake" : "installing.md"
  - "Snakemake Terminology" : "terminology.md"
@@ -34,6 +34,3 @@ markdown_extensions:
      guess_lang: false
  - toc:
      permalink: true
-
-
-strict: true
Author	SHA1	Message	Date
Charles Reid	e793a6678f	update all docs	2018-08-03 22:08:31 -07:00
Charles Reid	d2a6ceca81	add snakefile for utility tasks (and add .snakemake to .gitignore)	2018-07-24 01:02:51 -07:00