Basic snakemake
Snakefile
The main workflow is contained within the snakefile
file in the top folder of the asr_curation
repository.
A snakefile
contains a series of rules that define -
- an input
- an output
- how to get to the output from the input, typically either a Python script or a shell command
Rules in snakefiles
For example, this rule defines -
- the input as an alignment,
- the output as a tree file (in Newick format)
- a shell command to run in order to generate this - in this case calling
FastTree
rule infer_tree:
input:
WORKDIR + "/{dataset}/subsets/{subset}/{dataset}_{subset}.aln"
output:
WORKDIR + "/{dataset}/subsets/{subset}/{dataset}_{subset}.nwk"
shell:
"FastTree {input} > {output}"
Anything defined within curly brackets {}
is a wildcard.
The wildcards in the shell command let us use the whatever we define as input
and output
within the shell command.
The wildcards in the input and output rules let as define different dataset
and subset
wildcards which makes it possible to run the
entire snakemake pipeline on different sets of data.
Triggering a snakemake pipeline
Once you trigger a snakemake pipeline, it will look for the first command defined in the snakefile
file and try to create the output.
By convention, this is a rule called all
which we place at the top of the snakefile
In asr_curation
the following rule is defined as the first rule -
rule all:
input:
annotations = [f'{WORKDIR}/{dataset}/subsets/{subset}/csv/{dataset}_{subset}_annotations.txt' for dataset in DATASETS for subset in subsets[dataset]],
ancestors = [f'{WORKDIR}/{dataset}/subsets/{subset}/csv/{dataset}_{subset}_ancestors.csv' for dataset in DATASETS for subset in subsets[dataset]]
Snakemake will look to see if these input
files already exists for each dataset
and subset
in terms of both the annotations.txt
and the ancestors.csv
If these files do not exist, snakemake finds a rule defined in the snakefile
that it would need to run in order to generate these files
Ignoring the other detail, focus here on the output line - this rule will generate the output of ancestors.csv
which is one of the files we needed as input to the all
rule.
If this file didn't already exist, then we know we must run this rule in order to generate it!
rule add_annotations_from_ancestors:
input:
csv = WORKDIR + "/{dataset}/subsets/{subset}/csv/{dataset}_{subset}_alignment.csv",
aln = WORKDIR + "/{dataset}/subsets/{subset}/grasp_results/GRASP_ancestors.fa",
output:
csv = WORKDIR + "/{dataset}/subsets/{subset}/csv/{dataset}_{subset}_ancestors.csv"
script:
CUSTOM_ANCESTOR_DIR + "/add_annotations_from_ancestors_for_use_in_example_workflow.py"
Snakemake pipelines only rerun what is needed
Snakemake backtracks in this way, checking to see which rules need to be run in order to generate the required input for the all
rule.
If no data at all has been generated it will start with the very first step in the workflow, but if some steps have been completed then it does not need to regenerate them. This is an important concept in workflow management and saves us a lot of time.
For example, in the asr_curation
pipeline, all of the subset files use the annotation_files
generated from going to the UniProt and BRENDA databases.
So only one call to the databases needs to be made, and then all downstream analysis simply uses the generated output from the annotation steps.
A closer look at snakemake rules
Here are two snakemake
rules. We can compare the differences in order to get a better sense of how to read snakemake
rules in the snakefile
rule validate_ids:
input:
WORKDIR + "/{dataset}/csv/original/{dataset}_original.csv"
output:
WORKDIR + "/{dataset}/csv/validated/{dataset}_validated.csv"
script:
"scripts2/validate_ids.py"
rule infer_tree:
input:
WORKDIR + "/{dataset}/subsets/{subset}/{dataset}_{subset}.aln"
output:
WORKDIR + "/{dataset}/subsets/{subset}/{dataset}_{subset}.nwk"
shell:
"FastTree {input} > {output}"
validate_ids
only uses thedataset
wildcard as it is only run once perdataset
infer_tree
uses both thedataset
and thesubset
wildcard as it is run once persubset
- Both rules use the wildcards to name the output files - for example, in the example workflow, the phylogenetic tree generated for the dataset
kari_example_ec_1_1_1_86
and subseteukaryotic_kari
is written to/{dataset}/subsets/{subset}/{dataset}_{subset}.nwk
or/kari_example_ec_1_1_1_86/subsets/eukaryotic_kari/kari_example_ec_1_1_1_86_eukaryotic_kari.nwk
. This means that files can easily be identified and will not overwrite each other. validate_ids
defines ascript
command - which tellssnakemake
to run the Python scriptscripts/validate_ids.py
. You can look in the scripts folder to see exactly what is being run for each step.infer_tree
defines ashell
commmand - which tellssnakemake
to run the defined command on the command line.