Defining input files
Defining the config files
Every snakemake pipeline needs to be run with a specific config file - a YAML file with a .yaml extension.
For example this call uses the --configfile
flag to set the location to the example_config.yaml
file stored in the asr_curation repository
snakemake --cores 1 --configfile ./asr_curation/config/example_config.yaml
A config file must define the following parameters
-
workdir - where the
datasets
folder and all of the created data will be stored -
fastadir - where the fasta files are stored
-
subdir - where the
subset
files are stored - they must match the name of the .fasta file they refer to but contain a .subset extension -
annotation_cols - this specifies which columns in the annotation file should be written to the FigTree annotation file
-
blocked_datasets - a list of datasets that will not be run - meaning you can keep the FASTA and .subset file in the fastadir and subdir and just skip over the generation of these files. An empty list
[]
will run every fasta and matching .subset file
A config file can contain the following optional parameters
These define the custom annotations that can be added.
See Customising the annotations scripts for more details
- custom_dir - if you want to use a custom
add_custom_annotations.py
file then set the directory location here and place a custom file nameadd_custom_annotations.py
into this folder - custom_align_dir - if you want to use a custom
add_annotations_from_alignment.py
file then set the directory location here and place a custom file nameadd_annotations_from_alignment.py
into this folder - custom_align_dir - if you want to use a custom
add_annotations_from_ancestors.py
file then set the directory location here and place a custom file nameadd_annotations_from_ancestors.py
into this folder
Example config file
Here is an example config file from the example_workflow in the asr_curation
repository
# Main working directory
workdir: "workflows/example_workflow"
# Store all FASTA files here
fastadir: "workflows/example_workflow/fasta"
# Each FASTA file needs an according .subset file in this folder (test1.fasta -> test1.subset)
subdir: "workflows/example_workflow/subset_rules"
# Which columns should we plot in the summary documents?
annotation_cols: ['lineage_superkingdom', 'xref_supfam', 'xref_panther', 'KARI_Class', 'ec']
# Directories for custom annotations
custom_dir : "scripts/example_custom_annotations"
custom_align_dir: "scripts/example_custom_annotations"
# Block a data set from being run (you can still keep the FASTA and .subset file in the fastadir and subdir)
blocked_datasets : []
Defining the subset files
Subset files need to be in the format
<subset_name> = <column_name> : <string> $ <other_column_name> : NOT <string>
Where subset_name
can be any name you wish to call the subset, column_name
and other_column_name
are columns that appear in the final annotation file (see /csv/custom/*.csv) and string
is a term that you wish to include or exclude.
For each row / sequence in your annotation file, if the given column contains the string (or substring) then it will be included.
If you use the NOT
modifier, for each row / sequence in your annotation file, if the given column contains the string (or substring) then it will not be included.
You can define multiple subsets within the one file.
# THIS IS A COMMENT LINE.
eukaryotic_kari = lineage_superkingdom : Eukaryota $ protein_name : NOT Deleted, Merged $ Non_AA_Character : FALSE
classI_kari = KARI_Class : Class_1 $ protein_name : NOT Deleted, Merged $ Non_AA_Character : FALSE
all_kari = *
Including all sequences in a subset
The default way to include every row / sequence is
<subset_name> = *
For example,
all = *