Running example data
Running example data
The repository contains example data that can be inspected and rerun.
You can see the config file in config/example_config.yaml
The first few lines of this config file define where data will be stored and where the required input files are
config/example_config.yaml
# Main working directory
workdir: "workflows/example_workflow"
# Store all FASTA files here
fastadir: "workflows/example_workflow/fasta"
# Each FASTA file needs an according .subset file in this folder (test1.fasta -> test1.subset)
subdir: "workflows/example_workflow/subset_rules"
All of these are directories within the main asr_curation
repository.
FASTA directory
You can see that the FASTA directory fastadir
at workflows/example_workflow/fasta
contains two FASTA files
- als_example_ec_2_2_1_6.fasta
- kari_example_ec_1_1_1_86.fasta
Subset directory
And the subset directory subdir
at workflows/example_workflow/subset_rules
contains a matching set of subset_rules for each dataset
- als_example_ec_2_2_1_6.subset
- kari_example_ec_1_1_1_86.subset
The ALS subset file contains a single subset -
als_interpro_IPR012782 = Non_AA_Character : False $ protein_name : NOT Deleted, Merged $ xref_interpro : IPR012782
And the KARI subset file contains three subsets -
One of these will only contain Eukaryotic KARI sequences, one will only contains Class I KARI sequences, and one will contain all of the KARI sequences
eukaryotic_kari = lineage_superkingdom : Eukaryota $ protein_name : NOT Deleted, Merged $ Non_AA_Character : FALSE
classI_kari = KARI_Class : Class_1 $ protein_name : NOT Deleted, Merged $ Non_AA_Character : FALSE
all_kari = *
Therefore if we look at the generated data, we will see -
Running it via the tests
You can validate that everything is working correctly by running a small example workflow locally.
There is a test within test/test_snakemake.py
called test_snakemake_pipeline
This file only contains a small number of sequences so that it can be run quickly.
You can run this test by -
conda activate asr_curation
pytest test/test_snakemake.py
This will run the entire snakemake
pipeline from end to end, including any database retrieval.
The output folders stay after the test is run so that you can inspect them, but are deleted every time the test is rerun.
See test_snakemake_pipeline
in test/test_snakemake.py
for more details about this test and test/files/config/test_config.yaml
for the full details of where output folders, fasta directories, and subset directories are set.
There is also another test within test/test_snakemake.py
called test_snakemake_pipeline_with_existing_annotation_file
that does the same test without deleting the original annotations file - meaning that no calls to external databases need to be made
and only the subset generation, alignment, tree inference, and ancestor prediction is rerun - making this even quicker to test the
pipeline.
Rerunning the entire example_workflow
You can also delete the entire output of the example_workflow locally and then run it again to check that you can regenerate the data.
Note that these files contain substantially more data than the tests, so will take longer to run.
Remove the datasets
directory stored in workflows/example_workflow
. Make sure to keep the fastadir
and subdir
directories
rm -rf workflows/example_workflow/datasets/
Rerun the snakemake pipeline to re-generate the data
Warning
This will retrigger calls to the UniProt and BRENDA databases, so the annotations may not be identical to those stored in this repository as sequences get updated in these databases. Any changes you make to this files are excluded from the Git repository - so a fresh git pull will overwrite any changes you make to the example data.