Customising workflows
Customising the annotations scripts
The following rules / Python files stored in scripts are run at three different steps during the asr_curation
pipeline
-
add_custom_annotations / scripts/add_custom_annotations.py' - run after all of the database annotations are retrived and before subsets are made
-
add_annotations_from_alignment / scripts/add_annotations_from_alignment.py' - run after the sequences for a given subset have been aligned
-
add_annotations_from_ancestors / scripts/add_annotations_from_ancestors.py' - run after the ancestors for a given subset have been predicted
Which script to use for which custom annotation is a decision as to what information you require to build the rules. For example, if your custom rules make use of aligned positions, this will obviously need to come after alignments have been generated.
Currently only add_custom_annotations
comes before the creation of subsets, so only annotations added from this custom script (or annotations from the database retrievals) can be used to make subsets.
The files stored in scripts
are template files - they will read in the data and write it out correctly, but not add any annotations.
You can override these files by copying them to a new location and writing code to be annotated into your data.
Within your config file you then specify the location of any custom script.
See Defining the config files for more information.
Writing code for new annotation scripts
You must keep the structure for the input and output the same but to help with this all three files define the following line -
# IF CUSTOMISING THIS FILE PLACE CUSTOM CODE HERE
An example of how to write custom code is given in the scripts/configs/example_custom_annotations
folder, and the example_workflow uses these files to add custom annotations to the datasets in the example_workflow
There is a lot of code in scripts/annot_functions
that is designed to do common annotation tasks.
Customising annotations
Provided annotation functions
The following is a description of some of the provided annotation functions and how to incorporate them.
All changes should be made in your add_custom_annotations.py
file and you should add the custom location of this file
to you config
file.
Regnerating annotation files without making new alignments or trees
Create a top column
Creates a new column where if a value within a column is more than a given percentage, it will create a new column,
labelled TOP_True
or False
values depending on whether a sequence has
that annotation.
This can be useful for grouping together multiple annotations that differ from the dominant value in a column.
annot_df = an.create_top_column(annot_df, 80)
Separate out the note values from the feature columns
annot_df = an.separate_notes(annot_df)
Generating embeddings and running DBSCAN
Add embeddings
Generate embeddings using TM Vec. You need to download the checkpoint and config.json and point to their path in your add_custom_annotations.py
code
To download the files -
!wget https://users.flatironinstitute.org/thamamsy/public_www/tm_vec_cath_model.ckpt -q gwpy\n"
!wget https://users.flatironinstitute.org/thamamsy/public_www/tm_vec_cath_model_params.json -q gwpy"
Then set their path in your file -
model_checkpoint_path = <path_to_ckpt>
model_config_path = <path_to_json>
And then create the embeddings. This will store the generated embeddings in a local file, so that embeddings can be reused over subsequent runs of asr_curation.
The default path is in the custom_annotations folder, in a pickle file - "embeddings.pkl", but this can be changed with the parameter embedding_df_path
when using the following
process_and_store_embeddings
function
annot_df = tm_vec_embed.process_and_store_embeddings(annot_df, 'Prot_T5', model_checkpoint_path, model_config_path)
Generate DBSCAN images
Generate a DBSCAN coverage image
db.generate_dbscan_coverage(annot_df, 'Prot_T5 Embed Encoded',
f"{snakemake.input.custom_dir}/{snakemake.wildcards.dataset}_dbscan_coverage" )
Here we provide the dataframe
name, the name of the column to find the embeddings in (by default 'Prot_T5 Embed Encoded' if you are using Prot_T5 in the previous step) and a prefix for the output files
You can also pass columns that you do not wish to include in the DBSCAN analysis using the parameter skip_cols
db.generate_dbscan_coverage(annot_df, 'Prot_T5 Embed Encoded',
f"{snakemake.input.custom_dir}/{snakemake.wildcards.dataset}_dbscan_coverage_no_ft",
skip_cols = ['ft_var_seq||', 'ft_variant||', 'ft_conflict||', 'ft_chain||', 'ft_crosslnk||', 'ft_carbohyd||', 'ft_init_met||' , 'ft_mod_res||', 'ft_lipid||', 'ft_transit||', 'ft_compbias||', 'ft_domain||', 'ft_motif||', 'ft_region||', 'ft_repeat||', 'ft_zn_fing||', 'ft_binding||', 'ft_topo_dom||', 'ft_act_site||'])
Full minimal example of generating embeddings and running DBSCAN outlier detection
The following can be created as add_custom_annotations.py
to generate embeddings. Make sure to specify that you are using a custom annotations file in your config
file
and to update the paths to the TM-Vec checkpoint / config.
import os
import annot_functions as an
import get_funfams as ff
import map_to_cdd as m2c
import seqcurate as sc
import pandas as pd
import numpy as np
import add_embeddings as embed
import add_tm_vec_embeddings as tm_vec_embed
import create_itol_files as itol
import create_dbscan_coverage as db
annot_df = pd.read_csv(snakemake.input.csv)
# Create TOP column for high scoring values
annot_df = an.create_top_column(annot_df, 80)
# Separate out the note values from the feature columns
annot_df = an.separate_notes(annot_df)
# Add embeddings
model_checkpoint_path = <path_to_checkpoint>
model_config_path = <path_to_config>
annot_df = tm_vec_embed.process_and_store_embeddings(annot_df, 'Prot_T5', model_checkpoint_path, model_config_path)
# Generate DBSCAN images
db.generate_dbscan_coverage(annot_df, 'Prot_T5 Embed Encoded',
f"{snakemake.input.custom_dir}/{snakemake.wildcards.dataset}_dbscan_coverage" )
db.generate_dbscan_coverage(annot_df, 'Prot_T5 Embed Encoded',
f"{snakemake.input.custom_dir}/{snakemake.wildcards.dataset}_dbscan_coverage_no_ft",
skip_cols = ['ft_var_seq||', 'ft_variant||', 'ft_conflict||', 'ft_chain||', 'ft_crosslnk||', 'ft_carbohyd||', 'ft_init_met||' , 'ft_mod_res||', 'ft_lipid||', 'ft_transit||', 'ft_compbias||', 'ft_domain||', 'ft_motif||', 'ft_region||', 'ft_repeat||', 'ft_zn_fing||', 'ft_binding||', 'ft_topo_dom||', 'ft_act_site||'])
annot_df.to_csv(snakemake.output[0], index=False)