rk.msacluster
Subsample sequences in a MSA with DBSCAN and uniform random sampling
This script is modified from AF_cluster repo
rk.msacluster
is used to subsample sequences in a MSA using DBSCAN algorithm as well as random uniform sampling and write .a3m file for each subsample. Assumes first sequence in fasta is the query sequence.
options:
-h, --help show this help message and exit
-i I fasta/a3m file of original alignment, or path containing fasta/a3m files
-o O name of output directory to write MSAs to.
--n_controls N_CONTROLS
Number of control msas to generate (Default 10)
--verbose Print cluster info as they are generated.
--scan Select eps value on 1/4 of data, shuffled.
--eps_val EPS_VAL Use single value for eps instead of scanning.
--resample If included, will resample the original MSA with replacement before writing.
--gap_cutoff GAP_CUTOFF
Remove sequences with gaps representing more than this frac of seq.
--min_eps MIN_EPS Min epsilon value to scan for DBSCAN (Default 3).
--max_eps MAX_EPS Max epsilon value to scan for DBSCAN (Default 20).
--eps_step EPS_STEP step for epsilon scan for DBSCAN (Default 0.5).
--min_samples MIN_SAMPLES
Default min_samples for DBSCAN (Default 3, recommended no lower than that).
--run_PCA Run PCA on one-hot embedding of sequences and store in output_cluster_metadata.tsv
--run_TSNE Run TSNE on one-hot embedding of sequences and store in output_cluster_metadata.tsv
typical command example:
rk.msacluster v1 -i alignments/ -o msaclusters --run_TSNE
Last updated