rk.msacluster

Subsample sequences in a MSA with DBSCAN and uniform random sampling

This script is modified from AF_cluster repo

rk.msacluster is used to subsample sequences in a MSA using DBSCAN algorithm as well as random uniform sampling and write .a3m file for each subsample. Assumes first sequence in fasta is the query sequence.

options:
  -h, --help            show this help message and exit
  -i I                  fasta/a3m file of original alignment, or path containing fasta/a3m files
  -o O                  name of output directory to write MSAs to.
  --n_controls N_CONTROLS
                        Number of control msas to generate (Default 10)
  --verbose             Print cluster info as they are generated.
  --scan                Select eps value on 1/4 of data, shuffled.
  --eps_val EPS_VAL     Use single value for eps instead of scanning.
  --resample            If included, will resample the original MSA with replacement before writing.
  --gap_cutoff GAP_CUTOFF
                        Remove sequences with gaps representing more than this frac of seq.
  --min_eps MIN_EPS     Min epsilon value to scan for DBSCAN (Default 3).
  --max_eps MAX_EPS     Max epsilon value to scan for DBSCAN (Default 20).
  --eps_step EPS_STEP   step for epsilon scan for DBSCAN (Default 0.5).
  --min_samples MIN_SAMPLES
                        Default min_samples for DBSCAN (Default 3, recommended no lower than that).
  --run_PCA             Run PCA on one-hot embedding of sequences and store in output_cluster_metadata.tsv
  --run_TSNE            Run TSNE on one-hot embedding of sequences and store in output_cluster_metadata.tsv

typical command example:

rk.msacluster v1 -i alignments/ -o msaclusters --run_TSNE

Last updated