Latent State of a Human Serpin with MSA Subsampling
Using subsampling and data scoring before gradient-driven refinement
This tutorial walks through the reproduction of Figure 4 from our paper:
"Synergistic Integration of MSA Subsampling with Gradient-Based Optimization"
We demonstrate how ROCKET enables recovery of the latent conformation of the human serpin PAI-1 (PDB ID 1LJ5) using MSA subsampling and scoring-based selection of MSA subsamples, followed by gradient-based refinement.
The AlphaFold2 prediction for this serpin from the full MSA yields its metastable active conformation. The experimental structure instead captures its latent state (PDB ID 1LJ5). This example illustrates how to run MSA subsampling with rk.msacluster
to improve sampling of alternate conformations, score predictions with rk.score
based on experimental likelihood, and finally perform gradient-based refinement toward the native structure with rk.refine
.

1. Collect the necessary files
We have prepared ROCKET inputs of this serpin case for download at https://zenodo.org/uploads/15084558.
Download and decompress it:
tar xvf Human_Serpin_Tutorial.tar.gz
You will see a folder organized in the following manner:
data_for_1lj5/
βββ 1lj5_data
βββ 1lj5_fasta
βββ 1lj5_preprocessing_outputs
βββ alignments
βββ preprocessing_command.txt
For reproducibility, we have prepared all the necessary files in the 1lj5_preprocessing_outputs
. Check this xray tutorial and the API for rk.preprocess
if you want to do the preparation from scratch.
2. Subsample the MSA
Make sure you have activated the conda/mamba environment with ROCKET
installed. Then change directory into data_for_1lj5/1lj5_preprocessing_outputs
and run the following command to do MSA subsampling.
rk.msacluster \
"subv1" \ # name of this running
-i ./alignments \ # Path to the full msa files
-o ./serpin_clusters \ # Path to the save out the subs samples
--n_controls 30 \ # Number of uniformly sampled controls you want to include
--run_TSNE \ # Run TSNE on one-hot embedding of sequences and store in output_cluster_metadata.tsv
This will subsample and make a serpin_clusters/
folder with subsampled *.a3m
files. Files starting with subv1_
are clustered subsamples with DBSCAN; those starting with U10
or U100
are uniformly random sampled MSAs with 10 or 100 sequences.
3. Score the different MSA subsamples against the experimental data
Keep working in the 1lj5_preprocessing_outputs
folder, run the following command to score the MSA subsamples against experimental data.
rk.score \
. \ # Working Path
1lj5 \ # File_id
-i serpin_clusters/ # Prefix for msas to use, Working Path will prepend
-o serpin_scores/ # Name of output directory, Working Path will prepend
--init_recycling 20 # Initial recrycling, set to be 1 for fast process
--score_fullmsa # Flag on to also score the full msa, assume full msa is in Working Path/alignments/
--datamode xray # Specify whether this is a [cryoem/xray] dataset
This could take a bit of time depending on the number of subsampled MSAs you have. The default --init_recycling
is set to 4
in ROCKET, which is a which is a decent compromise between speed and model geometry/quality of predictions from shallow MSAs especially. But for the serpin case in the paper we used 20, which was AF2's default. After running, you should see a folder named β serpin_scores
. This folder contains:
β’ A CSV file with experimental scores (LLGs), mean PLDDT, depth and R-factors for each subsampled MSA prediction.
β’ The prediction PDB files for each subsampled MSA.
serpin_scores/
βββ fullmsa_postRBR.pdb
βββ msa_scoring.csv
βββ U10-subv1_000_postRBR.pdb
βββ U10-subv1_001_postRBR.pdb
...
We can identify a cluster that scores highly in LLG and is indeed a closer conformation to the experimentally resolved one:

Note: The results from rk.msacluster
and rk.score
are somewhat stochastic (e.g., due to random seeds in subsampling and rigid body refinement). Therefore, our msa_scoring.csv may not be exactly reproducible in a new run. However, we consistently see that a prediction with an inserted loop appears among the top results across different seeds. You should see this as well β please let us know if you donβt!
4. Further improve the structure with gradient-driven refinement
For reproducibility, we attached the subsampled MSAs and their scores from our run, in repro_msaclusters/
and repro_msascore/
. Among these, we found the v1_034.a3m
MSA to have the highest LLG score.
We therefore use the v1_034.a3m
prediction for further gradient-driven refinement using rk.refine
in the directory 1lj5_preprocessing_outputs
rk.refine refine_config.yml
We include the config file refine_config.yml
in the zenodo folder, with settings below:
note: phase1_v1_034
data:
datamode: xray
free_flag: R-free-flags
testset_value: 1
min_resolution: 4.0
max_resolution: null
voxel_spacing: 8.0
msa_subratio: null
w_plddt: 0.0
downsample_ratio: null
paths:
path: ./
file_id: 1lj5
template_pdb: null
input_msa: repro_msaclusters/v1_034.a3m
sub_msa_path: null
sub_delmat_path: null
msa_feat_init_path: null
starting_bias: null
starting_weights: null
uuid_hex: null
execution:
cuda_device: 0
num_of_runs: 3
verbose: false
algorithm:
bias_version: 3
iterations: 100
init_recycling: 20
domain_segs: null
optimization:
additive_learning_rate: 0.05
multiplicative_learning_rate: 1.0
weight_decay: 0.0001
batch_sub_ratio: 0.7
number_of_batches: 1
rbr_opt_algorithm: lbfgs
rbr_lbfgs_learning_rate: 150.0
smooth_stage_epochs: 50
phase2_final_lr: 0.001
l2_weight: 0.001
features:
solvent: true
sfc_scale: true
refine_sigmaA: true
additional_chain: false
bias_from_fullmsa: false
chimera_profile: false
The refinement results will be saved in ROCKET_outputs
. We also include the refinement trajectory we got from our previous run in theROCKET_outputs/e80938425e

Finalize geometry and B-factors
We always append a brief standard run of refinement (phenix.refine
used in the paper) to refine B-factors and polish the geometry of the models that come straight out of ROCKET
.
Last updated