Latent State of a Human Serpin with MSA Subsampling
Using subsampling and data scoring before gradient-driven refinement
This tutorial walks through the reproduction of Figure 4 from our paper:
"Synergistic Integration of MSA Subsampling with Gradient-Based Optimization"
We demonstrate how ROCKET enables recovery of the latent conformation of the human serpin PAI-1 (PDB ID 1LJ5) using MSA subsampling and scoring-based selection of MSA subsamples, followed by gradient-based refinement.
The AlphaFold2 prediction for this serpin from the full MSA yields its metastable active conformation. The experimental structure instead captures its latent state (PDB ID 1LJ5). This example illustrates how to run MSA subsampling with rk.msacluster to improve sampling of alternate conformations, score predictions with rk.score based on experimental likelihood, and finally perform gradient-based refinement toward the native structure with rk.refine.

1. Collect the necessary files
We have prepared ROCKET inputs of this serpin case for download at https://zenodo.org/uploads/15084558.
Download and decompress it:
You will see a folder organized in the following manner:
For reproducibility, we have prepared all the necessary files in the 1lj5_preprocessing_outputs. Check this xray tutorial and the API for rk.preprocess if you want to do the preparation from scratch.
2. Subsample the MSA
Make sure you have activated the conda/mamba environment with ROCKET installed. Then change directory into data_for_1lj5/1lj5_preprocessing_outputs and run the following command to do MSA subsampling.
This will subsample and make a serpin_clusters/ folder with subsampled *.a3m files. Files starting with subv1_ are clustered subsamples with DBSCAN; those starting with U10 or U100 are uniformly random sampled MSAs with 10 or 100 sequences.
3. Score the different MSA subsamples against the experimental data
Keep working in the 1lj5_preprocessing_outputs folder, run the following command to score the MSA subsamples against experimental data.
This could take a bit of time depending on the number of subsampled MSAs you have. The default --init_recycling is set to 4 in ROCKET, which is a which is a decent compromise between speed and model geometry/quality of predictions from shallow MSAs especially. But for the serpin case in the paper we used 20, which was AF2's default. After running, you should see a folder named β serpin_scores. This folder contains:
β’ A CSV file with experimental scores (LLGs), mean PLDDT, depth and R-factors for each subsampled MSA prediction.
β’ The prediction PDB files for each subsampled MSA.
We can identify a cluster that scores highly in LLG and is indeed a closer conformation to the experimentally resolved one:

Note: The results from rk.msacluster and rk.score are somewhat stochastic (e.g., due to random seeds in subsampling and rigid body refinement). Therefore, our msa_scoring.csv may not be exactly reproducible in a new run. However, we consistently see that a prediction with an inserted loop appears among the top results across different seeds. You should see this as well β please let us know if you donβt!
4. Further improve the structure with gradient-driven refinement
For reproducibility, we attached the subsampled MSAs and their scores from our run, in repro_msaclusters/ and repro_msascore/. Among these, we found the v1_034.a3m MSA to have the highest LLG score.
We therefore use the v1_034.a3m prediction for further gradient-driven refinement using rk.refine in the directory 1lj5_preprocessing_outputs
We include the config file refine_config.yml in the zenodo folder, with settings below:
The refinement results will be saved in ROCKET_outputs . We also include the refinement trajectory we got from our previous run in theROCKET_outputs/e80938425e

Finalize geometry and B-factors
We always append a brief standard run of refinement (phenix.refine used in the paper) to refine B-factors and polish the geometry of the models that come straight out of ROCKET .
Last updated