Latent State of a Human Serpin with MSA Subsampling
Using subsampling and data scoring before gradient-driven refinement
This tutorial walks through the reproduction of Figure 4 from our paper:
"Synergistic Integration of MSA Subsampling with Gradient-Based Optimization"
We demonstrate how ROCKET enables recovery of the latent conformation of the human serpin PAI-1 (PDB ID 1LJ5) using MSA subsampling and scoring-based selection of MSA subsamples, followed by gradient-based refinement.
The AlphaFold2 prediction for this serpin from the full MSA yields its metastable active conformation. The experimental structure instead captures its latent state (PDB ID 1LJ5). This example shows how to run MSA subsampling with rk.msacluster, score the resulting predictions with rk.score, and then refine the best start with rk.refine.
The latent state conformation (PDB ID 1LJ5) is not accessible by gradient descent alone when ROCKET refinement is started from the full MSA prediction.
You will see a folder organized in the following manner:
For reproducibility, we have prepared all the necessary files in 1lj5_preprocessing_outputs. If you want to generate them from scratch, follow Launch with Your Own X-ray Data and rk.preprocess.
2. Subsample the MSA
Activate the environment with ROCKET installed. Then change into data_for_1lj5/1lj5_preprocessing_outputs and run:
This will subsample and make a serpin_clusters/ folder with subsampled *.a3m files. Files starting with subv1_ are clustered subsamples with DBSCAN; those starting with U10 or U100 are uniformly random sampled MSAs with 10 or 100 sequences.
3. Score the different MSA subsamples against the experimental data
Keep working in 1lj5_preprocessing_outputs. Then run:
This may take some time, depending on how many subsampled MSAs you score. The default --init_recycling value in ROCKET is 4, which is a decent speed and quality compromise. For the serpin case in the paper, we used 20, which matches the original AF2 default. After the run, you should see a serpin_scores/ folder with:
A CSV file with experimental scores, mean pLDDT, MSA depth, and R-factors for each prediction.
The prediction PDB files for each subsampled MSA.
We can identify a cluster that scores highly in LLG and is indeed a closer conformation to the experimentally resolved one:
Ranking predictions from MSA subsamples by their experimental likelihoods identifies a better starting model for gradient-based refinement (samples are ordered along the x-axis by increasing gains in experimental likelihood), unlike pLDDT-based scoring, which does not (highest samples along the y-axis do not resemble the experimental conformation).
Note: The results from rk.msacluster and rk.score are somewhat stochastic. Random seeds in subsampling and rigid-body refinement both contribute. Your msa_scoring.csv will not match ours exactly. Still, we consistently see a prediction with an inserted loop among the top results across different seeds.
4. Further improve the structure with gradient-driven refinement
For reproducibility, we attached the subsampled MSAs and scores from our run in repro_msaclusters/ and repro_msascore/. Among these, v1_034.a3m had the highest LLG score.
We therefore use the v1_034.a3m prediction for further gradient-driven refinement in 1lj5_preprocessing_outputs:
We include the config file refine_config.yml in the zenodo folder, with settings below:
The refinement results will be saved in ROCKET_outputs. We also include the refinement trajectory from our previous run in ROCKET_outputs/e80938425e.
Gradient-based refinement of the best starting prediction from the MSA subsamples results in a structure that closely resembles the latent conformation.
Finalize geometry and B-factors
We always append a brief standard refinement run afterwards. We used phenix.refine in the paper. This helps refine B-factors and polish geometry on the model that comes straight out of ROCKET.