Latent State of a Human Serpin with MSA Subsampling
Using subsampling and data scoring before gradient-driven refinement
This tutorial walks through the reproduction of Figure 4 from our paper:
"Synergistic Integration of MSA Subsampling with Gradient-Based Optimization"
We demonstrate how ROCKET enables recovery of the latent conformation of the human serpin PAI-1 (PDB ID 1LJ5) using MSA subsampling and scoring-based selection of MSA subsamples, followed by gradient-based refinement.
The AlphaFold2 prediction for this serpin from the full MSA yields its metastable active conformation. The experimental structure instead captures its latent state (PDB ID 1LJ5). This example illustrates how to run MSA subsampling with rk.msacluster to improve sampling of alternate conformations, score predictions with rk.score based on experimental likelihood, and finally perform gradient-based refinement toward the native structure with rk.refine.
The latent state conformation (PDB ID 1LJ5) is not accessible by gradient descent alone when ROCKET refinement is started from the full MSA prediction.
You will see a folder organized in the following manner:
For reproducibility, we have prepared all the necessary files in the 1lj5_preprocessing_outputs. Check this xray tutorial and the API for rk.preprocess if you want to do the preparation from scratch.
2. Subsample the MSA
Make sure you have activated the conda/mamba environment with ROCKET installed. Then change directory into data_for_1lj5/1lj5_preprocessing_outputs and run the following command to do MSA subsampling.
This will subsample and make a serpin_clusters/ folder with subsampled *.a3m files. Files starting with subv1_ are clustered subsamples with DBSCAN; those starting with U10 or U100 are uniformly random sampled MSAs with 10 or 100 sequences.
3. Score the different MSA subsamples against the experimental data
Keep working in the 1lj5_preprocessing_outputs folder, run the following command to score the MSA subsamples against experimental data.
This could take a bit of time depending on the number of subsampled MSAs you have. The default --init_recycling is set to 4 in ROCKET, which is a which is a decent compromise between speed and model geometry/quality of predictions from shallow MSAs especially. But for the serpin case in the paper we used 20, which was AF2's default. After running, you should see a folder named β serpin_scores. This folder contains:
β’ A CSV file with experimental scores (LLGs), mean PLDDT, depth and R-factors for each subsampled MSA prediction.
β’ The prediction PDB files for each subsampled MSA.
We can identify a cluster that scores highly in LLG and is indeed a closer conformation to the experimentally resolved one:
Ranking predictions from MSA subsamples by their experimental likelihoods identifies a better starting model for gradient-based refinement (samples are ordered along the x-axis by increasing gains in experimental likelihood), unlike pLDDT-based scoring, which does not (highest samples along the y-axis do not resemble the experimental conformation).
Note: The results from rk.msacluster and rk.score are somewhat stochastic (e.g., due to random seeds in subsampling and rigid body refinement). Therefore, our msa_scoring.csv may not be exactly reproducible in a new run. However, we consistently see that a prediction with an inserted loop appears among the top results across different seeds. You should see this as well β please let us know if you donβt!
4. Further improve the structure with gradient-driven refinement
For reproducibility, we attached the subsampled MSAs and their scores from our run, in repro_msaclusters/ and repro_msascore/. Among these, we found the v1_034.a3m MSA to have the highest LLG score.
We therefore use the v1_034.a3m prediction for further gradient-driven refinement using rk.refine in the directory 1lj5_preprocessing_outputs
We include the config file refine_config.yml in the zenodo folder, with settings below:
The refinement results will be saved in ROCKET_outputs . We also include the refinement trajectory we got from our previous run in theROCKET_outputs/e80938425e
Gradient-based refinement of the best starting prediction from the MSA subsamples results in a structure that closely resembles the latent conformation.
Finalize geometry and B-factors
We always append a brief standard run of refinement (phenix.refine used in the paper) to refine B-factors and polish the geometry of the models that come straight out of ROCKET .
rk.msacluster \
"subv1" \ # name of this running
-i ./alignments \ # Path to the full msa files
-o ./serpin_clusters \ # Path to the save out the subs samples
--n_controls 30 \ # Number of uniformly sampled controls you want to include
--run_TSNE \ # Run TSNE on one-hot embedding of sequences and store in output_cluster_metadata.tsv
rk.score \
. \ # Working Path
1lj5 \ # File_id
-i serpin_clusters/ # Prefix for msas to use, Working Path will prepend
-o serpin_scores/ # Name of output directory, Working Path will prepend
--init_recycling 20 # Initial recrycling, set to be 1 for fast process
--score_fullmsa # Flag on to also score the full msa, assume full msa is in Working Path/alignments/
--datamode xray # Specify whether this is a [cryoem/xray] dataset