Latent State of a Human Serpin with MSA Subsampling

Using subsampling and data scoring before gradient-driven refinement

This tutorial walks through the reproduction of Figure 4 from our paper:

"Synergistic Integration of MSA Subsampling with Gradient-Based Optimization"

We demonstrate how ROCKET enables recovery of the latent conformation of the human serpin PAI-1 (PDB ID 1LJ5) using MSA subsampling and scoring-based selection of MSA subsamples, followed by gradient-based refinement.

The AlphaFold2 prediction for this serpin from the full MSA yields its metastable active conformation. The experimental structure instead captures its latent state (PDB ID 1LJ5). This example illustrates how to run MSA subsampling with rk.msacluster to improve sampling of alternate conformations, score predictions with rk.score based on experimental likelihood, and finally perform gradient-based refinement toward the native structure with rk.refine.

The latent state conformation (PDB ID 1LJ5) is not accessible by gradient descent alone when ROCKET refinement is started from the full MSA prediction.

1. Collect the necessary files

We have prepared ROCKET inputs of this serpin case for download at https://zenodo.org/uploads/15084558.

Download and decompress it:

tar xvf Human_Serpin_Tutorial.tar.gz

You will see a folder organized in the following manner:

data_for_1lj5/
β”œβ”€β”€ 1lj5_data
β”œβ”€β”€ 1lj5_fasta
β”œβ”€β”€ 1lj5_preprocessing_outputs
β”œβ”€β”€ alignments
└── preprocessing_command.txt

For reproducibility, we have prepared all the necessary files in the 1lj5_preprocessing_outputs. Check this xray tutorial and the API for rk.preprocess if you want to do the preparation from scratch.

2. Subsample the MSA

Make sure you have activated the conda/mamba environment with ROCKET installed. Then change directory into data_for_1lj5/1lj5_preprocessing_outputs and run the following command to do MSA subsampling.

rk.msacluster \
  "subv1" \                    # name of this running
  -i ./alignments \            # Path to the full msa files
  -o ./serpin_clusters \       # Path to the save out the subs samples
  --n_controls 30 \            # Number of uniformly sampled controls you want to include
  --run_TSNE \                 # Run TSNE on one-hot embedding of sequences and store in output_cluster_metadata.tsv

This will subsample and make a serpin_clusters/ folder with subsampled *.a3m files. Files starting with subv1_ are clustered subsamples with DBSCAN; those starting with U10 or U100 are uniformly random sampled MSAs with 10 or 100 sequences.

3. Score the different MSA subsamples against the experimental data

Keep working in the 1lj5_preprocessing_outputs folder, run the following command to score the MSA subsamples against experimental data.

rk.score \
  . \                      # Working Path
  1lj5 \                   # File_id
  -i serpin_clusters/      # Prefix for msas to use, Working Path will prepend 
  -o serpin_scores/        # Name of output directory, Working Path will prepend
  --init_recycling 20      # Initial recrycling, set to be 1 for fast process
  --score_fullmsa          # Flag on to also score the full msa, assume full msa is in Working Path/alignments/      
  --datamode xray          # Specify whether this is a [cryoem/xray] dataset

This could take a bit of time depending on the number of subsampled MSAs you have. The default --init_recycling is set to 4 in ROCKET, which is a which is a decent compromise between speed and model geometry/quality of predictions from shallow MSAs especially. But for the serpin case in the paper we used 20, which was AF2's default. After running, you should see a folder named ⁠serpin_scores. This folder contains:

β€’ A CSV file with experimental scores (LLGs), mean PLDDT, depth and R-factors for each subsampled MSA prediction.

β€’ The prediction PDB files for each subsampled MSA.

serpin_scores/
β”œβ”€β”€ fullmsa_postRBR.pdb
β”œβ”€β”€ msa_scoring.csv
β”œβ”€β”€ U10-subv1_000_postRBR.pdb
└── U10-subv1_001_postRBR.pdb
...

We can identify a cluster that scores highly in LLG and is indeed a closer conformation to the experimentally resolved one:

Ranking predictions from MSA subsamples by their experimental likelihoods identifies a better starting model for gradient-based refinement (samples are ordered along the x-axis by increasing gains in experimental likelihood), unlike pLDDT-based scoring, which does not (highest samples along the y-axis do not resemble the experimental conformation).

Note: The results from rk.msacluster and rk.score are somewhat stochastic (e.g., due to random seeds in subsampling and rigid body refinement). Therefore, our msa_scoring.csv may not be exactly reproducible in a new run. However, we consistently see that a prediction with an inserted loop appears among the top results across different seeds. You should see this as well – please let us know if you don’t!

4. Further improve the structure with gradient-driven refinement

For reproducibility, we attached the subsampled MSAs and their scores from our run, in repro_msaclusters/ and repro_msascore/. Among these, we found the v1_034.a3m MSA to have the highest LLG score.

We therefore use the v1_034.a3m prediction for further gradient-driven refinement using rk.refine in the directory 1lj5_preprocessing_outputs

rk.refine refine_config.yml

We include the config file refine_config.yml in the zenodo folder, with settings below:

note: phase1_v1_034
data:
  datamode: xray
  free_flag: R-free-flags
  testset_value: 1
  min_resolution: 4.0
  max_resolution: null
  voxel_spacing: 8.0
  msa_subratio: null
  w_plddt: 0.0
  downsample_ratio: null
paths:
  path: ./
  file_id: 1lj5
  template_pdb: null
  input_msa: repro_msaclusters/v1_034.a3m
  sub_msa_path: null
  sub_delmat_path: null
  msa_feat_init_path: null
  starting_bias: null
  starting_weights: null
  uuid_hex: null
execution:
  cuda_device: 0
  num_of_runs: 3
  verbose: false
algorithm:
  bias_version: 3
  iterations: 100
  init_recycling: 20
  domain_segs: null
  optimization:
    additive_learning_rate: 0.05
    multiplicative_learning_rate: 1.0
    weight_decay: 0.0001
    batch_sub_ratio: 0.7
    number_of_batches: 1
    rbr_opt_algorithm: lbfgs
    rbr_lbfgs_learning_rate: 150.0
    smooth_stage_epochs: 50
    phase2_final_lr: 0.001
    l2_weight: 0.001
  features:
    solvent: true
    sfc_scale: true
    refine_sigmaA: true
    additional_chain: false
    bias_from_fullmsa: false
    chimera_profile: false

The refinement results will be saved in ROCKET_outputs . We also include the refinement trajectory we got from our previous run in theROCKET_outputs/e80938425e

Gradient-based refinement of the best starting prediction from the MSA subsamples results in a structure that closely resembles the latent conformation.

Finalize geometry and B-factors

We always append a brief standard run of refinement (phenix.refine used in the paper) to refine B-factors and polish the geometry of the models that come straight out of ROCKET .

Last updated