> For the complete documentation index, see [llms.txt](https://rocket-9.gitbook.io/rocket-docs/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://rocket-9.gitbook.io/rocket-docs/latent-state-of-a-human-serpin-with-msa-subsampling.md).

# Latent State of a Human Serpin with MSA Subsampling

This tutorial walks through the reproduction of **Figure S7** from our paper:

**"Synergistic Integration of MSA Subsampling with Gradient-Based Optimization"**

We demonstrate how ROCKET enables recovery of the **latent conformation** of the human serpin **PAI-1 (PDB ID 1LJ5)** using MSA subsampling and scoring-based selection of MSA subsamples, followed by gradient-based refinement.

The AlphaFold2 prediction for this serpin from the full MSA yields its **metastable active conformation**. The experimental structure instead captures its **latent state** (PDB ID 1LJ5). This example shows how to run MSA subsampling with `rk.msacluster`, score the resulting predictions with `rk.score`, and then refine the best start with `rk.refine`.

<figure><img src="/files/R4NnsQlXJZJkMO6HpyjC" alt=""><figcaption><p>The latent state conformation (PDB ID 1LJ5) is not accessible by gradient descent alone when ROCKET refinement is started from the full MSA prediction.</p></figcaption></figure>

### 1. Collect the required files

We have prepared ROCKET inputs of this serpin case for download at <https://zenodo.org/records/19368349>.

Download and decompress it:

```bash
tar xvf 1lj5_Human_Serpin_Tutorial.tar.gz
```

You will see a folder organized in the following manner:

```
data_for_1lj5/
├── 1lj5_data
├── 1lj5_fasta
├── 1lj5_preprocessing_outputs
├── alignments
└── preprocess.sh
```

For reproducibility, we have prepared all the necessary files in `1lj5_preprocessing_outputs`. If you want to generate them from scratch, follow [Launch with Your Own X-ray Data](/rocket-docs/launch-with-your-own-x-ray-data.md) and [rk.preprocess](/rocket-docs/api/rk.preprocess.md).

### 2. Subsample the MSA

Activate the environment with `ROCKET` installed. Then change into `data_for_1lj5/1lj5_preprocessing_outputs` and run:

```bash
rk.msacluster \
  subv1 \
  -i ./alignments \
  -o ./serpin_clusters \
  --n_controls 30 \
  --run_TSNE
```

This will subsample and make a `serpin_clusters/` folder with subsampled `*.a3m` files. Files starting with `subv1_` are clustered subsamples with DBSCAN; those starting with `U10` or `U100` are uniformly random sampled MSAs with 10 or 100 sequences.

### 3. Score the different MSA subsamples against the experimental data

Keep working in `1lj5_preprocessing_outputs`. Then run:

```bash
rk.score \
  . \
  1lj5 \
  -i serpin_clusters/ \
  -o serpin_scores/ \
  --init_recycling 20 \
  --score_fullmsa \
  --datamode xray
```

This may take some time, depending on how many subsampled MSAs you score. The default `--init_recycling` value in ROCKET is `4`, which is a decent speed and quality compromise. For the serpin case in the paper, we used `20`, which matches the original AF2 default. After the run, you should see a `serpin_scores/` folder with:

* A CSV file with experimental scores, mean pLDDT, MSA depth, and R-factors for each prediction.
* The prediction PDB files for each subsampled MSA.

```
serpin_scores/
├── fullmsa_postRBR.pdb
├── msa_scoring.csv
├── U10-subv1_000_postRBR.pdb
└── U10-subv1_001_postRBR.pdb
...
```

We can identify a cluster that scores highly in LLG and is indeed a closer conformation to the experimentally resolved one:

<figure><img src="/files/TlVY79BBouc1aLzSUXPK" alt=""><figcaption><p>Ranking predictions from MSA subsamples by their experimental likelihoods identifies a better starting model for gradient-based refinement (samples are ordered along the x-axis by increasing gains in experimental likelihood), unlike pLDDT-based scoring, which does not (highest samples along the y-axis do not resemble the experimental conformation).</p></figcaption></figure>

**Note:** The results from `rk.msacluster` and `rk.score` are somewhat stochastic. Random seeds in subsampling and rigid-body refinement both contribute. Your `msa_scoring.csv` will not match ours exactly. Still, we consistently see a prediction with an inserted loop among the top results across different seeds.

### 4. Further improve the structure with gradient-driven refinement

For reproducibility, we attached the subsampled MSAs and scores from our run in `repro_msaclusters/` and `repro_msascore/`. Among these, `v1_034.a3m` had the highest LLG score.

We therefore use the `v1_034.a3m` prediction for further gradient-driven refinement in `1lj5_preprocessing_outputs`:

```bash
rk.refine refine_config.yml
```

We include the config file `refine_config.yml` in the zenodo folder, with settings below:

```yaml
note: phase1_v1_034
data:
  datamode: xray
  free_flag: R-free-flags
  testset_value: 1
  min_resolution: 4.0
  max_resolution: null
  voxel_spacing: 8.0
  msa_subratio: null
  w_plddt: 0.0
  downsample_ratio: null
paths:
  path: ./
  file_id: 1lj5
  input_pdb: ./ROCKET_inputs/1lj5-pred-aligned.pdb
  template_pdb: null
  input_msa: serpin_clusters/v1_034.a3m
  sub_msa_path: null
  sub_delmat_path: null
  msa_feat_init_path: null
  starting_bias: null
  starting_weights: null
  uuid_hex: null
execution:
  cuda_device: 0
  num_of_runs: 3
  verbose: false
algorithm:
  bias_version: 3
  iterations: 100
  init_recycling: 20
  domain_segs: null
  optimization:
    additive_learning_rate: 0.05
    multiplicative_learning_rate: 1.0
    weight_decay: 0.0001
    batch_sub_ratio: 0.7
    number_of_batches: 1
    rbr_opt_algorithm: lbfgs
    rbr_lbfgs_learning_rate: 150.0
    smooth_stage_epochs: 50
    phase2_final_lr: 0.001
    l2_weight: 0.001
  features:
    solvent: true
    sfc_scale: true
    refine_sigmaA: true
    additional_chain: false
    total_chain_copy: 1.0
    bias_from_fullmsa: false
    chimera_profile: false
alphafold:
  use_deepspeed_evo_attention: true
monitoring:
  use_wandb: false
  wandb_project: null
  wandb_entity: null
  wandb_name: null
  wandb_tags: null
  wandb_notes: null

```

The refinement results will be saved in `ROCKET_outputs`. We also include the refinement trajectory from our previous run in `ROCKET_outputs/e80938425e`.

<figure><img src="/files/vXtLIWjnqE3S0rJKcNqZ" alt=""><figcaption><p>Gradient-based refinement of the best starting prediction from the MSA subsamples results in a structure that closely resembles the latent conformation.</p></figcaption></figure>

### Finalize geometry and B-factors

We always append a brief standard refinement run afterwards. We used `phenix.refine` in the paper. This helps refine B-factors and polish geometry on the model that comes straight out of ROCKET.
