rk.preprocess

ROCKET Preprocessing Command

rk.preprocess performs the preprocessing of predicted protein structures for ROCKET. It runs OpenFold inference, processes structures using Phenix, and performs Molecular Replacement or Cryo-EM Docking.

TL;DR.

  1. A typical preprocessing command for X-ray datasets (for our human serpin case)

rk.preprocess \
  --file_id 1lj5 \
  --method xray \
  --output_dir ./1lj5_processed \
  --xray_data_label FP,SIGFP \
  --max_recycling_iters 20 \
  --use_deepspeed_evoformer_attention

It requires working in the path with preexisting data files organized as:

.
β”œβ”€β”€ 1lj5_data
β”‚   └── 1lj5-tng_withrfree.mtz
β”œβ”€β”€ 1lj5_fasta
β”‚   └── 1lj5.fasta
β”œβ”€β”€ alignments
β”‚   └── 1lj5
β”‚       β”œβ”€β”€ bfd_uniclust_hits.a3m
β”‚       β”œβ”€β”€ mgnify_hits.a3m
β”‚       β”œβ”€β”€ pdb70_hits.hhr
β”‚       └── uniref90_hits.a3m

  1. A typical preprocessing command for Cryo-EM map datasets (for our groEL case):

rk.preprocess \
  --file_id 8p4pA \
  --resolution 9.6 \
  --method cryoem \
  --output_dir 8p4pA_processed \
  --predocked_model 8p4pA_data/8p4pA_docked.pdb \
  --fixed_model 8p4pA_data/8p4pA-fixed-model-forchainA.pdb \
  --map1 8p4pA_data/emd_17425_half_map_1.map \
  --map2 8p4pA_data/emd_17425_half_map_2.map \
  --max_recycling_iters 20 \
  --use_deepspeed_evoformer_attention

It requires preexisting data files organized as:

.
β”œβ”€β”€ 8p4pA_fasta
β”‚   └── 8p4pA.fasta
β”œβ”€β”€ alignments
β”‚   └── 8p4pA
β”‚       β”œβ”€β”€ bfd_uniclust_hits.a3m
β”‚       β”œβ”€β”€ pdb70_hits.hhr
β”‚       └── uniref90_hits.a3m
β”œβ”€β”€ 8p4pA_data
β”‚   β”œβ”€β”€ emd_17425_half_map_1.map
β”‚   β”œβ”€β”€ emd_17425_half_map_2.map
β”‚   β”œβ”€β”€ 8p4pA-fixed-model-forchainA.pdb
β”‚   └── 8p4pA_docked.pdb

Input Parameters

Argument
Description

--file_id

Identifier for input files.

--resolution

The best resolution for cryoEM map. Not used for x-ray case.

--method

Choose "xray" (calls Phaser) or "cryoem" (calls EMPlacement).

--output_dir

Directory to store results (default: "preprocessing_output").

--precomputed_alignment_dir

Path to OpenFold precomputed alignments (default: "alignments/").

--max_recycling_iters

N_recyclings for initial predictions (default: 4)

--use_deepspeed_evoformer_attention

Flag, whether to use the DeepSpeed evoformer attention layer. Must have deepspeed installed in the environment

--jax_params_path

Path to JAX parameter file ("params_model_1_ptm.npz"). Default None, will use system env var $OPENFOLD_RESOURCES

The scripts expects input files organized as follows:

<working_directory>/
β”œβ”€β”€ {file_id}_fasta/
β”‚   └── {file_id}.fasta       # FASTA file containing the chain to refine
β”‚                             # Header should be "> {file_id}"
β”‚
β”œβ”€β”€ {file_id}_data/
β”‚   β”œβ”€β”€ *.mtz                 # For X-ray data
β”‚   β”œβ”€β”€ *_half_map*.mrc       # For Cryo-EM data
β”‚   └── <optional files>      # e.g., predicted or docked models
β”‚
β”œβ”€β”€ alignments/               # (default: --precomputed_alignment_dir)
β”‚   └── {file_id}
|       └──*.a3m / *.hhr          

Additional Parameters for X-ray (--method xray)

Argument
Description

--xray_data_label

Reflection data labels (e.g., "FP,SIGFP").

Additional Parameters for Cryo-EM (--method cryoem)

Argument
Description

--map1

Path to Half-map 1.

--map2

Path to Half-map 2.

--full_composition

FASTA file containing sequences of everything expected to be in the reconstruction, whether there is a model for it or not

Optional Arguments

Argument
Description

--predocked_model

Path to an already docked model (default: None).

--fixed_model

Optional fixed model contribution (default: None).

Outputs

After execution, results will be structured in the --output_dir directory:

output_dir/
|── {file_id}.fasta                 # FASTA file containing the chain to refine, copied from input
β”œβ”€β”€ alignments/                     # MSA files for the input sequence, copied from input
|   └──*.a3m / *.hhr               
│── predictions/                    # OpenFold structure predictions and pkl files
β”‚   └── xxx_processed_feats.pickle  # Processed feature dict with cluster profiles           
│── ROCKET_inputs/                  # Final outputs for ROCKET main trunk
β”‚   β”œβ”€β”€ {file_id}-pred-aligned.pdb  # Aligned prediction with pseudo-Bs
β”‚   β”œβ”€β”€ {file_id}-Edata.mtz         # Experimental data in LLG convention
β”œβ”€β”€ ROCKET_config_phase1.yaml       # Automatically generated config file for rk.refine phase1 run
β”œβ”€β”€ ROCKET_config_phase2.yaml.      # Automatically generated config file for rk.refine phase2 run
│── processed_predicted_files/      # Processed predictions from Phenix (including trimmed confidence loops)
│── docking_outputs/                # Cryo-EM docking results (if exists)
│── phaser_files/                   # X-ray molecular replacement results (if exists)

Last updated