rk.preprocess

ROCKET Preprocessing Command

rk.preprocess performs the preprocessing of predicted protein structures for ROCKET. It runs OpenFold inference, processes structures using Phenix, and performs Molecular Replacement or Cryo-EM Docking.

TL;DR.

  1. A typical preprocessing command for X-ray datasets (for our human serpin case)

rk.preprocess \
  --file_id 1lj5 \
  --method xray \
  --output_dir ./1lj5_processed \
  --max_recycling_iters 20 \
  --use_deepspeed_evoformer_attention

It requires working in the path with preexisting data files organized as:

.
β”œβ”€β”€ 1lj5_data
β”‚   └── 1lj5-tng_withrfree.mtz
β”œβ”€β”€ 1lj5_fasta
β”‚   └── 1lj5.fasta
β”œβ”€β”€ alignments
β”‚   └── 1lj5
β”‚       β”œβ”€β”€ bfd_uniclust_hits.a3m
β”‚       β”œβ”€β”€ mgnify_hits.a3m
β”‚       β”œβ”€β”€ pdb70_hits.hhr
β”‚       └── uniref90_hits.a3m

Note: If the MTZ file provided contains more than one set of relevant columns (e.g. intensities and errors + structure factor amplitudes and errors), Phaser will pick the best set for you – it's likely better to work from the intensities if available! If you absolutely want to use a specific set of columns, you can provide an MTZ that contains only that data.

  1. A typical preprocessing command for Cryo-EM map datasets (for our groEL case):

It requires preexisting data files organized as:

Input Parameters

Argument
Description

--file_id

Identifier for input files.

--resolution

The best resolution for cryoEM map. Not used for x-ray case.

--method

Choose "xray" (calls Phaser) or "cryoem" (calls EMPlacement).

--output_dir

Directory to store results (default: "preprocessing_output").

--precomputed_alignment_dir

Path to OpenFold precomputed alignments (default: "alignments/").

--max_recycling_iters

N_recyclings for initial predictions (default: 4)

--use_deepspeed_evoformer_attention

Flag, whether to use the DeepSpeed evoformer attention layer. Must have deepspeed installed in the environment

--jax_params_path

Path to JAX parameter file ("params_model_1_ptm.npz"). Default None, will use system env var $OPENFOLD_RESOURCES

The scripts expects input files organized as follows:

Additional Parameters for Cryo-EM (--method cryoem)

Argument
Description

--map1

Path to Half-map 1.

--map2

Path to Half-map 2.

--full_composition

FASTA file containing sequences of everything expected to be in the reconstruction, whether there is a model for it or not

Optional Arguments

Argument
Description

--predocked_model

Path to an already docked model (default: None).

--fixed_model

Optional fixed model contribution (default: None).

Outputs

After execution, results will be structured in the --output_dir directory:

Last updated