> For the complete documentation index, see [llms.txt](https://rocket-9.gitbook.io/rocket-docs/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://rocket-9.gitbook.io/rocket-docs/api/rk.preprocess.md).

# rk.preprocess

`rk.preprocess` performs the preprocessing of predicted protein structures for **ROCKET**. It runs **OpenFold inference**, processes structures using **Phenix**, and performs **Molecular Replacement or Cryo-EM Docking**.

### TL;DR.

1. A typical preprocessing command for **X-ray** datasets (for our human serpin case)

```bash
rk.preprocess \
  --file_id 1lj5 \
  --method xray \
  --output_dir ./1lj5_processed \
  --max_recycling_iters 20 \
  --use_deepspeed_evoformer_attention
```

It requires working in the path with preexisting data files organized as:

```
.
├── 1lj5_data
│   └── 1lj5-tng_withrfree.mtz
├── 1lj5_fasta
│   └── 1lj5.fasta
├── alignments
│   └── 1lj5
│       ├── bfd_uniclust_hits.a3m
│       ├── mgnify_hits.a3m
│       ├── pdb70_hits.hhr
│       └── uniref90_hits.a3m
```

**Note**: If the MTZ file provided contains more than one set of relevant columns (e.g. intensities and errors + structure factor amplitudes and errors), Phaser will pick the best set for you – it's likely better to work from the intensities if available! If you absolutely want to use a specific set of columns, you can provide an MTZ that contains only that data.

2. A typical preprocessing command for **Cryo-EM** map datasets (for our groEL case):

```bash
rk.preprocess \
  --file_id 8p4pA \
  --resolution 9.6 \
  --method cryoem \
  --output_dir 8p4pA_processed \
  --predocked_model 8p4pA_data/8p4pA_docked.pdb \
  --fixed_model 8p4pA_data/8p4pA-fixed-model-forchainA.pdb \
  --map1 8p4pA_data/emd_17425_half_map_1.map \
  --map2 8p4pA_data/emd_17425_half_map_2.map \
  --max_recycling_iters 20 \
  --use_deepspeed_evoformer_attention
```

If you already have a `--predocked_model` and only a single post-processed map, you can use `--map` alone:

```bash
rk.preprocess \
  --file_id 8p4pA \
  --resolution 9.6 \
  --method cryoem \
  --output_dir 8p4pA_processed \
  --predocked_model 8p4pA_data/8p4pA_docked.pdb \
  --fixed_model 8p4pA_data/8p4pA-fixed-model-forchainA.pdb \
  --map 8p4pA_data/emd_17425_postprocessed.map \
  --max_recycling_iters 20 \
  --use_deepspeed_evoformer_attention
```

**Note:** `--map` alone only works when ROCKET can skip docking because you already supplied a predocked model. If ROCKET still needs to perform a docking search, you must provide both half maps with `--map1` and `--map2`.

It requires preexisting data files organized as:

```
.
├── 8p4pA_fasta
│   └── 8p4pA.fasta
├── alignments
│   └── 8p4pA
│       ├── bfd_uniclust_hits.a3m
│       ├── pdb70_hits.hhr
│       └── uniref90_hits.a3m
├── 8p4pA_data
│   ├── emd_17425_half_map_1.map
│   ├── emd_17425_half_map_2.map
│   ├── emd_17425_postprocessed.map
│   ├── 8p4pA-fixed-model-forchainA.pdb
│   └── 8p4pA_docked.pdb
```

### Input Parameters

| Argument                              | Description                                                                                                            |
| ------------------------------------- | ---------------------------------------------------------------------------------------------------------------------- |
| `--file_id`                           | Identifier for input files.                                                                                            |
| `--resolution`                        | The best resolution for cryoEM map. Not used for x-ray case.                                                           |
| `--method`                            | Choose `"xray"` (calls Phaser) or `"cryoem"` (calls EMPlacement).                                                      |
| `--output_dir`                        | Directory to store results (default: `"preprocessing_output"`).                                                        |
| `--precomputed_alignment_dir`         | Path to OpenFold precomputed alignments (default: `"alignments/"`).                                                    |
| `--max_recycling_iters`               | N\_recyclings for initial predictions (default: 4)                                                                     |
| `--use_deepspeed_evoformer_attention` | Flag, whether to use the DeepSpeed evoformer attention layer. Must have deepspeed installed in the environment         |
| `--jax_params_path`                   | Path to JAX parameter file (`"params_model_1_ptm.npz"`). Default `None`, will use system env var `$OPENFOLD_RESOURCES` |

The scripts expects input files organized as follows:

```
<working_directory>/
├── {file_id}_fasta/
│   └── {file_id}.fasta       # FASTA file containing the chain to refine
│                             # Header should be "> {file_id}"
│
├── {file_id}_data/
│   ├── *.mtz                 # For X-ray data
│   ├── *_half_map*.mrc       # For Cryo-EM data
│   └── <optional files>      # e.g., predicted or docked models
│
├── alignments/               # (default: --precomputed_alignment_dir)
│   └── {file_id}
|       └──*.a3m / *.hhr          
```

#### Additional Parameters for Cryo-EM (`--method cryoem`)

| Argument             | Description                                                                                                                |
| -------------------- | -------------------------------------------------------------------------------------------------------------------------- |
| `--map`              | Path to a single post-processed map. Use this only when `--predocked_model` is provided and docking can be skipped.        |
| `--map1`             | Path to **Half-map 1**. Required when ROCKET needs to run a docking search.                                                |
| `--map2`             | Path to **Half-map 2**. Required when ROCKET needs to run a docking search.                                                |
| `--full_composition` | FASTA file containing sequences of everything expected to be in the reconstruction, whether there is a model for it or not |
|                      |                                                                                                                            |

#### Optional Arguments

| Argument            | Description                                          |
| ------------------- | ---------------------------------------------------- |
| `--predocked_model` | Path to an already docked model (default: `None`).   |
| `--fixed_model`     | Optional fixed model contribution (default: `None`). |

**Note:** When `--predocked_model` is present, `--map` is enough if you only have one post-processed map. Without a predocked model, or whenever docking still needs to be searched, use `--map1` and `--map2`.

## Outputs

After execution, results will be structured in the `--output_dir` directory:

```
output_dir/
|── {file_id}.fasta                 # FASTA file containing the chain to refine, copied from input
├── alignments/                     # MSA files for the input sequence, copied from input
|   └──*.a3m / *.hhr               
│── predictions/                    # OpenFold structure predictions and pkl files
│   └── xxx_processed_feats.pickle  # Processed feature dict with cluster profiles           
│── ROCKET_inputs/                  # Final outputs for ROCKET main trunk
│   ├── {file_id}-pred-aligned.pdb  # Aligned prediction with pseudo-Bs
│   ├── {file_id}-Edata.mtz         # Experimental data in LLG convention
├── ROCKET_config_phase1.yaml       # Automatically generated config file for rk.refine phase1 run
├── ROCKET_config_phase2.yaml.      # Automatically generated config file for rk.refine phase2 run
│── processed_predicted_files/      # Processed predictions from Phenix (including trimmed confidence loops)
│── docking_outputs/                # Cryo-EM docking results (if exists)
│── phaser_files/                   # X-ray molecular replacement results (if exists)
```


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://rocket-9.gitbook.io/rocket-docs/api/rk.preprocess.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
