Learning RFdiffusion: build protein backbones with diffusion ML model

5 minute read

Published:

I recently played with RFdiffusion, a software that generates backbone using the new and hot diffusion ML model.

A very brief intro to diffusion model

The diffusion model is a generative model. It powers a lot of text-prompted image generating AIs like DALL-E2, where you can give the model a prompt “A photo of a teddy bear on a skateboard in Times Square” and the AI generates the image:

DALL-E2 teddy bear in time square

The model is trained to reverse the noise adding process to recover the original data.

mit slides Slides from MIT 6.S191: introduction to deep learning.

The the model is trained to “generate” an image with a lot more information and recover (imagine) details from an initial noisy state with few information/ mostly noise.

In the image prompt example, DALL-E2 first converts the text into a sequence of tokens using a language model; then uses the tokens to generate an initial image based on the text input. This initial image is then fed to the diffusion model, where it gets gradually refined and improved, yielding a high-quality images that matches the input text.

RFdiffusion: diffusion model to generate protein backbones

Back to the protein design business.

The Baker lab implemented the diffusion model to generate protein backbones, developing RoseTTAFold Diffusion (RFdiffusion). Like with the text-prompt image generating example, the prompt for making a protein backbone may be “100 residues”, or “30 residues of alpha helix followed by 20 residues of loop followed by 50 residues of beta sheet”, and the model will try to recover the details of those secondary structures and generate the backbone accordingly.

The gory details of the model can be found in their paper “Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models”.

The codebase is open-sourced on github: RFdiffusion, and they offer a Google Colab file to explore and play with the examples.

Simple prompt: designing a protein with 30 residues

My customized Colab file (based on the Colab file in RFdiffusion github page) can be found here.

The design process:

  1. Let the diffusion model freely generate a protein backbone of 30 amino acids (i.e., only provide the residue number as prompt).
  2. Run ProteinMPNN to generate a sequence for that backbone.
  3. Use AlphaFold to validate the sequences

The workflow of the Colab notebook:

  • Setup
    • Install RF diffusion, ProteinMPNN, AlphaFold, etc. This should take about 2 mintues.
  • Gerate backbone with RFdiffusion
    • The key parameter to set is contigs = "30", which specifying the 30 residues length.
    • We use free mode in this simple example. We can add other conditions/prompts, for example, in a binder design, we can set contigs='A:50' pdb='4N5T' to diffuse a binder of length 50 to chain A of defined PDB.
    • There’s only one chain, no need to set the symmetry.
    • The default diffusion time steps/iterations is 50. The program will trace back 50 time steps to gradually denoise the protein, and the notebook is set up to display the structure as the model traces back.
    • It should take about 2 min for 50 steps to finish, and in our case we get a helical backbone of 30 residues long.
  • Run ProteinMPNN to generate a sequence and AlphaFold to validate
    • Use proteinMPNN to calculate 8 sequences for the backbone.
    • Feed the 8 sequences to AlphaFold2 to get structures
  • Overlap the structure of the best design sequence with the backbone from diffusion.

Colab results Left: de novo designed backbone generated by RFdiffusion. Right: verifying the designability of the backbone by running ProteinMPNN to design sequences, feed the sequences to AlphaFold and compare the best structure (in blue-ish color) with the de novo backbone generated by RFdiffusion (colored in grey).

Key command under the hood

Most of codes in the notebook are about installing the program, setting up parameters, and displaying the results. The key lines to actually run the RFdiffusion process are as follows:

command = f"./RFdiffusion/run_inference.py inference.output_prefix=outputs/test inference.num_designs=1 'contigmap.contigs=[30-30]'"

# run bash command in python
process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True, text=True)

It’s the equivalent of running ./RFdiffusion/run_inference.py inference.output_prefix=outputs/test inference.num_designs=1 'contigmap.contigs=[30-30]' in a terminal.

This is very similar to the simplest example on the RFdiffusion README page of ./scripts/run_inference.py 'contigmap.contigs=[150-150]' inference.output_prefix=test_outputs/test inference.num_designs=10. The readme example here designs a 150 backbone with contigs set to 150, and generate 10 backbones by setting the num_designs to 10.


P.S. (non-important random thoughts)

Some philosophical impression/intuition of the diffusion model, noise and entropy

I am at awe of how powerful the diffusion model can be in generating/recovering details from a few prompts/settings.

It amazes me to think about the analogy that in a sense, human creations are also a de-noising process, to generate something from what seems to be random noise; that information emerges when the system entropy is reduced.

It reminds me of what David Froster Wallace said in his commencement speech “this is water”, that there is a “mystical oneness of all things deep down”.