2022 Archived Content

Inaugural

Machine Learning Approaches for Protein Engineering

Balancing Theory with Practice

May 5 - 6, 2022 | Hynes Convention Center, Boston, MA | EDT

Machine learning and AI are changing the way drugs will get discovered, designed and optimized in the future, but these tools are still in their early development and much needs to be learned on how to adapt them for use in antibody and vaccine discovery, training, prediction, developability, simulation and optimization.

Day 1
Day 2

Thursday, May 5

7:30 am Registration and Morning Coffee (Hynes Main Lobby)

8:25 am

Chairperson's Opening Remarks

Maria Wendt, PhD, Head, Biologics Research US & Global Head, Digital Biologics Platform (ML/AI), Large Molecule Research, Sanofi

8:30 am KEYNOTE PRESENTATION:

Protein Structure Prediction in a Post-AlphaFold2 World

Mohammed AlQuraishi, PhD, Assistant Professor, Systems Biology, Columbia University

In this talk, I will argue that with AlphaFold2, the core problem of static protein structure prediction is in some sense finished, but that further maturation is necessary before structure prediction informs questions beyond those of structure determination itself. I will outline some of these developments, highlighting one in particular: the prediction of structure from individual protein sequences, presenting new results on predicting structures of orphan and designed proteins.

9:00 am

Learned Surface Fingerprints for Protein Function Prediction and Design

Bruno Correia, PhD, Assistant Professor, Laboratory of Protein Design & Immunoengineering, University of Lausanne

A high-level representation of protein structure, the molecular surface, displays patterns of chemical and geometric features that fingerprint a protein’s modes of interactions with other biomolecules. We present MaSIF (molecular surface interaction fingerprinting), a conceptual framework based on a geometric deep learning method to capture fingerprints that are important for specific biomolecular interactions. Next, we sought to use MaSIF for the de design of de novo PPIs, a longstanding problem in computational protein design. We show that the learned surface fingerprints hold exquisite information that enables us to understand functional features of proteins as well as to design novel ones.

9:30 amProtein Engineering Guided by Accurate in silico Modeling of Disulfide Formation and Its Effect on Thermostability

Dmitry Lupyan, PhD, Senior Principal Scientist, Schrödinger

Introducing disulfide bonds into protein has shown to improve protein thermostability and enhance function. Here we use a combination of bioinformatics and structure-based descriptors to predict putative disulfide crosslinking, and use rigorous physics-based methods (FEP+) to interrogate the effect of disulfides on protein thermostability or affinity. We show the approach can predict the effect of disulfide bonds on physical properties and is a valuable tool for in silico protein engineering.

10:00 am Coffee Break in the Exhibit Hall with Poster Viewing (Exhibit Hall A & B)

10:40 am

Computational Design of a Synthetic PD-1 Agonist

Cassie Bryan, PhD, Senior Scientist, Synthetic Biology, Charles Stark Draper Laboratory, Inc.

PD-1 expressed on activated T cells inhibits T cell function and proliferation to prevent an excessive immune response, and disease can result if this delicate balance is shifted in either direction. Using a combination of computation and experiment, we designed a hyperstable 40-residue miniprotein, PD-MP1, that specifically binds PD-1 at the PD-L1 interface. The apo crystal structure shows that the binder folds as designed, and trimerization of PD-MP1 resulted in a PD-1 agonist that strongly inhibits T cell activation. PD-MP1 was computationally designed with an all-beta interface, and the trimeric agonist could contribute to treatments for autoimmune and inflammatory diseases.

11:10 am

Learning a Language Spoken by Nature: Protein Language Model, a Useful Tool for Protein Engineering

Yu Qiu, PhD, Senior Principal Scientist, Sanofi Genzyme R&D Center

Natural antibodies are optimized for general “fitness” by evolution and in vivo selections. Taking natural protein sequences as a language spoken by nature, learning the underlying “grammars” and “semantics” can help various engineering tasks. We have built protein language models (PLMs) trained on >2 billion natural human antibody sequences. The model showed promising results in predicting affinity, expression, and functional readout when trained and evaluated on retrospective data.

11:40 am

Deep Dive into Machine Learning Models for Protein Engineering

Deeptak Verma, PhD, Senior Scientist, Merck & Co.

Yuting Xu, PhD, Associate Principal Scientist, Biostatistics, Merck Sharp & Dohme Corp.

There has been an increased interest in applying machine learning in the area of protein engineering. If built accurately, such models have the ability to virtually screen and discover large number of novel sequences. However, many state-of-the-art models and protein sequence descriptors have not been explored extensively. Our benchmark study, using different model types, descriptors, and datasets, suggest that Convolution Neural Network models built with sequence-based amino acid property descriptors can be widely applicable to different types of protein redesign problems in the pharmaceutical industry.

12:10 pm Luncheon in the Exhibit Hall and Last Chance for Poster Viewing (Exhibit Hall A & B)

1:15 pm

Chairperson's Remarks

M. Frank Erasmus, PhD, Head, Bioinformatics, Specifica, Inc.

1:20 pm

Predicting Antibody Developability Profiles through Experimental and in silico Approaches

Laurence Fayadat-Dilman, PhD, Executive Director, Protein Sciences, Merck Research Laboratories

Selection of multi-parameter optimized antibody molecules, taking into consideration biological function, safety, and developability, allows for streamlined and successful development. We developed an efficient and practical high-throughput developability workflow, which identified novel patterns and correlations between biophysical assays. These patterns and correlations represent the basis for training deep neural networks and establishing machine learning algorithms for in silico interrogation and prediction of developability profiles.

1:50 pm

Protein Language Models for Improved Prediction of Immunogenicity and Biophysical Properties

Paolo Marcatili, PhD, Associate Professor, Bio & Health Informatics, Danish Technical University

Transformer-based language models are powerful machine learning algorithms that can digest and extract information from huge text-based datasets. By applying language models to proteins, we can generate meaningful representations of their structural and sequence landscape, which in turn can be exploited to improve downstream prediction tasks, such as immunogenicity prediction, mutagenesis, and biophysical characterization.

2:20 pmNovel Deep Learning Models Enable Lead Antibody Optimization by Predicting Affinity and Naturalness of Sequence Variants

Roberto Spreafico, PhD, Principal AI Scientist, AbSci

Therapeutic antibodies require optimization of binding affinity and other properties. Traditional engineering approaches are time-consuming and explore only a subset of the solution sequence space. To address these challenges, we assist antibody development with AI. Models trained with affinity measurements of sequence variants of trastuzumab could quantitatively predict the binding strength of unseen variants. Models can also score antibody sequences for naturalness by comparison with human antibody repertoires, mitigating downstream developability issues.

2:50 pm Networking Refreshment Break (Hynes Main Lobby)

3:20 pm

De novo Design of Epitope Specific Antibodies with Machine Learning Methods

Philip M. Kim, PhD, Professor, Molecular Genetics & Computer Science, University of Toronto

I will present a set of machine learning technologies for the de novo design of antibodies as high-affinity binders to given epitopes. Our methods encompass structure-based design of CDRs for optimal epitope recognition and sequence-based generative models ensuring favorable developability properties. We show that we obtain nanomolar Fab binders to a specified novel epitope.

3:50 pm

A Cloud-Based Platform that Uses Unsupervised Machine Learning to Identify Hundreds of SARS-CoV-2 Antibodies with Optimal Properties

M. Frank Erasmus, PhD, Head, Bioinformatics, Specifica, Inc.

Efficient exploration of sequence outputs from discovery campaigns is often hindered by inefficient sampling of the CDR diversity and technical hurdles involved with the handling of big data, particularly from multiplexed experiments. To overcome these limitations, we developed a cloud-based bioinformatics platform, AbXtract™, which utilizes population-based statistics and unsupervised clustering of next-generation sequencing (NGS) data to rapidly identify leads from distinct and/or overlapping populations. To validate the platform, we carried out a discovery campaign using our in vitro Generation 3 Antibody Library Platform to select antibodies against the SARS-CoV-2 Spike protein trimer and its RBD and S1 subunits. After demultiplexing the NGS datasets from antibody selections against different antigen populations at decreasing concentrations, we were able to effectively sort through the maze of sequencing data to identify and test many of these top candidates for kinetics, binning, and in vitro neutralization experiments. The correlation of these data with NGS metrics strengthened our ability to identify high quality leads directly from NGS datasets, with many predicted leads displaying favorable properties such as sub-nanomolar affinities (≥13pM), distinct binding profiles and potent neutralization (IC50’s <1ng/ml) of many SARS-CoV-2 variants of concern.

4:20 pm

In Silico High Throughput Screening and Mutagenesis of Signal Peptides to Mitigate N-Terminal Miscleavage of Monoclonal Antibodies

Xin Yu, PhD, Senior Scientist, Global Biologics Discovery, Abbvie Bioresearch Center

We developed a novel high-throughput computational pipeline capable of generating millions of signal peptide mutants and utilizes a deep learning model to predict which of these mutants can alleviate the N-terminal miscleavage in antibodies. The pipeline was optimized to screen a library of 296077 unique mutants for each input antibody. We applied it to five antibodies with various extent of miscleavage. In each case, multiple mutants were obtained, with miscleavage reduced to a non-detectable level and titers comparable with or better than that of the original signal peptides.

4:50 pm Close of Day

Day 1
Day 2

Friday, May 6

7:00 am Registration and Morning Coffee (Hynes Main Lobby)

7:30 am Interactive Discussions with Continental Breakfast (Ballroom Pre-Function)

Grab your breakfast and Coffee and join a Discussion Group. Interactive Discussions are informal, moderated discussions, allowing participants to exchange ideas and experiences and develop future collaborations around a focused topic. Each discussion will be led by a facilitator who keeps the discussion on track and the group engaged. To get the most out of this format, please come prepared to share examples from your work, be a part of a collective, problem-solving session, and participate in active idea sharing. Please visit the Interactive Discussion page on the conference website for a complete listing of topics and descriptions.

TABLE 1: Best Practices for Using Machine Learning in NGS-Guided Antibody Discovery

Andrew R.M. Bradbury, PhD, CSO, Specifica, Inc.

M. Frank Erasmus, PhD, Head, Bioinformatics, Specifica, Inc.

• How does unsupervised or supervised machine learning aid your discovery efforts (e.g., clustering, classification, inference)?

o If unsupervised – what approaches do you take (e.g., k-means, PCA, other)?

o If supervised, do shallow learning (logistic regression, naïve Bayes, etc.) approaches suffice or do you require deep learning (recurrent NN, 1D convnets, etc.)

• What experimental and/or bioinformatics processing steps do you employ to ensure you have established an accurate ground truth for select population (e.g., binders / non-binders from FACS) and classification strategies do you employ as it pertains to antibody discovery?

• What min/max read depth and/or fold coverage of the underlying region of interest (HCDR3) thresholds do you employ to dataset machine learning algorithm to avoid under / overfitting? Is this sufficient based on your accuracy assessment from your acc/loss curves? Do you utilize data augmentation or regularization techniques

• How do you typically allocate your training, validation, and test sets for NGS datasets from discovery campaigns?

• What data encoding methods do you employ (e.g. one-hot encoding, tokenization) to represent your sequence data? Does 3D coordinate information enhance your dataset?

8:25 am

Chairperson's Remarks

Victor Greiff, PhD, Associate Professor, Immunology, University of Oslo

8:30 am

How Structure-Based Machine Learning Can Drive the Development of Biotherapeutics

Matthew Raybould, PhD, Postdoctoral Researcher, Immunoinformatics, University of Oxford, United Kingdom

Machine learning has shown its power across all of biology and in this talk, I will describe some of the novel machine learning tools we are pioneering in the area of biotherapeutics from computational humanization to accurate rapid structure prediction and virtual high-throughput screening.

9:00 am

Antibody CDR Design for Specific Binding Using High-Capacity Machine Learning

David K. Gifford, PhD, Professor, Electrical Engineering & Computer Science, Massachusetts Institute of Technology

We improve the binding of antibodies to a desired target and eliminate non-specific binders using machine learning (ML) models that are trained on CDR H3 sequences from high-throughput phage display assays. We tested 77,599 novel ML designed sequences from 6 ML methods and found that ML could provide superior binders. We next used single-target ML models to eliminate non-specific antibodies and observed that ML methods outperformed conventional affinity competition assays.

9:30 am

Identifying Prospective Variants of SARS-CoV-2 by Deep Mutational Learning

Sai Reddy, PhD, Associate Professor, Systems and Synthetic Immunology, ETH Zurich, Switzerland

The continual evolution of the severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) and the emergence of variants that show resistance to vaccines and neutralizing antibodies threatens to prolong the coronavirus disease 2019 (COVID-19) pandemic. Selection and emergence of SARS-CoV-2 variants are driven in part by mutations within the viral spike protein and in particular, the receptor-binding domain (RBD), which binds to the human ACE2 receptor and is a primary target site for neutralizing antibodies. Comprehensive mapping of single-position substitutions in the RBD has revealed key mutations that enhance binding to ACE2 and provide an escape from neutralizing antibodies. However, several SARS-CoV-2 variants such as beta (B.1.351) gamma (P.1), and delta (B.1.617.2) possess multiple, combinatorial mutations in their RBD. Here, we develop deep mutational learning (DML) - a machine learning-guided protein engineering technology that enables the comprehensive interrogation of combinatorial mutations in the RBD and prediction of their impact on ACE2 binding and antibody escape. DML reveals a highly diverse sequence landscape of possible variants that maintain or enhance binding to ACE2 and escape from different classes of neutralizing antibodies. DML may be used in the future to comprehensively profile the breadth of candidate therapeutic antibodies against existing and prospective variants of SARS-CoV-2.

10:00 amClonal Hit Expansion to Discover Diverse Llama Vhhs Using Alicanto

Natalie Castellana, CEO, Abterra Biosciences, Inc.

Next-generation sequencing of antibody repertoires has provided new insights into the single domain antibody repertoire. The correlation between the B-cell receptor repertoire and the serological antibody repertoire has only been analyzed in a small number of studies. In this talk, we will discuss the use of serum antibodies to guide antibody discovery in an immunized llama. Further, we use an autoencoder to deeply mine the clonal lineage of serum-identified antibodies.

10:30 am Networking Coffee Break (Hynes Main Lobby)

11:00 am

A Compact Vocabulary of Paratope-Epitope Interactions Enables Predictability of Antibody-Antigen Binding

Victor Greiff, PhD, Associate Professor, Immunology, University of Oslo

The prediction of antibody-antigen binding is a central question in biotechnology. A fundamental premise for the predictability of antibody-antigen binding is the existence of paratope-epitope interaction motifs universally shared among antibody-antigen structures. In a dataset of non-redundant antibody-antigen structures, we discovered a motif vocabulary of paratope-epitope interactions that govern antibody specificity providing the proof-of-principle that antibody-antigen binding is predictable with implications for de novo antibody and (neo-)epitope design.

11:30 am

De novo Design of Nanobodies Targeting Specific Epitopes

Pietro Sormanni, PhD, Group Leader, Royal Society University Research Fellow, Chemistry of Health, Yusuf Hamied Department of Chemistry, University of Cambridge

De novo design methods promise a cheaper and faster route to antibody discovery, while enabling the targeting of predetermined epitopes and the screening of multiple biophysical properties. I will present recent advances to design antibodies targeting structured epitopes and to predict solubility and formulation condition. We show that we can rapidly obtain highly stable nanobodies binding predetermined epitopes with affinities in the nM range, and that we can accurately predict protein solubility at varying pH values.

12:00 pm Close of PEGS Summit