Poster Presentation 50th Lorne Proteins Conference 2025

On developing hidden Markov models using a structural alphabet (#234)

Ashar Malik 1 2 , David Ascher 1 2
  1. School of Chemistry and Molecular Biosciences, University of Queensland, Brisbane, QLD, Australia
  2. Computational Biology and Clinical Informatics, Baker Heart and Diabetes Institute, Melbourne, VIC, Australia

The integration of machine learning with protein structure analysis has led to the creation of a novel 20-character structural alphabet that encodes protein structures as sequences while preserving their spatial context, enabling sequence-based analytical techniques to be applied to structural data. This talk will demonstrate the use of these encoded sequences in generating hidden Markov models (HMMs), repurposing a traditional sequence analysis method to leverage structural patterns. Benchmarks will be presented to show how these structure-based HMMs, available through tools like HMMology3D and HMMStruct, can detect homologous proteins and predict folds for orphan proteins, advancing the state of the art in protein structure analysis. These structure-based HMMs surpass existing approaches, enhancing structural homology detection and integrating seamlessly into existing workflows. This work builds upon the structural alignment program Foldseek, extending it by using generative models to capture structural nuances beyond pairwise comparisons. The implications extend to evolutionary studies, functional annotation, and the identification of structural variants, with promising applications in drug discovery and precision medicine. This approach represents a significant advance in protein structure modeling, opening new possibilities in structural phylogenetics and providing deeper insights into protein function and evolution.

  1. Malik, Ashar J., et al. "On use of tertiary structure characters in hidden Markov models for protein fold prediction." bioRxiv (2024): 2024-04.
  2. Van Kempen, Michel, et al. "Fast and accurate protein structure search with Foldseek." Nature biotechnology 42.2 (2024): 243-246.