Characterizing Protein Conformational Spaces using Efficient Data Reduction and Algebraic Topology

Arpita Joshi, Nurit Haspel, Eduardo González


Datasets representing the conformational landscapes of protein structures are high-dimensional and hence present computational challenges. Efficient and effective dimensionality reduction of these datasets is therefore paramount to our ability to analyze the conformational landscapes of proteins and extract important information regarding protein folding, conformational changes, and binding. Representing the structures with fewer attributes that capture the most variance in the data makes for a quicker and more precise analysis of these structures. In this study, we make use of dimensionality reduction methods for reducing the number of instances and for feature reduction. The reduced dataset that is obtained is then subjected to topological and quantitative analysis. In this step, we perform hierarchical clustering to obtain different sets of conformation clusters that may correspond to intermediate structures. The structures represented by these conformations are then analyzed by studying their high-dimensional topological properties to identify truly distinct conformations and holes in the conformational space that may represent high energy barriers. Our results show that the clusters closely follow known experimental results about intermediate structures as well as binding and folding events.


Doi: 10.28991/HEF-SP2022-01-01

Full Text: PDF


Dimensionality Reduction; Hierarchical Clustering; Betti Numbers; Protein Folding; BioScience.


Miyashita, O., Wolynes, P. G., & Onuchic, J. N. (2005). Simple energy landscape model for the kinetics of functional transitions in proteins. Journal of Physical Chemistry B, 109(5), 1959–1969. doi:10.1021/jp046736q.

Haspel, N., Moll, M., Baker, M. L., Chiu, W., & Kavraki, L. E. (2010). Tracing conformational changes in proteins. BMC Structural Biology, 10(SUPPL. 1), 1. doi:10.1186/1472-6807-10-S1-S1.

Haspel, N., Luo, D., & González, E. (2017). Detecting intermediate protein conformations using algebraic topology. BMC Bioinformatics, 18(Suppl 15), 502. doi:10.1186/s12859-017-1918-z.

Bryngelson, J. D., Onuchic, J. N., Socci, N. D., & Wolynes, P. G. (1995). Funnels, pathways, and the energy landscape of protein folding: A synthesis. Proteins: Structure, Function, and Bioinformatics, 21(3), 167–195. doi:10.1002/prot.340210302.

Case, D. A., Cheatham, T. E., Darden, T., Gohlke, H., Luo, R., Merz, K. M., Onufriev, A., Simmerling, C., Wang, B., & Woods, R. J. (2005). The Amber biomolecular simulation programs. Journal of Computational Chemistry, 26(16), 1668–1688. doi:10.1002/jcc.20290.

Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimization by simulated annealing. Science, 220(4598), 671–680. doi:10.1126/science.220.4598.671.

Raveh, B., Enosh, A., Schueler-Furman, O., & Halperin, D. (2009). Rapid sampling of molecular motions with prior information constraints. PLoS Computational Biology, 5(2), 1000295. doi:10.1371/journal.pcbi.1000295.

Shehu, A., & Olson, B. (2010). Guiding the search for native-like protein conformations with an Ab-initio tree-based exploration. International Journal of Robotics Research, 29(8), 1106–1127. doi:10.1177/0278364910371527.

Al-Bluwi, I., Vaisset, M., Siméon, T., & Cortés, J. (2013). Modeling protein conformational transitions by a combination of coarse-grained normal mode analysis and robotics-inspired methods. BMC Structural Biology, 13(SUPPL.1), 2. doi:10.1186/1472-6807-13-S1-S2.

Molloy, K., & Shehu, A. (2016). A General, Adaptive, Roadmap-Based Algorithm for Protein Motion Computation. IEEE Transactions on Nanobioscience, 15(2), 160–167. doi:10.1109/TNB.2016.2519246.

Zheng, W., & Brooks, B. (2005). Identification of dynamical correlations within the myosin motor domain by the normal mode analysis of an elastic network model. Journal of Molecular Biology, 346(3), 745–759. doi:10.1016/j.jmb.2004.12.020.

Yang, L., Song, G., & Jernigan, R. L. (2009). Protein elastic network models and the ranges of cooperativity. Proceedings of the National Academy of Sciences of the United States of America, 106(30), 12347–12352. doi:10.1073/pnas.0902159106.

Xia, K., Opron, K., & Wei, G. W. (2015). Multiscale Gaussian network model (mGNM) and multiscale anisotropic network model (mANM). Journal of Chemical Physics, 143(20), 204106. doi:10.1063/1.4936132.

Schröder, G. F., Brunger, A. T., & Levitt, M. (2007). Combining Efficient Conformational Sampling with a Deformable Elastic Network Model Facilitates Structure Refinement at Low Resolution. Structure, 15(12), 1630–1641. doi:10.1016/j.str.2007.09.021.

Frappier, V., Chartier, M., & Najmanovich, R. J. (2015). ENCoM server: Exploring protein conformational space and the effect of mutations on protein function and stability. Nucleic Acids Research, 43(W1), W395–W400. doi:10.1093/nar/gkv343.

Weiss, D. R., & Levitt, M. (2009). Can Morphing Methods Predict Intermediate Structures? Journal of Molecular Biology, 385(2), 665–674. doi:10.1016/j.jmb.2008.10.064.

Castellana, N. E., Lushnikov, A., Rotkiewicz, P., Sefcovic, N., Pevzner, P. A., Godzik, A., & Vyatkina, K. (2013). MORPH-PRO: A novel algorithm and web server for protein morphing. Algorithms for Molecular Biology, 8(1), 19. doi:10.1186/1748-7188-8-19.

Vetro, R., Haspel, N., & Simovici, D. (2013). Characterizing intermediate conformations in protein conformational space. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): Vol. 7845 LNBI, 70–80. doi:10.1007/978-3-642-38342-7_7.

Chang, H. W., Bacallado, S., Pande, V. S., & Carlsson, G. E. (2013). Persistent Topology and Metastable State in Conformational Dynamics. PLoS ONE, 8(4), 58699. doi:10.1371/journal.pone.0058699.

Gan, G., & Wu, J. (2004). Subspace clustering for high dimensional categorical data. ACM SIGKDD Explorations Newsletter, 6(2), 87–94. doi:10.1145/1046456.1046468.

Karplus, M., & Shakhnovich, E. (1992). Protein Folding: Theoretical Studies of Thermodynamics and Dynamics. In Protein Folding, 127–196, W. H. Freeman and Company, New York, United States.

Bryngelson, J. D., Onuchic, J. N., Socci, N. D., & Wolynes, P. G. (1995). Funnels, pathways, and the energy landscape of protein folding: A synthesis. Proteins: Structure, Function, and Genetics, 21(3), 167–195. doi:10.1002/prot.340210302.

Wilson, D. R., & Martinez, T. R. (2000). Reduction techniques for instance-based learning algorithms. Machine Learning, 38(3), 257–286. doi:10.1023/A:1007626913721.

Arnaiz-González, Á., Díez-Pastor, J. F., Rodríguez, J. J., & García-Osorio, C. (2016). Instance selection of linear complexity for big data. Knowledge-Based Systems, 107, 83–95. doi:10.1016/j.knosys.2016.05.056.

García, S., Derrac, J., Cano, J. R., & Herrera, F. (2012). Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(3), 417–435. doi:10.1109/TPAMI.2011.142.

Czarnowski, I., & Jędrzejowicz, P. (2006). Instance reduction approach to machine learning and multi-database mining. Annales Universitatis Mariae Curie-Skłodowska, sectio AI–Informatica, 4(1)-60-71.

Son, S.-H., & Kim, J.-Y. (2006). Data Reduction for Instance-Based Learning Using Entropy-Based Partitioning. Lecture Notes in Computer Science, 590–599. doi:10.1007/11751595_63

Boyd, S., Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization. Cambridge University Press, Cambridge, United Kingdom.

Maaten, L., Postma, E., & Herik, J. (2009). Dimensionality reduction: a comparative review. Journal of Machine Learning Research, 10, 1–36.

Tenenbaum, J. B., De Silva, V., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323. doi:10.1126/science.290.5500.2319.

Das, P., Moll, M., Stamati, H., Kavraki, L. E., & Clementi, C. (2006). Low-dimensional, free-energy landscapes of protein-folding reactions by nonlinear dimensionality reduction. Proceedings of the National Academy of Sciences, 103(26), 9885–9890. doi:10.1073/pnas.0603553103.

Vajdi, A., Haspel, N., & Banaee, H. (2015). A new DP algorithm for comparing gene expression data using geometrics similarity. Proceedings - 2015 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2015, 1745–1747. doi:10.1109/BIBM.2015.7359948.

Silva, V., & Tenenbaum, J. (2002). Global versus local methods in nonlinear dimensionality reduction. Advances in neural information processing systems, 15 (NIPS 2002), 1-8.

Talwalkar, A., Kumar, S., & Rowley, H. (2008). Large-scale manifold learning. 2008 IEEE Conference on Computer Vision and Pattern Recognition. doi:10.1109/cvpr.2008.4587670.

Adams, H., Tausz, A., & Vejdemo-Johansson, M. (2014). javaPlex: A research software package for persistent (co)homology. In H. Hong & C. Yap (Eds.), Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): Vol. 8592 LNCS (pp. 129–136). doi:10.1007/978-3-662-44199-2_23.

Watanabe, S., & Yamana, H. (2020). Deep Neural Network Pruning Using Persistent Homology. 2020 IEEE Third International Conference on Artificial Intelligence and Knowledge Engineering (AIKE). doi:10.1109/aike48582.2020.00030.

Dindin, M., Umeda, Y., & Chazal, F. (2020). Topological Data Analysis for Arrhythmia Detection through Modular Neural Networks. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 12109 LNAI, 177–188. doi:10.1007/978-3-030-47358-7_17.

The GUDHI Project (2015). GUDHI User and Reference Manual. GUDHI Editorial Board, 2015. Available online: (accessed on March 2022).

Cang, Z., Munch, E., & Wei, G.-W. (2020). Evolutionary homology on coupled dynamical systems with applications to protein flexibility analysis. Journal of Applied and Computational Topology, 4(4), 481–507. doi:10.1007/s41468-020-00057-9

Cámara, P. G. (2017). Topological methods for genomics: Present and future directions. Current Opinion in Systems Biology, 1, 95–101. doi:10.1016/j.coisb.2016.12.007.

Wei, G.-W. (2017). Persistent homology analysis of biomolecular data. Society for Industrial and Applied Mathematics, 2017. Available online: (accessed on March 2022).

Cang, Z., Mu, L., & Wei, G. W. (2018). Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening. PLoS Computational Biology, 14(1), 1005929. doi:10.1371/journal.pcbi.1005929.

Pettersen, E. F., Goddard, T. D., Huang, C. C., Couch, G. S., Greenblatt, D. M., Meng, E. C., & Ferrin, T. E. (2004). UCSF Chimera - A visualization system for exploratory research and analysis. Journal of Computational Chemistry, 25(13), 1605–1612. doi:10.1002/jcc.20084.

Jorgensen, W. L., Chandrasekhar, J., Madura, J. D., Impey, R. W., & Klein, M. L. (1983). Comparison of simple potential functions for simulating liquid water. The Journal of Chemical Physics, 79(2), 926–935. doi:10.1063/1.445869.

Darden, T., York, D., & Pedersen, L. (1993). Particle mesh Ewald: An N•log(N) method for Ewald sums in large systems. The Journal of Chemical Physics, 98(12), 10089–10092. doi:10.1063/1.464397.

Kalé, L., Skeel, R., Bhandarkar, M., Brunner, R., Gursoy, A., Krawetz, N., Phillips, J., Shinozaki, A., Varadarajan, K., & Schulten, K. (1999). NAMD2: Greater Scalability for Parallel Molecular Dynamics. Journal of Computational Physics, 151(1), 283–312. doi:10.1006/jcph.1999.6201.

Duan, Y., Wu, C., Chowdhury, S., Lee, M. C., Xiong, G., Zhang, W., Yang, R., Cieplak, P., Luo, R., Lee, T., Caldwell, J., Wang, J., & Kollman, P. (2003). A Point-Charge Force Field for Molecular Mechanics Simulations of Proteins Based on Condensed-Phase Quantum Mechanical Calculations. Journal of Computational Chemistry, 24(16), 1999–2012. doi:10.1002/jcc.10349.

Haspel, N., Jang, H., & Nussinov, R. (2021). Active and Inactive Cdc42 Differ in Their Insert Region Conformational Dynamics. Biophysical Journal, 120(2), 306–318. doi:10.1016/j.bpj.2020.12.007.

Luo, D., & Haspel, N. (2013). Multi-resolution rigidity-based sampling of protein conformational paths. In 2013 ACM Conference on Bioinformatics, Computational Biology and Biomedical Informatics, ACM-BCB 2013 (pp. 786–792). doi:10.1145/2506583.2506710.

Candès, E. J., Li, X., Ma, Y., & Wright, J. (2011). Robust principal component analysis? Journal of the ACM, 58(3), 1–37. doi:10.1145/1970392.1970395.

Locantore, N., Marron, J. S., Simpson, D. G., Tripoli, N., Zhang, J. T., Cohen, K. L., … Cohen, K. L. (1999). Robust principal component analysis for functional data. Test, 8(1), 1–73. doi:10.1007/bf02595862.

Fujiki, J. (2007). Spherical PCA with Euclideanization. ACCV’07 Workshop Subspace, November, Tokyo, 61–68.

Joshi, A., & Haspel, N. (2020). A Novel Data Instance Reduction Technique using Linear Feature Reduction. Journal of Artificial Intelligence and Systems, 2, 191–206. doi:10.33969/ais.2020.21012.

Joshi, A. (2019). High Performance Computing Techniques To Better Understand Protein Conformational Space. Ph.D. dissertation, University of Massachusetts, Boston, United State

Joshi, A., & Haspel, N. (2019). Clustering of Protein Conformations Using Parallelized Dimensionality Reduction. Journal of Advances in Information Technology, 10(4), 142–147. doi:10.12720/jait.10.4.142-147.

Wadhwa, R. R., Williamson, D. F., Dhawan, A., & Scott, J. G. (2018). Introduction to persistent homology with tdastats. The Journal of Open Sorce Software. Available online: (accessed on March 2022).

Valds-Mora, F., Pulgar, T. G., & Lacal, J. C. Translational Oncology Unit CSIC-UAM- La Paz Centro Nacional de Biotecnologia C/Darwin 3, Campus de Cantoblanco, 28049 Madrid, Spain. Available online:

Hartman, M. A., & Spudich, J. A. (2012). The myosin superfamily at a glance. Journal of Cell Science, 125(7), 1627–1632. doi:10.1242/jcs.094300.

Del Mar Maldonado, M., & Dharmawardhane, S. (2018). Targeting rac and Cdc42 GT pases in cancer. Cancer Research, 78(12), 3101–3111. doi:10.1158/0008-5472.CAN-18-0619.

Backurs, A., Indyk, P., & Wagner, T. (2019). Space and time efficient kernel density estimation in high dimensions. In H. Wallach, H. Larochelle, A. Beygelzimer, F. Buc, E. Fox, & R. Garnett (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc. 32, 15799–15808.

Humphrey, W., Dalke, A., & Schulten, K. (1996). VMD: Visual molecular dynamics. Journal of Molecular Graphics, 14(1), 33–38. doi:10.1016/0263-7855(96)00018-5.

Morris, K. M., Henderson, R., Suresh Kumar, T. K., Heyes, C. D., & Adams, P. D. (2016). Intrinsic GTP hydrolysis is observed for a switch 1 variant of Cdc42 in the presence of a specific GTPase inhibitor. Small GTPases, 7(1), 1–11. doi:10.1080/21541248.2015.1123797.

Melendez, J., Grogg, M., & Zheng, Y. (2011). Signaling role of Cdc42 in regulating mammalian physiology. Journal of Biological Chemistry, 286(4), 2375–2381. doi:10.1074/jbc.R110.200329.

Caldwell, H. K., & Young, W. S. (2006). Oxytocin and Vasopressin: Genetics and Behavioral Implications. Handbook of Neurochemistry and Molecular Neurobiology (3rd Ed), 573–607. doi:10.1007/978-0-387-30381-9_25.

Torres, R., & Polymeropoulos, M. H. (1998). Genomic organization and localization of the human CRMP-1 gene. DNA Research, 5(6), 393–395. doi:10.1093/dnares/5.6.393.

Bersani, M., Johnsen, A. H., Højrup, P., Dunning, B. E., Andreasen, J. J., & Holst, J. J. (1991). Human galanin: Primary structure and identification of two molecular forms. FEBS Letters, 283(2), 189–194. doi:10.1016/0014-5793(91)80585-Q.

Full Text: PDF

DOI: 10.28991/HEF-SP2022-01-01


  • There are currently no refbacks.

Copyright (c) 2022 Arpita Joshi