Computer-driven molecular design combines the principles of chemistry, physics, and artificial intelligence to identify novel chemical compounds and materials with desired properties for a specific application. This rational in silico design requires an advanced understanding of both the structure–property/property–property relationships that exist across chemical compound space (CCS) as well as efficient methodologies to define an inverse mapping from properties to 3D molecular structures. In this work, we first analyze these fundamental relationships in the sector of CCS spanned by small molecules using the QM7-X dataset—a comprehensive dataset of 42 quantum-mechanical (QM) properties for ≈4.2 million equilibrium and non-equilibrium structures of small organic molecules with up to seven non-hydrogen (C, N, O, S, Cl) atoms. By characterizing and enumerating progressively more complex manifolds of molecular property space—the corresponding high-dimensional space defined by the properties of each molecule in this sector of CCS—our analysis reveals that one has a substantial degree of flexibility or “freedom of design” when searching for a single molecule with a desired pair of properties or a set of distinct molecules sharing an array of properties. To explore how the insights gained in this analysis can be exploited in the actual design of molecules with desired properties, we then develop a proof-of-concept implementation based on a variational autoencoder (VAE) architecture and demonstrate that it is feasible to parameterize the QM7-X chemical space with a finite set of extensive/intensive QM properties. We illustrate the capabilities of our approach by conditional generation of de novo molecular structures with targeted properties, transition path interpolation for chemical reactions as well as insights into property-structure relationships. These results thus provide a proof-of-concept demonstration aiming to enable the inverse property-to-structure design in diverse chemical spaces. We expect that our work serves as motivation for the development of advanced ML-based tools that will improve the in silico sampling, identification, and design of molecular systems for a specific application.
Computer-driven molecular design combines the principles of chemistry, physics, and artificial intelligence to identify novel chemical compounds and materials with desired properties for a specific application. This rational in silico design requires an advanced understanding of both the structure–property/property–property relationships that exist across chemical compound space (CCS) as well as efficient methodologies to define an inverse mapping from properties to 3D molecular structures. In this work, we first analyze these fundamental relationships in the sector of CCS spanned by small molecules using the QM7-X dataset—a comprehensive dataset of 42 quantum-mechanical (QM) properties for ≈4.2 million equilibrium and non-equilibrium structures of small organic molecules with up to seven non-hydrogen (C, N, O, S, Cl) atoms. By characterizing and enumerating progressively more complex manifolds of molecular property space—the corresponding high-dimensional space defined by the properties of each molecule in this sector of CCS—our analysis reveals that one has a substantial degree of flexibility or “freedom of design” when searching for a single molecule with a desired pair of properties or a set of distinct molecules sharing an array of properties. To explore how the insights gained in this analysis can be exploited in the actual design of molecules with desired properties, we then develop a proof-of-concept implementation based on a variational autoencoder (VAE) architecture and demonstrate that it is feasible to parameterize the QM7-X chemical space with a finite set of extensive/intensive QM properties. We illustrate the capabilities of our approach by conditional generation of de novo molecular structures with targeted properties, transition path interpolation for chemical reactions as well as insights into property-structure relationships. These results thus provide a proof-of-concept demonstration aiming to enable the inverse property-to-structure design in diverse chemical spaces. We expect that our work serves as motivation for the development of advanced ML-based tools that will improve the in silico sampling, identification, and design of molecular systems for a specific application.