Over the past few years, we’ve seen a steady rise in the adoption of AI methods in drug discovery and early hints of the impact these technologies will have in the years to come.
However, outside of AI-first biotechnology companies and the largest pharmas, adoption of these technologies has lagged, largely because of the lack of scientists and engineers cross-trained in both modern deep learning and chemistry. On the other hand, machine learning scientists looking to apply their research to real-world applications often find it challenging to navigate the chemoinformatics universe. Molecules are ultimately a complex data type requiring expertise in both chemistry and computational science.
Molecular manipulation at Valence
At Valence, computational manipulation of molecules is a daily occurrence. From our ML Ops team to our full-stack engineers to our machine learning research scientists, everyone spends a significant part of their day manipulating molecules.
Like many others, we take full advantage of the open-source RDKit library, which has been built on decades of community development and is widely considered the de facto standard in chemoinformatics. RDKit’s core machinery is built with C++, making it easy to interface with one of the most popular languages in data science: Python. RDKit is also actively maintained (major updates are released every 6 months) and it has a large community, ensuring regular bug fixes as well as continuous integration of new features.
As our internal codebase at Valence has grown, we needed a way to centralize all procedures and functions used to manipulate molecules. Not only does this help reduce the (otherwise steep!) learning curve for machine learning scientists new to chemoinformatics, it also helps make our platform more robust via standardized procedures and smaller surface area for potential bugs.
🐍 Datamol is a Python library that aims to make working with molecules intuitive while still allowing for full control over your molecular processing workflows.
✅ All you need to get started is:
mamba install -c conda-forge datamol
⚗️ Datamol is a light library that directly manipulates RDKit Chem.Mol objects. Its API was designed with simplicity, flexibility, and modularity in mind, and it works with a single import (similar to Pandas and NumPy).
🕹️ Datamol also proposes various molecular conversion and IO functions to load, save, and convert between multiple molecular representations and file formats such as SMILES, SMARTS, InCHI, SELFIES, SMI, SDF, CSV, Excel, DataFrame, and others. All the IO functions work transparently on local as well as remote filesystems (such as AWS S3 or Google Storage).
🧠 Molecular clustering, fragmentation, and scaffold enumeration are other common tasks when working with molecular datasets. Datamol has several functions for this such as centroid picking and BRICS fragmentation, among others. Datamol also contains additional modules that may be useful for 2D/3D visualization, conformer generation, reaction manipulation, molecule editing, and more.
🏭 At Valence, we are very careful to rigorously test and validate code before pushing it into production. A continuous integration system guarantees Datamol can be installed and executed on the supported platforms (Linux, OSX and Windows) with different combinations of RDKit and Python versions. Datamol documents a matrice of compatibility between the supported version of Python and RDKit.
Let us know what you think of Datamol!
Datamol is a mature library that we’ve been using internally at Valence for over a year. We’re excited to open-source it today and hope it can accelerate the adoption of molecular machine learning across the industry more broadly.