Open-sourcing Datamol, a Python library to intuitively manipulate molecules

TLDR: Datamol is now open-source! Get started at and follow along at @datamol_io.

Over the past few years, we’ve seen a steady rise in the adoption of AI methods in drug discovery and early hints of the impact these technologies will have in the years to come.

However, outside of AI-first biotechnology companies and the largest pharmas, adoption of these technologies has lagged, largely because of the lack of scientists and engineers cross-trained in both modern deep learning and chemistry. On the other hand, machine learning scientists looking to apply their research to real-world applications often find it challenging to navigate the chemoinformatics universe. Molecules are ultimately a complex data type requiring expertise in both chemistry and computational science.

Molecular manipulation at Valence

At Valence, computational manipulation of molecules is a daily occurrence. From our ML Ops team to our full-stack engineers to our machine learning research scientists, everyone spends a significant part of their day manipulating molecules.

Like many others, we take full advantage of the open-source RDKit library, which has been built on decades of community development and is widely considered the de facto standard in chemoinformatics. RDKit’s core machinery is built with C++, making it easy to interface with one of the most popular languages in data science: Python. RDKit is also actively maintained (major updates are released every 6 months) and it has a large community, ensuring regular bug fixes as well as continuous integration of new features.

As our internal codebase at Valence has grown, we needed a way to centralize all procedures and functions used to manipulate molecules. Not only does this help reduce the (otherwise steep!) learning curve for machine learning scientists new to chemoinformatics, it also helps make our platform more robust via standardized procedures and smaller surface area for potential bugs.

Introducing Datamol

🐍 Datamol is a Python library that aims to make working with molecules intuitive while still allowing for full control over your molecular processing workflows.

✅ All you need to get started is:

mamba install -c conda-forge datamol

⚗️ Datamol is a light library that directly manipulates RDKit Chem.Mol objects. Its API was designed with simplicity, flexibility, and modularity in mind, and it works with a single import (similar to Pandas and NumPy).

🕹️ Datamol also proposes various molecular conversion and IO functions to load, save, and convert between multiple molecular representations and file formats such as SMILES, SMARTS, InCHI, SELFIES, SMI, SDF, CSV, Excel, DataFrame, and others. All the IO functions work transparently on local as well as remote filesystems (such as AWS S3 or Google Storage).

🧠 Molecular clustering, fragmentation, and scaffold enumeration are other common tasks when working with molecular datasets. Datamol has several functions for this such as centroid picking and BRICS fragmentation, among others. Datamol also contains additional modules that may be useful for 2D/3D visualization, conformer generation, reaction manipulation, molecule editing, and more.

🏭 At Valence, we are very careful to rigorously test and validate code before pushing it into production. A continuous integration system guarantees Datamol can be installed and executed on the supported platforms (Linux, OSX and Windows) with different combinations of RDKit and Python versions. Datamol documents a matrice of compatibility between the supported version of Python and RDKit.

Let us know what you think of Datamol!

Datamol is a mature library that we’ve been using internally at Valence for over a year. We’re excited to open-source it today and hope it can accelerate the adoption of molecular machine learning across the industry more broadly.

You can check out our tutorials at or try Datamol online. We welcome your feedback on the Github repository, on the forum, or on Twitter!




Musings Beyond the Baseline

Recommended from Medium

Building and Leveraging an Event-Based Data Model for Analyzing Online Data

Practical EDA Guide with Pandas

Irrational Product Design!

Data Science by Example— a case study of Airbnb Seattle

The CRISP DM process

Domain specific time series forecasting library for network and cloud ops

Banking Query Intent Detector

Experimental human imputation pipeline using GLIMPSE

High tech vs High touch sensors

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Hadrien Mary

Hadrien Mary

Biophysicist playing with small molecules at Valence Discovery.

More from Medium

What is the Best Way to Debug a Tensor Flow Model?

Thingspeak — IoT Technology!!

Fuzzy Matching using the RecordLinkage Module in Python

CS50 — Lesson 6 Python notes