Accelerating Science with the Generating Toolkit for Scientific Discovery (GT4SD)

Matteo Manica
Open-Source Science (OSSci)
3 min readJul 11, 2022

Scientific progress is based on a mixture of creativity and curiosity and is characterized by a continually evolving trial and error process. During centuries of evolution this process has been formalized in what we know as the scientific method and helped us pushing the boundaries in many disciplines. While effective, this methodology can be painstakingly slow, but recent advances in machine learning application to science can help us to speed up its implementation.

When working on a scientific problem, one essential step of the investigation consists in generating hypotheses on the phenomenon under study. Besides requiring a profound knowledge and understanding of the problem, the formulation of valid assumptions require a consistent time investment where machine learning models, such as generative models, can come to the rescue.

As a researcher with a passion for generative models, I have been always been keen on keeping up with advancements and papers/algorithms coming up (basically a full time job on its own). While there are many amazing resources to follow the latest and greatest publications or to benchmark models (especially in material science: GuacaMol and MOSES), I always felt there was something missing to enable me as a researcher to rapidly develop, test and share my generative models with scientific community. This is why we decided to put together a team to work on a library that could enable researcher to apply generative models to science faster and in a standardized, shareable and consumable way.

Enters GT4SD (Generative Toolkit for Scientific Discovery).

With GT4SD, our mission is to build an open-source community and ecosystem for generative modeling applications to science with the goal to enable scientists to embed data-driven models in their hypothesis generation process as well as sharing the models they built for broad usage (here you can find more info GT4SD mission and community guidelines.).

The library implements pipelines for inference and training of generative models. GT4SD offers utilities for algorithm versioning and sharing for broader usage in the community. The standardized interface enables algorithm instantiation and run for generating samples with less than five lines of code (left panel). The CLI tools ease the run of a full discover pipeline in the terminal (right panel).

GT4SD includes models that can generate new molecule designs based on properties such as target proteins, target omics profiles, scaffolds distances, binding energies, and additional targets relevant for materials and drug discovery (currently 25+ models for material science are available in the library). Check out this notebook to see how you can use GT4SD to easily run a vast gamma of models to generate molecules with desired properties.

More details can be found in the pre-print.

If you want to know more and get involved take a look at the repository: https://github.com/GT4SD/gt4sd-core. There you can find quick start instructions to run, train and share models as well as examples and all the info to help us expanding and improving GT4SD.

--

--

Matteo Manica
Open-Source Science (OSSci)

Matteo is a research scientist with a passion for data-driven models applied to science, and a past as a jazz musician