An Architecture and Domain Specific Language Framework for Repeated Domain-Specific Predictive Modeling

A couple of years ago, I wrote a blog post here, expanding on some great ideas from Jon Morra. In that post, I wrote that:

For a predictive model to be well-integrated with consuming applications and other supporting tools, it’s critical that the representations that are used to communicate between systems are well-defined and consistent.

I developed an expanded version of Jon’s formalism, and discussed briefly how it informed some previous software architecture decisions.

In 2017, I got an opportunity to talk about these formalisms and this architecture at the International Conference on Predictive Applications and APIs (PAPIs). As part of that presentation, I took the opportunity to do two of my favorite things — build bigger conceptual frameworks, and write open-source software!

And now, in 2018, I’m happy to say that a further-expanded writeup of these ideas has been published in the Proceedings of Machine Learning Research. You can read the full paper here, but I thought it’d be worthwhile to pull some highlights for this blog.

First, I defined a class of problems faced in industry, called Repeated Domain-Specific Modeling. In these problems, a set of closely related predictive models is customized and built, and it worth the effort to build a framework for ensuring speed and quality.

Repeated Domain-Specific Modeling systems include several parts — the application, the predictive model, the data scientists who will be operationally responsible for keeping the predictive model tuned and accurate, and the users of the application.Communication among these parts, be they human-machine interfaces or machine-machine interfaces, must be efficient and consistent for the system as a whole to run smoothly.

Next, I defined some terminology for the components of this system, and the representations that are passed among the components.

Predictive modeling is essentially two linked processes, Training and Scoring. Importantly, they share a parallel structure with regards to the representations that flow through their processes. And equally importantly, the Training process can be thought of as a system, or a higher-order function, that generates the Scoring process as an output of its data processing. To maintain encapsulation, the application should provide Training and Scoring data to the model in an (agreed upon by contract) format that reflects the semantics and structure used by the application.

I then talk about the critical importance of building systems that allow Data Scientists to contribute their Substantive Expertise (obligatory Drew Conway Venn Diagram reference…), and finally walk through an open-source template for this problem.

Most of the steps that might be used in a production environment are included: defining a model with a DSL, fitting a model, viewing a model archive file, generating a standard report about the model, scoring the model via web service, and inspecting the model in production.

As part of that template, I describe some key features, such as clunkily-named patterns such as the Feature Transformer Generator, and how to properly handle missing data, all tied as closely as possible to the theoretical constructs I’d described earlier.

I hope the longer exposition in the paper, and the open-source code, is useful to someone! I found it extremely clarifying to write both the theory and a scratch implementation of that theory. There’s nothing quite like being given an extremely tight page count to be forced to really distill ideas down to their essences..

It’s also worth nothing that I’m very glad to see other related work being discussed. The paper provides references to a related project at Uber, as well as Jon Morra and colleagues’ recent writeup of their work that originally inspired me. If you know of other such frameworks, or have other thoughts, your comments would be very welcome!

Data Scientist; co-founder of Data Community DC and the Data Science DC Meetup; Brooklyn, NY.

Data Scientist; co-founder of Data Community DC and the Data Science DC Meetup; Brooklyn, NY.