Feature Engineering Needs Domain Knowledge
A Simple Process [with code] for Integrating Domain Knowledge in FE
In a recent post, I explained how small and medium sized businesses are often tied to data silos that make it difficult to access usable data. I also made the case that a data scientists understanding of feature engineering may be a great way to overcome the challenges of these data silos.
What Are Data Silos & How Can Data Science Solve Them
A Case for Data Science in Small & Medium Sized Businesses
Thus, feature engineering is an essential practice for helping businesses to overcome data silos and build more useful data to support more sophisticated planning.
One challenge to feature engineering more generally is that there are often on overwhelming number of possibilities to engineer different features for a business problem. And without any domain expertise, knowing where to spend one’s time building different features can make feature engineering even more challenging.
Thus, it is often beneficial for data scientists to engage with domain experts in order to inquire about features that have importance to the problem space. In some cases, that knowledge can be very complex and difficult to translate into data features.
Take document classification as an example. I recently helped a friend build a document classifier to automate the classification of millions of documents in the real estate space. The problem I faced when starting to build the model was that generating features automatically from just the data was still surfacing too much feature noise and the model was struggling to learn how to properly classify some documents.
To solve for this, I set up a simple spreadsheet and handed it to my friend, the domain expert, to fill out. In one column I asked him to enter partial word patterns that would be unique to a specific document type. In another column, I asked him to include any additional rules. In this case, he indicated whether the pattern showed up at the top or the bottom of a given page.
Next, I developed Python code that would pull the spreadsheet in and create a feature matrix for each page based on each rule. The nice thing about this approach was that my friend could update the rules code from the spreadsheet without ever having to tell me about the changes. My code would automatically update with the new features each time and retrain the model.
To see an example of this simple code, check out this link my Github Gist:
Like engaging to learn about data science, career growth, life, or poor business decisions? Sign up for my newsletter here and get a link to my free ebook.