The Next Generation of Feature Engineering Tool in Python

HEADJACK - Modularization Feature Engineering

5 min readNov 1, 2022

Feature engineering has always been an important stage in a machine learning pipeline. The primary concept of feature engineering is to create or extract the features through domain knowledge that machine learning algorithms can more easily learn information from data. The data scientists have been struggling here for a long time, so that a variety of techniques have been produced.

In this article, the approach we present is different from traditional domain knowledge technology. The feature engineering function in the Headjack (HJ) is a model which learn by self-supervised learning learning. In other words, we can perform features transform between two domain without the same key value. In addition, release the potential of data that is not typically used. For example, enhance the prediction of the Boston housing price task applied in the Titanic domain, or enhance the prediction of the customers churn task applied in the African traffic domain and so on. This open library is currently in Kaggle competition of Spaceship Titanic by applying other people’s solutions and directly adding Headjack, and got a ranking of 8/2119 (2022/10/23). For details, please refer to:

🏆(0.81669) misaelcribeiro solution+Modularity FE

Explore and run machine learning code with Kaggle Notebooks | Using data from Spaceship Titanic

www.kaggle.com

HEADJACK AI

HJ is an open library which provides a framework for exchanging self-supervised learning models, similar to huggingface as a hub, but which currently focuses on exchanging features for tabular data models. Compared to textual data, tabular data are different in that each data set has different column length and attributes, this means that it cannot be typed consistently unlike the token embedded in NLP tasks. Therefore, HJ is different from NLP’s pre-trained model with single domain transformation, but by performing with two different domain transformations. Details can be found in the paper (currently in double blind review, and will be released later).

From the application perspective, imagined that we up every data set to models by self-supervised learning, which model can be used as feature engineering in other tasks, so that it can also be considered as Modularization feature engineering. Also, there is a community in HJ, which contain plenty of models that have been trained by other data scientists. just train your model first (the platform provides free Public GPU), then start transforming the features with other models.

How to start and install?

It is easy to start HJ, just walk to the website and register the account (for free). After registered, model management and pool view in this page are to perform.

It is easy to install HJ too, just directly install in pip:

pip install headjackai-sdk

Login and Check in python

from headjackai.headjackai_hub import headjackai_hub 
#host setting 
hj_hub = headjackai_hub('http://www.headjackai.com:9000')  
#account login 
hj_hub.login(username='???', pwd='???')

Check model pool, a model represent a method for feature engineering:

#Check knowledge list on public pool 
hj_hub.knowledgepool_check(public_pool=True)[:10]

['kaggle_brijbhushannanda1979_bigmart_sales_data',
 'kaggle_blastchar_telco_customer_churn',
 'kaggle_lava18_google_play_store_apps',
 'kaggle_madislemsalu_facebook_ad_campaign',
 'kaggle_datasets_mhdzahier_travel_insurance',
 'kaggle_zhijinzhai_loandata',
 'kaggle_janiobachmann_bank_marketing_dataset',
 'kaggle_santoshd3_bank_customers',
 'kaggle_mahirahmzh_starbucks_customer_retention_malaysia_survey',
 'kaggle_ihormuliar_starbucks_customer_data']

Train a Feature Engineering Model and Run Feature Transferred:

In addition to the login and check function before mentioned, this library mainly provides two features, one is to train the feature engineering model, and the other is to transfer the feature. In this article, we will use the classic dataset IRIS to show it.

As mentioned earlier, HJ is a framework dedicated to transferring two different domains in tabluar, then any data set that wants to perform feature transferring must first be trained as a model before entering the HJ framework:

#Load example data  
df = load_iris() 
pd_df = pd.concat((pd.DataFrame(df['data']),pd.DataFrame(df['target'])),1) 
pd_df.columns = df['feature_names']+['label']   

#Train a new knowledge model of genenral feature engineering   
hj_hub.knowledge_fit(data=pd_df, target_domain='example_iris',  
                     task_name='example_iris_task01', label='label')

Check the status after running:

#Check the status of training task  
hj_hub.fit_status_check(task_name='example_iris_task01')

{'account': ???,  'task_name': 'api_example_iris_task01',  
'process': 'knowledge_fit',  'status': 'running'}

The status will be displayed as completed at the end of training.:

#Check the status of training task  
hj_hub.fit_status_check(task_name='example_boston_task01')

{'account': ???, 'task_name': 'api_example_iris_task01',  
'process': 'knowledge_fit',  'status': 'completed'}

It begins to transfer feature used by these models after training.

#apply general features engineering on your task 
jackin_df = hj_hub.knowledge_transform(data=pd_df, 
                                        target_domain='example_iris',
                                        source_domain='uci_wine',label='label')

We directly train machine learning on the matrix of feature transfer by HJ replace domain knowledge of feature engineering.

In the HJ, it trained data set to pre-trained models into the pool before the transfer, that is reason why HJ call Modularization feature engineering, and it breaks the limitation that two datasets must use a common key to join at the same time, so that the result of feature engineering becomes very diverse. In the practice, test those models separately to compare which ones work well on the task, and find a source domain that can directly improve the metric score on the validation set by about 3%-10% in our experience.

In this framework, we have the opportunity to release the potential of each data sets, instead of the culture in the NLP tasks which applying super-giant models directly. Everyone has the ability to contribute, and it may be applied in a scenario that you can’t image of at all. The more people use it, there will be more models to communicate on this community, and the platform will be more perfect.

Website：headjackai.com

Library Code：github