The 5th Place Solution to the ADIA Causal Discovery Challenge 2024
I recently participated in the ADIA Causal Discovery Challenge and emerged as a proud 5th place winner. Let’s dive into the exciting details of this challenge and my approach to solving it.
Problem setting: Given a dataset of 3–10 variables. You are asked to predict the causal relationships of each variable to two specific variables, X and Y. Based on the structure of the causal relationships, there are 8 classes. For more details about the challenge, I recommend readers to read the challenge description on their website.
Unlike conventional data science competitions, where we’re handed various initial features, causal discovery presents a more intricate challenge. We’re given raw data without any initial features, which necessitates creative thinking to encode and extract meaningful information from each dataset, regardless of its dimensions.
To put it simply, I employed a supervised learning model, carefully crafted with hand-engineered features. The final model was a fusion of a cross-validated LGBM model and an automated machine learning model. An overview of my approach is depicted in the figure below.
In the following parts, I would like to share my experience during the competition. I believe that many people are more curious about how the final approach was arrived at rather than knowing what it is. Finally, I will share some lessons that I learned through this competition.
1. My journey
Stage 1: How I come to the supervised learning approach.
I was fortunate enough to get an early start on the challenge. Like many others, I began with the organization’s provided code base, which leveraged the classic PC algorithm. For those unfamiliar with causal discovery, most current methods are unsupervised, falling into either the constraint-based or score-based categories. While constraint-based methods rely on conditional independence relationships, score-based methods optimize a score function like the BIC. Part 3 in a book Causal Inference and Discovery in Python of Aleksander Molak offers a fantastic overview of these state-of-the-art techniques.
The baseline PC algorithm yielded a score of ~37.79%. Undeterred, I delved into a variety of other unsupervised methods, including cutting-edge gradient-based approaches like NO-TEARS and DAGMA. Unfortunately, these didn’t significantly boost the performance. Ensembling these models through a voting strategy also proved insufficient, keeping the score below 40%.
A pivotal moment arrived when I realized the sheer scale of the dataset. With 23,500 training datasets and 1,880 test datasets, each comprising 1,000 samples of N variables, I saw an opportunity to leverage a supervised learning approach.
As a first step, I applied a simple polynomial transformation of degree 5 to the observed triplets (X, Y, K) and fed them into an LGBM model to predict the relationship between node K and X, Y. This straightforward approach yielded an impressive score of ~47.13%, propelling me to the top 6 of the public leaderboard. This early success ignited my passion to further explore this supervised learning direction. It should be noted that I came to the supervised approach before the organization released their code base for this direction.
Stage 2: Feature engineering
By crafting a rich feature set that captures intricate statistical relationships, like correlations and conditional independence, I unlocked the potential to surpass 60% accuracy. But with a massive dataset and computationally intensive feature engineering, I knew I had to be strategic.
My approach was to start with a set of fixed features (such as the polynomial features in the first baseline of the supervised model). For each new feature, I evaluated how it improved the overall score and added it to my feature list. Finally, I used an LGBM model to evaluate the feature importance of all these features. By analyzing feature importance, I ruthlessly pruned the feature set, ensuring only the most impactful ones contributed to the final model.
These features can be categorized into the following groups:
- ANM-based features: These involve reconstructing the underlying noise in the Structural Equation Model (SEM) between two variables, then evaluating the correlation of the noise to the predictive variable and some statistical descriptions of the estimated noise.
- Correlation-based features: These include well-known correlation metrics such as Pearson, Spearman, distance, and Kendall tau.
- Conditional independence-based features: These include both parametric and non-parametric metrics to measure the conditional independence between X and Y given a set of variables Z, such as partial correlation and conditional mutual information.
- Information theory-based features: These relate to the entropy and mutual information between the variables.
- Feature importance-based features: These aim to evaluate the role of X and Y in predicting the value of variable K and vice versa, using various machine learning models including linear regression and tree-based models.
- Other causal features: I also include mediation analysis of X->Y->K, K->X->Y, and X->K->Y.
Stage 3: Tuning models
My primary focus was on crafting a powerful feature set. Armed with an LGBM model, configured with 700 estimators and a maximum depth of 15, I laid a solid foundation. As the competition progressed, the feature set grew exponentially, necessitating adjustments to the model’s hyperparameters. I increased the number of estimators, deepened the tree structure, and reduced the learning rate to accommodate the increasing complexity. To address the imbalanced dataset, I employed SMOTE oversampling and 5-fold cross-validation for robust model evaluation. Ensembling the predictions from these folds further boosted the performance, yielding a respectable 67% accuracy.
Since I was quite busy for the final two weeks of the challenge, I almost stopped feature engineering at that time and let the computer do hyperparameter tuning itself. I also explored more how to tune models efficiently. And Auto-ML caught my attention. I found autogluon achieve impressive performance in recent Kaggle challenges. Therefore, I decided to give it a try. To my surprise, it outperformed my carefully tuned LGBM model, achieving an impressive 69% accuracy. It was a testament to the power of automation and the rapid advancements in machine learning.
While the organization released a neural network-based code base, I opted to focus on enhancing my existing approach. The out-of-sample phase presented an intriguing opportunity for further improvement, but I weighed the risks of potential overfitting. Considering the close alignment between my cross-validation scores and the public leaderboard, I decided to stick with my well-tuned ensembling model.
In the end, a strategic combination of feature engineering, model selection, and efficient optimization led to a successful campaign. My final public score was 70.686%, while my out-of-sample score was 70.37%. As I expected, the two scores were quite similar.
2. Final thoughts
Personally, this challenge is the biggest competition I’ve joined until now. Especially, it came at the right time for me. As a PhD student studying causal discovery, this is a great opportunity to explore how my research applies to real-world problems.
Indeed, in research, we often confine ourselves to specific functional forms like linear non-Gaussian, additive noise, or heteroskedastic noise models. However, the real world presents far more complex and intricate challenges, where the underlying functional forms remain elusive. This thrilling complexity ignites my passion to delve deeper, explore innovative techniques, and bridge the gap between theoretical research and practical applications of causal discovery.
In addition, I’ve gained valuable insights through this challenge. Preprocessing large datasets proved to be quite frustrating, requiring significant optimization of my code. Specifically, this competition forced us to consider the practical implementation of our models, requiring us to run the code for each submission. Therefore, effective time management was also crucial, given the limited computational resources available on the platform. I benefited greatly from collaborating with my teammates (CDOZ) and learning from other competitors. While the competition rules prohibited code sharing among team members, we actively exchanged ideas on feature engineering and exploratory approaches. Furthermore, I was particularly impressed by the innovative solutions of the top three teams, providing me inspiration to push the boundaries of what’s possible in causal discovery.
Additional links of the top 3 solutions for those who are interested in:
Top 1: An end-to-end deep learning approach
Top 2: Unsupervised approach from traditional causal discovery methods
Top 3: An supervised approach with a simple, yet amazing data augmentation strategy