isFraud?

My Second Kaggle

Nicholas Teague

Published in

Automunge

23 min readOct 22, 2019

The Morning, Port Scene — by Claude-Joseph Vernet (1780)

Preservation Hall Jazz Band — Live on KEXP

For those who haven’t been following along, I’ve been using this platform to document the development of Automunge, an open source platform to prepare tabular data for machine learning. If you consider mainstream “AutoML” frameworks, they share in common that a user is generally required to clean and prepare the data themselves prior to implementation. That is where Automunge comes in, we are the “auto” that get’s you to “auto”. The tool is a python library, available now for install, that evaluates properties of data set columns to apply simple feature engineering methods to convert raw data to the numerical encodings with infilled missing cells that are a prerequisite for modern machine learning libraries. A user does not have to defer to automation for the entire set, the tool also allows the assignment to distinct columns of feature engineering transformations, such as may be a single transformation or a “family tree” of transformations such as to create generations and branches of derivations originating from the same source column. Such transformations may be sourced from our “feature store” library of transformations or may also be custom designed by the user, making use of just a few simple generic data structures, for incorporation into the platform. By applying transformations within the platform a user gets access to a host of unique and useful automated methods, such as automated machine learning derived infill to missing points (aka ML infill), automated feature importance evaluation with the shuffle permutation method, automated dimensionality reduction such as via feature importance or principle component analysis, automated preparation of data for oversampling in cases of class imbalance in labels, automated evaluation of data property drift between training data and subsequent data, and perhaps most importantly the simplest means for consistent processing of additional data with just a single function call. In short, we make machine learning easy.

Part of the challenge for any new software platform, especially one that may serve in the foundations for implementation of mission critical objectives which may be the target for a machine learning evaluation, is identifying early users willing to share risk of beta software which inevitably will have at least a few bugs. I mean I’m just one guy building this thing, and while I certainly have given serious effort to rigor in the implementations, the reality is that with currently up to almost 20,000 lines of original code, originating from a (kind of sort of) inexperienced developer, well it’s inevitable that some things will be hard to spot in first pass. (Actually coming to terms with this reality was part of the inspiration for a recent extensive quality control audit in which I went into some detail validating the various transformations, certainly a worthwhile effort as I did find and fix a few previously hard to spot bugs, more on that below.) Failing early users, product design can become a challenging proposition, as feedback from a user base is a key source of fuel for identifying pain points and potential extensions.

“At its heart, a startup is a catalyst that transforms ideas into products. As customers interact with those products, they generate feedback and data. The feedback is both qualitative (such as what they like and don’t like) and quantitative (such as how many people use it and find it valuable). … the products a startup builds are experiments; the learning about how to build a sustainable business is the outcome of those experiments. For startups, that information is much more important than dollars, awards, or mentions in the press, because it can influence and reshape the next set of ideas.”
- Eric Ries, The Lean Startup

Facing the reality of an unrealized user base, I’ve found myself in the position of having to wear two hats so to speak, that of the developer and of the beta tester as well. It was thus that I ended up turning to the Kaggle platform for machine learning competitions, for where better to prove the validity of the software than in context of competitive environment where I can actively benchmark against skilled practitioners incentivized by a competition for cash prizes. This was only my second full attempt at a Kaggle competition (I documented my first in real time in what was (in my defense) kind of a stream of consciousness attempt at code blogging I called “My First Kaggle” which I wrote in 2017), and I somewhat arbitrarily selected a competition hosted by IEEE for application of detecting fraud in financial transactions, partly because as an IEEE member it seemed a cool way to get involved in an association initiative, but mostly just because it was an arbitrary tabular data based competition for which the current version of Automunge is primarily built to address. I’ll use this essay to highlight a few of my takeaways from the competition, and perhaps a few tangent musings as well to keep it interesting.

The competition data sets consisted of a series of entries associated with unique financial transactions, with the training set including an extra label column with simple boolean indication of whether a given entry was an example of fraud, with entrants then evaluated on their ability to predict fraud on a similarly structured test set. The competition turned out to have some unique challenges in that the bulk of the data set features (columns) were sort of anonymized so to speak, by which I mean most of the column labels and associated contents were ambiguous with respect to origination, and subject to any range of interpretations as to what they could represent (for example many of the columns contained integers which for all we knew could be counting something, codes for a categorical feature, or some other measure). Thus any entry trying to incorporate more esoteric feature engineering methods, such as adjoining related external data or aggregating features based on known relations, were primarily left to algorithmic inference of suitability. (Feature engineering refers to data transformations that take place prior to training a machine learning model, such as may make for more efficient training, which oh by the way is part of what Automunge so dang useful for after all.)

“Early adopters use their imagination to fill in what a product is missing. … In enterprise products, it’s often about gaining a competitive advantage by taking a risk with something new that competitors don’t have yet. Early adopters are suspicious of something that is too polished: if it’s ready for everyone to adopt, how much advantage can one get by being early? As a result, additional features or polish beyond what early adopters demand is a form of wasted resources and time.”
- Eric Ries, The Lean Startup

Another of the challenges associated with the problem was the scale of data included in the set, which I suspect was selected to push the boundaries of what was possible for in-memory operations on the Kaggle platform. Although the training data sets (consisting of two related sets with could be paired by a transaction ID number) were collectively only about 700MB, when you incorporate feature engineering methods that may generate supplemental columns that can easily climb into multiple GB, which in my case some of my derived sets climbed to about 5GB or so to give you an idea. I was hoping this problem could be abated by the investment in a personal dedicated Nvidia GPU accelerated rig, which I had been meaning to invest in for a while anyway, with what I assumed would be an ample 32 GB of RAM, but when it came time to train even this setup turned out to have capacity constraints, especially for gradient boosting. (Yes I know what you’re thinking, why on earth would you buy a PC when there are so many cloud options ranging from free services like Kaggle and Colaboratory to a whole host of cloud services renting by the hour like Google Cloud, AWS, Paperspace, and well etc. Well for me it was kind of a personal preference of just trying to strip out as much complexity from the training operation as I could, as when I initially was experimenting with a cloud vendor (I won’t say who), was running into issues with whole EC2 instances getting corrupted by memory full errors without any clear reason why or how to correct which well given the time constraints of the competition this investment seemed the path of least resistance.)

The competition turned out to be a great testing ground for one of the truly unique and useful features of the Automunge platform, namely the potential for running feature engineering experiments to evaluate impact of predictive model accuracy from different configurations of transformations. I believe that versioning of data transformations is an until now unsolved problem in the machine learning space, and data scientists wanting to run these type of experiments have been forced to save iteration upon iteration of potential multi-GB scale data sets, with sharing of results and reproducibility of experiments a real challenge between researchers for instance — but the challenges come not just from saving different versions of transformed data, but also saving and versioning the series of data transformation pipelines for development of these sets, such as if one wanted to later consistently transform additional data. Automunge solves this problem by removing the need for versioning of huge data sets or their associated (potentially spaghetti code let’s be honest) data transformation pipelines. With each set of transformations applied, a simple and unencrypted python dictionary is returned (called the “postprocess_dict”, inelegant name I know, whatevs) recording all of the steps of transformations and associated normalization parameters, such that with just this simple object one can consistently transform additional data with just a simple “postmunge” function call. And by consistently I mean just that, the transformations to additional data are all based on evaluated properties from the training data such as to avoid any potential for data leakage. Thus when it comes time to archive our set of experiments, we can just throw away all of those huge iterations of datasets csv files and spaghetti code and simply save the postprocess_dict and trained model for each experiment.

For each experiment that I ran, I ended up saving a few more iterations of data than may be typical, so let me quickly explain the rational as to why. First one of the challenges of this particular competition was the presence of class imbalance in the labels, by which I mean we had many more examples of legitimate transactions than fraudulent transactions. Automunge has an option to help address this challenge, by a parameter called “TrainLabelFreqLevel” (inelegant name I know, whatevs), which prepares the training data for oversampling of the lower frequency category by copying sets of rows to approximately levelize the data, which oh yeah supports categorical labels as well as numerical labels via standard deviation bins for instance. I used this option, but the challenge was that in so doing I had to be careful that we were only levelizing the data after a split for training and validation has already taken place, to avoid data leakage for duplicated points ending up in both training and validation. (As mentioned earlier a key point of Automunge value is that training and validation data is processed separately to avoid data leakage). As a result I created a levelized set of training data and corresponding labels for tuning model hyperparameters, a separate non-levelized set to serve as validation for this tuning operation, and then for the intention that once we have settled on hyperparameters, we would retrain with the full training set levelized. And of course a consistently transformed test set (not levelized) for generation of predictions. Again, to be clear, “levelized” in this context refers to the automated duplication of a subset of rows to prepare a data set for oversampling in the training operation such as to compensate for class imbalance in the labels.

(I looked it up, this is actually depicting a “work fire”, it’s not like the vessel is on fire or anything. I’m still not clear why it’s on it’s side, perhaps it’s low tide or something.)

For the training aspects, I wanted to experiment with a few different approaches. I had actually originally intended to prepare an ensemble aggregation of models based on training a second tier “teacher” model to aggregate predictions of all completed experiments, in the end there were a few obstacles to this approach, not least of which were deadlines, others of which I’ll discuss below. I ended up primarily running experiments with XGBoost and TensorFlow via Keras (I did try a little PyTorch with fast.ai but got hung up on what I’m sure was probably a trivial point which I won’t get into here, another case where ran out of time for experimentation). I owe a little gratitude to Jason Brownlee whose Machine Learning Mastery website repeatedly served as a useful reference for a few basic operations, I’ll link to one of his books below (why aren’t you on Amazon Jason? you need to get on that geez). I did run a few experiments with Bayesian optimization tuning via Hyperopt, but never quite got the basics figured out and wanted to get going with training and well ended up using a more beginners approach for XGBoost tuning with a procedural operation via grid search inspired by a 2016 post by Aarshay Jain. If I had a little more time probably would have done some random search with more parameters or who knows maybe even looked for an evolutionary optimization library instead but well heck had this really handy procedural procedure handy and just went with it.

The gist of this procedural approach was the iterative tuning of XGBoost hyperparameters, starting with max_depth/min_child_weight, then gamma, subsample/colsample_by_tree, and finally regularization parameters alpha and lambda. I generally found that the performance of the model was most sensitive to max_depth, which is pretty intuitive as it refers to the number of tree branches allowed. I didn’t spend much energy tuning the learning rate, primarily because the memory constraints really seemed to come into play much below 0.05, so just went with the smallest rate that would run (I believe XGBoost defaults to 0.1 to give you an idea). If I could go back one additional experiment I’d like to run for this procedural approach would be, once all of the parameters have been addressed, to revisit those that were tuned to start — in this case min_child_weight and especially max_depth. And of course once the hyperparameters were addressed the next step was to train the final model and generate predictions. I actually did this in a few steps, first training the full model on a “levelized” (via Automunge) train set and corresponding non-levelized validation set with early stopping to identify the number of rounds, and then retraining the full model with the full levelized training set, including that portion that had previously been set aside for validation, in which I somewhat arbitrarily increased the n_estimators (number of rounds) by a small amount to handle the additional data.

For the Keras experiments I didn’t feel nearly as comfortable that my approach was really anywhere near potential. I mean from what I gather this is probably just a job for simple densely connected layers which is what I used, but beyond that to be honest I’m not sure if there is a consensus for how to otherwise initialize. For instance I really struggled with selecting number of hidden layers and associated layer widths. After all while in other deep learning applications like for images, language, generative, et al you generally have a whole host of pre-configured and sometimes even pre-trained architectures to choose from, but in tabular data I’m not sure if the problem is generalizable enough to have a go-to architecture, after all each application can be very different in their feature space (although I speculate that the kind of repeatable normalization procedures for different categories of data performed by Automunge may eventually help to tame that problem). I finally settled on a configuration loosely inspired by an old heuristic I think I had come across on stack overflow once in which layer width is calculated as as heuristic based on square root of the width of preceding layer (e.g. if a first layer was 1,000 neurons, the next would be 100, then 10, etc.). Of course it makes sense that architecture configurations could certainly be another target for some type of hyperparameter search, but it wasn’t really clear to me how to frame the setup in a manner to run these kind of experiments in an algorithmic context.

Beyond the number and width of layers, I turned to some pretty standard methods like the RELU activation unit and oh yeah threw in some dropout regularization on each layer to prevent overfitting. Of course as a binary classification problem the sigmoid output neuron was sufficient and binary cross entropy for the loss function. For the optimizer I used Adam which from what I gather is generally a good starting point although I think I’ve heard RMSprop can sometimes be used as a starting point in it’s place. I actually found a slight bit of confusion with respect to the evaluation metrics, if you look at the code sample you’ll see an AUC metric which stands for area under curve, chosen because the contest was being evaluated based on a ROC_AUC metric which I think stands for area under curve for receiver operating characteristics. And well forgive this thought because it’s been a few weeks since I looked at it, but I think I was unsure if this AUC I used in Keras was consistent with the AUC (area under curve) from the contest, I found the Keras documentation a little confusing on this point. In hindsight I think one error may have been that I’m not sure if the way I set it up if the early stopping metric is being applied to validation data. So yeah given some more time would have liked to dive deeper here, hopefully will get a chance next time.

Of course I’m not here to just show off some model prep code, this is an Automunge essay after all and the goal is to give you an idea of just what is possible with this tool for automated data wrangling — it’s pretty cool! So the workflow here was that I prepared a data set with Automunge in full automation and then identified a reasonable set of hyperparameters for a predictive model — oh and then tried a few different sets of feature engineering experiments on the data set to gauge impact to the predictive accuracy. It turned out the most successful experiment came pretty early in which I applied an Automunge feature importance evaluation to identify key columns and then just kind of threw the kitchen sink at those high import columns so to speak, by which I mean presented those columns to the machine learning in multiple configurations / multiple transformations, while for those other columns not deemed of import just deferring to the automated feature engineering methods. Which hey that this worked so well I think is kind of a validation of having automated data-wrangling in the first place, here I’ll demonstrate.

So what’s being demonstrated here is first an Automunge application incorporating a feature importance evaluation, and by way of the ‘pct’ featuremethod the results of that evaluation are used to only return those columns in the top percent of evaluated performance, here specified as top 5% via the featurepct parameter. Once we have those returned columns, note that they will include the suffix appenders which indicated the steps of transformations, also shown here is a short method to translate those returned column names to the original by way of data stored in the returned postprocess_dict. Now that we have identified columns with high import, we can create some custom sets of transformations to present those features to our training operation in multiple configurations, as would be expected to enable a more efficient extraction of properties for training the machine learning model. A custom set of transformations to be applied to one or more columns is first defined as a “family tree” of transformation category entries in a “transformdict”, those family primitives with offspring (parents/siblings) have their respective family trees checked for any offspring entries (children/niecesnephew/coworkers/friends), such as in this case bxcx will have offspring of a downstream z-score normalization. And then for any new category, we’ll need a corresponding populated “processdict”, which as shown is a pretty vanilla version — if we wanted to define a custom transformation function this is where we would specify. Note that although here we are only defining a new family tree root category ‘kitchensink’, that root category could also serve as a family tree entry in some other newly defined root category. (I encourage you to check out our GitHub READ ME for full documentation on these data structures). Finally now that we have a custom set of transformations, we simply assign that root category to one or more columns in the “assigncat”. Then when we run automunge(.) on the original data set, we can call the newly populated transformdict, processdict, and assigncat to run these custom sets of transformations on designated columns. And of course any columns that we did not assign to specific processing methods in the assigncat will just defer to the automated evaluation for appropriate steps of numerical encoding.

In addition to validation of existing functionality, another really worthwhile part of running experiments in context of Kaggle competitions is to serve as fuel for ideas of extended functionality. At Automunge we have an ongoing mission to continually build out our library of feature engineering transformations. I’ll be honest I had never heard of the phrase “feature store” before attending that TWiML conference the other week, but really in hindsight that’s exactly what we’re building, an open source catalog of feature engineering methods with built-in support for inferring properties of transformations from a training set for separate consistent transformations to validation, test, or additional training data sets, all within the context of automated numerical encoding of raw data, automated methods for feature importance, ML derived infill, evaluation of data property drift between training and subsequent data, and well I could go on it’s just a really useful platform with a whole host of stuff to make your life easy. Basically we’re trying to solve all of those unmet needs that immediately proceed the training of a ML model for tabular data. I hope some of you readers might give it a try we welcome any feedback. Here’s just an example of a few of our transformations built into the catalog.

Selection of recent additions to transformation catalog aka a “feature store”

I mean this is just a small selection of the type of transforms that are available. What’s important to keep in mind and I’ll reiterate is that the properties of transformation are derived from that data designated for training, for purposes of consistent normalization parameters applied for transformations to validation, test, or additional training data — no risk of data leakage between train and validation for instance. So a few shown here for numerical sets include ‘nmbr’ z-score normalization, ‘log0’ logarithmic transforms (note that ineligible values for log like 0 or nan are treated as subject to infill), then min-max scaling (oh we also have a few variants of this scaling such as ‘mnm6’ to ensure for example that subsequent data has a floor of 0 value to ensure all positive sets such as may be needed for kernel PCA, and a few other variants), the ‘dxdt’ is useful for unbounded numerical sets such as in time series to transform to time step specific properties (we also have a few variants to dxdt for things like velocity, acceleration, jerk, oh and also some methods to spread derivation over multiple rows to de-noise data for instance), and we’ve written about the ‘bxcx’ box-cox power law transform before (I think the addition of a z-score normalization to our default box-cox transform is really useful as this number can get kind of wild based on properties of the data, there’s other variants available in the catalog as well). I’ve previously demonstrated our ‘bins’ transform which aggregates a numerical set into bins based on number of standard deviations from the mean, I consider the ‘pwrs’ transform kind of in same neighborhood, it’s pretty useful for fat-tailed data (such as for instance perhaps the Transaction Amount values from this IEEE competition). Oh and I’m pretty happy with how the ‘1010’ transform turned out — the whole idea here is that for categorical data with wider ranges of value, one-hot-encoding may be less memory efficient. This ‘1010’ kind of preserves the benefits of one-hot encoding and can be used as an alternative for ordinal encoding for instance. Yep plenty more, check out our READ ME if you want to see what all is available.

Turning focus back to my strategy for the IEEE competition, I mentioned earlier that the intent for my final submission was to aggregate the collection of experiments into a single ensemble with final output determined by a second tier of machine learning model trained on the collection of first tier experiments to generate a final prediction. If you look at that experiment log excerpt I mentioned earlier you’ll see a row for an archived csv I refer to as “training predictions (for teacher training)”, which is a set of predicted fraud probabilities corresponding to the training data, which I had intended to aggregate for all experiments into an array to train a second tier model using the same labels as the original training operation. Unfortunately this didn’t seem to be viable as the labels themselves were not probabilities, they were just boolean identifiers for presence of fraud, and so when I trained the second tier the fidelity of the corresponding probability predictions (such as by using “predict_proba” in XGBoost for example) actually went down from the first tier findings — in other words the model had so much ease at predicting 0/1 with the collection of probabilities that the corresponding second tier probabilistic outputs were almost all identical for the 0 labels and 1 labels. Which was an interesting predicament, wherein apparently if a target variable is too easy to predict, we lose the ability to extend predictions to probabilistic refinements.

“(Entrepreneurship’s) most vital function is learning. We must learn the truth about which elements of our strategy are working to realize our vision and which are just crazy. We must learn what customers really want, not what they say they want or what we think they should want.” — Eric Ries, The Lean Startup

Fortunately not all was lost in the ensemble methods, as I ended up running some experiments on manual aggregation which actually bore fruit. The hypothesis was that models trained with different hyperparameters or even different frameworks (e.g. XGBoost vs. Keras) would potentially have different biases, which potentially could offset each other in an aggregation of each experiment’s predicted probabilities. And low and behold, by running a few experiments in, like, averaging predicted probabilities and submitting to Kaggle, heck I climbed like 600 places on the leaderboard all without the use of a machine learning library, which I thought was kind of cool. For example, I found my top two XGBoost models (which I believe were about a tenth of a percent apart in Kaggle submission ratings), when given something like a 75/25 weighted average, actually increased the top Kaggle assessment score materially, and heck throwing in a different architecture with a lower score (from my admittedly somewhat less refined Keras attempt) — well I used a much smaller weighting on this one, something like a 99/1 weighting, but that improved score as well. These type of experiments gave me about a 0.003 improvement on the Kaggle leaderboard score. That being said, when the final evaluation was performed I did have a noticeable drop from the initial assessment (on a subset of the test data) to the final assessment (with the entire test data set), so it’s not clear to me if this contributed. (My leaderboard score on subset of test data was 0.942, and then final private score based on the full test set was 0.909, so probably had a little overfit going on — to give you context the final winning score was 0.946).

I write all of this in the context of a recent visit to New Orleans for some homegrown food, unparalleled music, and one of a kind culture (where I found this amazingly ensembled painting on display at the New Orleans Museum of Art), and well it wasn’t an uneventful trip. On my last day, how should I say this, a building fell down. The Hard Rock Hotel that was under construction, well it just kind of collapsed (literally). And I’m not going to speculate about the root cause, as it would be just that, speculation, but it did serve as a kind of wakeup call to my efforts with Automunge. Recognizing that such a failure of structure could easily have been the result of some failed engineering assessment or heck possibly even software tools underlying, it added a little weight to what I’m trying to accomplish with Automunge. This type of platform is infrastructure after all, and if this is successful professionals of all kinds of industries could be performing evaluations based on this foundation — from fraud detection on financial transactions to heck potentially even like mission critical systems like aviation infrastructure, who knows. And it was kind of a wakeup call that I perhaps have not been consistently following best practices for rigor in my testing of features prior to rollout (that will not shock anyone who has kept pace with the pace of updates). That same day I immediately began a quality control audit to extensively validate the performance, ranging from manual inspections of parameter impact to an automated evaluation for consistency in transformations between training data and test data. This was an extremely worthwhile effort, as in addition to a few transformation function typos, a material bug fundamental to the operation of populating infill — one of our core features after all — was found and fixed. It was a very easy bug to fix, just a matter of correcting the population of internal data structures, but the fact that it had gone unnoticed for so long was to me somewhat alarming.

“You will need to be prepared for the fact that the Five Whys is going to turn up unpleasant facts about your organization, especially at the beginning. It is going to call for investments in prevention that come at the expense of time and money that could be invested in new products or features. Under pressure, teams may feel that they don’t have time to waste on analyzing root causes even though it would give them more time in the long term. … Building an adaptive organization… requires executive leaders to sponsor and support the process.” — Eric Ries, The Lean Startup

I tell this story in the context of carrying a professional engineering certification, which carries some additional burdens of professional conduct and ethics which I take seriously (and which at least in theory also provides benefit of a little extra credibility). I don’t think there is any equivalent in the field of machine learning, it is still a very young field after all with fewer professional standards. But I think we in the machine learning field could all learn something from professional engineers in mindset of having a code of ethics. A data scientist or machine learning practitioner, and especially a teacher thereof, should keep in mind that they have an ethical and professional responsibility to rigor in their methods. These predictions may form basis of important decisions that impact lives and safety. But rigor alone is not enough. We need to recognize the boundaries to what is possible. If we are not assessing potential failure modes with associated exposures to harm and identifying core assumptions then we are not doing our job correctly. Identifying bias is just one example. The tools and methods that we build now could potentially serve as the foundation for generations of predictive infrastructure. We here at Automunge take that responsibility very seriously. Thanks I’ll get off my pulpit, if you enjoyed or got some value from this essay perhaps you can ask Alexa to play you a little Preservation Hall Jazz Band.