I’ve been meaning to write up the results my experiences with this Kaggle competition since it ended a few weeks ago, but I’ve been caught up with other things. I really loved the community and culture of sharing that developed, which helped me to learn a ton and finish near the top 15%. I was hoping to get into the top 10% but ended up dropping by quite a lot in the final standings. I now know why and look forward to more competitions in the future.
First, I want to give big shoutouts to everyone who shared code and advice. I myself didn’t have much to add to the conversations, unfortunately, but guys like Paul Duan (who eventually won the competition) and Miroslaw Horbal really had a huge impact on the community by sharing. All my code were built upon the starter code provided by these guys, and I am very grateful for the help. Some of my code is published on Git: https://github.com/bachow/kaggle-amazon-contest.
I first started writing my code in R, actually, but I ran into many problems with limited memory. It turns out that R cannot work with virtual memory. So when you’re working on big data sets on a machine with limited RAM (like my vintage 2011 MacBook Air with 4 GB RAM), you’re going to have a bad time because R cannot swap/cache data to a HDD unless you use one of the packages written specifically to combat this problem. At certain points, my Python scripts were using about 20 GB of virtual memory! Since much of the community is based on Python, I made the call to switch over. Python has no (noticeable) issues with limited RAM, and it’s very easy to learn. A friend of mine described Python as, “… easy enough that your boss can learn it!” In my own experience, I because fluent in Python over the course of 2 – 3 months by taking ‘Interactive Programming in Python’ on Coursera and working on Kaggle challenges.
I started out building simple logistic regression and random forest models on the provided training data (i.e. no feature engineering was involved). They worked… That’s all I can really say about them. The AUCs were not that impressive (~0.65). The problem really is that the data set doesn’t provide that much information (30,000 rows and 9 features, of which one feature was highly correlated with other features). I ended up dropping that one highly correlated feature. Additionally, many feature levels have small sample sizes. So all this means that I had to perform some sort of feature engineering to extract more information out the data set.
I ended up modifying Miroslaw’s code for feature engineering. His code generated hash values from the combinations of feature level values for all combinations of 2 and 3 features. So for the features RESOURCE, MGR_ID, and ROLE_ROLLUP_1, the output would be the new features RESOURCE+MGR_ID, RESOURCE+ROLE_ROLLUP_1, MGR_ID+ROLE_ROLLUP_1, and RESOURCE+MGR_ID+ROLE_ROLLUP_1, and the level values were the hash values. There were a few cases in which the hashing function would produce negative and non-unique values, however, so I modified the function to use Cantor pairing encoding. I also tried generating fourth-order feature combinations, but these didn’t have much of an effect on AUC in the end. So the total number of features increased with each order of feature combination: 8 + 8 choose 2 + 8 choose 3 = 92.
Speaking of AUC, it’s part of Receiver-Operator Characteristic (ROC) analysis. My classifier, when tested, will produce true positive, true negative, false positive, and false negative predictions. AUC measures the area under the ROC curve and quantifies the quality of the model in making predictions consistent with the ground truth. Typically during validation of my models, I made 80/20 splits of the training set, trained the model on the 80% split, and tested its performance on the 20% split. This was done 10 times, and a mean of the AUC was calculated.
I also tried three other forms of feature engineering to squeeze more information out of the data. I created raw counts of each level within each feature, rankings of the levels within each factor in terms of number of occurrences, and cutting off levels with low counts. For the latter method, I wrote a cross-validation wrapper function on top of my feature selection function so that I could select the cutoff count under which the level was discarded. In other words, if the cutoff for a particular feature that maximized AUC was 10, then all feature levels that occurred less than 10 times were discarded by assignment to a common discard level.
I then tried various types of feature selection for my logistic models (random forest model implementations generally include feature selection built in and I didn’t need to worry about it). AUC scores suffer when all of the engineered features are included in the model due to overfitting, and there should exist some combinations of features that maximize AUC score. Additionally, because feature selection is computationally expensive (see below), most algorithms don’t try to find the global maximum AUC. Instead, they’re generally heuristic methods that may become trapped in local maxima. I tried a variety of selection algorithms to test against this problem: forward greedy selection (part of Miroslaw’s code), backwards greedy selection, and add one, drop one (A1D1) selection functions. The greedy selection functions are so named because when a feature is added to or subtracted from a feature set, they stay added or dropped, respectively. This method may lead to the case where a feature that was previously added/dropped may raise the AUC value if removed/added in a subsequent iteration. I found that this phenomenon did not occur for this particular data set as my A1D1 functions rarely if ever overturned selections.
Feature selection by far consumed the most CPU time out of any step. Think about this for the forward selection algorithm: on each iteration, add each feature that hasn’t been added to the model already and cross-validate each by calculating AUC 10 times (or 10-fold cross-validation). For the first iteration, that’s 92 features x 10-fold CV = 920 cross-validations for just one step! I found that most times around 20-30 features are selected, so that’s 18400 to 27600 CVs per run at around 1 second per CV fold! The time to calculate a CV fold increases as the number of features are selected for forward selection and vice versa for backwards selection. I ended up rewriting the functions to support multithreading to speed up the process, but with the limited computing resources at my disposal, it still ended up taking a few hours to run.
The next step (again, for my logistic models) was the tune the hyperparameter C. I don’t completely understand hyperparameter tuning, so I’ll just link to a blog post I found on the topic: http://www.r-bloggers.com/automatic-hyperparameter-tuning-methods/. The gist of it is that the hyperparameter value is a penalty for large coefficients relative to the penalty for errors in predictions while fitting the model to the data. For the most part, this part of my code was taken unmodified from Miroslaw’s code, but I hope to learn more about performance tuning and optimization in the future.
For some of my models, I generated ensemble learners. These are combinations of models whose predictions are averaged with some weighting that are determined by their performance during cross-validation. For instance, if model 1 had an AUC of 85% and model 2 87%, then model 2 would have a higher weighting during averaging. I created ensembles of purely logistic regressors and combinations of logistic and random forest classifiers. For the former, I wrote an implementation of ADA-BOOST, which iteratively creates a bootstrap sample from the data set, and cross-validates and fits a model. The performance of the model is analyzed with respect to individual data points, and data which are misclassified are weighted higher for the next bootstrap. After 20 iterations or so, the algorithm created the weighted average predictions.
Random forest (RF) classifiers are more straightforward. There’s no external feature selection or hyperparameter tuning. The algorithm Just Works. Unfortunately, I found that the performance of these models lagged behind logistic regression models, which is strange because RF models have a reputation for generating better, more accurate predictions than anything else available. It might be that the default RF implementation in Python Sci-kit is poor, but I don’t know. I would love to write my own, as I saw that some other competitors scored pretty high with their own custom implementations.
So those were the various approaches that I tried for this competition. Ironically, my highest scoring submission was generated from a single logistic regression learner on up to third-degree feature combinations, and not my more complicated algorithms. I think my biggest problem was that my cross-validation was not thorough enough. I took the simple mean of the 10-fold CV and didn’t take into account the associated variability. So during feature selection, my algorithms may have selected features that scored a higher mean AUC than other features, but did so with outlier values. There were plenty of times where my internal AUC scores ended up being much higher than the eventual AUC scores generated by Kaggle. Coming from a academic biology background, I should have known better! In my defense, I planned to write a CV function that took into account variability, but I ended spending my time on writing the more complicated functions. Even though I implemented much of the same methodology as higher-ranked competitors, their cross-validation seemed to be more thorough, resulting in better performing models. Lesson learned: cross-validation is really fucking important.
All in all, I had a lot of fun with this competition and learned a lot about building robust classification models in Python. I’ve started working on the Belkin Energy Disaggregation competition, which is a different but every bit as enjoyable challenge.