Weird Kaggle, the superiority of books, and other reflections

Published on February 19, 2025

I recently entered a Kaggle competition to brush up on some modeling skills.¹ The analysis problem is a pretty typical clinical prediction question. My final product is, well, "final product" is an extremely charitable description of it.² But despite this it was a fascinating and worthwhile experience, full of interesting questions to ponder, such as "what is the fundamental difference between statistics and machine learning?" and "do they realize their evaluation metric is pretty silly?"

The following are some notes from my project log.³

Wait, you want me to maximize what?

As I read the rules, they started off normally. The competition asks you to predict the chance of survival after getting a bone marrow transplant. It evaluates submissions based on concordance, which is just a scoring method that rewards the model for correctly ranking who will survive the longest (it doesn't matter if you are wrong about how much longer). However, in this competition the concordance scores are calculated within each race group. They give higher scores to models with higher overall concordance, but also penalize any models that perform badly only for people of certain races.

(More precisely, they take each race group's concordance, and subtract the standard deviation from the mean of those stratified scores).

The idea is straightforward and it's extremely reasonable to me. Predictions like this can have a big influence on the decision to offer or accept treatment, and if that decision is much more accurate for white people we have a problem. People who should get the treatment won't be offered it, people who get the treatment will later learn it isn't as effective as they were told, etc. (There are lots of examples of bias in clinical predictions, not just by race but also by sex, age, etc).

Once I grasped what this metric was meant to accomplish, I thought "but surely if I increase one race group score independent of the rest, the total score will always increase." Well, I didn't have clear statistical intuition, so I decided to prove it. And it turns out, this is not always the case; for example, if the scores are {0.5, 0.52, 0.55, 0.57, 0.63}, then raising the 0.63 further will lower the overall score (see appendix for a more thorough description of when this happens).

This is... probably not what we want. One strategy I thought of but never tried out is to actively worsen the highest scores and see if the overall score improves (I would do this by adding a bit of random noise to the predictions for those groups—literally obscuring the predictions). Surely this is completely against the spirit of the competition! A better metric would give a greater incentive to improve the worst score compared to the best, but never incentivize worsening a score.⁴

It gets even weirder. If I understand correctly (and I totally may not), the competition does not evaluate whether people are ranked correctly against those of other races. In other words, someone could win the competition with a model that says all white people will survive longer than all non-white people. "If you aren't X race, don't bother taking the treatment, it won't work as well for you." Seriously?!

A book is worth a thousand tutorials

The outcome is how long patients survive after treatment, so I had to brush up a bit on survival analysis. I have a bunch of experience with it from school and research a few years ago, and remembered the basic differences between available models (Cox models vs. accelerated failure-time, their assumptions and parameterizations, and the benefit of survival analysis over basic risk analysis). I also saw that XGBoost has built-in survival analysis that can account for people who we stop following before they have the event (right-censored data). My problem was I had confused myself about how censoring is accounted for and why it matters. Rather than blindly use the tool I "knew" was best, I figured I should remind myself why.

I skimmed a few tutorials and the Wikipedia page. This was basically a waste of time, and then I remembered I have one good book on epidemiology & biostatistics methods: Modern Epidemiology by Lash, VanderWeele, Haneuse, and Rothman. It has a short chapter with this outline:

[Description of an example study]. The outcome is survival time after treatment. The results are right-censored.
Naive idea #1: take the average time in each group (treated and untreated). This is bad because some times are censoring times, which are always shorter than the actual survival time. If one group has more censoring than the other, this messes up our comparison.
Naive idea #2: ignore the people who were censored. This works, but it ignores useful information (the fact that they survived for at least that long).
Correct plan: find the estimate that is the best fit for both types of people, the censored and the uncensored. By "the best fit" we mean [insert mathy details here⁵].

Now, all this information was in the tutorials, but none of them gave the correct sequence of examples, the right level of detail for me to immediately grasp what I needed to grasp. I continually have this experience when learning a new concept: the single good reference is so far superior to random tutorials and documentation that the online stuff feels roughly useless.

The unreasonable effectiveness of boring models

When I was younger, I imagined that machine learning was all about constructing highly intricate models with fancy math and stuff. But as I've developed more experience I've realized that's pretty far off.

Now, apparently XGBoost is all you need, and like many others I used it for this competition. So take decision tree ensembles as an example⁶:

Put the data points in a high dimensional space
Split up the space into a bunch of (high dimensional) boxes
Calculate mean inside each box. That's the prediction.

The value of XGBoost is not in some novel, intricate mathematical model ("just make some boxes" is not that impressive). It comes from the algorithm: it can be tuned easily with a few hyperparameters, it is highly optimized, and it scales to large datasets.

This is a very different feeling from the statistical approach I'm more familiar with, where you carefully design functions that are tailored to the relationship between the variables, where the functions are smooth and curve around in just the way you specify, and you have very precise ways to restrict or relax the model. Coming from that perspective, gradient-boosted decision trees feel like saying "who cares, just let it be any function and make the computer do a lot more work."

This helped it click for me why machine learning can still work pretty well when you have a massive number of variables and a massive number of variables. It works because the algorithms 1) are computationally straightforward and you can optimize the hell out of them; and 2) iteratively grow in complexity as they learn, making it easy to control the complexity of the model. The second one is a super vague and imprecise idea, but I'm talking about how XGBoost lets you train for a little bit of time, to get a model that vaguely summarizes the most significant patterns in the data; then if you want to relax the model and increase the amount of detail it expresses, you just train the existing model a little more. There's no desire to decide how exactly to change the model specification, like you typically have when you put your statistician hat on. As you add more variables, you probably have a worse understanding of those variables and plausible model specifications, so it becomes so much more efficient to use this "just train a little more and check again" approach.

So to sum up, it seems like the statistics approach is more about carefully crafted models that you try to understand more deeply: you can apply background knowledge (and priors), and you can mathematically express theories and test them out. But the machine learning approach is more about "some hard problems can be solved by seemingly suboptimal methods and brute force much more quickly than they can be understood by building up the statistics." Or if there is some concept of "mathematically expressing a theory," it's all handled by the computer and the human doesn't have to understand it to verify that it works. This is all probably a very surface-level way to compare, I'm just musing here.⁷

Some open questions

An unexpected tuning bottleneck

After shaping the data correctly and getting the model to run correctly a first time, I expected the main improvements to come from tuning the number of iterations. So I set up early stopping with a training set and test set. I was surprised to find out that the concordance of the resulting model was still extremely poor. As it turned out, tuning the scale parameter of the AFT distribution made an enormous difference. In particular, giving the AFT distribution a much higher variance helped.

I've thought about this a little and done a quick search, but still don't really understand why that's the case. Maybe if the variance is too low, then the likelihood is mostly influenced by data points that are quite close to the mean, and harder to predict data points are basically ignored by the model. This still feels like a pretty vague idea and I haven't been able to verify it.

What's the most straightforward way to maximize the stratified metric?

My approach was to just start with maximizing the overall concordance, but then see if I could optimize for the stratified metric used by the competition. I got a reasonably good stratified score right off the bat, and when I checked the concordance within each race group, they seemed close enough together that I wouldn't expect massive improvements from rebuilding my analysis, so I didn't. But I am curious what's the best way to do that.

The idea that seemed the least feasible was building a custom loss function. Instead of just minimizing error in each individual prediction, a custom loss function would put higher penalties on errors if the prediction was for the race group that had higher errors overall, and conversely, lower penalties for predictions in the groups that had better predictions. The problem is that the loss function is no longer an independent calculation for each data point (it becomes non-decomposable), which is not supported by the library and likely comes with a big cost in terms of optimizing the computation.

I had two other ideas: I could weight the smaller race groups higher, to ensure that the model didn't trade performance on a smaller group for better performance on a larger group. Or, I could run stratified models for each race group, and make predictions using the corresponding model. For the latter option, I also had the idea of adding noise to the top performing group to see if that improved the score (see above about why the competition metric is a weird choice).

Appendix: proof that raising a score can decrease the metric

math

In my current role as an epidemiologist, I don't do a ton of modeling and spend more time on things related to study design, descriptive analysis, communications, and making our data accessible to the public. ↩
I'm currently sitting at the 15th percentile, so no, not really my best effort. ↩
If you don't already keep logs of all the dilemmas and thoughts and ideas you have during a project, I can't recommend it enough. It's easily worth the extra time it takes to write it all down in terms of thinking more efficiently and learning from the experience. ↩
To be fair, I am not sure if the obfuscation strategy would actually make much difference in practice. Either way it feels wrong though. ↩
Use the typical maximum likelihood framework, where we specify an outcome distribution as a function of parameter values, and then try lots of parameter values in order to make the corresponding outcome distribution as close as possible to the pattern observed in the actual data. For an AFT model, we start by specifying an expression for both the survival function, and for the probability density function. During maximum likelihood estimation, we first calculate the likelihood among the non-censored individuals using the PDF; then we calculate the likelihood among the censored individuals using the survival function. The product of these two is the model's likelihood function. The Cox model uses an analogous approach, but using only relative hazards and leaving the full survival function unspecified. ↩
My description here is pretty silly and I didn't want to get super technical here, but I do like the overview given in the XGBoost docs. ↩
Of course, there's a longstanding debate about the differences between the fields (which includes arguments that the entire debate is silly, as well as the slightly aggressive "regression is not machine learning!!", which definitely is silly). I enjoyed reading both this post by Sam Finlayson, with some reasons why it's a false dichotomy plus some historical notes about the origin of the terms, as well as this post by Frank Harrell where he uses the two terms to refer to distinct modeling approaches, and gives pragmatic notes to help decide between the two. My only purpose for talking about this was that I noticed the difference in the approach, and I totally get how you could use regression in a machine learn-y way or decision tree ensembles in a statistics-y way. ↩