How does Kaggle rate submissions

Are Kaggle Competitions won by chance?


Kaggle competitions determine the final placements based on a persevering test set.

A sustained test set is a sample; It may not be representative of the population being modeled. Since each submission is like a hypothesis, the algorithm that won the competition may have hit the test set better than the others. In other words, if a different test set were selected and the competition repeated, would the placements stay the same?

For the sponsoring company, this doesn't really matter (likely the top 20 submissions would improve their baseline). Ironically, however, they could use a front-row model that is worse than the other top 5. To the contestants, Kaggle ultimately seems like a game of luck - luck is not required to stumble upon the correct solution, it is necessary to stumble upon the solution that goes with the test set!

Is it possible to change the competition so that all top teams that are statistically indistinguishable win? Or could the most economical or computationally cheapest model win in this group?





Reply:


Yes, your reasoning is correct. Indeed, if a different test set were selected and the competition repeated, the ranking would change. Consider the following example. All entries to a Kaggle binary label contest are simply guessing randomly (and independently, for example) to predict their output. By chance, one of them will agree to the holdout more than others, even though no prediction is made.

While this is a bit of a contrivance, we can see that deviations in the individual models of the filing mean that applying many such entries, in fact, only matches the noise of the holdout set. This tells us that (depending on each model anomaly) the top N models are likely to generalize the same thing. This is the garden of fork paths, except that the "explorers" are not the same (but it doesn't matter).

Is it possible to change the competition so that all teams win that are statistically indistinguishable from the top performance on the test set?

Indeed.

  • One approach (impractical as it is) would be to explicitly identify the variance of a particular model in each entry, which gives us a CI for its holdout performance.
  • Another approach that can be computationally high is to bring a CI to holdout performance by making a training and testing API available to all models.






There are other types of competitions where Kaggle has no elements of randomness. For example the stolen sleigh of this Stanta.

It's a discrete optimization problem and there isn't even a private ranking. What you see in the public ranking is the bottom line.

Compared to supervised learning, which is an easy start for many people, this type of competition is "tougher".

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from.

By continuing, you consent to our use of cookies and other tracking technologies and affirm you're at least 16 years old or have consent from a parent or guardian.

You can read details in our Cookie policy and Privacy policy.