ImageNet top-5 accuracy was once the north star of computer vision research. It remains one of the most replicated benchmarks in ML history — and one of the least predictive of real-world model performance. Models that rank first on ImageNet frequently underperform on domain-shifted variants of the same task, on long-tail categories, and on inputs with modest distribution shift such as changes in lighting or camera angle. The lesson is not that ImageNet is useless; it is that a single aggregate metric on a single benchmark dataset captures a specific slice of performance, and that slice may or may not overlap with your deployment slice.
The most common benchmarking mistake is evaluating on a test set drawn from the same distribution as the training set. This is fine for detecting overfit, but it tells you nothing about how the model behaves on the inputs that matter most: rare events, edge cases, and out-of-distribution examples. A well-designed evaluation suite includes a held-out in-distribution set, a distribution-shifted set (different camera, different geography, different user demographic), and a deliberately adversarial set targeting known failure modes. Teams that run all three surfaces a far more honest performance profile.
Corelabel's annotation teams build evaluation sets as a separate workstream from training data, with explicit briefs to maximise coverage of the tail distribution. This separation — combined with blind evaluation against an immutable held-out set — is the closest practical approximation of honest model measurement available without a live deployment.