CoreLabel – Your Data Annotation & Governance Partner

Catastrophic forgetting is the most documented fine-tuning failure mode — the model improves on domain-specific tasks while degrading on general capabilities — yet it remains underweighted in most fine-tuning recipes. The reason is that benchmark suites used in papers rarely measure general-capability degradation. If you evaluate only on the target domain, you won't see the regression until users report that the model "got dumber" about everything else. Mixing 15–20% general-domain data into every fine-tuning batch substantially mitigates forgetting without sacrificing domain gains.

Data contamination is a quieter and more insidious problem. If any of your fine-tuning examples were part of the base model's pre-training corpus — which is plausible for any publicly available domain text — you are not actually fine-tuning; you are reinforcing memorisation. Responses that look like strong domain understanding may simply be verbatim recall. Decontamination is difficult without access to the pre-training data manifest, but computing n-gram overlap between your fine-tuning set and the model's known training sources is a reasonable approximation.

Finally, the quality of instruction-following examples matters more than their quantity. Teams that invest in carefully written, diverse, consistently formatted instruction-response pairs consistently outperform teams that scale poorly written data. A dataset of 5,000 high-quality RLHF-style examples regularly outperforms 50,000 examples collected with minimal quality control. The ratio is uncomfortable for organisations accustomed to thinking about data volume as the primary lever.

Fine-Tuning LLMs: What the Research Papers Miss

More from AI & ML

Benchmark Blindness: Measuring What Actually Matters

Why Your Model Is Only As Good As Your Data