AI & ML

Fine-Tuning LLMs: What the Research Papers Miss

Fine-Tuning LLMs: What the Research Papers Miss

Catastrophic forgetting is the most documented fine-tuning failure mode — the model improves on domain-specific tasks while degrading on general capabilities — yet it remains underweighted in most fine-tuning recipes. The reason is that benchmark suites used in papers rarely measure general-capability degradation. If you evaluate only on the target domain, you won't see the regression until users report that the model "got dumber" about everything else. Mixing 15–20% general-domain data into every fine-tuning batch substantially mitigates forgetting without sacrificing domain gains.

Data contamination is a quieter and more insidious problem. If any of your fine-tuning examples were part of the base model's pre-training corpus — which is plausible for any publicly available domain text — you are not actually fine-tuning; you are reinforcing memorisation. Responses that look like strong domain understanding may simply be verbatim recall. Decontamination is difficult without access to the pre-training data manifest, but computing n-gram overlap between your fine-tuning set and the model's known training sources is a reasonable approximation.

Finally, the quality of instruction-following examples matters more than their quantity. Teams that invest in carefully written, diverse, consistently formatted instruction-response pairs consistently outperform teams that scale poorly written data. A dataset of 5,000 high-quality RLHF-style examples regularly outperforms 50,000 examples collected with minimal quality control. The ratio is uncomfortable for organisations accustomed to thinking about data volume as the primary lever.

RM

ML Research Engineer at Corelabel, focused on evaluation methodology and data-centric AI.