Abstract
This paper introduces the concept of 'chunky post-training,' a phenomenon where large language models (LLMs) generalize unintended patterns from discrete chunks of post-training data. These chunks, designed to teach specific behaviors, often encode spurious correlations that lead to miscalibrated or unexpected model behaviors. For example, models may incorrectly associate specific prompt features (e.g., formatting or phrasing) with certain behaviors, resulting in failures such as rejecting true facts or misinterpreting user intent. To address this, the authors propose two tools: SURF (Surfacing Unintended Response Failures), a black-box pipeline for identifying these unintended behaviors during inference, and TURF (Tracing Unintended Responses via Features), which traces these failures back to specific patterns in the training data. The study demonstrates that these failures are widespread across both frontier models (e.g., GPT-5.1, Claude 4.5) and open models (e.g., TĂĽlu 3). The authors argue that understanding and mitigating these issues is critical for improving user trust, evaluation reliability, and the overall alignment of LLMs with intended behaviors.
Methodology
The authors developed two tools: SURF, a black-box auditing pipeline that identifies unintended behaviors during inference, and TURF, which maps these behaviors to specific features in the post-training data. These tools were applied to several state-of-the-art LLMs (e.g., Claude 4.5, GPT-5.1, Gemini 3, Grok 4.1) and an open-source model (TĂĽlu 3). The study involved analyzing model responses to various prompts and identifying patterns of misgeneralization linked to training data artifacts.
Results
The study demonstrates that chunky post-training failures are widespread across both proprietary and open-source LLMs. These failures often manifest as miscalibrated behaviors, such as rejecting true facts or misinterpreting user intent, and can be traced back to specific patterns in the training data. The authors provide empirical evidence that these issues are caused by imbalanced or underspecified data chunks used during post-training.
Implications
The findings highlight the need for more rigorous auditing and curation of post-training datasets to mitigate unintended behaviors in LLMs. The proposed tools, SURF and TURF, can help developers identify and address these issues, potentially improving model reliability, user trust, and evaluation accuracy. This research also underscores the importance of understanding the impact of training data on model behavior, which is critical for advancing the development of aligned and trustworthy AI systems.
View on arXiv