Abstract
BinaryPPO introduces a novel offline reinforcement learning framework for binary classification tasks, addressing challenges such as label noise, class imbalance, and sparse supervision that hinder traditional supervised fine-tuning (SFT) methods. By reformulating binary classification as a reward maximization problem, BinaryPPO employs a variant of Proximal Policy Optimization (PPO) with a confidence-weighted reward function. This approach penalizes uncertain or incorrect predictions, enabling robust decision-making under uncertainty. The framework operates entirely offline, making it suitable for sensitive domains such as toxicity detection, factuality verification, and causal inference. Across eight domain-specific benchmarks, BinaryPPO consistently outperforms supervised baselines, achieving accuracy improvements of 40–60 percentage points and reaching up to 99%. The paper also provides an in-depth analysis of reward shaping, advantage scaling, and policy stability, demonstrating the efficacy of confidence-based reward design as an alternative to SFT.
Methodology
BinaryPPO employs a variant of Proximal Policy Optimization (PPO) with a confidence-weighted reward function to optimize binary classification tasks. The framework operates offline, leveraging static datasets without requiring online interaction. The reward function integrates model confidence to penalize uncertain or incorrect predictions, while the composite loss function combines PPO loss, value function regularization, cross-entropy loss, and entropy regularization to stabilize learning.
Results
BinaryPPO demonstrated consistent performance improvements across eight benchmarks, achieving accuracy gains of 40–60 percentage points and reaching up to 99% accuracy. It outperformed supervised fine-tuning methods and standard PPO across tasks such as toxicity detection, factuality verification, sentiment analysis, and causal inference. The framework also exhibited stable loss convergence and robust decision-making under noisy supervision.
Implications
BinaryPPO provides a robust alternative to supervised fine-tuning for binary classification tasks, particularly in domains with noisy labels, sparse supervision, or class imbalance. Its offline nature makes it suitable for sensitive applications like harmful content detection and LLM safety evaluations. The confidence-based reward design could inspire future reinforcement learning approaches for classification tasks and improve reliability in AI systems operating under uncertainty.
View on arXiv