The “cold posterior effect”—where raising the Bayesian posterior to a power greater than 1 improves predictive performance—remains one of the most puzzling empirical phenomena in Bayesian deep learning. While numerous heuristic explanations have been proposed, a rigorous theoretical understanding remains elusive. In this paper, we provide the first theoretical analysis of this effect through the lens of random features regression. We prove that in the overparameterized regime, the posterior predictive distribution becomes systematically over-dispersed relative to the true risk of the maximum a posteriori (MAP) estimator. This miscalibration naturally suggests tempering the posterior to achieve better uncertainty quantification. Using recent asymptotic results for Bayesian random features models, we derive explicit conditions under which cold tempering improves frequentist coverage of credible sets and characterize the optimal temperature parameter. Our theoretical results are validated by numerical experiments and provide a mathematically grounded explanation for why cold posteriors work in practice.
The “cold posterior effect”—where raising the Bayesian posterior to a power greater than 1 improves predictive performance—remains one of the most puzzling empirical phenomena in Bayesian deep learning. While numerous heuristic explanations have been proposed, a rigorous theoretical understanding remains elusive. In this paper, we provide the first theoretical analysis of this effect through the lens of random features regression. We prove that in the overparameterized regime, the posterior predictive distribution becomes systematically over-dispersed relative to the true risk of the maximum a posteriori (MAP) estimator. This miscalibration naturally suggests tempering the posterior to achieve better uncertainty quantification. Using recent asymptotic results for Bayesian random features models, we derive explicit conditions under which cold tempering improves frequentist coverage of credible sets and characterize the optimal temperature parameter. Our theoretical results are validated by numerical experiments and provide a mathematically grounded explanation for why cold posteriors work in practice.