model-evaluation - Abrarqasim Blogs

Your reasoning model is more fragile than the benchmarks say

A new paper tested 8 reasoning models with 14 formatting perturbations. Open-weight models lost up to 55% accuracy. Here's what that means for production use.

Rayyan | April 15, 2026 | 6 min