Which AI models are the most consistent over time? This report analyzes rank changes, state classifications, and sparkline volatility across 300 tracked models to produce a stability score from 0 to 100.
Rock Solid
177
Consistent
83
Variable
38
Volatile
2
Top 20 models with the highest stability scores. These models maintain consistent rankings with minimal volatility.
| # | Model | Score | Stability | 24h | 7d |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.7 (Fast)Anthropic | 94.7 | 100 | 0 | 0 |
| 2 | GPT-5.5 ProOpenAI | 90.3 | 100 | 0 | 0 |
| 3 | Claude Opus 4.6 (Fast)Anthropic | 90.0 | 100 | 0 | 0 |
| 4 | Grok 4.20xAI | 88.3 | 100 | 0 | 0 |
| 5 | Grok 4.20 Multi-AgentxAI | 87.4 | 100 | 0 | 0 |
| 6 | Gemma 4 31B (free)Google | 80.1 | 100 | 0 | 0 |
| 7 | Gemini 3.5 FlashGoogle | 78.8 | 100 | 0 | 0 |
| 8 | GPT-5.4 NanoOpenAI | 78.8 | 100 | 0 | 0 |
| 9 | GPT-5.4 MiniOpenAI | 78.8 | 100 | 0 | 0 |
| 10 | DeepSeek V4 FlashDeepSeek | 77.2 | 100 | 0 | 0 |
| 11 | DeepSeek V4 Flash (free)DeepSeek | 76.4 | 100 | 0 | 0 |
| 12 | GLM 5.1Zhipu AI | 76.0 | 100 | 0 | 0 |
| 13 | Kimi K2.6Moonshot AI | 75.5 | 100 | 0 | 0 |
| 14 | Grok 4.3xAI | 74.9 | 100 | -1 | -1 |
| 15 | Qwen3.6 Max PreviewAlibaba | 74.3 | 100 | -1 | -1 |
| 16 | Gemma 4 26B A4B (free)Google | 72.7 | 100 | 0 | -1 |
| 17 | Gemma 4 26B A4B Google | 72.7 | 100 | 0 | -1 |
| 18 | GPT Chat LatestOpenAI | 40.0 | 100 | -2 | 0 |
| 19 | Mistral Medium 3.5Mistral AI | 40.0 | 100 | -2 | 0 |
| 20 | Nemotron 3 Nano Omni (free)NVIDIA | 40.0 | 100 | -2 | 0 |
Bottom 20 models with the lowest stability scores. These models show significant ranking fluctuations or inconsistent states.
| # | Model | Score | Stability | 24h | 7d |
|---|---|---|---|---|---|
| 1 | MiMo-V2-OmniXiaomi | 69.7 | 19 | +109 | +111 |
| 2 | Step 3.5 FlashStepFun | 66.5 | 22 | -5 | -6 |
| 3 | GPT-4oOpenAI | 70.8 | 56 | +6 | -3 |
| 4 | DeepSeek V3DeepSeek | 69.0 | 58 | -4 | -4 |
| 5 | Llama 3.1 70B InstructMeta | 64.9 | 59 | -7 | -2 |
| 6 | MiniMax M2.1MiniMax | 69.5 | 59 | -7 | -2 |
| 7 | GPT-4o (2024-08-06)OpenAI | 70.8 | 59 | +7 | -2 |
| 8 | Llama 3.1 8B InstructMeta | 44.1 | 62 | -5 | -1 |
| 9 | GPT-4o (2024-11-20)OpenAI | 52.5 | 62 | +9 | -1 |
| 10 | GPT-4o-mini (2024-07-18)OpenAI | 56.1 | 62 | +7 | -1 |
| 11 | Phi 4Microsoft | 59.9 | 62 | -9 | -1 |
| 12 | Mistral Large 3 2512Mistral AI | 66.6 | 62 | -5 | -1 |
| 13 | DeepSeek V3.2 ExpDeepSeek | 69.8 | 62 | -6 | -1 |
| 14 | DeepSeek V3.2DeepSeek | 69.9 | 62 | -6 | -1 |
| 15 | GPT-4o Search PreviewOpenAI | 70.0 | 62 | +9 | -1 |
| 16 | GPT-4o AudioOpenAI | 70.0 | 62 | +9 | -1 |
| 17 | GPT-4o (2024-05-13)OpenAI | 70.8 | 62 | +8 | -1 |
| 18 | DeepSeek V3 0324DeepSeek | 71.4 | 62 | -8 | -1 |
| 19 | R1 Distill Llama 70BDeepSeek | 42.0 | 63 | +148 | -1 |
| 20 | Devstral Small 1.1Mistral AI | 46.8 | 63 | -5 | -1 |
Aggregated stability metrics per provider. Providers are ranked by their average stability score across all models.
| Provider | Models | Avg Stability |
|---|---|---|
| poolside | 2 | 100.0 |
| ~anthropic | 3 | 100.0 |
| ~openai | 2 | 100.0 |
| 2 | 100.0 | |
| ~moonshotai | 1 | 100.0 |
| essentialai | 1 | 100.0 |
| deepcogito | 1 | 99.9 |
| xAI | 4 | 98.0 |
| Writer | 1 | 97.9 |
| inclusionai | 3 | 97.3 |
| Kuaishou | 1 | 97.0 |
| Upstage | 1 | 95.9 |
| NVIDIA | 9 | 94.9 |
| AI21 Labs | 1 | 93.0 |
| Inception | 1 | 92.2 |
| perceptron | 1 | 92.0 |
| Windsurf | 1 | 91.5 |
| Liquid AI | 3 | 91.4 |
| Amazon | 5 | 90.7 |
| 24 | 90.1 | |
| rekaai | 2 | 89.9 |
| Alibaba | 47 | 89.0 |
| Anthropic | 13 | 88.6 |
| Tencent | 2 | 88.3 |
| Perplexity | 5 | 88.0 |
| Baidu | 5 | 88.0 |
| aion-labs | 3 | 87.0 |
| ByteDance | 5 | 85.9 |
| Mistral AI | 22 | 83.8 |
| Moonshot AI | 6 | 83.5 |
| OpenAI | 59 | 82.9 |
| arcee-ai | 5 | 82.8 |
| IBM | 2 | 80.4 |
| MiniMax | 8 | 79.7 |
| Zhipu AI | 12 | 79.7 |
| Cursor | 2 | 79.0 |
| DeepSeek | 13 | 77.0 |
| Meta | 10 | 75.3 |
| Xiaomi | 5 | 74.5 |
| Allen AI | 1 | 74.2 |
| Microsoft | 2 | 71.8 |
| Cohere | 3 | 69.7 |
| StepFun | 1 | 22.2 |
How stability scores are distributed across all 300 tracked models.
Our stability scoring system uses three key signals to measure how consistently a model performs over time.
The most direct measure of stability. Models lose up to 25 points for large 24-hour rank changes (5 points per rank position moved) and up to 21 points for 7-day changes (3 points per position). Models that hold their rank tightly score higher.
Each model has a state reflecting its overall reliability. Models in a "stable" state receive a 10-point bonus, while "fragile" models are penalized 15 points. This captures systemic reliability beyond simple rank movement.
The 14-day sparkline data reveals hidden volatility. We compute the standard deviation of the sparkline and subtract up to 20 points. Even models that end where they started can be penalized if they oscillated wildly along the way.
The stability score starts at 100 and is reduced based on three factors: 24-hour rank changes (up to -25 points, at 5 per position moved), 7-day rank changes (up to -21 points, at 3 per position), and sparkline volatility measured by standard deviation (up to -20 points). Models in a "stable" state get a +10 bonus, while "fragile" models lose 15 points.
Models are classified into four tiers based on their stability score: "Rock Solid" (85-100) means extremely consistent performance with minimal fluctuation. "Consistent" (70-84) means generally reliable with minor variations. "Variable" (50-69) shows noticeable ranking fluctuations. "Volatile" (below 50) indicates significant instability and unpredictable performance.
Stability indicates how predictably a model will perform over time. A highly rated but volatile model may deliver inconsistent results, which is problematic for production applications requiring reliable output quality. Stable models provide more predictable performance, making them safer choices for mission-critical workloads even if they do not always hold the top rank.