DeepSeek V4-Pro and MiMo V2.5 price cuts: has the LLM API price war begun?

DeepSeek V4-Pro made its 75% discount permanent, while Xiaomi MiMo V2.5 announced permanent API price cuts of up to 99%. This is not a one-off promotion; it resets the lower bound for LLM API costs.

1. Two permanent cuts in one week

Late May 2026 gave LLM API buyers two price signals in the same direction. First, DeepSeek's official API pricing page says deepseek-v4-pro will be adjusted to one quarter of its original price after the 75% promotion ends on May 31, 2026 at 15:59 UTC. Then Xiaomi's MiMo-V2.5 price announcement said the MiMo-V2.5 API would get a permanent reduction of up to 99%, effective May 27, 2026 Beijing time.

The important point is not that one vendor is running a campaign. Two large Chinese model teams have moved their long-context, agent-ready APIs into the same rough price band. That starts to look less like promotion and more like a new market floor.

2. Where the new price floor sits

The table below uses first-party API prices where available. Prices are USD per 1M tokens, checked on May 28, 2026.

Model Cache hit Input Output Context
DeepSeek V4-Flash $0.0028 $0.14 $0.28 1M
DeepSeek V4-Pro $0.003625 $0.435 $0.87 1M
MiMo-V2.5 $0.0028 $0.14 $0.28 1M
MiMo-V2.5-Pro $0.0036 $0.435 $0.87 1M
GPT-5.5 $0.50 $5.00 $30.00 Under 270K standard tier
Claude Opus 4.7 $0.50 $5.00 $25.00 1M
Gemini 3 Flash Preview $0.05 $0.50 $3.00 1M

DeepSeek V4-Pro and MiMo-V2.5-Pro now sit at almost the same public API price: about $0.435 input and $0.87 output per million tokens, with cache-hit input close to four tenths of a cent. That is the part that should make teams revisit old cost assumptions.

3. Cache hits are the real lever

The output price gets attention, but the cache-hit price changes the economics of agents and long-context workflows. DeepSeek lists V4-Pro cache-hit input at $0.003625/M. Xiaomi lists MiMo-V2.5-Pro at $0.0036/M. If your app repeatedly sends the same repository, policy manual, system prompt, or conversation prefix, the cached portion can become almost free compared with premium frontier APIs.

Xiaomi's announcement explicitly points to inference-system work: Sliding Window Attention, SGLang HiCache, and lower KV-cache transfer between GPU memory, CPU memory, and SSD. In other words, the story they want developers to hear is not "temporary subsidy"; it is "serving long prompts got cheaper."

4. Workloads worth testing

The first workloads to test are high-volume but reversible: bulk extraction, document classification, code review, log analysis, QA over internal documents, and background agents that can retry or escalate. These are exactly the places where a 10x to 100x token-price difference can turn a weekly batch job into always-on infrastructure.

Do not move everything just because the token table looks cheap. Customer-facing medical, legal, financial, security, and regulated-data workflows still need quality evals, retention review, data residency review, and fallback routing. Cheap tokens reduce the cost of experiments; they do not remove vendor risk.

5. Risks beyond cheap tokens

  1. Compare first-party API pricing with aggregator pricing; routing layers may lag official cuts.
  2. Measure cache hit rate on real sessions, not synthetic prompts.
  3. Check where prompts, files, logs, and telemetry are retained or processed.
  4. Run your own evals for long-context recall, tool use, coding edits, and refusal behavior.
  5. Keep GPT, Claude, or Gemini fallbacks for high-stakes tasks until reliability is proven.

The practical conclusion: the price war is real enough to change your spreadsheet, but not enough to eliminate architecture discipline. Treat DeepSeek V4-Pro and MiMo-V2.5-Pro as serious low-cost lanes in a multi-model routing strategy, not as universal replacements for every premium model.