I switched from Claude Sonnet 4.6 to Gemini 2.5 Flash to save 80% on costs, and encountered these real-world bugs. I hope you can avoid the same mistakes.
📅 Written on 2026-05-03 — A log of real pitfalls encountered in a self-operated service
The monthly API costs for running Anthropic Claude Sonnet 4.6 became a significant burden. Even downgrading to Haiku within the same model family still left the cost per token prohibitively high.
After re-evaluating the pricing:
| Model | Input | Output |
| --- | --- | --- |
| Claude Sonnet 4.6 | $3.00 / 1M | $15.00 / 1M |
| Claude Haiku 4.5 | $0.80 / 1M | $4.00 / 1M |
| **Gemini 2.5 Flash** (non-thinking) | **$0.15 / 1M** | **$0.60 / 1M** |
| Gemini Flash-Lite | $0.075 / 1M | $0.30 / 1M |
My own tests showed that Gemini 2.5 Flash was \*\*20x cheaper\*\* than Sonnet, with similar Korean language quality. The decision was made to switch.
The theory was clean. In reality, four traps awaited.
`gemini-2.5-flash` has thinking mode enabled by default. When this is on:
The symptom: For time-sensitive questions like "What's today's exchange rate?", it would answer using its own training data instead of triggering a search.
After 3 hours of debugging, I found the solution:
config = gtypes.GenerateContentConfig(
system_instruction=system_prompt,
tools=[gtypes.Tool(google_search=gtypes.GoogleSearch())],
max_output_tokens=8192,
temperature=0.7,
thinking_config=gtypes.ThinkingConfig(thinking_budget=0), # ← This
)Explicitly setting `thinking_budget=0` completely turns off thinking. The model responds quickly, like Flash-Lite, and the search trigger works correctly.
This was a code bug unique to our service, but I've seen similar patterns often.
Problematic code:
last_count = (existing or {}).get("message_count_at_analysis") or 0
if last_count > 0 and len(messages) - last_count < 5:
return # ← Skip if less than 5 turnsThis looks logical but contains a trap. **For new users, `last\_count` is 0, so the condition always evaluates to `False`.** This means the analysis function runs on every chat turn.
The analysis function makes two Gemini API calls (profile JSON generation + injection text generation). With 200 messages as input, the cost per call is not insignificant.
If a few new users chat actively for two days:
Over two days, we spent over 1,000 KRW.
Correction:
if last_count == 0:
if len(messages) < 10: # First analysis only if 10+ messages
return
else:
if len(messages) - last_count < 20: # After that, 20-turn interval
returnAdditionally, I reduced the message input limit from 200 → 60 and the truncation p
// artículos relacionados
Twitter/X: @lukOlejnik Anthropic got 90 minutes, openai didn't. regulation isn't a moat, it's a speed bump f…
Twitter/X: @Bitcoin_Teddy There was an analysis of Anthropic employees and they have near zero entry-level s…
Twitter/X: @charliebcurran this video about Anthropic explaining the best 😂