dev.toJune 7, 2026
Feature

4 Pitfalls Discovered After Migrating from Anthropic to Gemini

I switched from Claude Sonnet 4.6 to Gemini 2.5 Flash to save 80% on costs, and encountered these real-world bugs. I hope you can avoid the same mistakes.

šŸ“… Written on 2026-05-03 — A log of real pitfalls encountered in a self-operated service

Why the Switch?

The monthly API costs for running Anthropic Claude Sonnet 4.6 became a significant burden. Even downgrading to Haiku within the same model family still left the cost per token prohibitively high.

After re-evaluating the pricing:

| Model | Input | Output |

| --- | --- | --- |

| Claude Sonnet 4.6 | $3.00 / 1M | $15.00 / 1M |

| Claude Haiku 4.5 | $0.80 / 1M | $4.00 / 1M |

| **Gemini 2.5 Flash** (non-thinking) | **$0.15 / 1M** | **$0.60 / 1M** |

| Gemini Flash-Lite | $0.075 / 1M | $0.30 / 1M |

My own tests showed that Gemini 2.5 Flash was \*\*20x cheaper\*\* than Sonnet, with similar Korean language quality. The decision was made to switch.

The theory was clean. In reality, four traps awaited.

Trap 1: If `thinking\_budget` isn't set to 0, search breaks

`gemini-2.5-flash` has thinking mode enabled by default. When this is on:

  • Response speed slows down (~2x)
  • Costs increase ($0.60 → $3.50 / 1M output)
  • And most frustratingly, the **`google\_search` tool trigger weakens**

The symptom: For time-sensitive questions like "What's today's exchange rate?", it would answer using its own training data instead of triggering a search.

After 3 hours of debugging, I found the solution:

python
config = gtypes.GenerateContentConfig(
    system_instruction=system_prompt,
    tools=[gtypes.Tool(google_search=gtypes.GoogleSearch())],
    max_output_tokens=8192,
    temperature=0.7,
    thinking_config=gtypes.ThinkingConfig(thinking_budget=0),  # ← This
)

Explicitly setting `thinking_budget=0` completely turns off thinking. The model responds quickly, like Flash-Lite, and the search trigger works correctly.

Trap 2: Nightly batch job analyzes new users every turn

This was a code bug unique to our service, but I've seen similar patterns often.

Problematic code:

python
last_count = (existing or {}).get("message_count_at_analysis") or 0
if last_count > 0 and len(messages) - last_count < 5:
    return  # ← Skip if less than 5 turns

This looks logical but contains a trap. **For new users, `last\_count` is 0, so the condition always evaluates to `False`.** This means the analysis function runs on every chat turn.

The analysis function makes two Gemini API calls (profile JSON generation + injection text generation). With 200 messages as input, the cost per call is not insignificant.

If a few new users chat actively for two days:

  • 1 user Ɨ 20 turns Ɨ 2 API calls Ɨ ~3 KRW = 120 KRW / user
  • The nightly batch also re-analyzes all users daily without interval checks → hundreds of won more

Over two days, we spent over 1,000 KRW.

Correction:

python
if last_count == 0:
    if len(messages) < 10:    # First analysis only if 10+ messages
        return
else:
    if len(messages) - last_count < 20:   # After that, 20-turn interval
        return

Additionally, I reduced the message input limit from 200 → 60 and the truncation p

Read full article on dev.to