Back to all articles
Benchmarks

The hidden cost of re-reading: a benchmark across 8 languages

We measured DRIP on production-shape codebases in 8 languages. Here's the savings breakdown, the methodology, and what surprised us about how compression behaves on Java vs Python.

DRIP contributors7 min read

When we shipped DRIP's first numbers — "34 to 88 % fewer tokens" — three people asked the same question: on what?

Fair. So we ran the benchmarks publicly. Here's the full breakdown.

The methodology, in 60 seconds

Production-shape fixtures means: real open-source files of typical refactor size, not the toy snippets you see in most benchmark suites. We pulled 45 effective samples across 8 languages (Python, TypeScript, Java, Go, Rust, Ruby, C++, C#), warmed up the runner to discard JIT noise, and measured p99 latency alongside token counts.

The numbers below are the median of 45 samples per cell. p99 numbers and full data are in bench/source-map/ in the DRIP repo.

First-read savings (semantic compression)

DRIP recognises function bodies in 13 languages and elides them on first reads, keeping signatures, imports, type declarations, doc comments and class bodies intact. The agent gets the shape of the file. If it needs a specific function body, it asks — and that re-read returns the full body.

Language File size Native tokens DRIP first read Savings
Python (651 ln) 24 KB 6,005 2,403 60 %
Java (1,210 ln) 38 KB 9,800 2,950 70 %
TypeScript (820 ln) 31 KB 7,600 2,420 68 %
Go (440 ln) 14 KB 3,820 1,612 58 %
Rust (590 ln) 21 KB 5,210 2,008 61 %

Java compresses the hardest because Java files are mostly boilerplate (visibility modifiers, getters/setters, @Override annotations) — high signal-to-noise ratio for semantic elision.

Re-read savings (delta engine)

This is where the bigger numbers come from. On a 5-edit refactor cycle:

File Native re-reads DRIP re-reads Savings
130 KB, 30 reads 180,000 tok 12,400 tok 93 %
38 KB, 20 reads + 3 edits 196,000 tok 18,900 tok 90 %
24 KB, 15 reads 90,000 tok 6,540 tok 93 %

The pattern: as session length grows, savings grow with it — because the proportional cost of [unchanged] sentinels (~12 tokens each) and delta payloads (200–400 tokens) stays flat while native re-reads scale linearly with file size.

What surprised us

Python had the smallest wins on semantic compression

Python's terse syntax (no curly braces, optional types) means function bodies aren't that much bigger than their signatures. Java, by contrast, has 30–50 % of its volume tied up in boilerplate — so eliding bodies hits harder.

TypeScript reads benefited more from the delta engine than from semantic compression

TypeScript files tend to be edited a lot more often than Python files in our test sessions (more refactor work happens in TS in 2026). So most of the win came from the delta engine, not the first-read compressor.

Rust files saw mixed results because of macros

Function-like macros (println!, vec!, bail!) make body detection harder. We have a special path for them — bodies expand a lot, signatures stay small — but the heuristic is conservative. We'd rather miss a compression than emit a wrong elision.

Edit certificates outperformed expectations

We thought edit certificates (the 390-byte hash + touched-lines payload returned on a read-after-edit) would save 5–10 % of total tokens. In practice they save 12–18 %, because agents verify their own writes way more often than the docs hint at.

What does this mean for your bill?

If you're paying for Claude Sonnet 4.6 (the most common default at $3.00 / Mtok input):

  • A typical solo-dev refactor session burns 200K–500K input tokens.
  • DRIP cuts that to 40K–80K.
  • At $3.00 / Mtok, that's $0.45 to $1.20 saved per session.

Across a 10-developer team running 6 sessions/day, you're looking at $1,800–$4,800 / month in input-token savings. The math gets steeper if you're on Opus 4.7 ($15 / Mtok input) — same number of tokens, 5× the dollar value.

How to reproduce

Clone DRIP, run bash scripts/bench_multilang.sh. Numbers will land in bench/results/. Tweak the fixtures to match your codebase shape; the methodology is the same.

Further reading

#benchmarks#semantic compression#tokens#languages