Tech Employees Evaluated by LLM Token Burn – A Misguided Metric

I watched a team meeting where someone bragged about hitting a token milestone and the applause felt thin. You can feel the shift when a dashboard number becomes a performance metric; I felt uneasy watching talent traded for throughput. It felt like scoring points by throwing confetti that costs a fortune.

I’ve been tracking how teams measure machine learning work for years. Now, internal leaderboards that show who burns the most LLM tokens are leaking from private Slack channels into performance reviews at places like Meta and Shopify, according to Kevin Roose’s column in the New York Times. You should care because what gets counted shapes how people spend their time.

At Meta, managers are rewarding workers who make heavy use of A.I. tools.

That’s the concrete observation Roose reported: token volume is part of evaluations. The incentive is obvious — vendors and VCs like growth metrics — but pay attention to the behavior those incentives produce.

When token consumption becomes a KPI, you don’t just get more effective prompts and finished features. You also get rapid, noisy experimentation, proxy metrics replacing outcome measures, and people gaming the system. That can look efficient on a scoreboard and be wasteful on a balance sheet.

An engineer at OpenAI reportedly used 210 billion tokens this year.

Numbers matter here because they narrate scale. Roose says 210 billion tokens is roughly 33 Wikipedias worth of text.

That kind of usage signals two things: one, the power of consumerized agents and chains is finally cheap enough to saturate workflows; and two, it creates a perverse arms race. Engineers and product teams chase token counts the way social apps chase engagement — until the metric stops mapping to real value.

Are tech employees being evaluated on how many tokens they use?

Yes, in some firms. You’ll see leaderboards, manager nudges, and direct references to token use in performance conversations. That has already migrated beyond a handful of startups into large employers where managers use token graphs to make raises and promotions feel data-driven.

Claude Code and OpenClaw are amplifying the behavior inside teams.

Claude Code rolled out mobile-friendly channels — Telegram and Discord controls — so you can message sessions from your phone, and fans have been treating agentic platforms like OpenClaw as a new mode of work.

We just released Claude Code channels, which allows you to control your Claude Code session through select MCPs, starting with Telegram and Discord.

Use this to message Claude Code directly from your phone. pic.twitter.com/sl3BP2BEzS

— Thariq (@trq212) March 19, 2026

That lobster-or-crab sprite Thariq tossed into the promo becomes a code for permissive token spend. When your assistant is reachable from your phone, it is easier to call it, and less tempting to ask whether the call needed to be made at all.

How do companies track LLM token consumption?

Through APIs, usage dashboards, and internal tools that aggregate calls to models like GPT-5.4, Claude, or agent frameworks such as OpenClaw. Platform telemetry shows tokens per request, session length, and cumulative totals — the same telemetry your manager can wire into a leaderboard.

OpenAI’s own metrics have become part of the public narrative: Greg Brockman announced GPT-5.4 reached 5 trillion tokens per day and an annualized run rate of $1B (€920M) in net-new revenue. That headline tells investors the machines are busy; it doesn’t tell you whether that busyness is buying customer value.

A Swedish engineer says his company spends more on his Claude Code tokens than on his salary.

That quote is a micro-story of misaligned incentives: companies comfortable with high API spend can normalize enormous per-person costs.

If your employer treats tokens as a growth metric rather than a cost line, you get tokenmaxxing: people making as many calls as possible because the scoreboard rewards it. It’s like judging chefs by how many pans they smash — dramatic, visible, and ultimately unhelpful for taste.

I don’t want to demonize tooling. Agents and high-volume models are powerful forces for automation and creativity when used with discernment. But when token-count becomes both the scoreboard and the payday, you get a culture that prizes velocity over judgment, signals over outcomes, and noise over narrative.

Companies chasing scale through raw token volume will need to ask harder questions: are we buying product improvements, or just filling logs with expensive chatter? You should ask them that too — and ask it loudly when your next performance review hinges on a leaderboard.

Are we turning technical craft into a contest of who can spend the most, or into work that actually serves customers?