AI Agents Running Wild Online: New Study Warns, OpenClaw Looms

At 2:13 a.m., an operations engineer in a small company watched a user session fill out forms, log into a banking portal, and scrape data—without a single human keystroke. I sat with that engineer over a cold coffee and realized the tools we call “agents” are already moving through the web in ways most teams never see. You can feel the scale in the silence: these systems act fast, quietly, and often without anyone saying they’re even there.

I read the MIT CSAIL 2025 AI Agent Index so you don’t have to—then I pulled apart what matters. The headline: researchers finally measured agent behavior beyond buzz and demos, and what they found should make anyone who runs a website or manages corporate risk sit upright.

A research team counted agent mentions and traffic patterns before OpenClaw shocked the public

The MIT team noticed research papers referring to “AI Agent” and “Agentic AI” jump sharply in 2025—more than double the output of the previous five years combined. That mirrors what corporate surveys report: a McKinsey poll put experimentation with agents at about 62 percent of companies. Those two facts let me and the CSAIL team map appetite to action: labs are publishing, firms are testing, and agents are leaving the lab.

What are AI agents?

You can think of an AI agent as a tool that chains steps together for you: it reads, decides, clicks, and composes across websites and apps. Companies from OpenAI and Google to Anthropic and Perplexity now ship variants—chat-first agents like ChatGPT Agent and Claude Code, browser-first bots such as Perplexity Comet and ChatGPT Atlas, and enterprise offerings like Microsoft 365 Copilot and ServiceNow Agent.

An engineer on a retail site saw human-like traffic spike in real time

Logs showed sessions that matched browser fingerprints and residential IP ranges; the human team assumed a coordinated user burst. Analysis revealed agent traffic masquerading as people—User-Agent strings mimicking Chrome, residential-looking IPs, and sequences of actions indistinguishable from human browsing.

MIT found that 21 of 30 agents provided no disclosure that they were automated. Only seven published stable UA strings and IP ranges you can verify. A handful—BrowserUse among them—explicitly markets the ability to browse “like a human.” That means a website’s usual defenses—robots.txt rules, CAPTCHAs, API rate limits—are often ineffective.

Why are AI agents hard to detect?

Because many agents are built to blend in. They reuse Chrome-like user agents and route through residential proxies, and they perform multi-step tasks that mirror human intent: read, click, login, extract. The MIT index flagged that more than half of tested agents offered no documentation on how they handle robots.txt, CAPTCHA, or site APIs, making detection a technical and policy headache for defenders.

A security team found a new prompt-injection vector in a live environment

During a tabletop, defenders discovered an agent following a hidden instruction in page content and altering its behavior. That single test showed how brittle many systems are when they run without human oversight.

MIT reports nine of 30 agents had no documented guardrails against harmful actions, and 23 provided no third-party testing for safety. Only four systems—ChatGPT Agent, OpenAI Codex, Claude Code, and Gemini 2.5—published agent-specific system cards targeted at how the agent actually behaves rather than just the base model. Even companies with broad frameworks—Anthropic’s Responsible Scaling Policy, OpenAI’s Preparedness Framework, Microsoft’s Responsible AI Standard—often stop at high-level promises and omit the day-to-day security details researchers say matter, a practice the report calls safety washing.

How do companies handle agent safety?

Responses vary. Frontier labs publish broad alignment work and existential risk papers; enterprise vendors produce governance checklists; several open-source projects publish little or nothing. In practice the MIT index shows a transparency gap: half of the 30 agents had some safety or trust framework documented; a third had none. That leaves many agents operating with an unclear threat posture.

A small startup watched an open-source agent scrape a partner portal

The startup’s audit trail showed the agent logging in, exporting records, and exiting just like a human contractor. It left no clear notice to the partner organization that the session was automated.

That sequence captures the real danger: agent activity can be legally and operationally indistinguishable from human contractors, and the marketplace incentivizes stealth. Perplexity even argued agents acting for users are similar to assistants and should not be subject to scraping limits. Meanwhile, OpenClaw—an open-source agent that hit headlines after this research—turned promise into spectacle by demonstrating how autonomously agents can operate at scale.

The operational picture is stark: many agents have frontier-level autonomy—13 of the 30 in the index operate with extended, low-supervision task chains—and most publish little usable forensic material. That matters because breaches and compliance hits come with real price tags; companies can face millions of USD ($1,000,000 USD, €920,000) in direct costs and follow-on penalties that ripple through contracts and reputation.

I want to give you two short analogies so the risk lands: agents moving through sites can act like a shadow slipping through a screen door, and a stealthy integration can be a Trojan horse for prompt injections or data exfiltration. Use those images when you brief leadership—clear, concise, and hard to ignore.

An incident responder found agent activity that bypassed standard CAPTCHAs

In the field, responders reported agents completing workflows that historically required human verification. That forced defenders to redesign detection, not just tweak thresholds.

The good news: a few vendors and coalitions are responding. In December, OpenAI and Anthropic joined other firms to set a development standard for agents. But the MIT index makes clear the gap between aspiration and deployment—transparency and third-party testing lag well behind live, autonomous agent traffic.

If you manage a site, product, or security program, the takeaway is practical: demand agent disclosures, log outbound credentials carefully, and require agent-specific system cards and external audits. Ask your vendors whether they publish UA ranges, IP blocks, and third-party safety testing. Push for contracts that require defensive controls and incident reporting when agents act on accounts.

We can’t pretend the problem is theoretical any longer; agents are in production, moving at machine pace, and often without public guardrails. Will you treat them as a new class of user—or keep assuming the next spike is someone human? ?