Microsoft Researcher Uses Goats in AoE II to Expose LLMs

Microsoft Researcher Uses Goats in AoE II to Expose LLMs

I opened a YouTube clip and laughed until the punchline landed hard: a goat headbutting a baffled housecat. A Microsoft researcher then built logic gates out of those same goats inside Age of Empires II, and that laugh hollowed out into a question. If a herd of virtual animals can mimic a basic AI, what does that do to our stories about machine minds?

I’m asking this with the calm impatience of someone who spends too much time coding experiments and even more time watching people misread them. You and I both know language invites belief: when software speaks like a person, our brains shop for a soul. My job here is to show you how a playful proof can expose a methodological blind spot—and to give you a clean way to judge the spectacle for yourself.

I watched a researcher at Microsoft turn Age of Empires II goats into logic gates

Adrian de Wynter, a Microsoft researcher, used the game’s scenario editor to make goats act as rudimentary bits—his paper coins the term bit-goat. He stitched NAND, XNOR and AND behaviors together until he had a one-bit perceptron: a tiny, functioning analog of basic AI.

On the surface it’s a stunt. Under the surface it’s a probe: if a game’s fauna can reproduce the observable input/output of an LLM, then our claims about what makes an LLM “human-like” need sharper tests.

A goat demonstration shows that the substrate matters

I saw the setup run live: goats trampling through pens, triggers firing, text produced elsewhere. The observation is simple—same input, same output, different mechanics.

De Wynter’s point is twofold. First: any sufficiently expressive substrate can implement an LLM-like mapping from prompts to responses—whether that substrate is a GPU cluster at OpenAI or a swarm of bit-goats in AoE II. Second: the medium you expose changes how observers read the system’s inner life. When responses appear as polished text in ChatGPT, people are tempted to read agency into it. When the exact same behavior is produced by a visible herd inside a game, that temptation weakens.

Would you view a GoatGPT response differently to ChatGPT?

Yes—and that’s the experiment’s point. Your perception is shaped by interface, latency, and the visible machinery behind an answer. Ask the same question of ChatGPT, a smart speaker, and a goat simulation; you will, very likely, attribute different levels of intention to each. The bit-goats are vaudeville actors on a stage; they perform the lines but don’t carry the dramaturgical weight we grant human speakers.

I noticed debates about “sentience” get tangled in hidden assumptions

I’ve read interviews and op-eds that swing wildly from “these models are alive” to “they’re just parrots.” The real-world observation is ordinary: people project.

De Wynter argues that many tests for anthropomorphic traits sneak in prior beliefs. If you begin an experiment assuming an LLM understands, you’ll interpret its outputs as evidence. If you begin assuming it doesn’t, you’ll require different evidence. This is not a math error; it’s a psychology problem. The experiment acts as a funhouse mirror that shows how much of “sentience” lives in the observer, not the code.

Can Age of Empires II be used to simulate an AI?

Technically yes. The AoE II scenario editor exposes game mechanics that can implement logic operations, and de Wynter used those to assemble a perceptron. That’s not the same as reproducing the optimization, data pipelines, or scale behind modern LLMs from companies like Anthropic or OpenAI—but it demonstrates functional equivalence for certain behaviors. Seeing the guts makes you update your priors.

I remember watching people anthropomorphize simple systems

One quick, ordinary memory: a chatbot answered a question and a colleague sighed, “It gets me.” That sigh was informative.

Language hides process. When an interface obfuscates mechanics—fast responses, conversational style, continuity—people apply human frameworks. When it exposes mechanics—visible goats, delays, weird intermediary steps—those frameworks fall away. This isn’t an anti-technology sermon; it’s a methodological warning. To test claims about LLM minds you must control for the interface and the assumptions you bring into the room.

If you want a practical rule: separate observable behavior from the stories you tell about it. Measure latency, trace state, treat explanations as outputs not admissions, and always ask who benefits from the interpretation you prefer—tech platforms, media narratives, or your own cognitive comfort.

Microsoft’s goat stunt doesn’t deny the usefulness or power of models from OpenAI or Anthropic. It does, however, force a simple question: are we crediting intelligence where we’ve only rewarded convincing performance?

The experiment also nudges policy and product thinking: regulators and researchers should build protocols that declare interface conditions and observer priors before testing claims about minds. Otherwise we circle in debates while the machines, and the market, march on.

I’ll leave you with the image de Wynter wanted you to hold: a chat response arriving not from a black box but from a visible corral of bit-goats running preprogrammed routes. The output is the same, but your judgement changes. Will your definition of intelligence bend to the medium, or will you insist on tests that separate actor from performance?