I clicked a link at 2 a.m. and landed on an article about a musician who never existed. The page read like a sober Wikipedia entry written by a party guest who’d had too much to say. I closed the tab and realized the noise wasn’t accidental.
I’ve spent years tracing how information mutates online, and you should care because this feels like watching the foundation of the house get swapped for mirror glass. You and I are watching language models eat the internet and spit versions of it back into the world—versions that can train the next crop of models. That loop is the problem.
On my browser, “Halupedia” sat between a recipe and a thread — and it made me pause
The site looks like a prank version of Wikipedia. Halupedia pages are created on demand by a language model and filled with confident nonsense. If someone already searched for “theatre of the moon” you get their nonsense; if you’re first, the model invents your nonsense from a menu of possible topics.
There’s a neat worry behind this: the first generation of LLMs learned from real material—the Gutenberg Project, forums, news archives, even that one Thought Catalog essay you vaguely remember. But as LLM output spreads, it becomes fodder for future training. The signal-to-noise ratio shifts. Bad patterns amplify. Text begins to resemble a photocopy of a photocopy—grainy, missing detail, confident about the wrong things.

Will AI ruin the internet?
No single algorithm will flip the switch, but the system dynamics matter. Search engines like Google already rank pages that look authoritative; if those pages are high-volume LLM output that merely dresses up emptiness, authority signals get warped. The more these outputs circulate, the more likely they are to show up in future model training sets. It’s a feedback loop that rewards style over substance.
On social feeds I saw a trending list that read like the internet’s id — and it worried me
The Halupedia trending sidebar is a snapshot of what people actually feed the model: shitposts, invented lore, and open racism. Some pages get deleted, but titles and traces linger. Moderation can prune content, but it can’t erase the pattern once it’s been copied elsewhere.

Can LLMs contaminate their own training data?
Yes. Models trained on a corpus that includes high-volume, low-quality machine text can inherit errors, stylistic tics, and confidently wrong claims. That’s not just academic. When big names—Jensen Huang or Sam Altman—talk about scaling compute and data, they’re implicitly betting on data quality. If the well is poisoned, more compute just churns the problem faster.
There’s an exploitative angle too. People have started to weaponize this effect: deliberately generating content to pollute future models. You don’t need a botnet—just a steady stream of plausible-seeming garbage. The result is a rumor that gets stronger as it travels, because each repeat adds the illusion of legitimacy.

On the Halupedia tip page, the creator winked at the idea of “polluting” training data
Bartłomiej Strama, who runs Halupedia, openly frames the project as an absurdist experiment. The site even has a “tips-for-tokens” note encouraging contributions that “will surely benefit society!” It’s performative, but it’s also a live stress test of the no-holds-barred internet.
What is Halupedia?
Halupedia is a wiki-shaped project where every entry is an LLM hallucination. It’s a social mirror as much as it’s a toy: it exposes what happens when you give models permission to invent and let people curate the results. The platform sits next to real-world infrastructure—search engines, content farms, social platforms—so its output doesn’t stay isolated.
There are two clear policy pressures here: one technical, one social. Technically, teams at Google, OpenAI, Anthropic, and elsewhere must build filters and provenance signals so training sets don’t copy every shiny thing. Socially, platforms need moderation that scales without becoming a blunt instrument. Right now moderation is uneven; Strama admits it’s “sometimes too restrict,” which shows the trade-offs in real time.
I’m not predicting apocalypse; I am saying there’s a plausible path where cheap, high-volume text undoes the advantage of human-curated archives. That path matters to anyone who cares about truth, research, or even reliable customer support. If we let the feed write the feed, do we deserve what comes next?