AI Startups Using Others’ Work Face Pushback from Artists, Publishers

I got the email at 7:12 a.m. An editor from a paper in Ohio had a screenshot of ChatGPT answering a question with paragraphs lifted from her newsroom. I read it twice and then felt the room tilt: the machines weren’t just learning facts, they were siphoning livelihoods.

I’m going to tell you what’s happening, who’s arguing, and why the fight now looks less like copyright law and more like a geopolitical tug-of-war. You’ll see where publishers, startups, and lawmakers are placing their bets.

On a courthouse docket one morning: Publishers file suit over “systematic and willful theft”

The scene in New York felt less cinematic than it sounded—a stack of filings and a press release that landed in dozens of inboxes.

Small and regional publishers, together owning nearly 400 newspapers, sued OpenAI and Microsoft, accusing them of scraping “hundreds of thousands of articles” to train ChatGPT and Copilot. Their claim: those products helped create “hundreds of billions of dollars (≈€250–€400 billion) in market value” for the defendants, and none of it flowed back to the journalists who produced the underlying work.

I’m on the publishers’ side when they say this feels like the only lever left: the courts. Some outlets chose a different path—license deals with AI labs that trade content access for cash, custom tools, and distribution experiments. The split looks like a map of risk tolerance: sue if you can, sell if you must.

At my last conference a developer whispered something unnerving: Distillation isn’t always innocent

Developers call it “adversarial distillation,” and the phrase sounds technical until you see it in practice.

Anthropic alleges Alibaba used nearly 25,000 fake Claude accounts to pump tens of millions of exchanges into a distillation pipeline—effectively teaching an Alibaba model by interrogating Claude. OpenAI has made similar accusations against DeepSeek. On paper, distillation can be a legitimate method to train smaller models from larger ones. But at industrial scale it starts to look like a photocopier that has learned to write poems: it replicates structure and style without asking the original author.

There’s a legal gray zone: companies argue scraping and distilled outputs fall under fair use; rivals call it theft or contract violation. Meanwhile, White House policy director Michael Kratsios warned about “industrial-scale campaigns to distill U.S. frontier AI systems,” and offered public-private cooperation as a countermeasure.

Can AI train on copyrighted material without permission?

The short answer is: contested. Courts will decide. Right now, big labs point to fair use defenses; publishers point to the scale and commercial use that follow. You don’t need a legal degree to see why this matters: if models can be trained on billions of words of copyrighted content with no payment, the business model for news and creative work erodes fast.

In a Senate office the letter landed on a staffer’s desk: Anthropic urges action against Alibaba

Anthropic’s complaint went public as a letter to Senators Tim Scott and Elizabeth Warren, asking for help and offering cooperation.

The company framed the matter as not just corporate grievance but national security and economic policy: if foreign firms cheaply replicate U.S. models through mass distillation, the lead in frontier AI could shift. Anthropic has previously accused several Chinese startups, and OpenAI has leveled similar charges. But the companies so far stop short of alleging outright illegality—often their claims hinge on terms-of-service violations and the call for coordinated responses.

What is adversarial distillation in AI?

Think of a “teacher” model like Claude or GPT-4 and a “student” model that asks the teacher tens of millions of questions. The student learns by observing answers rather than by being trained on the original datasets. That can be efficient, but it can also copy idiosyncrasies and hidden biases from the teacher—Microsoft has emphasized this risk and touted its MAI-Thinking-1 model, which Mustafa Suleyman said was trained “with absolutely zero distillation.”

A reporter in a small town said she felt vindicated: Local papers watch big tech argue about theft

There’s a quiet satisfaction in small newsrooms right now—justice, however delayed, finally in motion.

Editors who’ve watched their stories repurposed into AI outputs without attribution see a symmetry: tech firms that once dismissed their complaints are now vocal about model theft, but for different reasons. For publishers, this is a fight for revenue and recognition. For the tech firms, it’s a fight over competitive advantage and guardrails. The result is an unusual alignment of interests: some publishers suing, other media houses licensing, and lawmakers asking whether a coordinated policy should stem mass distillation.

How can publishers get paid by AI companies?

There are three pragmatic paths you’re likely to see: litigation, licensing deals that pay per-use or offer equity and tools, and technical defenses that block scraping. Each has trade-offs. Litigation can set precedent but takes time and money; licenses deliver cash and tools but cede control; technical blocks reduce exposure yet rarely stop determined actors.

In a boardroom someone put numbers on the table: The market stakes are enormous

Executives quantify value in ways that make lawmakers pay attention: hundreds of billions of dollars in market capitalization, new enterprise products, and platform integration.

Microsoft, OpenAI, Anthropic, Google, and Alibaba are not just competing on features; they’re jockeying for datasets, partnership access, and legal cover. The U.S. administration’s memo and congressional attention give tech firms political cover to press their case. But that cover is thin. When the same companies complain about model theft, it’s hard not to see a house of cards with every legal filing as a gust of wind.

I want you to leave with one clear frame: this isn’t just copyright vs. innovation. It’s also about power, bargaining, and who gets to define “acceptable training data.” The next year will tell us whether courts, contracts, or cooperation set the rules—so which will you bet on?