RESEARCH POST

The AI Industry Was Built on Stolen Work. Here's the Proof.

Estimated read time: 14 minutes Tags: AI, Research, Copyright, Big Tech, Builders

Content

Let me start with something that should bother you more than it probably does. Right now, you might have three or four AI tools open in your browser. Maybe Claude for writing. Maybe ChatGPT for code. Maybe Midjourney for visuals. These are products worth hundreds of billions of dollars combined. Products that have quietly, fundamentally changed how we work, create and think. But nobody asked the people who made those tools possible. Not the YouTubers who spent years filming tutorials. Not the novelists who spent decades perfecting their craft. Not the Kenyan workers who spent nine hour shifts reading the worst content imaginable on the internet so that your chatbot could stay "safe." This is the story the AI industry really does not want you to understand. And the further I went down this rabbit hole, the darker it got. This is not a conspiracy theory piece. Every single claim in this post is linked to a court ruling, a government report, leaked internal documents or a major investigative newsroom. I am a builder who uses AI every single day. I believe in the technology. But I think anyone building with these tools owes it to themselves to understand what they were actually built on. Let's get into it.

Part 01: The Data Heist Nobody Talks About Here is something that sounds made up until you read the source. According to an investigation by The Atlantic's AI Watchdog, companies including Meta, Microsoft, Amazon, Nvidia, ByteDance, Snap and Tencent extracted over 15.8 million videos from more than 2 million YouTube channels without permission. These videos were used to train generative AI video models. All of it was done in clear violatoin of YouTube's own terms of service. Fifteen point eight million videos. Two million channels. That is not a rounding error. That is a systematic, organised, premeditated extraction of human creative work at a scale that is genuinely difficult to comprehend. And it was not some secret operation run from a parking garage. It was being done inside the offices of the most valuable technology companies on the planet. One of those companies was Nvidia, currently valued at around two trillion dollars. And the documents we have on what Nvidia's team was doing internally are jaw dropping. The Nvidia Slack Messages In August 2024, 404 Media obtained leaked internal Slack messages, emails and documents from inside Nvidia. These were not anonymous leaks. They were specific, timestamped communications between named executives and employees. What do they show? Nvidia employees were tasked with scraping millions of videos from YouTube for a commercial AI training project codenamed Cosmos. When workers raised concerns about whether this was legally authorized, management responded immediately. "This is an executive decision," wrote Ming-Yu Liu, a vice president of research at Nvidia. "We have an umbrella approval for all of the data." When someone worried that the scale of their scraping might cause YouTube to block their IP address, a colleague suggested using proxy IP addresses to route around it. Another employee responded that they had found a different solution: restarting a virtual machine on Amazon Web Services to get a fresh IP. According to the investigation, Nvidia accumulated over 30 million URLs in the space of just one month. The internal documents also show employees discussing how to pull video from Netflix and Discovery, and how to scrape clips from IMDb. There is no evidence that videos from those streaming services were actually taken, but the intent was clearly on the table. When 404 Media asked Nvidia to comment on the legal and ethical dimensions of all this, the company responded that it was operating "in full compliance with the letter and the spirit of copyright law." The letter and the spirit. About downloading 30 million URLs using rotating proxy servers to avoid being blocked. One employee on the thread had put it honestly. "What we are doing will lead to zero publications," he wrote, referring to their decision not to publish any academic research about the project. "Given we are not publishing anything, there will be no negative sentiment." In other words: do it quietly, and nobody will know. A class action lawsuit has since been filed against Nvidia by YouTube creators including H3H3 Productions, alleging the company illegally circumvented protective measures to scrape their content without consent.

Part 02: The Piracy Layer (This One Is Especially Wild) The YouTube scraping story, as bad as it is, still involves content that was publicly accessible. What happens when we go one layer deeper? This is where it gets genuinely uncomfortable, especially if you use Claude regularly, because this part is about Anthropic. Anthropic, the company behind Claude, was sued in 2024 by a group of authors who alleged that the company had downloaded millions of copyrighted books from pirate websites to train its AI models. Not scraped from the public web. Not borrowed under some gray area fair use argument. Downloaded directly from known piracy websites. The three original plaintiffs were thriller novelist Andrea Bartz and nonfiction writers Charles Graeber and Kirk Wallace Johnson. They represented a much larger class of authors. The case went to a federal judge in San Francisco named William Alsup, and what he found in his ruling was extraordinary. According to the court's findings, Anthropic's cofounder Ben Mann downloaded 196,640 books from a dataset called Books3 in early 2021, books he knew had been assembled from unauthorized, pirated copies. That was just the beginning. In June 2021, Mann downloaded at least five million copies of books from Library Genesis, known as LibGen, which is one of the most famous piracy websites in the world. And in July 2022, Anthropic downloaded at least two million more copies from a site called the Pirate Library Mirror. In total, Anthropic downloaded over seven million pirated books. The judge found that the company had done this deliberately. According to the ruling, Anthropic's CEO Dario Amodei had internally described the alternative, which was pursuing proper licensing agreements, as "legal/practice/business slog." So instead of dealing with that slog, they downloaded seven million pirated books. When Anthropic eventually realised the legal exposure this created, they hired Tom Turvey, the former Google executive who had overseen the Google Books project, specifically to clean up the situation. He attempted to pursue licensing deals with major publishers, but according to the judge, those conversations were allowed to "wither" without reaching agreements. Instead, Anthropic began physically purchasing books in bulk, stripping off their bindings and scanning them page by page before feeding the digitised versions into the training pipeline. They then destroyed the physical books. The case ended in September 2025 with Anthropic agreeing to a $1.5 billion settlement, paying authors approximately $3,000 per book across roughly 500,000 works. One legal analyst at Wolters Kluwer told NPR that if Anthropic had not settled, the company faced "a strong possibility of multiple billions of dollars, enough to potentially cripple or even put Anthropic out of business." Now, a quick important note: the judge also ruled that the actual act of training AI on books, even copyrighted ones, qualifies as "fair use" under US law when done with legally acquired material. So the training itself was not the crime. The crime was sourcing the material from pirate websites. The Claude you use today almost certainly contains knowledge derived from seven million pirated books. That is a strange thing to sit with. And Anthropic was not alone. Internal documents have also revealed that Meta's CEO Mark Zuckerberg personally approved the use of LibGen despite internal warnings about legal risks. The Kadrey v. Meta case has surfaced very similar facts, and that litigation is still ongoing.

Part 03: The Lawsuit Avalanche At the end of 2024 there were around 30 copyright infringement cases filed against AI companies in the United States. By the end of 2025, that number had more than doubled to over 70 cases. And as of March 2026, the total is now approaching 90 lawsuits worldwide. Let me just give you a sense of who is suing whom, because the list reads like someone pulled names out of a "who's who of powerful organizations" hat. The New York Times sued OpenAI and Microsoft in one of the most closely watched cases in the entire AI litigation landscape. The case is now in discovery and involves demands for OpenAI to produce over 20 million ChatGPT conversation logs as evidence. Disney, Universal and Warner Bros all sued Midjourney for copyright infringement related to their AI image generator. These three studios represent the first major Hollywood entities to enter the AI litigation arena. Disney's general counsel stated plainly: "Piracy is piracy. And the fact that it's done by an AI company does not make it any less infringing." The complaint is full of examples of Midjourney generating highly recognizable characters from Star Wars, The Simpsons, Shrek and Scooby Doo in response to entirely generic prompts. Not prompts crafted by lawyers. Just prompts typed by regular users who wanted to see what Shrek would look like as a 1950s greaser. The RIAA, representing the major record labels, sued over music. Getty Images sued over photography. The Authors Guild sued OpenAI. Even Apple is now facing a lawsuit over its OpenELM model training. Disney also filed a lawsuit against a Chinese AI company called MiniMax, making it one of the first AI copyright cases to cross international borders. The pattern is clear. The entire creative economy is mobilising against the way AI was built. And the legal reckoning has only just started.

Part 04: The Hidden Human Cost Here is the part of the story that gets talked about the least, and probably should be talked about the most. There is a common assumption that AI is magical. That it learns from data the same way a human reads, automatically, effortlessly, at machine speed. And in some technical sense that is true. But it is not the whole truth. In January 2023, TIME published an investigation revealing that in its effort to make ChatGPT less toxic, OpenAI outsourced work to Kenyan laborers earning between $1.32 and $2 per hour. These workers were employed by a San Francisco company called Sama, which operates in Kenya, Uganda and India. Their job was to read tens of thousands of text snippets that OpenAI had sent over, snippets pulled from the darkest parts of the internet, and label them. Content involving child sexual abuse, bestiality, murder, suicide, torture, self harm and incest. Three dozen workers spent nine hour shifts reading this content and categorizing it so that an AI safety filter could learn to recognise and block similar material. The goal was noble. ChatGPT's predecessor, GPT-3, had been trained on hundreds of billions of words scraped from the internet, and that included a lot of genuinely horrific content. The only way to build a safety layer was to show a model what "bad" looked like, and labeling that required human eyes. But those human eyes belonged to people in Nairobi earning $1.32 an hour. One worker told TIME he suffered from recurring visions after reading graphic descriptions of sexual abuse involving a child and an animal. "That was torture," he said. "You will read a number of statements like that all through the week. By the time it gets to Friday, you are disturbed from thinking through that picture." Four workers told TIME they had been mentally scarred by the work. Requests for one-on-one counseling were reportedly denied. The workers were offered group wellness sessions. This is the labor that powered the "safe" version of the product that generated one million users in its first five days. Sama ultimately canceled its contract with OpenAI eight months early, in part because of the traumatic nature of the work and in part because a separate TIME investigation had just exposed similar conditions at their Meta contract. The OpenAI contract was reportedly worth $150,000 in total, for work that helped train a product now valued at hundreds of billions of dollars.

Part 05: What This Actually Means for Builders I said at the top that I am a builder who uses AI every day. I am not writing this to make you feel guilty about using Claude or ChatGPT. These are genuinely useful tools and the products built on top of them can create real value. But I do think builders have a responsability to understand the foundation they are building on, for a few reasons. The legal risk is real and it is shifting. The US Copyright Office released a major report in 2025 concluding that certain uses of copyrighted material to train AI cannot be defended as fair use. Courts are beginning to draw distinctions between what was scraped, how it was sourced and what the output actually does. If you are building a product that generates content, you are downstream of a legal landscape that is actively being rewritten. The data sourcing problem will create new product opportunities. The entire AI industry has a data provenance problem. Companies are rushing to sign licensing deals now that litigation has made the old approach untenable. There is a real opportunity for builders who create infrastructure around consented, licensed, traceable data. Data marketplaces, consent layers, attribution systems, licensing tooling. All of it is being built right now. Transparency is becoming a competitive advantage. Creators are increasingly aware of how their work is being used. If you are building a product in the creative space, being explicit about how your AI works and what it was trained on is not just ethically better. It is also a differentiation strategy in a market where trust is becoming scarce. The scrape first, ask later model is ending. For the first six years of the generative AI era, the dominant strategy was to grab everything and sort out permissions afterwards. That era is closing. Disney is suing. The New York Times is suing. Authors are settling for a billion and a half dollars. The cost of the old approach is now being quantified in real dollars, and it is enormous. If you are building something new, the rules of the game are changing. That is actually good news for builders who are willing to think about this stuff early.

The Bottom Line The AI industry was not built on innovation alone. It was built on millions of YouTube videos taken without permission. On seven million pirated books. On Kenyan workers reading unspeakable content for $1.32 an hour. On a legal grey zone that courts are now beginning to close. None of this means AI is bad. The technology is real. The potential is real. But the story we were sold, the one about a handful of geniuses in San Francisco who just happened to crack the code of intelligence, is significantly less complete than it sounds. The people who made it possible deserve to at least be named. The woodworker in the Atlantic investigation, Jon Peters, whose YouTube videos were scraped without his knowledge, said something that has stuck with me. "I think everything's gonna get stolen. Do I quit, or do I just keep making videos and hope people want to connect with a person?" That question is not just about AI. It is about what kind of ecosystem we want to build inside of it.

Sources

The Atlantic / AI Watchdog: Tech Giants Scraped 15.8M YouTube Videos Proof News: Nvidia Scrapes YouTube, Eyes Netflix 404 Media: Leaked Nvidia AI Scraping Documents Bloomberg Law: Nvidia Faces Class Action Over YouTube Scraping PC Gamer: Nvidia Scraping 80 Years of Video Per Day Engadget: Nvidia Scraped YouTube, Netflix Without Permission NPR: Anthropic $1.5 Billion Settlement with Authors PBS: Anthropic Landmark Settlement Explained Publishers Weekly: Federal Judge Rules on Fair Use in Bartz v. Anthropic Kluwer Copyright Blog: Understanding America's Largest Copyright Settlement actuia.com: Claude at the Helm, Anthropic and LibGen Ruling Copyright Alliance: AI Copyright Lawsuit Developments 2025 Chat GPT Is Eating the World: Master List of AI Lawsuits Georgetown Law: Disney, Universal, DreamWorks v. Midjourney Analysis TIME: OpenAI Used Kenyan Workers on Less Than $2 Per Hour Rolling Stone: ChatGPT Moderators and PTSD Al Jazeera: ChatGPT and the Sweatshops Powering the Digital Age Copyright Lately: Anthropic's $1.5 Billion Speeding Ticket Analysis McKool Smith: AI Infringement Case Updates 2025 WinBuzzer: AI's Original Sin Investigation

FAQs Q: Is it illegal to train AI on copyrighted content? Not automatically, no. A federal judge ruled in the Anthropic case in 2025 that training AI on legally acquired copyrighted books is "exceedingly transformative" and qualifies as fair use under US law. The legal problem Anthropic ran into was not the training itself but the fact that the books were obtained from piracy websites. The distinction is real, but it is also narrow, and courts are still working through it across dozens of ongoing cases. Q: Can I still use AI tools like Claude and ChatGPT ethically? Yes. Using these tools as an individual or builder does not make you legally or morally responsible for how they were trained. The ethical responsability sits with the companies that made those choices. What you can do is stay informed, support platforms that are building with licensed data, and advocate for better industry standards when you have a platform to do so. Q: Why did Nvidia say it was "in full compliance" when internal documents show otherwise? That is a great question and honestly one that a lot of legal analysts have been picking apart. Nvidia's argument is essentially that copyright protects specific creative expressions but not facts, ideas or information, and that training an AI on videos is more like "learning" than "copying." Whether courts will accept that argument is still being tested. A class action lawsuit filed by YouTubers in late 2025 will likely force that question further. Q: What happened to the Kenyan workers after Sama canceled its OpenAI contract? Many of them ended up on lower-paying work streams or lost their jobs entirely. The contract cancellation left around 200 content moderation employees unemployed or facing reduced income. One of the workers, Richard Mathenge, later went public as one of 150 African AI workers who voted to establish the first African Content Moderators Union in 2023, pushing for better conditions and recognishion for the labor that makes AI safety possible. Q: What should builders do differently because of all this? Three things. First, understand the legal landscape around the AI tools and APIs you are integrating, especially if you are building for creative industries. Second, be transparent with your users about what AI is doing in your product. Third, keep an eye on the licensing economy that is emerging. Companies like Shutterstock, Getty and major publishers are now signing licensing deals with AI companies. That infrastructure is being built right now and there are product opportunities inside of it for early movers. Q: Will AI companies be forced to retrain their models without pirated data? Probably not in the near term. The Anthropic settlement required the company to destroy its LibGen and Pirate Library Mirror datasets, but the knowledge already extracted from those books remains embedded in Claude's neural networks. Untraining a model is not as simple as deleting files. This is actually one of the more unsettled and fascianting questions in AI law right now: what does it even mean to "remove" something from a trained model? Q: Did any of the big AI companies pay for content legitimately? Yes, some. Nvidia has licensing agreements with Shutterstock and Getty for certain projects. OpenAI has signed deals with several media outlets and publishers. AP, the Associated Press, reached a licensing deal with OpenAI in 2023. HarperCollins and Wiley are among the publishers who have signed agreements with unnamed AI companies. These deals are becoming more common now that litigation has raised the cost of the alternative. But they are newer developments, and for the most part the foundational models were trained before any of this was in place.

This post is part of the buildwithdev.xyz research series. Dev Chopra builds in public at buildwithdev.xyz and documents products, systems and research notes along the way. If you found this useful, the best thing you can do is share it with another builder who should probably know this stuff.

PUBLISHED: 3/19/2026