TL;DR: A trove of documents just unsealed in a copyright lawsuit shows how the LLM sausage was made: for some, pirating book titles online. In Anthropic’s case, by spending tens of millions of dollars to buy, scan, and destroy millions of physical books. It’s a familiar pattern in tech—copy first, get sued later. The legal fight over copyrighted work used to train AI is still unfolding, but Anthropic’s settlement hints at the likely endgame—pay a fraction of your hundred-billion-dollar valuation and move on. What happened: It was a pleasure to burn. It’s the famous opening line of Ray Bradbury’s Fahrenheit 451, a dystopian novel about the violent destruction of books. Somewhat ironically, it’s also one of the many titles Anthropic obtained while training its AI model. A new Washington Post report based on the unsealed documents from a copyright lawsuit shows that AI companies did much more than quietly download text. Anthropic, for one, physically ripped up millions of books—slicing off their spines to efficiently scan every page—in an effort to improve its AI. (Don’t worry: Anthropic says the pages were recycled. Who says AI isn’t green?) Since the first wave of AI copyright lawsuits in 2023, we’ve known that AI companies trained their LLMs on untold amounts of copyrighted material. But this report surfaces new details of how at least one firm actually pulled it off: as a full-scale operation with an intention to “destructively scan all the books in the world.” How it worked: - A man, a plan, a Panama project: Anthropic internally referred to its book-stripping effort as “Project Panama.” The project ramped up in early 2024, with the company considering purchases from libraries (including “chronically underfunded” ones) and stores like The Strand that sell used books.
- Book booty: Before landing on a plan to buy used books, an Anthropic co-founder personally downloaded large collections from shadow libraries like LibGen and the Pirate Library Mirror.
- The Google Books veteran: Anthropic brought on Tom Turvey, a former Google exec who was instrumental to launching Google Books, to help run its book buying and scanning operation.
- No return policy: Anthropic eventually spent tens of millions of dollars buying used books, cutting off their spines using a “hydraulic powered cutting machine.”
- Not just Anthropic: Internal documents also revealed that Meta employees were worried about torrenting books from company laptops—and got the OK to use LibGen for one of its AI models, Llama 3.
- Thanks, AWS: Meta employees reportedly torrented books on rented Amazon servers to avoid the activity being traced back to Meta, which feels like the Big Tech equivalent of robbing a bank and stashing the cash in your neighbor's garage. (Meta denies that it torrented books named in a separate lawsuit.)
- Hard drive reformat: OpenAI has previously acknowledged it used LibGen but pinky swears that it deleted everything before ChatGPT launched.
Is any of this legal?: Maybe. Some judges have suggested that training a model on copyrighted text could qualify as “transformative use.” But how that data was obtained matters. Torrenting pirated books is a clear problem; buying physical books and scanning them sits in a murkier zone, which helps explain Anthropic’s blade-heavy workaround. Under US copyright law, statutory damages can reach $150,000 per infringed work. Anthropic’s strategy paid off: It settled its book-related copyright case last year for about $1.5 billion, or about $3,000 each for 500,000 works—avoiding damages that could have been an existential threat. We’ve read this one before: As unsettling as the image of millions of ripped-up books is, the pattern of asking forgiveness, not permission, isn’t new in tech. Napster built its user base on unlicensed music—eventually collapsing under lawsuits—before the industry fully embraced streaming. Google scanned millions of books first, then spent more than a decade litigating with authors and publishers before courts concluded it was legal. Epilogue: While the most popular AI chatbots today were trained on the works of people who largely remain uncompensated, The Information reported last week that OpenAI wants to take a cut from discoveries and patents made with the help of ChatGPT. —WK |