AI’s Hidden Crisis: The Industry Is Running Out of Data
AI Collapse Thesis Pt 2: Why Human Knowledge Is the Next Great Moat

Fifty thousand people followed me on Instagram because I made a simple, yet bold claim: the AI industry is running out of high-quality data, and humanists hold the key to solving this problem.
That line tends to raise eyebrows because I’m a historian, a PhD candidate, and my company, Analog Social, isn’t a tech startup. Am I speaking out of my lane? The more I’ve studied where AI’s training data actually comes from, and what’s left, the more convinced I am that this isn’t just a “tech” issue. This is a knowledge governance issue and I argue that humanists, archivists, librarians, and knowledge workers are central to it.
The first phase of the GenAI boom was built on web scraping at an unprecedented scale. Datasets such as Common Crawl and Google’s C4 bundled together huge swaths of the public internet and became core ingredients in training large language models. Their ubiquity is documented in technical and critical studies alike, which analyze what these web scrapes contain—and what they conspicuously miss.1
But even a massive web crawl has limits. Several independent analyses suggest we are approaching a data cliff for the kind of high-quality, human-generated text that most improves model performance.2 In 2022, researchers at Epoch, a non-profit AI research institute, projected that constraints on high-quality text could begin to bind mid-decade; in 2024 they updated their outlook and still argued that truly high-quality public text is getting harder to source and could be effectively exhausted in the latter half of the decade without new approaches. Their broader 2024–2025 work frames data scarcity as one of a few structural bottlenecks for scaling.

That scarcity explains the market’s next move: expensive licensing. When public data runs thin, private data becomes a marketplace. Academic publishers, news organizations, and other rights holders have announced deals that put real price tags on access. Wiley, one of the world’s largest academic publishers, disclosed it expected about $44 million from AI rights partnerships; Taylor & Francis projected roughly $75 million; and News Corp’s multiyear arrangement with OpenAI was reported at up to $250 million. These figures don’t describe one uniform contract; they signal an emerging price of admission for high-quality, rights-cleared text.
There’s a reason those deals matter. The average web page is written to be broadly readable, not to capture the density of a doctoral dissertation or a complex legal brief. Readability research and content-design guidelines commonly assume a seventh- to eighth-grade level for mass-audience text; that’s appropriate for the web but it means the distribution of text online skews toward general prose rather than specialized, citation-heavy writing. That helps explain why models write serviceable emails but struggle when asked to produce original, expert-level scholarship without hallucinating sources.
The hallucination problem is not abstract. Studies have documented inaccurate or fabricated citations in scientific and medical contexts, and courts have sanctioned lawyers for filing briefs with non-existent cases generated by AI tools. The pattern is clear: when systems are pushed beyond the contours of their training data and verification, they improvise. That is not a “moral failing” of the technology; it is a signal about the limits of available, trustworthy data.
So where is the “good stuff,” the material that could actually raise the ceiling on what these systems can do? Much of it lives outside the open web. Think analog archives and special collections in libraries and historical societies; institutional repositories behind contracts and privacy walls; corporate knowledge bases, legal filings, clinical notes, field notebooks, and hard-won domain expertise that hasn’t been digitized, cleaned, or permissioned for machine learning.
Even when some of this material sits on a server, it is governed by confidentiality, ethics, patient privacy, or simple institutional memory. That’s why publishers can command eight-figure sums: they steward concentrated reservoirs of verified, higher-signal text.
This is where my lane begins. One of my first college internships was in knowledge management: tracking IP, records, and documents for an organization. It taught me that the hard part of knowledge isn’t storage; it’s context, consent, and curation. In the current AI rush, those human problems have returned with a vengeance.
We don’t have a technology shortage so much as a governance shortage: who owns what, who can license what, and how we rebuild trust after years of scraping without clear accountability. The open-access community has been warning about this new enclosure; even within higher education, commentators have tied the recent AI licensing spree to a renewed case for open models of access and participation.
I’m an optimist about what that means in practice. If high-quality data is scarce and valuable, we’ll need new entry-level jobs in digitization, cataloging, metadata, conservation, and rights management. We’ll need contracts that compensate creators and institutions for licensing, and processes that make opt-in and opt-out meaningful. We’ll need archivists and historians at the table to design ethical workflows that surface context instead of stripping it away. None of this requires believing that AI is replacing us.
It requires recognizing that original human thought, lived experience, and historically grounded interpretation are the last great moats in the information economy, and that these moats are dug and maintained by humanists.
I think this is also an opportunity to finally get the AI industry to slow down and listen to stakeholders, as more and more people call for a ban on the development of Artificial General Intelligence (AGI).
If the first era of generative AI was “scrape now, apologize later,” the next era has to be consent-driven, archivally informed, and built on mutually agreed-upon partnerships with the people who actually steward knowledge.
This series will unpack the economics of data scarcity, the shifting market for licensing, and practical steps institutions and individuals can take to protect, monetize, or share their knowledge on their terms. If you care about where our models get their minds, this is the conversation to have–across universities, companies, newsrooms, labs, and living rooms.
I know this AI collapse thesis series may seem a little dense, but I think it’s important for all of us to understand what decisions are being made and why in regards to our intellectual property, jobs, and futures.
If you care about the future of knowledge, now is the time to pay attention.
Share this article, subscribe to the series, and join the movement to build a human-centered AI economy.
TLDR; video explanation:
Long live the humanities,
Shae
Hi, I’m a PhD Candidate at Harvard and the Founder of Analog Social . My writing explores how history, ideas, and technology shape our understanding of what it means to be human. Here’s a guide for navigating my website, so you can find the content most relevant to you:
Reading Lists (Over 14 curated reading lists now!!)
Essays on History/Archiving (I’m moving all the SHAE THE HISTORIAN content over here)
Additional sources: Dodge, et al. “Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus” https://sites.rutgers.edu/critical-ai/wp-content/uploads/sites/586/2021/09/dodge2021documentingC4.pdf
Stefan Baack, “A Critical Analysis of the Largest Source for Generative AI Training Data: Common Crawl” https://dl.acm.org/doi/10.1145/3630106.3659033
See also, Shumailov, I., Shumaylov, Z., Zhao, Y. et al. AI models collapse when trained on recursively generated data. Nature 631, 755–759 (2024). https://doi.org/10.1038/s41586-024-07566-y
