Information Overload

What happens when AI trains AI

Oct 02, 2023

A few weeks ago, we talked about Liquid Facts and how our current age of abundant information changes how we perceive information. We talked about this trend in very abstract terms, but there are a few ways I think this can play out in everyday life.

In an age of superabundance of facts, the emphasis is no longer on good data but on having lots of data. The process of finding and researching things is governed by probability, so the incentives align to encourage people to pump as much information into the world as possible, regardless of its quality.

I think about this a lot in the world of SEO, or search engine optimization. SEO is the science and art of getting content to rank higher in search engines, and it’s generally geared towards creating lots and lots of content that contains keywords that people are searching for and, therefore, search engines can pick up on. On the one hand, there’s nothing wrong in general with making it easier for content on the web to be found by the websites we use to search the internet. On the other hand, this incentive leads to the creation of thousands of clickbait websites and pieces of content that exist to get people to click on them, which triggers an ad to be shown, earning a few cents for the company that created the clickbait content.

As time has passed, the search engines get better and better at filtering out the useless content created simply for clicks. This move has its consequences: it makes it harder for people making good content to have their content discovered, and it causes the people making clickbait to get better at making clickbait to keep attracting clicks.

Another facet is Large Language Models, or LLMs, like ChatGPT. These prediction engines take a prompt and start creating a response by predicting the next word on the page. LLMs can be tremendous tools for many applications, but like any technology, they have their tradeoffs. Neil Postman wrote, “When statistics and computers are joined, volumes of garbage are generated in public discourse”1 In our age of information superabundance, we have unleashed a new tool to pump more and more content into the world that a human has not reflected upon.

In the passage of Technopoly, Postman goes on to discuss the useless facts that sportscasters come up with to discuss while on air. He cites some examples from the mid-80s when he wrote this book, but as I think about watching College Gameday on ESPN recently, I don’t have to think far for my own examples. Postman observes, “It is surprising how frequently such blather will serve as the backbone of conversations which are essentially meaningless.”2 Postman, of course, knew nothing of LLMs at the time of writing. Still, he perceived the outcome that could arise: a world flooded with generated text.

Going even further, we realize that the rise of LLM-filled web content poses a considerable risk to the creation of future LLMs. The models that underlie these tools are created by “training” a computer on lots of source data that help them learn to predict what will come next. Training a model on its own output creates a dangerous loop where the only input to the model is the output. As we flood the web with LLM-created text, we are effectively making it impossible to develop better LLMs in the future.

Thanks to a boom in generative artificial intelligence, programs that can produce text, computer code, images and music are readily available to the average person. And we’re already using them: AI content is taking over the Internet, and text generated by “large language models” is filling hundreds of websites, including CNET and Gizmodo. But as AI developers scrape the Internet, AI-generated content may soon enter the data sets used to train new models to respond like humans. Some experts say that will inadvertently introduce errors that build up with each succeeding generation of models.
A growing body of evidence supports this idea. It suggests that a training diet of AI-generated text, even in small quantities, eventually becomes “poisonous” to the model being trained. Currently there are few obvious antidotes.3

Humans are not probability engines and have more discernment on the whole than a large language model has. But if consuming their content is poisonous to LLMs, we should be concerned about what a world flooded by unreflected-upon AI-generated mush will do to humans.

More is not always better, and in a world of superabundance of information, more makes it harder to know what we should be listening to.

As we go forward, how can we be more intentional about the content we are consuming, and as content creators, how can we balance the trade-offs we are incentivized to make in our age of information superabundance?

With that, thanks for reading, and see you again soon.

Social image by Izabel 🏳️‍🌈 on Unsplash

Technopoly: The Surrender of Culture to Technology, Neil Postman, 137

AI-Generated Data Can Poison Future AI Models - Scientific American

Context and Content

Discussion about this post