Large Language Models and the Information Ecosystem

Posted on: April 07, 2023

Recent developments of large language models (LLMs) raise many technical questions, many are hard to ponder beyond sheer speculation because of LLM’s current opaque development and nature. This note considers the interplay between the contemporary information ecosystem and this technology.¹ The objective is to highlight dynamics that do not seem to form an significant part of current discourse.

Different LLM training procedures adopt different data curation and preparation strategies, some publicly detailed and others are proprietary and opaque. Generally though, common to all LLMs is reliance on large amounts of web data.² This data is the result of decades of content generation by numerous individuals, with an immense variety of intents, incentives, and backgrounds, including through both paid and unpaid labor. It includes articles from prestigious news outlets, posts from obscure blogs, technical discussions, encyclopedic essays, and even messages from micro-blogging platforms. It also often extends beyond natural language to include large swaths of publicly available code and semi-structured data (e.g., tables).

This data is as important to the training of LLMs as the computing techniques and resources the process uses, potentially even more critical. The questions this note raises are: what impact if any will LLM-based services have on the processes that bring about the data they consume? Will this interplay impact deployed LLMs in the long run? And what may be the impact, if any on the modern information ecosystem?

LLMs are being offered as general tools for any task that can be formulated as mapping text to text, including answering questions, writing code, and composing essays. The breadth of information-heavy activities that fall under this definition is immense. While it is early to judge LLMs’ actual applicability to such a wide plethora of use cases, broad adoption raises a high potential to alter contemporary content generation economics by impacting both information seekers and its producers. This may create cyclical feedback, because LLM training relies on the data currently generated by these same activities.

For example, users of LLM-driven search engines may be less likely to read content on third-party websites, thereby eliminating revenue streams and incentives that currently drive significant amount of content creation.

They are also less likely to post queries to discussion boards, but instead prefer the immediate response of chatting with an LLM-driven conversational service. This may eliminate or reduce the activity we see on expert discussion boards (e.g., Stack Overflow). Queries will disappear into proprietary datasets of a small number of companies, similar to how search queries are treated nowadays. With no queries, content generation activities are likely to dramatically reduce. At the same time, remaining content creators may adopt LLMs to generate content, thereby reducing their efforts, but leading to dramatic curtailment in the diversity and innovation of Internet content.³

In broad terms and roughly speaking: LLMs may end up eating the ecosystem that enabled their existence. Reduced incentives may result in significant reduction in certain content creation activities. In turn, this will lead to stagnant datasets, and LLMs will not be able to benefit from new human creativity. With reduced human activity and static LLMs, the entire information ecosystem may lose much of it vibrancy. If to conclude with more dramatic flair: we might forever be stuck in 2023. This is likely an exaggeration, but this scenario raises the importance of considering the implications of automating away content creation, and the sustainability of this process in the longterm.

This note focuses on LLMs, but the issues raised are applicable, with some caveats, to generative vision models too. ↩
This note is written under the assumption that the current status quo of fair use continues. The issues outlined here likely should have impact on the ongoing and unresolved copyright discussions. ↩
Not all content is likely to be impacted in similar ways. For example, short social interactions (e.g., on Twitter) are maybe less likely to suffer. However, these do not seem to be the type of data to support the most exciting capabilities of LLMs. ↩