The New York Times Wants OpenAI to Pay Up
Big Journalism is preparing for a legal and economic battle against Big Tech. My money is on Big Tech.
Let's say you run a business that sells Robot Christmas tree ornaments, and you're doing pretty well. You sell about 500 of them a year, and your usual method is to scout for these precious trinkets one at a time at robot-themed holiday flea markets. It's not a very time-consuming process and you've grown used to it. After all, there are plenty of different places to explore in the hunt for holiday decorations celebrating robots. Then, one day, a vendor whom you typically buy 2 ornaments per year from approaches you with a straight face and demands a few hundred million dollars for their ornaments or they will sue you. You’d probably return their straight face, blink a few times, then walk away while you try to find the humor in their proposition. The vendor was making a pretty aggressive ask without much, if any, leverage. That is bad business!
Now, a group of prominent news publishers, including the likes of the New York Times and News Corp, is gearing up to do just that with big AI companies. They're preparing to demand substantial payouts for the use of their articles in training Large Language Models (LLMs) like Google’s Bard and OpenAI’s ChatGPT. Their presumption (which is very plausible) is that OpenAI, Google, and others are scraping their news articles to train their LLMs and generate revenues. Let’s be honest, they probably are. However, the publishers also are presuming they have economic leverage in this exchange, which is much more dubious.
LLMs, like the powerful ones built by OpenAI, Microsoft, and Google, rely on vast amounts of data, ranging from news articles, to Wikipedia, to Reddit message boards, to any other kind of website with words on it. They have to have this vast range of inputs so they can be versatile and produce quality output. This is why LLMs probably are currently scraping news articles from the publishers considering legal action.
However, the legality of using anything they can access on the internet is still up in the air. There are lots, of, lawsuits about this issue right now and I think it is far from a settled issue.
These publishers were left a bit bruised by the way the news industry evolved around the use of news articles on social media platforms. In the pursuit of reaching a broader audience, they ended up giving away a substantial amount of value to these social media giants in the way their articles got shared and displayed. Now, they're determined not to repeat the same mistake when it comes to AI companies.
The worry for publishers is that as these AI models grow more sophisticated, they might become powerful enough to replace traditional journalism with their technological prowess. The AI companies could either build out their journalism tools to replace the traditional journalism industry; they could summarize the news and capture all of the eyeballs (and ad revenue) through their chatbots; or they could change the cost structure of news reporting to make their AI tools essential to publishers - the AI tools trained on those publisher’s own data.
One of the core challenges for the publishers here is scale. Let’s assume they do not have legal protection from copyright laws (which is still a very open question). Even an archive as massive as the New York Times' could potentially go missing, and it wouldn't make much of a dent in an AI company's efforts to train their LLM. The New York Times archive contains roughly 2 billion words1, but OpenAI’s GPT-3 was trained on 500 billion words. GPT-4 is trained on a dataset an order of magnitude larger than that. The New York Times content is a drop in the bucket of LLM training data sets.
So, it's clear that individual news publishers standing alone wouldn't have much leverage in their negotiations with the AI giants. To create a coalition that could wield real bargaining power, they would need to unite with entire segments of the internet community - like ALL news publishers or ALL message boards. This means collaborating with just a few big players in the news industry won’t be enough to get real leverage. Instead, they will need to partner on massive, industry-wide coalitions. Otherwise, they’ll be just one vendor in an internet-sized flea market of people selling similar Robot Christmas tree ornaments. Big tech will probably just say “that’s cute” and walk away.
There are about 3.5 million articles in the NYT online archive and the average NYT article is ~600 words long