Data Acquisition Steeplechase

Incumbents in the LLM development space will continue to benefit from unobstructed access to training data pre-2023 while new entrants will face a many obstacles, limiting their ability to compete

Aug 24, 2023

If you are OpenAI, you might be worried about people threatening you with lawsuits and future actions to make it harder for you to scrape their websites to get training data for your Large Language Models (LLM). The success of ChatGPT and other LLMs has drawn a lot of attention to the technology and how their sausage gets made. This has led to lots of people who own various forms of content getting really angry about the practice of scraping the web en-masse to acquire training data.

There are real reasons these societal beefs could pose a headache to OpenAI:

The application of the Fair Use Doctrine in copyright law to a Generative AI model that is very capable of infringing on copyrights is very much an unresolved legal issue.
Some websites have Terms of Service that say how their websites can and can’t be used.
Taking the data of unsuspecting people and creators without their consent to build very lucrative businesses just looks bad!
In the future, website and content owners might take steps to make it a lot harder for you to scrape their data.

Like I said, you might be worried. But in reality, you probably aren’t:

The lawsuits alleging copyright infringement take a lot of leaps in their legal arguments and assumptions, and you can afford MUCH better lawyers than even Sarah Silverman.
When websites wrote their Terms of Service, they probably had no idea what a Large Language Model was. So, their are probably lots of loopholes for you to say that scraping the website to train an AI model is allowed and fair game.
You might have a sullied reputation, but that probably doesn’t hurt you too much when you’re busy selling your tools to other tech companies with equally bad reputations. It’s kind of like being in the crew of a pirate ship, you probably brag about some of the mean things you did to get ahead because that’s the social currency.
People may do things in the future to make it harder for you to get their data, like update the Terms of Service, but you have already scraped most of the internet! Until English evolves in a significant way, you’re pretty good with what you’ve already got!

There are lots of reasons for you to feel pretty OK with your current position. In fact, you would probably even want to join in the fun and help the people who want to keep you out by giving them the code they can add to their website that will keep your crawlers away. There are two reasons to do this:

This makes you look good, like you’ve become the hero.
It really won’t change anything because, again, you already scraped their website, and any website that actually has a lot of content is probably sophisticated enough to have already figured out this code on their own.

So, sure, pat yourself on the back and try to look like a good guy for a minute. It won’t really hurt you.

But what about the other guys? Well, you probably feel like a door is closing, fast. Remember, LLMs require massive datasets to train, and the more diversified they are, the better. Prior to 2023, private organizations basically had an unobstructed playing field in which to go out and scrape data. It was like running a 100-meter dash; just go as fast as you can to collect data.

Here Are The Sprinters to Watch in the 100-Meter Dash - The New York Times

However, data owners are putting up barriers to data collection quickly, which means 2024 and beyond will be more like the steeplechase for AI companies looking to train new LLMs.

Why It Is Called the Steeplechase and Why It Has Water Jumps.

This will cement the early leaders (OpenAI, Google, Meta, Cohere, and others) at the front of the pack since they can always pull from the data they scraped before the barriers. New entrants will have a hard time solving the data access problem before they can build competing models and products. This will likely make it an uncompetitive competition because the incumbents can still basically run the 100-meters while their competitors are hurdling over gates and puddles.

Data Acquisition Steeplechase

Incumbents in the LLM development space will continue to benefit from unobstructed access to training data pre-2023 while new entrants will face a many obstacles, limiting their ability to compete

Discussion about this post