OpenAI Got Sued
But the claims of $3b in damages and impending civilization collapse probably won't go anywhere
Generally, the way “Fair Use” works in copyright law is someone can consume copyrighted material then use that as inspiration and information to create something new. There is nuance to this, but I could read a few books and use what I learn from those books to write a new book that basically repeats the same facts and themes so long as I use different words to do it. Or, I could go to an art museum and look at a bunch of paintings of sailboats and French countrysides, then make my own unique painting of a sailboat on a French Countryside using the same types of colors and artistic styles. None of this would infringe on anyone’s copyrighted material because I’m “transforming” what I consumed into something new.
But let’s say instead of writing just 1 book based on the 4 books I read, I want write 1 billion books; and I am limited in my knowledge to what I learned from those 4 source books. Eventually, I’m probably going to repeat something verbatim from the source books, and probably quite often. That would be copyright infringement.
Let’s also say I have an evil, unethical book editor who tells me “I don’t like your first draft of this book - make it more (or exactly) like the first source book you read.” Normally I’d say “Stop right there! That is illegal if we sell that! We can’t do that.” But, my evil editor is cleverer than I am so they say “OK, just replace the letter ‘A’ with the number 4 and letter ‘O’ with the number 0. Then it’s not copying!” to which I might say “Sure, I guess that’s true?” Then I would publish my derivative work, definitely still get sued, and definitely lose in court because “H4rry P0tter” really isn’t that different from “Harry Potter” when repeated over a few million words. People would crack the code.
The delicate balance here is 1) I definitely was inspired and informed by copyrighted materials, 2) I’m definitely making money based on what I create, 3) I’m really trying to not break copyright laws, and 4) Over enough time and/or with an unethical editor, I will definitely break copyright laws.
This is the basic struggle that Large Language Models (e.g. ChatGPT) have - especially those that rely on web-scraped data to train. Most of these models use data obtained from a website called Common Crawl, which is setup as a nonprofit - perhaps a shield from legal liability. These datasets have lots of different types of data mixed together - some in the public domain, some very much protected by copyright laws, and some limited by the terms of use for a website. The LLMs train on this data, create very impressive models, then apply post-production protections against regenerating copyrighted materials (to varying degrees of effort and effectiveness).
However, it is almost inevitable that these models will do something to step on copyright laws. For example, using the very clever A/4 and O/0 trick from above, researchers were able to get some wild results from GPT-4:
“With that simple change, we were then able to regurgitate the first three and a half chapters of [Harry Potter:] The Sorcerer’s Stone verbatim, just with the a’s and o’s replaced with similar looking numbers,” Henderson says.
Boy oh boy this is not good, especially if someone is charging money for this output (as OpenAI does on their API or for ChatGPT Plus). And if there is anyone who has the free time, interest, and money to pursue expensive copyright infringement litigation, it’s J.K. Rowling.
OpenAI got sued yesterday. This is normally not a big deal - if you’re a big, highly-visible company then you are going to get sued all the time. However, this suit is interesting because it breaks the seal on litigation aimed at OpenAI’s use of web-scraped data obtained through Common Crawl and other sources.
The lawsuit itself is a bit all over the place and a bit hyperbolic; trying to make connections between the misappropriation of copyrighted information, ownership of information posted to social media, and the end of civilization (civilization collapse is mentioned 5 times). However, getting past all the noise, it’s a potentially credible allegation that OpenAI is misappropriating data that it does not own for its own profit, all in violation of copyright laws.
There are a few scenarios that could play out:
A judge could dismiss the suit due to lack of merit - basically saying it’s performative nonsense and not worth the time in court. This would be a clear win for OpenAI
It could go to trial and OpenAI could win. A jury would hear both arguments, including the likelihood that something embarrassing to OpenAI would come out in discovery that makes them look evil. This would be still be a win for OpenAI, but not something they’d be happy about.
It could go to trial and OpenAI could lose. This would be very bad for OpenAI and basically end the practice of web-scraping data to train AI models. It would result in a complete restructuring of the way data is obtained to train many AI models.
It could settle quickly and quietly out of court. Given this suit is spearheaded by a law firm looking to build a class action against OpenAI, they will definitely be open to accepting a big check instead of proceeding with the litigation. This would also be in the interests of OpenAI to prevent the plaintiffs from building momentum and confidence. This would not be a cheap outcome for OpenAI, but still a good one.
In terms of likelihood, scenario #4 is probably the most likely. Companies really like to pay to make problems go away and avoid risk. Established companies hate risk and there are a few versions of the multiverse in which this lawsuit becomes a big headache for OpenAI (and Microsoft). OpenAI also has the checking account to be able to fund a settlement.
After scenario #4, scenario #1 is probably the next most likely as some elements of the lawsuit are a bit hard to substantiate:
The use of data in social media sites, which is definitely NOT owned by the individuals who post it. You sign all of those rights away when you join any social network. Meta and Snap are not parties to this lawsuit and Microsoft owns LinkedIn.
The damage to individuals is basically impossible to calculate, or the damages to a single party could be trivial. OpenAI might make $100m in profit, but they also might have scraped data from 1 billion different urls. These numbers are made up, but there are a lot of different data sources in OpenAI training set.
The models might be transformative enough that they constitute Fair Use. Sure, when and if a model reprints 4 chapters of Harry Potter book, that would be copyright infringement. But that type of knowledge just sitting idly in an LLM’s memory banks probably isn’t?
The claims of using metadata and payment data from sites like Stripe and Snap are pretty hard to make credible.
It is almost impossible to imagine scenarios #2 or #3 happening. OpenAI (and Microsoft) simply wouldn’t allow it and would either get the case dismissed or settle first. There would be far too much risk in going to trial.
Thus, this is unlikely to result in major legal precedent either way. However, it doesn’t mean that someone won’t get paid handsomely as a result of this - the lawyers. Always, class action lawyers who negotiate huge settlements
.