Generative AI Has a Copyright Problem
Generative AI models might violate copyright law by their very natures. So what can we expect from the future?
The basic way copyright law works in the US is if I create something that is creative, it’s mine and you can’t use it without my permission. “Use” it has a lot of different meanings, including “prepare derivative works based on the work”. So, if I created the fictional Character “The Data Science Dandy” in this newsletter, and it became wildly popular in memes I publish; then you couldn’t go make a series of cartoons featuring The Dandy. This is meant to incentivize and protect creators from having their hard work stolen from them.
But what if instead of calling them “The Data Science Dandy” you call them “The Data Science Dude” and instead of wearing a tri-corner hat like The Dandy, The Dude wears a backwards baseball cap. Is that still derivative work? I’m not an IP lawyer, but I think this is always a hard question in courts of law.
The basic way generative AI models work is they crawl the internet and copy massive amounts of data, then feed that data into the training processes for these models. The algorithms identify patterns, trends, and themes in the data and learn to replicate them based on prompts or commands. So if I ask an image generating model to produce an image of “a brownish-orange two-dimensional cartoon bear eating honey from a honey pot with a jolly belly, red shirt, no pants, and black button eyes”, it will probably want to draw a picture for me of Winnie the Pooh. People have been able to play with these tools and create near-perfect replicas of pre-existing images.
This way of using data is intrinsic to the way almost all ML models work and why AI has accelerated as a field of study in the last ~2 decades. Earlier visions (20th century) of AI were more rule-based; extremely long and tedious if-then types of logic trees. The progress of technology, data creation & storage, and computing power enabled the addition of “Data Science” and “Machine Learning” to the AI toolkit and the creation of these models trained on massive datasets. If you take away the underlying training data from the these models, then you just have… nothing? A Python notebook of a theoretical model? A fun academic exercise analogous to calculating the probability of the existence of unicorns?
The impending conflict here is that much of the data hoovered-up to train generative AI models is also protected by copyright law: news articles, photographs, short stories, comic books, etc. So, if a model is trained on copyrighted material, are the works that model creates derivative works? This is still mostly an open question and being litigated in lots of different places with some embarrassing cases for the models.
The outcome of this legal issue is really, really important! Big corporates are spending billions on new Generative AI technology. VCs are still pouring tons of money into Generative AI companies. Almost every startup being created in 2023 is related to Generative AI. People in all types of professions are freaking out about whether AI is or is not coming for their jobs.
The outcome could have seismic consequences on many, many industries, professions, companies, and investments. Entire business models could be upended, or just uncertainty of some kind of change could chill investment and innovation in the space. It could also be very positive in that it might create new revenue streams for copyright holders, a la how Napster changed the economic model for music publishers.
This is not a legal newsletter and I’m not an expert on copyright law. So, what direction all of the pending litigation goes is not something I’m qualified to opine on. What we can talk about here is what are the possible scenarios and impacts.
Scenario 1: Nothing changes and we have an Intellectual Property Freeforall. If courts generally rule against copyright infringement plaintiffs and legislators don’t enact any meaningful new laws, then we’ll enter this Wild West scenario. Companies will be free to hoover-up all of the world’s data to train their models and create predictions free of concern or limitation. This is bad for people concerned about IP theft but very, very good people focused on innovation and investment.
Scenario 2: Regulated and limited data use. Courts rule in favor of plaintiffs and legislators enact laws that create guardrails for data use in training models and protections on works created by models. This would slow down innovation as it would create a significant barrier to data acquisition and likely block out many smaller entrants as they couldn’t afford the legal red-tape and data acquisition costs.
Scenario 3: Nope. Courts rule in favor of plaintiffs and legislation basically shuts down the acquisition and use of copyright-protected data for training these models. This is the doomsday scenario for the technology industry as it would require a completely new approach to figuring out how to build these models, one that would almost certainly be much harder and more expensive.
Again, I’m not a legal expert and I don’t know how courts or legislatures will react. This is a potent issue-cocktail of capitalism, technology, inequality, and even free-speech that could go a lot of different directions in the public discourse.
However, if I were a betting man, then I’d say we land somewhere between Scenarios 1 and 2. Why? Because of this:
Source: Statista
There are only two companies on this list of top lobbying spenders for 2022, and both have significant vested interests in allowing these models to live on and investment to continue. This is also nobody on this list representing the collective of copyright holders who are most aggrieved by the copyright issues.
So where do we go from here? Probably nowhere soon. Litigation and legislation take years to enact. In the meantime, it’s hard to see the taps being closed on investment (ignoring other market factors). Further, the longer companies have to get familiar with these models and figure out how to monetize them, the more entrenched they will become and harder they will be to reverse out. This will lead to bigger costs from dramatic legislation changes to put more guardrails on these models and thus, heavier spending on lobbying.