This is the first position paper pubished by the Working Group on 8 Nov 2021 on the Creative Commons Medium publication.
This article summarises Creative Commons’ (CC) provisional position on matters relating to artificial intelligence (AI) and copyright law. Creative Commons is dedicated to enabling sharing of knowledge and culture in the public interest. In general, we believe that the use of publicly available data on the Internet leads to greater innovation, collaboration, and creativity.
This publication is the result of the efforts of one of four policy working groups established as part of the Creative Commons Copyright Policy platform.
The topic, AI and Copyright, covers many areas related to the use and generation of content by computer algorithms, so our first decision was to decide what our areas of focus would be. We decided to divide our works and outputs into five areas: the definition of AI; text and data mining; training of AI and machine learning algorithms; AI generations and creations; and authors collaborating with AI.
The area of focus for this Working Group is of course one that is constantly changing in terms of the legislative as well as, particularly, the technological aspects that impact its work and position. What follows is a summary of the Working Group’s position on each of the topics, with a more detailed overview of the ongoing work and ideas around these topics published on the “CC AI Working Group” site which by Working Group to ensure that the dialog on the views and position of the Working Group have a place to allow further exploration and development.
Clarity on basic definitions in the AI space is a prerequisite to competent regulation in the copyright arena. AI needs to be properly understood before any copyright implications can be addressed. There is nothing today, or — in our view — in the relatively near future, which reasonably could be construed as Artificial Intelligence, at least in any sense which matters for copyright issues.
“AI” itself is an evolving concept. AI is an umbrella term that encompasses, for the most part at the current time, different types of algorithms. There is a lot of confusion around related and different concepts that have come to be called “AI” such as machine learning, natural language processing, predictive models, neural networks as well as other algorithms. Essentially, any algorithm that is capable of producing generative output, classifying data, or making decisions that approximates the capabilities of humans is referred to as “AI”, particularly if this capability is novel. However, that new capability is only “AI” until it becomes normalized as simply “software” or just another algorithm. What is AI today may not be so tomorrow. As technology advances over time, what is considered “AI” as opposed to “normal software” may continually evolve. That means that whatever copyright framework is put into place, if any, it has to remain flexible enough and technology-neutral to account for and adapt to the moving-target nature of AI.
There is also danger in categorizing all manner of algorithms as “AI” and in adopting rules or measures where these categories are arbitrarily determined.
Any policy or legal intervention in the field of copyright should be based on strong and reliable evidence and conceptual certainty, especially given the fast-paced evolution of “AI” technology.
It is presently not clear what “AI” exactly is and what it is capable of producing. As things stand, the term “AI” is not defined precisely enough to be used in the copyright arena. At the very least, any document dealing with “AI” should provide or refer to a clear and precise definition of the term.
Given this lack of clarity on what AI really means, intervention in copyright law is premature and any policy must be carried out with caution.
Text and data mining (TDM) activities are pivotal in supporting research and innovation and in the training of AI and machine learning systems.
Since TDM activities are non-consumptive and non-expressive uses of work, TDM does not compete with original markets for works, and may indeed enhance them by increasing demand for a wider range of works. As such, TDM should not be made subject to additional authorisations or payments once access is legitimate. Generally, TDM activities should not be considered copyright infringement and should not be restricted by copyright. Our position is that TDM should be allowed and supported pursuant to exceptions and limitations, in particular to enable a proper exercise of the right to research and algorithmic training-related activities.
In order to reduce bias, unfairness and exclusion in outputs, we encourage the use of larger and more diverse sets of data. This can be achieved by applying broad and open exceptions and limitations to support the most extensive possible use of copyrighted works as input into machine learning algorithms in order to encourage the elimination and minimisation of bias. This would require both minimising unnecessary barriers to TDM around copyright material that can be freely mined and facilitating uses across borders.
However, a careful balance must be struck between a push to reduce bias, unfairness and exclusion in algorithmic decision-making on the one hand and privacy rights and ethical and human rights considerations on the other.
We also take the position that there should be no digital rights management (DRM) or technological protection measures (TPM) to restrict or prevent (otherwise legal) access to the data. There should, of course, be ethical requirements for transparency in the modalities of use of data, however this should be established outside the boundaries of the copyright system.
With regards to database rights, we see that they are a potential harm to the development of AI, especially given the exclusion of data and other mere facts from copyright protection under international law.
The use of copyright works as input or to train AI and Machine Learning (ML) applications should not necessarily be considered copyright infringement as a default. It should be generally allowed under clear and open exceptions and limitations where such use upholds the public interest.
Unfettered access and use of data to improve and build upon AI encourages innovation and development of the development of algorithms in support of public-interest activities. It helps reduce bias, enhance inclusion, and promote important activities such as education and research.
That said, other concerns must be taken into account when using material to train AI / ML. For instance, one must consider the tension between the value of open data and legitimate concerns about privacy as well as ethical and responsible use of copyright material (especially openly licensed content) for algorithmic training.
In the context of openly CC-licensed content used to train AI applications, from a copyright perspective, CC has determined that no special or explicit permission regarding new technologies is required from the licensor. Creative Commons’ FAQ In the context of openly CC-licensed content used to train AI applications, from a copyright perspective, CC has determined that no special or explicit permission regarding new technologies is required from the licensor. Creative Commons’ FAQs undefined clarify how the CC licenses work in the context of openly licensed content that is used to train AI tools.
As a framework, CC licenses do not restrict reuse to any particular types of reuse, so long as the license terms are respected, as the case may be. Also, CC licenses do not override limitations and exceptions.
Speaking of copyright-protected data generally — whether released under an open license or not — assuming access to copyright works is legitimate at the point of input, use of data to train AI should be considered non-infringing by default. This use is non-expressive and does not compete with the original works in any market.
When data is used for objectionable or problematic purposes (facial recognition for instance), privacy and surveillance concerns must be taken seriously. Nevertheless, we do not believe that copyright is the most appropriate area of the law to protect individual privacy, to address research ethics in AI development or to regulate the use of surveillance tools employed online.
That said, copyright discussions cannot take place in a policy vacuum. To arrive at a good, comprehensive solution we need to fully capture the implications and concerns raised in different policy arenas.
The conversation on AI policy must be held in a coordinated and inclusive manner and through the lens of ethics, responsibility and sustainability, cultural rights, human rights, personality rights, privacy rights, and data protection. These other issues deserve equal attention and should not be marginalised.
At CC we intend to reflect upon these adjacent issues. As with any fundamental ideal, the “openness” of data is not an absolute end in itself and must be balanced with considerations for privacy and ethics. Whether this balance is found intrinsically within the copyright system or extrinsically in other areas of the law, and whether the CC licenses and tools have any role to play, is something we will continue to explore.
Copyright and related rights are unwarranted for AI-generated outputs as AI is currently understood, for two fundamental reasons: lack of a human author and lack of originality.
First, the notion of human authorship is a bedrock principle of copyright. Direct human involvement should remain a precondition to determining whether a work is worthy of protection and whether copyright can be claimed. While the conceptualisation of AI is still in flux, the technical nature of human inputs, combined with the mechanistic nature of AI algorithms and the absence of any personality rights recognised to AI, currently provide very little ground to justify any copyright protection for AI outputs.
Second, in most cases, AI algorithms use automated, mathematical means to encode statistical information about a set of input, such as copyright works. “AI” uses this statistical information, combined with some random seed, to generate output which is statistically similar to or indistinguishable from an arbitrary member of the set of input works. Algorithmically-generated (AI) output is composed of snippets chosen arbitrarily from thousands or millions of input works and generated as a result of a mathematical, stochastic function. Thus, AI output should similarly be presumed to lack originality. In short, Algorithmically-generated (AI) outputs should be in the public domain, at least pending clearer understanding of this evolving technology and clarity on what specific criteria a computer system should meet, should such systems ever be considered an author with rights.
This is an area of technology and the law where clarity is still being sought. We are clear on our position, as detailed in the previous topic, that machine-generated outputs should be in the public domain. However, when it comes to this topic there is the consideration related to the extent of input of the algorithms and the human authors into a particular work. There aren’t yet many clear examples of this kind of collaborative output, although there are nascent technologies such as Distributed Autonomous Organisations (DAOs) that are impacting the development of AI and content creation. For instance, with projects such as Botto, members of a DAO can band together and decide on what AI generative art has enough meaning to be worth selling as an NFT, with each member contributing creativity and originality by voting on various criteria. Moreover, all the members of a DAO could be accomplished artists in their own right, who perceive AI as another medium to express and DAOs as a way to do it with shared knowledge and skill in a craft.
We are still determining how generative art NFTs interact with copyright law. If these Global NFT markets have enough demand, CC could help clarify that the copyright side of an NFT’s IP should be treated as a public good, regardless of how original & creative of an I/O minting process event.
One of the most interesting aspects of the explorations that this Working Group is undertaking is that, on the one hand the outputs of what is considered AI can be very consequential in terms of its impact on the public interest, ethics, and privacy. Yet on the other hand, as it was phrased in some our meetings, “there is not there, there” in terms of a clear definition of what would constitute an “AI” that can be granted any rights under the law.
The positions and recommendations mentioned here are just a summary of the details that have gone into exploring and considering the topics of the Working Group. We will continue to build on the work done so far in line with the developments in the social, technical, and legal aspects of AI and copyright. We invite you to explore and contribute to our work on CC AI Working Group site and join the conversation on our channel in the Creative Commons Slack.
This policy position paper is the product of one of the four global working groups established in 2021 by the Creative Commons Copyright Platform, a global network of copyright advocates and practitioners, engaging with an emerging set of challenges affecting the open ecosystem.
Agnes Malatinszky, CommonLit
Ana Lazarova, Creative Commons Bulgaria
Andres Izquierdo, Program on Information Justice and Intellectual Property, American University — Washington College of Law
Ariadna Matas, Europeana
Beatriz Busaniche, CC Argentina
Benjamin White, Bournemouth University
Brigitte Vézina, CC — Director of Policy
Danièle Bourcier, CC France
Deborah De Angelis, CC Italy
Diane Peters, PIJIP
Francesco Vogelezang, Open Future
Franco Giandana, CC Argentina
Max Mahmoud Wardeh, CC UK
Maxwell Beganim, CC Ghana
Paul Keller, Open Future
Janel Thamkul, Google
Jonathan Poritz, CC USA
Kyle Smith, SeedTree
Laura Acion, Argentina
Rajeeb Dutta, India
Sarah Pearson, CC — General Counsel
Sean Flynn, PIJIP
Shanna Hollich, Creative Commons USA