Use of CC/openly licenced material in training AI
This document is the workspace for exploring the working group’s position on the topic of the “AI /ML Training”.
The use of copyright works as input or to train AI applications should not necessarily be considered copyright infringement as a default. It should be generally allowed under clear and open exceptions and limitations where such use upholds the public interest.
Unfettered access and use of data to improve and build upon AI encourages innovation and development of AI in support of public-interest activities. It helps reduce bias, enhance inclusion, and promote important activities such as education and research.
That said, other concerns must be taken into account when using material to train AI. For instance, one must consider the tension between the value of open data and legitimate concerns about privacy as well as ethical and responsible use of copyright material (especially openly licensed content) for algorithmic training.
The open access movement demonstrates the obvious advantages of freely and openly accessible resources to spur innovation, especially in times of crises. AI innovation is bound to be stimulated by openly accessible materials.
In the context of openly CC-licensed content used to train AI applications, from a copyright perspective, CC has determined that no special or explicit permission regarding new technologies is required from the licensor. Creative Commons’ FAQ In the context of openly CC-licensed content used to train AI applications, from a copyright perspective, CC has determined that no special or explicit permission regarding new technologies is required from the licensor. Creative Commons’ FAQs1 clarify how the CC licenses work in the context of openly licensed content that is used to train AI tools.
As a framework, CC licenses do not restrict reuse to any particular types of reuse, so long as the license terms are respected, as the case may be. Also, CC licenses do not override limitations and exceptions.
As concerns AI inputs and the training of AI applications, Creative Commons supports broad and unfettered access and use of copyright works to help reduce bias, enhance inclusion, promote important activities such as education and research, and foster innovation in the development of AI. Assuming access to copyright works is legitimate at the point of input, use of such works to train AI should be considered non-infringing by default. Indeed, such uses are non-expressive and do not compete with the original works in any market. In the context of openly licensed content used to train AI applications, our online FAQs clarify how Creative Commons licenses work: from a copyright perspective, no special or explicit permission regarding new technologies is required from the licensor.
Speaking of copyright-protected data generally — whether released under an open license or not — assuming access to copyright works is legitimate at the point of input, use of data to train AI should be considered non-infringing by default. This use is non-expressive and does not compete with the original works in any market. For example, text-and-data mining for research or education purposes should be allowed under an exception to copyright.
Limitations and exceptions for cross-border collaboration on AI can foster creativity, innovation and the public interest, such as for education and research purposes, and contribute to international development.
Licensing, including collective licensing, is not an appropriate alternative to a system of exceptions and limitations upholding the public interest to enable the use of copyright works as AI input.
When data is used for objectionable or problematic purposes (facial recognition for instance), privacy and surveillance concerns must be taken seriously.
Nevertheless, copyright is not the most appropriate area of the law to protect individual privacy, to address research ethics in AI development or to regulate the use of surveillance tools employed online.
That said, copyright discussions cannot take place in a policy vacuum. To arrive at a good, comprehensive solution we need to fully capture the implications and concerns raised in different policy arenas.
The conversation on AI policy must be held in a coordinated and inclusive manner and through the lens of ethics, responsibility and sustainability, cultural rights, human rights, personality rights, privacy rights, and data protection. These other issues deserve equal attention and should not be marginalised.
At CC we intend to reflect upon these adjacent issues. As with any fundamental ideal, the “openness” of data is not an absolute end in itself and must be balanced with considerations for privacy and ethics. Whether this balance is found intrinsically within the copyright system or extrinsically in other areas of the law, and whether the CC licenses and tools have any role to play, is something we look forward to probing.