As more organizations embrace artificial intelligence (AI) technologies and big data, there is a growing need to share and collaborate with data sets to analyze and use for AI training. But just as you should not release valuable software to the public without choosing an appropriate open source software license, you should not release data sets without a proper license written specifically for sharing data.
Today, the Linux Foundation AI announced the release of CDLA-Permissive-2.0 license agreement, developed to make it easier than ever for governments, academic institutions, businesses, and other organizations to share, access, and protect open source data sets.
What does this release offer?
Like version 1.0, the version 2.0 agreement maintains the clear rights to use, share and modify the data, as well as to use without restriction any results that are generated through data analysis.
- Plain language to express the grant of permissions and requirements
- One page of information to easily read and understand
- Simplified experience for adopting the license
- Support from IBM and other industry leaders
To recap, it is beautifully short–just 364 words long!–and tuned to data sets for AI use cases. A new license for a new era of data and AI.
IBM is using the license in our data sets
At IBM, we are excited about how this new license will enable better sharing of data sets that can be used in AI and machine learning work. We believe that the ease of using the new CDLA v2 license will benefit the AI and data science community. Today, we are announcing that one of the first IBM data sets to carry the CDLA-Permissive-2.0 license is the Project CodeNet data set.
Project CodeNet is a large data set that is made available via IBM’s Data Asset eXchange (DAX). Its purpose is to train AI models to undersatnd and write code. The data set consists of some 14M code samples, about 500M lines of code, in 55+ different programming languages. We believe Project CodeNet can serve as a benchmark data set for source-to-source translation and aims to do for AI and code what the ImageNet dataset did years ago for computer vision.
Why do we need a new license for data sharing?
Although open source software has a number of widely accepted licenses that have helped it thrive, these same licenses can’t be applied to the way data is shared. Similarly, licenses that govern sharing data for creative content don’t usually account for AI and machine learning use cases.
The laws and regulations that govern data sharing have different requirements. The types of data, the location where it is stored or accessed, and the way it’s consumed in AI or machine learning models all have different governance standards. Commonly used licenses for software and creative content might not apply in the intended ways for open data.
The CDLA permissive license was created to address concerns related to AI and ML models generated from open data. Because of our leadership in AI and experience writing open source licenses, IBM was involved in creating version 1 of the CDLA license.
What’s the difference between v1 and v2?
In 2017, IBM engaged the Linux Foundation with the early thinking around licenses for data sets because of our experience with AI and open source. After collaborating with the Linux Foundation and others, CDLA v1 was released. Feedback about the first version of the license suggested that it was overly complex for non-lawyers to use. To address these concerns, in 2019, Microsoft launched the Open Use of Data Agreement (O-UDA-1.0) to provide a more concise and simplified set of terms around the sharing and use of data for similar purposes. They contributed this license to the Linux Foundation in order to bring the two licenses together.
CDLA-Permissive v2 was developed to take the best from the original CDLA-Permissive v1 release and bring in the simplicity of the O-UDA-1.0 to offer a streamlined, simpler license that most readers can use and understand.
Use the new license with your data sets
If you’re involved with AI or machine learning, be sure to check out the license and use with your own data sets to speed collaboration and innovation in AI.