AI reference dataset hosting policy

1. Data licensing considerations

We will only host datasets with licenses that permit us to host them and that, at a minimum, allow for academic/non-commercial use of the data. For example, any of the standard Creative Commons licenses meet these requirements. Note that licenses that meet these criteria can still restrict certain uses of the data, such as commercial use, and can also impose additional data use requirements, such as requiring citation if the dataset is used for a publication. It is the responsibility of each reference dataset user to understand and abide by the dataset license(s).

2. Ethical considerations

Legal considerations (e.g., licensing) are only a part of responsible data use. Recently, there have been serious ethical concerns raised about some of the large-scale datasets used in AI research. Problems include failure to obtain informed consent, perpetuation of harmful stereotypes and biases, invasion of privacy, and voyeuristic or even illegal content (e.g., Prabhu and Birhane 2020). In light of these concerns, and to promote ethical data curation and use practices, we will only host a dataset that includes personal identifiable information (including, but not limited to, images of faces or recordings of voices) if the dataset meets the following three criteria:

  • Informed consent / privacy: Personal identifiable information, such as images of faces or audio recordings of voices, was obtained in such a way that the participants were a) aware that the data were being collected; and b) agreed to contribute the data for public use.
  • Curation: The dataset has a clear, well-documented curation process.
  • No known problems: There are no known problems with the dataset that would make it unethical to host on HiPerGator or to use the data for research.

3. Additional data use restrictions

Some datasets have additional data use guidelines or restrictions that are separate from the legal data use license(s). For example, the Mozilla Common Voice dataset has a very permissive Public Domain license (CC0), but also requires anyone who downloads the dataset to agree "to not attempt to determine the identity of speakers in the Common Voice dataset". In these cases, we will include an additional file with the dataset called "DATA_USE_REQUIREMENTS.txt" that includes any additional data use guidelines or restrictions. Again, it is the responsibility of each reference dataset user to understand and abide by the dataset license(s) and use requirements.

4. Reporting problems

If you believe that any datasets we are hosting violate one or more of the guidelines described above, please let us know so we can investigate and remove the dataset if necessary. Problems can be reported by opening a support ticket at