AI Reference Datasets

Jump to navigation Jump to search

UFRC maintains a repository of reference AI datasets that can be accessed by all HiPerGator users. The primary purposes of this repository are researcher convenience, efficient use of filesystem space, and cost savings. Research groups do not have to use their Blue or Orange quota to host their own copies of these reference datasets.

Please note that although these datasets are all freely available, data use licenses and restrictions vary among them. If you use these datasets, it is your responsibility to ensure that your use of the data complies with applicable licenses and any other use restrictions.

Use to request the addition of a reference dataset. All reference datasets hosted on HiPerGator must comply with Research Computing's AI reference dataset hosting policy.

You may need to use the full path /blue/data/ai for software applications to find the file.

Catalog of available datasets

Name Categories Location on HiPerGator Dataset size (approximate) Version License Date added Description

Free Spoken Digit Dataset (FSDD) Audio /data/ai/audio/free-spoken-digit-dataset-1.0.10 20.4 MiB v1.0.10 Creative Commons Attribution-ShareAlike 4.0 International March 11, 2021 A simple audio/speech dataset consisting of recordings of spoken digits in wav files at 8kHz. The recordings are trimmed so that they have near minimal silence at the beginnings and ends.
Freesound Dataset 50k (FSD50K) Audio /data/ai/audio/FSD50K 32.2 GiB 1.0 (10.5281/zenodo.4060432) Mixed Creative Commons licenses March 12, 2021 FSD50K is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology.
LibriSpeech ASR corpus Audio /data/ai/audio/LibriSpeech 59.4 GiB SLR12 Creative Commons Attribution 4.0 International March 11, 2021 LibriSpeech is a corpus of approximately 1,000 hours of 16kHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned.
CIFAR-10 Computer vision /data/ai/image/cifar-10-batches-py 177.6 MiB Not reported March 12, 2021 The CIFAR-10 dataset consists of 60,000 32x32 colour images in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images.
alphafold Model /data/ai/proteinfolding/alphafold 8.7 GiB v2.0.0 Apache License 2.0 July, 2021 Predicts protein structures. If you publish research using alphafold, the original paper must be cited.
RoseTTAFold Model /data/ai/proteinfolding/rosettafold 1.0 GiB v1.0.0 MIT License July, 2021 Predicts protein structures. If you publish research using RoseTTAFold, the original paper must be cited
DeepAtomDB_v2018-MD Molecular Dynamics Trajectories /data/ai/proteinbinding 3.3 TiB 0.1 Creative Commons Attribution-ShareAlike 4.0 International March, 2022 MD trajectories for drug-protein complexes extracted from PDBBind, BindingMOAB and Astex databases.
COCO Object Recognition /data/ai/image/COCO 47.2 GiB 2017 CC-BY 4.0 April 19, 2021 Common Objects in Context
PASCAL Object Recognition /data/ai/image/PASCAL 3.5 GiB VOC2012 April 19, 2021 Object datasets from the VOC challenges. The main goal of this challenge is to recognize objects from a number of visual object classes in realistic scenes (i.e. not pre-segmented objects). It is fundamentally a supervised learning learning problem in that a training set of labelled images is provided.
Reading Comprehension Dataset Text /data/ai/text/benchmarks/RACE 65.5 MiB 2017 Non-commercial Use March, 2021 28,000 passages with nearly 100,000 questions written for English exams for Chinese students at the middle and high school level. See for more detail.
SuperGLUE Text /data/ai/text/benchmarks/superglue 176.8 MiB January, 2021 The primary SuperGLUE tasks are built on and derived from existing datasets. We refer users to the original licenses accompanying each dataset, but it is our understanding that these licenses allow for their use and redistribution in a research context. January, 2021 SuperGLUE is a new version of GLUE benchmarks for NLP.
Wikipedia Text /data/ai/text/data/wikipedia 30.9 GiB January, 2021 Creative Commons Attribution-ShareAlike 4.0 International January, 2021 Wikipedia articles as downloaded January 2021 from, cleaned using wikiextractor Python library.