AI Reference Datasets
Jump to navigation
Jump to search
UFRC maintains a repository of reference AI datasets that can be accessed by all HiPerGator users. The primary purposes of this repository are researcher convenience, efficient use of filesystem space, and cost savings. Research groups do not have to use their Blue or Orange quota to host their own copies of these reference datasets.
Use https://support.rc.ufl.edu to request the addition of a reference dataset.
Catalog of available datasets
Name | Categories | Location on HiPerGator | Dataset size (approximate) | Version | License | Date added | Description
|
---|---|---|---|---|---|---|---|
Free Spoken Digit Dataset (FSDD) | Audio | /data/ai/audio/free-spoken-digit-dataset-1.0.10
|
20.4 MiB | v1.0.10 | Creative Commons Attribution-ShareAlike 4.0 International | March 11, 2021 | A simple audio/speech dataset consisting of recordings of spoken digits in wav files at 8kHz. The recordings are trimmed so that they have near minimal silence at the beginnings and ends. |
Freesound Dataset 50k (FSD50K) | Audio | /data/ai/audio/FSD50K
|
32.2 GiB | 1.0 (10.5281/zenodo.4060432) | Mixed Creative Commons licenses | March 12, 2021 | FSD50K is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology. |
LibriSpeech ASR corpus | Audio | /data/ai/audio/LibriSpeech
|
59.4 GiB | SLR12 | Creative Commons Attribution 4.0 International | March 11, 2021 | LibriSpeech is a corpus of approximately 1,000 hours of 16kHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned. |
CIFAR-10 | Computer vision | /data/ai/ref-data/image/cifar-10-batches-py
|
177.6 MiB | Not reported | March 12, 2021 | The CIFAR-10 dataset consists of 60,000 32x32 colour images in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images. | |
CIFAR-10 | Computer vision | /data/ai/image/cifar-10-batches-py
|
177.6 MiB | Not reported | March 12, 2021 | The CIFAR-10 dataset consists of 60,000 32x32 colour images in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images. | |
alphafold | Model | /data/ai/proteinfolding/alphafold
|
8.7 GiB | v2.0.0 | Apache License 2.0 | July, 2021 | Predicts protein structures. If you publish research using alphafold, the original paper must be cited. |
RoseTTAFold | Model | /data/ai/proteinfolding/rosettafold
|
1.0 GiB | v1.0.0 | MIT License | July, 2021 | Predicts protein structures. If you publish research using RoseTTAFold, the original paper must be cited https://www.biorxiv.org/content/10.1101/2021.06.14.448402v1 |
DeepAtomDB_v2018-MD | Molecular Dynamics Trajectories | /data/ai/proteinbinding
|
3.3 TiB | 0.1 | Creative Commons Attribution-ShareAlike 4.0 International | March, 2022 | MD trajectories for drug-protein complexes extracted from PDBBind, BindingMOAB and Astex databases. |
COCO | Object Recognition | /data/ai/ref-data/image/COCO
|
47.2 GiB | 2017 | CC-BY 4.0 | April 19, 2021 | Common Objects in Context |
COCO | Object Recognition | /data/ai/image/COCO
|
47.2 GiB | 2017 | CC-BY 4.0 | April 19, 2021 | Common Objects in Context |
PASCAL | Object Recognition | /data/ai/ref-data/image/PASCAL
|
3.5 GiB | VOC2012 | April 19, 2021 | Object datasets from the VOC challenges. The main goal of this challenge is to recognize objects from a number of visual object classes in realistic scenes (i.e. not pre-segmented objects). It is fundamentally a supervised learning learning problem in that a training set of labelled images is provided. | |
PASCAL | Object Recognition | /data/ai/image/PASCAL
|
3.5 GiB | VOC2012 | April 19, 2021 | Object datasets from the VOC challenges. The main goal of this challenge is to recognize objects from a number of visual object classes in realistic scenes (i.e. not pre-segmented objects). It is fundamentally a supervised learning learning problem in that a training set of labelled images is provided. | |
Babelnet-v5 | Text | /data/ai/ref-data/nlp/babelnet
|
5.2 KiB | May, 2022 | https://babelnet.org/license Researchers who agree to the license will be granted access to Babelnet databases. | May, 2022 | Babelnet is an NLP dictionary about words and their meanings. |
Reading Comprehension Dataset | Text | /data/ai/benchmarks/nlp/RACE
|
65.5 MiB | 2017 | Non-commercial Use | March, 2021 | 28,000 passages with nearly 100,000 questions written for English exams for Chinese students at the middle and high school level. See https://aclanthology.org/D17-1082/ for more detail. |
SuperGLUE | Text | /data/ai/benchmarks/nlp/superglue
|
176.8 MiB | January, 2021 | The primary SuperGLUE tasks are built on and derived from existing datasets. We refer users to the original licenses accompanying each dataset, but it is our understanding that these licenses allow for their use and redistribution in a research context. | January, 2021 | SuperGLUE is a new version of GLUE benchmarks for NLP. |
Wikipedia | Text | /data/ai/ref-data/nlp/wikipedia
|
30.9 GiB | January, 2021 | Creative Commons Attribution-ShareAlike 4.0 International | January, 2021 | Wikipedia articles as downloaded January 2021 from https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2, cleaned using wikiextractor Python library. |