Difference between revisions of "Reference Data"

From UFRC
Jump to navigation Jump to search
 
(17 intermediate revisions by 5 users not shown)
Line 1: Line 1:
 +
[[Category:Reference Data]]__NOTOC__
 
UFRC maintains a repository of reference data that can be accessed by all HiPerGator users. The primary purposes of this repository are researcher convenience, efficient use of filesystem space, and cost savings. We are happy to download and build reference datasets and configure applications installed on HiPerGator to automatically make use of the available reference data. Having UFRC host common reference data means that a research group does not have to use their Blue or Orange quota to host redundant copies of common reference data.
 
UFRC maintains a repository of reference data that can be accessed by all HiPerGator users. The primary purposes of this repository are researcher convenience, efficient use of filesystem space, and cost savings. We are happy to download and build reference datasets and configure applications installed on HiPerGator to automatically make use of the available reference data. Having UFRC host common reference data means that a research group does not have to use their Blue or Orange quota to host redundant copies of common reference data.
  
Line 8: Line 9:
 
* [[BLASTDB|NCBI Blast Databases]]
 
* [[BLASTDB|NCBI Blast Databases]]
 
* [[SnpEff Databases]]
 
* [[SnpEff Databases]]
 +
 +
Many other application specific references are located in <code>/data/reference</code>.
  
 
==Raw Genomic Data==
 
==Raw Genomic Data==
* [http://gigadb.org/dataset/200001 3k Rice genomes] - reference/genomes/rice3k, fastq format
+
* [http://gigadb.org/dataset/200001 3k Rice genomes] - /data/reference/genomes/rice3k, fastq format
 +
* [https://www.internationalgenome.org/data/ 1000 Genomes Data Releases] - /data/reference/1000genomes
 +
* [https://www.girinst.org/repbase/ RepBase] - /data/reference/repbase
 +
* [https://bfd.mmseqs.com/ BFD] - /data/reference/bfd
 +
* [https://www.uniprot.org/help/uniref Uniref30] - /data/reference/uniref30
 +
* [https://www.uniprot.org/help/uniref Uniref90] - /data/reference/uniref90
 +
* [https://www.ebi.ac.uk/metagenomics/ MGnify] - /data/reference/mgnify
 +
* [http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/ pdb70] - /data/reference/pdb70
 +
* Various other references - /data/reference/fasta
 +
 
 +
==AI Reference Datasets and Models==
 +
A variety of reference machine learning and AI datasets are located in <code>/data/ai</code>. You may need to use the full path /blue/data/ai for applications to find the files. Browse the [[AI Reference Datasets|catalog of all available AI reference datasets]] to learn more.
 +
 
 +
* [https://mediatum.ub.tum.de/1474000 SEN12MS] - curated dataset of georeferenced multi-spectral Sentinel-1 imagery for deep learning and data fusion.
 +
* [https://github.com/Open-Catalyst-Project/ocp/blob/main/DATASET.md OC20 Structure to Energy and Forces (S2EF)] - /data/reference/ocp/s2ef_train_all
 +
 
 +
==Repbase Sequencing Data==
 +
[https://www.girinst.org/about/repbase.html Repbase] is 'a database of representative repetitive sequences from eukaryotic species. Repbase is being used worldwide as the reference standard for annotating the presence of repetitive DNA in genomic data.'
 +
 
 +
A university-wide license for Repbase data is currently in effect for the University of Florida. It will expire on March 22nd, 2022.
 +
 
 +
Repbase data can be [https://www.girinst.org/server/RepBase/index.php downloaded] from UF IP Addresses for use on local computers.
  
==AI Training and Validation Data and Models==
+
We maintain copies of Repbase releases in sub-directories under /data/reference/repbase on HiPerGator and also configure RepeatMasker and Maker applications to use Repbase data automatically.
Training and validation datasets and collections can be found in /data/reference/ai/data.
 
Pre-compiled models can be found in /data/reference/ai/models.
 
;Data
 
* [http://sintel.is.tue.mpg.de MPI-Sintel]
 

Latest revision as of 19:55, 2 November 2023

UFRC maintains a repository of reference data that can be accessed by all HiPerGator users. The primary purposes of this repository are researcher convenience, efficient use of filesystem space, and cost savings. We are happy to download and build reference datasets and configure applications installed on HiPerGator to automatically make use of the available reference data. Having UFRC host common reference data means that a research group does not have to use their Blue or Orange quota to host redundant copies of common reference data.

Use https://support.rc.ufl.edu to request either addition of reference data or to ask for an addition of a directory that you can put reference data into for shared use.

The following is not an exhaustive list of the hosted reference data. If an existing reference is missing from the list below please let us know and we will update the list.

Application-Specific Data

Many other application specific references are located in /data/reference.

Raw Genomic Data

AI Reference Datasets and Models

A variety of reference machine learning and AI datasets are located in /data/ai. You may need to use the full path /blue/data/ai for applications to find the files. Browse the catalog of all available AI reference datasets to learn more.

Repbase Sequencing Data

Repbase is 'a database of representative repetitive sequences from eukaryotic species. Repbase is being used worldwide as the reference standard for annotating the presence of repetitive DNA in genomic data.'

A university-wide license for Repbase data is currently in effect for the University of Florida. It will expire on March 22nd, 2022.

Repbase data can be downloaded from UF IP Addresses for use on local computers.

We maintain copies of Repbase releases in sub-directories under /data/reference/repbase on HiPerGator and also configure RepeatMasker and Maker applications to use Repbase data automatically.