Public / Shared Data available on Apocrita¶
In order to prevent duplication of data and to save valuable research time we provide a local copy of some widely used public datasets.
QMUL staff can contact us to request corrections, updates or the addition of new datasets to this repository.
Datasets available¶
| Name and Location on Apocrita | Description |
|---|---|
Blast databases /data/PublicDataSets/shared_dbs |
Standard set of databases for BLAST (Basic Local Alignment Search Tool) |
CADD /data/PublicDataSets/genomes/Homo_sapiens/CADD |
Combined Annotation Dependent Depletion CADD is a tool for scoring the deleteriousness of single nucleotide variants as well as insertion/deletions variants in the human genome. |
CDD /data/PublicDataSets/CDD |
The Conserved Domain Database is a resource for the annotation of functional units in proteins |
GATK Bundle /data/PublicDataSets/GATKbundle |
Standard files for working with human resequencing data with the GATK |
Galaxy hg datasets /data/PublicDataSets/galaxy |
Reference genomes for use with Galaxy |
Illumina Genomes /data/PublicDataSets/genomes |
Ready-To-Use Reference Sequences and Annotations |
ImageNet 2012 /data/PublicDataSets/ImageNet-2012/ |
Annotated image database for Machine Learning, 2012 version |
ImageNet 2021 /data/PublicDataSets/ImageNet-2021/ |
Annotated image database for Machine Learning, 2021 version. Full and resized images |
MAESTRO /data/PublicDataSets/MAESTRO |
MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organisation) |
MusicNet /data/PublicDataSets/musicnet |
A curated collection of labelled classical music in raw format. |
NCBI WGS /data/PublicDataSets/shared_dbs/wgs |
Whole Genome Shotgun projects are genome assemblies of incomplete genomes |
NR Protein sequences /data/PublicDataSets/shared_dbs/nr |
Non-redundant protein sequences from GenPept, Swissprot, PIR, PDF, PDB, and NCBI RefSeq |
Prot_RefSeq /data/PublicDataSets/shared_dbs/prot_refseq |
Protein data for subset of commonly used model organisms, downloaded from NCBI |
Slakh /data/PublicDataSets/slakh2100/ |
A dataset of multi-track audio and aligned MIDI for music source separation and multi-instrument automatic transcription. |
UniRef50 /data/PublicDataSets/shared_dbs/uniref50 |
The UniProt Reference Clusters (UniRef) provide clustered sets of sequences from the UniProt knowledgebase |
Uniprot /data/PublicDataSets/shared_dbs/uniprot |
Database of protein sequence and functional information |