The GigaDB website allows any user to browse, search, view datasets and access data files. If you want to submit a dataset, save searches or be alerted of new content of interest we request that you create an account.
A 'Latest news' section will be visible to announce any updates or new features to the database and the RSS feed automatically announces each new dataset release.
The GigaDB homepage allows you to browse datasets by type eg Genomic, Metagenomic, Transcriptomic. Clicking on the DOI (digital object identifier) or image will take you directly to the webpage for the dataset of interest.
Alternatively you can use the search functions to find datasets, samples or files of interest.
To search across all Dataset, Sample and File records in GigaDB, simply enter a search term in the search bar found at the top of all GigaDB pages.
The search is case insensitive which means both uppercase and lowercase keywords will have the same result.
For each dataset result, author names and DOI are displayed. Hovering over dataset name provides the description of dataset. Dataset and sample names are linked to the specific DOI page for those data, as well as file links are provided to download.
For each sample result, the sample name, species name and species ID are displayed with links to the NCBI taxonomy page for the species and to the GigaDB dataset page.
For each file result, the file name, file type and file size are displayed with a direct link to the FTP server location of that file.
Only those objects that have direct matches are displayed in the search results, i.e. the only Files to be displayed in the search results will be those with matches to the search term, all other files within the same dataset will NOT be displayed.
For example, searching for the term “Potato” will return the dataset with the title “Genomic data from the potato” which contains 17 files, however, the search results table will only display 3 of those 17 files because only 3 contain the search term “potato”. To find all data associated with a dataset you must follow the link to the dataset page.
On the left of the search results you have the option to further refine the results by using the filters. By default all filters are disabled, allowing you to see all search results for your keyword. If you want to hide some results based on some criteria, choose the filter for your criteria, and select the options that match what you want to see.
TFilter options for Datasets:
Filter options for Samples:
Filter options for Files:
Click the 'Apply Filters' button to see your refined results table.
As many of the GigaDB datasets are large, several in the terabyte range, we have installed Aspera to provide a faster and more reliable method for users to download files from the GigaDB FTP server:
In order to use Aspera to download files you first need to install the free AsperaConnect web browser plug-in. For information on setup and use see the documentation on the plug-in site and the Aspera Connect User Guide.
For bulk downloads it is recommended that you do this programmatically via the 'ascp' command line (this utility is delivered along with the AsperaConnect product).
All sequence, assembly, variation, and microarray data must be deposited in a public database at NCBI, EBI, or DDBJ before you submit them to GigaDB. In the cases where you would like GigaDB to host files associated with genomic data not fully consented for public release, you must first submit the non-public data to dbGaP or EGA.
The template file contains:
Mandatory fields are highlighted in yellow.
Required information includes submitter name, email and affiliation, upload status [can we publish this dataset immediately after review (Publish) or should it be held until publication (HUP)], author list, dataset type(s) (selected from a controlled vocabulary list), dataset title and description, estimated total size of the files that will be submitted and dataset image information.
Optional information includes links to additional resources and related manuscripts, accessions for data in other databases (prefixes are found in the Links tab), and relationship (if any) to a previously published GigaDB dataset (selected from a controlled vocabulary list).
Optional information includes sample attributes (these are automatically populated in GigaDB if an NCBI BioSample ID is provided).
Required information includes a file name or path relative to your home directory and file type (selected from a controlled vocabulary list). A readme file must be provided.
Optional information includes a file description and a sample ID or name.
You can expect a response from the GigaDB team within 5 days to verify the information in your submission and to arrange upload of your files to our FTP site.
If you have any questions, please contact us at firstname.lastname@example.org.
Genomic - includes all genetic and genomic data eg sequence, assemblies, alignments, genotypes, variation and annotation.
Minimal requirements: DNA sequence data eg next-gen raw reads (fastq files) OR assembled DNA sequences (fasta files)
Epigenomic - includes methylation and histone modification data.
Minimal requirements: Details on methylation sites/status eg qmap files OR details on histone modification sites/status.
Metagenomic - includes all genetic and genomic data eg sequence, assemblies, alignments, genotypes, variation and annotation from environmental samples.
Minimal requirements: Environmental DNA sequence data eg next-gen raw reads (fastq files) OR assembled DNA sequences (fasta files).
Proteomic - includes all mass spec data.
Minimal requirements: Peptide/protein data eg mass spec.
Transcriptomic - includes all data relating to mRNA.
Minimal requirements: RNA sequence data eg next-gen raw reads (fastq files) OR transcript statistics eg RNA coverage/depth.
Additional dataset types can be added, upon review, as new submissions are received.
File types and examples of associated file extensions:
Alignments: .bam, .chain, .maf, .net, .sam
Allele frequencies: .frq
Annotation: .gff, .ipr, .kegg, .wego
Coding sequence: .cds, .fa
InDels: .gff, .txt, .vcf
ISA-Tab: see ISA tools
Genome assembly: .agp, .contig, .depth, .fa, .length, .scafseq
Genome sequence: .fastq, .fq
Methylome data: .fa, .qmap, .rpm, .txt
Protein sequence: .fa, .pep
Readme: .pdf, .txt
SNPs: .annotation, .gff, .txt, .vcf
SVs: .gff, .txt, .vcf
Transcriptome data: .depth, .rpkm, .wig
Other: .xls, .pdf, .txt
Additional file types can be added, upon review, as new submissions are received.
AGP (.agp) - the Accessioned Golden Path (AGP) file describes the assembly of a larger sequence object from smaller objects:
The large object can be a contig, a scaffold (supercontig), or a chromosome.
See AGP Specification v2.0
BAM (.bam) - the Binary Alignment/Map (BAM) format is the compressed binary version of the Sequence Alignment/Map (SAM) format, a compact and index-able representation of nucleotide sequence alignments.
BIGWIG (.bw) - the BIGWIG format is for storing dense, continuous data (such as GC percent, probability scores, and transcriptome data) that will be displayed in the UCSC Genome Browser as a graph. BIGWIG files are created initially from wiggle (WIG) type files, using the program wigToBigWig.
CHAIN (.chain) - the CHAIN format describes a pairwise alignment that allow gaps in both sequences simultaneously and is used by the UCSC Genome Browser.
CONTIG (.contig) - the CONTIG format is a direct output from the SOAPdenovo alignment program:
|>1 length 32 cvg_0.0_tip_0|
|>3 length 32 cvg_23.0_tip_0|
|>5 length 32 cvg_40.0_tip_0|
EXCEL (.xls, .xlsx) - Microsoft office spreadsheet files
FASTA (.fasta, .fa, .seq, .cds, .pep, .scafseq [SOAPdenovo output file - sequence of each scaffold]) - FASTA is a text-based format for representing either nucleotide sequences or peptide sequences.
FASTQ (.fq, .fastq) - the FASTQ format stores sequences (usually nucleotide sequence) and Phred qualities in a single file.
GFF (.gff) - The General Feature Format (GFF) is used for describing genes and other features of DNA, RNA and protein sequences.
MAF (.maf) - the Multiple Alignment Format (MAF) stores a series of multiple alignments at the DNA level between entire genomes.
NET (.net) - the NET file format is used to describe the axtNet data that underlie the net alignment annotations in the UCSC Genome Browser.
PDF (.pdf) - portable document format
PNG (.png) - portable network graphics
QUAL (.qual) - the QUAL file format represents base quality score file for NextGen data (similar in format to fasta).
RPKM (.rpkm) - Gene expression levels are calculated by Reads Per Kilobase per Million (RPKM) mapped reads eg 1kb transcript with 1000 alignments in a sample of 10 million reads (out of which 8 million reads can be mapped) will have RPKM = 1000/(1 * 8) = 125:
SAM (.sam) - the Sequence Alignment/Map (SAM) format is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. Most often it is generated as a human readable version of its sister BAM format, which stores the same data in a compressed, indexed, binary form. Currently, most SAM format data is output from aligners that read FASTQ files and assign the sequences to a position with respect to a known reference genome. In the future, SAM will also be used to archive unaligned sequence data generated directly from sequencing machines.
See The Sequence Alignment/Map format and SAMtools
TAR (.tar) - an archive containing other files
TEXT (.doc, .readme, .text, .txt) - a text file
VCF (.vcf) - the Variant Call Format (VCF) is a text file format for representing eg SNPs, InDels, CNVs, SVs, microsatellites, genotypes.
UNKNOWN - any file format not in this list
XML (.xml) - eXtensible Markup Language
Publish: this dataset is fully consented for immediate release upon GigaDB approval
HUP: this dataset should be Held Until Publication (HUP)
The DOI relationship vocabulary is taken from the DataCite 'relationType' schema property (ID=12.2).
Definition: Description of the relationship of the resource being registered (A) and the related resource (B).
IsSupplementTo: indicates that A is a supplement to B
IsSupplementedBy: indicates that B is a supplement to A
IsNewVersionOf: indicates A is a new edition of B, where the new edition has been modified or updated
IsPreviousVersionOf: indicates A is a previous edition of B
IsPartOf: indicates A is a portion of B; may be used for elements of a series
HasPart: indicates A includes the part B
References: indicates B is used as a source of information for A
IsReferencedBy: indicates A is used as a source of information by B