Upload & source file format
The genome processing aspect of GET-Evidence interprets GFF formatted files that reports differences of the genome versus reference. Details on requirements, assumptions, and types of interpreted data are described here.
Accepted upload file formats
You may upload a file in any of the following formats:
- Complete Genomics var file (build 36 or 37)
- 23andme SNP data (assumed to be build 36)
- VCF format (assumes a single individual, so far only used for build 37 23andme exome data)
- GFF format used by GET-Evidence, build 36 or 37 (described below)
These files may be compressed as gzip (.gz extension) or bzip2 (.bz2 extension).
GET-Evidence’s GFF format
We use a variant of GFF files for internal genome processing and output. Files may be uploaded as plain text (with a .gff extension) or compressed with gzip (.gff.gz extension) or bzip2 (.gff.bz2 extension). If you click the “download” option at the top of a genome report you will see an example of input data we have used. We make a lot of assumptions about input data, so please read our descriptions below to be sure your data is processed properly.
A header line in the file should specify genome build in the following manner:
If unspecified, the processing currently assumes build 36. Either build 36 or build 37 may be used.
Columns must be tab separated (”\t” character) and should have the following data:
- Chromosome (e.g. “chr1”, “chr12”, “chrX”, “chrM”. Must be h18 / build 36.)
- Source (ignored)
- Type (e.g. “SNP”, “REF”, “SUB”, “INDEL”. Only “REF” matters – our variant processing skips these rows – other values are ignored.)
- Start (1-based)
- End (1-based)
- Score (ignored, “.” may be used to leave the field empty)
- Strand (ignored, we assume to be “+”)
- Frame (ignored, “.” may be used to leave the field empty)
- Attributes: semicolon separated features, described further below
The variant and reference sequences for each position are contained within the final “Attributes” column. Attributes in this column should be separated by semicolons. Within each, data is separated by whitespace, and the first value is taken to be the variable name.
chr14 CGI SNP 93914700 93914700 . + . alleles C/T;amino_acid SERPINA1 E366K;db_xref dbsnp:rs28929474;ref_allele C
The “Attributes” column in this row is taken to have the following variables and values:
- alleles = “C/T”
- amino_acid = “SERPINA1 E366K”
- db_xref = “dbsnp:rs28929474”
- ref_allele = “C”
The “alleles” data is the only data needed in an uploaded file. How to annotate use this variable to describe genome variants follows.
Single Nucleotide Substitutions
Single nucleotide substitutions should start and end at the same position. The substituted allele(s) must described using the “alleles” variable name. Heterozygous alleles should be separated by a slash (e.g. “C/G”) while homozygous or hemizygous alleles can be reported without a slash. You may also describe the reference allele using the “ref_allele” variable name.
chr1 CGI SNP 31844 31844 . + . alleles G
chr1 CGI SNP 43069 43069 . + . alleles C/G
chr1 CGI SNP 45027 45027 . + . alleles A
Multiple Nucleotide Substitutions
These are described similarly to single nucleotide substitutions. Start and end positions should be different and should match the length of sequences given in the alleles variable.
chr2 CGI SUB 101087873 101087876 . + . alleles CACA/GGTG
chr2 CGI SUB 101210966 101210967 . + . alleles AC/CA
chr2 CGI SUB 101351061 101351062 . + . alleles AG
Start and end position refer to the reference allele position. The “empty value” that replaces reference for the position is described by “-”.
chr3 CGI INDEL 494450 494450 . + . alleles -
chr3 CGI INDEL 502274 502275 . + . alleles -/TT
chr3 CGI INDEL 507887 507887 . + . alleles -/A
To specify the position of insertions in a unique manner (different from single nucleotide substitution positions) we use an “end” value one base before the “start” value. This violates the GFF specification, which requires end positions to always be equal to or after start positions, but we choose to do this to have consistent interpretation of positions. The insertion occurs between the start and end positions.
chr4 CGI INDEL 821159 821158 . + . alleles C
chr4 CGI INDEL 824865 824864 . + . alleles ACTT/-
chr4 CGI INDEL 871712 871711 . + . alleles CA/-
Other length changing alleles
As in previous examples, these should have positions which describe the reference sequence positions that are replaced by the variant allele(s).
chr5 CGI INDEL 2237775 2237777 . + . alleles A/CTT
chr5 CGI INDEL 2336687 2336688 . + . alleles GTAGGA
chr5 CGI INDEL 2339000 2339000 . + . alleles AAA/A
Note: This last row looks like it should have been an “insertion”? Actually, in this case the reference allele at this position is “C”! Both alleles called here are non-reference.
Coverage information (positions which match reference)
We include regions which have been sequenced and match the reference genome in the GFF source data we make available. These are marked by the value “REF” in the third column (“Type”) and our processing system currently ignores these rows when analyzing genomes.
chr6 CGI REF 736528 736790 . + . .
chr6 CGI REF 736794 737031 . + . .
chr6 CGI SNP 737032 737032 . + . alleles C/T;ref_allele C
chr6 CGI REF 737033 737283 . + . .
In this example there is sequencing coverage for chr6:736528-736790 and chr6:736794-737283, with a heterozygous non-reference call made at position chr6:737032. The three base region chr6:736791-736793 is missing, it has no sequencing call made and is not covered.
Other attributes data
There are other attributes data which we attach to the gff files during processing. If you include this data in an uploaded file it will, for the most part, be ignored and replaced — an exception is dbSNP data. You may have dbSNP ID’s already attached to variant calls and this data may be more thorough than the calls we attempt to make. If you upload data containing dbSNP data please make sure it matches the format we describe here.
Example of pre-processed data:
chr7 CGI SNP 82419795 82419795 . + . alleles C/T
chr7 CGI SNP 82420782 82420782 . + . alleles C/T
chr7 CGI SNP 82423791 82423791 . + . alleles G/T
The same positions after processing:
chr7 CGI SNP 82419795 82419795 . + . alleles C/T;amino_acid PCLO A2804T;db_xref dbsnp:rs976714;ref_allele C
chr7 CGI SNP 82420782 82420782 . + . alleles C/T;amino_acid PCLO V2475I;db_xref dbsnp:rs10954696;ref_allele C
chr7 CGI SNP 82423791 82423791 . + . alleles G/T;amino_acid PCLO Q1472K;ref_allele G
ref_allele (ignored and overwritten during processing)
The reference allele. It should match the length inferred from start and end positions.
db_xref (we pay attention to this!)
We use this to report dbSNP ID’s. dbSNP ID’s should be begin with “dbsnp, followed by optional other information, then a colon (), then “rs” and the number. When there are multiple dbSNP ID’s associated with the position they should be comma separated.
An example of some dbSNP ID’s in uploaded data:
chr8 CGI SNP 145222820 145222820 . + . alleles G;db_xref dbsnp.116:rs7820984
chr8 CGI INDEL 145223654 145223681 . + . alleles C/GGCAGTGGGCATGTGGAATACTTCTCCA;db_xref dbsnp.130:rs67708571,dbsnp.130:rs73717807
chr8 CGI SNP 145225126 145225126 . + . alleles A/G
Post-processing it looks like this:
chr8 CGI SNP 145222820 145222820 . + . alleles G;amino_acid CYC1 M76V;db_xref dbsnp.116:rs7820984;ref_allele A
chr8 CGI INDEL 145223654 145223681 . + . alleles C/GGCAGTGGGCATGTGGAATACTTCTCCA;db_xref dbsnp.130:rs67708571,dbsnp.130:rs73717807;ref_allele GGCAGTGGGCATGTGGAATACTTCTCCA
chr8 CGI SNP 145225126 145225126 . + . alleles A/G;db_xref dbsnp:rs13254954;ref_allele A
For any variants occurring within coding sequence, we check to see if how they change the predicted amino acid sequence. If so we report this in the “amino_acid” variable. Please see our guide to amino acid calls for information on the nomenclature we have chosen.