Tutorial

This tutorial shows you how to setup Varda with the Aulë web interface and Manwë command line client, and how to import and query an example dataset.

The example dataset is taken from the Varda unit tests and is limited to the first 200,000 bases of human chromosome 20 (GRCh37/hg19).

Setting up Varda

Follow the installation instructions to install Varda. Configure Varda to use hg19.fa in the tests/data directory as reference genome and enable cross-origin resource sharing (CORS) (this allows Aulë to communicate with Varda). The Varda configuration file may look something like this:

DATA_DIR = 'data'
SQLALCHEMY_DATABASE_URI = 'sqlite:///varda.db'
BROKER_URL = 'redis://'
CELERY_RESULT_BACKEND = 'redis://'
GENOME = 'tests/data/hg19.fa'
CORS_ALLOW_ORIGIN = '*'

Remember to point the VARDA_SETTINGS environment variable to the configuration file before continuing.

See also

Configuration
More information on available configuration settings.

Start Varda and a Celery worker node as described in Running Varda:

$ varda debugserver

and:

$ celery worker -A varda.worker.celery -l info

Opening http://127.0.0.1:5000/genome in your browser should now show you a JSON representation of the reference genome configuration.

Setting up Aulë

Get the source code for Aulë, configure it to use MyGene.info with GRCh37/hg19, and run it:

$ git clone https://github.com/varda/aule.git
$ cd aule
$ nano config.js
AULE_CONFIG = {
  BASE: '/',
  API_ROOT: 'http://127.0.0.1:5000/',
  PAGE_SIZE: 50,
  MANY_PAGES: 13,
  MY_GENE_INFO: {
    species: 'human',
    exons_field: 'exons_hg19'
  }
  MY_GENE_INFO: null
};
$ npm install
$ npm run dev

You can now open http://localhost:8000/ in your browser, which should show you the Aulë homepage. Login with admin and the password you choose during Varda setup.

Setting up Manwë

Manwë authenticates with the Varda API using a token. You can generate a token in the Aulë web interface by choosing API tokens in the menu and clicking Generate API token. Copy the token by clicking Show token.

Install Manwë and create a configuration file with the token you just created:

$ pip install manwe
$ nano manwe.cfg
API_ROOT = 'http://127.0.0.1:5000'
TOKEN = 'c7fa8780025c8efa5077567434e0fcb56274fbb0'

Verify that everything is setup correctly by listing all Varda users:

$ manwe users list -c manwe.cfg
User:   /users/1
Name:   Admin User
Login:  admin
Roles:  admin

Note

Instead of including -c manwe.cfg in every invocation, you can also copy this file to ~/.config/manwe/config (config should be the name of the file) where Manwë will pick it up automatically.

Importing exome sequencing data

Let’s import an example set of variant calls from an exome sequencing experiment. The file tests/data/exome.vcf contains some variant calls on chromosome 20 for one individual and tests/data/exome.vcf contains regions on chromosome 20 where the sequencing was deep enough (or of high enough quality) to do variant calling:

$ cat tests/data/exome.vcf
##fileformat=VCFv4.1
##samtoolsVersion=0.1.16 (r963:234)
...
#CHROM  POS     ID  REF    ALT  QUAL  FILTER  INFO  FORMAT    -
chr20   76962   .   T      C    173   .       ...   GT:PL:GQ  0/1:203,0,221:99
chr20   126159  .   ACAAA  A    217   .       ...   GT:PL:GQ  0/1:255,0,255:99
chr20   126313  .   CCC    C    126   .       ...   GT:PL:GQ  0/1:164,250,0:99
...
$ cat tests/data/exome.bed
chr206811268631
chr207658177410
chr209002590400
...

Note

For any real data you import, it is best to always include both the variant calls in VCF format and a BED file of regions to include. This makes it possible for Varda to calculate accurate variant frequencies, also on regions that are not covered by some experiments.

Import the data as follows:

$ manwe samples import --vcf tests/data/exome.vcf --bed tests/data/exome.bed \
>     -l -w 'Exome sample'
Added sample: /samples/1
Added data source: /data_sources/1
Started variation import: /variations/1
Added data source: /data_sources/2
Started coverage import: /coverages/1
[################################] 100/100 - 00:00:02
Imported variations and coverages for sample: /samples/1

Note

The -l argument instructs Varda to use the PL column instead of the GT column to derive the genotypes. Use it when variant calling was done with Samtools.

Since Varda supports importing data for a sample in multiple steps, new samples are inactive by default to prevent using them in frequency calculations until everything is complete. Activate the sample you just imported with:

$ manwe samples activate /samples/1
Activated sample: /samples/1

If you go back to the Aulë web interface and choose Samples in the menu, you should see the exome sample you just imported.

Importing aggregate data from 1000 Genomes

Sometimes it makes sense to calculate variant frequencies within a dataset separately, as opposed to global frequencies over all datasets. An example might be a large public population study such as the 1000 Genomes project. Varda allows you to import a dataset like this without providing coverage data (i.e., the BED file).

The tests/data/1kg.vcf file contains a subset of variant calls from the 1000 Genomes project over 1092 individuals. Import it as follows:

$ manwe samples import --vcf ../varda/tests/data/1kg.vcf -s 1092 -p \
>     --no-coverage-profile -w '1000 Genomes'
Added sample: /samples/2
Added data source: /data_sources/3
Started variation import: /variations/2
[################################] 100/100 - 00:00:02
Imported variations and coverages for sample: /samples/2
$ manwe samples activate /samples/2
Activated sample: /samples/2

Note

Samples imported without coverage profile are automatically excluded from global variant frequency calculations. Instead, they may be queried separately.

Querying variant frequencies

Aulë allows for some ad-hoc querying of variant frequencies globally and per sample, as well as by variant, by region and by transcript region. Choose By region in the menu and set:

Query:
Global query
Chromosome:
chr20
Region begin:
1
Region end:
200000

This should show you the variants from the exome sequencing example, all with frequency 1.0 and N=1 (since it’s the only sample used in the calculation).

You can run the same query on the 1000 Genomes data by setting:

Query:
Sample query (1000 Genomes)

As an alternative to setting the region manually, you can also choose By transcript in the menu and select a region based on a gene transcript. The exome example has two variants in the DEFB126 gene. You can select it by clicking on Choose a transcript and typing DEFB126.

Annotating variants

The ad-hoc frequency queries with Aulë are nice for one-time lookups, but you would presumably also want to automate this on a larger scale. Manwë allows you to annotate local VCF or BED files with variant frequencies by supplying a list of queries:

$ manwe annotate-vcf -q GLOBAL '*' -q 1KG 'sample:/samples/2' -w \
>     tests/data/exome.vcf
Added data source: /data_sources/4
Started annotation: /annotations/1
[################################] 100/100 - 00:00:02
Annotated VCF file: /data_sources/5
$ manwe data-sources download /data_sources/5 > exome.annotated.vcf.gz

The resulting VCF file is annotated with several fields in the INFO column.