Todo list¶

These are some general todo notes. More specific notes can be found by grepping the source code for Todo.

More strict validation of user input, especially file uploads (max file size and contents).
Implement caching control headers.
Implement HEAD requests.
Better organised and more comprehensive test suite.
Throtling.
Better rights/roles model.
Support input in BCF2 format.
Have a look at supporting the gVCF format.
Possibility to contact submitter of an observation.
Have a maintenance and/or read-only mode, probably with HTTP redirects.
Store phasing info, for example by numbering each allele (uniquely within a sample) and store the allele number with observations.
Support bigBed format.
What to do for variants where we have more observations than coverage? We could have a check in sample activation, but would we really like to enforce this?
Fallback modes to accomodate browsing the API with a standard web browser, e.g., query string alternative to pagination with Accept-Range headers. Perhaps this can be optional and implemented by patching the Request object before it reaches the API code.
We currently store variants as (position, reference, observed) and regions as (begin, end) where all positioning is one-based and inclusive. An alternative is implemented in the observation-format git branch where all positioning is zero-based and open-ended and variants are stored as (begin, end, observed).

Here are some advantages of the alternative representation:
- If a reference genome is configured, the reference field is superfluous and we can do with defining just a region.
- Zero-based and open-ended positioning follows Python indexing and slicing notation as well as the BED format.
- Insertions are perhaps more naturally modelled by giving an empty region on the reference genome.
- Overlaps between regions and variants are easier to query for with begin and end fields.
But it also has some downsides:
- The current variant representation follows existing practices and therefore all interfaces to the outside world more closely.
- If there is no reference genome configured, we don’t have a complete definition of our variants.
- It means a lot of conversions between representations.
Note that the current representation isn’t following VCF, since VCF requires both the reference and observed sequences to be non-empty. However, by normalizing (and also anticipating other sources than VCF) we trim every sequence as much as possible.

For now we think it is best to stick with the current representations, but this is still somewhat up for discussion.
Have a section in the docs describing the unit tests. Also note that the unit tests use the first 200,000 bases of chromosome 19 as a reference genome.
Refactor how we handle Celery tasks. Don’t store the task uuid in the database. Probably also create the resulting resource in the task, not before starting the task like we do now.

A running task should be monitored and, when finished, it points to the resulting resource.

We can probably still list running tasks even though we don’t store them in the database, following what Flower does. This will only work when sending task events is enabled (-E option to celeryd). Also have a look at CELERY_SEND_EVENTS and CELERY_SEND_TASK_SENT_EVENT configuration options. As this post suggests, we probably also have to explicitely monitor the events.

Important: We still seem to have an issue with many long-running tasks where some of them may be run twice. In general, this will raise the TaskError('variation_imported', 'Variation already imported') exception but I have seen at least one case where the entire variation has been imported twice which is quite hard to recover from. My hope is that we can prevent this from happening by some refactoring here.
See if this issue affects us.
For simplicity, we are currently storing homozygous vs heterozygous for each alternate call. Shouldn’t we actually be storing the genotype, like 0/1 vs 1/1 (in reporting, we could include 0/0)? It is more general.

I can think of two reasons why we choose not to store genotypes. The first is that we don’t have reference calls (but we could simply omit 0/0). The second is that we don’t have a guarantee that a given chromosome was called using the same ploidity. Therefore, we could for example have genotypes from different samples on the Y chromosome as 0/0, 0/1, 1/1 versus 0, 1. We could report these as-is, or merge them to the highest ploidity which would be incorrect in this case. Or we store the ploidity for each chromosome system-wide.
Having a pool size per sample is not granular enough in some situations. For example, the 1KG phase1 integrated call sets are over 1092 individuals for most chromosomes, but over 1083 and 535 for the mitochondrial genome and chromosome Y, respectively. Not sure if we can really solve this easily, since having a pool size per variation/coverage will not work for samples with coverage.
Options for logging in a production environment. Basically, if DEBUG=False, everything from log level warning and up should be logged to a file and every error should optionally be e-mailed.
JSON is not a hypertext format, but still we can do better by using hypertext-like representations, for example using HAL.
Replace Resource base class by SingletonResource and CollectionResource. Implement the root, genome, and authentication resource using SingletonResource.
See if we can easily compress with bgzip instead of regular gzip.
Perhaps use Factory Boy instead of fixture. It looks like we don’t have to monkey patch Factory Boy.
Use JSONPatch for editing resources (example).