Example Python notebook connecting to GenomicsDB on Azure Blob storage

[24]:

import genomicsdb
print(genomicsdb.version())

1.4.5-SNAPSHOT-855435f

Below we set up some (optional) environment variables in order to connect to Azure Blob Storage. We also support using environment variables to connect AWS S3 or GCS storage – in each case, we use the native cloud SDK so refer to the appropriate documentation for supported environment variables. In addition, we also support using roles or service principals for access to cloud storage.

Next, we specify the cloud URIs for the GenomicsDB workspace, callset, vid and reference file, as well as the GenomicsDB array we wish to query. Lastly, we also specify the genomic attributes we wish to query from the workspace.

The environment variables and URIs have all been redacted below.

[25]:

# set up environment variables and configuration for query
import os
storageaccount = ""
os.environ["AZURE_STORAGE_ACCOUNT"] = storageaccount
os.environ["AZURE_STORAGE_KEY"] = ""

container = ""
workspace_prefix = ""
workspace = f"az://{container}@{storageaccount}.blob.core.windows.net/{workspace_prefix}"
callset_file = f"az://{container}@{storageaccount}.blob.core.windows.net/{workspace_prefix}/callset_mapping.json"
vid_file = f"az://{container}@{storageaccount}.blob.core.windows.net/{workspace_prefix}/vid_mapping.json"
array = ""
attributes = ["REF", "ALT", "GT"]

We use the connect_with_protobuf function to connect to the workspace. Check out the GenomicsDB protobuf specification here

[26]:

from genomicsdb.protobuf import genomicsdb_export_config_pb2 as query_pb
from genomicsdb.protobuf import genomicsdb_coordinates_pb2 as query_coords

# create the query protobuf, and point to the workspace we want to query
query = query_pb.ExportConfiguration()
query.workspace = workspace
query.array_name = array
query.attributes.extend(["REF", "ALT", "GT"])
query.callset_mapping_file = callset_file
query.vid_mapping_file = vid_file

# specify the samples we wish to query
query.query_sample_names.extend([
    '0x00922C4598840C041CB1BB19DC75231C969E9057F7E3B9CC04EE8B44E714B793_1xAFO897WL2I',
    '0x00AFE0D460A52BC28399A89E68A62D4CF2D279B56F482166B11BA7F29AFB45C9_1xGQPZ8BXWU0',
    '0x008AA255A3EF6E54E308C9B6728F9D7D60D71428A2DDD339DD6A7956DFB73B90_1x7HNHG5JNCA',
    '0x00AFE0D460A52BC28399A89E68A62D4CF2D279B56F482166B11BA7F29AFB45C9_1x6E9Z1KV502'
])

# specify the genomic intervals we wish to query
intervals = []
intervals.append(query_coords.ContigInterval(contig="1"))
intervals.append(query_coords.ContigInterval(contig="2", begin=1, end=100000000))
intervals.append(query_coords.ContigInterval(contig="5", begin=5000000, end=75000000))
query.query_contig_intervals.extend(intervals)

gdb = genomicsdb.connect_with_protobuf(query)
list = gdb.query_variant_calls()
print(*list, sep='\n')

(0, 249250620, [{'Row': 1, 'Col': 11188011, 'Sample': '0x00922C4598840C041CB1BB19DC75231C969E9057F7E3B9CC04EE8B44E714B793_1xAFO897WL2I', 'CHROM': '1', 'POS': 11188012, 'END': 11188012, 'REF': 'C', 'ALT': '[T]', 'GT': '0/0'}, {'Row': 1, 'Col': 65310488, 'Sample': '0x00922C4598840C041CB1BB19DC75231C969E9057F7E3B9CC04EE8B44E714B793_1xAFO897WL2I', 'CHROM': '1', 'POS': 65310489, 'END': 65310489, 'REF': 'T', 'ALT': '[C]', 'GT': '0/0'}, {'Row': 3, 'Col': 65310488, 'Sample': '0x00AFE0D460A52BC28399A89E68A62D4CF2D279B56F482166B11BA7F29AFB45C9_1xGQPZ8BXWU0', 'CHROM': '1', 'POS': 65310489, 'END': 65310489, 'REF': 'T', 'ALT': '[C]', 'GT': '0/0'}])
(251743127, 351743126, [{'Row': 3, 'Col': 281159492, 'Sample': '0x00AFE0D460A52BC28399A89E68A62D4CF2D279B56F482166B11BA7F29AFB45C9_1xGQPZ8BXWU0', 'CHROM': '2', 'POS': 29416366, 'END': 29416366, 'REF': 'G', 'ALT': '[C]', 'GT': '0/0'}, {'Row': 1, 'Col': 281159698, 'Sample': '0x00922C4598840C041CB1BB19DC75231C969E9057F7E3B9CC04EE8B44E714B793_1xAFO897WL2I', 'CHROM': '2', 'POS': 29416572, 'END': 29416572, 'REF': 'T', 'ALT': '[C]', 'GT': '0/0'}, {'Row': 3, 'Col': 281159698, 'Sample': '0x00AFE0D460A52BC28399A89E68A62D4CF2D279B56F482166B11BA7F29AFB45C9_1xGQPZ8BXWU0', 'CHROM': '2', 'POS': 29416572, 'END': 29416572, 'REF': 'T', 'ALT': '[C]', 'GT': '0/0'}, {'Row': 3, 'Col': 281188584, 'Sample': '0x00AFE0D460A52BC28399A89E68A62D4CF2D279B56F482166B11BA7F29AFB45C9_1xGQPZ8BXWU0', 'CHROM': '2', 'POS': 29445458, 'END': 29445458, 'REF': 'G', 'ALT': '[T]', 'GT': '0/0'}, {'Row': 1, 'Col': 281241093, 'Sample': '0x00922C4598840C041CB1BB19DC75231C969E9057F7E3B9CC04EE8B44E714B793_1xAFO897WL2I', 'CHROM': '2', 'POS': 29497967, 'END': 29497967, 'REF': 'G', 'ALT': '[A]', 'GT': '0/0'}])

Results are returned as a list of tuples, where the length of the list will correspond to the number of intervals being queried. Each entry will consist of:

Flattened start position of the genomic interval
Flattened end position of the genomic interval
List of variant calls represented as a dict

Some use cases may only require a flattened list of all the variant calls - the above data can be easily transformed to achieve that. Below we show an example of doing so, and also create a Pandas dataframe from the results

[27]:

import pandas as pd

x,y,calls = zip(*list)
flattened = [variant for sublist in calls for variant in sublist]
print(pd.DataFrame(flattened).to_markdown())

|    |   Row |       Col | Sample                                                                          |   CHROM |      POS |      END | REF   | ALT   | GT   |
|---:|------:|----------:|:--------------------------------------------------------------------------------|--------:|---------:|---------:|:------|:------|:-----|
|  0 |     1 |  11188011 | 0x00922C4598840C041CB1BB19DC75231C969E9057F7E3B9CC04EE8B44E714B793_1xAFO897WL2I |       1 | 11188012 | 11188012 | C     | [T]   | 0/0  |
|  1 |     1 |  65310488 | 0x00922C4598840C041CB1BB19DC75231C969E9057F7E3B9CC04EE8B44E714B793_1xAFO897WL2I |       1 | 65310489 | 65310489 | T     | [C]   | 0/0  |
|  2 |     3 |  65310488 | 0x00AFE0D460A52BC28399A89E68A62D4CF2D279B56F482166B11BA7F29AFB45C9_1xGQPZ8BXWU0 |       1 | 65310489 | 65310489 | T     | [C]   | 0/0  |
|  3 |     3 | 281159492 | 0x00AFE0D460A52BC28399A89E68A62D4CF2D279B56F482166B11BA7F29AFB45C9_1xGQPZ8BXWU0 |       2 | 29416366 | 29416366 | G     | [C]   | 0/0  |
|  4 |     1 | 281159698 | 0x00922C4598840C041CB1BB19DC75231C969E9057F7E3B9CC04EE8B44E714B793_1xAFO897WL2I |       2 | 29416572 | 29416572 | T     | [C]   | 0/0  |
|  5 |     3 | 281159698 | 0x00AFE0D460A52BC28399A89E68A62D4CF2D279B56F482166B11BA7F29AFB45C9_1xGQPZ8BXWU0 |       2 | 29416572 | 29416572 | T     | [C]   | 0/0  |
|  6 |     3 | 281188584 | 0x00AFE0D460A52BC28399A89E68A62D4CF2D279B56F482166B11BA7F29AFB45C9_1xGQPZ8BXWU0 |       2 | 29445458 | 29445458 | G     | [T]   | 0/0  |
|  7 |     1 | 281241093 | 0x00922C4598840C041CB1BB19DC75231C969E9057F7E3B9CC04EE8B44E714B793_1xAFO897WL2I |       2 | 29497967 | 29497967 | G     | [A]   | 0/0  |