{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Example Python notebook connecting to GenomicsDB on Azure Blob storage" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.4.5-SNAPSHOT-855435f\n" ] } ], "source": [ "import genomicsdb\n", "print(genomicsdb.version())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "\n", "Below we set up some (optional) environment variables in order to connect to Azure Blob Storage. We also support using environment variables to connect AWS S3 or GCS storage -- in each case, we use the native cloud SDK so refer to the appropriate documentation for supported environment variables. In addition, we also support using roles or service principals for access to cloud storage.\n", "\n", "Next, we specify the cloud URIs for the GenomicsDB workspace, callset, vid and reference file, as well as the GenomicsDB array we wish to query. Lastly, we also specify the genomic attributes we wish to query from the workspace.\n", "\n", "The environment variables and URIs have all been redacted below.\n", "\n", "***" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "# set up environment variables and configuration for query\n", "import os\n", "storageaccount = \"\"\n", "os.environ[\"AZURE_STORAGE_ACCOUNT\"] = storageaccount\n", "os.environ[\"AZURE_STORAGE_KEY\"] = \"\"\n", "\n", "container = \"\"\n", "workspace_prefix = \"\"\n", "workspace = f\"az://{container}@{storageaccount}.blob.core.windows.net/{workspace_prefix}\"\n", "callset_file = f\"az://{container}@{storageaccount}.blob.core.windows.net/{workspace_prefix}/callset_mapping.json\"\n", "vid_file = f\"az://{container}@{storageaccount}.blob.core.windows.net/{workspace_prefix}/vid_mapping.json\"\n", "array = \"\"\n", "attributes = [\"REF\", \"ALT\", \"GT\"]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "\n", "We use the `connect_with_protobuf` function to connect to the workspace. Check out the GenomicsDB protobuf specification [here](https://genomicsdb.readthedocs.io/en/develop/protodoc/proto.html)\n", "\n", "***" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(0, 249250620, [{'Row': 1, 'Col': 11188011, 'Sample': '0x00922C4598840C041CB1BB19DC75231C969E9057F7E3B9CC04EE8B44E714B793_1xAFO897WL2I', 'CHROM': '1', 'POS': 11188012, 'END': 11188012, 'REF': 'C', 'ALT': '[T]', 'GT': '0/0'}, {'Row': 1, 'Col': 65310488, 'Sample': '0x00922C4598840C041CB1BB19DC75231C969E9057F7E3B9CC04EE8B44E714B793_1xAFO897WL2I', 'CHROM': '1', 'POS': 65310489, 'END': 65310489, 'REF': 'T', 'ALT': '[C]', 'GT': '0/0'}, {'Row': 3, 'Col': 65310488, 'Sample': '0x00AFE0D460A52BC28399A89E68A62D4CF2D279B56F482166B11BA7F29AFB45C9_1xGQPZ8BXWU0', 'CHROM': '1', 'POS': 65310489, 'END': 65310489, 'REF': 'T', 'ALT': '[C]', 'GT': '0/0'}])\n", "(251743127, 351743126, [{'Row': 3, 'Col': 281159492, 'Sample': '0x00AFE0D460A52BC28399A89E68A62D4CF2D279B56F482166B11BA7F29AFB45C9_1xGQPZ8BXWU0', 'CHROM': '2', 'POS': 29416366, 'END': 29416366, 'REF': 'G', 'ALT': '[C]', 'GT': '0/0'}, {'Row': 1, 'Col': 281159698, 'Sample': '0x00922C4598840C041CB1BB19DC75231C969E9057F7E3B9CC04EE8B44E714B793_1xAFO897WL2I', 'CHROM': '2', 'POS': 29416572, 'END': 29416572, 'REF': 'T', 'ALT': '[C]', 'GT': '0/0'}, {'Row': 3, 'Col': 281159698, 'Sample': '0x00AFE0D460A52BC28399A89E68A62D4CF2D279B56F482166B11BA7F29AFB45C9_1xGQPZ8BXWU0', 'CHROM': '2', 'POS': 29416572, 'END': 29416572, 'REF': 'T', 'ALT': '[C]', 'GT': '0/0'}, {'Row': 3, 'Col': 281188584, 'Sample': '0x00AFE0D460A52BC28399A89E68A62D4CF2D279B56F482166B11BA7F29AFB45C9_1xGQPZ8BXWU0', 'CHROM': '2', 'POS': 29445458, 'END': 29445458, 'REF': 'G', 'ALT': '[T]', 'GT': '0/0'}, {'Row': 1, 'Col': 281241093, 'Sample': '0x00922C4598840C041CB1BB19DC75231C969E9057F7E3B9CC04EE8B44E714B793_1xAFO897WL2I', 'CHROM': '2', 'POS': 29497967, 'END': 29497967, 'REF': 'G', 'ALT': '[A]', 'GT': '0/0'}])\n" ] } ], "source": [ "from genomicsdb.protobuf import genomicsdb_export_config_pb2 as query_pb\n", "from genomicsdb.protobuf import genomicsdb_coordinates_pb2 as query_coords\n", "\n", "# create the query protobuf, and point to the workspace we want to query\n", "query = query_pb.ExportConfiguration()\n", "query.workspace = workspace\n", "query.array_name = array\n", "query.attributes.extend([\"REF\", \"ALT\", \"GT\"])\n", "query.callset_mapping_file = callset_file\n", "query.vid_mapping_file = vid_file\n", "\n", "# specify the samples we wish to query\n", "query.query_sample_names.extend([\n", " '0x00922C4598840C041CB1BB19DC75231C969E9057F7E3B9CC04EE8B44E714B793_1xAFO897WL2I',\n", " '0x00AFE0D460A52BC28399A89E68A62D4CF2D279B56F482166B11BA7F29AFB45C9_1xGQPZ8BXWU0',\n", " '0x008AA255A3EF6E54E308C9B6728F9D7D60D71428A2DDD339DD6A7956DFB73B90_1x7HNHG5JNCA',\n", " '0x00AFE0D460A52BC28399A89E68A62D4CF2D279B56F482166B11BA7F29AFB45C9_1x6E9Z1KV502'\n", "])\n", "\n", "# specify the genomic intervals we wish to query\n", "intervals = []\n", "intervals.append(query_coords.ContigInterval(contig=\"1\"))\n", "intervals.append(query_coords.ContigInterval(contig=\"2\", begin=1, end=100000000))\n", "intervals.append(query_coords.ContigInterval(contig=\"5\", begin=5000000, end=75000000))\n", "query.query_contig_intervals.extend(intervals)\n", "\n", "gdb = genomicsdb.connect_with_protobuf(query)\n", "list = gdb.query_variant_calls()\n", "print(*list, sep='\\n')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "***\n", "\n", "Results are returned as a list of tuples, where the length of the list will correspond to the number of intervals being queried. Each entry will consist of:\n", "\n", "* Flattened start position of the genomic interval\n", "* Flattened end position of the genomic interval\n", "* List of variant calls represented as a dict\n", "\n", "Some use cases may only require a flattened list of all the variant calls - the above data can be easily transformed to achieve that. Below we show an example of doing so, and also create a Pandas dataframe from the results\n", "\n", "***" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "| | Row | Col | Sample | CHROM | POS | END | REF | ALT | GT |\n", "|---:|------:|----------:|:--------------------------------------------------------------------------------|--------:|---------:|---------:|:------|:------|:-----|\n", "| 0 | 1 | 11188011 | 0x00922C4598840C041CB1BB19DC75231C969E9057F7E3B9CC04EE8B44E714B793_1xAFO897WL2I | 1 | 11188012 | 11188012 | C | [T] | 0/0 |\n", "| 1 | 1 | 65310488 | 0x00922C4598840C041CB1BB19DC75231C969E9057F7E3B9CC04EE8B44E714B793_1xAFO897WL2I | 1 | 65310489 | 65310489 | T | [C] | 0/0 |\n", "| 2 | 3 | 65310488 | 0x00AFE0D460A52BC28399A89E68A62D4CF2D279B56F482166B11BA7F29AFB45C9_1xGQPZ8BXWU0 | 1 | 65310489 | 65310489 | T | [C] | 0/0 |\n", "| 3 | 3 | 281159492 | 0x00AFE0D460A52BC28399A89E68A62D4CF2D279B56F482166B11BA7F29AFB45C9_1xGQPZ8BXWU0 | 2 | 29416366 | 29416366 | G | [C] | 0/0 |\n", "| 4 | 1 | 281159698 | 0x00922C4598840C041CB1BB19DC75231C969E9057F7E3B9CC04EE8B44E714B793_1xAFO897WL2I | 2 | 29416572 | 29416572 | T | [C] | 0/0 |\n", "| 5 | 3 | 281159698 | 0x00AFE0D460A52BC28399A89E68A62D4CF2D279B56F482166B11BA7F29AFB45C9_1xGQPZ8BXWU0 | 2 | 29416572 | 29416572 | T | [C] | 0/0 |\n", "| 6 | 3 | 281188584 | 0x00AFE0D460A52BC28399A89E68A62D4CF2D279B56F482166B11BA7F29AFB45C9_1xGQPZ8BXWU0 | 2 | 29445458 | 29445458 | G | [T] | 0/0 |\n", "| 7 | 1 | 281241093 | 0x00922C4598840C041CB1BB19DC75231C969E9057F7E3B9CC04EE8B44E714B793_1xAFO897WL2I | 2 | 29497967 | 29497967 | G | [A] | 0/0 |\n" ] } ], "source": [ "import pandas as pd\n", "\n", "x,y,calls = zip(*list)\n", "flattened = [variant for sublist in calls for variant in sublist]\n", "print(pd.DataFrame(flattened).to_markdown())" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.9.14 ('env': venv)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.14" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "fe3c803602b7d0d79cc9c377488bf3dd64067cfd86c74a6de3c16da38dc4030a" } } }, "nbformat": 4, "nbformat_minor": 2 }