Java Import Package

The GenomicsDBImporter class is the main package to import VCFs into GenomicsDB.

org::genomicsdb::importer::GenomicsDBImporter : public org.genomicsdb.importer.GenomicsDBImporterJni , public org.genomicsdb.importer.extensions.JsonFileExtensions , public org.genomicsdb.importer.extensions.CallSetMapExtensions , public org.genomicsdb.importer.extensions.VidMapExtensions

Java wrapper for vcf2genomicsdb - imports VCFs into GenomicsDB. All vid information is assumed to be set correctly by the user (JSON files)

Public Functions

inline  GenomicsDBImporter (final String loaderJSONFile)

Constructor

Parameters

loaderJSONFileGenomicsDB loader JSON configuration file

inline  GenomicsDBImporter (final String loaderJSONFile, final int rank)

Constructor

Parameters
  • loaderJSONFileGenomicsDB loader JSON configuration file

  • rank – Rank of this process (TileDB/GenomicsDB partition idx)

inline  GenomicsDBImporter (final ImportConfig config)

Constructor to create required data structures from a list of GVCF files and a chromosome interval. This constructor is developed specifically for running Chromosome intervals imports in parallel.

Parameters

config – Parallel import configuration

Throws

FileNotFoundException – when files could not be read/written

inline GenomicsDBVidMapProto.VidMappingPB getProtobufVidMapping ()

Function to return the vid mapping protobuf object.

Returns

protobuf object for vid mapping

inline void updateProtobufVidMapping (GenomicsDBVidMapProto.VidMappingPB vidMapPB)

Function to update vid mapping protobuf object in the top level config object. Used in cases where the VCF header doesn’t contain accurate information about how to parse fields. For instance, allele specific annotations

Parameters

vidMapPB – vid mapping protobuf object to use as new

inline int addSortedVariantContextIterator (final String streamName, final VCFHeader vcfHeader, Iterator< VariantContext > vcIterator, final long bufferCapacity, final VariantContextWriterBuilder.OutputType streamType, final Map< Integer, SampleInfo > sampleIndexToInfo)

Add a sorted VC iterator as the data source - caller must:

  1. Call setupGenomicsDBImporter() once all iterators are added

  2. Call doSingleImport()

  3. Done!

Parameters
  • streamName – Name of the stream being added - must be unique with respect to this GenomicsDBImporter object

  • vcfHeader – VCF header for the stream

  • vcIterator – Iterator over VariantContext objects

  • bufferCapacity – Capacity of the stream buffer in bytes

  • streamType – BCF_STREAM or VCF_STREAM

  • sampleIndexToInfo – map from sample index in the vcfHeader to SampleInfo object which contains row index and globally unique name can be set to null, which implies that the mapping is stored in a callsets JSON file

Returns

returns the stream index

inline int addBufferStream (final String streamName, final VCFHeader vcfHeader, final long bufferCapacity, final VariantContextWriterBuilder.OutputType streamType, Iterator< VariantContext > vcIterator, final Map< Integer, SampleInfo > sampleIndexToInfo)

Add a buffer stream or VC iterator - internal function

Parameters
  • streamName – Name of the stream being added - must be unique with respect to this GenomicsDBImporter object

  • vcfHeader – VCF header for the stream

  • bufferCapacity – Capacity of the stream buffer in bytes

  • streamType – BCF_STREAM or VCF_STREAM

  • vcIterator – Iterator over VariantContext objects - can be null

  • sampleIndexToInfo – map from sample index in the vcfHeader to SampleInfo object which contains row index and globally unique name can be set to null, which implies that the mapping is stored in a callsets JSON file

Returns

returns the stream index

inline void setupGenomicsDBImporter()

Setup the importer after all the buffer streams are added, but before any data is inserted into any stream No more buffer streams can be added once setupGenomicsDBImporter() is called

Throws

IOException – throws IOException if modified callsets JSON cannot be written

inline boolean add (VariantContext vc, final int streamIdx)

Write VariantContext object to stream - may fail if the buffer is full It’s the caller’s responsibility keep track of the VC object that’s not written

Parameters
  • vc – VariantContext object

  • streamIdx – index of the stream returned by the addBufferStream() call

Returns

true if the vc object was written successfully, false otherwise

inline boolean doSingleImport()

Only to be used in cases where iterator of VariantContext are not used. The data is written to buffers directly after which this function is called. See TestBufferStreamGenomicsDBImporter.java for an example

Throws

IOException – if the import fails

Returns

true if the import process is done

inline void executeImport()

Import multiple chromosome interval

inline void executeImport (final int numThreads)

Import multiple chromosome interval

Parameters

numThreads – number of threads used to import partitions

inline void doConsolidate (final int numThreads)

Consolidate all intervals/arrays in a given workspace into a single fragment

Parameters

numThreads – number of threads to use to parallelize consolidation

inline long getNumExhaustedBufferStreams()
Returns

get number of buffer streams for which new data must be supplied

inline int getExhaustedBufferStreamIndex (final long i)

Get buffer stream index of i-th exhausted stream There are mNumExhaustedBufferStreams and the caller must provide data for streams with indexes getExhaustedBufferStreamIndex(0), getExhaustedBufferStreamIndex(1),…, getExhaustedBufferStreamIndex(mNumExhaustedBufferStreams-1)

Parameters

i – i-th exhausted buffer stream

Returns

the buffer stream index of the i-th exhausted stream

inline boolean isDone()

Is the import process completed

Returns

true if complete, false otherwise

inline MultiChromosomeIterator columnPartitionIterator(FeatureReader<VariantContext> reader)

Utility function that returns a MultiChromosomeIterator given an FeatureReader that will iterate over the VariantContext objects provided by the reader belonging to the column partition specified by this object’s loader JSON file and rank/partition index

Parameters

reader – AbstractFeatureReader over VariantContext objects

Throws
  • IOException – when the reader’s query method throws an IOException

  • ParseException – when there is a bug in the JNI interface and a faulty JSON is returned

Returns

MultiChromosomeIterator that iterates over VariantContext objects in the reader belonging to the specified column partition

inline void write()

Write to TileDB/GenomicsDB using the configuration specified in the loader file passed to constructor

inline void write (final int rank)

Write to TileDB/GenomicsDB using the configuration specified in the loader file passed to constructor

Parameters

rank – Rank of this process (TileDB/GenomicsDB partition idx)

inline void write (final String loaderJSONFile, final int rank)

Write to TileDB/GenomicsDB

Parameters
  • loaderJSONFileGenomicsDB loader JSON configuration file

  • rank – Rank of this process (TileDB/GenomicsDB partition idx)

inline void coalesceContigsIntoNumPartitions (final int partitions)

Coalesce contigs into fewer GenomicsDB partitions

Parameters

partitions – Approximate number of partitions to coalesce into

Public Static Functions

static inline MultiChromosomeIterator columnPartitionIterator (FeatureReader< VariantContext > reader, final String loaderJSONFile, final int partitionIdx)

Utility function that returns a MultiChromosomeIterator given an FeatureReader that will iterate over the VariantContext objects provided by the reader belonging to the column partition specified by the loader JSON file and rank/partition index

Parameters
  • reader – AbstractFeatureReader over VariantContext objects

  • loaderJSONFile – path to loader JSON file

  • partitionIdx – rank/partition index

Throws
  • IOException – when the reader’s query method throws an IOException

  • ParseException – when there is a bug in the JNI interface and a faulty JSON is returned

Returns

MultiChromosomeIterator that iterates over VariantContext objects in the reader belonging to the specified column partition

The following packages are useful for specifying the import configuration.

class ImportConfig

This implementation extends what is in GenomicsDBImportConfiguration. Add extra data that is needed for parallel import.

Subclassed by org.genomicsdb.model.CommandLineImportConfig

Public Functions

inline  ImportConfig (final GenomicsDBImportConfiguration.ImportConfiguration importConfiguration, final boolean validateSampleToReaderMap, final boolean passAsVcf, final int batchSize, final Set< VCFHeaderLine > mergedHeader, final Map< String, URI > sampleNameToVcfPath, final Func< Map< String, URI >, Integer, Integer, Map< String, FeatureReader< VariantContext >>> sampleToReaderMapCreator, final boolean incrementalImport)

Main ImportConfig constructor

Parameters
  • importConfiguration – GenomicsDBImportConfiguration protobuf object

  • validateSampleToReaderMap – Flag for validating sample to reader map

  • passAsVcf – Flag for indicating that a VCF is being passed

  • batchSize – Batch size

  • mergedHeader – Required header

  • sampleNameToVcfPath – Sample name to VCF path map

  • sampleToReaderMapCreator – Function used for creating sampleToReaderMap

  • incrementalImport – Flag for indicating incremental import

template<T1, T2, T3, R>
interface Func
class BatchCompletionCallbackFunctionArgument