Java Import Package

The GenomicsDBImporter class is the main package to import VCFs into GenomicsDB.

org::genomicsdb::importer::GenomicsDBImporter : public org.genomicsdb.importer.GenomicsDBImporterJni , public org.genomicsdb.importer.extensions.JsonFileExtensions , public org.genomicsdb.importer.extensions.CallSetMapExtensions , public org.genomicsdb.importer.extensions.VidMapExtensions

Java wrapper for vcf2genomicsdb - imports VCFs into GenomicsDB. All vid information is assumed to be set correctly by the user (JSON files)

Public Functions

inline GenomicsDBImporter (final String loaderJSONFile)

Constructor

Parameters: loaderJSONFile – GenomicsDB loader JSON configuration file

inline GenomicsDBImporter (final String loaderJSONFile, final int rank)

Constructor

Parameters

loaderJSONFile – GenomicsDB loader JSON configuration file
rank – Rank of this process (TileDB/GenomicsDB partition idx)

inline GenomicsDBImporter (final ImportConfig config)

Constructor to create required data structures from a list of GVCF files and a chromosome interval. This constructor is developed specifically for running Chromosome intervals imports in parallel.

Parameters: config – Parallel import configuration
Throws: FileNotFoundException – when files could not be read/written

inline GenomicsDBVidMapProto.VidMappingPB getProtobufVidMapping ()

Function to return the vid mapping protobuf object.

Returns: protobuf object for vid mapping

inline void updateProtobufVidMapping (GenomicsDBVidMapProto.VidMappingPB vidMapPB)

Function to update vid mapping protobuf object in the top level config object. Used in cases where the VCF header doesn’t contain accurate information about how to parse fields. For instance, allele specific annotations

Parameters: vidMapPB – vid mapping protobuf object to use as new

inline int addSortedVariantContextIterator (final String streamName, final VCFHeader vcfHeader, Iterator< VariantContext > vcIterator, final long bufferCapacity, final VariantContextWriterBuilder.OutputType streamType, final Map< Integer, SampleInfo > sampleIndexToInfo)

Add a sorted VC iterator as the data source - caller must:

Call setupGenomicsDBImporter() once all iterators are added
Call doSingleImport()
Done!

Parameters

streamName – Name of the stream being added - must be unique with respect to this GenomicsDBImporter object
vcfHeader – VCF header for the stream
vcIterator – Iterator over VariantContext objects
bufferCapacity – Capacity of the stream buffer in bytes
streamType – BCF_STREAM or VCF_STREAM
sampleIndexToInfo – map from sample index in the vcfHeader to SampleInfo object which contains row index and globally unique name can be set to null, which implies that the mapping is stored in a callsets JSON file

Returns

returns the stream index

inline int addBufferStream (final String streamName, final VCFHeader vcfHeader, final long bufferCapacity, final VariantContextWriterBuilder.OutputType streamType, Iterator< VariantContext > vcIterator, final Map< Integer, SampleInfo > sampleIndexToInfo)

Add a buffer stream or VC iterator - internal function

Parameters

streamName – Name of the stream being added - must be unique with respect to this GenomicsDBImporter object
vcfHeader – VCF header for the stream
bufferCapacity – Capacity of the stream buffer in bytes
streamType – BCF_STREAM or VCF_STREAM
vcIterator – Iterator over VariantContext objects - can be null
sampleIndexToInfo – map from sample index in the vcfHeader to SampleInfo object which contains row index and globally unique name can be set to null, which implies that the mapping is stored in a callsets JSON file

Returns

returns the stream index

inline void setupGenomicsDBImporter()

Setup the importer after all the buffer streams are added, but before any data is inserted into any stream No more buffer streams can be added once setupGenomicsDBImporter() is called

Throws: IOException – throws IOException if modified callsets JSON cannot be written

inline boolean add (VariantContext vc, final int streamIdx)

Write VariantContext object to stream - may fail if the buffer is full It’s the caller’s responsibility keep track of the VC object that’s not written

Parameters

vc – VariantContext object
streamIdx – index of the stream returned by the addBufferStream() call

Returns

true if the vc object was written successfully, false otherwise

inline boolean doSingleImport()

Only to be used in cases where iterator of VariantContext are not used. The data is written to buffers directly after which this function is called. See TestBufferStreamGenomicsDBImporter.java for an example

Throws: IOException – if the import fails
Returns: true if the import process is done

inline void executeImport(): Import multiple chromosome interval

inline void executeImport (final int numThreads)

Import multiple chromosome interval

Parameters: numThreads – number of threads used to import partitions

inline void doConsolidate (final int numThreads)

Consolidate all intervals/arrays in a given workspace into a single fragment

Parameters: numThreads – number of threads to use to parallelize consolidation

inline long getNumExhaustedBufferStreams()

Returns: get number of buffer streams for which new data must be supplied

inline int getExhaustedBufferStreamIndex (final long i)

Get buffer stream index of i-th exhausted stream There are mNumExhaustedBufferStreams and the caller must provide data for streams with indexes getExhaustedBufferStreamIndex(0), getExhaustedBufferStreamIndex(1),…, getExhaustedBufferStreamIndex(mNumExhaustedBufferStreams-1)

Parameters: i – i-th exhausted buffer stream
Returns: the buffer stream index of the i-th exhausted stream

inline boolean isDone()

Is the import process completed

Returns: true if complete, false otherwise

inline MultiChromosomeIterator columnPartitionIterator(FeatureReader<VariantContext> reader)

Utility function that returns a MultiChromosomeIterator given an FeatureReader that will iterate over the VariantContext objects provided by the reader belonging to the column partition specified by this object’s loader JSON file and rank/partition index

Parameters

reader – AbstractFeatureReader over VariantContext objects

Throws

IOException – when the reader’s query method throws an IOException
ParseException – when there is a bug in the JNI interface and a faulty JSON is returned

Returns

MultiChromosomeIterator that iterates over VariantContext objects in the reader belonging to the specified column partition

inline void write(): Write to TileDB/GenomicsDB using the configuration specified in the loader file passed to constructor

inline void write (final int rank)

Write to TileDB/GenomicsDB using the configuration specified in the loader file passed to constructor

Parameters: rank – Rank of this process (TileDB/GenomicsDB partition idx)

inline void write (final String loaderJSONFile, final int rank)

Write to TileDB/GenomicsDB

Parameters

loaderJSONFile – GenomicsDB loader JSON configuration file
rank – Rank of this process (TileDB/GenomicsDB partition idx)

inline void coalesceContigsIntoNumPartitions (final int partitions)

Coalesce contigs into fewer GenomicsDB partitions

Parameters: partitions – Approximate number of partitions to coalesce into

Public Static Functions

static inline MultiChromosomeIterator columnPartitionIterator (FeatureReader< VariantContext > reader, final String loaderJSONFile, final int partitionIdx)

Utility function that returns a MultiChromosomeIterator given an FeatureReader that will iterate over the VariantContext objects provided by the reader belonging to the column partition specified by the loader JSON file and rank/partition index

Parameters

reader – AbstractFeatureReader over VariantContext objects
loaderJSONFile – path to loader JSON file
partitionIdx – rank/partition index

Throws

IOException – when the reader’s query method throws an IOException
ParseException – when there is a bug in the JNI interface and a faulty JSON is returned

Returns

MultiChromosomeIterator that iterates over VariantContext objects in the reader belonging to the specified column partition

The following packages are useful for specifying the import configuration.

class ImportConfig

This implementation extends what is in GenomicsDBImportConfiguration. Add extra data that is needed for parallel import.

Subclassed by org.genomicsdb.model.CommandLineImportConfig

Public Functions

inline ImportConfig (final GenomicsDBImportConfiguration.ImportConfiguration importConfiguration, final boolean validateSampleToReaderMap, final boolean passAsVcf, final int batchSize, final Set< VCFHeaderLine > mergedHeader, final Map< String, URI > sampleNameToVcfPath, final Func< Map< String, URI >, Integer, Integer, Map< String, FeatureReader< VariantContext >>> sampleToReaderMapCreator, final boolean incrementalImport)

Main ImportConfig constructor

Parameters

importConfiguration – GenomicsDBImportConfiguration protobuf object
validateSampleToReaderMap – Flag for validating sample to reader map
passAsVcf – Flag for indicating that a VCF is being passed
batchSize – Batch size
mergedHeader – Required header
sampleNameToVcfPath – Sample name to VCF path map
sampleToReaderMapCreator – Function used for creating sampleToReaderMap
incrementalImport – Flag for indicating incremental import

template<T1, T2, T3, R> interface Func

class BatchCompletionCallbackFunctionArgument