Java Import Package
The GenomicsDBImporter class is the main package to import VCFs into GenomicsDB.
- org::genomicsdb::importer::GenomicsDBImporter : public org.genomicsdb.importer.GenomicsDBImporterJni , public org.genomicsdb.importer.extensions.JsonFileExtensions , public org.genomicsdb.importer.extensions.CallSetMapExtensions , public org.genomicsdb.importer.extensions.VidMapExtensions
Java wrapper for vcf2genomicsdb - imports VCFs into GenomicsDB. All vid information is assumed to be set correctly by the user (JSON files)
Public Functions
- inline GenomicsDBImporter (final String loaderJSONFile)
Constructor
- Parameters
loaderJSONFile – GenomicsDB loader JSON configuration file
- inline GenomicsDBImporter (final String loaderJSONFile, final int rank)
Constructor
- Parameters
loaderJSONFile – GenomicsDB loader JSON configuration file
rank – Rank of this process (TileDB/GenomicsDB partition idx)
- inline GenomicsDBImporter (final ImportConfig config)
Constructor to create required data structures from a list of GVCF files and a chromosome interval. This constructor is developed specifically for running Chromosome intervals imports in parallel.
- Parameters
config – Parallel import configuration
- Throws
FileNotFoundException – when files could not be read/written
- inline GenomicsDBVidMapProto.VidMappingPB getProtobufVidMapping ()
Function to return the vid mapping protobuf object.
- Returns
protobuf object for vid mapping
- inline void updateProtobufVidMapping (GenomicsDBVidMapProto.VidMappingPB vidMapPB)
Function to update vid mapping protobuf object in the top level config object. Used in cases where the VCF header doesn’t contain accurate information about how to parse fields. For instance, allele specific annotations
- Parameters
vidMapPB – vid mapping protobuf object to use as new
- inline int addSortedVariantContextIterator (final String streamName, final VCFHeader vcfHeader, Iterator< VariantContext > vcIterator, final long bufferCapacity, final VariantContextWriterBuilder.OutputType streamType, final Map< Integer, SampleInfo > sampleIndexToInfo)
Add a sorted VC iterator as the data source - caller must:
Call setupGenomicsDBImporter() once all iterators are added
Call doSingleImport()
Done!
- Parameters
streamName – Name of the stream being added - must be unique with respect to this GenomicsDBImporter object
vcfHeader – VCF header for the stream
vcIterator – Iterator over VariantContext objects
bufferCapacity – Capacity of the stream buffer in bytes
streamType – BCF_STREAM or VCF_STREAM
sampleIndexToInfo – map from sample index in the vcfHeader to SampleInfo object which contains row index and globally unique name can be set to null, which implies that the mapping is stored in a callsets JSON file
- Returns
returns the stream index
- inline int addBufferStream (final String streamName, final VCFHeader vcfHeader, final long bufferCapacity, final VariantContextWriterBuilder.OutputType streamType, Iterator< VariantContext > vcIterator, final Map< Integer, SampleInfo > sampleIndexToInfo)
Add a buffer stream or VC iterator - internal function
- Parameters
streamName – Name of the stream being added - must be unique with respect to this GenomicsDBImporter object
vcfHeader – VCF header for the stream
bufferCapacity – Capacity of the stream buffer in bytes
streamType – BCF_STREAM or VCF_STREAM
vcIterator – Iterator over VariantContext objects - can be null
sampleIndexToInfo – map from sample index in the vcfHeader to SampleInfo object which contains row index and globally unique name can be set to null, which implies that the mapping is stored in a callsets JSON file
- Returns
returns the stream index
-
inline void setupGenomicsDBImporter()
Setup the importer after all the buffer streams are added, but before any data is inserted into any stream No more buffer streams can be added once setupGenomicsDBImporter() is called
- Throws
IOException – throws IOException if modified callsets JSON cannot be written
- inline boolean add (VariantContext vc, final int streamIdx)
Write VariantContext object to stream - may fail if the buffer is full It’s the caller’s responsibility keep track of the VC object that’s not written
- Parameters
vc – VariantContext object
streamIdx – index of the stream returned by the addBufferStream() call
- Returns
true if the vc object was written successfully, false otherwise
-
inline boolean doSingleImport()
Only to be used in cases where iterator of VariantContext are not used. The data is written to buffers directly after which this function is called. See TestBufferStreamGenomicsDBImporter.java for an example
- Throws
IOException – if the import fails
- Returns
true if the import process is done
-
inline void executeImport()
Import multiple chromosome interval
- inline void executeImport (final int numThreads)
Import multiple chromosome interval
- Parameters
numThreads – number of threads used to import partitions
- inline void doConsolidate (final int numThreads)
Consolidate all intervals/arrays in a given workspace into a single fragment
- Parameters
numThreads – number of threads to use to parallelize consolidation
-
inline long getNumExhaustedBufferStreams()
- Returns
get number of buffer streams for which new data must be supplied
- inline int getExhaustedBufferStreamIndex (final long i)
Get buffer stream index of i-th exhausted stream There are mNumExhaustedBufferStreams and the caller must provide data for streams with indexes getExhaustedBufferStreamIndex(0), getExhaustedBufferStreamIndex(1),…, getExhaustedBufferStreamIndex(mNumExhaustedBufferStreams-1)
- Parameters
i – i-th exhausted buffer stream
- Returns
the buffer stream index of the i-th exhausted stream
-
inline boolean isDone()
Is the import process completed
- Returns
true if complete, false otherwise
-
inline MultiChromosomeIterator columnPartitionIterator(FeatureReader<VariantContext> reader)
Utility function that returns a MultiChromosomeIterator given an FeatureReader that will iterate over the VariantContext objects provided by the reader belonging to the column partition specified by this object’s loader JSON file and rank/partition index
- Parameters
reader – AbstractFeatureReader over VariantContext objects
- Throws
IOException – when the reader’s query method throws an IOException
ParseException – when there is a bug in the JNI interface and a faulty JSON is returned
- Returns
MultiChromosomeIterator that iterates over VariantContext objects in the reader belonging to the specified column partition
-
inline void write()
Write to TileDB/GenomicsDB using the configuration specified in the loader file passed to constructor
- inline void write (final int rank)
Write to TileDB/GenomicsDB using the configuration specified in the loader file passed to constructor
- Parameters
rank – Rank of this process (TileDB/GenomicsDB partition idx)
- inline void write (final String loaderJSONFile, final int rank)
Write to TileDB/GenomicsDB
- Parameters
loaderJSONFile – GenomicsDB loader JSON configuration file
rank – Rank of this process (TileDB/GenomicsDB partition idx)
- inline void coalesceContigsIntoNumPartitions (final int partitions)
Coalesce contigs into fewer GenomicsDB partitions
- Parameters
partitions – Approximate number of partitions to coalesce into
Public Static Functions
- static inline MultiChromosomeIterator columnPartitionIterator (FeatureReader< VariantContext > reader, final String loaderJSONFile, final int partitionIdx)
Utility function that returns a MultiChromosomeIterator given an FeatureReader that will iterate over the VariantContext objects provided by the reader belonging to the column partition specified by the loader JSON file and rank/partition index
- Parameters
reader – AbstractFeatureReader over VariantContext objects
loaderJSONFile – path to loader JSON file
partitionIdx – rank/partition index
- Throws
IOException – when the reader’s query method throws an IOException
ParseException – when there is a bug in the JNI interface and a faulty JSON is returned
- Returns
MultiChromosomeIterator that iterates over VariantContext objects in the reader belonging to the specified column partition
The following packages are useful for specifying the import configuration.
-
class ImportConfig
This implementation extends what is in GenomicsDBImportConfiguration. Add extra data that is needed for parallel import.
Subclassed by org.genomicsdb.model.CommandLineImportConfig
Public Functions
- inline ImportConfig (final GenomicsDBImportConfiguration.ImportConfiguration importConfiguration, final boolean validateSampleToReaderMap, final boolean passAsVcf, final int batchSize, final Set< VCFHeaderLine > mergedHeader, final Map< String, URI > sampleNameToVcfPath, final Func< Map< String, URI >, Integer, Integer, Map< String, FeatureReader< VariantContext >>> sampleToReaderMapCreator, final boolean incrementalImport)
Main ImportConfig constructor
- Parameters
importConfiguration – GenomicsDBImportConfiguration protobuf object
validateSampleToReaderMap – Flag for validating sample to reader map
passAsVcf – Flag for indicating that a VCF is being passed
batchSize – Batch size
mergedHeader – Required header
sampleNameToVcfPath – Sample name to VCF path map
sampleToReaderMapCreator – Function used for creating sampleToReaderMap
incrementalImport – Flag for indicating incremental import
-
template<T1, T2, T3, R>
interface Func
-
class BatchCompletionCallbackFunctionArgument