Currently, I am working on the Sequence Assembly Problem. Our objective is to re-construct the Bacterial and Virus genomes from millions of short reads. Typically, the genome length is in the order of 2-10 Mbp and read length is about 35-200 bp. One can think of the Sequence Assembly as a Jigsaw puzzle.
Due to massive amounts of data that we are dealing with, the challenge lies in developing time and space efficient algorithms for the problem. Complex variations in genomes and errors in the reads also pose various challenges to the algorithm. Sequence Assembly is inherently hard and one must rely on heuristics to solve it. We are developing a new model for the problem and it will be interesting to provide the theoretical explanation and probabilistic guarantees on the quality of results.
An efficient method to assemble Bacterial and Virus genomes can aid biologists to characterize various properties and have a better understanding of microbial organisms.