This week’s BEACON Researchers at Work blog post is by University of Idaho graduate student Ilya Zhbannikov.
I graduated from the Moscow Aviation Institute (National Research University, Russia) with a Masters Degree in Information Systems in 2009. After a year of working as a software developer, I joined the University of Idaho (USA) and graduated with a Masters Degree in Computer Engineering (2012). Currently I am pursuing a PhD in Bioinformatics and Computational Biology at the same university and expect to finish it in 2014. I have a wide range of research interests: high-throughput sequencing data analysis, parallel computing, biomedical text mining and phylogeny.
The goal of my research in high-throughput sequencing data processing is to understand, analyze and improve the data in order to use it in subsequent stages of research. With newly developed next-generation sequencing technologies and increasing interest in gene discovery, DNA mapping, functional genomics and genome annotations, the amount of data produced by sequencing has an exponential growth curve and doubles every month. Sequence data from automatic sequencing machines (“as is”) should not be considered as “ready-to-use” data for analysis due to various contaminants remaining after sequencing. An additional cleaning step to filter such remnants is needed to prepare reads for further analysis. To provide this service, I propose SeqyClean, a software tool to clean next-generation sequence data.
Nowadays it also has become possible to sequence an entire genome quickly and inexpensively. However, in some experiments one only needs to extract and assemble a portion of the sequence reads, for example when performing transcriptome studies, sequencing mitochondrial genomes, or characterizing exomes. With the DNA library of a complete genome, one might think it would be no problem to identify the reads of interest. But it is not always easy to incorporate well-known tools such as BLAST, BLAT, Bowtie, and SOAP directly into a bioinformatics pipelines before the assembly stage, either due to incompatibility with the assembler’s file inputs, or because it is desirable to incorporate information that must be extracted separately. I am working on a tool, SlopMap, which can identify the reads of interest from the given DNA library.
The goal of my research in Biomedical text mining is to demonstrate how methods from Systems Biology, along with newly developed text mining techniques, can be applied to publication abstracts for the problem of discovering hidden relationships and features within microbial communities. Recently, microorganisms such as bacteria and whole bacterial communities have become models for Systematic Biology and Bioinformatics. The recent studies of the vaginal microbiome show improvements in learning and classification of microbiota but still lack a systematic approach to this problem. On the other hand, many researchers still use very general problem-solving techniques that consist of systematically enumerating all possible candidates for the solution. Such an approach can be ineffective and tends to delay future discoveries. To alleviate the bottleneck, I propose a tool intended to reduce the range of possible solutions and suggest hypotheses by taking advantage of previously published work along with newly developed text-mining algorithms and graph theory. I have developed a beta-version of an application, BALMNet, which provides these services by constructing microbial interaction networks from a set of PubMed abstracts.
Many comprehensive phylogenetic hypotheses have already been introduced to solve the problem of combining trees computed from data from different loci into one “supertree.” A supertree contains all taxa, while smaller input trees contain only a small part of a phylogeny and are often incompatible with one another. Input trees may not incorporate enough data for a perfect supertree, which leads to multiple supertrees, making it impossible to distinguish among constructed trees to choose the best. The concept of Phylogenetic Decisiveness presented by Steel and Sanderson (Steel M., Sanderson M., “Characterizing phylogenetically decisive taxon coverage,” Appl Math Letters 23:82–86) addresses the problem employing a special criterion, decisiveness, to estimate a unique phylogeny for the given taxa. I created the program decisivatoR, an R infrastructure package inspired by the work of Steel, Sanderson and Fischer. However, the problem to determine what to add to the data to produce a unique evolutionary tree still remains open. The web-version of decisivatoR (see figure below) is available here: http://glimmer.rstudio.com/izhbannikov/DecisivatoR/.
For more information please check my blog: http://bioalgo.blogspot.com/