By: Natalie Vande Pol (PhD Candidate, Michigan State University)
I am a 5th year PhD student in the Microbiology and Molecular Genetics program at Michigan State University. This is the story of a side project that has been one of the most enjoyable and rewarding undertakings in my PhD career. CONSTAX was the first project on which I was a key contributor. The co-first authors both worked in community ecology and they wanted to develop a tool, but they needed some help writing Python scripts. That’s where I came in.
Community ecologists use a technique called amplicon sequencing, in which they extract DNA from a substrate (e.g., soil, plants, water) and sequence-specific genes that they then use as a “barcode” to identify the organism from which the DNA originated (Figure 1). In bacteria, this barcode is the 16S ribosomal RNA gene. In fungi, we generally use one of two ribosomal regions: ITS1 or ITS2. Ecologists use these barcode sequences to study pooled communities of organisms, allowing comparison of community structure between different conditions (e.g., healthy v. diseased gut/plant). Think of it like a census for soil fungi. These comparisons can sometimes indicate organisms that are important to causing, preventing, detecting, or recovering from a given characteristic or disturbance.
One of the most important steps in a community analysis pipeline is to “translate” the barcode DNA sequences from the sample into the names of the organisms from which they originated. This is done by comparing sample sequences to reference sequences from known organisms, just as a barcode in a grocery store needs a computer reference to tell the cashier whether you are buying cilantro or parsley. With DNA sequences, the identification algorithm used to match up the sequences is called a classifier. Using different reference databases or different classifiers can yield different identifications.
To illustrate what happens with different classifiers, imagine you and two of your friends are all taking the same test. All three of you get 80/100 questions correct on the exam. However, when you compare your exams, you realize that while you all had 75 questions in common, the other 5 correctly answered questions were unique to each of you. So, on the surface your performances seem identical, but are in fact a bit different. Similarly, using a single classifier and different reference databases is analogous to each of you three taking the same exam having studied from three different textbooks (assuming otherwise identical performance). Your scores on the exam would probably vary.
Fortunately, for fungal research, UNITE is a well-curated reference sequence database, so the largest source of variation is between classifiers. Just as described in the first analogy above, different classifiers use different algorithms to assign taxonomies and estimate confidence/error rates, making it difficult to select a single classifier as the “best”. Therefore, our two community ecologists and I set out to develop a tool that eliminated the need to choose just one! If you and your three friends could collectively take that exam, you could have gotten 90/100, instead of just 80/100 on your own.
First, we chose the most commonly used and most recently developed classifiers: Ribosomal Database Project (RDP), UTAX, and SINTAX. We wrote a series of custom scripts to format the UNITE reference database to be compatible with each of the classifiers and ran our sequence datasets through each of the classifiers. Finally, we used Python scripts to standardize the output formats. This was all packaged and is automated by a single shell script constax.sh (Figure 2). Users simply place their input files in the specified folders and provide the names and desired parameters in a configuration file.
For each sequence, we compared the three assignments given for each taxonomic rank (Kingdom, Phylum, Class, Order, Family, Genus, and Species). If the confidence score for a given assignment was below a threshold value, that and all further taxonomic ranks were considered “Unidentified” for that sequence. In most cases, the three classifiers agreed on the taxonomy assigned. However, there were cases in which they disagreed, whether because one (or two) of the classifiers yielded an Unidentified, or because there were multiple different, confident assignments (Table 1). With three classifiers, we decided to implement a simple majority rule. Since classifiers provide an estimated confidence in taxonomic classifications, we used confidence scores to break ties.
We tested our tool on four different datasets from three different studies: barcode gene ITS1 or ITS2 of fungi from Soil or Plants (Figure 3). And it worked! Cross-referencing three classifiers corrected misassignments and improved overall performance. At the Kingdom level, the consensus taxonomy was only ~1% improved as compared to any individual classifier. However, higher levels had much stronger improvement, on average 7-35%, depending on the taxonomic level and the individual classifier. The mean improvement in performance by CONSTAX over individual classifiers is slightly over-estimated due to particularly poor classification by UTAX, which had the most Unidentified levels.
What’s next for CONSTAX?
First, we would love to develop our tool to be compatible with bacterial community sequences. Fortunately, the classifiers were all written for bacterial community analysis in the first place! Unfortunately, the reference databases are either out of date or so poorly curated as to have misidentified reference sequences and some convoluted taxonomies. Bacteria seem to be renamed rather frequently and it’s difficult to know whether the assignment given is still correct. We focused our preliminary efforts on the SILVA database, as it is the most up-to-date, but it has some serious formatting issues, among other things. In theory, there should be 7 taxonomic ranks. A significant proportion of the SILVA taxa have 4-13 levels, requiring manual correction to determine the appropriate classification for each of the 7 expected levels. At least in fungi, the different taxonomic ranks have consistent suffixes that can be used to identify gaps/insertions and correctly place the ranks. In bacteria, suffixes only seem to be consistent within particular lineages, so I would only be able to fix one group at a time, and quite often the canonical seven taxonomic levels simply don’t exist for some bacterial lineages.
Secondly, we are very interested in incorporating new classifiers into our tool. UTAX, in particular, is becoming obsolete and had the highest rate of “Unidentified” taxonomic assignments. While this may make our tool look good, it’s not really representative of the best we can do. However, an even number of classifiers makes “voting” on a consensus assignment more complicated and we would prefer to have a more elegant and sound basis for breaking ties than just comparing confidence scores, since those metrics are each calculated slightly differently and don’t mean quite the same thing. It’s an excellent starting point, but future work in this area would be served by a more thorough evaluation of disagreements between classifiers.
If you’re interested in more detail or in using CONSTAX for your own research, this blog is based on our publication, which you can find here and our code repository is on GitHub.
Agler MT, Ruhe J, Kroll S, Morhenn C, Kim S-T, Weigel D, et al. (2016) Microbial hub taxa link host and abiotic factors to plant microbiome variation. PLoS Biol. 14(1):e1002352–31.
Gdanetz, K., Benucci, G. M. N., Pol, N. V., & Bonito, G. (2017). CONSTAX: A tool for improved taxonomic resolution of environmental fungal ITS sequences. BMC bioinformatics, 18(1), 538.
Oliver AK, Mac A, Jr C, Jumpponen A. (2015) Soil fungal communities respond compositionally to recurring frequent prescribed burning in a managed southeastern US forest ecosystem. For Ecol Manag. 345:1–9.
Smith DP, Peay KG. (2014) Sequence depth, not PCR replication, improves ecological inference from next generation DNA sequencing. PLoS One. 9(2):e90234–12.