BEACON Team wins Best Paper Award in Evolutionary Machine Learning Track at GECCO 2019

Zhichao Lu and colleagues accepting the Best Paper Award at GECCO 2019

Congratulations to BEACONites Zhichao Lu, Ian Whalen, Vishnu Boddeti, Yashesh Dhebar, Kalyanmoy Deb, Erik Goodman, and Wolfgang Banzhaf! Their paper “NSGA-Net: Neural Architecture Search using Multi-Objective Genetic Algorithm” won the Best Paper Award in the Evolutionary Machine Learning track at GECCO 2019 in Prague.

There were in total 64 papers submitted to the Evolutionary Machine Learning (EML) track, only 16 of which were accepted as full papers. Two papers were nominated for Best Paper Award. Zhichao Lu and colleagues won the award based on the on-site voting from the conference attendees.

Here is the abstract of the paper, which can be accessed from arXiv: https://arxiv.org/abs/1810.03522

This paper introduces NSGA-Net – an evolutionary approach for neural architecture search (NAS). NSGA-Net is designed with three goals in mind: (1) a procedure considering multiple and conflicting objectives, (2) an efficient procedure balancing exploration and exploitation of the space of potential neural network architectures, and (3) a procedure finding a diverse set of trade-off network architectures achieved in a single run. NSGA-Net is a population-based search algorithm that explores a space of potential neural network architectures in three steps, namely, a population initialization step that is based on prior-knowledge from hand-crafted architectures, an exploration step comprising crossover and mutation of architectures, and finally an exploitation step that utilizes the hidden useful knowledge stored in the entire history of evaluated neural architectures in the form of a Bayesian Network. Experimental results suggest that combining the dual objectives of minimizing an error metric and computational complexity, as measured by FLOPs, allows NSGA-Net to find competitive neural architectures. Moreover, NSGA-Net achieves a comparable error rate on the CIFAR-10 dataset when compared to other state-of-the-art NAS methods while using orders of magnitude less computational resources. These results are encouraging and show the promise to further use of EC methods in various deep-learning paradigms.

The source code for the paper can be accessed from GitHub: https://github.com/ianwhale/nsga-net

Posted in BEACON in the News | Comments Off on BEACON Team wins Best Paper Award in Evolutionary Machine Learning Track at GECCO 2019

Genome Hackers – a near-peer, interdisciplinary summer program for high school girls

By: Cindy Yeh, Graduate Student, (Dunham Lab, Genome Sciences), University of Washington

Only 26% of the computing professional workforce is made of women, less than 10% of whom are women of color (ncwit.org). This is in contrast to the gender distribution in the life sciences, which is much closer to 50%. As technology continues to play an increasingly important role in our lives, addressing this gender disparity by giving young women access and exposure to computational thinking early is imperative.

I was introduced to programming as a high schooler, but never really learned how to code until I started my PhD program at the University of Washington in the Genome Sciences Department. Programming felt more intuitive when I was trying to implement a biological concept, such as finding the longest matching pattern in a DNA sequence using a suffix array or extracting information from FASTA files. Learning computer science can be intimidating, but I figured if this method allowed me to better understand its logic, it could be a great way to introduce young women to coding and make technical fields more accessible. Indeed, many research studies have found that integrated approaches are much more effective than traditional, non-interdisciplinary curricula. Furthermore, developing integrated lessons require many hours of professional development for which many teachers may not have time (Lin et al., 2018; Struyf et al. 2019; Salami et al., 2015; Stohlmann et al., 2012; Thibaut et al., 2018).

In 2017, as first year graduate students, my colleague, Andria Ellis, and I received a small grant from the National Center for Women in Information Technologies (NCWIT) to run a one-week, half-day summer camp for high school girls called Genome Hackers. We wanted our program not to have our participants walk away as experts in computer science or genomics, but to introduce them to concepts that they otherwise would never have the opportunity to learn prior to college. The idea was that if they were challenged by these topics in the future, they would seem less abstract or intimidating. We also wanted to teach real-world applications of computer science and how it specifically is used in genomics. With a team of graduate student instructors, our participants learned how to perform PCR to isolate and amplify a particular gene and subsequently Sanger sequence the PCR product to retrieve raw sequences. Simultaneously, we taught them the basics in programming through Python. By the end of the camp, the participants had written transcription and translation scripts, where they can directly take their Sanger sequencing results and determine the amino acid sequences of their gene. Furthermore, they shared their sequencing results with other students and generated a phylogenetic tree to investigate the relatedness of the same gene from various species. They also used their final amino acid sequence to generate a predicted protein structure compared across species as well.

Figure 1: 2017 participants learning how to pipette

Genome Hackers culminates in a poster session where the students share with scientists in the department (and with their family and friends!) their many accomplishments over the course of the week. This really helps tie the week together, and participants walk away with something concrete that they can show off. Furthermore, our camp is affordable ($50/week with scholarship available); this is in contrast to many other biotechnology camps where fees can be a deciding factor for many applicants, usually costing, at the minimum, $300, per week (these can sometimes cost upwards of $500 per participant!).

Figure 2: Participants working hard on their transcription and translation scripts

After receiving overwhelmingly positive feedback from graduate students, faculty, teachers, and parents, we will be running Genome Hackers in 2019 for its third year in a row. We are also running iterations of this camp through two other campuses (SoundBio Labs and University of Chicago). Here we will determine what aspects of our current curriculum are easy to implement and what areas need improvement. Our final goal is to package our program into something any high school biology teacher or graduate student can pick up and implement on their own without my or Andria’s presence.

Figure 3: 2017 participants presenting their findings to those teachers, parents, and scientists at University of Washington

Figure 4: 2018 participants presenting their findings to those teachers, parents, and scientists at University of Washington

Several of our former participants have now also participated in Girls Who Code at Fred Hutch or gone on to pursue technical degrees. One former participant has even returned to Genome Hackers as a near-peer mentor and may lead her own session this year. I never would have guessed that this was something I would accomplish (or even want to accomplish) as a graduate student. While I did put a lot of energy towards outreach and service as an undergraduate, being able to take what I have learned in the lab as a graduate student and materialize it into teaching high school students has been one of the most rewarding activities I’ve ever pursued in and outside of my scientific career. Andria and I are also both very lucky that our PIs (Cole Trapnell and Maitreya Dunham, respectively) appreciate outreach activities and continue to encourage us to pursue them.

Figure 5: Group photo from 2018. Cindy and Andria and are the ends of the front row.

We are always searching for new ideas or collaborators who may be interested in running their own version of Genome Hackers. We have a website (genomehackers.org) and an e-mail (genomehackersuw@nullgmail.com) and are very interested in hearing your comments.

Figure 6: Students’ confidence and interest levels before and after Genome Hackers

Participant Testimonials:

“I have been taught coding before, but I feel like […this program] introduced a new coding language very well.”

 

“I liked how I got to see how programming aided genome scientists.”

 

“My favorite part was getting to learn a new coding language, and combining two of my passions.”

 

“I wasn’t very interested in coding, but after actually doing some coding I now really like it and I might look into doing coding for a career with biology.”

 

“I will remember creating my first science poster. It felt amazing learning how to reach a conclusion and finally getting to have something to show for it.”

 

“I was really proud of myself for figuring out how to code a DNA strand into RNA.”

 

 

References

Lin, Y.-T., Wang, M.-T., Wu, C.-C., 2018. Design and Implementation of Interdisciplinary STEM Instruction: Teaching Programming by Computational Physics. The Asia-Pacific Education Researcher 28, 77–91. doi:10.1007/s40299-018-0415-0

Salami, M.K.A., Makela, C.J., Miranda, M.A.D., 2015. Assessing changes in teachers’ attitudes toward interdisciplinary STEM teaching. International Journal of Technology and Design Education 27, 63–88. doi:10.1007/s10798-015-9341-0

Stohlmann, M., Moore, T., Roehrig, G., 2012. Considerations for Teaching Integrated STEM Education. Journal of Pre-College Engineering Education Research 2, 28–34. doi:10.5703/1288284314653

Struyf, A., Loof, H.D., Pauw, J.B.-D., Petegem, P.V., 2019. Students’ engagement in different STEM learning environments: integrated STEM education as promising practice? International Journal of Science Education 41, 1387–1407. doi:10.1080/09500693.2019.1607983

Thibaut, L., Knipprath, H., Dehaene, W., Depaepe, F., 2018. The influence of teachers’ attitudes and school context on instructional practices in integrated STEM education. Teaching and Teacher Education 71, 190–205. doi:10.1016/j.tate.2017.12.014

Posted in Diversity in STEM, Education | Comments Off on Genome Hackers – a near-peer, interdisciplinary summer program for high school girls

The devil in the closet

By: Dr. Wenying Shou – Fred Hutchinson Cancer Research Center

Sometimes in science, a seemingly straightforward journey can take an enormous amount of time. Our paper in PLoS Biology (Hart et al., 2019) was one such journey. The question seemed easy enough: for a highly simplified microbial community — a community of two yeast strains engineered to help each other or “cooperate”, could we predict how fast the community might grow?

If you think that this question is esoteric, it is not. Cooperation is surprisingly common in biology: pathogenic bacteria cooperate with each other to launch infections; microbes in sewage treatment sludge cooperate to break down wastes. The faster a community can grow, the more likely it will survive perturbations or advance to new territories. Ultimately, a quantitative understanding of microbial communities will empower us to control and use communities as, for example, probiotics.

One remarkable aspect of this work — the very long and turbulent gestation — is invisible from the data themselves. When responding to journal reviewers’ critiques, I had the urge to write down this untold story of scientific discovery.

A humble dream

The project started when I was a postdoc about 17 years ago. To tie biology with mathematics, I joined a physicist’s lab at the Rockefeller University in New York City. I wanted to see how interacting “parts” of a biological system might generate quantitative properties of the system as a “whole”.

All biological systems consist of parts. For example, an ecological community consists of interacting species, and the human body consists of different cell types. A quantitative understanding of how a biological system works can be very powerful. For example, it could help us predict what would happen if we were to perturb a part.

A mathematical model consists of one or more equations. An equation describes how different quantities are linked to each other. For example, how fast population size changes equals how fast new members are added through birth, minus how fast the existing members die. The growth and death rates are examples of model parameters.

Back then, people had been modeling biological systems such as ecological communities, gene regulatory networks, and the cell division cycle. Some models matched data beautifully. However, the renowned mathematician and computer scientist John von Neumann once stated “with four parameters I can fit an elephant, and with five I can make him wiggle his trunk.” In other words, given enough “free parameters” — parameters one could freely choose rather than being constrained by reality from experimental measurements, a model can be made to fit any data. Although a fitting model can explain data, it does not mean that the model is correct or can predict new data.

To avoid the “free parameter” problem, I decided to start with a very simple system. In such a system, I should know exactly how parts interact with each other. I could then write down the equations, know which parameters need to be measured, and measure all parameters. After much deliberation, I decided to engineer a highly simplified cooperative yeast community consisting of two strains, each supplying the other with an essential metabolite. My colleagues and I thought of a lovely name for it: CoSMO — Cooperation that is Synthetic and Mutually Obligatory. Unlike real-life communities where scientists often have trouble counting the number of species, CoSMO has two and only two strains. Unlike real-life communities where species influence each other by releasing many uncharacterized chemicals, in CoSMO each strain releases only one metabolite which is required and consumed by the partner. Moreover in CoSMO, the two strains coexist due to their inter-dependence, and thus I do not need to worry about losing any one of them.

Now that I know how the two strains interact with each other, it should in principle be easy to predict community properties, such as how fast the community grows — or community growth rate. Community growth rate primarily depends on two traits of each strain: the metabolite release rate and the amount of metabolite consumed per birth. Thus, modeling community growth rate boils down to measuring four parameters. This is by no means ambitious!

Mission aborted

I measured all four parameters. I measured the metabolite release and consumption traits of each strain in the absence of its partner. I got rid of the partner so that the released metabolites would accumulate in the test tube for me to measure, instead of being immediately consumed by the partner. However, since the partner was not present, the measurement environment (called the “batch culture” environment) differed from the community environment. For example, to measure metabolite consumption in batch cultures, I would add a high dose of metabolite at the beginning of an experiment. In contrast, in communities, the consumed metabolite is constantly supplied by the partner at a low level. Measuring strain traits in a community-like environment would require a special experimental setup. So, I had to hope that the batch culture environment could approximate the community environment.

After measuring the four parameters, I predicted community growth rate. However, my prediction was way off from the experimental results. I was disappointed. In theory, a mathematical model is useful when it fails, because failure suggests that we are still missing important pieces. In reality, I was far from being thrilled by the failure, because too many pieces could be missing, even in a system as simple as CoSMO. For example, the batch culture environment might not approximate the community environment; cells could be evolving… The problem immediately becomes monstrously messy and un-elegant — a devil.

Eventually, I aborted the mission. I was forced to ask whether I could explain some other community property — the minimal total cell density required for the community to start to grow. That calculation required measuring more parameters — such as cell growth rates at various metabolite concentrations. I did not have the experimental setup for such measurements, so I gave in to free parameters. I did what was, and is still, commonly done: looking for literature values. The literature values varied over an order of magnitude, so I naturally chose the value that could explain my data. I felt guilty, but comforted myself by noting that at least, the free parameter I chose was not outrageous, and that at least, the fraction of free parameters in my model was far lower than most other models. This reasoning did not exonerate me, but helped to bring a closure to my postdoc project (Shou et al., 2007).

Haunted by the devil

When I started my lab at the Hutch, I promptly locked up the devil in a closet. I felt caught: on the one hand, grant reviewers kept punishing me because “CoSMO is too simple”, yet on the other hand, I could not even understand a very basic property of the community. My modeling failure was indeed humiliating. There was no way I could rephrase the question in an exciting fashion to attract any one, possibly even including myself.

My group started working on other more sexy problems, such as how the two cooperating strains might fend off cheaters who consume but do not contribute metabolites (Momeni et al., 2013b; Waite and Shou, 2012).

Despite group members’ successes, the devil kept haunting me. Babak Momeni, then a postdoctoral fellow in my lab, examined spatial patterning in CoSMO when CoSMO grew on an agarose pad. When we compared patterns predicted by our model versus patterns observed in experiments, they looked similar in a qualitative sense. However, the timing looked very different. This is not surprising given that we do not understand how fast the community grows. Fortunately, dynamics was not the focus of that paper, so we erased all time stamps from our simulations (Momeni et al., 2013a).

Figure 1. CoSMO patterning. The two cooperating strains were engineered to express green or red fluorescent proteins, and can thus be distinguished under a microscope. Time stamp was shown for experiment (right) and not simulation (left). This figure is from (Momeni et al., 2013a).

Devil breaking loose

Eventually, the devil of my past failure would not allow me to ignore it any further.

Chi-Chun Chen and Jose Pineda, two talented group members, were quantifying metabolite release rates of evolved cells. They wanted to see whether cells could evolve to be more “generous” by releasing more. However, Chi-Chun and Jose were getting highly variable results despite their superb experimental skills. It seemed that we got stuck when the question turned quantitative.

We suspected that the variable measurement results could be due to cell traits being highly sensitive to the measurement environment. To enable measurements in a community-like environment, David Skelding — a physicist in the lab — started to build devices called “chemostats”. In chemostats, nutrients were supplied at a small dose (in small drops) but frequently (every tens of seconds), mimicking partner strain’s slow but constant metabolite release rate. It took David a good year or more to ensure that chemostats worked reliably and precisely (Skelding et al., 2018).

 

Figure 2. Chemostats. This home-made multi-plexed chemostat has eight culturing chambers (tubes with yellow stoppers). The syringe pump on the left pushes the fresh medium into chambers through tubing. Sterile humidified air was also introduced into the chambers to push out excess waste. This figure is from (Skelding et al., 2018).

Taming the devil

When Sam Hart joined my lab as a research technician, he inherited the problems I, Chi-Chun, and Jose had left behind.

Initially, Sam continued to quantify strain traits in batch cultures, because after all, chemostat measurements are much harder and are limited by the number of chambers. However, at some point, we realized that without getting our fundamentals on a solid footing, we would be chasing after our tails: If we do not understand the two ancestral strains (i.e. why ancestral strains’ traits cannot explain ancestral community’s growth rate), there is no point trying to understand evolved strains.

By that time, Sam had already invested a year or two. But Sam was unflustered because he understood the importance of asking the right, albeit inconvenient, question. Sam re-measured ancestral strains’ metabolite release and consumption traits in David’s chemostats. By controlling how slowly metabolites were supplied, Sam could force cells to grow at various slow rates observed in CoSMO. However, chemostats introduced their own devil: because of metabolite limitation, cells from both strains quickly evolved away from their original states while adapting to metabolite limitation. Sam then figured out ways to deal with this new problem.

Figure 3. Ancestral versus evolved clones. On agarose with low metabolite, ancestral cells failed to divide (arrows). Cells from a mildly-adapted evolved clone (center) showed mixed phenomena: some cells remained undivided (arrow), while other cells formed microcolonies of various sizes. Cells from a strongly-adapted evolved clone formed microcolonies of a uniform and large size. These images were taken using a cell phone camera and thus do not have a scale bar. For reference, an average yeast cell (e.g. black dots in “anc”) has a diameter of ~5 µm. Image from (Hart et al., 2019).

Eventually, Sam discovered that indeed, measurements of metabolite release and consumption traits could differ significantly in chemostats versus in batch cultures. Hanbing Mi, an undergraduate visiting student from China, figured out how to properly measure community growth rate when cells could evolve quickly. Once we took all these into consideration, we solved the puzzle (Hart et al., 2019). But only partially: we still do not understand CoSMO’s initial phase of slower growth.

Figure 4. Model can explain experimental observations of CoSMO long-term growth rate. Model prediction explained experiments (purple) when parameters were measured in community-like chemostat environments (green), and not when parameters were measured in batch culture environments (blue). Error bars mark 95% confidence interval. This figure is from (Hart et al., 2019).

Using the same quantification methodology that we have found to be trustworthy, Sam figured out what it means to be “generous” (Hart & Pineda et al., 2019), and which mutants evolved to be more generous (manuscript in preparation). Sam is now a graduate student at the University of Washington.

Summary

It takes a lot to do careful science. For science to advance, it must stand on a solid foundation. By demonstrating how to properly model a very simple living system, we have helped setting the standard for future modeling of more complex systems such as probiotic communities or infectious diseases.

Acknowledgements

I am very grateful to my lab members, particularly Sam Hart, Hanbing Mi,  Jose Pineda, Chi-Chun Chen, and David Skelding, for doing high-quality work.

Hart & Pineda, Chen, Chichun, Green, Robin, Shou W. 2019. Disentangling strictly self-serving mutations from win-win mutations in a mutualistic microbial community. eLife Accepted.

Hart SFM, Mi H, Green R, Xie L, Pineda JMB, Momeni B, Shou W. 2019. Uncovering and resolving challenges of quantitative modeling in a simplified community of interacting cells. PLOS Biol 17:e3000135. doi:10.1371/journal.pbio.3000135

Momeni B, Brileya KA, Fields MW, Shou W. 2013a. Strong inter-population cooperation leads to partner intermixing in microbial communities. eLife 2:e00230. doi:10.7554/eLife.00230

Momeni B, Waite AJ, Shou W. 2013b. Spatial self-organization favors heterotypic cooperation over cheating. eLife 2:e00960. doi:10.7554/eLife.00960

Shou W, Ram S, Vilar JM. 2007. Synthetic cooperation in engineered yeast populations. Proc Natl Acad Sci USA 104:1877–1882. doi:10.1073/pnas.0610575104

Skelding D, Hart SF, Vidyasagar T, Pozhitkov AE, Shou W. 2018. Developing a low-cost milliliter-scale chemostat array for precise control of cellular growth. Quant Biol 6:129–141.

Waite AJ, Shou W. 2012. Adaptation to a new environment allows cooperators to purge cheaters stochastically. Proc Natl Acad Sci 109:19079–19086. doi:10.1073/pnas.1210190109

 

Posted in Uncategorized | Comments Off on The devil in the closet

Paul Turner elected to National Academy of Sciences

Paul TurnerProfessor Paul Turner was elected to the National Academy of Sciences earlier this week (following his election to the American Academy of Arts & Sciences two weeks ago).

Paul Turner is a professor of of Ecology and Evolutionary Biology at Yale University, and he joined BEACON as a Faculty Affiliate in 2013. Since then has been involved in many BEACON projects and mentored several BEACON trainees. He spoke about his fascinating work on using viruses to control antibiotic-resistant bacteria at last summer’s BEACON Congress.

When announcing this exciting news to BEACON, Rich Lenski wrote, “Paul has done beautiful work on the evolution of viruses, including intriguing issues that arise when multiple virions infect the same host cell. In the last few years, Paul has also been performing clever, life-saving (literally) experiments in which phages (viruses that infect bacteria) are chosen that target bacterial pathogens that are resistant to every available antibiotic.”

You can read more about Paul and his work here: https://turnerlab.yale.edu/

Here’s a write up of one of the cases of using viruses to fight antibiotic-resistant bacteria:
https://www.statnews.com/2016/12/07/virus-bacteria-phage-therapy/

Congratulations, Paul!

Posted in BEACON in the News | Tagged | Comments Off on Paul Turner elected to National Academy of Sciences

Using a course-based undergraduate research experience to increase leadership opportunities for students

By: Katie Dickinson, research scientist, Kerr Lab (Department of Biology), University of Washington

Katie Dickinson is a research scientist based out of the Kerr Lab (Department of Biology) at the University of Washington

Course-based Undergraduate Research Experiences (CUREs) are becoming increasingly popular, as they enable all students to gain the positive outcomes associated with undergraduate research. In a CURE, students investigate real-world research questions without predefined outcomes.

With support from BEACON and the Howard Hughes Medical Institute, our team has developed a CURE on experimental evolution of antibiotic resistance in Escherichia coli for the introductory biology sequence at the University of Washington. In our CURE, students isolate bacteria strains that are sensitive and resistant to rifampicin and streptomycin, do daily transfers to conduct experimental evolution, and gather and analyze data on variation in level of resistance, the fitness effects of resistance, and collateral effects. In addition, students analyze the products of their own evolution experiments; they sequence the relevant gene(s) of their sensitive and resistant bacterial isolates, look for mutations, and explore how those mutations change protein structure and cellular processes. In this way, the students will gain an understanding of the genetic and phenotypic basis of drug resistance.

Currently, our CURE is being scaled so that several thousand students per year can participate.  The goals of our new curriculum include improving undergraduate students’ understanding of key evolutionary concepts and their ability to design experiments, while also increasing their emotional engagement with their learning, academic performance, confidence, resiliency, and professional identity. One of our CURE’s keys to success: peer facilitators.

Peer facilitators work with graduate teaching assistants (TAs) to run each session in the CURE sequence. In lab, PFs assist the TA by 1) demonstrating lab techniques, 2) answering student questions, and 3) facilitating active learning activities designed to increase understanding of evolutionary theory and experimental results. Their help is crucial, because in many cases the PFs—having completed the CURE previously as a student—have a deeper understanding of the protocols and underlying biology than the TAs, who are often new to the CURE. In addition, PFs play a key mentoring role for their younger peers: offering support, advice, and encouragement.

Peer Facilitator Margaux is assisting with lab preparations

Past PFs have said this experience helped them improve their communication and teaching skills, develop leadership qualities, reinforced their own study skills and science knowledge, and increased their confidence and motivation, in addition to enhancing their CVs. I asked current PFs their thoughts on the program and this is what a few of them had to say.

Bao N. a PF since autumn 2017.
“During my freshman year, I enrolled in the Biology CURE. While I never imagined taking the lead role in group projects, I worked diligently and did not hesitate to ask questions. To my surprise, at the end of the quarter, I was chosen among several students to become a peer facilitator- mentors for students in the course’s next offering session. I jumped at this opportunity, as it was my first leadership position in college. It remains one of the most meaningful experiences I have had at the UW. I learned to appreciate the rigorous scientific research happening during and after each class session. I learned to communicate effectively with students as well as other members of the teaching team. I learned to take responsibility for the knowledge and skills student receive, knowing that they will carry these skills into real-world settings, such as a clinic or a research lab. Being a PF is especially meaningful because I was able to support students more inclusively, especially when I can relate to the academic challenges a student can face in this class. Whereas the TA alone would have limited time helping individual students, my role allows me to spend a little extra time with each student. I was also able to incorporate my own experience as an alumnus of the same class in order to help the course developers build lesson plans. I gained many resources from my own peer facilitators and looked up to them as role models. In return, I strive to be very open with my students if they have questions or concerns about how to succeed in class, how to get involved with research or how to apply to certain scholarships.”

Khoi H. a PF since winter 2019.
“Personally, I really enjoy the idea of CURE lab. The lab itself is refreshing in a way, unlike chemistry or physics labs, you come into lab reading a manual and you can always Google what’s about to happen beforehand. It makes the lab just a boring contest of who can repeat what they found online, whereas the CURE lab, it is an immersive, collaborative effort by students and TAs to attempt to understand a subject. I immediately signed up to PF for CURE labs, because I think this is a great addition to the curriculum. The CURE lab allows me to support students and encourage science in them, whilst maintaining the fun and educational environment. To the students, having a PF is helpful because the students are able to relate to the PF since they are both undergrads, so students may be more comfortable asking PFs for help. This is beneficial to both TA and students because we act as a communication bridge between the two. Although we only formally meet in the classroom, a PF can still assist students outside of class, whether that is in other courses, socially, or emotionally. Additionally, being a PF taught me ways to interpret materials in various ways, making me feel more comfortable when it comes to finding another way to explain the material. Overall, having a PF is beneficial to the students—especially because it improves their understanding and allows them to be more engaged in the course.”

Grace D a PF since winter 2017.
“Serving as a PF has furthered my love of teaching science in ways that are inclusive to all, as well as fostering a personal curiosity in research. Without this program, I would never have had the experience or confidence to pursue other research opportunities. It was also through meeting fellow undergrads interested in STEM education that I came to truly appreciate how extraordinary the CURE PF experience is. While the rest of my peers had similar experiences of tutoring and assisting students with worksheets during lecture, there was a unique difference in how we were able to take ownership of the course material and lab techniques, and also collaborate with, advice, and support both the students and other PFs too. As a result, my career goals have shifted more toward research and academia, something I previously didn’t know anything about, never thought I would be interested in, nor believed that I was capable of. It has been an honor to be a part of this incredible CURE family and I am deeply grateful for the ways it has pushed me to become a better scientist, teacher, and friend.”

Cindy T a PF since autumn 2017.
“I never expected to find myself being a part of something like the CURE lab. During my freshman year, I came off as extremely quiet and shy around people – talking to classmates was something I did not voluntarily engage in. After going through the CURE program, I was surprised to be one of the many students that were eligible and selected to be a PF. At first, I doubted myself; would I be able to guide the undergrads in the “right” direction? However, my fears gradually subsided. The community members within the CURE program were so welcoming and accommodating. This small but growing community of TAs and PFs felt like a small family to me. Throughout my time as a PF in this program, I slowly gained the confidence to communicate more clearly and confidently. The concept of the CURE program also appealed to me. Giving undergrads the opportunity to gain lab experience while performing an actual experiment was something unheard of. Instead of doing a textbook lab experiment where there should be expected results, the data students obtain from this experiment do contribute to a greater cause – so there is some amount of real life application.”

Winter 2019 Peer Facilitator team: Back row, left to right: Bao, Grace, Khoi, Margaux, Deja, Julianna, and Angie.  Front row, left to right: Sammi, Yuri, Richard, and Shannon.   Not pictured: Ariel, Alena, Cindy, Lindsey, Rachael, Reilly, Tibebu, and Veronica

LOOKING AHEAD:

One of our goals is to helping low-income and underrepresented students build the skills and confidence needed to complete a STEM major.  We aim to recruit PFs from diverse backgrounds to serve as role models in the classroom.  In addition, we would like to create a PF mentoring ladder where experienced PFs are partnered with newer PFs to help encourage and train each other. At the core of the PF program is mentoring, research, and education.  To help support the PFs we are working on developing additional resources and training modules that will cover topics such as active learning, mentoring, diversity and equity, career support, general teambuilding, and undergraduate research. We hope that as PFs engage in peer mentoring and support activities, they will pay it forward and will become leaders who teach others what they have learned.

Posted in BEACONites, Education | Comments Off on Using a course-based undergraduate research experience to increase leadership opportunities for students

Fish, You are the Father!

By: Isaac Miller-Crews, PhD Candidate, University of Texas at Austin

My job would be much easier if CVS sold paternity testing kits for fish instead of humans! I am interested in the evolution of the neural regulation of reproduction, which requires knowing whether an animal reproduced. Genetic testing, such as parentage analysis, allows us to figure out relationships among individuals without direct historical knowledge. This testing has generally relied on looking in the DNA for microsatellites but we’re discovering new, more powerful, and cheaper ways to conduct these tests in the ‘Age of Big Data’ (Flanagan, 2018; Hodel, 2016). This is especially true if your fish population stubbornly refuse to have variable microsatellites!

Yet, common standards or guidelines for dealing with next-generation sequencing data still need to be figured out (Flanagan, 2018). Importantly, few bioinformatic tools exist that can differentiate well between closely related individuals or deal with DNA mixtures. Looking at single nucleotide polymorphisms (SNPs) across thousands of genomic sites allows researchers significantly more information on variability among samples than standard microsatellite approaches (Hodel, 2016). A new technique called restriction site-associated DNA sequencing (RAD-seq) helps us narrow down which places to look at on the DNA, because it only sequences certain fragments, and which fragments you get depends on which endonucleases you use to cut up the DNA. 2bRAD sequencing uses an endonuclease (type-2b) that give you consistent fragments across your sample, not to mention it’s very cost-effective (Wang, 2012).

The simplest form of paternity testing is exclusion, in which paternity is ruled out if a single site disagrees between the alleged father and the offspring-mother pair (Marshall, 1998), is prone to errors. (Wang, 2010). Parental and sibship reconstruction can generate full sets of possible parental genotype profiles but cannot be used with pooled offspring samples (Wang, 2004). The most common paternity testing technique uses a likelihood model to categorically assign paternity between individuals (Meagher, 1986). Not only does this approach require setting a threshold to call genotypes, but it also limits paternity to the comparison of only two alleged fathers (Marshall, 1998). Furthermore, this type of technique cannot deal with cases of mixed or pooled samples, since it can only categorically assign paternity to one putative father.

Luckily, there is always a Bayesian approach! Partial paternity testing assigns fractions of the offspring to candidate parents based on the highest Bayesian posterior probability (Hadfield, 2006) and outperforms categorical likelihood models, especially in being able to circumvent systematic biases, such as over-assigning paternity to males with a relatively higher number of homozygous loci (Devlin, 1988). Assigning partial paternity is thus perfect if you want to assess an entire brood or clutch or litter at once!

Most parentage testing techniques assume that parents are unrelated, and the pool of putative parents contain no close relatives, which can lead to troubling situations where full-siblings are assigned parentage over actual parents (Thompson, 1976). Populations with a lot of closely related individuals pose a problem to both microsatellite and SNP assays due to the lower variation amongst samples. In these cases, only 100 SNPs are required to outperform microsatellites (Flanagan, 2018). If close relatives are suspected to be in the sample, broader pedigree analysis is often required, such as done with identity-by-state (IBS) matrix clustering. Yet, to date, only one study has attempted to combine IBS clustering with any paternity testing method, categorical assignment, or to a genotyping-by-sequencing with RAD-seq data (Gutierrez, 2017). If only someone could combine the awesome power of IBS matrix clustering with the staggering potential of partial paternity testing!

The African cichlid fish Burton’s mouthbrooder, Astatotilapia burtoni, is a model system in social neuroscience, which forms highly complex and dynamic social communities. Adult male A. burtoni are considered either territorial or non-territorial (Fernald, 1977). Males position within the social dominance hierarchy is dynamic as possession of territories is transient (Hofmann, 1999). A. burtoni reproduce within territorial bowers prior to female mouth-brooding for around two weeks, during which fry can be directly removed from the mother’s buccal cavity. Current estimates of male reproductive success usually integrate some combination of female behavior (proximity, duration/frequency in shelter, or number of eggs laid in a territory), with variation in female preference assumed from this proxy of male reproductive success (Kidd, 2006). Although a female may associate with a male this does not directly equate to mating outcomes, meaning behavioral scoring is not enough to assign paternity (Theis, 2012).

My research aims to do just that by developing a NGS-based parentage analysis bioinformatics pipeline that integrates partial paternity assignment and IBS matrix clustering. The powerful pairing of these two parentage assignment methods allows detection of biases that might arise from closely related individuals in the alleged parent population and will handle pooled samples of multiple offspring. Which is great since our laboratory population of A. burtoni is quite inbred and produces fairly large broods (imagine mouth-brooding anywhere from 10-60 fry). Implementation of paternity testing to measure reproduction outcomes can help us understand the interaction between dynamic systems such as female reproductive cycle and male social dynamics (Fig. 1).

Figure 1. Research overview of how female internal reproductive state (blue) with male external social structure (red) interact and integrate into producing reproduction (purple). Measuring reproductive output requires the development of paternity testing methods.

The integration of a bioinformatics pipeline and the unique advantages of 2bRAD sequencing will allow for relatively easy expansion both into alternative DNA sequencing approaches and any species, regardless of available genomic resources. I plan to integrate paternity testing, as a measure of Darwinian fitness, into analysis on mate preferences and reproductive success in naturalistic communities of A. burtoni. While we use a lot of behavioral proxies of reproduction, such as social interactions or association time, nothing let’s you know that the deed was done like genetically testing everyone. Layered on top of these models of reproductive success within a social hierarchy I want to integrate neuromolecular techniques, from both the spatial resolution of single genes up to transcriptomic networks. This means I will know information about an individual’s behavior, reproductive success, and neural profile all within the context of an actual social community. Talk about truly integrative!

Isaac Miller-Crews is a PhD candidate in the Hofmann Lab (Department of Integrative Biology) at the
University of Texas at Austin

References:

Devlin, B., Roeder, K., & Ellstrand, N. C. (1988). Fractional paternity assignment: theoretical development and comparison to other methods. Theoretical and Applied Genetics, 76(3), 369–380. https://doi.org/10.1007/BF00265336
Fernald, R. D., & Hirata, N. R. (1977). Field study of Haplochromis burtoni : Quantitative behavioral observations. Animal Behaviour, 25, 964–975.
Flanagan, S. P., & Jones, A. G. (2018). The future of parentage analysis: From microsatellites to SNPs and beyond. Molecular Ecology, mec.14988. https://doi.org/10.1111/mec.14988
Gutierrez, A. P., Turner, F., Gharbi, K., Talbot, R., Lowe, N. R., Peñaloza, C., … Houston, R. D. (2017). Development of a Medium Density Combined-Species SNP Array for Pacific and European Oysters (Crassostrea gigas and Ostrea edulis). G3 (Bethesda, Md.), 7(7), 2209–2218. https://doi.org/10.1534/g3.117.041780
Hadfield, J. D., Richardson, D. S., & Burke, T. (2006). Towards unbiased parentage assignment: Combining genetic, behavioural and spatial data in a Bayesian framework. Molecular Ecology, 15(12), 3715–3730. https://doi.org/10.1111/j.1365-294X.2006.03050.x
Hodel, R. G. J., Segovia-Salcedo, M. C., Landis, J. B., Crowl, A. A., Sun, M., Liu, X., … Soltis, P. S. (2016). The Report of My Death was an Exaggeration: A Review for Researchers Using Microsatellites in the 21st Century. Applications in Plant Sciences, 4(6), 1600025. https://doi.org/10.3732/apps.1600025
Hofmann, H. a, Benson, M. E., & Fernald, R. D. (1999). Social status regulates growth rate: consequences for life-history strategies. Proceedings of the National Academy of Sciences of the United States of America, 96(24), 14171–6. https://doi.org/10.1073/pnas.96.24.14171
Kidd, M. R., Danley, P. D., & Kocher, T. D. (2006). A direct assay of female choice in cichlids: all the eggs in one basket. Journal of Fish Biology, 68(2), 373–384. https://doi.org/10.1111/j.0022-1112.2006.00896.x
Marshall, T. C., Slate, J., Kruuk, L. E. B., & Pemberton, J. M. (1998). Statistical confidence for likelihood-based paternity inference in natural populations. Molecular Ecology, 7(5), 639–655. https://doi.org/10.1046/j.1365-294x.1998.00374.x
Meagher, T. R., & Thompson, E. (1986). The relationship between single parent and parent pair genetic likelihoods in genealogy reconstruction. Theoretical Population Biology, 29(1), 87–106. https://doi.org/10.1016/0040-5809(86)90006-7
Thompson, E. A. (1976). A paradox of genealogical inference. Advances in Applied Probability, 8(04), 648–650. https://doi.org/10.2307/1425927
Wang, J. (2010). Effects of genotyping errors on parentage exclusion analysis. Molecular Ecology, 19(22), 5061–5078. https://doi.org/10.1111/j.1365-294X.2010.04865.x
Wang, J. (2004). Sibship Reconstruction from Genetic Data with Typing Errors. Genetics, 166(4), 1963–1979. https://doi.org/10.1534/genetics.166.4.1963
Wang, S., Meyer, E., Mckay, J. K., & Matz, M. V. (2012). 2b-rad: a simple and flexible method for genome-wide genotyping. https://doi.org/10.1038/nmeth.2023
Posted in BEACON Researchers at Work | Comments Off on Fish, You are the Father!

200 Years of Developmental Hourglass: Using Big Data to Increase Our Understanding of Vertebrate Embryogenesis from a Trickle to a Flood

By: Megan Chan, Undergraduate Student, University of Texas – Austin

When I started college at The University of Texas at Austin a couple of years ago, I enrolled as a biochemistry/pre-pharmacy major. I didn’t know anything about computational biology back then but have since had the opportunity to participate in computational biology research under the guidance of Dr. Rebecca Young and Dr. Hans Hofmann in the Department of Integrative Biology at UT Austin. Over the last couple of years, I have grown more and more interested in the realm of data analytics, and my experience in hands-on research has completely changed my goals for the future. Because of this, I finally transferred majors last year to computational biology.

Megan Chan

At the University of Texas, we have a program called the Freshman Research Initiative (FRI) that helps new students get experience in research labs. Although I originally applied just to get something interesting on my resume, I ended up gaining much more. As part of FRI, I joined a research stream called Big Data in Biology, led by Dhivya Arasappan. The goal of this stream was to introduce freshmen to concepts in genetics and how statistics and computer science are being used to study biological systems. I chose this stream over others I was interested in (like streams working in genetically engineering bacteria or chemical analysis of wine tannins) because I had really enjoyed a year of programming when I was in high school. I had never considered myself very knowledgeable about computers and often felt overwhelmed when around guys who had been writing code since middle school, but I found the challenge of solving problems and discovering something new exciting. In my sophomore year I realized that I wanted to continue exploring this field and completely changed my career focus from pharmacy to computational biology.

As part of FRI, I had the opportunity to join Dr. Young and Dr. Hofmann in an independent project adding evidence to a long-standing debate over the validity of what is commonly known as the hourglass model of vertebrate development. The hourglass model hypothesizes that the vertebrate body plan imposes a constraint on diversification of mid-embryonic development across vertebrate species. Early evidence for this theory was based on qualitative analysis of anatomical developmental variation, but in recent years gene expression data has been used as evidence for and against the hourglass model. The part of this overall project that I have been working on focuses on describing patterns of similarity in developmental gene expression through embryogenesis among several vertebrate species. This has involved the processing and analysis over 150 open-source gene expression datasets representing developmental stages for six species. By comparing the similarity of gene expression between each combination of species at each time point in development I can ask whether mid-embryonic stages are most similar in gene expression across species.

A major challenge in achieving this goal has been the lack of consistency in staging for different species. There is not a common quantitative way to equate a particular stage of development in one species with that in another. To add to this problem, of the species we have data for, most only have data for a select set of stages, and the number of stages sequenced for each species is also different. For example, there are 8 out of 46 stages represented for chicken embryos and 24 out of a possible 44 stages for a species of frog (not including free-swimming tadpoles). To overcome this essential problem, I’ve turned to machine learning and comparing qualitative descriptions of stages to group developmental time points within each species into comparable sets.

Of the various methods I integrated into my approach, the first method I employed was K-means clustering. K-means is an unsupervised machine learning algorithm that iteratively computes the distance between each data point and a set of k centroids to calculate which points cluster together around a mean, with k being the number of clusters to find. This was the first method I tried because it is a fairly common way of classifying data without pre-determining classes. To find the appropriate k, I generated an elbow plot visualizing the amount of variation that would be accounted for by several possible numbers of clusters and chose a k that represented a reasonable amount of variation without dividing the data into too small of clusters. A known feature with K-means, however, is that it randomizes the initial centroids which can result in some variation in cluster membership when the clusters are not robust. To enhance/strength of this method, I used partitioned hierarchical clustering, another form of unsupervised machine learning. Similar to the first, this algorithm’s goal is to group the data points into a predetermined number of clusters with similar values, but it starts by considering the entire dataset one cluster and then partitions it into smaller pieces until it’s reached the appropriate number of clusters. Hierarchical clustering, unlike K-means, tends to be consistent, and our results showed that, at an appropriate number of clusters found with the earlier described method, it also conserved the order of the developmental stages. Further analysis showed that these clusters could be defined by at least some biological significance. We are now confronted with the challenge of aligning these clusters across species.

Now, my work has turned from heavy computation to intense reading. I’ve made it this far without having to know too much about the details of what all these stages mean, but I’ve come to face the fact that I will need some biological knowledge of vertebrate development in order to compare these stages in any reasonable way. The beauty of being in an interdisciplinary field.

The knowledge that I’ve gained while working on this project is invaluable to me as I start to pursue my own projects and begin exploring my future options as graduation slowly approaches. I’ve enjoyed the work I’ve done in this lab so much that last year I started analyzing data for fun; in one instance looking for patterns in word choice in a dataset of Russian disinformation tweets, and in another instance predicting the length of time a dog will stay in the local shelter based on its age. This research experience has also opened many doors for me, allowing me the opportunity to pursue positions analyzing data for other labs on campus and jobs mentoring new students in research, and giving me the tools I needed to land a software internship in biotech this summer. In my last year, I hope to publish results for this project and leave an impact on future research.

Posted in Uncategorized | Comments Off on 200 Years of Developmental Hourglass: Using Big Data to Increase Our Understanding of Vertebrate Embryogenesis from a Trickle to a Flood

Team yEvo goes to National Association of Biology Teachers

By: Bryce Taylor, Alexa Warwick, and Ryan Skophammer

Hi BEACONites! We are Ryan Skophammer of the Westridge School for Girls, Bryce Taylor of University of Washington, and Alexa Warwick of Michigan State University. We’ve been collaborating on a BEACON-funded grant to expand options for introductory Biology teachers who want to use labs that teach concepts in evolution. Specifically, we have developed a standards-based, hands-on, long-term yeast evolution project (‘yEvo’). Ryan has been developing lesson plans and teaching the lab in his AP biology class, Bryce is providing experimental support and data analysis, and Alexa is evaluating the impact of participation on student learning.

A subset of colorful yeast used in yEvo. The pigments allow us to monitor for contamination during student evolution experiments and serve as a strain-specific marker in competitions.

The yeast evolution project begins by having students choose a favorite color of yeast from a living ‘palette’ of S. cerevisiae strains that have been engineered to express vibrant pigments (courtesy of the Boeke lab at NYU). Over several weeks students grow their yeast in the presence of an over-the-counter antifungal agent to select for mutants with higher tolerance. Classes at Westridge run for 80 minutes on an alternating block schedule. This means students attended AP Biology every other school day. At the beginning of each block, students inspected their experiments and transfer from a saturated culture to fresh media using a disposable sterile swab.

After a few weeks, students purify a single clone from the culture and use it in a class-wide competition, in which they use the color of their yeast as a marker to determine which is “winning” in a mixed culture. Some of these clones are then sequenced by the Dunham lab at the University of Washington to determine mutations, which students analyze and research to form hypotheses about whether a given mutation is likely to be adaptive. Early results have yielded an exciting mix of mutations in genes with known roles in resistance to the active ingredient in our antifungal, which demonstrate the experiment worked, and genes that haven’t been implicated previously but seem worthy of further investigation.

Ryan’s students hard at work with their yeast.

In Ryan’s class, forty-five students completed the first pilot of the yeast evolution project in the 2017-18 school year. To iteratively improve the lessons and to evaluate impacts on student learning of evolution and motivation/attitudes toward science we gave a post-survey to Ryan’s students in May 2018 (17 responses). When asked what they liked about the process of growing yeast in the presence of the fungicure, most of them mentioned watching their yeast survive or evolve over time (64.7%) and determining whether to increase the concentration of the antifungal (29.4%). When analyzing sequence data from their evolved strains the students liked seeing the actual mutations (52.9%), but also found it confusing to figure out how to analyze the data (47%), suggesting more scaffolding is needed in the design of this activity to assist students with this difficulty next time. Most of the students also liked the competition aspect (82.3%), but some disliked losing (17.6%), felt rushed (11.7%), or didn’t like counting (11.7%). Most students (94.1%) reported they were willing to do the activity again because it was fun; one person was uncertain. All students agreed or strongly agreed that they enjoyed participating. We also asked students to report on their interest in becoming a biologist as a result of their participation (41.2% agreed or strongly agreed) and their interest in STEM (47% agreed or strongly disagreed).

In November of 2018, we traveled to the National Association of Biology Teachers conference in San Diego. It was the first time we’d all met in person and provided a great opportunity to catch up and plan out our next steps. Alexa and Ryan had been to the conference previously. Bryce attended for the first time, and was supported by travel funds from our BEACON grant. In addition to discussing yEvo, Alexa and Bryce presented posters on ConnectedBio (https://connectedbio.org/) and UW Genomics Salon, respectively. ConnectedBio is an NSF-funded grant project to develop curricular materials that are designed for the Next-Generation Science Standards (https://www.nextgenscience.org/) and foster integrated learning of high school genetics and evolution. The materials use the Evo-Ed cases (http://www.evo-ed.org/) as the phenomena that students explore through a series of technology-enhanced lessons as part of the collaboration between Michigan State University researchers and the Concord Consortium (https://concord.org/). Genomics Salon is an interdisciplinary discussion group at University of Washington that brings together academics and members of the broader UW community to talk about issues in science and society. Bryce shared a repository of discussion questions and resources from 2 years of meetings, which could be a good starting point for teachers interested in building lesson plans on topics we’ve covered, but who aren’t sure where to start.

Ryan introducing yEvo during his workshop at NABT.

Ryan led a workshop where he shared his experience designing and teaching yEvo. The teachers present had great ideas and feedback on the project that helped us to think through where to take the project next. After the workshop several teachers hung around with additional questions and feedback. Chatting with them helped us to recognize aspects that may or may not work in every school setting, which we aim to address as we further refine protocols and materials. One particularly enthusiastic participant had some fantastic ideas about future conditions or experimental setups we could try out. He’s stayed in contact since and is running yEvo in his classroom this semester!

The most important conference tradition: dinner with new friends.

One of Alexa’s highlights from the meeting was attending science writer Ed Yong’s talk and then going out to dinner with him. If you haven’t seen Ed’s articles in the Atlantic yet, we recommend them: https://www.theatlantic.com/author/ed-yong/. Bryce particularly enjoyed the exhibit hall. The vendors present brought a very cool mix of biology apps, games, and toys, which are a growing and fascinating component of education that play a big role in the early stages of science training, but that you don’t get to interact with often in higher-ed settings.

Posted in Uncategorized | Comments Off on Team yEvo goes to National Association of Biology Teachers

A Fly’s View of Retinoblastoma-Family Protein Conservation

By: Dhruva Kadiyala (Undergraduate Student at Michigan State University)

How do evolutionary perspectives illuminate cancer-related biochemistry? As a high school student, I was involved in a project to find targets to attack cancer cells. That project really inspired me to work on the retinoblastoma-family protein project in the Arnosti lab. I came into Dr. Arnosti’s lab in my freshmen year at Michigan State University as a Professorial Assistant from the Honors College and immediately began to learn about retinoblastoma (Rb) tumor suppressor proteins in Drosophila species.

In humans, the retinoblastoma protein is a tumor suppressor and plays an active role in cell cycle regulation. Mutations in the Rb gene or its regulatory pathway are associated with many human cancers. Rb is ancient; the gene is evolutionarily conserved in most multicellular organisms and present as a single copy gene. In mammals, however, there are three Rb paralogs: Rb, p107 and p130. Independently, in Drosophila the Rb gene duplicated about 60 million years ago, and both paralogs, Rbf1 and Rbf2, have been retained in all modern Drosophila species. This situation provides a great model system to study Rb paralog evolution and function.

To understand the evolution of Rbf1 and Rbf2, I aligned Rbf1 protein sequences from 12 Drosophila species using the Clustal Omega multiple sequence alignment tool from the European Bioinformatics Institution. I split the proteins into three different domains (N-terminus, C-terminus, and Pocket Domain) to see what part of the protein is more conserved. I found that the Rbf1 gene that most resembles the ancestral gene, based on similarity with other organisms’ Rb genes, shows a higher degree of conservation, especially in the Pocket domain important for binding to transcription factors. The derived Rbf2 gene has a higher degree of variation within Drosophila, especially in the N and C-termini.

I also aligned both Rbf1 and Rbf2 sequences from the D. melanogaster with the human Retinoblastoma-family proteins (Rb, p107, and p130). What was striking is that the more evolutionarily variable human Rb and fly Rbf2 proteins have changes especially in the C-terminus that impact a functional domain (the Instability Element IE) important for protein turnover and transcriptional regulation, an apparent case of parallel evolution.

Why do most animals outside of vertebrates make do with a single Rb gene, while Drosophila have expanded their count? To assess the structural variation in Rb genes in arthropods in general, I compared Drosophila Rbf1 sequences with those of the red flour beetle (Tribolium castaneum), eastern honey bee (Apis cerana), monarch butterfly (Danaus plexippus), western flower thrip (Frankliniella occidentalis), green peach aphid (Myzus persicae), a drywood termite (Cryptotermes secundus), a springtail (Folsomia candida), the common house spider (Parasteatoda tepidariorum), and white-legged shrimp (Penaeus vannamei).  Overall, conservation is greatest in the transcription factor binding Pocket Domain, although the internal “spacer” region within the domain is quite variable, something that may influence activity of the proteins. The C terminus was least conserved, but IE sequences are conserved. Thus, evolutionary changes in this portion of the protein seem to be restricted to cases where there are paralogous genes.

I generated a visual representation of these levels of conservation with the help of Clustal Omega. For that purpose, I turned to Jalview software, which uses the multiple sequence alignment tools Clustal Omega and MUSCLE to generate visuals for analysis. Here, I show a visual representation showing residue by residue conservation of Rb genes from arthopod species (Figure 1).

Figure 1: Multiple sequence alignment of Rbf1 of D. melanogaster and Rb genes from other arthropod species. The height and color of the bars represent percent identity and similarity. Higher bars and yellow bars are more conserved than lower brown bars. The protein skeleton is based on D. melanogaster Rbf1 protein with following denotations: Blue: cyclin fold domain, Pink: A pocket, Green: B pocket, purple: Instability element.

Overall, my work in Dr. Arnosti’s lab has been most meaningful work for my development as a researcher. I experienced firsthand how proteins that play a major role in survival and development in cancer are evolutionarily conserved and yet evolve over time among species, and thereby I have deepened my knowledge of biology and the mechanisms of evolution. I hope to continue working in the lab for the rest of my undergraduate career, discover a more disciplined researcher in myself, and contribute to science as I prepare to advance to medical studies.

Dhruva Kadiyala is a sophomore studying Neuroscience in Lyman Briggs College at Michigan State University. He is a pre-medical student also interested in biological research, and has worked with Cell and Molecular Biology Ph.D. student Rima Mouawad in the lab of David Arnosti.

Posted in BEACON Researchers at Work, Uncategorized | Comments Off on A Fly’s View of Retinoblastoma-Family Protein Conservation

CONSTAX: a tool to simplify and improve taxonomic classification of community sequences

By: Natalie Vande Pol (PhD Candidate, Michigan State University)

I am a 5th year PhD student in the Microbiology and Molecular Genetics program at Michigan State University. This is the story of a side project that has been one of the most enjoyable and rewarding undertakings in my PhD career. CONSTAX was the first project on which I was a key contributor. The co-first authors both worked in community ecology and they wanted to develop a tool, but they needed some help writing Python scripts. That’s where I came in.

Community ecologists use a technique called amplicon sequencing, in which they extract DNA from a substrate (e.g., soil, plants, water) and sequence-specific genes that they then use as a “barcode” to identify the organism from which the DNA originated (Figure 1). In bacteria, this barcode is the 16S ribosomal RNA gene. In fungi, we generally use one of two ribosomal regions: ITS1 or ITS2. Ecologists use these barcode sequences to study pooled communities of organisms, allowing comparison of community structure between different conditions (e.g., healthy v. diseased gut/plant). Think of it like a census for soil fungi. These comparisons can sometimes indicate organisms that are important to causing, preventing, detecting, or recovering from a given characteristic or disturbance.

Figure 1: Community barcoding. Barcode genes amplified from different organisms have small differences in sequence. So long as a sequence for that organism is included in the reference database, that sequence can be “translated” back into an organism name.

One of the most important steps in a community analysis pipeline is to “translate” the barcode DNA sequences from the sample into the names of the organisms from which they originated. This is done by comparing sample sequences to reference sequences from known organisms, just as a barcode in a grocery store needs a computer reference to tell the cashier whether you are buying cilantro or parsley. With DNA sequences, the identification algorithm used to match up the sequences is called a classifier. Using different reference databases or different classifiers can yield different identifications.

To illustrate what happens with different classifiers, imagine you and two of your friends are all taking the same test. All three of you get 80/100 questions correct on the exam. However, when you compare your exams, you realize that while you all had 75 questions in common, the other 5 correctly answered questions were unique to each of you. So, on the surface your performances seem identical, but are in fact a bit different. Similarly, using a single classifier and different reference databases is analogous to each of you three taking the same exam having studied from three different textbooks (assuming otherwise identical performance). Your scores on the exam would probably vary.

Fortunately, for fungal research, UNITE is a well-curated reference sequence database, so the largest source of variation is between classifiers. Just as described in the first analogy above, different classifiers use different algorithms to assign taxonomies and estimate confidence/error rates, making it difficult to select a single classifier as the “best”. Therefore, our two community ecologists and I set out to develop a tool that eliminated the need to choose just one! If you and your three friends could collectively take that exam, you could have gotten 90/100, instead of just 80/100 on your own.

First, we chose the most commonly used and most recently developed classifiers: Ribosomal Database Project (RDP), UTAX, and SINTAX. We wrote a series of custom scripts to format the UNITE reference database to be compatible with each of the classifiers and ran our sequence datasets through each of the classifiers. Finally, we used Python scripts to standardize the output formats. This was all packaged and is automated by a single shell script constax.sh (Figure 2). Users simply place their input files in the specified folders and provide the names and desired parameters in a configuration file.

Figure 2: The CONSTAX workflow. The portion highlighted in the gray box is automated through a single master script to ensure ease of use.

For each sequence, we compared the three assignments given for each taxonomic rank (Kingdom, Phylum, Class, Order, Family, Genus, and Species). If the confidence score for a given assignment was below a threshold value, that and all further taxonomic ranks were considered “Unidentified” for that sequence. In most cases, the three classifiers agreed on the taxonomy assigned. However, there were cases in which they disagreed, whether because one (or two) of the classifiers yielded an Unidentified, or because there were multiple different, confident assignments (Table 1). With three classifiers, we decided to implement a simple majority rule. Since classifiers provide an estimated confidence in taxonomic classifications, we used confidence scores to break ties.

Table 1: The CONSTAX Voting Rules.

We tested our tool on four different datasets from three different studies: barcode gene ITS1 or ITS2 of fungi from Soil or Plants (Figure 3). And it worked! Cross-referencing three classifiers corrected misassignments and improved overall performance. At the Kingdom level, the consensus taxonomy was only ~1% improved as compared to any individual classifier. However, higher levels had much stronger improvement, on average 7-35%, depending on the taxonomic level and the individual classifier. The mean improvement in performance by CONSTAX over individual classifiers is slightly over-estimated due to particularly poor classification by UTAX, which had the most Unidentified levels.

Figure 3: CONSTAX Performance. a) Soil fungi barcoded with the ITS1 gene, from Smith & Peay. b) Soil fungi barcoded with the ITS2 gene, from Oliver et al. c-d) Plant fungi with the c) ITS1 and d) ITS2 genes, from Angler et al.

What’s next for CONSTAX?

First, we would love to develop our tool to be compatible with bacterial community sequences. Fortunately, the classifiers were all written for bacterial community analysis in the first place! Unfortunately, the reference databases are either out of date or so poorly curated as to have misidentified reference sequences and some convoluted taxonomies. Bacteria seem to be renamed rather frequently and it’s difficult to know whether the assignment given is still correct. We focused our preliminary efforts on the SILVA database, as it is the most up-to-date, but it has some serious formatting issues, among other things. In theory, there should be 7 taxonomic ranks. A significant proportion of the SILVA taxa have 4-13 levels, requiring manual correction to determine the appropriate classification for each of the 7 expected levels. At least in fungi, the different taxonomic ranks have consistent suffixes that can be used to identify gaps/insertions and correctly place the ranks. In bacteria, suffixes only seem to be consistent within particular lineages, so I would only be able to fix one group at a time, and quite often the canonical seven taxonomic levels simply don’t exist for some bacterial lineages.

Secondly, we are very interested in incorporating new classifiers into our tool. UTAX, in particular, is becoming obsolete and had the highest rate of “Unidentified” taxonomic assignments. While this may make our tool look good, it’s not really representative of the best we can do. However, an even number of classifiers makes “voting” on a consensus assignment more complicated and we would prefer to have a more elegant and sound basis for breaking ties than just comparing confidence scores, since those metrics are each calculated slightly differently and don’t mean quite the same thing. It’s an excellent starting point, but future work in this area would be served by a more thorough evaluation of disagreements between classifiers.

If you’re interested in more detail or in using CONSTAX for your own research, this blog is based on our publication, which you can find here and our code repository is on GitHub.


References:

Agler MT, Ruhe J, Kroll S, Morhenn C, Kim S-T, Weigel D, et al. (2016) Microbial hub taxa link host and abiotic factors to plant microbiome variation. PLoS Biol. 14(1):e1002352–31.

Gdanetz, K., Benucci, G. M. N., Pol, N. V., & Bonito, G. (2017). CONSTAX: A tool for improved taxonomic resolution of environmental fungal ITS sequences. BMC bioinformatics18(1), 538.

Oliver AK, Mac A, Jr C, Jumpponen A. (2015) Soil fungal communities respond compositionally to recurring frequent prescribed burning in a managed southeastern US forest ecosystem. For Ecol Manag. 345:1–9.

Smith DP, Peay KG. (2014) Sequence depth, not PCR replication, improves ecological inference from next generation DNA sequencing. PLoS One. 9(2):e90234–12.

Posted in BEACON Researchers at Work | Comments Off on CONSTAX: a tool to simplify and improve taxonomic classification of community sequences