Exploring Genetic Design Space with Phylosemantics

This post is written by UW grad student Bryan Bartley 

Synthetic biology is a fascinating, interdisciplinary field at the intersection of biology and engineering. Synthetic biologists envision that life can be re-programmed by rewriting the genetic code of organisms. A variety of biotechnologies for synthesizing, assembling, and editing DNA now make this possible. Of course, this idea has many profound and serious implications, one of many reasons why it is such an interesting field to work in.

Many people are uncomfortable with the idea of tinkering with the genetic code. My scientific and personal convictions lead me to believe that if humanity wants to live in harmony with nature, then we must learn to speak the language of life.

The language of life is written mostly in terms of A, C, T, or G, which, as you perhaps learned in biology, stand for the four molecular bases of the genetic code. These bases are strung together into long sequences of DNA by means of a polymeric backbone. It’s a bit of an oversimplification to describe DNA as genetic code, because frankly there is still a lot we don’t understand. However, every organism on earth, to our knowledge, uses DNA to encode living processes inside their cells. Human beings are related to the rest of the animal kingdom and in fact to all living organisms. The story is written in our DNA.

If you ever have the opportunity to take courses in biology or biochemistry, then you might just learn the basics of decoding DNA. However, unlocking the mysteries of the genetic code has taken decades, and continues to be a scientific challenge full of surprising discoveries. The approach I discuss in this week’s BEACON blog, called phylosemantics, is a technique for interpreting the genetic code that might be useful in some special cases.

Phylosemantics is a computational algorithm I developed as part of my PhD research in synthetic biology. It is a combination of methods called phylogenetics, which is commonly used in evolutionary biology, and semantic clustering, an idea with roots in artificial intelligence. Tree diagrams are used by all of these methods to classify information into families or groups with similar characteristics. There’s a good chance you have seen a phylogenetic tree before, and just don’t remember! In case you have forgotten what they look like, evogeneao.com has a nice interactive tree-of-life. Phylogenetics uses similarities in DNA sequence to group related sequences together. In contrast, phylosemantics makes a semantic comparison between different components of DNA.

For example, consider the Cox combinatorial promoter library1, which consists of 288 variant genetic promoters. Each individual promoter is composed of three genetic operators arranged sequentially in distal, medial, and proximal positions (Fig. 1). The boundary between positions are defined by the -35 and -10 sigma70 RNA polymerase binding sites. Promoter variants were derived by varying operator types at each position (repressor, neutral, or activator). Operator sites may also be varied by substituting operators derived from different species. For example G and H variants represent operators specific to LacI and TetR repressor proteins, respectively, while activator variants J and K represent AraC and LuxR binding sites. Thus, it is possible for two operators to be semantically equivalent, even while they differ in terms of their DNA sequence.

The phylosemantic tree (Fig. 2) diagrams 12 variant promoters from the Cox library. This tree systematically groups the promoter variants into 3 families based on similar configurations. The length of branches of the tree correspond to semantic distance between variant designs. If the adjacent branches have no length, then adjacent promoters have the same configuration. Tabulated next to each variant are levels of gene expression corresponding to each variant promoter. The advantage of graphing these data with a phylosemantic tree is that some patterns in gene expression become more apparent.

The first family of variants (FJK, IDD, FDB, and HEB) are clustered by my algorithm because they all have a repressor operator distally. These promoters exhibit high gene expression, despite the presence of a repression operator. In other words, repression in this family of promoters appears to fail. In contrast, the middle cluster contains similar promoters DGB and AFI with a medial repressor operator. Promoters with a medial repressor operator exhibit very low gene expression consistent with repression. This makes sense from a biophysical perspective—a repressor bound in medial position will sterically hinder RNA polymerase binding.. A design pattern may thus be stated that repressor operators in medial position exhibit a pronounced repression effect while repressor operators in distal position appear ineffective. The point of the phylosemantic tree is to systematically organize the different genetic architectures and find patterns in their behavior.

This brief explanation of phylosemantics barely scratches the surface, but I hope some readers will at least find it intriguing. Phylosemantics encompasses a number of related approaches that might apply in different scenarios. For example, different formulae for calculating semantic distance can produce trees that are more useful for one type of analysis versus another. Another choice with interesting implications is whether to construct a rooted versus unrooted tree. Scenarios in which phylosemantics might be useful include:

  • Phylosemantic classification might be useful for comparing different genetic architectures in natural biological variants
  • Phylosemantics can be used to discover genetic design rules for synthetic biology
  • Phylosemantic classification might be used to systematically classify permutations of genes in different orientations.
  • Phylosemantics could enable biodesign automation efforts by helping synthetic biologists plan rational assembly strategies starting from the given DNA templates.

If you found this discussion interesting, I will be presenting this topic at the BEACON Congress and the International Workshop for Biodesign Automation in Pittsburgh, PA in August. I’m very interested in connecting with collaborators in industry or academia who are interested in applying phylosemantic approaches to a case study. Thanks for reading my post today!


[1] R. S. Cox et al., “Programming gene expression with combinatorial promoters,” Mol. Syst. Biol., vol. 3, no. 1, p. 145, 2007.

This entry was posted in BEACON Researchers at Work and tagged , , , , . Bookmark the permalink.

Comments are closed.