Understanding Single-celled Life:

An Abstract Approach

by Ralph Butler, Ross Overbeek, ...


Part 1: The Cell: a Basic Abstraction

A cell is a bag (i.e., a volume enclosed by a membrane) that contains three types of things: compounds, cellular machines, and a genome.

By the term compound we refer to the normal notion of chemical compound.

A cellular machine is a set of proteins that together perform a function. Unless otherwise noted, when we use the term machine we will always be speaking of a cellular machine. Many machines transform one set of compounds into another set. Some machines (transport machines) are used to move compounds into or out of the cell. Later we will try to convey a more comprehensive notion of what functions are implemented by machines that we understand.

A protein is a string of amino acids (i.e., a string in the 20-character alphabet {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}).

A genome is a string of DNA bases (i.e., a string in the 4-character alphabet {A,C,G,T}).

A gene is a region in the genome that describes how to build a protein. The description is a sequence of 3-character codons. Each codon may be thought of as an instruction specifying which amino acid should come next in the protein the gene describes.   Thus, if the protein described by the gene contains 100 amino acids, then the gene would be composed of 100 codons (i.e., 300 DNA characters) followed by a codon that means "stop here" (a stop codon).  There are three stop codons: {TAA,TAG,TGA}. The genetic code is the table of correspondences between codons and amino acids:

Amino Acid Codons
A GCT, GCC, GCA, GCG
C TGT, TGC
D GAT, GAC
E GAA, GAG
F TTT, TTC
G GGT, GGC, GGA, GGG
H CAT, CAC
I ATT, ATC, ATA
K AAA, AAG
L TTA, TTG, CTT, CTC, CTA, CTG
M ATG
N AAT, AAC
P CCT, CCC, CCA, CCG
Q CAA, CAG
R CGT, CGC, CGA, CGG, AGA, AGG
S TCT, TCC, TCA, TCG, AGT, AGC
T ACT, ACC, ACA, ACG
V GTT, GTC, GTA, GTG
W TGG
Y TAT, TAC
* TAG, TGA, TAA [Stop codons]



The process of building a protein as a string of amino acids from the gene containing codons is called expressing the gene.

Problems in BioInformatics that Depend only on the Basic Abstraction

Identifying Genes within the Genome

If we plan on using a genome, it will usually be necessary to identify the genes within the genome.  How can this best be done?   First, it should be noted that this can be broken into two variations:

  1. Given a large collection of existing genomes in which the genes have been identified, find the set of genes in a new genome.
  2. Given no assumption of an existing body of previously identified genes, find the genes in a new genome.
The second problem had to be addressed in the early days of genomics, since there were no previously characterized genomes.  You could approach this problem by first seeking to estimate properties of the genes (for example, codon and dicodon usage), and then using these estimates to support a more refined search.  In many cases, we actually developed detailed inventories by doing actual wortk in the wet lab.

The first problem is much easier, in the sense that one can rapidly recognize genes similar to ones we already have analyzed (usually based on similarity of the corresponding protein sequences.

Identifying Similar Genes

Genes are said to be homologous if they share a common ancestor.  Tools have been developed to construct estimates of whether or not two genes, or the protein sequences they encode, are homologous.  Most of these are based on measuring the degree of similarity between the genes based on some metric.  The most basic versions of this problem are

  1. Given two genes (or proteins), are they homologs?  That is, estimate the liklihood that they are homologs.
  2. Given a gene and a database of other genes, extract a prioritized list from the database of genes that are likely to be homologs.  Similarly, given a protein sequence and a database of other protein sequences, which are most likely to be produced by homologous genes?
  3. Produce an alignment of two DNA or protein sequences that attempts to show corresponding characters in the two sequences.   For example,
<<<EXAMPLE of BINARY ALIGNMENT>>>

Multiple Sequence Alignment

Given a set of sequence (either DNA or protein sequences) that are similar, align them in a way that puts corresponding characters in the same column.  For example,

<<<MSA>>>

Producing more accurate multiple-sequence alignments, especially for large numbers of long sequences, is a fundamental problem, and there is still certainly room for improvement.

Construction of a Phylogenetic Tree, Given a Multiple-Sequence Alignment


One of the most central problems in bioinformatics is to take a multiple-sequence alignment and produce a phylogenetic tree estimating the historical relationship of the sequences.  There is a great deal that has been written on this problem, and a great deal more will be.

What is "the tree of life" and How Might it Get Built?


The problem of constructing a single phylogenetic tree from a single alignment (the last problem) is relevant to this issue, but it does not cover it.  Suppose that you built 200 alignments  that contain the sequences common to almost all genomes.  Then, if you were to build 200 trees, and then you found that they were not identical (or even close in some cases), what would you infer, and how should you respond?  Is it even possible or desirable that we actually create an estimate of the history of how the existing micro-organisms have evolved from some ancestral organism?

Assuming that We Do Have an Estimate of the Tree of Life, which Proteins Characterize Subdivisions of the Tree?

It is clear that sequences are introduced into genomes through replication and (in addition) through horizontal transfer.  In the presence of large amounts of horizontal transfer, many genes will occur only in relatively small portions of a specific subtree (these represent relatively recent transfers).  Is it possible and meaningful to create inventories of proteins that tend to be unique to a subtree (or is the concept "tend to be unique" somewhat similar to "a little pregnant")?

Can We Identify Instances of Horizontal Transfer?

How can we construct tools to recognize horizontal transfer, and can these tools be good enough to sort out the actual details of the evolutionary history?

Can We Determine Which Columns and Sections of a Multiple-Sequence Alignment are Conserved (and Why)?

Conservation normally implies functional constraints (the reason a column has restricted content is that any evolutionary change  led to the death of the organism that had it).  Shifts of function relate to conserved sections that have changed (i.e., the sections are not random, but neither are they identical).  The correspondence between conservation and function is a rich source of significant problems.

To What Extent Can Structure (Secondary or Tertiary) be Predicted froma Multiple-Sequence Alignment?

Comparison of columns in a large multiple sequence alignment was the key to developing secondary structures for both DNA alignments and protein alignments.

The Machines: a Initial Inventory

Energy Issues

The following diagram offers a summary of the machines that relate to acquisition and storage of energy, as well as the production of a number of key compounds by breaking up sugar:




    
M1 harvesting light energy
M2 building sugar from smaller components and energy
M3 Storing strings of sugar molecules as starch
M4 breaking up starch to give sugar
M5 breaking up sugar to get energy and smaller molecules

Many of our machines will need energy to run.  In the basic organism we are describing, we have incuded M1 to harvest energy from sunlight.  This process is called photosynthesis.  The cell stores energy in a molecule called ATP.  Whenever energy is needed, the molecule is broken into two pieces, releasing energy.  The cell maintains a fairly constant concentration of ATP, which allows reactions throughout the cell to depend on it.  This is similar in many respects to the way electricity is available throught an house.  Appliances can be designed to plug in anywhere, and they assume the normal voltage will be available.  Similarly, we have a mechanism for maintaining the concentration of ATP, and this allows us to include reactions that depend on that concentration.

M2 is a machine that builds sugar from CO2 and energy.  This involves a number of transformations.  Eventually, we will need to examine the individual steps, but for now let us remain at this quite abstract level.

Machines M3 and M4  allow the cell to store sugars when energy is abundant, and then to use them later when energy is needed.  Starch should be thought of as just a string of sugar molecules, which is a convenient way to store them.  When sugar is needed, M4 can be used to break off a few.

Finally, M5 is a machine that takes sugar molecules and breaks them into smaller pieces, releasing energy (in the form of ATP) in the process.  These smaller molecules are the building blocks that are used to construct  over and over to build things needed by the cell.  Here is a table that contains the abbreviations we use for these molecules.  Frankly, if you have not had biochemistry classes, you might simply work with the abbreviations, since the full names can be intimidating.

2OG 2-oxoglutarate
3PG 3-phospho-glutarate
A Adenosine [one of the characters in a DNA string]
Ala Alanine [an amino acid]
Arg Arginine [an amino acid]
Asn Asparagine [an amino acid]
Asp Aspartate [an amino acid]
C Cytosine [one of the characters in a DNA string]
CHOR Chorismate
CO2 Carbon dioxide
Daughter genome the added cell after replication
E4P Erythrose 4-phosphate
Extra Membrane A little extra membrane for the new cell
G Guanine [one of the characters in a DNA string]
G6P Glucose 6-phosphate
Genome the DNA string in the cell that contais the genes
Gln Glutamine [an amino acid]
Glu Glutamate [an amino acid]
Gly Glycine [an amino acid]
HOM Homoserine
His Histidine [an amino acid]
Iso Isoleucine [an amino acid]
Leu Leucine [an amino acid]
Lys Lysine [an amino acid]
Membrane the thing enclosing the cell
Met Methionine [an amino acid]
OXLA Oxalacetate
PEP Phosphoenolpyruvate
PYR Pyruvate
Phe Phenylalanine [an amino acid]
Pro Proline [an amino acid]
R5P Ribose 5-phosphate
Ser Serine [an amino acid]
Starch A polymer of sugars (used for storage)
Sugar think glucose
T Thiamine [one of the characters in a DNA string]
Thr Threonine [an amino acid]
Trp Tryptophane [an amino acid]
Tyr Tyrosine [an amino acid]
Val Valine [an amino acid]


Building the Amino Acids


M6 build glutamate and glutamine  from 2-oxoglutarate
M7 build proline
M8 build aspartate from 2-oxalacetate
M9 build arginine
M10 build asparagine
M11 build serine



M12 build glycine from serine
M13 build cysteine from serine
M14 build methionine from homoserine and cysteine
M15 build lysine
M16 buil homoserine from aspartate
M17 build threonine
M18 build isoleucine



M19 build alanine
M20 build valine
M21 Build leucine
M22 build the intermediate  chorismate
M23 build tyrosine and phenaylalanine
M24 build tryptophane
M25 build ribose 5-phosphate 
M26 build histidine

Expressing Genes



M30 building a protein from amino acids and a gene

Motility

The cell we envision has some motility.  It can "turn on its motor and propellers" to move a bit, turn off the motility machinery, wait a while, turn it on again, and so forth.
We do not show a diagram or table of this machine, but we shall number it M31.

Replication


Replication is descriibed in a somewhat imprecise manner.  We think of M27 as a machine that builds the nucleotides, which are the characters that make up the DNA genome.   Then M28 is a machine that takes these loose "characters" floating in the cell, along with the existing genomes, and manufactures a copy of the genome.   Then, finally, M29 takes some extra membrane (see the output of M5), the genome copy, and "pinches" the extended cell, creating two separate cells which we call the "original" (containing the original genome) and the "daughter" containing the copiy of the genome).

 



M27 build nucleotides
M28 build new genome
M29 split the cell into original and daughter

Problems in BioInformatics that Can Be Done Once the Notion of "Function" Exists


The inventory of machines has led us (albeit circuitously) into a discussion of "the function of a protein" and how to think about it.  These problems relate to the use of comparative analysis between the protein sequences from many distinct genomes (and what clues we can expect to develop in our attempts to make sense of it all).

Identifying the Functions of Genes

The general topic of how assign function to genes is central to genome annotation.  Deciding when you can safely project function based on similarity is a topic that can profitably be pondered.

Predicting When Two Genes Implement Related Functions

There are many clues that can be used to improve the accuracy of function projection.  Conservation of contiguity, detection of gene fusions, protein-protein interaction data, and characterization of regulatory sites have all proven useful  Integration of clues from a number of sources has been attempted (and will undoubtedly be important in the future).

Grouping Genes into Subsystems

The genes that encode proteins that together implement a single machine may be thought of as an instance of a subsystem.  In later tutorials we will discuss the notion of subsystem in more detail.  Essentially, it is an abstraction of the notion of machine, and it represents an important conceptual framework for analyzing the functions of genes from many genomes simultaneously.  So, how can you detect when two genes are components of the same machine?

Constructing Sets of Isofunctional Homologs

Homolgs are genes that share a common ancestor.  Isofunctional genes implement the same function.  The goal of compiling sets of homologous genes (and the proteins they encode) that implement a single function is central to automating annotation of genomes.  Further, since we will be faced with annotating thousands of new genomes over the next few years (and it increases much more rapidly after that), almost all annotations will be automated.

Supporting Decision Procedures for Sets of Isofunctional Homologs

Suppose that you have a collection of sets of isofunctional homologs.  Suppose further that you have, say, 10,000 of these sets.  For each set, you will wish to develop a decision procedure which, when given as input a set and a new protein sequence, determines whether or not the protein should be added to the set.  In some cases, such decisions are easy, and you will wish to use a very fast decision procedure.  In others, they are very difficult, and you will need to bring many sources of clues to bear.
Construction of such decision procedures will become increasingly important.

Characterization of Regulons for a Genome

Genes are often co-regulated.  That is, expression of a set of genes may always be tightly coordinated.  In this case, we will think of the co-regulated set as a regulon.  Determination of which genes make up which regulons is a task requiring both bioinformatic challenges and wet lab confirmations.  Don't attempt this one without a close working relationship with a wet lab biologist.

Charaterization of "States of the Cell"

It might be conjectured that a cell has a limited set of states.  Each state is characterized by the set of regulons that are expressed.  It seems likely that the cell should be viewed as "tending to stay in the same state" until forced to make a transition to another state.  That is, the states demonstrate a degree of homeostasis.  If we underatnd a comprehensive list of states, and we worked out the forces that determine transitions, we would begin to understand the cell as a dynamic system.