Understanding Single-celled Life:
An Abstract Approach
by Ralph Butler, Ross Overbeek, ...
Part 1: The Cell: a Basic Abstraction
A cell is a bag (i.e., a volume enclosed by a
membrane) that contains three types of things: compounds, cellular
machines, and a genome.
By the term compound we refer to the
normal notion of chemical compound.
A cellular machine is a set of proteins
that together perform a function. Unless otherwise noted,
when we use the term machine we will always be
speaking of a cellular machine.
Many machines
transform one set of compounds into another set. Some machines
(transport machines) are used to move compounds into
or out of the cell. Later we will try to convey a more comprehensive
notion of what functions are implemented
by machines that we understand.
A protein is a string of amino acids
(i.e., a string in the 20-character alphabet
{A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}).
A genome is a string of DNA bases (i.e., a
string in the 4-character alphabet {A,C,G,T}).
A gene is a region in the genome that
describes how to build a
protein. The description is a sequence of 3-character codons. Each
codon may
be thought of as an
instruction specifying which amino acid should come next in the protein
the gene describes. Thus, if the protein described by the
gene
contains 100 amino acids, then the gene would be composed of 100 codons
(i.e., 300 DNA characters) followed by a codon that means "stop here"
(a stop codon).
There are three stop codons: {TAA,TAG,TGA}. The genetic code is the
table of correspondences between codons and amino acids:
| Amino Acid |
Codons |
| A |
GCT, GCC, GCA, GCG |
| C |
TGT, TGC |
| D |
GAT, GAC |
| E |
GAA, GAG |
| F |
TTT, TTC |
| G |
GGT, GGC, GGA, GGG |
| H |
CAT, CAC |
| I |
ATT, ATC, ATA |
| K |
AAA, AAG |
| L |
TTA, TTG, CTT, CTC, CTA, CTG |
| M |
ATG |
| N |
AAT, AAC |
| P |
CCT, CCC, CCA, CCG |
| Q |
CAA, CAG |
| R |
CGT, CGC, CGA, CGG, AGA, AGG |
| S |
TCT, TCC, TCA, TCG, AGT, AGC |
| T |
ACT, ACC, ACA, ACG |
| V |
GTT, GTC, GTA, GTG |
| W |
TGG |
| Y |
TAT, TAC |
| * |
TAG, TGA, TAA [Stop codons] |
The process of building a protein as a string of amino acids
from the gene containing codons is
called expressing the gene.
Problems in BioInformatics that Depend only on the Basic
Abstraction
Identifying Genes within the Genome
If we plan on using a genome, it will usually be necessary to identify
the genes within the genome. How can this best be done?
First, it should be noted that this can be broken into two
variations:
- Given a large collection of existing genomes in which the
genes have been identified, find the set of genes in a new genome.
- Given no assumption of an existing body of previously
identified genes, find the genes in a new genome.
The second problem had to be addressed in the early days of genomics,
since there were no previously characterized genomes. You
could approach this problem by first seeking to estimate properties of
the genes (for example, codon and dicodon usage), and then using these
estimates to support a more refined search. In many cases, we
actually developed detailed inventories by doing actual wortk in the
wet lab.
The first problem is much easier, in the sense that one can rapidly
recognize genes similar to ones we already have analyzed (usually based
on similarity of the corresponding protein sequences.
Identifying Similar Genes
Genes are said to be homologous
if they share a common ancestor. Tools have been developed to
construct estimates of whether or not two genes, or the protein
sequences they encode, are homologous. Most of these are
based on measuring the degree of similarity
between the genes based on some metric. The most basic
versions of this problem are
- Given two genes (or proteins), are they homologs?
That is, estimate the liklihood that they are homologs.
- Given a gene and a database of other genes, extract a
prioritized list from the database of genes that are likely to be
homologs. Similarly, given a protein sequence and a database
of other protein sequences, which are most likely to be produced by
homologous genes?
- Produce an alignment
of two DNA or protein sequences that attempts to show corresponding
characters in the two sequences. For example,
<<<EXAMPLE of BINARY ALIGNMENT>>>
Multiple Sequence Alignment
Given a set of sequence (either DNA or protein sequences) that are
similar, align them in a way that puts corresponding characters in the
same column. For example,
<<<MSA>>>
Producing
more accurate multiple-sequence alignments, especially for large
numbers of long sequences, is a fundamental problem, and there is still
certainly room for improvement.
Construction of a Phylogenetic Tree, Given a Multiple-Sequence Alignment
One
of the most central problems in bioinformatics is to take a
multiple-sequence alignment and produce a phylogenetic tree estimating
the historical relationship of the sequences. There is a great
deal that has been written on this problem, and a great deal more will
be.
What is "the tree of life" and How Might it Get Built?
The
problem of constructing a single phylogenetic tree from a single
alignment (the last problem) is relevant to this issue, but it does not
cover it. Suppose that you built 200 alignments that
contain the sequences common to almost all genomes. Then, if you
were to build 200 trees, and then you found that they were not
identical (or even close in some cases), what would you infer, and how
should you respond? Is it even possible or desirable that we
actually create an estimate of the history of how the existing
micro-organisms have evolved from some ancestral organism?
Assuming that We Do Have an Estimate of the Tree of Life, which Proteins Characterize Subdivisions of the Tree?
It
is clear that sequences are introduced into genomes through replication
and (in addition) through horizontal transfer. In the presence of
large amounts of horizontal transfer, many genes will occur only in
relatively small portions of a specific subtree (these represent
relatively recent transfers). Is it possible and meaningful to
create inventories of proteins that tend to be unique to a subtree (or
is the concept "tend to be unique" somewhat similar to "a little
pregnant")?
Can We Identify Instances of Horizontal Transfer?
How
can we construct tools to recognize horizontal transfer, and can these
tools be good enough to sort out the actual details of the evolutionary
history?
Can We Determine Which Columns and Sections of a Multiple-Sequence Alignment are Conserved (and Why)?
Conservation
normally implies functional constraints (the reason a column has
restricted content is that any evolutionary change led to the
death of the organism that had it). Shifts of function relate to
conserved sections that have changed (i.e., the sections are not
random, but neither are they identical). The correspondence
between conservation and function is a rich source of significant
problems.
To What Extent Can Structure (Secondary or Tertiary) be Predicted froma Multiple-Sequence Alignment?
Comparison
of columns in a large multiple sequence alignment was the key to
developing secondary structures for both DNA alignments and protein
alignments.
The Machines: a Initial Inventory
Energy Issues
The following diagram offers a summary of the machines that relate to
acquisition and storage of energy, as well as the production of a
number of key compounds by breaking up sugar:

| M1 |
harvesting light energy |
| M2 |
building sugar from smaller components and energy |
| M3 |
Storing strings of sugar molecules as starch |
| M4 |
breaking up starch to give sugar |
| M5 |
breaking up sugar to get energy and smaller molecules |
Many of our machines will need energy to run. In the basic
organism we are describing, we have incuded M1 to harvest energy
from sunlight. This process is called photosynthesis.
The cell stores energy in a molecule called ATP.
Whenever energy is needed, the molecule is broken into two
pieces, releasing energy. The cell maintains a fairly
constant concentration of ATP, which allows reactions throughout the
cell to depend on it. This is similar in many respects to the
way electricity is available throught an house. Appliances
can be designed to plug in anywhere, and they assume the normal voltage
will be available. Similarly, we have a mechanism for
maintaining the concentration of ATP, and this allows us to include
reactions that depend on that concentration.
M2 is a
machine that builds sugar from CO2 and energy. This involves
a number of transformations. Eventually, we will need to
examine the individual steps, but for now let us remain at this quite
abstract level.
Machines M3
and M4 allow
the cell to store sugars when energy is abundant, and then to use them
later when energy is needed. Starch should be thought of as
just a string of sugar molecules, which is a convenient way to store
them. When sugar is needed, M4 can be used to
break off a few.
Finally, M5
is a machine that takes sugar molecules and breaks them into smaller
pieces, releasing energy (in the form of ATP) in the process.
These smaller molecules are the building blocks that are used
to construct over and over to build things needed by the
cell. Here is a table that contains the abbreviations we use
for these molecules. Frankly, if you have not had
biochemistry classes, you might simply work with the abbreviations,
since the full names can be intimidating.
| 2OG |
2-oxoglutarate |
| 3PG |
3-phospho-glutarate |
| A |
Adenosine [one of the characters in a DNA string] |
| Ala |
Alanine [an amino acid] |
| Arg |
Arginine [an amino acid] |
| Asn |
Asparagine [an amino acid] |
| Asp |
Aspartate [an amino acid] |
| C |
Cytosine [one of the characters in a DNA string] |
| CHOR |
Chorismate |
| CO2 |
Carbon dioxide |
| Daughter genome |
the added cell after replication |
| E4P |
Erythrose 4-phosphate |
| Extra Membrane |
A little extra membrane for the new cell |
| G |
Guanine [one of the characters in a DNA string] |
| G6P |
Glucose 6-phosphate |
| Genome |
the DNA string in the cell that contais the genes |
| Gln |
Glutamine [an amino acid] |
| Glu |
Glutamate [an amino acid] |
| Gly |
Glycine [an amino acid] |
| HOM |
Homoserine |
| His |
Histidine [an amino acid] |
| Iso |
Isoleucine [an amino acid] |
| Leu |
Leucine [an amino acid] |
| Lys |
Lysine [an amino acid] |
| Membrane |
the thing enclosing the cell |
| Met |
Methionine [an amino acid] |
| OXLA |
Oxalacetate |
| PEP |
Phosphoenolpyruvate |
| PYR |
Pyruvate |
| Phe |
Phenylalanine [an amino acid] |
| Pro |
Proline [an amino acid] |
| R5P |
Ribose 5-phosphate |
| Ser |
Serine [an amino acid] |
| Starch |
A polymer of sugars (used for storage) |
| Sugar |
think glucose |
| T |
Thiamine [one of the characters in a DNA string] |
| Thr |
Threonine [an amino acid] |
| Trp |
Tryptophane [an amino acid] |
| Tyr |
Tyrosine [an amino acid] |
| Val |
Valine [an amino acid] |
Building the Amino Acids

| M6 |
build glutamate and glutamine from
2-oxoglutarate |
| M7 |
build proline |
| M8 |
build aspartate from 2-oxalacetate |
| M9 |
build arginine |
| M10 |
build asparagine |
| M11 |
build serine |

| M12 |
build
glycine from serine |
| M13 |
build
cysteine from serine |
| M14 |
build
methionine from homoserine and cysteine |
| M15 |
build
lysine |
| M16 |
buil
homoserine from aspartate |
| M17 |
build
threonine |
| M18 |
build
isoleucine |

| M19 |
build
alanine |
| M20 |
build
valine |
| M21 |
Build
leucine |
| M22 |
build
the intermediate chorismate |
| M23 |
build
tyrosine and phenaylalanine |
| M24 |
build
tryptophane |
| M25 |
build
ribose 5-phosphate |
| M26 |
build
histidine |
Expressing Genes

| M30 |
building
a protein from amino acids and a gene |
Motility
The cell we envision has some motility. It can
"turn on its motor and propellers" to move a bit, turn off the motility
machinery, wait a while, turn it on again, and so forth.
We do not show a diagram or table of this machine, but we shall number it M31.Replication
Replication is descriibed in a somewhat imprecise manner. We think of M27 as a machine that builds the nucleotides, which are the characters that make up the DNA genome. Then M28
is a machine that takes these loose "characters" floating in the cell,
along with the existing genomes, and manufactures a copy of the genome.
Then, finally, M29 takes some extra membrane (see the output of M5),
the genome copy, and "pinches" the extended cell, creating two separate
cells which we call the "original" (containing the original genome) and
the "daughter" containing the copiy of the genome).
| M27 |
build
nucleotides |
| M28 |
build
new genome |
| M29 |
split
the cell into original and daughter |
Problems in BioInformatics that Can Be Done Once the Notion of "Function" Exists
The
inventory of machines has led us (albeit circuitously) into a
discussion of "the function of a protein" and how to think about it.
These problems relate to the use of comparative analysis between
the protein sequences from many distinct genomes (and what clues we can
expect to develop in our attempts to make sense of it all).
Identifying the Functions of Genes
The
general topic of how assign function to genes is central to genome
annotation. Deciding when you can safely project function based
on similarity is a topic that can profitably be pondered.
Predicting When Two Genes Implement Related Functions
There
are many clues that can be used to improve the accuracy of function
projection. Conservation of contiguity, detection of gene
fusions, protein-protein interaction data, and characterization of
regulatory sites have all proven useful Integration of clues from
a number of sources has been attempted (and will undoubtedly be
important in the future).
Grouping Genes into Subsystems
The genes that encode proteins that together implement a single machine may be thought of as an instance of a subsystem.
In later tutorials we will discuss the notion of subsystem in
more detail. Essentially, it is an abstraction of the notion of
machine, and it represents an important conceptual framework for
analyzing the functions of genes from many genomes simultaneously.
So, how can you detect when two genes are components of the same
machine?
Constructing Sets of Isofunctional Homologs
Homolgs
are genes that share a common ancestor. Isofunctional genes
implement the same function. The goal of compiling sets of
homologous genes (and the proteins they encode) that implement a single
function is central to automating annotation of genomes. Further,
since we will be faced with annotating thousands of new genomes over
the next few years (and it increases much more rapidly after that),
almost all annotations will be automated.
Supporting Decision Procedures for Sets of Isofunctional Homologs
Suppose
that you have a collection of sets of isofunctional homologs.
Suppose further that you have, say, 10,000 of these sets.
For each set, you will wish to develop a decision procedure
which, when given as input a set and a new protein sequence,
determines whether or not the protein should be added to the set.
In some cases, such decisions are easy, and you will wish to use
a very fast decision procedure. In others, they are very
difficult, and you will need to bring many sources of clues to bear.
Construction of such decision procedures will become increasingly important.
Characterization of Regulons for a Genome
Genes
are often co-regulated. That is, expression of a set of genes may
always be tightly coordinated. In this case, we will think of the
co-regulated set as a regulon.
Determination of which genes make up which regulons is a task
requiring both bioinformatic challenges and wet lab confirmations.
Don't attempt this one without a close working relationship with
a wet lab biologist.
Charaterization of "States of the Cell"
It might be conjectured that a cell has a limited set of states.
Each state is characterized by the set of regulons that are
expressed. It seems likely that the cell should be viewed as
"tending to stay in the same state" until forced to make a transition
to another state. That is, the states demonstrate a degree of homeostasis.
If we underatnd a comprehensive list of states, and we worked out
the forces that determine transitions, we would begin to understand the
cell as a dynamic system.