The following information may have errors; It is not permissible to be read by anyone who has ever met a lawyer. Use is confined to Engineers with more than 370 course hours of electronic engineering for theoretical studies.
All content entered becomes and is (C)2007 Transtronics, Inc. the property of Transtronics, Inc. Rest assured that your contributions won't be sold and will be publicly available.
ph +1(785) 841 3089 Email inform@xtronics
Genetic differences and information theory
From Transwiki
[edit] How should we measure understand genetic differences in a meaningful way?
OK, my curiosity has me again - I often read things and find that instead of understanding better I have ended up with many more questions than answers.
I was listening to a lecture (celebrating Dawkins book, the Selfish Gene) and the professor said that of the information in our chromosomes about 2% codes for proteins, perhaps 3% is used for control, but the remaining 95% is thought to be made up of remnants of retro viruses and empty "line genes" that don't do anything.
Couple that with a typical quote about the difference between Human and chimp genes, this one from National Geographic:
"The goal is to answer the basic question: What makes us humans?" said Eichler. Eichler and his colleagues found that the human and chimp sequences differ by only 1.2 percent in terms of single-nucleotide changes to the genetic code. But 2.7 percent of the genetic difference between humans and chimps are duplications, in which segments of genetic code are copied many times in the genome. If genetic code is a book, what we found is that entire pages of the book duplicated in one species but not the other," said Eichler. "This gives us some insight into the genetic diversity that's going on between chimp and human and identifies regions that contain genes that have undergone very rapid gnomic changes.
And this one:
The new estimate could be a little misleading, said Saitou Naruya, an evolutionary geneticist at the National Institute of Genetics in Mishima, Japan. "There is no consensus about how to count numbers or proportion of nucleotide insertions and deletions," he said.
So I still don't think I have a good feel for the magnitude of the difference.
I have an interest in how to measure the difference in data - it is of key importance in computer data compressing algorithms.
A reversal of data has little difference, and the insertion or deletion of data that changes the position of other data is not much of a true difference. So how do they measure these gene differences in a meaningful way?
When they say that humans vary from chimps by only a few percent - that could mean a lot of different things - are they only looking at the 5% that counts? (that would make sense to me) or are they overstating it by looking at all the junk DNA? I don't think they are looking at mitochondrial DNA. How do they deal with reversals and relocations? Are they only looking at the DNA that codes for protein amino acid sequences? How can this difference be expressed in a meaningful way?
On Linux based computers there is a command called diff that creates a difference file, while this command deals with insertions and deletions well - it is not so good at reversals or rearrangements of blocks. Related to this program is one call rsync.
rsync was originally written by Andrew Tridgell as the basis of his PhD thesis. is the key person and in some ways more important to Linux than Linus. (BTW - for a bit of self referentialism, I use rsync to transfer updates of this web site!) See binary diffs for more on this..
He was solving a problem that often comes along in computer files where there are different versions of - lets say a text file. Instead of sending the entire file, he came up with a system that broke the file into parts, created a hash (a mathematical method that identifies a block of data with a type of checksum) and compares the hashes on both ends on then only transmits the differences (they compress the differences as well) allowing files to be 'synchronized' with out sending the whole thing - thus speeding up the process. Other attempts to create compact differences have been worked on that use more complex algorithms at the expense of processor time.
The important point here, is how to measure the actual meaningful difference. It seems to me that only looking at the 5% and then finding the best compression of the difference data would get us a representation that has meaning. Then making a fraction based on the best compression of the difference over the best compression of the useful data is the way to go.
I've also heard estimates on how much information our genes represent - and again I don't know if they are looking at both the real data and the junk data in these estimates. Is it 700MB or compressed to 250MB? Or does this include all the junk? Is there a difference in how the junk genetic information compresses?.
[edit] Update
As if someone read this page ?? There was an article in American Scientist, September-October 2007 called Sorting Out the Genome: To put your genes in order, flip them like pancakes
Here, Brian Hayes, talks about genetic inversions: blocks of genes that had flipped end-over-end and figuring out what the actual information content is.
There is still a need for a good book by a mathematician that wold nail genetic difference measurements to the wall of understanding.
[edit] meaningful differences
My understanding is that the only directly useful regions on the DNA are the genes -- every protein that an organism can possibly make is coded by some gene. My understanding is that molecular biologists can now tell, just from the DNA sequence, where each gene stops and starts.
I agree that the "differ by only 1.2 percent" number is a little vague. Since the number of genes humans had was completely unknown before the human genome project published its working draft of the human DNA sequence in 2000, and the "1.2 percent" number was published before that, I'm pretty sure the "1.2 percent" number includes *everything*, both coding genes and non-coding "junk".
I agree that it would be interesting to know, in detail, what exactly is different between chimps and humans. One of many ways to categorize the differences between one cell and another is:
- differences in the non-gene "junk" DNA -- makes no difference in the kinds of proteins that an organism can possibly make. (But it is speculated that some areas of non-gene DNA may effect the quantities produced of some proteins, even though all the information on the exact protein molecule produced is described by some gene).
- minor differences in a gene that have no effect on the protein produced. (For example, if humans have 6 copies of some gene, and chimps have 7 copies of that gene, but each copy is identical ... or substituting UCU for AGC, both of which code for the identical amino acid -- Serine).
- "small" differences in a gene that cause it to produce a slightly different kind of protein.
- "large" differences in a gene ("this gene in this species corresponds to that gene in that species, even though there are many, many changes in the base sequence")
- "added" genes -- are there proteins that only chimps can make, unrelated to any that humans can make? Or vice versa? (The boundary may be fuzzy between "large differences in a gene" vs. "both species have added genes").
- differences between cells that are not reflected in the DNA -- the epigenetic code -- including things like DNA methylation. For example, a human brain and a human foot (from the same human) contain cells containing exactly the same DNA sequence. Even though both groups of human cells contain exactly the same genes (in a particular human), the actual quantities produced of each particular protein (the actual frequency a particular gene is used) is very different between the brain and the foot. And yet -- that human brain looks a lot more like a chimp brain than it looks like that human foot.
I hear that the Great Ape Genome Project plans to publish some results Real Soon Now. I hope that makes it possible to gain a better understanding of the differences and similarities between humans and others.
