The following information may have errors; It is not permissible to be read by anyone who has ever met a lawyer. Use should also be confined to Engineers with more than 370 course hours of electronic engineering and should only be used for theoretical studies. All content entered becomes and is (C)2007 Transtronics, Inc. the property of Transtronics, Inc. Rest assured that your contributions won't be sold and will be publicly available.

Genetic differences and information theory

From Transwiki

Jump to: navigation, search

[edit] How should we measure understand genetic differences in a meaningful way?

OK, my curiosity has me again - I often read things and find that instead of understanding better I have ended up with many more questions than answers.

I was listening to a lecture (celebrating Dawkins book, the Selfish Gene) and the professor said that of the information in our chromosomes about 2% codes for proteins, perhaps 3% is used for control, but the remaining 95% is thought to be made up of remnants of retro viruses and empty "line genes" that don't do anything.

Couple that with a typical quote about the difference between Human and chimp genes, this one from National Geographic:

"The goal is to answer the basic question: What makes us humans?" said Eichler. Eichler and his colleagues found that the human and chimp sequences differ by only 1.2 percent in terms of single-nucleotide changes to the genetic code. But 2.7 percent of the genetic difference between humans and chimps are duplications, in which segments of genetic code are copied many times in the genome. If genetic code is a book, what we found is that entire pages of the book duplicated in one species but not the other," said Eichler. "This gives us some insight into the genetic diversity that's going on between chimp and human and identifies regions that contain genes that have undergone very rapid gnomic changes.


And this one:

The new estimate could be a little misleading, said Saitou Naruya, an evolutionary geneticist at the National Institute of Genetics in Mishima, Japan. "There is no consensus about how to count numbers or proportion of nucleotide insertions and deletions," he said.

So I still don't think I have a good feel for the magnitude of the difference.

I have an interest in how to measure the difference in data - it is of key importance in computer data compressing algorithms.

A reversal of data has little difference, and the insertion or deletion of data that changes the position of other data is not much of a true difference. So how do they measure these gene differences in a meaningful way?

When they say that humans vary from chimps by only a few percent - that could mean a lot of different things - are they only looking at the 5% that counts? (that would make sense to me) or are they overstating it by looking at all the junk DNA? I don't think they are looking at mitochondrial DNA. How do they deal with reversals and relocations? Are they only looking at the DNA that codes for protein amino acid sequences? How can this difference be expressed in a meaningful way?

On Linux based computers there is a command called diff that creates a difference file, while this command deals with insertions and deletions well - it is not so good at reversals or rearrangements of blocks. Related to this program is one call rsync.

rsync was originally written by Andrew Tridgell as the basis of his PhD thesis. is the key person and in some ways more important to Linux than Linus. (BTW - for a bit of self referentialism, I use rsync to transfer updates of this web site!) See binary diffs for more on this..

He was solving a problem that often comes along in computer files where there are different versions of - lets say a text file. Instead of sending the entire file, he came up with a system that broke the file into parts, created a hash (a mathematical method that identifies a block of data with a type of checksum) and compares the hashes on both ends on then only transmits the differences (they compress the differences as well) allowing files to be 'synchronized' with out sending the whole thing - thus speeding up the process. Other attempts to create compact differences have been worked on that use more complex algorithms at the expense of processor time.

The important point here, is how to measure the actual meaningful difference. It seems to me that only looking at the 5% and then finding the best compression of the difference data would get us a representation that has meaning. Then making a fraction based on the best compression of the difference over the best compression of the useful data is the way to go.

I've also heard estimates on how much information our genes represent - and again I don't know if they are looking at both the real data and the junk data in these estimates. Is it 700MB or compressed to 250MB? Or does this include all the junk? Is there a difference in how the junk genetic information compresses?.

[edit] Update

As if someone read this page ?? There was an article in American Scientist, September-October 2007 called Sorting Out the Genome: To put your genes in order, flip them like pancakes

Here, Brian Hayes, talks about genetic inversions: blocks of genes that had flipped end-over-end and figuring out what the actual information content is.

There is still a need for a good book by a mathematician that wold nail genetic difference measurements to the wall of understanding.

Personal tools