Protein Folding, HIV and Drug Design
By Michael Thorpe
Figure 1. The first 5 amino acids of the polypeptide chain of the protein HIV protease that result from the instructions in the DNA codons given in the text. Here the green spheres are C atoms, blue are N, and red are O and grey is H.
Figure 2. The protein HIV protease in a dimer that consists of 2 polypeptide chains, each containing 99 amino acids, and is an important part of the HIV virus.
It is difficult to pinpoint exactly when modern quantitative molecular biophysics started, but 50 years ago on 28 February 1953 is a natural entry point, when James Watson and Francis Crick completed their famous molecular model of the double helix structure of DNA, based on the x-ray diffraction work of Rosalind Franklin and Maurice Wilkins. In their paper in Nature, their understated remark "It has not escaped our notice that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material," opened up the modern era in biochemistry.
The discovery of the structure of DNA is perhaps the most important scientific event of the second half of the 20th century. This structure introduced the idea of complementarity, in which knowing one strand determines what the opposite strand should be, with the base pairs A, T and G, C always occurring together. This pairing provides the basis for genetics and reproduction. This is all now taken for granted, but we should not forget that just months before the Watson and Crick paper, Linus Pauling had published a three helix model for DNA.
While DNA provides the genetic code, these are only the encoded instructions which must be decoded and implemented to be useful.
In the last 50 years, molecular biology has gradually become more quantitative. With the recent near- completion of the human, mouse and other genomes, the genetic code is now available as the base pairs of DNA. It is this sequence (ATGCAATCGA. ), about 3 billion altogether, that contains all the information needed for life. DNA is largely inert- it is the computer code, but of much more interest is what happens when the program is run.
About 2% of the genetic material is genes, which are specific sequences of bases that encode instructions on how to string amino acids together to make the polypeptide chains that will then fold to form 3D proteins. At each site of DNA, there is a choice of one of the bases A, T, G, or C. Hence to encode for the 20 amino acids, it takes three adjacent sites of DNA, called a codon, which leads to 64 choices. This redundancy has bothered people, and many elegant encoding schemes were proposed. The truth turned out to be somewhat prosaic, with between 1 and 6 codons mapping onto each amino acid, in a many to one mapping.
To physicists who are inclined to think that nature always goes for elegance, the translation scheme from DNA, via RNA transcription, to amino acid sequence can be a disappointment. However, realizing that the complex systems of biology have evolved continuously through a series of very small evolutionary steps, we may regard biology being locally optimized. It is a design that works.
When the polypeptide chain folds into a 3D structure, it becomes a functioning protein. We are now seeing molecular biology being understood from the ground up - starting with the smallest molecular building blocks from the base pairs of DNA that encode for the amino acid sequences of proteins, to the 3D structure of proteins, to complexes containing proteins, DNA and RNA, and up to self-contained biological entities like viruses and cells.
The forefront of molecular biophysics has now moved to proteins. Each one has a simple task, out of which something as complex as a mouse or you and I is constructed. Indeed, both have very similar sets of proteins, totaling around 30,000, with about 95% being the same or very similar. This has come as a considerable shock to some humans, but mice have remained quiet about it. The linear sequence of amino acids, as shown in Figure 1, folds as it comes off the ribosome, which is the molecular machine where the amino acids are assembled into a polypeptide chain in the right sequence, according to the genetic code. The DNA sequence for the first five amino acids of the protein HIV protease is CCT CAA ATC ACT CTT, which encodes for the amino acids proline, glutamine, isoleucine, threonine and leucine.
Proteins arise from exponentially unlikely sequences of amino acids as determined by evolution. The unfolded polypeptide chain is a rather uninteresting 1D random coil that has no useful function until it folds into a compact 3D protein as shown in Figure 2. How a protein folds in times that range from a microsecond to a few seconds remains the subject of much study, as the protein does not have enough time to sample all possible conformations. Thus some kind of sequential or hierarchical process must occur, where perhaps secondary structures like the alpha helices and beta sheets form first, and then they in turn come together to form a rather compact 3D structure. It is important that proteins are three-dimensional, both for functionality, and as the smallest component that defines us as three dimensional objects.
Proteins are compact macromolecules that exist and function in an aqueous environment. Protein functions are very varied, such as docking against another protein, RNA or DNA, forming a channel for ions to pass through, or the proteins called proteases which can chop a polypeptide chain into segments of the right length.
The balance between stability and function can be studied by using constraint theory. The constraints associated with the unfolded polypeptide chain are the covalent bond lengths and angles, and the locking of the dihedral angle associated with rotation around the peptide bond. The additional constraints that are responsible for the 3D protein structure are hydrophobic interactions and the hydrogen bonds that act to cross-link different pieces of the polypeptide chain.
Although constraint theory only involves topology, it is necessary to know the 3D protein structure from x-ray diffraction experiments in order to assign hydrophobic and hydrogen bond constraints. In 1970, the Dutch mathematician Gerard Laman worked out a complete theory of constraints for 2D systems. This allows the rigid regions and the flexible joints between them to be identified. Some parts of the rigid regions may be over-constrained and contain redundant bonds and these are also identified.
In 1985, the mathematicians Tlong-Seng Tay and Walter Whiteley extended this work with the Molecular Framework conjecture, which applies to the subset of all 3D networks such as those found in macromolecules with covalent bonding, and so can be applied to proteins. It is now possible to determine the rigid regions in a 3D protein and the flexible joints between them, as shown in Figure 2. Such knowledge is important in understanding the stability and function of the protein. The lower panel in Figure 2 shows the same protein with a small ligand attached, which inhibits the function of the protein associated with the opening and closing of the two flaps at the top. These now become part of the rigid core with the ligand present. This is the basis of most drug design-shutting down the activity of one protein, while avoiding the toxicity that would result if other proteins were affected.
When proteins unfold due to environmental changes in the local pH etc., or in the lab by increasing the temperature, functionality is lost. While the detailed pathways associated with this unraveling differ for different proteins, the whole process can be characterized by an overall loss of rigidity. As non-covalent bond constraints are broken, the rigid core gets slightly smaller, and then fractures into smaller rigid pieces very much like a weak first order phase transition. There is a corresponding increase in the number of independent motions associated with internal degrees of freedom that are now possible. Within the constraint approach, these motions have no restoring force associated with them, but in reality are low frequency or floppy modes. This viewpoint emphasizes the similarities among all proteins, and identifies the mean coordination as the reaction coordinate for protein unfolding and folding. The mean coordination is the average number of constraints per atom taken over the whole protein.
Rigid region decompositions, shown in Figure 2, are a useful aid in understanding protein function, and also in applications such as the search for drug candidates in the pharmaceutical industry. Drugs are small molecules with typically between 10 and 100 atoms, which attach themselves to a protein and inhibit its function. A good attachment requires a good steric fit as well as favorable van der Waals contacts. The computer screening of large databases of molecules to search for drug candidates is becoming more sophisticated using software such as SLIDE. Of course this is only the first step in the drug discovery process. The lower panel in Figure 2 shows how an inhibitor attaches itself to the protein HIV protease. The flaps at the top can no longer open and close, rendering the protein dysfunctional, which in this case is of course very beneficial, and has given the best success so far in the worldwide fight against AIDS.
Michael Thorpe is currently University Distinguished Professor of Physics at Michigan State University. From the middle of the year he will be Professor of Physics, Chemistry and Biochemistry at Arizona State University. He is a Fellow of the APS.