CHEM 245
Biochemistry

J. D. Cronk    Syllabus    Previous lecture | Next lecture

Lecture 6. Proteins

Thursday 31 January 2019

Peptides and proteins. Liquid chromatography. Gel electrophoresis. Protein purification and analysis. Protein primary structure. Primary structure as the covalent structure of proteins. Informational content of protein sequences. Experimental methods for primary structure determination and study of peptides. Computational methods: Sequence alignments.

Reading: Lehninger - Ch.3, pp.85-109.


Summary

Reading summary. §3.2 Peptides and proteins. Peptides are chains of amino acids. Peptides can be distinguished by their ionization behavior. Biologically active peptides and polypeptides occur in a vast range of sizes and compositions. Table 3-3 Amino acid composition of two proteins. Some proteins contain chemical groups other than amino acids. Table 3-4 Conjugated proteins. §3.3 Working with proteins. Proteins can be separated and purified. Proteins can be separated and characterized by electrophoresis. Unseparated proteins can be quantified. §3.4 The structure of proteins: Primary Structure. The function of a protein depends on its amino acid sequence. The amino acid sequences of millions of proteins have been determined. Protein chemistry is enriched by methods derived from classical polypeptide sequencing. Mass spectrometry offers an alternative method to determine amino acid sequences. Amino acid sequences provide important biochemical information. Protein sequences help elucidate the history of life on earth. Box 3-2 Consensus sequences and sequence logos.

***

We begin an exploration of protein structure and function with a ground level view of proteins and their experimental characterization. Proteins are biological polymers made up of amino acid monomers linked together via peptide bonds. Hence, another term used in this context is polypeptide. The structure of proteins is described in a hierarchical manner, and the lowest level of this hierarchy is the primary structure, or amino acid sequence, of a given polypeptide chain. We may well consider the primary structure of a protein to be its covalent structure - i.e. what atoms are connected together in covalent bonds as a macromolecule. In this way, we can incorporate any important post-translational modifications to a polypeptide chain under the umbrella of primary structure.

Proteins as polyprotic species and isolectric point (pI)

Proteins provide a general example of a multiprotic system, one of obvious biological importance. A given protein molecule generally contains many acidic and basic groups. We also use the term polyelectrolytes to describe such molecules, since they can exist in many distinctly-charged states, depending on which acidic and basic groups are protonated or deprotonated at a given pH. In fact, we want to describe how the charge on a protein molecule varies with pH. This reasoning will form the basis for separation of different protein molecules on the basis of charge. The two principal methods we will discuss are isoelectric focusing, a gel electrophoretic technique, and ion-exchange chromatography.

Let's start by defining the isoelectric point (or isoelectric pH) of a protein (or any polyelectrolyte species, for that matter).

Primary or covalent structure of mature insulin

As an example, consider insulin. In humans, the mature, active form of insulin is secreted by the pancreas and circulates systemically and acts on peripheral organs to stimulate the uptake and storage of carbohydrates and fats (lipids). The mature insulin molecule is not directly produced by ribosomal protein synthesis. Instead it is synthesized by ribosomes as a single polypeptide chain of 110 amino acids that is targeted to the endoplasmic reticulum (ER) by the N-terminal portion of the nascent polypeptide chain, termed the signal peptide. The signal peptide is cleaved, and the remaining chain undergoes folding and oxidation of cysteine residues to form disulfide bonds. It is worth noting here that intracellular, cytosolic proteins are maintained in a relatively reduced environment, and disulfide bonding is not very common. Extracellular proteins, are exposed to relatively more oxidizing conditions, which promotes disulfide bonding. Further processing in secretory vesicles results in the removal of an internal section of the polypeptide chain, yielding mature insulin which consists of two disulfide-linked polypeptide chains, an A chain of 21 amino acids and the B chain of 30 amino acids, as shown in the figure. Thus, the covalent structure of insulin - a rather small protein - technically differs from the amino acid sequence, in that the latter does not convey the conversion of cysteine residues to specific pairs of disulfide-linked residues (the disulfide-linked pair is occasionally referred to as a cystine residue). Analogously, the post-translational modifications of individual amino acids - such as glycosylations or phosphorylations - are not explicitly indicated by the amino acid sequence. The processing and modifications of polypeptide chains is in most cases of crucial importance to the biologic functional roles they play, as the insulin example illustrates.

Polypeptide diversity

As we will subsequently see, the amino acid sequence of a polypeptide chain determines its structure at the highest level, i.e. its tertiary structure or three-dimensional conformation. The tertiary structure of a polypeptide chain, in turn, is inextricably linked to its biological function. Our shorthand expression for this principle is that structure determines function.

Protein methods

Structural hierarchy: A common conceptual framework for the description of protein structure sets forth multiple levels, or a hierarchy. The first level, primary structure, relates to the covalent description of a (possibly modified) polypeptide chain. This can often be represented as the amino acid sequence. Additional levels - secondary, tertiary and quaternary structure - define aspects of the three-dimensional conformation, overall fold, and assembly of polypeptide chains. Secondary structure is a "local" (in the sense of actual three-dimensional space) description of the conformation of a polypeptide chain. Secondary structures show regular patterns, the two prominent examples being the α helix (alpha helix) and the β sheet (beta sheet).

Tertiary structure is the overall shape of the polypeptide chain, which depends on the conformation of all the main chain and side chain bonds of the molecule. Quaternary structure refers to the cases in which separate polypeptide chains ("monomers" or "subunits") associate - usually noncovalently - to form "multimers" or "oligomers".


Informational content of protein sequences

Even just the primary structure of a protein provides a potential wealth of information. Since the amino acid sequence ultimately determines the structure of the protein - at all levels of the hierarchy - and the three-dimensional structure of a protein underlies its function, a protein's sequence already provides important clues to its function. In particular, if the amino acid sequence of a protein of unknown function is compared with those of proteins of known function and a significant similarity is identified between such sequences - in other words, a homology exists - the uncharacterized protein is highly likely to perform the same or similar function.

The comparison of sequences is represented by a sequence alignment. Sequence alignments can be performed for a pair of homologous proteins, or for multiple homologs. Multiple sequence alignments

Primary structure as protein covalent structure

Primary or covalent structure of mature insulin

As an example, consider insulin. In humans, the mature, active form of insulin is secreted by the pancreas and circulates systemically and acts on peripheral organs to stimulate the uptake and storage of carbohydrates and fats (lipids). The mature insulin molecule is not immediately produced by ribosomal protein synthesis. Instead it is synthesized by ribosomes as a single polypeptide chain of 110 amino acids that is targeted to the endoplasmic reticulum (ER) by the N-terminal portion of the nascent polypeptide chain, termed the signal peptide. The signal peptide is cleaved, and the remaining chain undergoes folding and oxidation of cysteine residues to form disulfide bonds. Further processing in secretory vesicles results in the removal of an internal section of the polypeptide chain, yielding mature insulin which consists of two disulfide-linked polypeptide chains, an A chain of 21 amino acids and the B chain of 30 amino acids, as shown in the figure. Thus, the covalent structure of insulin - a rather small protein - technically differs from the amino acid sequence, in that the latter does not convey the conversion of cysteine residues to specific pairs of disulfide-linked residues (the disulfide-linked pair is occasionally referred to as a cystine residue). Analogously, the post-translational modifications of individual amino acids - such as glycosylations or phosphorylations - are not explicitly indicated by the amino acid sequence. The processing and modifications of polypeptide chains is in most cases of crucial importance to the biologic functional roles they play, as the insulin example illustrates.

Protein sequencing

Most proteins are much too large to sequence directly. Instead, a "divide and conquer" strategy is used, by which smaller peptide fragments are produced, followed by sequencing of the fragments by Edman degradation or mass spectrometry. If at least two distinct methods of fragmenting the protein are used, the assembly of the sequence is made tractable by the overlapping fragments produced. Two general methods are used to produce peptide fragments: endopeptidases, or proteases, which are enzymes that cleave the peptide bonds between amino acids within the polypeptide chain, and chemical methods.

The figure below shows schematically the hydrolysis reaction catalyzed by a protease. The peptide bond targeted by a protease is termed the scissile bond. For an endopeptidase, the scissile bond is internal to the chain, i.e. the protease is is not acting on the first (or N-terminal) or last (C-terminal) peptide bonds. In the latter case, the term exopeptidase is used, and there are exopeptidases specific for one end or the other (aminopeptidases and carboxypeptidases).

Structural diagrams of the hydrolytic cleavage of a peptide bond as carried out by endopeptidases

For the approach to protein sequencing we are considering, endopeptidases are most useful. Furthermore, it is helpful if the endopeptidase used is specific - that is, it cleaves peptide bonds not in a random fashion, but only at certain locations in the chain, such as on the C-terminal side of certain amino acids. Ideally, the fragments yielded by treatment of a given protein with an endopeptidase are generated predictably and reproducibly. Fortunately, a degree of specificity is characteristic of enzymes in general. In fact, it is possible for the enzyme to be too specific in this context. A proteolytic enzyme employed for protein sequencing ideally acts with a moderate degree of specificity so that treatment of a protein with the enzyme yields an appropriate number and length of peptide fragments.

Two digestive enzymes of the serine protease family are useful in this sense, trypsin and chymotrypsin. Trypsin cleaves on the C-terminal side of Arg or Lys residues, although not if the following residue is Pro. Chymotrypsin is somewhat more loosely specific, cleaving on the C-terminal side of residues with large nonpolar sidechains, preferentially acting on the peptide bond following Phe, Tyr, and Trp (although again, not if the next residue is Pro).

There are a few chemical methods with a specificity making them suitable for the generation of peptide fragments. The prime example is cyanogen bromide (CNBr), a reagent that cleaves on the C-terminal side of Met residues, generating a peptidyl homoserine lactone N-terminal fragment.

Schematic of CNBr cleavage reaction

Edman degradation

Peptides generated by the above described (or similar) methods can in most cases be readily sequenced by the iterative chemical procedure known as Edman degradation. The procedure is based on a three-stage reaction that labels and removes the N-terminal residue of a polypeptide, which can be identified as a PTH (phenylthiohydantoin) derivative. Thus, the product peptide following the three steps is again a peptide that is one residue shorter at its N-terminus. This product can be then be subjected to another round of the same reactions. Below the overall reactants and products of one round of the three-step Edman procedure are shown, Note that the initial reaction requires the N-terminal amino group to be nucleophilic, hence the pH must be high enough to insure that this group is in its neutral base form.

Overall reactants and products for one three-step round of Edman degradation

In favorable cases, the process can be repeated many times, the liberated PTH-amino acid identified each time, and up to 100 residues of sequence determined in this manner. Furthermore, with technological advancement permitting automation and incorporating highly sensitive means of detection, Edman degradation can be performed on small amounts of a peptide - 5-10 pmol or <0.1 μg.

Sequencing by mass spectrometry

Mass spectrometry is an analytical method that measures mass-to-charge ratio (m/z) for ions in gas phase. Generally, the larger a molecules, the lower its vapor pressure. Hence, a major hurdle to application of mass spectrometry to biomolecules was to produce gas phase ions of an intact large polymer such as a protein or peptide. Work overcoming this challenge was rewarded by half of the The Nobel Prize in Chemistry 2002.

Electrospray ionization (ESI) is capable of producing multiply-charged intact polypeptides in gas phase. ESI mass spectrometry is an extremely useful and accurate method for determining mass of polypeptides and proteins. A family of ions is produced from a polypeptide that differ in both mass and charge by 1 atomic unit (based upon number of H+ taken up by proton accepting groups of the molecule). Our text (VVP4e) shows (Fig. 5-17, p.112) a mass spectrum of such a family for horse heart myoglobin, and demonstrates (Sample Calculation 5-1, p.113) how the m/z values for two successive peaks can used in an algebraic determination of the molecular mass.

Peptides of up to 25 residues can be sequenced by mass spectrometry. In tandem mass spectrometry (MS/MS), an analyte selected by a first mass spectrometer is directed into a second mass spectrometer after being fragmented. Through the determination of the masses of the many fragments that can be produced by breaking one of the peptide bonds of the analyte peptide, the sequence can be determined.

The advantages of mass-spectrometry peptide sequencing are that blocked N-termini (a roadblock for Edman degradation) pose no problem, the rapid acquisition of sequence data, and characterization of common posttranslational modifications is possible.

A limitation of this method is that it is unable to distinguish between Ile and Leu (as they have identical residue masses), as well as difficulty in distinguishing Gln and Lys.

Sequence assembly: The "divide and conquer" strategy, piecing together overlapping peptide fragments produced by at least two different methods, is employed to deduce the complete sequence of the intact parent polypeptide. As a simple example, if treatment of a 10-residue peptide with trypsin yields fragments NYAN and ELFVHR, and treatment of the same peptide with chymotrypsin produces AN, ELF and VHRNY, then the intact peptide sequence must be ELFVHRNYAN.

Sequencing example: The first protein sequence, that of insulin, determined in the 1950s by Frederick Sanger.

Sequence databases

There are a number of important resources for online retrieval of information related to biological molecules, collectively indispensable, and which represent the product of bioinformatics. The National Center for Biotechnology Information (NCBI) and the Protein Data Bank (PDB) will serve as principal bioinformatic portals in this course. Other resources (as provided in VVP4e, Table 5-5 on p.115):

Protein Information Resource (PIR): http://pir.georgetown.edu/

UniProt: http://www.uniprot.org/

Sequence analysis: alignments

Pairwise comparison of sequences of proteins. Multiple sequence alignments and conserved residues.

Protein evolution

Evolutionary relationships are revealed by protein sequence comparisons. Phylogenetic trees can be constructed from multiple sequence alignments. Sequence comparisons provide information on protein structure and function.

Proteins evolve by duplication of genes or gene segments. Homologous proteins have protein sequences that have a high degree of identity and similarity (conservative substitutions). Not surprisingly, homology is indicative of evolutionary relationships. This is most evident in orthologous proteins, which are proteins that perform the same function in different organisms. For example, cytochrome c orthologs can be found in nearly every eukaryotic organism, and multiple sequence alignments of cytochrome c can be used to construct a phylogenetic tree - see VVP4e, Table 5-6 (pp.116-117) and Fig. 5-22 (p.119).

Among homologous proteins, paralogous proteins (paralogs) reside within the same organism. Paralogs arise as the result of gene duplication; hence they can evolve independently into divergent functional roles, in contrast to orthologous proteins

Proteins evolve at different rates, providing evolutionary clocks on vastly different time scales.

Domains, their duplication and divergence. A domain is a segment of a protein sequence that is conserved, apparently as a result of the evolutionary utility of the tertiary structure and its associated biological function. The smallest domains are the set by the typical minimum number of residues necessary to form a stable tertiary conformation (or 'fold"), usually cited as somewhere around 40 amino acids. The largest can be several hundred residues in length. Polypeptide chains of more than several hundred residues in length almost certainly fold into two or more domains. Our text at this point, at the end of Ch.5 (VVP4e, p.122), focused still on genetic information (or its translation into a protein sequence according to the genetic code), introduces domain shuffling as a evolutionary mechanism. Furthermore, domain shuffling may be a more rapid process by which protein diversity is generated than is gene duplication. Certainly domain duplication, followed by divergence, creates a greater multiplicity of independently evolving structural units. We'll revisit domains when we delve into protein tertiary structure in the next part of the course.