Welcome Guest, you are in: Login

Fulton Wiki

RSS RSS

Navigation





Search the wiki
»

PoweredBy

Bioinformatics

RSS
Modified on 2012/12/03 15:47 by Ben Fulton Categorized as Uncategorized
A strand of DNA consists of a series of nucleotides. There are just four nucleotides: Adenine, Guanine, Ctyosine, and Thymine. These nucleotides bond together in a long row to form a DNA molecule. The order of the nucleotides in the molecule is not relevant to the structure, so one molecule might consist of Guanine, Cytosine, Guanine, Guanine, Adenine, while another could be Cytosine, Thymine, Thymine, Guanine, Guanine. To simplify the description, it is sufficient to use the initial letters of the nucleotides to describe the molecule, so our first molecule would be GCGGA, and the second would be CTTGG. Bits of DNA are used as blueprints inside a cell to create proteins. A protein is also a sequence, but this time it is a sequence of amino acids rather than nucleotides. There are just 20 kinds of amino acids, so each one can be represented as a letter of the alphabet. Since three nucleotides can combine in 64 different ways, each triplet in an RNA chain can code a distinct protein. Inside the cell, a ribosome moves along the RNA and converts each triplet into a specific protein (many triplets can code to the same protein). There are also Start and Stop triplets that tell the ribosome where to start and stop the protein creation.

Alignment

Global Alignment

One of the most important problems in bioinformatics is to take two DNA molecules and determine how similar they are to each other. Typically this would be done by "scoring" the sequences - say, one point if there is a nucleotide match at a given position, and minus one point if they are different. So the sequences AACGT and AATGT would have a score of three - four points for the four matches and -1 for the mismatch. This is a good start, but sequences are often not the same length, and if they are not, they could differ only by an additional nucleotide in the middle. An extra C might be added to the second string in the example, to make it ACATGT. If we compare the sequences now, we have just one match and five mismatches, for a score of -4. We need to introduce a gap into the first string in order to find the best match, giving us
 	
A-ACGT
 	
ACATGT
 	
with a much better score of four matches and two mismatches. The question becomes, how do we decide where to put the gap? The Needleman-Wunsch algorithm provides an answer, by placing the two sequences in a matrix and assigning a score to each position in the matrix.

Local Alignment

Another, similar problem is to find, not just an exact alignment between two sequences, but a partial alignment. So perhaps, inside two larger sequences, there are sections coding for identical proteins, but in the first sequence, the section starts at position 100, and in the second sequence, the identical section starts at position 500. A global alignment in this case will assign a large gap penalty to get the two sections aligned, so we only want to find subsections of the strings and match them. This turns out to be a problem very similar to the global alignment problem. The Smith-Waterman algorithm provides some small modifications.

Scoring Matrices

Algorithms the choose alignments based on a matrix have to choose a score for each item in the matrix. If we have two amino acid sequences, ABCD, and BBCC, we create a cell in the matrix for each pair of elements, and assign it a score. It could be as simple as 1 for a match and 0 for a mismatch, but in biological terms, certain amino acids are more likely to be replaced by specific other ones. We account for this by a matrix that gives a specific score for each potential replacement. Popular scoring matrices are the BLOSUM series and the PAM series.

Multiple Alignment

After mastering the comparison of two sequences, we next move on and try to align multiple sequences at once. It would be theoretically possible to generalize the matrix theory, so to compare three sequences you'd use a three-dimensional matrix, four sequences a hypermatrix, and so on. This is monstrously expensive computationally, and so is not a strategy that is used in practice. Instead, we use progressive alignment, in which the most similar sequences are aligned first, and then a composite alignment, or profile, is used to continue the alignment to other sequences.

RNA Folding

DNA folds into double helixes; RNA: not so much. Non-coding RNA (ncRNA) just refers to RNA that doesn't code for proteins. It can have many other uses (regulation, localization). A riboswitch? RNA can fold into hairpin turns. The sequence CGAXXXXXXXUCG will pair bond C-G, G-C, A-U with the X's in the hairpin. The three pairs keep the structure but if one or two mutations there, the bonds aren't strong and the structure will be lost. This may make the RNA useless. (C-G pairs bond more strongly than A-U). Secondary structures: hairpin turns, loops. More complex folds: Pseudoknots, kissing hairpins, hair-bulge interactions. Pseudoknot has two loops, held together by bonds at the edges. Approaches to structure prediction: Energy minimization, comparative sequence analysis, simultaneous folding and alignment.

RNA folding prediction

Assumes: most stable structure is also most likely to form; energy is only influenced by local structure; doesn't take into account pseudoknots.

Nussinov Algorithm

Dynamic programming algorithm. Calculate score from I to J along the RNA - S(I,J) comes from one of three strings: by moving in one space from the front, from the back, or from both. OR, bifurcate the string at any location between I,J. Do the two bifurcated strings add together for greater energy? Fill in a matrix NxN, base pairs going both across and down. Put 0's in the main diagonal to represent the energy of a single nucleotide strand. Fill in table diagonal by diagonal, looking at cells below, to the left, and lower left. For bifurcation, for K between I and J, add the max of the scores in S(I+1,K) + S(K+1,J), etc. Final result will be S(1,N) (The score for the entire sequence) Complexity of the algorithm is N3. Traceback can be tricky due to the back-calculation of the bifurcated elements. You have to keep track of where the bifurcations were.

Inferring by comparative sequence analysis

Compared sequences need to be similar - but make sure not to penalize covariant changes! A paired AU can be replaced with a paired GC without much energy loss. Can simultaneously align and fold. Measuring covariance: Mutual Information - generate a correlation over parts of the sequence. RNAalifold - generates a covariance-like score. (Complex score involving summing over all possible pairs). (Have to reference a slide here).

Covariance Model

Very slow but accurate analysis. Multiple alignment looping with covariance estimations. Results in binary tree representation of secondary structure.

Energy Minimization

Mfold and Vienna packages. Mfold will be within a certain threshold of the calculated minimum energy. Vienna calculates the probability of base pairings.

Inferring by comparative sequence analysis

Compared sequences need to be similar - but make sure not to penalize covariant changes! A paired AU can be replaced with a paired GC without much energy loss. Can simultaneously align and fold. Measuring covariance: Mutual Information - generate a correlation over parts of the sequence. RNAalifold - generates a covariance-like score. (Complex score involving summing over all possible pairs). (Have to reference a slide here).

ncRNA gene finding

De novo methods - calculate from folding energy and number of sub-optimal structures Homology methods Rfam - collection of sequence alignments and covariance models. Infernal is a search tool for it. A lot of good references at the end of the slides.

Protein Bioinformatics

Lots of open problems. Detection of similarities, prediction of function, prediction of structure. membranes. Designing new proteins and engineering existing ones. SCOP classification - catalogues structures by structure. Folds, families, superfamilies. Experimental determination techniques: X-Ray crystallography (but getting crystals is a black art), Nuclear Magnetic Radiation (NRM) - determine the exact location of each hydrogen atom. Better for small proteins, 200-300 bp., Electron microscope (EM) for large complex structures. Structure alignment - take two proteins where the acids are known in three-dimensional space, apply translations and rotations to try to get them matching as closely as possible. Attempt to minimize RMSD. But what if we don't know the alignments? DALI - use distance matrix (expand on this). Flexible Structure Comparison (Maybe only comparing regions?) FATCAT

Protein Folding

Problems in folding: pathways? intermediate structures? folding speed and contact order? energy landscapes? misfolds?

Levinthal Paradox - proteins fold reliably to the correct conformation. No one knows why, really. Seems like it could fold into many different patterns.

Energy Landscape theory of protein folding. (Folding Funnel) Can go down different pathways and still fold to the right structure.

Molecular dynamics simulations - generate a trajectory for the molecule based on F=MA and integrate across all atoms. Can use all sorts of energies - hydrogen bonds, bond eneergey, van der vwall energy. Using a powerful computer can determine states of all atoms over time.

Can protein structure be predicted from sequences?

- Secondary structure is easier, and thus has many methods (statistical methods, nearest neighbors, neural networks, and more) Certain acids have preferences for helixes or sheets (eg, Glu is alpha, Val is beta).

- Side chains (Dead End Elimination algorithm). Figure out patterns that conflict and remove them (Rotamers - a possible location for an acid).

- Tertiary structure - can be done via thermodynamics but very computationally expensive. So: homology modelling. Steps: Find a template, and align it. Transfer coordinates from backbone and sidechain from the template. Predict any missing structures. Programs: Modeller, Rosetta.

CASP 8 - a competition for modelling

Cryo-EM, Flex-EM, De Novo

Even more complex structures

Blind testing

Function Prediction

Gene ontology

Homology-based: look at similar proteins and there's a good chance they have similar functions. But also look to see what proteins they are associated with in the cell (guilt by association).

What is predicting functional residues?

Hypothetical proteins - new genome sequencing projects that seem to be protein coded, but unknown what they do.

Gene ontology components: the biological process, the molecular function, the cellular component of the protein.

EC (Enzyme Commission Classification) describes function with just four numbers: For example, the tripeptide aminopeptidases have the code "EC 3.4.11.4".

But Gene Ontology (GO) is more extensive, AND it has a great acronym! Definitions, plus formal relationship words (is_a, has_a). Each term has two definitions: a human one and a graph one. Each term is related to others via the graph. (pretty much just is_a and has_a)

Can measure similarity not just by closeness to each other in the graph, but also by semantic similarity. EG, probability of each node is the probability of the term occurring in a database such as SWISS-Prot.

Proteomics

Mass Spectromety and MS/MS

Then identify, as usual, by database search or de novo. Also need to identify Post Transitionally Modified (PTM) peptides. And quantitative proteins. (different abundancies)

proteome coined by analogy with genome - means all the proteins in the cell.

MS components: An ion source, a mass analyzer, and a detector. Mass Analyzer separates the ions according to their mass-to-charge ratio.

Mass spectrometry-based proteomics, Nature 422:198, 2003 Applying mass spectrometry-based proteomics to genetics, genomics and network biology, Nature Reviews Genetics 10, 617-627, 2009

ScrewTurn Wiki version 3.0.4.560. Some of the icons created by FamFamFam.