Tuesday, August 27, 2013

Semantic web

The Semantic Web refers to the inclusion of machine-readable data on web pages. While all data is theoretically readable by machine, the semantic web generally takes advantage of ontologies - formal and consensual specifications that provide a shared and common understanding of terms.

An ontology is essentially metadata - data describing the data. A database schema is an example of an ontology, although not one that is probably usefully formatted for web data. So is a class in an object-oriented language such as Java.

However, metadata is generally categorized into levels. The lowest level is simple data. The next highest level is syntactic data, simple attributes of the data such as language, format, source, and creation date.

Next is Structural Metadata, such as DTD's, XSL, and clustering.

Then, we have semantic metadata, where terms have semantic meaning. In metadata concerning medical treatment, for example, we might see data like
region: upper abdomen, organ: liver, pathological structure: abscess

Finally, at the highest level, we have a full-fledged Ontology. Examples of ontologies might include anatomy or diagnostics.Some currently existing ontologies include PapiNet.org, a vocabulary for for the paper industry , BPMI.org: a vocabulary for exchanging business process models , and XML-HR, vocabularies for human resources.

Interesting things:

  • Pellet - an OWL reasoner for Java.  http://clarkparsia.com/pellet/
  • Simile - http://simile-widgets.org/ Web widgets for supporting data visualizations
  • http://schema.org, a site that has standards for microformat markup.
  • Google Brain, a learning project
  • vivisimo - structured clusterization
  • http://www.semantic-conference.com/primer.html


Thursday, February 21, 2013

Enzyme Inhibitor Types

Enzyme inhibitors come in two flavors: reversible and irreversible. Irreversible inhibitors act by forming a covalent bond with the enzyme. (Covalent bonds are very strong and difficult to break.) Aspirin and penicillin are examples of irreversible inhibitors. Aspirin modifies the Serine516 residue of the enzyme cyclooxygenase, by adding a COCH3 (acetyl) group. The residue is part of the enzyme's active site, and the acetylation causes it to lose its function.

Reversible inhibitors also have three types: competitivenoncompetitive, and uncompetitive. While the names of the second two are rather confusing, a competitive reversible inhibitor is straightforward: it mimics the structure of the substrate of the enzyme. The enzyme's active site thus bonds to the inhibitor rather than the substrate.

An uncompetitive inhibitor binds only to the enzyme after it has already bonded to the substrate. It might perform by blocking the release of the product from the enzyme.

noncompetitive inhibitor binds to the enzyme at a point other than the active site, causing some reaction, possibly a change in the conformation of the enzyme, that interferes with the working of the enzyme. It may bind to the enzyme itself, or it may bind to the enzyme/substrate combination.

Wednesday, February 20, 2013

Fundamental questions of Hidden Markov Models

There are three problems, canonically, that are solved with the use of Hidden Markov Models:


  • Given an HMM, and a sequence of outputs, evaluate the probability of the HMM producing the given output.
  • In the same conditions, determine the most likely sequence of states that would have produced the given output.
  • Given just a sequence of outputs and some states, create an HMM that would best account for the given output.

For the first, the probability is calculated by multiplying the probability of each individual output. A simple example is a single-state HMM that has the following outputs: 'A', with probability .5, 'B' with probability .3, and 'C' with probability .2. If the output sequence is ABCABC, multiplying the probabilities gives .0009. Clearly, as the length of a sequence increases, the probability of it being output drops; however, we are generally more concerned with comparative probabilities. For example, the sequence AAAAAA is much more likely to have been generated by our model, with a probability of 0.015625.

For the second, you might have a two-state HMM. In state 1, it outputs 'A' with a probability of .5 and 'B' with a probability of .5.  In state 2, it outputs 'A' with a probability of .8, 'B' with a probability of .1, and 'C' with a probability of .1. Given the sequence CCC, we know that the model is in state 2 the entire time,as state 1 has no possibility of outputting a C. But if the sequence is CAA, we don't know the state that the model was in when it generated the second or third outputs. But if we also have a model parameter that says the probability of transitioning from state 2 to state 1 is .01, we can see that it is still overwhelmingly likely that the model never exited state 2. Given an output sequence of a few hundred characters, however, the likelihood is very high for an occasional jump into state 1.

The third is more complicated, as it involves calculating the most likely parameters for the model based only on the output sequence. There are algorithms for calculating them, however.

Tuesday, February 19, 2013

Hidden Markov Model

A Markov Chain is a simple state machine. The machine can be in one of several states and from each state, there is a set probability (possibly zero) of transitioning to each other state. Consider a baseball game: One state might be two outs, runner on second. After the next at-bat, the state might change to three outs, or to two outs, runners on first and second. The probability of the state changing to one out, nobody on base is zero.

A Hidden Markov Model is one where the actual transition probabilities are unknown, and must be inferred. You might, for example, observe hundreds of baseball games and determine that, out of 100 times there were two outs, runner on second, the batter made an out 63 times. Thus, we infer that the transition probability, which is unknown, is probably 63%.

Michaelis-Menten


Two important characteristics of an enzymatic reaction are its Km and its Vmax. The Vmax is a theoretical fastest rate of conversion that can be achieved, while the Km is the concentration of substrate that causes the rate of conversion to be half of Vmax. This is important because it describes a relationship between a transitional state of the enzyme complex, Km = (Vq + Vr) / Vs, where Vq and Vr represent changes from the transitional state to the two final states of the enzyme, and Vs represents the change from the substrate to the transitional state.

We are interested in finding the initial velocity (V0) of the reaction, the rate at which the enzyme can convert a substrate S as soon as the two are mixed together Given the Km and the Vmax, the reaction's V0 can be calculated using the Michaelis-Menten equation: V0 = Vmax * S / ( S+Km). This equation graphs a hyperbolic curve that approaches Vmax as S increases.

This equation takes on interesting characteristics in certain situations. If the concentration of S is much less than Km, S+Km becomes approximately Km, and the formula simplifies to Vmax * S / Km. Conversely, if S is much greater than Km, Km falls from the equation, yielding Vmax * S / S, or Vmax. (Of course, if S = Km, the equation becomes Vmax * S / (S+S), or Vmax / 2, which is the definition of Km).

    Flynn's Taxonomy

    Flynn's Taxonomy is a classification of computer architectures, based on whether the processor handles multiple instructions at once, and whether data can be read frm multiple streams simultaneously. Since there are two parameters with two options each, the taxonomy becomes:

    SISD - Single Instruction, Single Data
    SIMD - Single Instruction, Multiple Data
    MIMD - Multiple Instruction, Multiple Data
    MISD - Multiple Instruction, Single Data

    BUT, none of these are particularly interesting in 2013. The current pattern to use might be considered:

    SPMD - Single Program, Multiple Data

    In other words, to do parallel processing, one launches multiple processes and splits up the tasks between them with a messaging framework. OpenMPI is a popular one of these.