# Difference between revisions of "graphical models for structured classification, with an application to interpreting images of protein subcellular location patterns"

Line 65: | Line 65: | ||

<center><math>\sum_{x_1}\sum_{x_2}...\sum_{k-1}\phi_j^*(x_1,...,x_k)\text{ }(7)</math></center> | <center><math>\sum_{x_1}\sum_{x_2}...\sum_{k-1}\phi_j^*(x_1,...,x_k)\text{ }(7)</math></center> | ||

− | <center><math>\phi_j^*(x_1,...,x_k)=m_1(x_1)m_2(x_2)...m_k(x_{k-1})\phi_j(x_1,...,x_k)\text{ }(7)</math></ | + | <center><math>\phi_j^*(x_1,...,x_k)=m_1(x_1)m_2(x_2)...m_k(x_{k-1})\phi_j(x_1,...,x_k)\text{ }(8)</math></center> |

+ | |||

+ | where <math>m_i(x_i)</math> is the message to factor <math>j</math> from variable <math>x_i</math>. If loops are removed from the factor graph, equation (8) would include only a subset of the above messages and the messages of the collapsed variables would be gathered in one message. | ||

+ | |||

+ | Equations (7) & (8) can be computed quickly if <math>\phi_j</math> is a sum of terms <math>\sum\psi_{ji}</math> where each term <math>\psi_{jl}</math> depends only on a small subset of its arguments <math>x_1...x_k</math>. There is one more condition that, when found, could cause the above equations to be computed rapidly; that is when \phi_j is a constant except at a small number of input vectors <math>(x_1,...,x_k)</math>. In the first case, let's say that <math>\phi_j</math> is a sum of low-arity terms <math>\psi_jl</math> and in the second case let's say that <math>\phi_j</math> is sparse. Equation (7) can then be written as a sum of products of low-arity functions: writing <math></math> | ||

<center><math></math></center> | <center><math></math></center> |

## Revision as of 02:39, 11 November 2011

## Contents

## Background

In structured classification problems, there is a direct conflict between expressive models and efficient inference: while graphical models such as factor graphs can represent arbitrary dependences among instance labels, the cost of inference via belief propagation in these models grows rapidly as the graph structure becomes more complicated. One important source of complexity in belief propagation is the need to marginalize large factors to compute messages. This operation takes time exponential in the number of variables in the factor, and can limit the expressiveness of the models used. A new class of potential functions is proposed, which is called decomposable k-way potentials. It provides efficient algorithms for computing messages from these potentials during belief propagation. These new potentials provide a good balance between expressive power and efficient inference in practical structured classification problems. Three instances of decomposable potentials are discussed: the associative Markov network potential, the nested junction tree, and the voting potential. The new representation and algorithm lead to substantial improvements in both inference speed and classification accuracy.

## Belief Propagation

Just to stick with the notion of the authors, let's assume that [math]\phi_i^{loc}(x_i)[/math] is the one-argument factor that represents the local evidence on [math]x_i[/math]. Moreover, Figure 1 shows the notion they use in graphs. The small squares denote potential functions, and, as usual, the shaded and unshaded circles represent observed and unobserved variables respectively.

Using such notion, the message sent from a variable [math]x_i[/math] to a potential function [math]\phi_k[/math] as:

Similarly, a message from a potential function [math]\phi_j[/math] to [math]x_k[/math] can be computed as:

## General graphs

The above is easily applied when the graph is tree-shaped. For graphs with loops, there are generally two alternatives, the first is to collapse groups of variable nodes together into combined nodes, which could turn the graph into a tree and makes it feasible to run Belief Propagation (BP). The second is to run an approximate inference algorithm that doesn't require a tree-shaped graph. One further solution is to combine both techniques. An example is to derive a tree-shaped graph for the graph shown in Figure 1. Figure 2 combines variables [math]x_1[/math] and [math]x_2[/math] to form the graph in Figure 2.

## Loopy Belief Propagation (LBP)

If a graph is collapsed all the way to a tree, inference can be done with the exact version of BP as above. If there are still some loops left, it's LBP that should be used. In LBP (as in BP), an arbitrary node is chosen to be the root and formulas 1 & 2 are used. However, each message may have to be updated repeatedly before the marginals converge. Inference with LBP is approximate because it can double-count evidence; messages to a node [math]i[/math] from two nodes [math]j[/math] and [math]k[/math] can both contain information from a common neighbor [math]l[/math] of [math]j[/math] and [math]k[/math]. If LBP oscillates between some steady states and does not converge, the process could be stopped after some number of iterations. Oscillations can be avoided by using momentum, which replaces the messages that were sent at time [math]t[/math] with a weighted average of the messages at times [math]t[/math] and [math]t-1[/math]. For either exact or loopy BP, run time for each path over the factor graph is exponential in the number of distinct original variables included in the largest factor. Therefore, inference can become prohibitively expensive if the factors are too large.

## Constructing factor graphs for structured classification

To construct factor graphs that encode "likely" label vectors, two steps are performed. First, domain specific heuristics are used to identify pairs of examples whose labels are likely to be the same in order to use such pairs to build a similarity graph with an edge between each pair of examples. The second step is to use this similarity graph to decide which potentials to add to the factor graph. Given the similarity graph of the protein subcellular location pattern classification problem, factor graphs built using different types of potentials are compared as we will see in the following sections.

### The Potts potential

The Potts potential is a two-argument factor which encourages two nodes [math]x_i[/math] and [math]x_j[/math] to have the same label:

whereas [math]\omega\gt 1[/math] is an arbitrary parameter expressing how strongly [math]x_i[/math] and [math]x_j[/math] are believed to have the same label. If the Potts potential is used for each edge in the similarity graph, the overall probability of a vector of labels x is as follows:

### The Voting potential

Assuming that [math]N(j)[/math] is the set of similarity graph neignbors of cell [math]j[/math], let's write the group of cells [math]V(j)=\{j\}\cup{N(j)}[/math]. The voting potential is then defined as follows:

whereas [math]n[/math] is the number of classes, [math]\lambda[/math] is a smoothing parameter and [math]I[/math] is an indicator function:

### The AMN (Associative Markov Network) potential

AMN potential is defined to be:

for parameters [math]\omega_y\gt 1[/math] where I(predicate) is defined to be [math]1[/math] if the predicate is true and [math]0[/math] if it is false. Therefore, the AMN potential is constant unless all the variables [math]x_1...x_k[/math] are assigned to the same class [math]y[/math].

## The proposed Decomposable potentials

while k-way factors can lead to more accurate inference, they can also slow down belief propagation. For a general k-way factor, it takes time exponential in k. For specific k-way potentials though, it is possible to take advantage of special structure to design a fast inference algorithm. In particular, for many potential functions, it is possible to write down an algorithm which efficiently performs sums of the form required for message computation:

where [math]m_i(x_i)[/math] is the message to factor [math]j[/math] from variable [math]x_i[/math]. If loops are removed from the factor graph, equation (8) would include only a subset of the above messages and the messages of the collapsed variables would be gathered in one message.

Equations (7) & (8) can be computed quickly if [math]\phi_j[/math] is a sum of terms [math]\sum\psi_{ji}[/math] where each term [math]\psi_{jl}[/math] depends only on a small subset of its arguments [math]x_1...x_k[/math]. There is one more condition that, when found, could cause the above equations to be computed rapidly; that is when \phi_j is a constant except at a small number of input vectors [math](x_1,...,x_k)[/math]. In the first case, let's say that [math]\phi_j[/math] is a sum of low-arity terms [math]\psi_jl[/math] and in the second case let's say that [math]\phi_j[/math] is sparse. Equation (7) can then be written as a sum of products of low-arity functions: writing [math][/math]