\[ \newcommand{\mlequation}[5]{ #1{\underbrace{\theta}_{\text{parameters}}} = \arg \min_{\theta} #2{\underbrace{L}_{\text{loss}}}( #3{\overbrace{Y}^{\text{labels}}} , #4{\underbrace{f(X; \theta)}_{\text{estimated value}}} ; #5{\overbrace{D}^{\text{data}}} ) } \newcommand{\colorfocus}{\color{red}} \newcommand{\linregequation}[5]{ #1{\{a, b\}} = \arg \min_{a, b} #2{\sum_{(x,y) \in #5{D}}} #2{\|} #3{y} - #4{(ax + b)} #2{\|_2} } \]

Probablistic graphical models

Vikas Dhiman

Why we should still study Probablistic graphical models?
  • Combining Neural network with graphical models is the latest frontier
  • Eventually Deep neural networks and graphical models are tools for solving problems. You need as many tools as possible.

Machine learning recap

\[ \mlequation{\colorfocus}{}{}{}{} \]

\[ \linregequation{\colorfocus}{}{}{}{} \]

\[ \mlequation{}{\colorfocus}{}{}{} \]

\[ \linregequation{}{\colorfocus}{}{}{} \]

\[ \mlequation{}{}{\colorfocus}{}{} \]

\[ \linregequation{}{}{\colorfocus}{}{} \]

\[ \mlequation{}{}{}{}{\colorfocus} \]

\[ \linregequation{}{}{}{}{\colorfocus} \]

\[ \mlequation{}{}{}{\colorfocus}{} \]

\[ \linregequation{}{}{}{\colorfocus}{} \]

\[ \mlequation{}{}{}{\colorfocus}{} \]



\[ \color{red}{\underbrace{f}_{\text{model}}} (\overbrace{X}^{\text{features}}; \theta) \]

What are our options for models
  • Equation for lines
    \( f(x; a, b) = ax + b \)
  • Neural networks:
    \( f(X; W_1, W_2) = W_2\sigma(W_1X)) \)
  • Probablistic graphical models :
    \( f(X; \theta) = \arg \max_{Y} P(Y, X; \theta) \)

Probablistic graphical models

\[ f(X; \theta) = \arg \max_{Y} P(Y, X; \theta) \]

\[ f(x; a, b) = \arg \max_{y} P(y, x; a, b) \]
\[ P(y, x; a, b) = \frac{1}{Z}\exp(-\| y - (ax + b) \|_2) \]

The binary segmentation problem: A more practical example
For each pixel define a binary random variable \(y_i\) that denotes whether the pixel belongs to foreground or background. \[ Y^* = \arg \max_{y_i \in \{0, 1\} \forall i} P(Y, X) \]
\[ Y^* = \arg \max_{y_i \in \{0, 1\} \forall i} P(Y, X) \]

Look at the dimensionality of the search space

If the size of the image is 100 x 100, then what is the size of search space?

\( 2^{100 \times 100} \)

Two unanswered questions
  1. What should we do about the high dimensional search space?
  2. What is so "graphical" about "Probablistic Graphical Models"?
Recall some probability
  • Independence: Events \(y\) and \(x\) are independent iff \[ P(y, x) = P(y)P(x) \]
  • Conditional Probability of \(y\) given \(x\) is \( \begin{align} P(y|x) = \frac{P(y, x)}{P(x)} &\qquad \text{ or }P(y, x) = P(y|x)P(x) \end{align} \)
  • Conditional independence:
    \(y\) and \(x\) are conditionally independent given \(z\) iff \[ P(y, x|z) = P(y|z)P(x|z) \]

Note that, independence is intimately linked to factorization.

Probablistic Graphical models (PGMs) represent conditional independence relations between random variables.

Different types of PGMs

  • Bayes net
  • Factor graphs
  • Conditional Random Fields
  • Markov Random Fields
Markov Random fields (MRF)

Define a graph \(G = (V, E)\) where \(V\) is a set of random variables and \(E\) is a set of edges such that each (or set of) random variable is conditionally independent of all other (or set of) random variables given its neighbors.

Examples
  • Draw an MRF for two random variables (not independent).
    \(x\)
    \(y\)
  • Draw an MRF for three random variables (not independent).
    \(x\)
    \(y\)
    \(z\)
  • Draw an MRF for four random variables (not independent).
    \(x\)
    \(y\)
    \(z\)
    \(w\)
  • Draw an MRF for four random variables that are all independent.
    image/svg+xml
    \(x\)
    \(y\)
    \(z\)
    \(w\)
  • Draw an MRF for three random variables \(x,y,z\) such that \(x\) and \(y\) are conditionally independent given \(z\).
    \(x\)
    \(y\)
    \(z\)
  • Draw an MRF for four random variables \(w, x,y,z\) such that \(x\) and \(z\) are conditionally independent given \(y\) and \(w\)
    and \(y\) and \(w\) are conditionally independent given \(x\) and \(z\)
    image/svg+xml
    \(x\)
    \(w\)
    \(y\)
    \(z\)
Let's make some independence assumptions

Assume that RV \(y_1\) is independent of the rest of the graph given its neighbors \(y_2, y_3, y_4, y_5\).

\[ P(Y, X) = \prod_{i, j: i \ne j}P_{ij}(y_i, y_j, X) \prod_{i}P_i(y_i, X) \]

The factorization

\[ P(Y, X) = \prod_{i, j: i \ne j}P_{ij}(y_i, y_j, X) \prod_{i}P_i(y_i, X) \]

is equivalent to the MRF:

This relationship was proved by Hammersley-Clifford in 1971

Hammersley, J. M.; Clifford, P. (1971), Markov fields on finite graphs and lattices

Recap: A recipe for solving problems using MRFs

  • Model function as probability \( f(X; \theta) = \arg \max_{Y} P(Y, X; \theta) \)
  • Make independence assumptions
  • Write corresponding factorization \[ P(Y, X) = \prod_{i, j: i \ne j}P_{ij}(y_i, y_j, X) \prod_{i}P_i(y_i, X) \]
  • Solve probability maximization efficiently using PGM algorithms.
    1. Graph cuts
    2. Gibbs sampling
    3. Belief propagation
Product of probabilities is same as summation of energies \[ f(X; \theta) = \arg \max_{Y} P(Y, X; \theta) \] \[ P(Y, X) = \prod_{i, j: i \ne j}P_{ij}(y_i, y_j, X) \prod_{i}P_i(y_i, X) \]

Define \( E(Y, X) = -\log P(Y, X) \) or

\[ E(Y, X) = \sum_{i, j: i \ne j}E_{ij}(y_i, y_j, X) + \sum_{i}E_i(y_i, X) \] \[ f(X; \theta) = \arg \min_{Y} E(Y, X; \theta) \]

The following slides are borrowed from L. Ladicky
For further reading
  • For Gibbs sampling: D. J. MacKay, “Introduction to monte carlo methods,” in Learning in graphical models . Springer, 1998, pp. 175–204.
  • For belief propagation: R. Kschischang, B. J. Frey, and H.-A. Loeliger, “Factor graphs and the sum-product algorithm,” Information Theory, IEEE Transactions on , vol. 47, no. 2, pp. 498–519, 2001.