$\newcommand{\mlequation}{ #1{\underbrace{\theta}_{\text{parameters}}} = \arg \min_{\theta} #2{\underbrace{L}_{\text{loss}}}( #3{\overbrace{Y}^{\text{labels}}} , #4{\underbrace{f(X; \theta)}_{\text{estimated value}}} ; #5{\overbrace{D}^{\text{data}}} ) } \newcommand{\colorfocus}{\color{red}} \newcommand{\linregequation}{ #1{\{a, b\}} = \arg \min_{a, b} #2{\sum_{(x,y) \in #5{D}}} #2{\|} #3{y} - #4{(ax + b)} #2{\|_2} }$

### Probablistic graphical models

Vikas Dhiman

##### Why we should still study Probablistic graphical models?
• Combining Neural network with graphical models is the latest frontier   • Eventually Deep neural networks and graphical models are tools for solving problems. You need as many tools as possible.

### Machine learning recap

$\mlequation{\colorfocus}{}{}{}{}$

$\linregequation{\colorfocus}{}{}{}{}$

$\mlequation{}{\colorfocus}{}{}{}$

$\linregequation{}{\colorfocus}{}{}{}$

$\mlequation{}{}{\colorfocus}{}{}$

$\linregequation{}{}{\colorfocus}{}{}$

$\mlequation{}{}{}{}{\colorfocus}$

$\linregequation{}{}{}{}{\colorfocus}$

$\mlequation{}{}{}{\colorfocus}{}$

$\linregequation{}{}{}{\colorfocus}{}$

$\mlequation{}{}{}{\colorfocus}{}$

$\color{red}{\underbrace{f}_{\text{model}}} (\overbrace{X}^{\text{features}}; \theta)$

What are our options for models
• Equation for lines
$$f(x; a, b) = ax + b$$
• Neural networks:
$$f(X; W_1, W_2) = W_2\sigma(W_1X))$$ • Probablistic graphical models :
$$f(X; \theta) = \arg \max_{Y} P(Y, X; \theta)$$ #### Probablistic graphical models

$f(X; \theta) = \arg \max_{Y} P(Y, X; \theta)$

$f(x; a, b) = \arg \max_{y} P(y, x; a, b)$
$P(y, x; a, b) = \frac{1}{Z}\exp(-\| y - (ax + b) \|_2)$

The binary segmentation problem: A more practical example  For each pixel define a binary random variable $$y_i$$ that denotes whether the pixel belongs to foreground or background. $Y^* = \arg \max_{y_i \in \{0, 1\} \forall i} P(Y, X)$
$Y^* = \arg \max_{y_i \in \{0, 1\} \forall i} P(Y, X)$

Look at the dimensionality of the search space

If the size of the image is 100 x 100, then what is the size of search space?

$$2^{100 \times 100}$$

1. What should we do about the high dimensional search space?
2. What is so "graphical" about "Probablistic Graphical Models"?
##### Recall some probability
• Independence: Events $$y$$ and $$x$$ are independent iff $P(y, x) = P(y)P(x)$
• Conditional Probability of $$y$$ given $$x$$ is \begin{align} P(y|x) = \frac{P(y, x)}{P(x)} &\qquad \text{ or }P(y, x) = P(y|x)P(x) \end{align}
• Conditional independence:
$$y$$ and $$x$$ are conditionally independent given $$z$$ iff $P(y, x|z) = P(y|z)P(x|z)$

Note that, independence is intimately linked to factorization.

Probablistic Graphical models (PGMs) represent conditional independence relations between random variables.

Different types of PGMs

• Bayes net
• Factor graphs
• Conditional Random Fields
• Markov Random Fields
##### Markov Random fields (MRF)

Define a graph $$G = (V, E)$$ where $$V$$ is a set of random variables and $$E$$ is a set of edges such that each (or set of) random variable is conditionally independent of all other (or set of) random variables given its neighbors.

Examples
• Draw an MRF for two random variables (not independent).
• Draw an MRF for three random variables (not independent).
• Draw an MRF for four random variables (not independent).
• Draw an MRF for four random variables that are all independent.
• Draw an MRF for three random variables $$x,y,z$$ such that $$x$$ and $$y$$ are conditionally independent given $$z$$.
• Draw an MRF for four random variables $$w, x,y,z$$ such that $$x$$ and $$z$$ are conditionally independent given $$y$$ and $$w$$
and $$y$$ and $$w$$ are conditionally independent given $$x$$ and $$z$$
Let's make some independence assumptions  Assume that RV $$y_1$$ is independent of the rest of the graph given its neighbors $$y_2, y_3, y_4, y_5$$.

$P(Y, X) = \prod_{i, j: i \ne j}P_{ij}(y_i, y_j, X) \prod_{i}P_i(y_i, X)$

The factorization

$P(Y, X) = \prod_{i, j: i \ne j}P_{ij}(y_i, y_j, X) \prod_{i}P_i(y_i, X)$

is equivalent to the MRF: This relationship was proved by Hammersley-Clifford in 1971

Hammersley, J. M.; Clifford, P. (1971), Markov fields on finite graphs and lattices

Recap: A recipe for solving problems using MRFs

• Model function as probability $$f(X; \theta) = \arg \max_{Y} P(Y, X; \theta)$$
• Make independence assumptions • Write corresponding factorization $P(Y, X) = \prod_{i, j: i \ne j}P_{ij}(y_i, y_j, X) \prod_{i}P_i(y_i, X)$
• Solve probability maximization efficiently using PGM algorithms.
1. Graph cuts
2. Gibbs sampling
3. Belief propagation
Product of probabilities is same as summation of energies $f(X; \theta) = \arg \max_{Y} P(Y, X; \theta)$ $P(Y, X) = \prod_{i, j: i \ne j}P_{ij}(y_i, y_j, X) \prod_{i}P_i(y_i, X)$

Define $$E(Y, X) = -\log P(Y, X)$$ or

$E(Y, X) = \sum_{i, j: i \ne j}E_{ij}(y_i, y_j, X) + \sum_{i}E_i(y_i, X)$ $f(X; \theta) = \arg \min_{Y} E(Y, X; \theta)$

The following slides are borrowed from L. Ladicky