Generative Models
Introduction of Generative Models [1]
When discussing generative models, it’s essential to understand how machine learning approaches tasks. Consider a scenario where we aim to distinguish between elephants and dogs. There are primarily two modeling approaches: discriminative and generative:
- Discriminative Modeling: This approach involves building a model that directly predicts classification labels or identifies the decision boundary between elephants and dogs.
- Generative Modeling: This approach entails constructing separate models for elephants and dogs, capturing their respective characteristics. A new animal is then compared against each model to determine which it resembles more closely.
In discriminative modeling, the focus is on learning the conditional probability of labels given the input data, denoted as \(p(y|{x})\). Techniques like logistic regression exemplify this by modeling the probability of a label based on input features. Alternatively, methods such as the perceptron algorithm aim to find a decision boundary that maps new observations to specific labels \(\{0,1\}\), such as \(0\) for dogs and \(1\) for elephants.
Conversely, generative modeling focuses on understanding how the data is generated by learning the joint probability distribution \(p(x,y)\) or the likelihood \(p(x|{y})\) along with the prior probability \(p(y)\). This approach models the distribution of the input data for each class, enabling the generation of new data points and facilitating classification by applying Bayes’ theorem to compute the posterior probability:
\[p(y|x) = \frac{p(x|y)p(y)}{p(x)}\]The denominator \(p(x)\) is the marginal probability that sums the joint probability \(p(x,y)\) over all possible labels \(y\):
\[\begin{aligned} p(x) &= \sum_{y} p(x,y) \\ &= \sum_{y} p(x|y)p(y) \\ &= p(x|y=0)p(y=0) + p(x|y=1)p(y=1) \end{aligned}\]Actually, \(p(x)\) acts as a normalization constant since it does not depend on the label \(y\). To be more specific, \(p(x)\) remains unchanged regardless of how \(y\) varies. Therefore, when calculating \(p(y|x)\), we do not need to compute \(p(x)\):
\[\begin{aligned} \arg\max_y p(y|x) &= \arg\max_y \frac{p(x|y)p(y)}{p(x)}\\ &= \arg\max_y p(x|y)p(y) \\ \end{aligned}\]Let’s consider a new task of spam classification. \(x^{(i)}\) is the feature vector of the \(i\)-th email, and \(y^{(i)}\) is the label indicating whether the email is spam (\(1\)) or not spam (\(0\)). The following examples illustrate how discriminative and generative models approach the same problem differently.
Example 1: Logistic Regression as a Discriminative Model
Since it is a binary classification problem, it makes sense to choose a hypothesis \(h_{\theta}(x)\) that ranges in \((0,1)\) to represent the probability of \(p(y=1|x)\), where \(p(y=0|x) = 1 - h_{\theta}(x)\). We can set the threshold of \(h_{\theta}(x)\) to be \(0.5\) to predict if an email is spam. The logistic function fits this case well as it ranges in \((0,1)\) for \(z\in(-\infty, +\infty)\):
\[h_{\theta}(x) = g(\theta^T x) = \frac{1}{1 + e^{-\theta^T x}}\]where
\[g(z) = \frac{1}{1 + e^{-z}}\]is called the logistic function or sigmoid function. Below is a plot of the sigmoid function:
From the plot, we can see that \(g(z)\) tends to \(0\) as \(z\to-\infty\) and tends to \(1\) as \(z\to+\infty\). When \(z=0\), \(g(z)=0.5\). The function \(g(z)\) or \(h_{\theta}(x)\) is always bounded between \(0\) and \(1\). To maintain the convention of letting \(x_0=1\), we can rewrite the expression of \(z\) in the hypothesis as \(z = \theta^T x = \theta_0 + \sum_{j=1}^n \theta_j x_j\), where \(\theta_0\) is the bias term and \(\theta_j\) is the weight of the \(j\)-th feature \(x_j\). Please note that other functions that smoothly and monotonically increase from \(0\) to \(1\) can also be considered for \(h_{\theta}(x)\).
Now we can continue to use maximum likelihood estimation to find the best parameters \(\theta\) for the logistic regression model. To indicate \(\theta\) as the parameter vector in the conditional probability distribution \(p(y|x)\), we can rewrite the expression of \(p(y|x)\) as:
\[\begin{aligned} p(y=1|x;\theta) &= h_{\theta}(x) \\ p(y=0|x;\theta) &= 1 - h_{\theta}(x) \\ p(y|x;\theta) &= (h_{\theta}(x))^y (1-h_{\theta}(x))^{1-y} \end{aligned}\]Assuming that \(n\) training examples are drawn independently from the same distribution, we can write the likelihood function of parameter \(\theta\) as:
\[\begin{aligned} L(\theta) &= p(\vec{y}|X;\theta) \\ &= p(y^{(1)},y^{(2)},...,y^{(n)}|x^{(1)},x^{(2)},...,x^{(n)};\theta) \\ &= \prod_{i=1}^n p(y^{(i)}|x^{(i)};\theta) \\ &= \prod_{i=1}^n (h_{\theta}(x^{(i)}))^{y^{(i)}}(1-h_{\theta}(x^{(i)}))^{1-y^{(i)}} \\ \end{aligned}\]Since it is easier to work with the log-likelihood function to maximize, we can take the logarithm of the likelihood function:
\[\begin{aligned} \ell(\theta) = \log{L}(\theta) &= \log\prod_{i=1}^n (h_{\theta}(x^{(i)}))^{y^{(i)}}(1-h_{\theta}(x^{(i)}))^{1-y^{(i)}} \\ &= \sum_{i=1}^n y^{(i)}\log h_{\theta}(x^{(i)}) + \sum_{i=1}^n (1-y^{(i)})\log(1-h_{\theta}(x^{(i)})) \\ &= \left[\begin{array}{cccc} y^{(1)} & y^{(2)} & \cdots & y^{(n)} \\ \end{array}\right]log \begin{bmatrix} g(\theta^T x^{(1)}) \\ g(\theta^T x^{(2)}) \\ \vdots \\ g(\theta^T x^{(n)}) \end{bmatrix} + \left[\begin{array}{cccc} 1-y^{(1)} & 1-y^{(2)} & \cdots & 1-y^{(n)} \\ \end{array}\right]log \begin{bmatrix} (1-g(\theta^T x^{(1)})) \\ (1-g(\theta^T x^{(2)})) \\ \vdots \\ (1-g(\theta^T x^{(n)})) \end{bmatrix} \\ &= y^T\log{g(X\theta)} + (1-y)^T\log{(1-g(X\theta))} \end{aligned}\]As we choose the sigmoid function \(g(z)\) to represent the hypothesis \(h_{\theta}(x)\), let’s first compute the derivative of \(g(z)\) with respect to \(z\):
\[\begin{aligned} g^\prime(z) &= \frac{d}{dz}\frac{1}{1+e^{-z}} \\ &= \frac{-1}{(1+e^{-z})^2}\frac{d}{dz}(1+e^{-z}) = \frac{-1}{(1+e^{-z})^2}(-e^{-z}) \\ &= \frac{e^{-z}}{(1+e^{-z})^2} = \frac{1}{(1+e^{-z})}(1-\frac{1}{1+e^{-z}}) \\ &= g(z)(1-g(z)) \end{aligned}\]Then apply the chain rule to compute the derivative of \(\ell(\theta)\) with respect to \(\theta\). To simplify, we will only consider the derivative of \(\ell(\theta)\) with respect to \(\theta_j\) for each training example \(x^{(i)}\) and label \(y^{(i)}\). As we only use sample \(i\) to compute the derivative, we can drop the index \(i\) for convenience:
\[\begin{aligned} \frac{\partial}{\partial \theta_j} \ell(\theta) &= \left(y\frac{1}{g(\theta^T{x})} - (1-y)\frac{1}{1-g(\theta^T{x})}\right)\frac{\partial}{\partial \theta_j}g(\theta^T{x}) \\ &= \left(y\frac{1}{g(\theta^T{x})} - (1-y)\frac{1}{1-g(\theta^T{x})}\right)g(\theta^T{x})(1-g(\theta^T{x}))\frac{\partial}{\partial \theta_j}\theta^T{x} \\ &= \left(y\frac{1}{g(\theta^T{x})} - (1-y)\frac{1}{1-g(\theta^T{x})}\right)g(\theta^T{x})(1-g(\theta^T{x}))x_j \\ &= \left(y(1-g(\theta^T{x})) - (1-y)g(\theta^T{x})\right)x_j \\ &= (y-h_{\theta}(x))x_j \end{aligned}\]This leads us to the stochastic gradient ascent rule, where \((y^{(i)}-h_{\theta}(x^{(i)}))x_j^{(i)}\) is the gradient of the log-likelihood function with respect to the \(i\)-th training example:
\[\theta_j := \theta_j + \alpha(y^{(i)}-h_{\theta}(x^{(i)}))x_j^{(i)}\]Example 2: Gaussian Discriminant Analysis as a Generative Model
Let’s say the feature vector \(x\) of an email is using TF-IDF [2] that measures the importance of words in the email. TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that reflects the importance of a word in a document relative to a collection of documents (corpus). It is calculated as:
\[\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)\]where:
\[\begin{aligned} \text{TF}(t, d) &= \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} \\ \text{IDF}(t) &= \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing term } t}\right) \end{aligned}\]As TF-IDF is continuous we can model \(p(x|y)\) as a multivariate normal distribution. Then the model can be represented as:
\[\begin{aligned} y &\sim \text{Bernoulli}(\phi) \\ x|y=0 &\sim \mathcal{N}(\mu_0, \Sigma) \\ x|y=1 &\sim \mathcal{N}(\mu_1, \Sigma) \end{aligned}\]Writing out the distributions, we have:
\[\begin{aligned} p(y) &= \phi^y(1-\phi)^{1-y} \\ p(x|y=0) &= \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}exp\left(-\frac{1}{2}(x-\mu_0)^T\Sigma^{-1}(x-\mu_0)\right) \\ p(x|y=1) &= \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}}exp\left(-\frac{1}{2}(x-\mu_1)^T\Sigma^{-1}(x-\mu_1)\right) \\ \end{aligned}\]Therefore, the log-likelihood of the data is:
\[\begin{aligned} \ell(\phi, \mu_0, \mu_1, \Sigma) &= \log\prod_{i=1}^n p(x^{(i)}, y^{(i)};\phi, \mu_0, \mu_1, \Sigma) \\ &= \log\prod_{i=1}^n p(x^{(i)}|y^{(i)};\mu_0, \mu_1, \Sigma)p(y^{(i)};\phi) \end{aligned}\]By maximizing the log-likelihood, we can derive the parameters of the model. The MLE estimates of the parameters are:
\[\begin{aligned} \phi &= \frac{1}{n}\sum_{i=1}^n 1\{y^{(i)}=1\} \\ \mu_0 &= \frac{\sum_{i=1}^n 1\{y^{(i)}=0\}x^{(i)}}{\sum_{i=1}^n 1\{y^{(i)}=0\}} \\ \mu_1 &= \frac{\sum_{i=1}^n 1\{y^{(i)}=1\}x^{(i)}}{\sum_{i=1}^n 1\{y^{(i)}=1\}} \\ \Sigma &= \frac{1}{n}\sum_{i=1}^n (x^{(i)}-\mu_{y^{(i)}})(x^{(i)}-\mu_{y^{(i)}})^T \end{aligned}\]The following figure shows the training set and the contours of two Gaussian distributions. These two Guassian distributions share the same covariance matrix \(\Sigma\), leading to the same shape and orientation of the contours. But they have different means \(\mu_0\) and \(\mu_1\), leading to different positions of the contours. The straight line shown in the figure is the decision boundary at which \(p(y=1|x) = 0.5\). Thus on the left side of the line, the model predicts \(y=0\) and on the right side, the model predicts \(y=1\).
Figure 1: Gaussian Discriminant Analysis. Image source: Section 4.1.2 on page 40 from Stanford CS229 Notes.
Comparison between Discriminative and Generative Models
Apply the Bayes’ theorem to the generative model GDA (Gaussian Discriminant Analysis), we have:
\[\begin{aligned} p(y=1|x) &= \frac{p(x|y=1)p(y=1)}{p(x|y=1)p(y=1) + p(x|y=0)p(y=0)} \\ &= \frac{\exp\left\{ -\frac{1}{2} (x - \mu_1)^T \Sigma^{-1}(x - \mu_1) \right\} \phi} {\exp\left\{ -\frac{1}{2} (x - \mu_1)^T \Sigma^{-1} (x - \mu_1) \right\} \phi + \exp\left\{ -\frac{1}{2} (x - \mu_0)^T \Sigma^{-1} (x - \mu_0) \right\} (1 - \phi)} \\ &= \frac{1}{1+\exp\left\{ -\frac{1}{2} (x - \mu_1)^T \Sigma^{-1} (x - \mu_1) + \frac{1}{2} (x - \mu_0)^T \Sigma^{-1} (x - \mu_0) \right\} \frac{1 - \phi}{\phi}} \\ &= \frac{1}{1 + \exp\left\{ -\left[ (\Sigma^{-1} (\mu_1 - \mu_0))^T x + \frac{1}{2} (\mu_0 + \mu_1)^T \Sigma^{-1} (\mu_0 - \mu_1) - \ln \left( \frac{1 - \phi}{\phi} \right) \right]\right\} }\\ \end{aligned}\] \[\begin{aligned} \theta &= \Sigma^{-1} (\mu_1 - \mu_0) \\ \theta_0 &= \frac{1}{2} (\mu_0 + \mu_1)^T \Sigma^{-1} (\mu_0 - \mu_1) - \ln \left( \frac{1 - \phi}{\phi} \right) \end{aligned}\]From the derivation above, we found that \(p(y=1|x;\phi, \mu_0, \mu_1, \Sigma)\) can actually be viewed as a function of \(x\) in the following form:
\[p(y=1|x;\phi, \mu_0, \mu_1, \Sigma) = \frac{1}{1+\exp(-\theta^T x)}\]This highlights a interesting link between generative and discriminative models: \(\theta\) can be expressed as a function of \(\phi, \mu_0, \mu_1, \Sigma\) from the GDA model. The form is identical to the hypothesis function of the logistic regression model, which is used to model the conditional probability \(p(y=1|x)\) in a discriminative manner.
Generally, generative models and discriminative models generate different decision boundaries when trained on the same dataset. The following points shows the differences between the generative GDA model and the discriminative logistic regression model:
For GDA, if \(p(x|y)\) is a multivariate Gaussian distribution with a shared covariance matrix, then \(p(y=1|x)\) will necessarily take the form of a sigmoid function. However, the reverse is not true: set \(p(y=1|x)\) to be a sigmoid function does not guarantee that \(p(x|y)\) is a multivariate Gaussian. This indicates that the GDA model actually makes stronger assumptions than logistic regression.
Due to stronger assumptions, GDA performs well when the assumptions align with the actual data. Conversely, logistic regression with weaker assumptions, tends to be more robust across various data distributions, if sufficient training data is available.
In summary, when there is prior knowledge about the data distribution, generative GDA is more efficient as it requires less training data. However, if the data distribution is unknown, discriminative logistic regression is preferable because it is less sensitive to the validity of prior assumptions, although it demands more data to approximate the real data distribution.
References
[1] Andrew, Ng. “CS229: Machine Learning Course Notes”. Stanford University, 2018.
[2] Salton, Gerard & Michael J., McGill. “Introduction to Modern Information Retrieval”. McGraw-Hill, 1983.