Day 5 - Supervised Learning - Logistic Regression

Posted on June 2, 2017 by Govind Gopakumar

Lecture slides in pdf form

Prelude

Announcements

New project groups : Meet after class for short discussion
Old project groups : Meeting tomorrow
Programming tutorials to be put up tonight / tomorrow
Webpage - govg.github.io/acass

Recap

Our first Regression model

How to fit a line through our model
How is this formed?
Analytical solution for 1D
Problems with this for more than 1D

Matrix factorization as regression

Reduction of “complicated” problem to simple problems.
“Random” method to optimize - alternating optimization
Works for non-convex loss functions

Probabilistic Classification

Logistic Regression - I

Why do we need this?

Wish to predict “probability” of a label
Useful to quantify “confidence” about prediction

Idea from linear regression

\(\langle w, x \rangle\) : Similarity between parameter and point
How do we extend this to classification?
Very simple model : sum of all features!

Logistic Regression - II

Model overview

Learn a parameter \(w\)
\(p(y_i = 1) = \mu_i\)
\(\mu_i = \frac{1}{1+\exp(-w^Tx_i)}\)
Computes a “score” : \(\langle w, x \rangle\)
Squashes it between (0,1)

Interpretation?

Very high “scores” - ?
Very low “scores” - ?
When are we not “confident”?

Logistic Regression - III

Learning

We need to find out this \(w\) parameter.
What does the decision rule look like?
\(\log \frac{p(y_i = 1)}{p(y_i = 0)}\) = ?
Intuitve explanation of this?

Geometry of the solution

Still learning a line!
How does this differ from other “lines”?
Why is this useful then?

Logistic Regression - IV

Learning the parameter

Can we come up with a loss function?
Why will this be easy or hard?
How can we optimize this?

Problems with the squared loss

Can we differentiate this easily?
Is this convex?

Logistic Regression - V

Constructing a loss

How do we choose a loss?
Loss should be high when predicted and actual are different.
Loss should be low when predicted is same as actual.

Two way loss

If \(y_i = 1\), loss \(l(w) = -\log(\mu_i)\)
If \(y_i = 0\), loss \(l(w) = -\log(1-\mu_i)\)
Why does this seem right?

Logistic Regression - VI

Final cross-entropy loss

\(l(w) = -y_i \log(\mu_i) - (1-y_i)\log(1-\mu_i)\)
“Cross” entropy : related to earlier entropy
How do we write this in terms of \(w\)?

Loss function

Setting \(\mu_i = \frac{\exp(w^Tx_i)}{1 + \exp(w^Tx_i)}\)
\(L(w) = -\sum (y_i w^Tx_i - log(1 + \exp(w^Tx_i)))\)
How do we impose control on solution?

Logistic Regression - VII

Optimizing this loss

\(L(w) = -\sum (y_i w^Tx_i - log(1 + \exp(w^Tx_i)))\)
\(g = -\sum \left(y_ix_i - \frac{\exp(w^Tx_i)}{1 + \exp(w^Tx_i)}\right)\)
Is there a simple form? Yes!

Final expression

\(g = -\sum (y_i - \mu_i)x_i\)
Can we set it to zero?
What do we do now?

Logistic Regression - VIII

Gradient descent

Update using \(w^{t+1} = w^{t} - \eta g_t\)
\(w^{t+1} = w^{t} - \eta \sum (\mu_i^t - y_i) x_i\)

Analyzing the update step

What \(x_i\) is added to \(w^t\) more?
Does this sort of update make sense now?
How much time do we require to compute this?

Gradient Descent - I

Improving gradient descent

Choice of \(\eta\) is crucial!
Can add a momentum term \(w^{t+1} = w^t - \eta g_t + \alpha^t(w^t - w^{t-1})\)
Can also use “second-order” methods (beyond the scope of this class)

Speeding up gradient descent

We need to compute gradient across entire data
Is there a naive solution to this?

Gradient Descent - II

Mini-batch Gradient Descent

Approximate the loss function using a subset
Gradient becomes faster to compute
Why should this work?

Stochastic Gradient Descent

Let’s take it to the extreme - use just one point!
Extremely fast gradient descent
Why would this work at all?

Logistic Regression - via Probability - I

Choosing a likelihood

What is appropriate?
Can we relate this to something we know?
How do we write down entire likelihood?

Doing “Maximum” probability

\(p(y_i) = \mu_i^{y_i} (1-\mu_i)^{1-y_i}\)
What will we get? Any guesses?

Logistic Regression - IX

Multiclass

Naturally extend this to multiclass - how?
Can think of it both in loss function sense and probability sense
Same methods will apply, with some tweaks

Comments

Probability estimate of class, instead of decision
Gradient descent can be done fast
Widely used, in different fields as well
Used as modules in neural networks!

Yet another Classifier

Perceptron - I

Extending the Logistic model

\(w^{t+1} = w^t - \eta_t (\mu^t_i - y_i)x_i\)
Replace with a cutoff for \(\mu_i\)
\(w^{t+1} = w^t - \eta_t (\hat{y}_i - y_i)x_i\)

Analyzing the new update

When does this update actually take place?
What is this update when it does take place?
For ease, let us assume labels \(y_i \in \{-1, 1\}\).

Perceptron - II

Mistake driven learning

Update upon mistake : \(w^{t+1} = w^t + 2\eta_t y_i x_i\)
What does this update look like?
Why does the update work?

Geometry of the classifier

What will the loss surface be?
Learns a linear surface!
Why is it useful then? - Extremely fast way to construct it

Perceptron - III

Significance of Perceptrons

Almost the first ever “classifier” built
Can be thought of as a model for a brain
Led to AI “winter” : ML research stalled for a while
Actual theoretical proof on number of mistakes!

Usage of perceptrons

Multilayer Perceptrons : Starting point for neural networks
Almost every “deep neural network” is an MLP
Non-linear methods : do a transformation! (when we discuss kernels)

Halfway round up

General techniques

Loss functions

Why choice of a loss function matters
Common loss functions : squared loss!
How some loss functions can be bad.

Probability method

Maximize the experiment happening!
How to choose a likelihood model
How it (possibly) leads to same answer as above

Methods discussed

Classification

K - nearest neighbors
Decision Trees
Random Forests
Logistic Regression
Perceptron

Regression

Adaptation of KNN
Adaptation of Decision Tree?
Linear Regression

Agenda for next week

Unsupervised Learning and Advanced methods

Cover some unsupervised learning methods
Cover some “advanced” material (SVM, Neural Networks, Kernels)

Greater focus on programming

Every class will have a programming assignment
(Hopefully) deal with “realistic” datasets
Two classes on feature “extraction” and modelling
One class purely on best practices for experiments

Conclusion

Concluding Remarks

Takeaways

Another classification technique : Logistic Regression
Gradient descent and stochastic gradient descent
The perceptron algorithm

Announcements

Extra class : Monday 3 - 4 pm (purely a Python tutorial)
Quiz 1 : Automatically graded
Assignment 2 : Working on the MNIST dataset

References