A brief introduction

The simplest 2 word explanation of ML - Curve Fitting

Basically, given a dataset, is a computer able to learn & interpret, what it ideally should i.e. learn it like a human would.

This is achieved by mathematically encoding everything from images, text, analytics into vectors that can be passed through a function that we want to come up with, that would actually give an output close to what a human brain would do for the same input information.

The accuracy of-course would be measured by drawing curves & see how well the output by the machine, matches real-world thinking i.e. how well does the predicted curve, fit the actual curve.

Basic Algorithms

I've personally been a developer, most of my career, hence most technologies I learn or stumble across new technologies across are actually when I run into a business use case that demands a new solution or approach as the most optimal way forward.

Hence, in this post, instead of explaining in depth what each algorithm does & the mathematics of it, I will primarily be focussing on explaining the basic algorithms used in Supervised (we have given inputs & outputs values) & Unsupervised (we have input values only) Learning & when to use which algorithm.

We will be speaking in short about :

  1. Linear Regression
  2. Logistic Regression (Classification)
  3. Neural Networks
  4. Support Vector Machine
  5. K-Means Clustering - deriving structure from data
  6. Principal Component Analysis

For each algorithm, discuss :

  1. Optimization Objective
  2. How to actually use it
  3. When to use it

Also, as a bonus, I'll point you to some resources on actually debugging learning algorithms so that you can improve upon an implemented solution. 🚀

Some basic definitions before we get started :
Say my dataset is the collection of characteristics of "houses in London" & their prices

  1. Parameters aka Weights - The characteristic/trait associated with a given value in the dataset. eg: number of square feet, water supply hours
  2. Features - The actual values in the dataset. Eg : 1200 square feet, 24 hour water supply.

Basically, our learning algorithm (especially in supervised learning) mostly is always assigning the right values to these weights associated with each value, to come up with a function (aka "hypotheses function") that always gives the good predicted value.
Eg : water supply hours should be given a higher weightage compared to other features, since most people want that, hence such houses will be in high demand, hence the prices for these houses will be more

1. Linear Regression

Regression Multivariate Gradient Descent Normalization Regularization

Used when - We have a labelled dataset & we wish to devise a function that uses the features + parameters & predicts the most accurate output. Since, the results are continuous & not into classified buckets, hence, the term "regression".

Eg application - Guessing the price of a house, starting with a labelled dataset of various houses’ features such as area, locality etc. & their actual prices

"linear", since we mostly only deal with features independent of each other. Eg : size of a house does not ideally depend directly on how near its location is to a school. In case we do have dependent features we use - "Multivariate regression" a technique used to measure the degree to which the various independent variable and various dependent variables are linearly related to each other

We optimize the function we come up with using the "Gradient Descent" approach, by which we come up with better coefficients for each feature in our function.

Also, we do something called "regularization" to prevent underfitting or overfitting the predictions we give to the dataset we have. Get it just right :)

Lastly, we can also use something called "Stochastic Gradient Descent" for improving our hypotheses function, in case our dataset is too large & iterating over all training examples for each step of gradient descent is too time & compute consuming
This approach looks at only 1 random training example in each iteration of gradient descent, as opposed to all examples in the training set, & improves parameters just for that 1 example in that step, before moving on to the next random example in the next step. This drastically brings down number of iterations we have to perform for each step of gradient descent

2. Logistic Regression

Classification Sigmoid Gradient Descent One-vs-all

Use case - We have a labelled dataset & we wish to devise a function that uses features + parameters & predicts the most accurate classified output

Eg application - Looking at a series of emails & classifying as "spam" or "not-spam". eg of multiclass - Looking at a series of vehicle images, predict which of them is a bike, car, ship, bus etc

We basically predict the probability (between 0 to 1) of a given data point to lie in one of the classes. We use a "sigmoid" function as part of the function we come up with outputs lying in the range 0 to 1

For multiclass classification, we predict for each data point, what is the probability that it lies in 1 class vs its probability of not lying in that class (i.e. probability of it lying in any class except that class). This approach is called "one-vs-all" approach & we calculate this for all classes for all data points

3. Neural Networks

Complex non-linear hypotheses Forward Propagation Backward Propagation

Use case - We have a labelled dataset & we wish to learn interesting features starting from initial features, & perform classification. Basically, cases where we can’t use logistic regression for complex hypotheses as it is only a linear classifier

Eg application - Handwriting recognition

Each set of interesting features, that we learn, using the initial input features & training examples & building up on those - constitute a layer
Eg : Using 32x32 pixels black & white images & from a collection of pixels, learning what a line, arc, curve is & then using those, understanding how Arabic Numbers are constructed as a combination of these newly learnt, interesting features.

We use "Forward Propagation" algorithm to predict output based on the hypothesis function the layers of the neural network collectively. We use "Backward Propagation" algorithm to optimize the accuracy of the neural network

4. Support Vector Machine

Kernels Gaussian Kernels

Use case - Reducing feature set for labelled training data

Gives a more powerful & cleaner way of learning complex non-linear functions as compared to regression, neural networks - by learning only on basis of most important features in the given set.
Can be used to increase & decrease feature set size

Eg application - Feature set optimization in literally any machine learning implementation

We use kernels i.e. predefined methodologies to use existing features & come up with a smaller but more impactful feature set that can be less compute-heavy but more important to the actual output

5. K-Means Clustering

Cluster Assignment Move Centroid

Use case - Given an unlabelled dataset, that we have to transform into discrete classes/clusters/groups of data

Eg application - Grouping of Tshirts, by using dimension values, into S, M, L, XL categories

Basically we start by initializing the random points as "cluster leaders" (aka centroids) & then based on the similarity of other data points, with respect to these, add the other points to the clusters. This is known as "cluster assignment" step

Once this is done, we recompute the "cluster leaders' " values to be the average of values of all points in the cluster. This is known as "move centroid" step as on an actual graph, this average would be a centroid that would move upon recomputation

We keep alternating "move centroid" & "cluster assignment" step for a couple iterations till the "cluster leaders' " values aren't changing anymore or changing negligibly

6. Principal Component Analysis

Dimensionality Reduction Eigen Vector Reconstruction

Use case - Reducing feature set for unlabelled & labelled training examples aka "dimensionality reduction"

It is a statistical process that converts the observations of correlated features into a set of linearly uncorrelated features with the help of orthogonal transformation

Eg application - Reduce memory needed to store data & to speed up learning algorithm

It uses the concept of "eigen vectors" which basically helps us extract the most impactful features i.e. the desired vector from a given feature set vector

Also, we use "reconstruction" of feature set, from the output that this algorithm gives, & use it in comparison to the actual feature set & assess & improve this algorithm.

7. Anomaly Detection

Gaussian Distribution

Use case - Collecting skewed datasets (almost as good as unlabelled) & learning from it to figure out a way to predict whether something is an anomaly or not

Eg application - Fraud detection based on user’s web activities, Manufacturing, Malfunctioning of computers in a data center

We basically assume that the data set follows Gaussian Distribution (basically data is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean) & use that to come up with a metric that helps us perform binary or multi-class classification of the data point we're looking at

8. Recommender Systems

Content Based Recommendation Collaborative Filtering

Use case - Build a system that recommends products to users based on previous feedback & also uses this same feedback to tag products better for an improved experience the next time around.

Basically, we treat each user as a linear regression model of their own (i.e. own parameters + features, so hypotheses function at an individual level) , based on their previous actions.

Eg application - Building recommendation engine for a streaming servive (eg : Netflix) or a e-commerce website (eg : Amazon)

Content Based Recommendation - Predicting/Improving parameters keeping features fixed.
This is more like summation of linear regression over a bunch of users. It addresses a problem of type - If Alice liked (gave a rating of 4 and above out of 5) movie 1 (which was tagged as "romance" & "action" ) & she also liked movie 2 (which was tagged as "romance" & "comedy"), to what extent will she like movie 3 (which was tagged as "romance" & "tragedy" ). Is movie 3 good enough to be recommended to Alice ?

Collaborative Filtering - Predicting/Improving features, keeping parameters fixed.
This is yet again a summation of linear regression over a bunch of users. It addresses a problem of type - If Alice likes "romance" & "tragedy" movies & John likes "action" & "romance" movies & Emily likes "drama" & "romance" movies, can a movie X, that all three have seen & liked, be tagged as "romance" ?

Thus, we use Content Based Recommendation & Collaborative Filtering in conjunction in most cases, to improve our own training dataset & the UX as our service is used more & more.

9. Online Learning

Large Dataset

Use case - You have continuous large, streams of data & you want your ML model to just learn from this data & then discard/throw it

Eg application - Product search algorithm showing the user only the most relevant, popular items, based on the latest learnings

9. Reinforcement Learning

Goal Oriented Algorithms Rewards Punishments

Use case - Enables an agent to learn in an interactive environment by trial & error based on feedback from its own actions & experiences.
Uses rewards & punishments for positive & negative behavior.
Usually modelled as a Markov Decision Process

Eg application - Find a suitable action model that would maximize the total cumulative reward. Building a bot that can play a game really well. Other use cases - Robotics, Business Strategy Planning, Traffic Light Control, Web System Configuration

Debugging Machine Learning Algorithms

The best way to go about this is :

  1. Start with a quick, simple implementation of a ML model, chosen by the business use case you have, using just a small subset of the entire training dataset you have.
  2. Plot "learning curves" to get an idea of how well your ML model addresses your use case.
  3. Calculate metrics that you care about - such as Precision, Recall, F1 Score, Accuracy etc. & figure out which ones you want to improve upon
  4. Then, bring into account things like regularization, adding more features etc. to fix any underfitting/overfitting problems you might have
  5. Lastly, bring in your full dataset, use methods like dimensionality reduction, PCA, SVM, MapReduce (Basically splitting your training dataset into chunks & computing each part in parallel in a seperate core), Online learning to scale your ML model.

Word of advice

Sometimes, by just looking at a ML problem, we might not know what ML algorithm to use. More than the algorithm itself, the following play a big part in training a model :

  1. How much data do you have
  2. How skilled are you at error analysis, debugging learning algorithms, designing new features, figuring out what features give you learning algorithms & so on

Afterword

This is it. A quick runthrough to most important, fundamental concepts of machine learning & why to use them. Massive thanks to Andrew Ng's exhaustive Machine Learning course that forms the bedrock of this blog.
Hope you don't use this to build the AI that will conquer humanity one day :)