Machine Learning

Instructor: Dr. Aric LaBarr


Introduction

Machine learning can be a hard concept to define. The Merriam Webster dictionary defines machine learning in this way:

Machine Learning - Merriam Webster

a computational method that is a sub-field of artificial intelligence and that enables a computer to learn to perform tasks by analyzing a large data set without being explicitly programmed.

The Oxford dictionary is similar in nature:

Machine Learning - Oxford

a type of artificial intelligence in which computers use huge amounts of data to learn how to do tasks rather than being programmed to do them.

Sounds complicated. However, let’s examine a list of some of the most common machine learning methods:

  1. Linear Regression

  2. Logistic Regression

  3. k-Nearest Neighbors

  4. Decision Trees

  5. Random Forests

  6. Time Series Analysis

  7. Clustering

  8. Naïve Bayes

  9. Principal Components Analysis

  10. Gradient Boosting

Notice how linear and logistic regression are commonly on those lists. We will take the approach that machine learning algorithms are exactly what the above definitions state and have a spectrum of outcomes ranging from extremely interpretable to focusing more on predictive ability. Although a tremendous amount of work has been done in recent years to help make some of the more advanced algorithms more interpretable, their original creation was done for the value of prediction and complexity as compared to interpretation. We will discuss some of the ways to interpret these algorithms at the end of this code deck. This code deck only assumes a basic introduction to linear regression as the foundation.

Let’s examine the “learning” piece of machine learning. There are four main categories of “learning” at the writing of this code deck:

  • Supervised Learning: Models learn from labeled data. For example, linear regression.

  • Unsupervised Learning: Models learn from unlabeled data. For example, clustering.

  • Semi-supervised Learning: Models learn from limited labeled data mixed with large amounts of unlabeled data. For example, image recognition.

  • Reinforcement Learning: Models learn from interacting with environment around them (trial and error). For example, autonomous vehicles.

This course will focus primarily on supervised learning with an introduction to unsupervised learning at the end.

In supervised learning, variables are used to predict or explain a known target variable. In supervised regression learning, that target variable is a continuous target variable:

Supervised Regression Modeling

In supervised classification learning, that target variable is a categorical target variable with a binary target variable shown in the example below:

Supervised Classification Modeling

Unsupervised learning models don’t have a target variable to try and relate predictor variables to.

Unsupervised Modeling

The reason we cover all of the algorithms in this code deck is that we are unsure ahead of time which algorithm will work best for any specific data set. Stephen Lee from the University of Idaho has a great visual to summarize this idea:

Comparing Algorithms (Lee, University of Idaho)

This markdown file contains information on how to perform many different machine learning algorithms and all of their components in Python as well as R. In each section you are able to toggle back and forth between the code and output you desire to view.

The libraries for Python and R will be loaded as we go throughout the code.