FastAI Lesson 5 Notes
deep learning

January 17, 2024


All of this code was written by Jeremy Howard and the FastAI team. This is the source for the original code Linear model and neural net from scratch and Why you should use a framework.


In this lesson, Jeremy goes over training a model from scratch using a linear model, neural network and deep learning before finally walking through training a model using fastai + pytorch and an ensemble. This lesson is actually lesson 3, lesson 5 and part of lesson 6 so I had to go back to review lesson 3 to make sure I understood the material for lesson 5. I highly recommend going over lesson 3 and chapter 4 before this lesson because Jeremy doesn’t go too deep into the meaning of tensor shape and rank as he does in chapter 4. This lesson was really exciting from the programming side because I learned more about python and numerical programming with partials, broadcasting, data cleaning with pandas, and feature engineering.

Titanic - Machine Learning From Disaster

The Titanic - Machine Learning from Disaster Competition is used as the case study for this lesson. More information about the data can be found here Titanic - Machine Learning from Disaster.

Load Data and Libraries

# import libraries and files

# required libraries + packages for any ml/data science project
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch

# fastai library contains all the packages above and wraps them in the fastai library
!pip install -Uqq fastai

# kaggle API package install
from fastai.imports import *
import os
from pathlib import Path
import zipfile

'''Function for loading kaggle datasets locally or on kaggle
Returns a local path to data files
- input: Kaggle API Login Credentials, Kaggle Contest Name '''
def loadData(creds, dataFile):
    # variable to check whether we're running on kaggle website or not
    iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')

    # path for kaggle API credentials
    cred_path = Path('~/.kaggle/kaggle.json').expanduser()

    if not cred_path.exists():

    # Download data from Kaggle to path and extract files at path location

    # local machine
    path = Path(dataFile)
    if not iskaggle and not path.exists():
        import kaggle

    # kaggle
    if iskaggle:
        fileName = '../input/' + dataFile
        path = fileName

    return path
creds = ''
dataFile = 'titanic'
path = loadData(creds, dataFile)
# check data files
# set up default settings
import warnings, logging, torch
torch.set_printoptions(linewidth=140, sci_mode=False, edgeitems=7)
pd.set_option('display.width', 140)

Problem Statement


what sorts of people were more likely to survive the Titanic Disaster

  • use machine learning to create a model that predicts which passengers survived the Titanic Shipwreck using passenger data(name, age, gender, socio-economic class, etc)

Training Data

  • contains subset of the passengers on board Titanic (891 passengers) with information on whether they survived or not(ground truth)

Test(Inference) Data

  • contains same information as train data but does disclose the ground truth (whether passenger survived or not)
  • using patterns found in train data, predict whether the other 418 passsengers on board (test data) survived

Evaluation Goal

  • Predict if a passenger survived the sinking of the titanic or not
  • For each value in test set, predict 0 or 1 value for the variable

Evaluation Metric

  • Score is the percentage of passenger correctly predicted (accuracy)

Submission Format

  • PassengerID, Survived (contains binary predictions: 1 for survived, 0 for deceased)

Exploratory Data Analysis

Exploratory Data Analysis: Data Processing

# load data and view data
df = pd.read_csv(path/'train.csv')
891 rows × 12 columns

# Count number of missing values in each category
# - 1 - represents NaN value
# - summation tells how many NaN values are in each column
# Replace NaN with mode
# - replace missing values with something meaningful -> mean, median, mode etc
# in case of ties select first value
# find modes in different categories
modes = df.mode().iloc[0]
# Replace NaN with mode and verify there are no NaN values
df.fillna(modes, inplace=True)
Exploratory Data Analysis: Numeric Data

# summary of all numeric columns in data
# histogram of fare data
# long tail to the right histogram

# log histogram of fare data to center data
df['LogFare'] = np.log(df['Fare'] + 1)

Exploratory Data Analysis: Categorical Data

# Passenger Classes
pclasses = sorted(df.Pclass.unique())
# Convert Categorical Data to Numerical Data - Dummy Variables
# - Dummy variable is a column that contains 1 where a particular columns contains a particular value
# and 0 otherwise
# create dummy variables for categorical variables
df = pd.get_dummies(df, columns=["Sex","Pclass","Embarked"])
Sex_male Sex_female Pclass_1 Pclass_2 Pclass_3 Embarked_C Embarked_Q Embarked_S
Linear Model

Linear Model Variables

  • Independent Variables - predictors: all continuous variables + dummy variables
  • Dependent Variables - target: survived
# Linear Model Data Processing
from torch import tensor

# independent(predictors)
indep_cols = ['Age', 'SibSp', 'Parch', 'LogFare'] + added_cols
t_indep = tensor(df[indep_cols].values, dtype=torch.float)

# dependent(target) variables - Survived
t_dep = tensor(df.Survived)

# print information about tensors
print(f"Indendent Tensors Shape: {t_indep.shape}")
print(f"Independent Tensors Rank: {len(t_indep.shape)}")
print(f"Dependent Tensors Shape: {t_dep.shape}")
print(f"Dependent Tensors Rank: {len(t_dep.shape)}")
Indendent Tensors Shape: torch.Size([891, 12])
Independent Tensors Rank: 2
Dependent Tensors Shape: torch.Size([891])
Dependent Tensors Rank: 1
# do not use this in practice -> this is to ensure reproducibility in experimentation
# do not seed manually when done with experimentation

n_coeff = t_indep.shape[1]
coeffs = torch.rand(n_coeff) - 0.5
print(f"Coefficients shape: {coeffs.shape}")
print(f"Coefficients rank: {len(coeffs.shape)}")
print(f"Coefficients: {coeffs}")
Coefficients shape: torch.Size([12])
Coefficients rank: 1
Coefficients: tensor([-0.4629,  0.1386,  0.2409, -0.2262, -0.2632, -0.3147,  0.4876,  0.3136,  0.2799, -0.4392,  0.2103,  0.3625])
 # element wise multiplication using broadcasting - multiply every row by coefficients
#  - can be interpreted as looping 891 times and multiplying each row value by corresponding coeff value
t_indep * coeffs
# Sum of each row are dominated by Age since Age is larger than all the other variables
# center data to between 0 and 1 by averaging each column

# find max val in each row
vals,indices = t_indep.max(dim=0)
print(f"vals shape {vals.shape}")
print(f"vals rank: {len(vals.shape)}")

# - can be interpreted as looping 891 times and dividing each row value by corresponding value in vals
t_indep = t_indep / vals
# Calculate Prediction
preds = (t_indep * coeffs).sum(axis=1)
print(f"first few predictions: {preds[:10]}")
first few predictions: tensor([ 0.1927, -0.6239,  0.0979,  0.2056,  0.0968,  0.0066,  0.1306,  0.3476,  0.1613, -0.6285])
# Loss Function -> Mean Absolute Error
# - Loss function is required for doing gradient descent
loss = torch.abs(preds - t_dep).mean()
Loss: 0.5382388234138489
Loss: 0.5382388234138489
# Functions for computing predictions and loss

# Compute Predictions
def calc_preds(coeffs, indeps):
  return (indeps * coeffs).sum(axis=1)

# Compute Loss
def calc_loss(coeffs, indeps, deps):
  return torch.abs(calc_preds(coeffs, indeps) - deps).mean()

Gradient Descent

# Tell pytorch to calculate gradients
# Calculate loss
tensor(0.5382, grad_fn=<MeanBackward0>)
# - each call to backward, gradients are added to the value stored in grad attribute
loss = calc_loss(coeffs, t_indep, t_dep)
# reset gradients to zero after doing a single gradient step
loss = calc_loss(coeffs, t_indep, t_dep)
with torch.no_grad():
  coeffs.sub_(coeffs.grad * 0.1)
  print(calc_loss(coeffs, t_indep, t_dep))

Linear Model Training

# Data split
from import RandomSplitter

print(f"Training Data Size: {len(trn_split)}")
print(f"Validation Data Size: {len(val_split)}")
print(f"Training Data Indices: {trn_split}")
print(f"Validation Data Indices: {val_split}")
# Training Data, Validation Data
trn_indep, val_indep = t_indep[trn_split], t_indep[val_split]
trn_dep, val_dep = t_dep[trn_split], t_dep[val_split]

print(f"Training Independent Data Size: {len(trn_indep)}")
print(f"Training Dependent Data Size: {len(trn_dep)}")
print(f"Validation Indepdent Data Size: {len(val_indep)}")
print(f"Validation Dependent Data Size: {len(val_dep)}")
Training Independent Data Size: 713
Training Dependent Data Size: 713
Validation Indepdent Data Size: 178
Validation Dependent Data Size: 178
# Randomly initialize coefficients
def init_coeffs():
  return (torch.rand(n_coeff) - 0.5).requires_grad_()

# Update coefficents
def update_coeffs(coeffs, lr):
    coeffs.sub_(coeffs.grad * lr)

# One full gradient descent step
def one_epoch(coeffs, lr):
    loss = calc_loss(coeffs, trn_indep, trn_dep)
    with torch.no_grad():
        update_coeffs(coeffs, lr)
    print(f"{loss:.3f}", end="; ")

# Train model
def train_model(epochs=30, lr=0.01):
    coeffs = init_coeffs()
    for i in range(epochs):
      one_epoch(coeffs, lr=lr)
    return coeffs

# Calculate average accuracy of model
def acc(coeffs):
  return (val_dep.bool() == (calc_preds(coeffs, val_indep) > 0.5)).float().mean()

# Show coefficients for each column
def show_coeffs():
  return dict(zip(indep_cols, coeffs.requires_grad_(False)))
# Train Model
coeffs = train_model(18, lr=0.2)
# Coefficients for every column
{'Age': tensor(-0.2694),
 'SibSp': tensor(0.0901),
 'Parch': tensor(0.2359),
 'LogFare': tensor(0.0280),
 'Sex_male': tensor(-0.3990),
 'Sex_female': tensor(0.2345),
 'Pclass_1': tensor(0.7232),
 'Pclass_2': tensor(0.4112),
 'Pclass_3': tensor(0.3601),
 'Embarked_C': tensor(0.0955),
 'Embarked_Q': tensor(0.2395),
 'Embarked_S': tensor(0.2122)}
# Calculate accuracy
preds = calc_preds(coeffs, val_indep)

# - assume that any passenger with score > 0.5 is predicted to survive
# - correct for  each row where preds > 0.5 is the same as dependent variable
results = val_dep.bool()==(preds > 0.5)
print(f"First 16 results: {results[:16]}")

# Average accuracy
avg_acc = results.float().mean()
print(f"Average Accuracy: {avg_acc}")
# - some of the predictions of the survival probability are > 1 or < 0
# - can fix this issue by passing prediction through sigmoid function
# Sigmoid function has a minimum of 0 and max at 1
import sympy
sympy.plot("1/(1+exp(-x))", xlim=(-5,5));

# Update calc predictions to use sigmoid
def calc_preds(coeffs, indeps):
  return torch.sigmoid((indeps*coeffs).sum(axis=1))
# Train Model with Sigmoid Predictions
coeffs = train_model(lr=100)
# check coeffcients
{'Age': tensor(-1.5061),
 'SibSp': tensor(-1.1575),
 'Parch': tensor(-0.4267),
 'LogFare': tensor(0.2543),
 'Sex_male': tensor(-10.3320),
 'Sex_female': tensor(8.4185),
 'Pclass_1': tensor(3.8389),
 'Pclass_2': tensor(2.1398),
 'Pclass_3': tensor(-6.2331),
 'Embarked_C': tensor(1.4771),
 'Embarked_Q': tensor(2.1168),
tensor(0.8258)

Final Linear Model Setup

# number of coefficients
n_coeff = t_indep.shape[1]

# Randomly initialize coefficients
def init_coeffs():
  return (torch.rand(n_coeff)-0.5).requires_grad_()

# Loss Function - MAE (Mean Absolute Error)
def calc_loss(coeffs, indeps, deps):
  return torch.abs(calc_preds(coeffs, indeps) - deps).mean()

# Update coefficents
def update_coeffs(coeffs, lr):
    coeffs.sub_(coeffs.grad * lr)

# One full gradient descent step
def one_epoch(coeffs, lr):
    loss = calc_loss(coeffs, trn_indep, trn_dep)
    with torch.no_grad():
      update_coeffs(coeffs, lr)
    print(f"{loss:.3f}", end="; ")

# Train model
def train_model(epochs=30, lr=0.01):
    coeffs = init_coeffs()
    for i in range(epochs):
      one_epoch(coeffs, lr=lr)
    return coeffs

# Calculate average accuracy of model
def acc(coeffs):
  return (val_dep.bool() == (calc_preds(coeffs, val_indep) > 0.5)).float().mean()

# Calculate predictions
def calc_preds(coeffs, indeps):
  return torch.sigmoid((indeps * coeffs).sum(axis=1))

# Show coefficients for each column
def show_coeffs():
  return dict(zip(indep_cols, coeffs.requires_grad_(False)))
Test model on inference data

Inference Data Processing

# load data
tst_df = pd.read_csv(path/'test.csv')

# Fare data is missing one passenger -> substitute 0 to fix the issue
tst_df['Fare'] = tst_df.Fare.fillna(0)
# - these steps follow the same process as the training data processing
tst_df.fillna(modes, inplace=True)
tst_df['LogFare'] = np.log(tst_df['Fare'] + 1)
tst_df = pd.get_dummies(tst_df, columns=["Sex","Pclass","Embarked"])

added_cols = ['Sex_male', 'Sex_female', 'Pclass_1', 'Pclass_2', 'Pclass_3', 'Embarked_C', 'Embarked_Q', 'Embarked_S']
indep_cols = ['Age', 'SibSp', 'Parch', 'LogFare'] + added_cols

tst_indep = tensor(tst_df[indep_cols].values, dtype=torch.float)
tst_indep = tst_indep / vals
# Calculate predictions of which passengers survived in titanic dataset
tst_df['Survived'] = (calc_preds(tst_indep, coeffs) > 0.5).int()

Linear Model: Submit Results to Kaggle

# Submit to Kaggle
sub_df = tst_df[['PassengerId','Survived']]
sub_df.to_csv('sub.csv', index=False)

# check first few rows
Cleaning up Linear Model Code

# - Multiplying elements together and then adding across rows is the same as matrix-vector multiply
# Original Matrix-Vector multiply
Linear Model: PyTorch Matrix-Vector Multiply

# number of coefficients
n_coeff = t_indep.shape[1]

# Randomly initialize coefficients
def init_coeffs():
  # - 1 turns torch.rand() into a column vector
  return (torch.rand(n_coeff, 1) * 0.1).requires_grad_()

# Loss Function - MAE (Mean Absolute Error)
def calc_loss(coeffs, indeps, deps):
  return torch.abs(calc_preds(coeffs, indeps) - deps).mean()

# Update coefficents
def update_coeffs(coeffs, lr):
    coeffs.sub_(coeffs.grad * lr)

# One full gradient descent step
def one_epoch(coeffs, lr):
    loss = calc_loss(coeffs, trn_indep, trn_dep)
    with torch.no_grad():
      update_coeffs(coeffs, lr)
    print(f"{loss:.3f}", end="; ")

# Train model
def train_model(epochs=30, lr=0.01):
    coeffs = init_coeffs()
    for i in range(epochs):
      one_epoch(coeffs, lr=lr)
    return coeffs

# Calculate average accuracy of model
def acc(coeffs):
  return (val_dep.bool() == (calc_preds(coeffs, val_indep) > 0.5)).float().mean()

# Calculate predictions
def calc_preds(coeffs, indeps):
  return torch.sigmoid(indeps@coeffs)

# Show coefficients for each column
def show_coeffs():
  return dict(zip(indep_cols, coeffs.requires_grad_(False)))
# change dependent variable into a column vector - rank 2 tensor
trn_dep = trn_dep[:,None]
val_dep = val_dep[:,None]
print(f"Training Data Shape: {trn_dep.shape}")
print(f"Training Data Rank: {len(trn_dep.shape)}")
print(f"Validation Data Shape: {val_dep.shape}")
print(f"Validation Data Rank: {len(val_dep.shape)}")
Training Data Shape: torch.Size([713, 1])
Training Data Rank: 2
Validation Data Shape: torch.Size([178, 1])
Validation Data Rank: 2
coeffs = train_model(lr=100)
print(f"coefficients shape: {coeffs.shape}")
print(f"coefficients rank: {len(coeffs.shape)}")
print(f"accuracy: {acc(coeffs)}")
Neural Network

  • Define coefficients for each layer of the neural network

n hidden - higher number gives more flexibility for neural network to approximate data but slower and harder to train

First Layer input - n_coeff values output - n_hidden values (input to second layer) - need matrix of size n_coeffs by n_hidden - divide coefficients by n_hidden so that when we sum them up in the next layer so that we end up with similar magnitude numbers to what we started with

Second Layer input - n_hidden values (output of first layer) output - 1 value - need n_hidden by 1 + constant term

Steps 1. Two matrix products - indeps@l1 and res@l2 (res is output of first layer) 2. First layer output is passed to F.relu (non-linearity) 3. Second layer output is passed to sigmoid

def init_coeffs(n_hidden=20):
  # set of coefficients to go from input to hidden
    layer1 = (torch.rand(n_coeff, n_hidden) - 0.5) / n_hidden

    # set of coefficients to from hiddent to an output
    layer2 = torch.rand(n_hidden, 1) - 0.3
    const = torch.rand(1)[0]

    # return a tuple of layer1 gradient, layer2 gradient, and constant gradient
    return layer1.requires_grad_(), layer2.requires_grad_(), const.requires_grad_()
import torch.nn.functional as F

# neural network
def calc_preds(coeffs, indeps):
    l1, l2, const = coeffs

    # layer 1
    # replace negative values with zeroes
    res = F.relu(indeps@l1)

    # layer 2
    res = res@l2 + const
    return torch.sigmoid(res)
# Update Coefficients
# - Three sets of coefficients to update per epoch(layer1, layer2, constant)
def update_coeffs(coeffs, lr):
  for layer in coeffs:
    layer.sub_(layer.grad * lr)
# Train Model
coeffs = train_model(lr=1.4)
# Train Model
coeffs = train_model(lr=20)
Deep Learning

def init_coeffs():
  # size of each hidden layer
  # two hidden layers - 10 activations in each layer
    hiddens = [10, 10]
  # - n_coeffs to 10
  # - 10 to 10
  # - 10 to 1
    sizes = [n_coeff] + hiddens + [1]
    n = len(sizes)
    layers = [(torch.rand(sizes[i], sizes[i + 1])- 0.3)/sizes[i + 1]*4 for i in range(n - 1)]
    consts = [(torch.rand(1)[0] - 0.5) * 0.1 for i in range(n - 1)]
    for l in layers + consts:
    return layers,consts
import torch.nn.functional as F

def calc_preds(coeffs, indeps):
    layers,consts = coeffs
    n = len(layers)
    res = indeps
    for i,l in enumerate(layers):
        res = res@l + consts[i]
        # RELU for every layer except for last layer
        if i != n - 1:
          res = F.relu(res)
        # sigmoid only for the last layer
    return torch.sigmoid(res)
def update_coeffs(coeffs, lr):
    layers,consts = coeffs
    for layer in layers + consts:
        layer.sub_(layer.grad * lr)
# Train Model
coeffs = train_model(lr=4)
Framework: fastai + PyTorch

# load stuff
from fastai.tabular.all import *

pd.options.display.float_format = '{:.2f}'.format
# Load Data
df = pd.read_csv(path/'train.csv')

# Feature Engineering
def add_features(df):
    df['LogFare'] = np.log1p(df['Fare'])
    df['Deck'] = df.Cabin.str[0].map(dict(A="ABC", B="ABC", C="ABC", D="DE", E="DE", F="FG", G="FG"))
    df['Family'] = df.SibSp+df.Parch
    df['Alone'] = df.Family==0
    df['TicketFreq'] = df.groupby('Ticket')['Ticket'].transform('count')
    df['Title'] = df.Name.str.split(', ', expand=True)[1].str.split('.', expand=True)[0]
    df['Title'] ="Mr",Miss="Miss",Mrs="Mrs",Master="Master"))

# Data Split
splits = RandomSplitter(seed=42)(df)
# Tabular Dataloaders
dls = TabularPandas(
    # splits for indices of training and validation sets
    df, splits=splits,
    # Turn strings into categories, fill missing values in numeric columns with the median, normalise all numeric columns
    procs = [Categorify, FillMissing, Normalize],
    # categorical independent variables
    cat_names=["Sex","Pclass","Embarked","Deck", "Title"],
    # continuous independent variables
    cont_names=['Age', 'SibSp', 'Parch', 'LogFare', 'Alone', 'TicketFreq', 'Family'],
    # dependent variable
    # dependent variable is categorical(build a classification model)
    y_block = CategoryBlock(),
# Train Model
# - data + model = Learner
# - dls -> data
# - layers -> size of each hidden layer
# - metrics -> any metric we want to use for loss function
learn = tabular_learner(dls, metrics=accuracy, layers=[10,10])
# find learning rate
learn.lr_find(suggest_funcs=(slide, valley))
# specify number of epochs and learning rate and train model, 0.03)
Test fastai model on inference data

# Inference Data Processing
tst_df = pd.read_csv(path/'test.csv')
tst_df['Fare'] = tst_df.Fare.fillna(0)
# apply data modeling information from learner to inference
tst_dl = learn.dls.test_dl(tst_df)
# get predictions for the inference data
preds,_ = learn.get_preds(dl=tst_dl)

fastai Model: Submit to Kaggle

# submit to kaggle
tst_df['Survived'] = (preds[:,1] > 0.5).int()
sub_df = tst_df[['PassengerId','Survived']]
sub_df.to_csv('framework_sub.csv', index=False)

# check predictions file
  • create multiple models and combine predictions
def ensemble():
    learn = tabular_learner(dls, metrics=accuracy, layers=[10,10])
    with learn.no_bar(),learn.no_logging():, lr=0.03)
    return learn.get_preds(dl=tst_dl)[0]
# create a set of 5 different predictions
learns = [ensemble() for _ in range(5)]
# take average of all predictions
ens_preds = torch.stack(learns).mean(0)

Ensembling: Submit to Kaggle

# submit to kaggle
tst_df['Survived'] = (preds[:,1] > 0.5).int()
sub_df = tst_df[['PassengerId','Survived']]
sub_df.to_csv('ensemble_sub.csv', index=False)

# check predictions file
