Pranav Rajan’s Blog - FastAI Lesson 5: From-scratch model

Acknowledgements

All of this code was written by Jeremy Howard and the FastAI team. I modified it slightly to include my own print statements, comments and additional helper functions based on Jeremy’s code. This is the source for the original code Linear model and neural net from scratch and Why you should use a framework.

Summary

In this lesson, Jeremy goes over training a model from scratch using a linear model, neural network and deep learning before finally walking through training a model using fastai + pytorch and an ensemble. This lesson is actually lesson 3, lesson 5 and part of lesson 6 so I had to go back to review lesson 3 to make sure I understood the material for lesson 5. I highly recommend going over lesson 3 and chapter 4 before this lesson because Jeremy doesn’t go too deep into the meaning of tensor shape and rank as he does in chapter 4. This lesson was really exciting from the programming side because I learned more about python and numerical programming with partials, broadcasting, data cleaning with pandas, and feature engineering.

Titanic - Machine Learning From Disaster

The Titanic - Machine Learning from Disaster Competition is used as the case study for this lesson. More information about the data can be found here Titanic - Machine Learning from Disaster.

Load Data and Libraries

# import libraries and files

# required libraries + packages for any ml/data science project
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch

# fastai library contains all the packages above and wraps them in the fastai library
!pip install -Uqq fastai

# kaggle API package install
!pip install kaggle

Requirement already satisfied: kaggle in /usr/local/lib/python3.10/dist-packages (1.5.16)
Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.10/dist-packages (from kaggle) (1.16.0)
Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from kaggle) (2023.11.17)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.10/dist-packages (from kaggle) (2.8.2)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from kaggle) (2.31.0)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from kaggle) (4.66.1)
Requirement already satisfied: python-slugify in /usr/local/lib/python3.10/dist-packages (from kaggle) (8.0.1)
Requirement already satisfied: urllib3 in /usr/local/lib/python3.10/dist-packages (from kaggle) (2.0.7)
Requirement already satisfied: bleach in /usr/local/lib/python3.10/dist-packages (from kaggle) (6.1.0)
Requirement already satisfied: webencodings in /usr/local/lib/python3.10/dist-packages (from bleach->kaggle) (0.5.1)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.10/dist-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->kaggle) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->kaggle) (3.6)

from fastai.imports import *
import os
from pathlib import Path
import zipfile


'''Function for loading kaggle datasets locally or on kaggle
Returns a local path to data files
- input: Kaggle API Login Credentials, Kaggle Contest Name '''
def loadData(creds, dataFile):
    # variable to check whether we're running on kaggle website or not
    iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')

    # path for kaggle API credentials
    cred_path = Path('~/.kaggle/kaggle.json').expanduser()

    if not cred_path.exists():
        cred_path.parent.mkdir(exist_ok=True)
        cred_path.write_text(creds)
        cred_path.chmod(0o600)

    # Download data from Kaggle to path and extract files at path location

    # local machine
    path = Path(dataFile)
    if not iskaggle and not path.exists():
        import kaggle
        kaggle.api.competition_download_cli(str(path))
        zipfile.ZipFile(f'{path}.zip').extractall(path)

    # kaggle
    if iskaggle:
        fileName = '../input/' + dataFile
        path = fileName

    return path

creds = ''
dataFile = 'titanic'
path = loadData(creds, dataFile)

Downloading titanic.zip to /content

100%|██████████| 34.1k/34.1k [00:00<00:00, 38.6MB/s]

# check data files
!ls {path}

gender_submission.csv  test.csv  train.csv

# set up default settings
import warnings, logging, torch
torch.set_printoptions(linewidth=140, sci_mode=False, edgeitems=7)
pd.set_option('display.width', 140)
warnings.simplefilter('ignore')
logging.disable(logging.WARNING)

Problem Statement

Problem

what sorts of people were more likely to survive the Titanic Disaster

use machine learning to create a model that predicts which passengers survived the Titanic Shipwreck using passenger data(name, age, gender, socio-economic class, etc)

Training Data

contains subset of the passengers on board Titanic (891 passengers) with information on whether they survived or not(ground truth)

Test(Inference) Data

contains same information as train data but does disclose the ground truth (whether passenger survived or not)
using patterns found in train data, predict whether the other 418 passsengers on board (test data) survived

Evaluation Goal

Predict if a passenger survived the sinking of the titanic or not
For each value in test set, predict 0 or 1 value for the variable

Evaluation Metric

Score is the percentage of passenger correctly predicted (accuracy)

Submission Format

PassengerID, Survived (contains binary predictions: 1 for survived, 0 for deceased)

Exploratory Data Analysis

Exploratory Data Analysis: Data Processing

# load data and view data
df = pd.read_csv(path/'train.csv')
df

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

891 rows × 12 columns

# Count number of missing values in each category
# - 1 - represents NaN value
# - summation tells how many NaN values are in each column
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

# Replace NaN with mode
# - replace missing values with something meaningful -> mean, median, mode etc
# in case of ties select first value

# find modes in different categories
modes = df.mode().iloc[0]
modes

PassengerId                      1
Survived                       0.0
Pclass                         3.0
Name           Abbing, Mr. Anthony
Sex                           male
Age                           24.0
SibSp                          0.0
Parch                          0.0
Ticket                        1601
Fare                          8.05
Cabin                      B96 B98
Embarked                         S
Name: 0, dtype: object

# Replace NaN with mode and verify there are no NaN values
df.fillna(modes, inplace=True)
df.isna().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

Exploratory Data Analysis: Numeric Data

# summary of all numeric columns in data
df.describe(include=(np.number))

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	891.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	28.566970	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	13.199572	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	22.000000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	24.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	35.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

# histogram of fare data
# long tail to the right histogram
df['Fare'].hist();

# log histogram of fare data to center data
df['LogFare'] = np.log(df['Fare'] + 1)
df['LogFare'].hist();

Exploratory Data Analysis: Categorical Data

# Passenger Classes
pclasses = sorted(df.Pclass.unique())
pclasses

[1, 2, 3]

# non-numeric data summary
df.describe(include=[object])

	Name	Sex	Ticket	Cabin	Embarked
count	891	891	891	891	891
unique	891	2	681	147	3
top	Braund, Mr. Owen Harris	male	347082	B96 B98	S
freq	1	577	7	691	646

# Convert Categorical Data to Numerical Data - Dummy Variables
# - Dummy variable is a column that contains 1 where a particular columns contains a particular value
# and 0 otherwise

# create dummy variables for categorical variables
df = pd.get_dummies(df, columns=["Sex","Pclass","Embarked"])
df.columns

Index(['PassengerId', 'Survived', 'Name', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'LogFare', 'Sex_female', 'Sex_male',
       'Pclass_1', 'Pclass_2', 'Pclass_3', 'Embarked_C', 'Embarked_Q', 'Embarked_S'],
      dtype='object')

# new data with dummy variables
added_cols = ['Sex_male', 'Sex_female', 'Pclass_1', 'Pclass_2', 'Pclass_3', 'Embarked_C', 'Embarked_Q', 'Embarked_S']
df[added_cols].head()

	Sex_male	Sex_female	Pclass_1	Pclass_3	Embarked_C	Embarked_S
0	1	0	0	1	0	1
1	0	1	1	0	1	0
2	0	1	0	1	0	1
3	0	1	1	0	0	1
4	1	0	0	1	0	1

Linear Model

Linear Model Variables

Independent Variables - predictors: all continuous variables + dummy variables
Dependent Variables - target: survived

# Linear Model Data Processing
from torch import tensor

# independent(predictors)
indep_cols = ['Age', 'SibSp', 'Parch', 'LogFare'] + added_cols
t_indep = tensor(df[indep_cols].values, dtype=torch.float)

# dependent(target) variables - Survived
t_dep = tensor(df.Survived)

# print information about tensors
print(f"Indendent Tensors Shape: {t_indep.shape}")
print(f"Independent Tensors Rank: {len(t_indep.shape)}")
print(f"Dependent Tensors Shape: {t_dep.shape}")
print(f"Dependent Tensors Rank: {len(t_dep.shape)}")

Indendent Tensors Shape: torch.Size([891, 12])
Independent Tensors Rank: 2
Dependent Tensors Shape: torch.Size([891])
Dependent Tensors Rank: 1

# do not use this in practice -> this is to ensure reproducibility in experimentation
# do not seed manually when done with experimentation
torch.manual_seed(442)

n_coeff = t_indep.shape[1]
coeffs = torch.rand(n_coeff) - 0.5
print(f"Coefficients shape: {coeffs.shape}")
print(f"Coefficients rank: {len(coeffs.shape)}")
print(f"Coefficients: {coeffs}")

Coefficients shape: torch.Size([12])
Coefficients rank: 1
Coefficients: tensor([-0.4629,  0.1386,  0.2409, -0.2262, -0.2632, -0.3147,  0.4876,  0.3136,  0.2799, -0.4392,  0.2103,  0.3625])

 # element wise multiplication using broadcasting - multiply every row by coefficients
#  - can be interpreted as looping 891 times and multiplying each row value by corresponding coeff value
t_indep * coeffs

tensor([[-10.1838,   0.1386,   0.0000,  -0.4772,  -0.2632,  -0.0000,   0.0000,   0.0000,   0.2799,  -0.0000,   0.0000,   0.3625],
        [-17.5902,   0.1386,   0.0000,  -0.9681,  -0.0000,  -0.3147,   0.4876,   0.0000,   0.0000,  -0.4392,   0.0000,   0.0000],
        [-12.0354,   0.0000,   0.0000,  -0.4950,  -0.0000,  -0.3147,   0.0000,   0.0000,   0.2799,  -0.0000,   0.0000,   0.3625],
        [-16.2015,   0.1386,   0.0000,  -0.9025,  -0.0000,  -0.3147,   0.4876,   0.0000,   0.0000,  -0.0000,   0.0000,   0.3625],
        [-16.2015,   0.0000,   0.0000,  -0.4982,  -0.2632,  -0.0000,   0.0000,   0.0000,   0.2799,  -0.0000,   0.0000,   0.3625],
        [-11.1096,   0.0000,   0.0000,  -0.5081,  -0.2632,  -0.0000,   0.0000,   0.0000,   0.2799,  -0.0000,   0.2103,   0.0000],
        [-24.9966,   0.0000,   0.0000,  -0.8973,  -0.2632,  -0.0000,   0.4876,   0.0000,   0.0000,  -0.0000,   0.0000,   0.3625],
        ...,
        [-11.5725,   0.0000,   0.0000,  -0.4717,  -0.2632,  -0.0000,   0.0000,   0.0000,   0.2799,  -0.0000,   0.0000,   0.3625],
        [-18.0531,   0.0000,   1.2045,  -0.7701,  -0.0000,  -0.3147,   0.0000,   0.0000,   0.2799,  -0.0000,   0.2103,   0.0000],
        [-12.4983,   0.0000,   0.0000,  -0.5968,  -0.2632,  -0.0000,   0.0000,   0.3136,   0.0000,  -0.0000,   0.0000,   0.3625],
        [ -8.7951,   0.0000,   0.0000,  -0.7766,  -0.0000,  -0.3147,   0.4876,   0.0000,   0.0000,  -0.0000,   0.0000,   0.3625],
        [-11.1096,   0.1386,   0.4818,  -0.7229,  -0.0000,  -0.3147,   0.0000,   0.0000,   0.2799,  -0.0000,   0.0000,   0.3625],
        [-12.0354,   0.0000,   0.0000,  -0.7766,  -0.2632,  -0.0000,   0.4876,   0.0000,   0.0000,  -0.4392,   0.0000,   0.0000],
        [-14.8128,   0.0000,   0.0000,  -0.4905,  -0.2632,  -0.0000,   0.0000,   0.0000,   0.2799,  -0.0000,   0.2103,   0.0000]])

# Sum of each row are dominated by Age since Age is larger than all the other variables
# center data to between 0 and 1 by averaging each column

# find max val in each row
vals,indices = t_indep.max(dim=0)
print(f"vals shape {vals.shape}")
print(f"vals rank: {len(vals.shape)}")

# - can be interpreted as looping 891 times and dividing each row value by corresponding value in vals
t_indep = t_indep / vals

vals shape torch.Size([12])
vals rank: 1

# Recompute coefficients using new new centered independent variable values
t_indep * coeffs

tensor([[-0.1273,  0.0173,  0.0000, -0.0765, -0.2632, -0.0000,  0.0000,  0.0000,  0.2799, -0.0000,  0.0000,  0.3625],
        [-0.2199,  0.0173,  0.0000, -0.1551, -0.0000, -0.3147,  0.4876,  0.0000,  0.0000, -0.4392,  0.0000,  0.0000],
        [-0.1504,  0.0000,  0.0000, -0.0793, -0.0000, -0.3147,  0.0000,  0.0000,  0.2799, -0.0000,  0.0000,  0.3625],
        [-0.2025,  0.0173,  0.0000, -0.1446, -0.0000, -0.3147,  0.4876,  0.0000,  0.0000, -0.0000,  0.0000,  0.3625],
        [-0.2025,  0.0000,  0.0000, -0.0798, -0.2632, -0.0000,  0.0000,  0.0000,  0.2799, -0.0000,  0.0000,  0.3625],
        [-0.1389,  0.0000,  0.0000, -0.0814, -0.2632, -0.0000,  0.0000,  0.0000,  0.2799, -0.0000,  0.2103,  0.0000],
        [-0.3125,  0.0000,  0.0000, -0.1438, -0.2632, -0.0000,  0.4876,  0.0000,  0.0000, -0.0000,  0.0000,  0.3625],
        ...,
        [-0.1447,  0.0000,  0.0000, -0.0756, -0.2632, -0.0000,  0.0000,  0.0000,  0.2799, -0.0000,  0.0000,  0.3625],
        [-0.2257,  0.0000,  0.2008, -0.1234, -0.0000, -0.3147,  0.0000,  0.0000,  0.2799, -0.0000,  0.2103,  0.0000],
        [-0.1562,  0.0000,  0.0000, -0.0956, -0.2632, -0.0000,  0.0000,  0.3136,  0.0000, -0.0000,  0.0000,  0.3625],
        [-0.1099,  0.0000,  0.0000, -0.1244, -0.0000, -0.3147,  0.4876,  0.0000,  0.0000, -0.0000,  0.0000,  0.3625],
        [-0.1389,  0.0173,  0.0803, -0.1158, -0.0000, -0.3147,  0.0000,  0.0000,  0.2799, -0.0000,  0.0000,  0.3625],
        [-0.1504,  0.0000,  0.0000, -0.1244, -0.2632, -0.0000,  0.4876,  0.0000,  0.0000, -0.4392,  0.0000,  0.0000],
        [-0.1852,  0.0000,  0.0000, -0.0786, -0.2632, -0.0000,  0.0000,  0.0000,  0.2799, -0.0000,  0.2103,  0.0000]])

# Calculate Prediction
preds = (t_indep * coeffs).sum(axis=1)
print(f"first few predictions: {preds[:10]}")

first few predictions: tensor([ 0.1927, -0.6239,  0.0979,  0.2056,  0.0968,  0.0066,  0.1306,  0.3476,  0.1613, -0.6285])

# Loss Function -> Mean Absolute Error
# - Loss function is required for doing gradient descent
loss = torch.abs(preds - t_dep).mean()
print(f"Loss: {loss}")

Loss: 0.5382388234138489

# Functions for computing predictions and loss

# Compute Predictions
def calc_preds(coeffs, indeps):
  return (indeps * coeffs).sum(axis=1)

# Compute Loss
def calc_loss(coeffs, indeps, deps):
  return torch.abs(calc_preds(coeffs, indeps) - deps).mean()

Gradient Descent

# Tell pytorch to calculate gradients
coeffs.requires_grad_()

tensor([-0.4629,  0.1386,  0.2409, -0.2262, -0.2632, -0.3147,  0.4876,  0.3136,  0.2799, -0.4392,  0.2103,  0.3625], requires_grad=True)

# Calculate loss
loss = calc_loss(coeffs, t_indep, t_dep)
loss

tensor(0.5382, grad_fn=<MeanBackward0>)

# Calculate gradients
loss.backward()

# Gradients
coeffs.grad

tensor([-0.0106,  0.0129, -0.0041, -0.0484,  0.2099, -0.2132, -0.1212, -0.0247,  0.1425, -0.1886, -0.0191,  0.2043])

# - each call to backward, gradients are added to the value stored in grad attribute
loss = calc_loss(coeffs, t_indep, t_dep)
loss.backward()
coeffs.grad

tensor([-0.0212,  0.0258, -0.0082, -0.0969,  0.4198, -0.4265, -0.2424, -0.0494,  0.2851, -0.3771, -0.0382,  0.4085])

# reset gradients to zero after doing a single gradient step
loss = calc_loss(coeffs, t_indep, t_dep)
loss.backward()
with torch.no_grad():
  coeffs.sub_(coeffs.grad * 0.1)
  coeffs.grad.zero_()
  print(calc_loss(coeffs, t_indep, t_dep))

tensor(0.4945)

Linear Model Training

# Data split
from fastai.data.transforms import RandomSplitter
trn_split,val_split=RandomSplitter(seed=42)(df)

print(f"Training Data Size: {len(trn_split)}")
print(f"Validation Data Size: {len(val_split)}")
print(f"Training Data Indices: {trn_split}")
print(f"Validation Data Indices: {val_split}")

Training Data Size: 713
Validation Data Size: 178
Training Data Indices: [788, 525, 821, 253, 374, 98, 215, 313, 281, 305, 701, 812, 76, 50, 387, 47, 516, 564, 434, 117, 150, 513, 676, 470, 569, 603, 816, 719, 120, 88, 204, 617, 615, 61, 648, 139, 840, 831, 302, 118, 58, 257, 404, 24, 618, 730, 371, 104, 370, 592, 548, 633, 216, 682, 157, 103, 512, 574, 650, 312, 757, 225, 241, 557, 808, 827, 334, 208, 23, 2, 28, 319, 463, 77, 34, 637, 842, 30, 460, 888, 217, 405, 10, 66, 852, 291, 249, 872, 75, 450, 597, 377, 178, 207, 737, 318, 573, 64, 415, 220, 184, 49, 384, 97, 121, 111, 568, 873, 343, 495, 611, 712, 723, 829, 871, 29, 641, 69, 844, 383, 560, 394, 817, 643, 820, 832, 409, 645, 441, 732, 636, 848, 475, 317, 884, 881, 367, 562, 689, 841, 805, 716, 81, 54, 44, 136, 364, 35, 796, 373, 342, 550, 543, 851, 185, 85, 451, 826, 761, 399, 16, 125, 264, 162, 197, 309, 804, 42, 545, 846, 622, 177, 273, 559, 815, 74, 248, 809, 403, 728, 613, 96, 549, 258, 192, 780, 263, 109, 803, 20, 715, 487, 375, 570, 108, 628, 153, 474, 572, 306, 341, 763, 877, 227, 454, 535, 767, 33, 193, 793, 62, 311, 285, 861, 687, 738, 498, 235, 507, 5, 743, 499, 160, 686, 445, 368, 660, 748, 813, 614, 431, 534, 152, 810, 314, 423, 166, 567, 447, 296, 621, 688, 147, 176, 662, 481, 407, 22, 878, 702, 542, 219, 604, 448, 866, 508, 267, 786, 190, 339, 222, 473, 365, 363, 469, 746, 554, 349, 428, 60, 421, 25, 331, 429, 39, 752, 419, 553, 879, 558, 802, 360, 372, 19, 156, 483, 251, 801, 882, 547, 744, 704, 602, 164, 608, 78, 750, 201, 722, 142, 171, 762, 3, 828, 482, 472, 458, 130, 576, 231, 468, 843, 43, 647, 11, 132, 760, 785, 206, 556, 72, 886, 413, 293, 284, 348, 751, 759, 488, 68, 529, 503, 389, 381, 883, 432, 830, 238, 345, 269, 555, 83, 672, 14, 818, 167, 651, 819, 718, 502, 316, 836, 410, 99, 625, 278, 352, 640, 295, 765, 741, 112, 847, 539, 165, 616, 772, 596, 320, 546, 181, 709, 395, 401, 230, 347, 510, 626, 223, 127, 205, 756, 355, 52, 773, 697, 565, 610, 337, 158, 329, 795, 15, 749, 792, 486, 789, 430, 416, 544, 149, 73, 210, 855, 254, 776, 522, 739, 114, 661, 777, 380, 600, 107, 36, 856, 396, 838, 354, 161, 351, 708, 590, 734, 393, 632, 406, 703, 138, 426, 845, 674, 668, 500, 247, 680, 659, 653, 620, 262, 41, 781, 652, 307, 350, 382, 577, 40, 209, 357, 17, 717, 527, 666, 196, 174, 116, 124, 530, 82, 133, 607, 679, 850, 272, 771, 849, 137, 57, 457, 195, 268, 601, 497, 145, 745, 95, 839, 453, 561, 287, 684, 858, 571, 675, 433, 876, 422, 654, 186, 627, 673, 511, 493, 214, 520, 308, 338, 720, 323, 727, 496, 101, 234, 692, 455, 13, 663, 237, 665, 203, 724, 501, 449, 521, 261, 860, 514, 609, 259, 726, 169, 731, 541, 80, 335, 420, 86, 612, 794, 536, 244, 332, 669, 328, 664, 655, 524, 667, 677, 461, 834, 271, 92, 199, 243, 631, 155, 696, 46, 392, 325, 711, 379, 523, 494, 578, 9, 200, 485, 93, 437, 800, 221, 87, 154, 27, 255, 290, 833, 53, 327, 678, 425, 48, 583, 198, 0, 835, 194, 491, 94, 6, 353, 594, 79, 100, 863, 90, 310, 336, 859, 402, 775, 444, 159, 634, 791, 623, 466, 528, 649, 340, 624, 887, 294, 240, 478, 823, 588, 714, 889, 671, 862, 356, 551, 304, 135, 378, 346, 552, 629, 733, 517, 212, 7, 321, 265, 398, 442, 400, 84, 228, 436, 725, 783, 681, 411, 700, 67, 584, 635, 91, 277, 875, 388, 239, 582, 774, 440, 141, 694, 4, 766, 857, 424, 105, 683, 279, 867, 55, 51, 479, 747, 657, 242, 110, 326, 260, 59, 297, 180, 408, 63, 32, 822, 106, 270, 427, 698, 806, 755, 799, 824, 693, 465, 256, 418, 446, 695, 276, 26, 885, 599, 707, 585, 646, 595, 758, 189, 8, 170, 144, 280, 146, 391, 807, 179, 173, 462, 358, 532, 344]
Validation Data Indices: [303, 778, 531, 385, 134, 476, 691, 443, 386, 128, 579, 65, 869, 359, 202, 187, 456, 880, 705, 797, 656, 467, 581, 754, 131, 768, 172, 252, 163, 300, 417, 71, 566, 814, 322, 119, 515, 369, 89, 729, 226, 563, 148, 218, 12, 638, 509, 333, 825, 376, 245, 480, 182, 784, 619, 70, 126, 605, 37, 735, 452, 721, 630, 115, 854, 102, 864, 811, 188, 706, 870, 140, 439, 868, 168, 764, 589, 224, 191, 586, 183, 742, 753, 250, 526, 685, 505, 435, 874, 658, 769, 45, 537, 798, 644, 292, 397, 330, 289, 143, 175, 274, 519, 606, 690, 288, 286, 670, 484, 477, 506, 18, 740, 412, 464, 366, 736, 790, 699, 299, 266, 587, 459, 233, 598, 782, 246, 315, 282, 236, 504, 639, 518, 123, 31, 283, 232, 713, 122, 211, 213, 591, 129, 890, 229, 275, 471, 533, 853, 362, 38, 593, 56, 787, 301, 324, 837, 113, 21, 538, 710, 361, 490, 390, 1, 779, 580, 865, 438, 492, 642, 575, 151, 770, 414, 540, 298, 489]

# Training Data, Validation Data
trn_indep, val_indep = t_indep[trn_split], t_indep[val_split]
trn_dep, val_dep = t_dep[trn_split], t_dep[val_split]

print(f"Training Independent Data Size: {len(trn_indep)}")
print(f"Training Dependent Data Size: {len(trn_dep)}")
print(f"Validation Indepdent Data Size: {len(val_indep)}")
print(f"Validation Dependent Data Size: {len(val_dep)}")

Training Independent Data Size: 713
Training Dependent Data Size: 713
Validation Indepdent Data Size: 178
Validation Dependent Data Size: 178

# Randomly initialize coefficients
def init_coeffs():
  return (torch.rand(n_coeff) - 0.5).requires_grad_()

# Update coefficents
def update_coeffs(coeffs, lr):
    coeffs.sub_(coeffs.grad * lr)
    coeffs.grad.zero_()

# One full gradient descent step
def one_epoch(coeffs, lr):
    loss = calc_loss(coeffs, trn_indep, trn_dep)
    loss.backward()
    with torch.no_grad():
        update_coeffs(coeffs, lr)
    print(f"{loss:.3f}", end="; ")

# Train model
def train_model(epochs=30, lr=0.01):
    torch.manual_seed(442)
    coeffs = init_coeffs()
    for i in range(epochs):
      one_epoch(coeffs, lr=lr)
    return coeffs

# Calculate average accuracy of model
def acc(coeffs):
  return (val_dep.bool() == (calc_preds(coeffs, val_indep) > 0.5)).float().mean()

# Show coefficients for each column
def show_coeffs():
  return dict(zip(indep_cols, coeffs.requires_grad_(False)))

# Train Model
coeffs = train_model(18, lr=0.2)

0.536; 0.502; 0.477; 0.454; 0.431; 0.409; 0.388; 0.367; 0.349; 0.336; 0.330; 0.326; 0.329; 0.304; 0.314; 0.296; 0.300; 0.289;

# Coefficients for every column
show_coeffs()

{'Age': tensor(-0.2694),
 'SibSp': tensor(0.0901),
 'Parch': tensor(0.2359),
 'LogFare': tensor(0.0280),
 'Sex_male': tensor(-0.3990),
 'Sex_female': tensor(0.2345),
 'Pclass_1': tensor(0.7232),
 'Pclass_2': tensor(0.4112),
 'Pclass_3': tensor(0.3601),
 'Embarked_C': tensor(0.0955),
 'Embarked_Q': tensor(0.2395),
 'Embarked_S': tensor(0.2122)}

# Calculate accuracy
preds = calc_preds(coeffs, val_indep)

# - assume that any passenger with score > 0.5 is predicted to survive
# - correct for  each row where preds > 0.5 is the same as dependent variable
results = val_dep.bool()==(preds > 0.5)
print(f"First 16 results: {results[:16]}")

# Average accuracy
avg_acc = results.float().mean()
print(f"Average Accuracy: {avg_acc}")

First 16 results: tensor([ True,  True,  True,  True,  True,  True,  True,  True,  True,  True, False, False, False,  True,  True, False])
Average Accuracy: 0.7865168452262878

Sigmoid

# - some of the predictions of the survival probability are > 1 or < 0
# - can fix this issue by passing prediction through sigmoid function
preds[:28]

tensor([ 0.8160,  0.1295, -0.0148,  0.1831,  0.1520,  0.1350,  0.7279,  0.7754,  0.3222,  0.6740,  0.0753,  0.0389,  0.2216,  0.7631,
         0.0678,  0.3997,  0.3324,  0.8278,  0.1078,  0.7126,  0.1023,  0.3627,  0.9937,  0.8050,  0.1153,  0.1455,  0.8652,  0.3425])

# Sigmoid function has a minimum of 0 and max at 1
import sympy
sympy.plot("1/(1+exp(-x))", xlim=(-5,5));

# Update calc predictions to use sigmoid
def calc_preds(coeffs, indeps):
  return torch.sigmoid((indeps*coeffs).sum(axis=1))

# Train Model with Sigmoid Predictions
coeffs = train_model(lr=100)

0.510; 0.327; 0.294; 0.207; 0.201; 0.199; 0.198; 0.197; 0.196; 0.196; 0.196; 0.195; 0.195; 0.195; 0.195; 0.195; 0.195; 0.195; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194;

# check coeffcients
show_coeffs()

{'Age': tensor(-1.5061),
 'SibSp': tensor(-1.1575),
 'Parch': tensor(-0.4267),
 'LogFare': tensor(0.2543),
 'Sex_male': tensor(-10.3320),
 'Sex_female': tensor(8.4185),
 'Pclass_1': tensor(3.8389),
 'Pclass_2': tensor(2.1398),
 'Pclass_3': tensor(-6.2331),
 'Embarked_C': tensor(1.4771),
 'Embarked_Q': tensor(2.1168),
 'Embarked_S': tensor(-4.7958)}

# Check Accuracy
acc(coeffs)

tensor(0.8258)

Final Linear Model Setup

# number of coefficients
n_coeff = t_indep.shape[1]

# Randomly initialize coefficients
def init_coeffs():
  return (torch.rand(n_coeff)-0.5).requires_grad_()

# Loss Function - MAE (Mean Absolute Error)
def calc_loss(coeffs, indeps, deps):
  return torch.abs(calc_preds(coeffs, indeps) - deps).mean()

# Update coefficents
def update_coeffs(coeffs, lr):
    coeffs.sub_(coeffs.grad * lr)
    coeffs.grad.zero_()

# One full gradient descent step
def one_epoch(coeffs, lr):
    loss = calc_loss(coeffs, trn_indep, trn_dep)
    loss.backward()
    with torch.no_grad():
      update_coeffs(coeffs, lr)
    print(f"{loss:.3f}", end="; ")

# Train model
def train_model(epochs=30, lr=0.01):
    torch.manual_seed(442)
    coeffs = init_coeffs()
    for i in range(epochs):
      one_epoch(coeffs, lr=lr)
    return coeffs

# Calculate average accuracy of model
def acc(coeffs):
  return (val_dep.bool() == (calc_preds(coeffs, val_indep) > 0.5)).float().mean()
acc(coeffs)

# Calculate predictions
def calc_preds(coeffs, indeps):
  return torch.sigmoid((indeps * coeffs).sum(axis=1))

# Show coefficients for each column
def show_coeffs():
  return dict(zip(indep_cols, coeffs.requires_grad_(False)))

coeffs = train_model(lr=100)

0.510; 0.327; 0.294; 0.207; 0.201; 0.199; 0.198; 0.197; 0.196; 0.196; 0.196; 0.195; 0.195; 0.195; 0.195; 0.195; 0.195; 0.195; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194;

Test model on inference data

Inference Data Processing

# load data
tst_df = pd.read_csv(path/'test.csv')

# Fare data is missing one passenger -> substitute 0 to fix the issue
tst_df['Fare'] = tst_df.Fare.fillna(0)
tst_df

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...
413	1305	3	Spector, Mr. Woolf	male	NaN	0	0	A.5. 3236	8.0500	NaN	S
414	1306	1	Oliva y Ocana, Dona. Fermina	female	39.0	0	0	PC 17758	108.9000	C105	C
415	1307	3	Saether, Mr. Simon Sivertsen	male	38.5	0	0	SOTON/O.Q. 3101262	7.2500	NaN	S
416	1308	3	Ware, Mr. Frederick	male	NaN	0	0	359309	8.0500	NaN	S
417	1309	3	Peter, Master. Michael J	male	NaN	1	1	2668	22.3583	NaN	C

418 rows × 11 columns

# - these steps follow the same process as the training data processing
tst_df.fillna(modes, inplace=True)
tst_df['LogFare'] = np.log(tst_df['Fare'] + 1)
tst_df = pd.get_dummies(tst_df, columns=["Sex","Pclass","Embarked"])

added_cols = ['Sex_male', 'Sex_female', 'Pclass_1', 'Pclass_2', 'Pclass_3', 'Embarked_C', 'Embarked_Q', 'Embarked_S']
indep_cols = ['Age', 'SibSp', 'Parch', 'LogFare'] + added_cols

tst_indep = tensor(tst_df[indep_cols].values, dtype=torch.float)
tst_indep = tst_indep / vals

# Calculate predictions of which passengers survived in titanic dataset
tst_df['Survived'] = (calc_preds(tst_indep, coeffs) > 0.5).int()

Linear Model: Submit Results to Kaggle

# Submit to Kaggle
sub_df = tst_df[['PassengerId','Survived']]
sub_df.to_csv('sub.csv', index=False)

# check first few rows
!head sub.csv

PassengerId,Survived
892,0
893,0
894,0
895,0
896,0
897,0
898,1
899,0
900,1

Cleaning up Linear Model Code

# - Multiplying elements together and then adding across rows is the same as matrix-vector multiply

# Original Matrix-Vector multiply
(val_indep*coeffs).sum(axis=1)

tensor([ 12.3288, -14.8119, -15.4540, -13.1513, -13.3512, -13.6469,   3.6248,   5.3429, -22.0878,   3.1233, -21.8742, -15.6421, -21.5504,
          3.9393, -21.9190, -12.0010, -12.3775,   5.3550, -13.5880,  -3.1015, -21.7237, -12.2081,  12.9767,   4.7427, -21.6525, -14.9135,
         -2.7433, -12.3210, -21.5886,   3.9387,   5.3890,  -3.6196, -21.6296, -21.8454,  12.2159,  -3.2275, -12.0289,  13.4560, -21.7230,
         -3.1366, -13.2462, -21.7230, -13.6831,  13.3092, -21.6477,  -3.5868, -21.6854, -21.8316, -14.8158,  -2.9386,  -5.3103, -22.2384,
        -22.1097, -21.7466, -13.3780, -13.4909, -14.8119, -22.0690, -21.6666, -21.7818,  -5.4439, -21.7407, -12.6551, -21.6671,   4.9238,
        -11.5777, -13.3323, -21.9638, -15.3030,   5.0243, -21.7614,   3.1820, -13.4721, -21.7170, -11.6066, -21.5737, -21.7230, -11.9652,
        -13.2382, -13.7599, -13.2170,  13.1347, -21.7049, -21.7268,   4.9207,  -7.3198,  -5.3081,   7.1065,  11.4948, -13.3135, -21.8723,
        -21.7230,  13.3603, -15.5670,   3.4105,  -7.2857, -13.7197,   3.6909,   3.9763, -14.7227, -21.8268,   3.9387, -21.8743, -21.8367,
        -11.8518, -13.6712, -21.8299,   4.9440,  -5.4471, -21.9666,   5.1333,  -3.2187, -11.6008,  13.7920, -21.7230,  12.6369,  -3.7268,
        -14.8119, -22.0637,  12.9468, -22.1610,  -6.1827, -14.8119,  -3.2838, -15.4540, -11.6950,  -2.9926,  -3.0110, -21.5664, -13.8268,
          7.3426, -21.8418,   5.0744,   5.2582,  13.3415, -21.6289, -13.9898, -21.8112,  -7.3316,   5.2296, -13.4453,  12.7891, -22.1235,
        -14.9625,  -3.4339,   6.3089, -21.9839,   3.1968,   7.2400,   2.8558,  -3.1187,   3.7965,   5.4667, -15.1101, -15.0597, -22.9391,
        -21.7230,  -3.0346, -13.5206, -21.7011,  13.4425,  -7.2690, -21.8335, -12.0582,  13.0489,   6.7993,   5.2160,   5.0794, -12.6957,
        -12.1838,  -3.0873, -21.6070,   7.0744, -21.7170, -22.1001,   6.8159, -11.6002, -21.6310], grad_fn=<SumBackward1>)

# Pytorch optimized matrix-vector multiply
# - python uses @ operator to indicate matrix products and is supported by pytorch tensors
val_indep@coeffs

tensor([ 12.3288, -14.8119, -15.4540, -13.1513, -13.3511, -13.6468,   3.6248,   5.3429, -22.0878,   3.1233, -21.8742, -15.6421, -21.5504,
          3.9393, -21.9190, -12.0010, -12.3775,   5.3550, -13.5880,  -3.1015, -21.7237, -12.2081,  12.9767,   4.7427, -21.6525, -14.9135,
         -2.7433, -12.3210, -21.5886,   3.9387,   5.3890,  -3.6196, -21.6296, -21.8454,  12.2159,  -3.2275, -12.0289,  13.4560, -21.7230,
         -3.1366, -13.2462, -21.7230, -13.6831,  13.3092, -21.6477,  -3.5868, -21.6854, -21.8316, -14.8158,  -2.9386,  -5.3103, -22.2384,
        -22.1097, -21.7466, -13.3780, -13.4909, -14.8119, -22.0690, -21.6666, -21.7818,  -5.4439, -21.7407, -12.6551, -21.6671,   4.9238,
        -11.5777, -13.3323, -21.9638, -15.3030,   5.0243, -21.7614,   3.1820, -13.4721, -21.7170, -11.6066, -21.5737, -21.7230, -11.9652,
        -13.2382, -13.7599, -13.2170,  13.1347, -21.7049, -21.7268,   4.9207,  -7.3198,  -5.3081,   7.1065,  11.4948, -13.3135, -21.8723,
        -21.7230,  13.3603, -15.5670,   3.4105,  -7.2857, -13.7197,   3.6909,   3.9763, -14.7227, -21.8268,   3.9387, -21.8743, -21.8367,
        -11.8518, -13.6712, -21.8299,   4.9440,  -5.4471, -21.9666,   5.1333,  -3.2187, -11.6008,  13.7920, -21.7230,  12.6369,  -3.7268,
        -14.8119, -22.0637,  12.9468, -22.1610,  -6.1827, -14.8119,  -3.2838, -15.4540, -11.6950,  -2.9926,  -3.0110, -21.5664, -13.8268,
          7.3426, -21.8418,   5.0744,   5.2582,  13.3415, -21.6289, -13.9898, -21.8112,  -7.3316,   5.2296, -13.4453,  12.7891, -22.1235,
        -14.9625,  -3.4339,   6.3089, -21.9839,   3.1968,   7.2400,   2.8558,  -3.1187,   3.7965,   5.4667, -15.1101, -15.0597, -22.9391,
        -21.7230,  -3.0346, -13.5206, -21.7011,  13.4425,  -7.2690, -21.8335, -12.0582,  13.0489,   6.7993,   5.2160,   5.0794, -12.6957,
        -12.1838,  -3.0873, -21.6070,   7.0744, -21.7170, -22.1001,   6.8159, -11.6002, -21.6310], grad_fn=<MvBackward0>)

Linear Model: PyTorch Matrix-Vector Multiply

# number of coefficients
n_coeff = t_indep.shape[1]

# Randomly initialize coefficients
def init_coeffs():
  # - 1 turns torch.rand() into a column vector
  return (torch.rand(n_coeff, 1) * 0.1).requires_grad_()

# Loss Function - MAE (Mean Absolute Error)
def calc_loss(coeffs, indeps, deps):
  return torch.abs(calc_preds(coeffs, indeps) - deps).mean()

# Update coefficents
def update_coeffs(coeffs, lr):
    coeffs.sub_(coeffs.grad * lr)
    coeffs.grad.zero_()

# One full gradient descent step
def one_epoch(coeffs, lr):
    loss = calc_loss(coeffs, trn_indep, trn_dep)
    loss.backward()
    with torch.no_grad():
      update_coeffs(coeffs, lr)
    print(f"{loss:.3f}", end="; ")

# Train model
def train_model(epochs=30, lr=0.01):
    torch.manual_seed(442)
    coeffs = init_coeffs()
    for i in range(epochs):
      one_epoch(coeffs, lr=lr)
    return coeffs

# Calculate average accuracy of model
def acc(coeffs):
  return (val_dep.bool() == (calc_preds(coeffs, val_indep) > 0.5)).float().mean()
acc(coeffs)

# Calculate predictions
def calc_preds(coeffs, indeps):
  return torch.sigmoid(indeps@coeffs)

# Show coefficients for each column
def show_coeffs():
  return dict(zip(indep_cols, coeffs.requires_grad_(False)))

# change dependent variable into a column vector - rank 2 tensor
trn_dep = trn_dep[:,None]
val_dep = val_dep[:,None]
print(f"Training Data Shape: {trn_dep.shape}")
print(f"Training Data Rank: {len(trn_dep.shape)}")
print(f"Validation Data Shape: {val_dep.shape}")
print(f"Validation Data Rank: {len(val_dep.shape)}")

Training Data Shape: torch.Size([713, 1])
Training Data Rank: 2
Validation Data Shape: torch.Size([178, 1])
Validation Data Rank: 2

coeffs = train_model(lr=100)
print()
print(f"coefficients shape: {coeffs.shape}")
print(f"coefficients rank: {len(coeffs.shape)}")
print(f"accuracy: {acc(coeffs)}")

0.512; 0.323; 0.290; 0.205; 0.200; 0.198; 0.197; 0.197; 0.196; 0.196; 0.196; 0.195; 0.195; 0.195; 0.195; 0.195; 0.195; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 0.194; 
coefficients shape: torch.Size([12, 1])
coefficients rank: 2
accuracy: 0.8258426785469055

Neural Network

Define coefficients for each layer of the neural network

n hidden - higher number gives more flexibility for neural network to approximate data but slower and harder to train

First Layer input - n_coeff values output - n_hidden values (input to second layer) - need matrix of size n_coeffs by n_hidden - divide coefficients by n_hidden so that when we sum them up in the next layer so that we end up with similar magnitude numbers to what we started with

Second Layer input - n_hidden values (output of first layer) output - 1 value - need n_hidden by 1 + constant term

Steps 1. Two matrix products - indeps@l1 and res@l2 (res is output of first layer) 2. First layer output is passed to F.relu (non-linearity) 3. Second layer output is passed to sigmoid

def init_coeffs(n_hidden=20):
  # set of coefficients to go from input to hidden
    layer1 = (torch.rand(n_coeff, n_hidden) - 0.5) / n_hidden

    # set of coefficients to from hiddent to an output
    layer2 = torch.rand(n_hidden, 1) - 0.3
    const = torch.rand(1)[0]

    # return a tuple of layer1 gradient, layer2 gradient, and constant gradient
    return layer1.requires_grad_(), layer2.requires_grad_(), const.requires_grad_()

import torch.nn.functional as F

# neural network
def calc_preds(coeffs, indeps):
    l1, l2, const = coeffs

    # layer 1
    # replace negative values with zeroes
    res = F.relu(indeps@l1)

    # layer 2
    res = res@l2 + const
    return torch.sigmoid(res)

# Update Coefficients
# - Three sets of coefficients to update per epoch(layer1, layer2, constant)
def update_coeffs(coeffs, lr):
  for layer in coeffs:
    layer.sub_(layer.grad * lr)
    layer.grad.zero_()

# Train Model
coeffs = train_model(lr=1.4)

0.543; 0.532; 0.520; 0.505; 0.487; 0.466; 0.439; 0.407; 0.373; 0.343; 0.319; 0.301; 0.286; 0.274; 0.264; 0.256; 0.250; 0.245; 0.240; 0.237; 0.234; 0.231; 0.229; 0.227; 0.226; 0.224; 0.223; 0.222; 0.221; 0.220;

# Train Model
coeffs = train_model(lr=20)

0.543; 0.400; 0.260; 0.390; 0.221; 0.211; 0.197; 0.195; 0.193; 0.193; 0.193; 0.193; 0.193; 0.193; 0.193; 0.193; 0.193; 0.192; 0.192; 0.192; 0.192; 0.192; 0.192; 0.192; 0.192; 0.192; 0.192; 0.192; 0.192; 0.192;

# Check Accuracy
acc(coeffs)

tensor(0.8258)

Deep Learning

def init_coeffs():
  # size of each hidden layer
  # two hidden layers - 10 activations in each layer
    hiddens = [10, 10]
  # - n_coeffs to 10
  # - 10 to 10
  # - 10 to 1
    sizes = [n_coeff] + hiddens + [1]
    n = len(sizes)
    layers = [(torch.rand(sizes[i], sizes[i + 1])- 0.3)/sizes[i + 1]*4 for i in range(n - 1)]
    consts = [(torch.rand(1)[0] - 0.5) * 0.1 for i in range(n - 1)]
    for l in layers + consts:
      l.requires_grad_()
    return layers,consts

import torch.nn.functional as F

def calc_preds(coeffs, indeps):
    layers,consts = coeffs
    n = len(layers)
    res = indeps
    for i,l in enumerate(layers):
        res = res@l + consts[i]
        # RELU for every layer except for last layer
        if i != n - 1:
          res = F.relu(res)
        # sigmoid only for the last layer
    return torch.sigmoid(res)

def update_coeffs(coeffs, lr):
    layers,consts = coeffs
    for layer in layers + consts:
        layer.sub_(layer.grad * lr)
        layer.grad.zero_()

# Train Model
coeffs = train_model(lr=4)

0.521; 0.483; 0.427; 0.379; 0.379; 0.379; 0.379; 0.378; 0.378; 0.378; 0.378; 0.378; 0.378; 0.378; 0.378; 0.378; 0.377; 0.376; 0.371; 0.333; 0.239; 0.224; 0.208; 0.204; 0.203; 0.203; 0.207; 0.197; 0.196; 0.195;

# Check Accuracy
acc(coeffs)

tensor(0.8258)

Framework: fastai + PyTorch

# load stuff
from fastai.tabular.all import *

pd.options.display.float_format = '{:.2f}'.format
set_seed(42)

# Load Data
df = pd.read_csv(path/'train.csv')

# Feature Engineering
def add_features(df):
    df['LogFare'] = np.log1p(df['Fare'])
    df['Deck'] = df.Cabin.str[0].map(dict(A="ABC", B="ABC", C="ABC", D="DE", E="DE", F="FG", G="FG"))
    df['Family'] = df.SibSp+df.Parch
    df['Alone'] = df.Family==0
    df['TicketFreq'] = df.groupby('Ticket')['Ticket'].transform('count')
    df['Title'] = df.Name.str.split(', ', expand=True)[1].str.split('.', expand=True)[0]
    df['Title'] = df.Title.map(dict(Mr="Mr",Miss="Miss",Mrs="Mrs",Master="Master"))

add_features(df)

# Data Split
splits = RandomSplitter(seed=42)(df)

# Tabular Dataloaders
dls = TabularPandas(
    # splits for indices of training and validation sets
    df, splits=splits,
    # Turn strings into categories, fill missing values in numeric columns with the median, normalise all numeric columns
    procs = [Categorify, FillMissing, Normalize],
    # categorical independent variables
    cat_names=["Sex","Pclass","Embarked","Deck", "Title"],
    # continuous independent variables
    cont_names=['Age', 'SibSp', 'Parch', 'LogFare', 'Alone', 'TicketFreq', 'Family'],
    # dependent variable
    y_names="Survived",
    # dependent variable is categorical(build a classification model)
    y_block = CategoryBlock(),
).dataloaders(path=".")

# Train Model
# - data + model = Learner
# - dls -> data
# - layers -> size of each hidden layer
# - metrics -> any metric we want to use for loss function
learn = tabular_learner(dls, metrics=accuracy, layers=[10,10])

# find learning rate
learn.lr_find(suggest_funcs=(slide, valley))

SuggestedLRs(slide=0.05754399299621582, valley=0.013182567432522774)

# specify number of epochs and learning rate and train model
learn.fit(16, 0.03)

epoch	train_loss	valid_loss	accuracy	time
0	0.577146	0.582949	0.606742	00:00
1	0.510818	0.498523	0.786517	00:00
2	0.467023	0.459841	0.797753	00:00
3	0.439957	0.468547	0.797753	00:00
4	0.427232	0.415261	0.825843	00:00
5	0.416340	0.437362	0.820225	00:00
6	0.408347	0.413253	0.848315	00:00
7	0.400442	0.406075	0.803371	00:00
8	0.397265	0.443730	0.820225	00:00
9	0.392389	0.432267	0.831461	00:00
10	0.389977	0.416234	0.837079	00:00
11	0.386176	0.424609	0.814607	00:00
12	0.382634	0.440516	0.837079	00:00
13	0.378518	0.432545	0.825843	00:00
14	0.374405	0.434594	0.825843	00:00
15	0.372745	0.426979	0.837079	00:00

Test fastai model on inference data

# Inference Data Processing
tst_df = pd.read_csv(path/'test.csv')
tst_df['Fare'] = tst_df.Fare.fillna(0)
add_features(tst_df)

# apply data modeling information from learner to inference
tst_dl = learn.dls.test_dl(tst_df)

# get predictions for the inference data
preds,_ = learn.get_preds(dl=tst_dl)

fastai Model: Submit to Kaggle

# submit to kaggle
tst_df['Survived'] = (preds[:,1] > 0.5).int()
sub_df = tst_df[['PassengerId','Survived']]
sub_df.to_csv('framework_sub.csv', index=False)

# check predictions file
!head framework_sub.csv

PassengerId,Survived
892,0
893,0
894,0
895,0
896,1
897,0
898,1
899,0
900,1

Ensembling

create multiple models and combine predictions

def ensemble():
    learn = tabular_learner(dls, metrics=accuracy, layers=[10,10])
    with learn.no_bar(),learn.no_logging():
      learn.fit(16, lr=0.03)
    return learn.get_preds(dl=tst_dl)[0]

# create a set of 5 different predictions
learns = [ensemble() for _ in range(5)]

# take average of all predictions
ens_preds = torch.stack(learns).mean(0)

Ensembling: Submit to Kaggle

# submit to kaggle
tst_df['Survived'] = (preds[:,1] > 0.5).int()
sub_df = tst_df[['PassengerId','Survived']]
sub_df.to_csv('ensemble_sub.csv', index=False)

# check predictions file
!head ensemble_sub.csv

PassengerId,Survived
892,0
893,0
894,0
895,0
896,1
897,0
898,1
899,0
900,1