Pranav Rajan’s Blog - FastAI Lesson 7: Collaborative Filtering

Acknowledgements

All of this code was written by Jeremy Howard and the FastAI Team. I modified it slightly to include my own print statements, comments and additional helper functions based on Jeremy’s code. This is the source for the original code Scaling Up: Road to the Top, Part 3, Multi-Target: Road to the Top, Part 4 and Collaborative Filtering Deep Dive.

Summary

In this lesson, Jeremy explains Collaborative Filtering and its application in recommendation systems and continues the discussion of his process for the Paddy Doctor: Paddy Disease Classification challenge. The section on cross-entropy loss was difficult for me but rewatching the cross entropy loss section of the video and reworking the cross entropy loss spreadsheet calculations helped me understand how cross entropy loss works.

Jeremy Howard’s Advice

The stuff that happens in the first layer of the model and the last layer including the loss function that sits between the last layer and the loss are very important in deep learning

Terminology

Ensemble - model which is itself the result of combining a number of other models. The simplest way to do ensembling is to take the average of the predictions of each model.

Gradient Accumulation - rather than updating the model weights after ever batch based on that batch’s gradients, instead keep accumulating (adding) the gradients for a few batches, and thenupdate the model weights with those accumulated gradients.

Collaborative Filtering (Recommendation Systems) - look at what products the current user has used or liked, find other users that hae used or liked similar products and then recommend other products that those users have used or liked.

Latent Factors - Find what features matter. In the movie case study, the latent factors are what things matter most to a person when they choose to watch a particular movie.

Embeddings - special layer in pytorch that indexes into a vector using an integer, and has its derivative calculated in such a way that it is identical to what would have been if it had done a matrix multiplication with a one-hot-encoded vector.

Weight Decay - weight decay or L2 Regularization is adding to your loss function the sum of all the weights squared. When computing the gradients, this forces the weights to be as small as possibl. This helps prevent overfitting because the larger the coefficients are, the sharper the canyons appear in the loss function

Cross Entropy Loss

Cross Entropy Loss is confusing to understand just by looking at the math equations in the pytorch documentation. Here I will try my best to explain how it works.

Like the other loss functions Jeremy has talked about cross entropy loss is a loss function we are using to determine how good our model is. In order to compute the cross entropy loss we do the following steps:

Compute the SoftMax
Compute the Cross Entropy Loss using the SoftMax

SoftMax

In the FastAI Lesson 7 video, Jeremy uses an example with five different image classes: cat, dog, plane, fish, building with the goal of predicting whether some image is one of those categories. I will be using the same example.

Given some random weights and after doing some work, the image model will output 5 output values corresponding to the image classes from above:

cat = -4.89 dog = 2.60 plane = 0.59 fish = -2.07 building = -4.57

To use these values in cross entropy loss, they need to be converted to probabilities. Softmax is a function that converts numbers to probabilities using the following equation: \[\frac{e^{z_{i}}}{\sum_{j=1}^{K}e^{z_{j}}}\]

In the equation, K represents the number of categories, zj and zirepresents the output value for the corresponding category. Adding up the softmax results for each category, we get a total of 1.0 because the sum of all probabilities in an experiment is 1.0.

Cross Entropy

The cross entropy function is defined as the following: \[-\sum_{i=1}^{M}y_{i}\log{p(y_{i})} + (1 - y_{i})\log(1 - p(y_{i}))\]

In the equation, M represents the number of categories and yi represents the category.

Using the probabilities calculated by the softmax function we can simplify the cross entropy function to the following: -sum(actual target value * log(prediction probability)).

Cross Entropy is finding the probability of the target class where the actual target value is 1 and then taking the log of the probability - where 1 is the correct value and 0 is the incorrect value.

Binary Cross Entropy

\[-\sum_{i=1}^{N}y_{i}log(p(y_{i}) + (1 - y_{i})log(1 - p(y_{i}))\]

In the equation, the sum represents the total over the number of trials, yi represents the label.

Load Data and Libraries

# import libraries and files

# required libraries + packages for any ml/data science project
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch

# fastai library contains all the packages above and wraps them in the fastai library
!pip install -Uqq fastai

# install PyTorch Image Models (TIMM)
!pip install timm==0.6.13

# kaggle API package install
!pip install kaggle

!pip install pynvml

Collecting timm==0.6.13
  Downloading timm-0.6.13-py3-none-any.whl (549 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 549.1/549.1 kB 2.9 MB/s eta 0:00:00
Requirement already satisfied: torch>=1.7 in /usr/local/lib/python3.10/dist-packages (from timm==0.6.13) (2.1.0+cu121)
Requirement already satisfied: torchvision in /usr/local/lib/python3.10/dist-packages (from timm==0.6.13) (0.16.0+cu121)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from timm==0.6.13) (6.0.1)
Requirement already satisfied: huggingface-hub in /usr/local/lib/python3.10/dist-packages (from timm==0.6.13) (0.20.3)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch>=1.7->timm==0.6.13) (3.13.1)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch>=1.7->timm==0.6.13) (4.9.0)
Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.7->timm==0.6.13) (1.12)
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.7->timm==0.6.13) (3.2.1)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.7->timm==0.6.13) (3.1.3)
Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch>=1.7->timm==0.6.13) (2023.6.0)
Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.7->timm==0.6.13) (2.1.0)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from huggingface-hub->timm==0.6.13) (2.31.0)
Requirement already satisfied: tqdm>=4.42.1 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub->timm==0.6.13) (4.66.1)
Requirement already satisfied: packaging>=20.9 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub->timm==0.6.13) (23.2)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from torchvision->timm==0.6.13) (1.25.2)
Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /usr/local/lib/python3.10/dist-packages (from torchvision->timm==0.6.13) (9.4.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.7->timm==0.6.13) (2.1.5)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub->timm==0.6.13) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub->timm==0.6.13) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub->timm==0.6.13) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface-hub->timm==0.6.13) (2024.2.2)
Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.7->timm==0.6.13) (1.3.0)
Installing collected packages: timm
Successfully installed timm-0.6.13
Requirement already satisfied: kaggle in /usr/local/lib/python3.10/dist-packages (1.5.16)
Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.10/dist-packages (from kaggle) (1.16.0)
Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from kaggle) (2024.2.2)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.10/dist-packages (from kaggle) (2.8.2)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from kaggle) (2.31.0)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from kaggle) (4.66.1)
Requirement already satisfied: python-slugify in /usr/local/lib/python3.10/dist-packages (from kaggle) (8.0.4)
Requirement already satisfied: urllib3 in /usr/local/lib/python3.10/dist-packages (from kaggle) (2.0.7)
Requirement already satisfied: bleach in /usr/local/lib/python3.10/dist-packages (from kaggle) (6.1.0)
Requirement already satisfied: webencodings in /usr/local/lib/python3.10/dist-packages (from bleach->kaggle) (0.5.1)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.10/dist-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->kaggle) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->kaggle) (3.6)
Collecting pynvml
  Downloading pynvml-11.5.0-py3-none-any.whl (53 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.1/53.1 kB 897.6 kB/s eta 0:00:00
Installing collected packages: pynvml
Successfully installed pynvml-11.5.0

from fastai.imports import *
import os
from pathlib import Path
import zipfile

'''Function for loading kaggle datasets locally or on kaggle
Returns a local path to data files
- input: Kaggle API Login Credentials, Kaggle Contest Name '''
def loadData(creds, dataFile):
    # variable to check whether we're running on kaggle website or not
    iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')

    # path for kaggle API credentials
    cred_path = Path('~/.kaggle/kaggle.json').expanduser()

    if not cred_path.exists():
        cred_path.parent.mkdir(exist_ok=True)
        cred_path.write_text(creds)
        cred_path.chmod(0o600)

    # Download data from Kaggle to path and extract files at path location

    # local machine
    path = Path(dataFile)
    if not iskaggle and not path.exists():
        import kaggle
        kaggle.api.competition_download_cli(str(path))
        zipfile.ZipFile(f'{path}.zip').extractall(path)

    # kaggle
    if iskaggle:
        fileName = '../input/' + dataFile
        path = fileName

    return path

creds = ''
dataFile = 'paddy-disease-classification'
path = loadData(creds, dataFile)

Downloading paddy-disease-classification.zip to /content

100%|██████████| 1.02G/1.02G [00:14<00:00, 76.2MB/s]

# check data files
! ls {path}

sample_submission.csv  test_images  train.csv  train_images

# set up default settings
import warnings, logging, torch
warnings.simplefilter('ignore')
logging.disable(logging.WARNING)

# load data files
from fastai.vision.all import *
set_seed(42)

df = pd.read_csv(path/'train.csv')
test_files = get_image_files(path/'test_images').sorted()

Paddy Doctor

Part 3

Memory + Gradient Accumulation

Goal: Train an ensemble of large models with large inputs
The bottleneck for training such models is GPU memory
Kaggle GPUS have 16280 MiB of available memory

# file count
df.label.value_counts()

normal                      1764
blast                       1738
hispa                       1594
dead_heart                  1442
tungro                      1088
brown_spot                   965
downy_mildew                 620
bacterial_leaf_blight        479
bacterial_leaf_streak        380
bacterial_panicle_blight     337
Name: label, dtype: int64

# - bacterial panicle blight has the least files so use that for testing models and images
trn_path = path/'train_images'/'bacterial_panicle_blight'

# - finetune argument to specify whether to run fine_tune() or the fit_one_cycyle() -> fit_one_cycle()
# is faster since it doesn't do an intiial fine-tuning of the head
# - In the finetune function the TTA predictions on the test set are calculated and returned
# - accum argument is used for calculating gradient accumulation
def train(arch, size, item=Resize(480, method='squish'), accum=1, finetune=True, epochs=12):
    dls = ImageDataLoaders.from_folder(trn_path, valid_pct=0.2, item_tfms=item,
        batch_tfms=aug_transforms(size=size, min_scale=0.75), bs=64//accum)
    cbs = GradientAccumulation(64) if accum else []
    learn = vision_learner(dls, arch, metrics=error_rate, cbs=cbs).to_fp16()
    if finetune:
        learn.fine_tune(epochs, 0.01)
        return learn.tta(dl=dls.test_dl(test_files))
    else:
        learn.unfreeze()
        learn.fit_one_cycle(epochs, 0.01)

Gradient Accumulation

fastai has a GradientAccumulation parameter that you pass to define how many batches of gradients are accumulated
After adding up all the gradients over accum batches, we need to divide the batch size by that number
the resulting loop is nearly identical to using the original batch size but the amount of memory used is the same as using a batch size accum times smaller

# single epoch training loop without gradient accumulation

# for x, y in dl:
#   calc_loss(coeffs, x, y).backward()
#   coeffs.data.sub_(coeffs.grad * lr)
#   coeffs.grad.zero_()

# gradient accumulation added (assuming a target effective batch size of 64)

# # number of items seen since last weight update
# count = 0
# for x,y in dl:
#  # update count based on this minibatch size
#     count += len(x)
#     calc_loss(coeffs, x, y).backward()
#  # count is greater than accumulation target, so do weight update
#     if count > 64:
#         coeffs.data.sub_(coeffs.grad * lr)
#         coeffs.grad.zero_()
# # reset count
#         count = 0

Memory Consumption Analysis

import gc
def report_gpu():
    print(torch.cuda.list_gpu_processes())
    gc.collect()
    torch.cuda.empty_cache()

# accum = 1
train('convnext_small_in22k', 128, epochs=1, accum=1, finetune=False)
report_gpu()

Downloading: "https://dl.fbaipublicfiles.com/convnext/convnext_small_22k_224.pth" to /root/.cache/torch/hub/checkpoints/convnext_small_22k_224.pth

epoch	train_loss	valid_loss	error_rate	time
0	0.000000	0.000000	0.000000	00:08

GPU:0
process       4883 uses     3260.000 MB GPU memory

# accum = 2
train('convnext_small_in22k', 128, epochs=1, accum=2, finetune=False)
report_gpu()

epoch	train_loss	valid_loss	error_rate	time
0	0.000000	0.000000	0.000000	00:03

GPU:0
process       4883 uses     2200.000 MB GPU memory

# accum = 4
train('convnext_small_in22k', 128, epochs=1, accum=4, finetune=False)
report_gpu()

epoch	train_loss	valid_loss	error_rate	time
0	0.000000	0.000000	0.000000	00:05

GPU:0
process       4883 uses     1664.000 MB GPU memory

# convnext large
train('convnext_large_in22k', 224, epochs=1, accum=2, finetune=False)
report_gpu()

Downloading: "https://dl.fbaipublicfiles.com/convnext/convnext_large_22k_224.pth" to /root/.cache/torch/hub/checkpoints/convnext_large_22k_224.pth

epoch	train_loss	valid_loss	error_rate	time
0	0.000000	0.000000	0.000000	00:07

GPU:0
process       4883 uses    10082.000 MB GPU memory

train('convnext_large_in22k', (320,240), epochs=1, accum=2, finetune=False)
report_gpu()

epoch	train_loss	valid_loss	error_rate	time
0	0.000000	0.000000	0.000000	00:08

GPU:0
process       4883 uses    13422.000 MB GPU memory

# vit_large - close to going over 16280MiB in Kaggle
train('vit_large_patch16_224', 224, epochs=1, accum=2, finetune=False)
report_gpu()

epoch	train_loss	valid_loss	error_rate	time
0	0.000000	0.000000	0.000000	00:09

GPU:0
process       4883 uses    14360.000 MB GPU memory

# swinv2
train('swinv2_large_window12_192_22k', 192, epochs=1, accum=2, finetune=False)
report_gpu()

Downloading: "https://github.com/SwinTransformer/storage/releases/download/v2.0.0/swinv2_large_patch4_window12_192_22k.pth" to /root/.cache/torch/hub/checkpoints/swinv2_large_patch4_window12_192_22k.pth

epoch	train_loss	valid_loss	error_rate	time
0	0.000000	0.000000	0.000000	00:09

GPU:0
process       4883 uses    12508.000 MB GPU memory

# swin
train('swin_large_patch4_window7_224', 224, epochs=1, accum=2, finetune=False)
report_gpu()

Downloading: "https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_large_patch4_window7_224_22kto1k.pth" to /root/.cache/torch/hub/checkpoints/swin_large_patch4_window7_224_22kto1k.pth

epoch	train_loss	valid_loss	error_rate	time
0	0.000000	0.000000	0.000000	00:07

GPU:0
process       4883 uses    10924.000 MB GPU memory

Experimenting with different models

# image sizes
res = 640, 480

# different models
models = {
    'convnext_large_in22k': {
        (Resize(res), (320,224)),
    }, 'vit_large_patch16_224': {
        (Resize(480, method='squish'), 224),
        (Resize(res), 224),
    }, 'swinv2_large_window12_192_22k': {
        (Resize(480, method='squish'), 192),
        (Resize(res), 192),
    }, 'swin_large_patch4_window7_224': {
        (Resize(res), 224),
    }
}

# switch to all training images
trn_path = path/'train_images'

# - each model is using different training and validation sets so results are not comparable
# - append each set of TTA predictions on the test set into the tta_res list
tta_res = []

for arch,details in models.items():
    for item,size in details:
        print('---',arch)
        print(size)
        print(item.name)
        tta_res.append(train(arch, size, item=item, accum=2)) #, epochs=1))
        gc.collect()
        torch.cuda.empty_cache()

--- convnext_large_in22k
(320, 224)
Resize -- {'size': (480, 640), 'method': 'crop', 'pad_mode': 'reflection', 'resamples': (<Resampling.BILINEAR: 2>, <Resampling.NEAREST: 0>), 'p': 1.0}
--- vit_large_patch16_224
224
Resize -- {'size': (480, 480), 'method': 'squish', 'pad_mode': 'reflection', 'resamples': (<Resampling.BILINEAR: 2>, <Resampling.NEAREST: 0>), 'p': 1.0}
--- vit_large_patch16_224
224
Resize -- {'size': (480, 640), 'method': 'crop', 'pad_mode': 'reflection', 'resamples': (<Resampling.BILINEAR: 2>, <Resampling.NEAREST: 0>), 'p': 1.0}
--- swinv2_large_window12_192_22k
192
Resize -- {'size': (480, 480), 'method': 'squish', 'pad_mode': 'reflection', 'resamples': (<Resampling.BILINEAR: 2>, <Resampling.NEAREST: 0>), 'p': 1.0}
--- swinv2_large_window12_192_22k
192
Resize -- {'size': (480, 640), 'method': 'crop', 'pad_mode': 'reflection', 'resamples': (<Resampling.BILINEAR: 2>, <Resampling.NEAREST: 0>), 'p': 1.0}
--- swin_large_patch4_window7_224
224
Resize -- {'size': (480, 640), 'method': 'crop', 'pad_mode': 'reflection', 'resamples': (<Resampling.BILINEAR: 2>, <Resampling.NEAREST: 0>), 'p': 1.0}

epoch	train_loss	valid_loss	error_rate	time
0	0.858722	0.499460	0.165786	03:24

epoch	train_loss	valid_loss	error_rate	time
0	0.385716	0.208010	0.061509	04:33
1	0.310007	0.212928	0.061028	04:31
2	0.351554	0.228603	0.066314	04:28
3	0.221846	0.182669	0.054301	04:27
4	0.153551	0.156956	0.045171	04:29
5	0.149726	0.149253	0.037001	04:27
6	0.090776	0.127437	0.031235	04:26
7	0.081606	0.115022	0.028832	04:30
8	0.044530	0.108681	0.023546	04:26
9	0.035370	0.105452	0.023546	04:25
10	0.022027	0.110059	0.023546	04:25
11	0.023722	0.106580	0.022585	04:25

epoch	train_loss	valid_loss	error_rate	time
0	1.004288	0.582304	0.187890	03:54

epoch	train_loss	valid_loss	error_rate	time
0	0.384640	0.259779	0.074003	05:11
1	0.352266	0.273039	0.075925	05:14
2	0.378463	0.300954	0.073522	05:10
3	0.246590	0.394313	0.087458	05:04
4	0.241524	0.235776	0.061509	05:04
5	0.139398	0.231180	0.061989	05:04
6	0.114390	0.198267	0.042287	05:04
7	0.079242	0.174894	0.035079	05:13
8	0.040460	0.150516	0.028352	05:14
9	0.035262	0.140247	0.029313	05:09
10	0.032867	0.124250	0.025469	05:01
11	0.019166	0.123131	0.024027	05:02

epoch	train_loss	valid_loss	error_rate	time
0	0.945174	0.587339	0.188852	03:56

epoch	train_loss	valid_loss	error_rate	time
0	0.431525	0.264968	0.085536	05:07
1	0.373674	0.285950	0.083133	05:07
2	0.294009	0.283589	0.075925	05:06
3	0.283224	0.297093	0.083133	05:06
4	0.196644	0.185746	0.048054	05:09
5	0.153443	0.165554	0.037963	05:15
6	0.122116	0.162079	0.039885	05:22
7	0.105744	0.148939	0.036040	05:13
8	0.073754	0.101312	0.024988	05:07
9	0.038434	0.106780	0.021624	05:04
10	0.034042	0.098371	0.019222	05:11
11	0.022197	0.096102	0.020183	05:12

epoch	train_loss	valid_loss	error_rate	time
0	0.941143	0.501398	0.165786	03:40

epoch	train_loss	valid_loss	error_rate	time
0	0.438842	0.208749	0.061028	04:25
1	0.355431	0.220552	0.061509	04:33
2	0.349027	0.397794	0.120135	04:35
3	0.293179	0.189101	0.053820	04:41
4	0.206543	0.163805	0.046612	04:40
5	0.186177	0.128186	0.035560	04:41
6	0.113470	0.114702	0.025949	04:24
7	0.112366	0.092310	0.024988	04:24
8	0.074616	0.091132	0.021144	04:25
9	0.042644	0.079768	0.020663	04:24
10	0.042276	0.080345	0.019222	04:24
11	0.035028	0.081594	0.020183	04:24

epoch	train_loss	valid_loss	error_rate	time
0	0.965668	0.443010	0.141278	03:42

epoch	train_loss	valid_loss	error_rate	time
0	0.448268	0.263037	0.083133	04:27
1	0.355133	0.252433	0.081211	04:26
2	0.333021	0.213186	0.064873	04:28
3	0.278992	0.219155	0.066314	04:26
4	0.204435	0.168250	0.053820	04:26
5	0.180749	0.170310	0.045171	04:25
6	0.143278	0.142150	0.032196	04:25
7	0.100807	0.110142	0.026910	04:27
8	0.051518	0.101470	0.023546	04:24
9	0.039993	0.097524	0.025469	04:25
10	0.035260	0.094065	0.020183	04:25
11	0.036998	0.096248	0.020663	04:25

epoch	train_loss	valid_loss	error_rate	time
0	0.966472	0.521585	0.164344	03:16

33.33% [4/12 15:46<31:32]

epoch	train_loss	valid_loss	error_rate	time
0	0.409947	0.254923	0.073522	03:56
1	0.363681	0.227124	0.070639	03:56
2	0.361192	0.202026	0.065834	03:56
3	0.276248	0.296600	0.083614	03:56

19.23% [50/260 00:41<02:55 0.2991]

Ensembling all the models

# save all results
save_pickle('tta_res.pkl', tta_res)

# - Learner.tta returns predictions and targets for each row

# get predictions
tta_prs = first(zip(*tta_res))

# - Jeremy's experiments and research figured out that vit was a better than all the other models
# - Based on the experiments, vit has double the weight in the ensemble
# - vit double weight -> add vit results to the ensemble (or compute a weighted average)
tta_prs += tta_prs[1:3]

# - ensemble: a model which is itself the result of combining a number of other models
# - simplest way of ensembling is to take the average of the predictions of each model
avg_pr = torch.stack(tta_prs).mean(0)
print(f"avg prediction shape: {avg_pr.shape}")
print(f"avg prediction rank: {len(avg_pr.shape)}")

Submission

dls = ImageDataLoaders.from_folder(trn_path, valid_pct=0.2, item_tfms=Resize(480, method='squish'),
    batch_tfms=aug_transforms(size=224, min_scale=0.75))

idxs = avg_pr.argmax(dim=1)
vocab = np.array(dls.vocab)
sample_sub = pd.read_csv(path/'sample_submission.csv')
sample_sub['label'] = vocab[idxs]
sample_sub.to_csv('part3_subm.csv', index=False)
!head part3_subm.csv

Part 4

predict what disease the rice paddy has AND what kind of rice is shown

# load data files
from fastai.vision.all import *
from fastcore.parallel import *
set_seed(42)

trn_path = path/'train_images'
df = pd.read_csv(path/'train.csv', index_col='image_id')

# data summary
df.head()

	label	variety	age
image_id
100330.jpg	bacterial_leaf_blight	ADT45	45
100365.jpg	bacterial_leaf_blight	ADT45	45
100382.jpg	bacterial_leaf_blight	ADT45	45
100632.jpg	bacterial_leaf_blight	ADT45	45
101918.jpg	bacterial_leaf_blight	ADT45	45

# - get rice variety from rice image data
df.loc['100330.jpg', 'variety']

'ADT45'

# rice variety helper function
def get_variety(p):
  return df.loc[p.name, 'variety']

# Dataloader + Datablock setup
dls = DataBlock(
    # - create an image(contents of file), 2 categorical variables(disease and variety)
    blocks=(ImageBlock,CategoryBlock,CategoryBlock),
    # - 1 input(the image), 2 outputs(disease category, variety category)
    n_inp=1,
    # - get list of inputs (image files)
    get_items=get_image_files,
    # - create the outputs for each image file (parent image label, variety(from variety function))
    get_y = [parent_label,get_variety],
    # - split data into 80% training, 20% validation
    splitter=RandomSplitter(0.2, seed=42),
    # - batch transforms
    item_tfms=Resize(192, method='squish'),
    batch_tfms=aug_transforms(size=128, min_scale=0.75)
).dataloaders(trn_path)

# batch analysis
dls.show_batch(max_n=6)

Disease Model

metric and loss function will take three inputs:

model outputs(metric and loss function inputs)
two targets (disease and variety)

Metric + Loss Functions

# metric and loss functions

# metric function
def disease_err(inp,disease,variety):
  return error_rate(inp,disease)

# loss function - cross entropy
def disease_loss(inp,disease,variety):
   return F.cross_entropy(inp,disease)

# learner setup
arch = 'convnext_small_in22k'
learn = vision_learner(dls, arch, loss_func=disease_loss, metrics=disease_err, n_out=10).to_fp16()
# learning rate
lr = 0.01

# train and fine tune model
learn.fine_tune(5, lr)

epoch	train_loss	valid_loss	disease_err	time
0	1.295816	0.941801	0.281115	01:24

epoch	train_loss	valid_loss	disease_err	time
0	0.641056	0.494316	0.152331	01:23
1	0.466232	0.332578	0.095147	01:25
2	0.304405	0.187158	0.059587	01:24
3	0.170902	0.167094	0.048054	01:24
4	0.126892	0.147188	0.038924	01:24

Multi-Target Model

# - to predict probability of disease and variety we need model to output a tensor of length 20
# 10 possible diseases, 10 possible varities -> n_out=20 sets this up
learn = vision_learner(dls, arch, n_out=20).to_fp16()

Metric + Loss Functions

# loss functions
# - input tensor is length 20 but need to use first 10 values for disease

# disease loss
def disease_loss(inp, disease, variety):
  return F.cross_entropy(inp[:,:10],disease)

# variety loss
def variety_loss(inp,disease,variety):
  return F.cross_entropy(inp[:,10:],variety)

# total loss
def combine_loss(inp,disease,variety):
  return disease_loss(inp,disease,variety) + variety_loss(inp,disease,variety)

# error rate functions

# disease metric
def disease_err(inp,disease,variety):
  return error_rate(inp[:,:10],disease)

# variety metric
def variety_err(inp,disease,variety):
  return error_rate(inp[:,10:],variety)

# all error rate
err_metrics = (disease_err, variety_err)

# all metrics
all_metrics = err_metrics + (disease_loss,variety_loss)

Train Model

# train learner and finetune
learn = vision_learner(dls, arch, loss_func=combine_loss, metrics=all_metrics, n_out=20).to_fp16()
learn.fine_tune(5, lr)

epoch	train_loss	valid_loss	disease_err	variety_err	disease_loss	variety_loss	time
0	2.307458	1.334855	0.288323	0.133109	0.893397	0.441458	01:19

epoch	train_loss	valid_loss	disease_err	variety_err	disease_loss	variety_loss	time
0	1.010617	0.681740	0.154253	0.062951	0.472450	0.209290	01:25
1	0.763498	0.449330	0.101874	0.046132	0.312340	0.136990	01:33
2	0.470782	0.310100	0.066314	0.030274	0.220500	0.089600	01:28
3	0.289418	0.217040	0.044210	0.018260	0.155922	0.061118	01:29
4	0.204842	0.204496	0.042287	0.017299	0.155690	0.048806	01:26

Collaborative Filtering Deep Dive

Load Data + Set Up

# set up some stuff
from fastai.collab import *
from fastai.tabular.all import *
set_seed(42)

# movie dataset
path = untar_data(URLs.ML_100k)

100.15% [4931584/4924029 00:00<00:00]

# load data files
ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None, names=['user','movie','rating','timestamp'])

Exploratory Data Analysis

# exploratory analysis
ratings.head()

	user	movie	rating	timestamp
0	196	242	3	881250949
1	186	302	3	891717742
2	22	377	1	878887116
3	244	51	2	880606923
4	166	346	1	886397596

Experiments

# - very science fiction = 0.98
# - very action = 0.9
# - very not as old = -0.9
last_skywalker = np.array([0.98,0.9,-0.9])

# user who likes modern sci-fi movies
# - very science fiction = 0.9
# - very action = 0.8
# - very not as old = -0.6
user1 = np.array([0.9,0.8,-0.6])

# match between lastskywalker and user1
(user1 * last_skywalker).sum()

2.1420000000000003

# - very science fiction = -0.99
# - very action = -0.3
# - very not as old = 0.8
casablanca = np.array([-0.99,-0.3,0.8])

# match between casablanca and user
(user1 * casablanca).sum()

-1.611

Learning Latent Factors

gradient descent can be used to learn the latent factors

randomly initialize some parameters. parameters will be a set of latent factors for each user and movie
calculate predictions. take dot product of each movie and user. High number represents match, low number represents mismatch
calculate loss. mean square error is good for representing accuracy of a prediction

# movie data
movies = pd.read_csv(path/'u.item',  delimiter='|', encoding='latin-1',
                     usecols=(0,1), names=('movie','title'), header=None)
movies.head()

	movie	title
0	1	Toy Story (1995)
1	2	GoldenEye (1995)
2	3	Four Rooms (1995)
3	4	Get Shorty (1995)
4	5	Copycat (1995)

# movie + rating data
ratings = ratings.merge(movies)
ratings.head()

	user	movie	rating	timestamp	title
0	196	242	3	881250949	Kolya (1996)
1	63	242	3	875747190	Kolya (1996)
2	226	242	5	883888671	Kolya (1996)
3	154	242	3	879138235	Kolya (1996)
4	306	242	5	876503793	Kolya (1996)

# Dataloaders
# - first column is for the user
# - second column is for the item (movies)
# - third column is for the rating
dls = CollabDataLoaders.from_df(ratings, item_name='title', bs=64)
dls.show_batch()

	user	title	rating
0	542	My Left Foot (1989)	4
1	422	Event Horizon (1997)	3
2	311	African Queen, The (1951)	4
3	595	Face/Off (1997)	4
4	617	Evil Dead II (1987)	1
5	158	Jurassic Park (1993)	5
6	836	Chasing Amy (1997)	3
7	474	Emma (1996)	3
8	466	Jackie Chan's First Strike (1996)	3
9	554	Scream (1996)	3

# represent movie and user late factor table as matrices
# - result = index of the move in movie latent factor matrix * index of the user in user late factor matrix
n_users  = len(dls.classes['user'])
n_movies = len(dls.classes['title'])
n_factors = 5
user_factors = torch.randn(n_users, n_factors)
movie_factors = torch.randn(n_movies, n_factors)

Embeddings

# convert index lookup to matrix product
# - replace indices with one-hot encoded vectors

# multiply a vector by a one hot-encoded vector representing index 3
one_hot_3 = one_hot(3, n_users).float()
print(f"dot product of user factors and one-hot encoded three: {user_factors.t() @ one_hot_3}")
print(f"vector at index 3 in user factor matrix: {user_factors[3]}")

dot product of user factors and one-hot encoded three: tensor([-0.4586, -0.9915, -0.4052, -0.3621, -0.5908])
vector at index 3 in user factor matrix: tensor([-0.4586, -0.9915, -0.4052, -0.3621, -0.5908])

Collaborative Filtering From Scratch

# Dot Product Class
class DotProduct(Module):
    def __init__(self, n_users, n_movies, n_factors):
        self.user_factors = Embedding(n_users, n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)

    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        return (users * movies).sum(dim=1)

#  - input of the model is a tensor of shape batch_size x 2
# - first column x[:0] contains user IDS
# - second column x[:1] contains movie IDS
# - embedding layers represent matrices of user and movie latent factors
x,y = dls.one_batch()
x.shape

torch.Size([64, 2])

model = DotProduct(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())

# train model
learn.fit_one_cycle(5, 5e-3)

epoch	train_loss	valid_loss	time
0	1.344786	1.279100	00:08
1	1.093331	1.109981	00:07
2	0.958258	0.990199	00:08
3	0.814234	0.894916	00:08
4	0.780714	0.882022	00:08

# force predictions to be between 0 and 5 (match the reviews)
# - good to have range go a bit over 5
class DotProduct(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)
        self.y_range = y_range

    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        return sigmoid_range((users * movies).sum(dim=1), *self.y_range)

# train model
model = DotProduct(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3)

epoch	train_loss	valid_loss	time
0	0.986799	1.005294	00:08
1	0.878134	0.918898	00:08
2	0.675850	0.875467	00:09
3	0.483372	0.877939	00:08
4	0.378927	0.881887	00:08

Dot Product + Bias

some users are more positive or negative in their recommendations than others, and some movies are better or worse than others
the dot product representation from above does not encode this information

# add biases
class DotProductBias(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.user_bias = Embedding(n_users, 1)
        self.movie_factors = Embedding(n_movies, n_factors)
        self.movie_bias = Embedding(n_movies, 1)
        self.y_range = y_range

    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        res = (users * movies).sum(dim=1, keepdim=True)
        res += self.user_bias(x[:,0]) + self.movie_bias(x[:,1])
        return sigmoid_range(res, *self.y_range)

# train model
model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3)

epoch	train_loss	valid_loss	time
0	0.938634	0.952516	00:09
1	0.846664	0.865633	00:09
2	0.608090	0.865127	00:09
3	0.413482	0.887318	00:08
4	0.286971	0.894876	00:09

Weight Decay

# Weight decay with parabola
x = np.linspace(-2,2,100)
a_s = [1,2,5,10,50]
ys = [a * x**2 for a in a_s]
_,ax = plt.subplots(figsize=(8,6))
for a,y in zip(a_s,ys): ax.plot(x,y, label=f'a={a}')
ax.set_ylim([0,5])
ax.legend()

<matplotlib.legend.Legend at 0x7ab900191d50>

# train movie model with weight decay parameter
# - weight decay is parameter that controls the sum of squares added to loss function
# - in practice it is very inefficient (and potentially numerically unstable) to compute large sums and add them to the loss
# - fastai sets weight decay with weight decay parameter
model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.1)

epoch	train_loss	valid_loss	time
0	0.932776	0.961672	00:09
1	0.888625	0.882614	00:09
2	0.771066	0.832743	00:09
3	0.599807	0.822374	00:09
4	0.504981	0.822528	00:09

Embeddings from Scratch

# Embedding Module From Scratch
class T(Module):
  def __init__(self):
    self.a = nn.Linear(1, 3, bias=False)
    # self.a = nn.Parameter(torch.ones(3))

t = T()
L(t.parameters())

# check type of t
print(f"T type: {type(t.a.weight)}")

T type: <class 'torch.nn.parameter.Parameter'>

# create tensor with random initialization
def create_params(size):
    return nn.Parameter(torch.zeros(*size).normal_(0, 0.01))

# DotProduct with bias (and without embedding)
class DotProductBias(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = create_params([n_users, n_factors])
        self.user_bias = create_params([n_users])
        self.movie_factors = create_params([n_movies, n_factors])
        self.movie_bias = create_params([n_movies])
        self.y_range = y_range

    def forward(self, x):
        users = self.user_factors[x[:,0]]
        movies = self.movie_factors[x[:,1]]
        res = (users*movies).sum(dim=1)
        res += self.user_bias[x[:,0]] + self.movie_bias[x[:,1]]
        return sigmoid_range(res, *self.y_range)

# train model
model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.1)

epoch	train_loss	valid_loss	time
0	0.929254	0.953444	00:09
1	0.865246	0.878304	00:10
2	0.720294	0.838921	00:10
3	0.582796	0.829129	00:09
4	0.474043	0.829031	00:09

Interpreting Embeddings + Biases

# - biases are easiest to interpret
# - movies with the lowest values in bias vector
# - for each of these movies, even when a user is matched to its latent factors, they still don't generally like it
# - does not tell us whether a movie is of a kind that people tend to enjoy watching but that people tend not to like
# watching it even if it is of a kind they would otherwise enjoy
movie_bias = learn.model.movie_bias.squeeze()
idxs = movie_bias.argsort()[:5]
[dls.classes['title'][i] for i in idxs]

['Lawnmower Man 2: Beyond Cyberspace (1996)',
 'Children of the Corn: The Gathering (1996)',
 'Mortal Kombat: Annihilation (1997)',
 'Amityville 3-D (1983)',
 'Beautician and the Beast, The (1997)']

# movies with high bias
idxs = movie_bias.argsort(descending=True)[:5]
[dls.classes['title'][i] for i in idxs]

['Titanic (1997)',
 'Shawshank Redemption, The (1994)',
 'Silence of the Lambs, The (1991)',
 'L.A. Confidential (1997)',
 "Schindler's List (1993)"]

# PCA analysis of the two strongest PCA components
g = ratings.groupby('title')['rating'].count()
top_movies = g.sort_values(ascending=False).index.values[:1000]
top_idxs = tensor([learn.dls.classes['title'].o2i[m] for m in top_movies])
movie_w = learn.model.movie_factors[top_idxs].cpu().detach()
movie_pca = movie_w.pca(3)
fac0,fac1,fac2 = movie_pca.t()
idxs = list(range(50))
X = fac0[idxs]
Y = fac2[idxs]
plt.figure(figsize=(12,12))
plt.scatter(X, Y)
for i, x, y in zip(top_movies[idxs], X, Y):
    plt.text(x,y,i, color=np.random.rand(3)*0.7, fontsize=11)
plt.show()

fastai collaborative filtering

learn = collab_learner(dls, n_factors=50, y_range=(0, 5.5))
learn.fit_one_cycle(5, 5e-3, wd=0.1)

epoch	train_loss	valid_loss	time
0	0.939463	0.954959	00:10
1	0.841215	0.876151	00:09
2	0.724404	0.832099	00:09
3	0.597228	0.816953	00:08
4	0.481373	0.817286	00:09

# print layer names of model
learn.model

EmbeddingDotBias(
  (u_weight): Embedding(944, 50)
  (i_weight): Embedding(1665, 50)
  (u_bias): Embedding(944, 1)
  (i_bias): Embedding(1665, 1)
)

# high bias movies
movie_bias = learn.model.i_bias.weight.squeeze()
idxs = movie_bias.argsort(descending=True)[:5]
[dls.classes['title'][i] for i in idxs]

['L.A. Confidential (1997)',
 'Titanic (1997)',
 'Shawshank Redemption, The (1994)',
 'Silence of the Lambs, The (1991)',
 'Rear Window (1954)']

Embedding Distance

# - distance between two movie embeddings can define two movies that are nearly identical 
# because users that would like them would be nearly exactly the same

# find movie most similar to Silence of the Lambs
movie_factors = learn.model.i_weight.weight
idx = dls.classes['title'].o2i['Silence of the Lambs, The (1991)']
distances = nn.CosineSimilarity(dim=1)(movie_factors, movie_factors[idx][None])
idx = distances.argsort(descending=True)[1]
dls.classes['title'][idx]

'Before the Rain (Pred dozhdot) (1994)'

Deep Learning + Collaborative Filtering

# fastai has a function that returns the recommended size for embeddings matrices for your data based on a heuristic FastAI found
# that works well in practice
embs = get_emb_sz(dls)
embs

[(944, 74), (1665, 102)]

# Deep Learning Collaborative Filtering Class
class CollabNN(Module):
    def __init__(self, user_sz, item_sz, y_range=(0,5.5), n_act=100):
        self.user_factors = Embedding(*user_sz)
        self.item_factors = Embedding(*item_sz)
        self.layers = nn.Sequential(
            nn.Linear(user_sz[1]+item_sz[1], n_act),
            nn.ReLU(),
            nn.Linear(n_act, 1))
        self.y_range = y_range

    def forward(self, x):
        embs = self.user_factors(x[:,0]),self.item_factors(x[:,1])
        x = self.layers(torch.cat(embs, dim=1))
        return sigmoid_range(x, *self.y_range)

# create model and train it
model = CollabNN(*embs)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.01)

epoch	train_loss	valid_loss	time
0	0.943857	0.951898	00:10
1	0.914082	0.898525	00:10
2	0.848892	0.884356	00:10
3	0.814803	0.875278	00:10
4	0.761398	0.878594	00:09

# model with hidden layers of 100 and 50
learn = collab_learner(dls, use_nn=True, y_range=(0, 5.5), layers=[100,50])
learn.fit_one_cycle(5, 5e-3, wd=0.1)

epoch	train_loss	valid_loss	time
0	1.003178	0.998673	00:11
1	0.877362	0.934763	00:11
2	0.887651	0.898290	00:11
3	0.815599	0.865441	00:12
4	0.788975	0.864559	00:11

Acknowledgements

Summary

Jeremy Howard’s Advice

Terminology

Cross Entropy Loss

SoftMax

Cross Entropy

Binary Cross Entropy

Load Data and Libraries

Paddy Doctor

Part 3

Memory + Gradient Accumulation

Gradient Accumulation

Memory Consumption Analysis

Experimenting with different models

Ensembling all the models

Submission

Part 4

Disease Model

Metric + Loss Functions

Multi-Target Model

Metric + Loss Functions

Train Model

Collaborative Filtering Deep Dive

Load Data + Set Up

Exploratory Data Analysis

Experiments

Learning Latent Factors

Embeddings

Collaborative Filtering From Scratch

Dot Product + Bias

Weight Decay

Embeddings from Scratch

Interpreting Embeddings + Biases

fastai collaborative filtering

Embedding Distance

Deep Learning + Collaborative Filtering

Resources