Pranav Rajan’s Blog - FastAI Lesson 4: Natural Language(NLP)

Announcements

Happy New Year! I’m finally back to FastAI after taking a break due to job applications, trying to learn Golang and reflecting on my job search. This lesson was particularly challenging for me because I ran into a bunch of weird bugs and updates with the Hugging Face Transformer API. I had no idea what the code was doing in the FastAI example despite copying Jeremy’s code verbatim and printing print statements. After spending some time trying to debug and understand what each part of the code setup was doing with Hugging Face and Jeremy’s code I finally feel comfortable trying to write a blog post about this lesson.

Acknowledgements

All of this code was written by Jeremy Howard and the FastAI team. I modified it slightly to include my own print statements, comments and additional helper functions based on Jeremy’s code. This is the source for the original code Getting Start With NLP for Absolute Begginers and Iterate Like A Grandmaster.

Summary

In this lesson, Jeremy gives an overview of his pioneering work on ULMFit, how to use Transformers and the set up for participating on Kaggle. I’m no NLP expert and don’t have an interest in NLP but what really excited me was that I finally understood what the iteration and development process was for a Kaggle Competition. This had been one of my goals when I embarked on FastAI.

Jeremy Howard’s Advice

Know the tools for your trade really well - Data Science Core Tools: Python, Numpy, Pandas Matplotlib, ScikitLearn, Pytorch, Scipy

Data Science/Kaggle Competition Workflow

This is the workflow that I figured out from this chapter.

Import data and check files
Exploratory Data Analysis
Data Representation (Wrangling, Tokenization, Numericalization)
Metrics
Training Model
Evaluate Model Performance

I found the steps similar to the visualization process and could have dedicated some more time to iterating and fine-tuning my model for scoring high on this kaggle competition but in this chapter I was aiming to get an understanding of the general workflow and process that goes into building and iterating a model.

US Patent Phrase to Phrase Matching Competition

The US Patent Phrase to Phrase Matching Kaggle Competition is used as the case study for learning the basics of NLP, how to use transformers and the kaggle workflow. More information about the data and competition can be found here US Patent Phrase to Phrase Matching Competition

Load Data and Libraries

# import libraries and files

# required libraries + packages for any ml/data science project
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch

# fastai library contains all the packages above and wraps them in the fastai library
!pip install -Uqq fastai

# kaggle API package install
!pip install kaggle

# Hugging Face required libraries + packages
!pip install -q datasets
!pip install transformers sentencepiece
!pip install transformers[torch]

Requirement already satisfied: kaggle in /usr/local/lib/python3.10/dist-packages (1.5.16)
Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.10/dist-packages (from kaggle) (1.16.0)
Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from kaggle) (2023.11.17)
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.10/dist-packages (from kaggle) (2.8.2)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from kaggle) (2.31.0)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from kaggle) (4.66.1)
Requirement already satisfied: python-slugify in /usr/local/lib/python3.10/dist-packages (from kaggle) (8.0.1)
Requirement already satisfied: urllib3 in /usr/local/lib/python3.10/dist-packages (from kaggle) (2.0.7)
Requirement already satisfied: bleach in /usr/local/lib/python3.10/dist-packages (from kaggle) (6.1.0)
Requirement already satisfied: webencodings in /usr/local/lib/python3.10/dist-packages (from bleach->kaggle) (0.5.1)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.10/dist-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->kaggle) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->kaggle) (3.6)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 507.1/507.1 kB 6.7 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 115.3/115.3 kB 11.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.8/134.8 kB 9.9 MB/s eta 0:00:00
Requirement already satisfied: transformers in /usr/local/lib/python3.10/dist-packages (4.35.2)
Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 13.0 MB/s eta 0:00:00
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers) (3.13.1)
Requirement already satisfied: huggingface-hub<1.0,>=0.16.4 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.20.2)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (1.23.5)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers) (23.2)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (6.0.1)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers) (2023.6.3)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers) (2.31.0)
Requirement already satisfied: tokenizers<0.19,>=0.14 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.15.0)
Requirement already satisfied: safetensors>=0.3.1 in /usr/local/lib/python3.10/dist-packages (from transformers) (0.4.1)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers) (4.66.1)
Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.16.4->transformers) (2023.6.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.16.4->transformers) (4.5.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers) (2023.11.17)
Installing collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99
Requirement already satisfied: transformers[torch] in /usr/local/lib/python3.10/dist-packages (4.35.2)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers[torch]) (3.13.1)
Requirement already satisfied: huggingface-hub<1.0,>=0.16.4 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]) (0.20.2)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]) (1.23.5)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]) (23.2)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]) (6.0.1)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]) (2023.6.3)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers[torch]) (2.31.0)
Requirement already satisfied: tokenizers<0.19,>=0.14 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]) (0.15.0)
Requirement already satisfied: safetensors>=0.3.1 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]) (0.4.1)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]) (4.66.1)
Requirement already satisfied: torch!=1.12.0,>=1.10 in /usr/local/lib/python3.10/dist-packages (from transformers[torch]) (2.1.0+cu121)
Collecting accelerate>=0.20.3 (from transformers[torch])
  Downloading accelerate-0.26.1-py3-none-any.whl (270 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 270.9/270.9 kB 6.2 MB/s eta 0:00:00
Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from accelerate>=0.20.3->transformers[torch]) (5.9.5)
Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.16.4->transformers[torch]) (2023.6.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.16.4->transformers[torch]) (4.5.0)
Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch!=1.12.0,>=1.10->transformers[torch]) (1.12)
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch!=1.12.0,>=1.10->transformers[torch]) (3.2.1)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch!=1.12.0,>=1.10->transformers[torch]) (3.1.3)
Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch!=1.12.0,>=1.10->transformers[torch]) (2.1.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->transformers[torch]) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers[torch]) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers[torch]) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers[torch]) (2023.11.17)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch!=1.12.0,>=1.10->transformers[torch]) (2.1.3)
Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch!=1.12.0,>=1.10->transformers[torch]) (1.3.0)
Installing collected packages: accelerate
Successfully installed accelerate-0.26.1

# import stuff from fastai
from fastai.imports import *
import os
from pathlib import Path
import zipfile


'''Function for loading kaggle datasets locally or on kaggle
Returns a local path to data files
- input: Kaggle API Login Credentials, Kaggle Contest Name '''
def loadData(creds, dataFile):
    # variable to check whether we're running on kaggle website or not
    iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')

    # path for kaggle API credentials
    cred_path = Path('~/.kaggle/kaggle.json').expanduser()

    if not cred_path.exists():
        cred_path.parent.mkdir(exist_ok=True)
        cred_path.write_text(creds)
        cred_path.chmod(0o600)

    # Download data from Kaggle to path and extract files at path location

    # local machine
    path = Path(dataFile)
    if not iskaggle and not path.exists():
        import kaggle
        kaggle.api.competition_download_cli(str(path))
        zipfile.ZipFile(f'{path}.zip').extractall(path)

    # kaggle
    if iskaggle:
        fileName = '../input/' + dataFile
        path = fileName

    return path

creds = ''
dataFile = 'us-patent-phrase-to-phrase-matching'
path = loadData(creds, dataFile)

Downloading us-patent-phrase-to-phrase-matching.zip to /content

100%|██████████| 682k/682k [00:00<00:00, 727kB/s]

Document - A file containing some text

Large Documents - One text file per document, often organized into one folder per category

Smaller Documents - One document per row in a CSV File

# check data files
!ls {path}

sample_submission.csv  test.csv  train.csv

Problem Statement

Problem given a pairs of phrases (an anchor or target phrase), rate how similar they are on a scale of 0(not similar) to 1(identical in meaning). Similarity has been scored within a patent’s context specifically the CPC Classification which indicates the subject to which the patent relates.

Training Data

contains phrases, contexts and similarity scores

Test(Inference) Data

unseen test contains approximately 12k pairs of phrases. Small public test set has been provided for testing purposes but is not use in scoring.
contains identitical structure to training data but without the score

Evaluation Goal

Build model to match phrases in order to extract contextual information helping the patent community connect the dots between millions of patent documents

Evaluation Metric

Submissions are evaluated on the Pearson Correlation Coefficient between predicted and actual similarity scores.

Submission Format

id(representing a pair of phrases) in test set, score representing the similarity score
A score of 1 is considered that the two inputs have identical meaning
A score of 0 is considered that the two inputs have totally different meaning
A score in between ie. 0.5 means they’re somewhat similar but not identical
Problem Type - NLP Classification Problem -> classify document automatically into some category

Exploratory Data Analysis

# set up data path

# training data path
df = pd.read_csv(path/'train.csv')

# testing data path
eval_df = pd.read_csv(path/'test.csv')

# sample submission data path
submit_df = pd.read_csv(path/'sample_submission.csv')

# Test Data Length
print(f"{len(df)}")

# Training Data Info
df

	id	anchor	target	context	score
0	37d61fd2272659b1	abatement	abatement of pollution	A47	0.50
1	7b9652b17b68b7a4	abatement	act of abating	A47	0.75
2	36d72442aefd8232	abatement	active catalyst	A47	0.25
3	5296b0c19e1ce60e	abatement	eliminating process	A47	0.50
4	54c1e3b9184cb5b6	abatement	forest region	A47	0.00
...	...	...	...	...	...
36468	8e1386cbefd7f245	wood article	wooden article	B44	1.00
36469	42d9e032d1cd3242	wood article	wooden box	B44	0.50
36470	208654ccb9e14fa3	wood article	wooden handle	B44	0.50
36471	756ec035e694722b	wood article	wooden material	B44	0.75
36472	8d135da0b55b8c88	wood article	wooden substrate	B44	0.50

36473 rows × 5 columns

# Training Data Summary
df.describe(include='object')

	id	anchor	target	context
count	36473	36473	36473	36473
unique	36473	733	29340	106
top	37d61fd2272659b1	component composite coating	composition	H01
freq	1	152	24	2186

# Test Data Length
print(f"{len(eval_df)}")

# Test Data Info
eval_df.head()

	id	anchor	target	context
0	4112d61851461f60	opc drum	inorganic photoconductor drum	G02
1	09e418c93a776564	adjust gas flow	altering gas flow	F23
2	36baf228038e314b	lower trunnion	lower locating	B60
3	1f37ead645e7f0c8	cap component	upper portion	D06
4	71a5b6ad068d531f	neural stimulation	artificial neural network	H04

# Test Data Summary
eval_df.describe()

	id	anchor	target	context
count	36	36	36	36
unique	36	34	36	29
top	4112d61851461f60	el display	inorganic photoconductor drum	G02
freq	1	2	1	3

Data Representation (Tokenization, Numericalization, Data Wrangling)

Models require numbers as inputs -> need some strategy of mapping words, characters etc to a numerical value

Tokenization: Split each text up into tokens

Numericalization: Convert each token to a number

Autokenizer:function for creating a tokenizer appropriate for the selected NLP model

Vocab: A special list in the tokenizer which contains a unique integer for every possible token string

The start of a new word is indicated by __

from torch.utils.data import DataLoader
import warnings,transformers,logging,torch
from transformers import TrainingArguments,Trainer
from transformers import AutoModelForSequenceClassification,AutoTokenizer
import datasets
from datasets import load_dataset, Dataset, DatasetDict

# ignore hugging face warnings
warnings.simplefilter('ignore')
logging.disable(logging.WARNING)

# Tokenization + Numericalization

# - Models require numbers as inputs -> need to convert text to numbers:
# Tokenization - split each text into tokens
# Numericalization - convert each word into a number
# - This process depends on the model that is used. AutoTokenizer creates the appropriate tokenizer based on the selected model

# Model
# - This is a reasonable model to start nearly NLP problem. Replace small with large for a slower but more accurate model
# once data exploration and experimentation is completed
model_nm = 'microsoft/deberta-v3-small'

# Tokenizer
from transformers import AutoModelForSequenceClassification,AutoTokenizer
tokz = AutoTokenizer.from_pretrained(model_nm)

# Test Tokenizer
print(f"Favorite Line from The Bear: {tokz.tokenize('Every Second Counts')}")
print(f"Uncommon words: {tokz.tokenize('A platypus is an ornithorhynchus anatinus.')}")

# function for tokenizing
def tok_func(x):
  return tokz(x["input"])

Favorite Line from The Bear: ['▁Every', '▁Second', '▁Counts']
Uncommon words: ['▁A', '▁platypus', '▁is', '▁an', '▁or', 'ni', 'tho', 'rhynch', 'us', '▁an', 'at', 'inus', '.']

# Combine Context + Anchor + Target

# Method 1 - Baseline
df['input'] = 'CONTEXT: ' + df.context + '; TEXT1: ' + df.anchor + '; TEXT2: ' + df.target
eval_df['input'] = 'CONTEXT: ' + eval_df.context + '; TEXT1: ' + eval_df.anchor + '; TEXT2: ' + eval_df.target

# Convert data to Transformer DataSet Data Structure
# rename score to labels for training data -> Transformers require a label column
ds = Dataset.from_pandas(df).rename_column('score', 'label')
eval_ds = Dataset.from_pandas(eval_df)

# Tokenize Training Data
print("Original Input Data")
print(ds)

# test tokenizing function
# tok_func(ds[0])

# Tokenize all input data in parallel
tok_ds = ds.map(tok_func, batched=True)

print("Tokenized Input Data")
print(tok_ds)

# check tokenized data
print("Check Tokenized Input Data")
print(tok_ds[0])

Original Input Data
Dataset({
    features: ['id', 'anchor', 'target', 'context', 'label', 'input'],
    num_rows: 36473
})
Tokenized Input Data
Dataset({
    features: ['id', 'anchor', 'target', 'context', 'label', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 36473
})
Check Tokenized Input Data
{'id': '37d61fd2272659b1', 'anchor': 'abatement', 'target': 'abatement of pollution', 'context': 'A47', 'label': 0.5, 'input': 'CONTEXT: A47; TEXT1: abatement; TEXT2: abatement of pollution', 'input_ids': [1, 20967, 104917, 294, 336, 5753, 346, 54453, 435, 294, 47284, 346, 54453, 445, 294, 47284, 265, 6435, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

# Tokenize Inference Data
print("Original Inference Data")
print(eval_ds)

eval_ds = eval_ds.map(tok_func, batched=True)

print("Tokenized Inference Data")
print(eval_ds)


print("Check Tokenized Inference Data")
print(eval_ds[0])

Original Inference Data
Dataset({
    features: ['id', 'anchor', 'target', 'context', 'input'],
    num_rows: 36
})
Tokenized Inference Data
Dataset({
    features: ['id', 'anchor', 'target', 'context', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 36
})
Check Tokenized Inference Data
{'id': '4112d61851461f60', 'anchor': 'opc drum', 'target': 'inorganic photoconductor drum', 'context': 'G02', 'input': 'CONTEXT: G02; TEXT1: opc drum; TEXT2: inorganic photoconductor drum', 'input_ids': [1, 20967, 104917, 294, 1098, 4159, 346, 54453, 435, 294, 8847, 1207, 8263, 346, 54453, 445, 294, 31553, 1456, 48133, 8263, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

# Tokenization Exploration

# - Vocab -> specal list in the tokenizer which contains a unique integer for every possible token string
# - Input IDs are generated from the vocab list
row = tok_ds[0]
print(f"Input: {row['input']}, Input IDs: {row['input_ids']}")

# Token for the word "of"
print(f"Token for the word of: {tokz.vocab['▁of']}")

Input: CONTEXT: A47; TEXT1: abatement; TEXT2: abatement of pollution, Input IDs: [1, 20967, 104917, 294, 336, 5753, 346, 54453, 435, 294, 47284, 346, 54453, 445, 294, 47284, 265, 6435, 2]
Token for the word of: 265

Datasets Training, Validation, Testing(Inference)

Training Dataset

A set of data that the model uses to learn the weights needed to create a function that best approximates the data.

Validation Dataset

A dataset that is used to determine whether the model is under-fitting, overfitting or fitting the data (to some threshold degree). - This dataset is withheld from training and the model never sees it. The validation set is only used for evaluating the model during training and never used as inputs to train the model
Creating validation datasets is an active area of research and engineering. See this post by Dr. Rachel Thomas on validation datasets - How (and why) to create a good validation set
Transformers call the validation dataset in the DataSetDict object test

Test(Inference Set)

A dataset that is withheld from training and reporting metrics. The accuracy of the model on the test set is only checked after completing the entire training process -> trying different models, training methods, data processing, optimizations etc.
Kaggle has a second test set which is a held out dataset used only at the end of competitions to assess predictions (private leaderboard).

# Validation Set - subset of training dataset which the model doesn't see at all
# - The validation set is used to check if model is underfit, overfit or just right
# - The validation set is nost used as input to the training model
# - Transformers use DatasetDict to store both the training and validation dataset
# - fastai automatically creates the validation set for you if you don't have one and reports the metrics
# (accuracy of model) using the validation set

# 25% validation set, 75% training set using hugging face transformer
# test refers to the validation set
dds = tok_ds.train_test_split(0.25, seed=42)

print("Check Training-Validation Split")
print(dds)

Check Training-Validation Split
DatasetDict({
    train: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'label', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 27354
    })
    test: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'label', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9119
    })
})

Metrics

Measurements that help evaluate how good the model is
Kaggle tells users the metric and how submissions are evaluated in the problem description. In industry and research problems developing metrics is more complicated. See this post by Dr. Rachel Thomas on metrics - The problem with metrics is a big problem for AI
In this challenge, Kaggle has stated that submissions are evaluated based on the Pearson Correlation Coefficient between the predicted and actual similarity scores.
Pearson’s Correlation Coefficient is represented with r and is one of the most widely used measures of the degree of relationship between two variables.
r can vary between -1 indicating a perfect inverse correlation and +1 indicating a perfect positive correlation
r is sensitive to outliers -> outliers will dominate the data in a pearson correlation visualization producing skewed results.

# Metrics - Correlation

# function for returning the correlation between two variables
def corr(x,y):
  return np.corrcoef(x,y)[0][1]

# - Transformers expect metrics to be returned as a dict, since that way the trainer knows what label to use
def corr_d(eval_pred):
   return {'pearson': corr(*eval_pred)}

Train Model

# Set up hyperparameter values

# learning rate
lr = 8e-5

# batch size
bs = 128

# weight decay
wd = 0.01

# epochs
epochs = 4

# - Transformer uses Trainging Arguments class to set up arguments
args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=wd, report_to='none')

# create model and Trainer -> trainer is a class combining the data and model together (like learner does in fastai)
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
                  tokenizer=tokz, compute_metrics=corr_d)

# Train the model
trainer.train()

[856/856 03:37, Epoch 4/4]

Epoch	Training Loss	Validation Loss	Pearson
1	No log	0.028198	0.803834
2	No log	0.022185	0.823395
3	0.034200	0.022253	0.833714
4	0.034200	0.022595	0.834576

TrainOutput(global_step=856, training_loss=0.025547120615700695, metrics={'train_runtime': 220.4986, 'train_samples_per_second': 496.221, 'train_steps_per_second': 3.882, 'total_flos': 717218678299260.0, 'train_loss': 0.025547120615700695, 'epoch': 4.0})

Evaluate Model Performance

# Evaluate Model on the Inference Data
preds = trainer.predict(eval_ds).predictions.astype(float)
preds = np.clip(preds, 0, 1)
preds

array([[0.4855957 ],
       [0.69238281],
       [0.54833984],
       [0.34765625],
       [0.        ],
       [0.47460938],
       [0.49023438],
       [0.        ],
       [0.34765625],
       [1.        ],
       [0.23352051],
       [0.27612305],
       [0.76757812],
       [0.91113281],
       [0.81347656],
       [0.37280273],
       [0.26318359],
       [0.        ],
       [0.63720703],
       [0.34619141],
       [0.40063477],
       [0.23999023],
       [0.14367676],
       [0.22607422],
       [0.60595703],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.57910156],
       [0.30908203],
       [0.07293701],
       [0.71191406],
       [0.53564453],
       [0.42602539],
       [0.24609375]])

Improving + Iterating on the Model for Kaggle and Beyond

In the previous code, I mostly followed a standard template for how to get a model running and producing values for a kaggle competition.

This section is based on Jeremy’s advice and experience for understanding how one might go about refining and finetune a model using transfer learning to win a kaggle competition. My goal was to try and beat the original base line evaluation I had in the previous section but I didn’t do enough experiments to develop a model that could match or produce a better result than my first attempt. In this section, I was more interested in getting a feel for the iteration and engineering workflow process of developing a model for a kaggle contest/data science challenge so I might revisit this post in the future to see if I can produce a better result once I gain more experience iterating and building models.

Problem Statement

Compare two words or short phrases and score them based on whether they’re similar or not based on the patent class they were used in.
A score of 1 is considered that the two inputs have identical meaning
A score of 0 is considered that the two inputs have totally different meaning
A score in between ie. 0.5 means they’re somewhat similar but not identical
Problem Type - NLP Classification Problem -> classify document automatically into some category

Data Exploration: Data Meaning + Representation

anchor - first phrase

target - second phrase

context - cpc classification which indicates the subject within which the similarity is to be scored

score - the similarity. A combination of one or more manual expert ratings

Exploratory Data Analysis

# Distribution of values of Target
df.target.value_counts()

composition                    24
data                           22
metal                          22
motor                          22
assembly                       21
                               ..
switching switch over valve     1
switching switch off valve      1
switching over valve            1
switching off valve             1
wooden substrate                1
Name: target, Length: 29340, dtype: int64

# Distribution of values of Anchor
df.anchor.value_counts()

component composite coating              152
sheet supply roller                      150
source voltage                           140
perfluoroalkyl group                     136
el display                               135
                                        ... 
plug nozzle                                2
shannon                                    2
dry coating composition1                   2
peripheral nervous system stimulation      1
conduct conducting material                1
Name: anchor, Length: 733, dtype: int64

# Distribution of values of Context
df.context.value_counts()

H01    2186
H04    2177
G01    1812
A61    1477
F16    1091
       ... 
B03      47
F17      33
B31      24
A62      23
F26      18
Name: context, Length: 106, dtype: int64

# Distribution of values of Section
# Patent Section - first character of context is the section the patent was filed under
df['section'] = df.context.str[0]
df.section.value_counts()

B    8019
H    6195
G    6013
C    5288
A    4094
F    4054
E    1531
D    1279
Name: section, dtype: int64

eval_df['section'] = eval_df.context.str[0]
eval_df.section.value_counts()

B    8
G    7
F    6
H    5
C    4
A    3
E    2
D    1
Name: section, dtype: int64

# Distribution of values of Score
df.score.hist()

<Axes: >

# Items that were identical
df[df.score==1]

	id	anchor	target	context	score	input	section
28	473137168ebf7484	abatement	abating	F24	1.0	CONTEXT: F24; TEXT1: abatement; TEXT2: abating	F
158	621b048d70aa8867	absorbent properties	absorbent characteristics	D01	1.0	CONTEXT: D01; TEXT1: absorbent properties; TEXT2: absorbent characteristics	D
161	bc20a1c961cb073a	absorbent properties	absorption properties	D01	1.0	CONTEXT: D01; TEXT1: absorbent properties; TEXT2: absorption properties	D
311	e955700dffd68624	acid absorption	absorption of acid	B08	1.0	CONTEXT: B08; TEXT1: acid absorption; TEXT2: absorption of acid	B
315	3a09aba546aac675	acid absorption	acid absorption	B08	1.0	CONTEXT: B08; TEXT1: acid absorption; TEXT2: acid absorption	B
...	...	...	...	...	...	...	...
36398	913141526432f1d6	wiring trough	wiring troughs	F16	1.0	CONTEXT: F16; TEXT1: wiring trough; TEXT2: wiring troughs	F
36435	ee0746f2a8ecef97	wood article	wood articles	B05	1.0	CONTEXT: B05; TEXT1: wood article; TEXT2: wood articles	B
36440	ecaf479135cf0dfd	wood article	wooden article	B05	1.0	CONTEXT: B05; TEXT1: wood article; TEXT2: wooden article	B
36464	8ceaa2b5c2d56250	wood article	wood article	B44	1.0	CONTEXT: B44; TEXT1: wood article; TEXT2: wood article	B
36468	8e1386cbefd7f245	wood article	wooden article	B44	1.0	CONTEXT: B44; TEXT1: wood article; TEXT2: wooden article	B

1154 rows × 7 columns

Training Model - Setup

# Constants for Training

# model
model_nm = 'microsoft/deberta-v3-small'

# tokenizer
tokz = AutoTokenizer.from_pretrained(model_nm)

# learning rate
lr = 8e-5

# batch size
bs = 128

# weight decay
wd = 0.01

# epochs
epochs = 4

# function that determines the split for validation and training data
# set a default validation dataset split percentage as 25%
# set a default training dataset split percentage as 75%
def get_dataSplit(vp):
  anchors = df.anchor.unique()
  print(f"Unique Anchors: {len(anchors)}")
  np.random.seed(42)
  np.random.shuffle(anchors)

  # specify how much validation set we want
  val_prop = vp
  val_sz = int(len(anchors)*val_prop)
  val_anchors = anchors[:val_sz]

  # Find which rows match the valid anchors and get their indices
  is_val = np.isin(df.anchor, val_anchors)
  idxs = np.arange(len(df))

  # Validation Data
  val_idxs = idxs[ is_val]
  print(f"Validation Dataset Length: {len(val_idxs)}")
  # Training Data
  trn_idxs = idxs[~is_val]
  print(f"Training Dataset Length: {len(trn_idxs)}")

  return (val_idxs, trn_idxs)

# Function to tokenize training data
def get_dds(df, vp=0.25):
  ds = Dataset.from_pandas(df).rename_column('score', 'label')
  print("Original Input Dataset")
  print(ds)
  eval_ds = Dataset.from_pandas(eval_df)
  print("Original Inference Dataset")
  print(eval_ds)
  inps = "anchor","target","context"
  tok_ds = ds.map(tok_func, batched=True, remove_columns=inps+('input','id','section'))
  print("Tokenized Input Dataset")
  print(tok_ds)
  print("Tokenized Inference Dataset")
  print(eval_ds)
  val_idxs, trn_idxs = get_dataSplit(vp)
  return DatasetDict({"train":tok_ds.select(trn_idxs), "test": tok_ds.select(val_idxs)})

# Function to get the model
def get_model():
  return AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)

# Function to create a Trainer
# - Trainer -> class which combines the data and model together (similar to Learner in FastAI)
def get_trainer(dds, model=None):

    # initialize model
    if model is None: model = get_model()

    # Transformers require the TrainingArguments class to set up the arguments for the trainer
    args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
        evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
        num_train_epochs=epochs, weight_decay=wd, report_to='none')

    return Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
                   tokenizer=tokz, compute_metrics=corr_d)

Data Representation

Combine Context + Anchor + Target Experiments

Method 1 - Baseline

# Method 1 - Baseline
df['input'] = 'CONTEXT: ' + df.context + '; TEXT1: ' + df.anchor + '; TEXT2: ' + df.target
eval_df['input'] = 'CONTEXT: ' + eval_df.context + '; TEXT1: ' + eval_df.anchor + '; TEXT2: ' + eval_df.target

dds = get_dds(df)

# create and train model
get_trainer(dds).train()

Original Input Dataset
Dataset({
    features: ['id', 'anchor', 'target', 'context', 'label', 'input', 'section'],
    num_rows: 36473
})
Original Inference Dataset
Dataset({
    features: ['id', 'anchor', 'target', 'context', 'input', 'section'],
    num_rows: 36
})
Tokenized Input Dataset
Dataset({
    features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 36473
})
Tokenized Inference Dataset
Dataset({
    features: ['id', 'anchor', 'target', 'context', 'input', 'section'],
    num_rows: 36
})
Unique Anchors: 733
Validation Dataset Length: 9116
Training Dataset Length: 27357

[856/856 03:36, Epoch 4/4]

Epoch	Training Loss	Validation Loss	Pearson
1	No log	0.026578	0.796921
2	No log	0.023371	0.818602
3	0.036700	0.024409	0.820100
4	0.036700	0.023927	0.819576

TrainOutput(global_step=856, training_loss=0.02694343183642236, metrics={'train_runtime': 217.0934, 'train_samples_per_second': 504.059, 'train_steps_per_second': 3.943, 'total_flos': 723683047099410.0, 'train_loss': 0.02694343183642236, 'epoch': 4.0})

Method 2 - Separate Token

# Method 2 - Separate Token
sep = tokz.sep_token
print(f"Separate Token: {sep}")
df['input'] = df.context + sep + df.anchor + sep + df.target
eval_df['input'] = eval_df.context + sep + eval_df.anchor + sep + eval_df.target

dds = get_dds(df)

# create and train model
get_trainer(dds).train()

Separate Token: [SEP]
Original Input Dataset
Dataset({
    features: ['id', 'anchor', 'target', 'context', 'label', 'input', 'section'],
    num_rows: 36473
})
Original Inference Dataset
Dataset({
    features: ['id', 'anchor', 'target', 'context', 'input', 'section'],
    num_rows: 36
})
Tokenized Input Dataset
Dataset({
    features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 36473
})
Tokenized Inference Dataset
Dataset({
    features: ['id', 'anchor', 'target', 'context', 'input', 'section'],
    num_rows: 36
})
Unique Anchors: 733
Validation Dataset Length: 9116
Training Dataset Length: 27357

[856/856 03:10, Epoch 4/4]

Epoch	Training Loss	Validation Loss	Pearson
1	No log	0.026330	0.792181
2	No log	0.025083	0.809545
3	0.032000	0.024699	0.812255
4	0.032000	0.024956	0.812318

TrainOutput(global_step=856, training_loss=0.024021479013924287, metrics={'train_runtime': 190.9271, 'train_samples_per_second': 573.14, 'train_steps_per_second': 4.483, 'total_flos': 468872298987930.0, 'train_loss': 0.024021479013924287, 'epoch': 4.0})

Method 3 - Change Token Separator

# Method 3 - change the type of separator
sep = " [s] "
df['input'] = df.context + sep + df.anchor + sep + df.target
eval_df['input'] = eval_df.context + sep + eval_df.anchor + sep + eval_df.target

dds = get_dds(df)

# create and train model
get_trainer(dds).train()

Original Input Dataset
Dataset({
    features: ['id', 'anchor', 'target', 'context', 'label', 'input', 'section'],
    num_rows: 36473
})
Original Inference Dataset
Dataset({
    features: ['id', 'anchor', 'target', 'context', 'input', 'section'],
    num_rows: 36
})
Tokenized Input Dataset
Dataset({
    features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 36473
})
Tokenized Inference Dataset
Dataset({
    features: ['id', 'anchor', 'target', 'context', 'input', 'section'],
    num_rows: 36
})
Unique Anchors: 733
Validation Dataset Length: 9116
Training Dataset Length: 27357

[856/856 03:17, Epoch 4/4]

Epoch	Training Loss	Validation Loss	Pearson
1	No log	0.027578	0.789799
2	No log	0.025510	0.814414
3	0.031600	0.023443	0.817882
4	0.031600	0.024294	0.817224

TrainOutput(global_step=856, training_loss=0.023982213479336177, metrics={'train_runtime': 197.7627, 'train_samples_per_second': 553.33, 'train_steps_per_second': 4.328, 'total_flos': 582121520370810.0, 'train_loss': 0.023982213479336177, 'epoch': 4.0})

Method 4 - Method 3 + Lowercase

# Method 4 - change to all lower case
df['input'] = df.context + sep + df.anchor + sep + df.target
df['input'] = df.input.str.lower()
eval_df['input'] = eval_df.context + sep + eval_df.anchor + sep + eval_df.target
eval_df['input'] = eval_df.input.str.lower()

dds = get_dds(df)
get_trainer(dds).train()

Original Input Dataset
Dataset({
    features: ['id', 'anchor', 'target', 'context', 'label', 'input', 'section'],
    num_rows: 36473
})
Original Inference Dataset
Dataset({
    features: ['id', 'anchor', 'target', 'context', 'input', 'section'],
    num_rows: 36
})
Tokenized Input Dataset
Dataset({
    features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 36473
})
Tokenized Inference Dataset
Dataset({
    features: ['id', 'anchor', 'target', 'context', 'input', 'section'],
    num_rows: 36
})
Unique Anchors: 733
Validation Dataset Length: 9116
Training Dataset Length: 27357

[856/856 03:13, Epoch 4/4]

Epoch	Training Loss	Validation Loss	Pearson
1	No log	0.026599	0.794113
2	No log	0.024963	0.817053
3	0.031500	0.023034	0.818722
4	0.031500	0.024056	0.818191

TrainOutput(global_step=856, training_loss=0.02395056954054075, metrics={'train_runtime': 193.6779, 'train_samples_per_second': 565.0, 'train_steps_per_second': 4.42, 'total_flos': 582121520370810.0, 'train_loss': 0.02395056954054075, 'epoch': 4.0})

Method 5 - Special Tokens

# Method 5 - Special Tokens
df['sectok'] = '[' + df.section + ']'
eval_df['sectok'] = '[' + eval_df.section + ']'
sectoks = list(df.sectok.unique())
tokz.add_special_tokens({'additional_special_tokens': sectoks})

df['input'] = df.sectok + sep + df.context + sep + df.anchor.str.lower() + sep + df.target
eval_df['input'] = eval_df.sectok + sep + eval_df.context + sep + eval_df.anchor.str.lower() + sep + eval_df.target
dds = get_dds(df)

Original Input Dataset
Dataset({
    features: ['id', 'anchor', 'target', 'context', 'label', 'input', 'section', 'sectok'],
    num_rows: 36473
})
Original Inference Dataset
Dataset({
    features: ['id', 'anchor', 'target', 'context', 'input', 'section', 'sectok'],
    num_rows: 36
})
Tokenized Input Dataset
Dataset({
    features: ['label', 'sectok', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 36473
})
Tokenized Inference Dataset
Dataset({
    features: ['id', 'anchor', 'target', 'context', 'input', 'section', 'sectok'],
    num_rows: 36
})
Unique Anchors: 733
Validation Dataset Length: 9116
Training Dataset Length: 27357

# resize embedding matrix in model
model = get_model()
model.resize_token_embeddings(len(tokz))

Embedding(128009, 768)

# train the model
trainer = get_trainer(dds, model=model)
trainer.train()

[856/856 03:33, Epoch 4/4]

Epoch	Training Loss	Validation Loss	Pearson
1	No log	0.029230	0.803785
2	No log	0.024406	0.821807
3	0.031400	0.023469	0.820734
4	0.031400	0.024289	0.820907

TrainOutput(global_step=856, training_loss=0.023841174406425976, metrics={'train_runtime': 214.1222, 'train_samples_per_second': 511.054, 'train_steps_per_second': 3.998, 'total_flos': 695370741753690.0, 'train_loss': 0.023841174406425976, 'epoch': 4.0})

# Train on the entire training dataset before doing final tests with inference dataset
def final_get_trainer(dds, model=None):
    # initialize model
    if model is None:
        model = get_model()

    # Transformers require the TrainingArguments class to set up the arguments for the trainer
    args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
        evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
        num_train_epochs=epochs, weight_decay=wd, report_to='none')

    return Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
                   tokenizer=tokz, compute_metrics=corr_d)

dds = get_dds(df, 0.01)
trainer = final_get_trainer(dds, model=model)
trainer.train()

Original Input Dataset
Dataset({
    features: ['id', 'anchor', 'target', 'context', 'label', 'input', 'section', 'sectok'],
    num_rows: 36473
})
Original Inference Dataset
Dataset({
    features: ['id', 'anchor', 'target', 'context', 'input', 'section', 'sectok'],
    num_rows: 36
})
Tokenized Input Dataset
Dataset({
    features: ['label', 'sectok', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 36473
})
Tokenized Inference Dataset
Dataset({
    features: ['id', 'anchor', 'target', 'context', 'input', 'section', 'sectok'],
    num_rows: 36
})
Unique Anchors: 733
Validation Dataset Length: 347
Training Dataset Length: 36126

[1132/1132 04:31, Epoch 4/4]

Epoch	Training Loss	Validation Loss	Pearson
1	No log	0.021583	0.836893
2	0.015400	0.019293	0.863323
3	0.015400	0.018774	0.860989
4	0.009500	0.018202	0.864261

TrainOutput(global_step=1132, training_loss=0.011905024835161943, metrics={'train_runtime': 272.1518, 'train_samples_per_second': 530.968, 'train_steps_per_second': 4.159, 'total_flos': 918764683369440.0, 'train_loss': 0.011905024835161943, 'epoch': 4.0})

Evaluate Performance of Model on Inference Data

# Test model on the evaluation data
preds = trainer.predict(eval_ds).predictions.astype(float)
preds = np.clip(preds, 0, 1)
preds

array([[0.57763672],
       [0.57177734],
       [0.47412109],
       [0.34521484],
       [0.        ],
       [0.48852539],
       [0.51611328],
       [0.        ],
       [0.3503418 ],
       [0.64208984],
       [0.33959961],
       [0.28930664],
       [0.65234375],
       [0.71289062],
       [0.56591797],
       [0.37866211],
       [0.30786133],
       [0.05584717],
       [0.61181641],
       [0.4050293 ],
       [0.58447266],
       [0.26806641],
       [0.28491211],
       [0.29077148],
       [0.53857422],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.59912109],
       [0.35766602],
       [0.07025146],
       [0.68945312],
       [0.48779297],
       [0.41601562],
       [0.29858398]])

Submission for Kaggle Competition

# Submit for kaggle competition
import datasets

submission = datasets.Dataset.from_dict({
    'id': eval_ds['id'],
    'score': preds
})

submission.to_csv('submission.csv', index=False)

Validation Set and Modeling Experiments

# Function to model
def f(x):
  return -3*x**2 + 2*x + 20

# Function for visualization Data
def plot_function(f, min=-2.1, max=2.1, color='r'):
    x = np.linspace(min,max, 100)[:,None]
    plt.plot(x, f(x), color)

print(f"Function we're trying to fit")
plot_function(f)

Function we're trying to fit

# Functions for creating noisy data
from numpy.random import normal,seed,uniform
np.random.seed(42)

def noise(x, scale):
   return normal(scale=scale, size=x.shape)

def add_noise(x, mult, add):
  return x * (1+noise(x,mult)) + noise(x,add)

# Simulate Noisy Data
x = np.linspace(-2, 2, num=20)[:,None]
y = add_noise(f(x), 0.2, 1.3)
plt.scatter(x,y);

# Function for trying to find polynomial function that fits data
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

def plot_poly(degree):
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    model.fit(x, y)
    plt.scatter(x,y)
    plot_function(model.predict)

# Underfit
print(f"Polynomial of Degree 1: Underfit")
plot_poly(1)

Polynomial of Degree 1: Underfit

# Overfit
print(f"Polynomial of Degree 10: Overfit")
plot_poly(10)

Polynomial of Degree 10: Overfit

# Approximation of Close fit
print(f"Polynmomial of Degree 2: A close approximation of exact fit")
plot_poly(2)

# the original true function (one we're trying to fit)
plot_function(f, color='b')

Polynmomial of Degree 2: A close approximation of exact fit

Metric Experiments: Correlation

from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing(as_frame=True)
housing = housing['data'].join(housing['target']).sample(1000, random_state=52)
housing.head()

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	MedHouseVal
7506	3.0550	37.0	5.152778	1.048611	729.0	5.062500	33.92	-118.28	1.054
4720	3.0862	35.0	4.697897	1.055449	1159.0	2.216061	34.05	-118.37	3.453
12888	2.5556	24.0	4.864905	1.129222	1631.0	2.395007	38.66	-121.35	1.057
13344	3.0057	32.0	4.212687	0.936567	1378.0	5.141791	34.05	-117.64	0.969
7173	1.9083	42.0	3.888554	1.039157	1535.0	4.623494	34.05	-118.19	1.192

# correlation coefficients for every combination of columns
np.set_printoptions(precision=2, suppress=True)
np.corrcoef(housing, rowvar=False)

array([[ 1.  , -0.12,  0.43, -0.08,  0.01, -0.07, -0.12,  0.04,  0.68],
       [-0.12,  1.  , -0.17, -0.06, -0.31,  0.  ,  0.03, -0.13,  0.12],
       [ 0.43, -0.17,  1.  ,  0.76, -0.09, -0.07,  0.12, -0.03,  0.21],
       [-0.08, -0.06,  0.76,  1.  , -0.08, -0.07,  0.09,  0.  , -0.04],
       [ 0.01, -0.31, -0.09, -0.08,  1.  ,  0.16, -0.15,  0.13,  0.  ],
       [-0.07,  0.  , -0.07, -0.07,  0.16,  1.  , -0.16,  0.17, -0.27],
       [-0.12,  0.03,  0.12,  0.09, -0.15, -0.16,  1.  , -0.93, -0.16],
       [ 0.04, -0.13, -0.03,  0.  ,  0.13,  0.17, -0.93,  1.  , -0.03],
       [ 0.68,  0.12,  0.21, -0.04,  0.  , -0.27, -0.16, -0.03,  1.  ]])

# Correlation for two variables
np.corrcoef(housing.MedInc, housing.MedHouseVal)

array([[1.  , 0.68],
       [0.68, 1.  ]])

# function for returning single coefficient correlation

def corr(x,y):
  return np.corrcoef(x,y)[0][1]

# test function
corr(housing.MedInc, housing.MedHouseVal)

# function for visualizing correlation between two variables
def show_corr(df, a, b):
    x,y = df[a],df[b]
    plt.scatter(x,y, alpha=0.5, s=4)
    plt.title(f'{a} vs {b}; r: {corr(x, y):.2f}')

show_corr(housing, 'MedInc', 'MedHouseVal')

show_corr(housing, 'MedInc', 'AveRooms')

subset = housing[housing.AveRooms < 15]
show_corr(subset, 'MedInc', 'AveRooms')

Announcements

Acknowledgements

Summary

Jeremy Howard’s Advice

Data Science/Kaggle Competition Workflow

US Patent Phrase to Phrase Matching Competition

Load Data and Libraries

Problem Statement

Exploratory Data Analysis

Data Representation (Tokenization, Numericalization, Data Wrangling)

Datasets Training, Validation, Testing(Inference)

Training Dataset

Validation Dataset

Test(Inference Set)

Metrics

Train Model

Evaluate Model Performance

Improving + Iterating on the Model for Kaggle and Beyond

Problem Statement

Data Exploration: Data Meaning + Representation

Exploratory Data Analysis

Training Model - Setup

Data Representation

Combine Context + Anchor + Target Experiments

Method 1 - Baseline

Method 2 - Separate Token

Method 3 - Change Token Separator

Method 4 - Method 3 + Lowercase

Method 5 - Special Tokens

Evaluate Performance of Model on Inference Data

Submission for Kaggle Competition

Validation Set and Modeling Experiments

Metric Experiments: Correlation

Resources