{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Spam Detector Example\n",
"\n",
"This notebook demonstrates a spam detector."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup\n",
"\n",
"Let's import our libraries:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And some more helper libraries:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"from zipfile import ZipFile\n",
"from tqdm.notebook import tqdm_notebook as tqdm\n",
"import unicodedata"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And SciKit algorithms:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.dummy import DummyClassifier\n",
"from sklearn.naive_bayes import MultinomialNB\n",
"from sklearn.pipeline import Pipeline, make_pipeline\n",
"from sklearn.feature_extraction.text import CountVectorizer\n",
"from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Set up our RNG:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"rng = np.random.RandomState(20201106)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load TREC Spam\n",
"\n",
"Now we're going to load the TREC Spam data set.\n",
"\n",
"I downloaded this data from , and converted the TGZ file from TREC to a Zip file so that we can read it directly from the compressed file. This is because each e-mail is in a separate file, all in the same directory; a directory with 75K files does not perform well on sime file systems. Here is the command I used to convert it (with Node.js installed):\n",
"\n",
" npx tar-to-zip trec07p.tgz\n",
"\n",
"We're going to start by opening the zip file so we can access its contents:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"trec_zf = ZipFile('trec07p.zip')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we want to load the labels — these are in the file `trec07p/full/index`. We'll get a data frame, which contains the class (spam or ham) and the filename:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
label
\n",
"
path
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
spam
\n",
"
../data/inmail.1
\n",
"
\n",
"
\n",
"
1
\n",
"
ham
\n",
"
../data/inmail.2
\n",
"
\n",
"
\n",
"
2
\n",
"
spam
\n",
"
../data/inmail.3
\n",
"
\n",
"
\n",
"
3
\n",
"
spam
\n",
"
../data/inmail.4
\n",
"
\n",
"
\n",
"
4
\n",
"
spam
\n",
"
../data/inmail.5
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" label path\n",
"0 spam ../data/inmail.1\n",
"1 ham ../data/inmail.2\n",
"2 spam ../data/inmail.3\n",
"3 spam ../data/inmail.4\n",
"4 spam ../data/inmail.5"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"with trec_zf.open('trec07p/full/index') as idxf:\n",
" trec_labels = pd.read_table(idxf, sep=' ', names=['label', 'path'])\n",
"trec_labels.head()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 75419 entries, 0 to 75418\n",
"Data columns (total 2 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 label 75419 non-null object\n",
" 1 path 75419 non-null object\n",
"dtypes: object(2)\n",
"memory usage: 9.9 MB\n"
]
}
],
"source": [
"trec_labels.info(memory_usage='deep')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's double-check that we don't have any duplicate paths:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"75419"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"trec_labels['path'].nunique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use these filenames to extract the individual messages. Let's do this:\n",
"\n",
"1. Extract the filename (after the `/`) for use as a key\n",
"2. Load each file's contents into a string\n",
"3. Merge with labels for a labeled spam/ham data set\n",
"\n",
"Start by replacing everything up to the final `/` with nothing:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
label
\n",
"
path
\n",
"
name
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
spam
\n",
"
../data/inmail.1
\n",
"
inmail.1
\n",
"
\n",
"
\n",
"
1
\n",
"
ham
\n",
"
../data/inmail.2
\n",
"
inmail.2
\n",
"
\n",
"
\n",
"
2
\n",
"
spam
\n",
"
../data/inmail.3
\n",
"
inmail.3
\n",
"
\n",
"
\n",
"
3
\n",
"
spam
\n",
"
../data/inmail.4
\n",
"
inmail.4
\n",
"
\n",
"
\n",
"
4
\n",
"
spam
\n",
"
../data/inmail.5
\n",
"
inmail.5
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" label path name\n",
"0 spam ../data/inmail.1 inmail.1\n",
"1 ham ../data/inmail.2 inmail.2\n",
"2 spam ../data/inmail.3 inmail.3\n",
"3 spam ../data/inmail.4 inmail.4\n",
"4 spam ../data/inmail.5 inmail.5"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"trec_labels['name'] = trec_labels['path'].str.replace(r'^.*/', '')\n",
"trec_labels.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we're going to load all the mails. No way to get around doing this as a loop - we could do it with `apply`, but that's just a loop. We'll put it in a dictionary, then convert that to a series; the result is a series indexed by name, whose values are the e-mails. We're also going to use TQDM to get a progress bar.\n",
"\n",
"While we are loading the data, we will also perform our **decoding** and **text normalization** steps."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "bb300b9b64ad440ab818cb6f97b9f80a",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"HBox(children=(FloatProgress(value=0.0, max=75419.0), HTML(value='')))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"text/plain": [
"75419"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"trec_mails = {}\n",
"for name in tqdm(trec_labels['name']):\n",
" path = f'trec07p/data/{name}'\n",
" with trec_zf.open(path) as mailf:\n",
" content = mailf.read()\n",
" content = content.decode('latin1')\n",
" content = unicodedata.normalize('NFKD', content)\n",
" trec_mails[name] = content\n",
"trec_mails = pd.Series(trec_mails, name='content')\n",
"len(trec_mails)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we can merge with the labels. Let's create an `IsSpam` logical to mark the spams, then merge:"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"