{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Measuring similarity between categorical data records\n", "\n", "We are interested in the problem of detecting similar records within categorical data. These may be records which differ slightly due to data misentry, missing data fields, slight variations in reporting or coding or a variety of other factors. Alternatively they may just be natural groupings of records within a particular data set. \n", "\n", "We will look at this problem from the perspective of inducing a similarity or distance function over the set of all records and then clustering the results into piles of near duplicate candidates for a human to either explore or to follow up and triage. \n", "\n", "There are a variety of interesting ways to measure similarity between categorical records. Many of them involve building a complex function testing matches and non-matches between rows. Unfortunately, these approaches don't easily scale to very large amounts of data since they typically require $O(n^2)$ similarity comparisons. For the purposes of this notebook we will make use of some recent advances in nearest neighbour search in order to reduce that complexity. That may initially force us to restrict ourselves to simpler models. In the case that these aren't sufficient one take an iterative approach where a coarse similarity measure was used to estimate which record pairs were likely near duplicates and a refined (and more expensive) similarity measure was applied to only those candidate records." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Representation\n", "In order to make use of modern techniques we need to convert our data from columns of categorical values into numerical vectors. There are a variety of ways to do this but by far the most common would be to use a **one hot encoding**. That is to build a new column for each categorical variable name and value and then place a one in that new column if that particular value was present within a particular row. \n", "\n", "Two of the most common are pandas [get_dummies](https://pandas.pydata.org/pandas-docs/version/0.23.3/generated/pandas.get_dummies.html) function or scikit-learns [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). Both are fine functions, but for large amounts of data (particulary if some of your columns have many values) you really want to ensure that you keep good track of your column names and maintain a sparse matrix representation. A sparse matrix representation is one where you never actually write down any of the zero values. This is important in a one-hot encoded representation because you have one column for every single categorical value in your data set. If, for example, one of your columns contained a name or phone number then you could easily have added hundreds of thousands of (mostly zero) columns to every row within your data matrix. For even medium sized data sets this can quickly exceed your computers memory limit. \n", "\n", "For the purposes of this tutorial I'll make use of our Vectorizers library for performing this task." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import Some libraries\n", "\n", "We'll need CategoricalColumnTransformer and NgramVectorizer from our vectorizers library along with a few helper functions for dealing with our data and plotting it." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "from vectorizers import NgramVectorizer\n", "from vectorizers.transformers import CategoricalColumnTransformer, InformationWeightTransformer\n", "from sklearn.decomposition import TruncatedSVD\n", "from random import sample\n", "\n", "import pandas as pd\n", "import numpy as np\n", "import umap\n", "import umap.plot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll add some bokeh imports for easy interactive plots" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/javascript": [ "(function(root) {\n", " function now() {\n", " return new Date();\n", " }\n", "\n", " const force = true;\n", "\n", " if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n", " root._bokeh_onload_callbacks = [];\n", " root._bokeh_is_loading = undefined;\n", " }\n", "\n", "const JS_MIME_TYPE = 'application/javascript';\n", " const HTML_MIME_TYPE = 'text/html';\n", " const EXEC_MIME_TYPE = 'application/vnd.bokehjs_exec.v0+json';\n", " const CLASS_NAME = 'output_bokeh rendered_html';\n", "\n", " /**\n", " * Render data to the DOM node\n", " */\n", " function render(props, node) {\n", " const script = document.createElement(\"script\");\n", " node.appendChild(script);\n", " }\n", "\n", " /**\n", " * Handle when an output is cleared or removed\n", " */\n", " function handleClearOutput(event, handle) {\n", " const cell = handle.cell;\n", "\n", " const id = cell.output_area._bokeh_element_id;\n", " const server_id = cell.output_area._bokeh_server_id;\n", " // Clean up Bokeh references\n", " if (id != null && id in Bokeh.index) {\n", " Bokeh.index[id].model.document.clear();\n", " delete Bokeh.index[id];\n", " }\n", "\n", " if (server_id !== undefined) {\n", " // Clean up Bokeh references\n", " const cmd_clean = \"from bokeh.io.state import curstate; print(curstate().uuid_to_server['\" + server_id + \"'].get_sessions()[0].document.roots[0]._id)\";\n", " cell.notebook.kernel.execute(cmd_clean, {\n", " iopub: {\n", " output: function(msg) {\n", " const id = msg.content.text.trim();\n", " if (id in Bokeh.index) {\n", " Bokeh.index[id].model.document.clear();\n", " delete Bokeh.index[id];\n", " }\n", " }\n", " }\n", " });\n", " // Destroy server and session\n", " const cmd_destroy = \"import bokeh.io.notebook as ion; ion.destroy_server('\" + server_id + \"')\";\n", " cell.notebook.kernel.execute(cmd_destroy);\n", " }\n", " }\n", "\n", " /**\n", " * Handle when a new output is added\n", " */\n", " function handleAddOutput(event, handle) {\n", " const output_area = handle.output_area;\n", " const output = handle.output;\n", "\n", " // limit handleAddOutput to display_data with EXEC_MIME_TYPE content only\n", " if ((output.output_type != \"display_data\") || (!Object.prototype.hasOwnProperty.call(output.data, EXEC_MIME_TYPE))) {\n", " return\n", " }\n", "\n", " const toinsert = output_area.element.find(\".\" + CLASS_NAME.split(' ')[0]);\n", "\n", " if (output.metadata[EXEC_MIME_TYPE][\"id\"] !== undefined) {\n", " toinsert[toinsert.length - 1].firstChild.textContent = output.data[JS_MIME_TYPE];\n", " // store reference to embed id on output_area\n", " output_area._bokeh_element_id = output.metadata[EXEC_MIME_TYPE][\"id\"];\n", " }\n", " if (output.metadata[EXEC_MIME_TYPE][\"server_id\"] !== undefined) {\n", " const bk_div = document.createElement(\"div\");\n", " bk_div.innerHTML = output.data[HTML_MIME_TYPE];\n", " const script_attrs = bk_div.children[0].attributes;\n", " for (let i = 0; i < script_attrs.length; i++) {\n", " toinsert[toinsert.length - 1].firstChild.setAttribute(script_attrs[i].name, script_attrs[i].value);\n", " toinsert[toinsert.length - 1].firstChild.textContent = bk_div.children[0].textContent\n", " }\n", " // store reference to server id on output_area\n", " output_area._bokeh_server_id = output.metadata[EXEC_MIME_TYPE][\"server_id\"];\n", " }\n", " }\n", "\n", " function register_renderer(events, OutputArea) {\n", "\n", " function append_mime(data, metadata, element) {\n", " // create a DOM node to render to\n", " const toinsert = this.create_output_subarea(\n", " metadata,\n", " CLASS_NAME,\n", " EXEC_MIME_TYPE\n", " );\n", " this.keyboard_manager.register_events(toinsert);\n", " // Render to node\n", " const props = {data: data, metadata: metadata[EXEC_MIME_TYPE]};\n", " render(props, toinsert[toinsert.length - 1]);\n", " element.append(toinsert);\n", " return toinsert\n", " }\n", "\n", " /* Handle when an output is cleared or removed */\n", " events.on('clear_output.CodeCell', handleClearOutput);\n", " events.on('delete.Cell', handleClearOutput);\n", "\n", " /* Handle when a new output is added */\n", " events.on('output_added.OutputArea', handleAddOutput);\n", "\n", " /**\n", " * Register the mime type and append_mime function with output_area\n", " */\n", " OutputArea.prototype.register_mime_type(EXEC_MIME_TYPE, append_mime, {\n", " /* Is output safe? */\n", " safe: true,\n", " /* Index of renderer in `output_area.display_order` */\n", " index: 0\n", " });\n", " }\n", "\n", " // register the mime type if in Jupyter Notebook environment and previously unregistered\n", " if (root.Jupyter !== undefined) {\n", " const events = require('base/js/events');\n", " const OutputArea = require('notebook/js/outputarea').OutputArea;\n", "\n", " if (OutputArea.prototype.mime_types().indexOf(EXEC_MIME_TYPE) == -1) {\n", " register_renderer(events, OutputArea);\n", " }\n", " }\n", " if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n", " root._bokeh_timeout = Date.now() + 5000;\n", " root._bokeh_failed_load = false;\n", " }\n", "\n", " const NB_LOAD_WARNING = {'data': {'text/html':\n", " \"\\n\"+\n", " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", " \"
\\n\"+\n", " \"\\n\"+\n",
" \"from bokeh.resources import INLINE\\n\"+\n",
" \"output_notebook(resources=INLINE)\\n\"+\n",
" \"\\n\"+\n",
" \"\\n\"+\n \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n \"
\\n\"+\n \"\\n\"+\n \"from bokeh.resources import INLINE\\n\"+\n \"output_notebook(resources=INLINE)\\n\"+\n \"\\n\"+\n \"| \n", " | REPORT_ID | \n", "REPORT_NO | \n", "VERSION_NO | \n", "DATRECEIVED | \n", "DATINTRECEIVED | \n", "MAH_NO | \n", "REPORT_TYPE_CODE | \n", "REPORT_TYPE_ENG | \n", "REPORT_TYPE_FR | \n", "GENDER_CODE | \n", "... | \n", "HOSP_REQUIRED | \n", "OTHER_MEDICALLY_IMP_COND | \n", "REPORTER_TYPE_ENG | \n", "REPORTER_TYPE_FR | \n", "SOURCE_CODE | \n", "SOURCE_ENG | \n", "SOURCE_FR | \n", "E2B_IMP_SAFETYREPORT_ID | \n", "AUTHORITY_NUMB | \n", "COMPANY_NUMB | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "1 | \n", "000000001 | \n", "0 | \n", "05-JUN-73 | \n", "05-JUN-73 | \n", "NaN | \n", "7.0 | \n", "Spontaneous | \n", "Déclaration spontanée | \n", "2.0 | \n", "... | \n", "NaN | \n", "NaN | \n", "Physician | \n", "Physicien | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
| 1 | \n", "2 | \n", "000000002 | \n", "0 | \n", "05-JUN-73 | \n", "05-JUN-73 | \n", "NaN | \n", "7.0 | \n", "Spontaneous | \n", "Déclaration spontanée | \n", "2.0 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "3.0 | \n", "Hospital | \n", "Hôpital | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
| 2 | \n", "3 | \n", "000000003 | \n", "0 | \n", "05-JUN-73 | \n", "05-JUN-73 | \n", "NaN | \n", "7.0 | \n", "Spontaneous | \n", "Déclaration spontanée | \n", "1.0 | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "3.0 | \n", "Hospital | \n", "Hôpital | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
3 rows × 42 columns
\n", "CategoricalColumnTransformer from our Vectorizers library to construct a new pandas Series with the index being our objects and the values being the corresponding list of categorical values associated with each object. Notionaly, we've decided to say that two objects are similar if they match on many of these categorical values. \n",
"\n",
"Since we're combining categorical values from multiple columns we should also append the column name to each categorical value in order to be sure to not confuse things. We'll set include_column_name=True in order to have our categorical descriptors take the form of the string f'{column_name}:{value}'."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 57.3 s, sys: 540 ms, total: 57.8 s\n",
"Wall time: 58.1 s\n"
]
},
{
"data": {
"text/plain": [
"1019799"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"report_list = CategoricalColumnTransformer(object_column_name='REPORT_ID',\n",
" descriptor_column_name=['GENDER_ENG','OUTCOME_ENG'],\n",
" include_column_name=True).fit_transform(data)\n",
"len(report_list)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since we had one REPORT_ID for every row in our data frame we now a series of the same length with a list of at most two descriptors assicated with each."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"REPORT_ID\n",
"1 [GENDER_ENG:Female, OUTCOME_ENG:Unknown]\n",
"2 [GENDER_ENG:Female, OUTCOME_ENG:Unknown]\n",
"3 [GENDER_ENG:Male, OUTCOME_ENG:Unknown]\n",
"dtype: object"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"report_list.head(3)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2 967530\n",
"1 52267\n",
"0 2\n",
"Name: count, dtype: int64"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"report_list.apply(len).value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We implicitely filter all categorical values that are Null. As we see here, this can result in smaller or empty lists. Around 46,000 records only had one of our fields present and 2 of our records had neither field present."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Enrich with reactions\n",
"\n",
"In order to find out what the reports we are interested in are actually about we need to enrich this data with the adverse reaction being reported. This information comes from the reactions table which contains just over 3 million reactions associated with our reports. REPORT_ID is also contained in this table so it can also be used to describe our objects of interest, but now we will describe our reports as a list of the categorical values contained within the PT_NAME_ENG column. This is the Adverse Reaction Term column. "
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(3866373, 10)"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = pd.read_csv(f'{file_path}/reactions.txt', sep='$', header=None, low_memory=False)\n",
"column_names = 'REACTION_ID,REPORT_ID,DURATION,DURATION_UNIT_ENG,DURATION_UNIT_FR,PT_NAME_ENG,PT_NAME_FR,SOC_NAME_ENG,SOC_NAME_FR,MEDDRA_VERSION'.split(',')\n",
"data.columns = column_names\n",
"data.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Of note here, we have more rows than objects of interest. That is because a single report can be responsible for multiple reactions. That is going to necessitate grouping up the rows with the same REPORT_ID we can do this with a pandas groupby manually but this is exactly what our CategoricalColumnTransformer is doing under the hood. As such, just as before, we simply specify our object_column_name and the columns from this table that we'd like to use as descriptors and apply our transformer."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 37.1 s, sys: 1.27 s, total: 38.4 s\n",
"Wall time: 39.1 s\n"
]
},
{
"data": {
"text/plain": [
"1019212"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"reaction_list = CategoricalColumnTransformer(object_column_name='REPORT_ID',\n",
" descriptor_column_name='PT_NAME_ENG',\n",
" include_column_name=True).fit_transform(data)\n",
"len(reaction_list)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, just as before, we have a series with an index that is REPORT_ID and values that are lists of descriptors (strings)."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"REPORT_ID\n",
"1 [PT_NAME_ENG:Pruritus, PT_NAME_ENG:Oedema]\n",
"2 [PT_NAME_ENG:Urticaria]\n",
"3 [PT_NAME_ENG:Erythema, PT_NAME_ENG:Periorbital...\n",
"dtype: object"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"reaction_list.head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Drug ingredient data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lastly, we will include some categorical columns to describe the drug being reported on. This will give us the oportunity to see how we might make use a pair of linked tables to enrich our data further. It will also give us enough descriptors to induce an interesting notion of similarity on our data. Similarity measures based on very few categorical values will end up with a tremendous number of records at exactly the same distance making for an uninteresting metric space that is also difficult to work with.\n",
"\n",
"To do this we will enrich our reports with the name of the drug involved in the report and the active ingredients which make up that drug. We've included the active ingredients in this notion to because there are a wide variety of drugs with different names that contain the same drug. For example, there are over three hundred different drugs who's only active ingredient is listed as cannabis. We'd like our notion of distance to treat reports about these drugs as being similar. \n",
"\n",
"As before, we load the table that contains our active ingredients for each drug we know about."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(354757, 5)"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"input_file = f'{file_path}/drug_product_ingredients.txt'\n",
"data = pd.read_csv(input_file, sep='$', header=None, low_memory=False)\n",
"column_names = 'DRUG_PRODUCT_INGREDIENT_ID,DRUG_PRODUCT_ID,DRUGNAME,ACTIVE_INGREDIENT_ID,ACTIVE_INGREDIENT_NAME'.split(\",\")\n",
"data.columns = column_names\n",
"data.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first thing we notice is that this table has no concept of reports or a REPORT_ID. In order to get this information we will have to load a another table which contains this linking information. Since we only want to use this table to link our drug ids with our report ids we will only load the two columns of interest from our file."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(4280493, 2)"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"input_file = f'{file_path}/report_drug.txt'\n",
"drug2report_joining = pd.read_csv(input_file, sep='$', header=None, low_memory=False, usecols=[1,2])\n",
"drug2report_joining.columns = 'REPORT_ID,DRUG_PRODUCT_ID'.split(',')\n",
"drug2report_joining.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we make use of pandas excellent merge function to perform an inner join between this data. This will create a new DataFrame that contains both our ingredient lists as well as the reports which their corresponding drugs were contained in. \n",
"\n",
"It should be noted that this is a much larger table due to the fact any given drug may have many reports associated with it."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(21606832, 6)"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"merged_data = data.merge(drug2report_joining, how='inner', on='DRUG_PRODUCT_ID')\n",
"merged_data.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lastly, we deploy our CategoricalColumnTransformer in order to reduce this data to an pandas Series containing lists of our categorical column values of interest and indexed by our column of interest, REPORT_ID. We've included unique_values=True in this transformer to ensure that any report that potentially contains multiple drugs with overlapping ingredients is only represented by the unique set of ingredients across all the drugs involved."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 2min 41s, sys: 5.55 s, total: 2min 46s\n",
"Wall time: 2min 48s\n"
]
},
{
"data": {
"text/plain": [
"1016353"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%time\n",
"report_drug_ingredient_list = CategoricalColumnTransformer(object_column_name='REPORT_ID',\n",
" descriptor_column_name=['DRUGNAME','ACTIVE_INGREDIENT_NAME'],\n",
" include_column_name=True, unique_values=True).fit_transform(merged_data)\n",
"len(report_drug_ingredient_list)"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"REPORT_ID\n",
"1 [DRUGNAME:GANTRISIN, ACTIVE_INGREDIENT_NAME:su...\n",
"2 [DRUGNAME:PENBRITIN, ACTIVE_INGREDIENT_NAME:am...\n",
"3 [DRUGNAME:NEOSPORIN OINTMENT, ACTIVE_INGREDIEN...\n",
"dtype: object"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"report_drug_ingredient_list.head(3)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2 101065\n",
"3 97990\n",
"4 54916\n",
"5 56476\n",
"6 33446\n",
" ... \n",
"727 1\n",
"796 1\n",
"817 1\n",
"903 1\n",
"962 1\n",
"Name: count, Length: 446, dtype: int64"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"report_drug_ingredient_list.apply(len).value_counts().sort_index()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that no report contains less than two tokens in our new representation. One from its DRUGNAME and one from its ACTIVE_INGREDIENT. In other words all drugs have both a name and at least one active ingredient. On the other end of the spectrum there is a drug that has been reported on that contain closer to 814 active ingredients. \n",
"\n",
"That last statement should be taken with a grain of salt since in this particular data set various synonyms for drug ingredients are sometimes recorded multiple times for a given drug. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Combine our series\n",
"\n",
"Now we have three pandas Series each indexed by REPORT_ID and containing a list of strings representing our categorical values.\n",
"\n",
"Since these series all share an index we can use pandas concat to join them together. For those familiar with database terminology this can be done via either an outer or inner join. An outer join says that if an object is present in either series index it will be included in our results. An inner join requires the index to be present in all of our series.\n",
"\n",
"Outer joins may introduce NaN into our newly joined_df if some of our index values aren't present in all series. Since pandas doesn't allow us to fillna with lists we would have to do this manually ourselves via the commented code included below. In this case we are are performing an inner join since we only care about reports that have both symptoms and drugs associated with them."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1015830 records joined\n"
]
},
{
"data": {
"text/html": [
"| \n", " | 0 | \n", "1 | \n", "2 | \n", "
|---|---|---|---|
| REPORT_ID | \n", "\n", " | \n", " | \n", " |
| 1 | \n", "[GENDER_ENG:Female, OUTCOME_ENG:Unknown] | \n", "[PT_NAME_ENG:Pruritus, PT_NAME_ENG:Oedema] | \n", "[DRUGNAME:GANTRISIN, ACTIVE_INGREDIENT_NAME:su... | \n", "
| 2 | \n", "[GENDER_ENG:Female, OUTCOME_ENG:Unknown] | \n", "[PT_NAME_ENG:Urticaria] | \n", "[DRUGNAME:PENBRITIN, ACTIVE_INGREDIENT_NAME:am... | \n", "
| 3 | \n", "[GENDER_ENG:Male, OUTCOME_ENG:Unknown] | \n", "[PT_NAME_ENG:Erythema, PT_NAME_ENG:Periorbital... | \n", "[DRUGNAME:NEOSPORIN OINTMENT, ACTIVE_INGREDIEN... | \n", "
REPORT_ID."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'd now like to combine all the lists in the columns of this data frame into one big list per REPORT_ID. To do this we will take advantage of the fact that python lists have overloaded the addition opperator with concatination. Adding two lists together just merges them in exactly the way we'd like. As such, what we'd like to do with our dataframe is just sum up all the values (i.e. lists) in every row. Adding two Series (i.e. columns) together just performs elementwise addition on all their elements."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"joined_report = joined_df[0] + joined_df[1] + joined_df[2]"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"REPORT_ID\n",
"1 [GENDER_ENG:Female, OUTCOME_ENG:Unknown, PT_NA...\n",
"2 [GENDER_ENG:Female, OUTCOME_ENG:Unknown, PT_NA...\n",
"3 [GENDER_ENG:Male, OUTCOME_ENG:Unknown, PT_NAME...\n",
"dtype: object"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"joined_report.head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is a quick summary of the distribution of how many reports have however many tokens. You see that many of our reports have between 4 and 20 tokens while one of them has as many as 825 tokens associated with it. As we saw earlier this large list is mostly made up of active drug ingredients."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4 5133\n",
"5 49422\n",
"6 62134\n",
"7 57920\n",
"8 56035\n",
" ... \n",
"736 1\n",
"802 1\n",
"827 1\n",
"952 1\n",
"1021 1\n",
"Name: count, Length: 462, dtype: int64"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"joined_report.apply(len).value_counts().sort_index()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we have a series with one entry for each REPORT_ID and a value that is a list of all the categorical descriptors we've derived from multiple columns across multiple tables. \n",
"\n",
"In the above analysis we have constructed our reprsentation via joining multiple seperate CategoricalColumnTransformer operations. Alternatively, we could have done this by building one large table containing all the columns of interest to us and then applying our CategoricalColumnTransformer to that single larger table. It is generally preferable to reduce the size of our tables via grouping and filtering before joining them together which is why we went with the approach we have here. That said, depending on your data base and compute access you could easily take a different approach. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Ngram Vectorizer\n",
"\n",
"In order to quickly and efficiently compare variable length sequences of categorical tokens we will convert into a sparse matrix format. This is a commonly practice in natural language processing where documents are often described as a 'bag of words'. In essence we have a single column for each unique categorical value in our list. That column has a one in it if that categorical value is present in our list. The sparse representation simply means that we don't encode any of the multitude of zeroes that would otherwise be contained in this matrix. \n",
"\n",
"We will do this with our NgramVectorizer. This is the simplest use case for our Ngram vectorizer with the default value of N=1. "
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"record_id_labels = list(joined_report.index)"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ngram is now a