{ "cells": [ { "cell_type": "markdown", "id": "4e938668-ce23-4bf4-8696-0b11dc95216a", "metadata": {}, "source": [ "# InformationWeightTransformer\n", "The information weight transformer is designed to improve the embeddings given by a simple Ngram Vectorizer by taking into account the amount of information that each token provides. It plays a similar role to the Term Frequency - Inverse Document Frequency transform for weighting count vectors, but by performing a calculation which is grounded in Bayesian inference and information theory.\n", "\n", "It is inspired by the paper *An information-theoretic perspective of tf–idf measures* by A. Aizawa (https://doi.org/10.1016/S0306-4573(02)00021-3)" ] }, { "cell_type": "markdown", "id": "60f1156e-082f-49dc-a8cb-eb80bbd2fb9e", "metadata": {}, "source": [ "## Example: Distinctive Ingredients from Regional Cuisines\n", "Consider a dataset of recipes, labelled by what regional cuisine they came from, and defined by a list of the ingredients used in the recipe." ] }, { "cell_type": "code", "execution_count": 2, "id": "43b05d84-91ba-45b3-8205-a98e694248b0", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from vectorizers.transformers import InformationWeightTransformer\n", "from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer\n", "from pathlib import Path\n", "import json\n", "from zipfile import ZipFile" ] }, { "cell_type": "code", "execution_count": 3, "id": "cc7b33af-a925-4fb0-8cc8-c074b5c0b284", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | id | \n", "cuisine | \n", "ingredients | \n", "
|---|---|---|---|
| 0 | \n", "10259 | \n", "greek | \n", "[romaine lettuce, black olives, grape tomatoes, garlic, pepper, purple onion, seasoning, garbanzo beans, feta cheese... | \n", "
| 1 | \n", "25693 | \n", "southern_us | \n", "[plain flour, ground pepper, salt, tomatoes, ground black pepper, thyme, eggs, green tomatoes, yellow corn meal, mil... | \n", "
| 2 | \n", "20130 | \n", "filipino | \n", "[eggs, pepper, salt, mayonaise, cooking oil, green chilies, grilled chicken breasts, garlic powder, yellow onion, so... | \n", "
| 3 | \n", "22213 | \n", "indian | \n", "[water, vegetable oil, wheat, salt] | \n", "
| 4 | \n", "13162 | \n", "indian | \n", "[black pepper, shallots, cornflour, cayenne pepper, onions, garlic paste, milk, butter, salt, lemon juice, water, ch... | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "
| 39769 | \n", "29109 | \n", "irish | \n", "[light brown sugar, granulated sugar, butter, warm water, large eggs, all-purpose flour, whole wheat flour, cooking ... | \n", "
| 39770 | \n", "11462 | \n", "italian | \n", "[KRAFT Zesty Italian Dressing, purple onion, broccoli florets, rotini, pitted black olives, Kraft Grated Parmesan Ch... | \n", "
| 39771 | \n", "2238 | \n", "irish | \n", "[eggs, citrus fruit, raisins, sourdough starter, flour, hot tea, sugar, ground nutmeg, salt, ground cinnamon, milk, ... | \n", "
| 39772 | \n", "41882 | \n", "chinese | \n", "[boneless chicken skinless thigh, minced garlic, steamed white rice, baking powder, corn starch, dark soy sauce, kos... | \n", "
| 39773 | \n", "2362 | \n", "mexican | \n", "[green chile, jalapeno chilies, onions, ground black pepper, salt, chopped cilantro fresh, green bell pepper, garlic... | \n", "
39774 rows × 3 columns
\n", "| \n", " | Cuisine | \n", "Most Frequent | \n", "
|---|---|---|
| 0 | \n", "Brazilian | \n", "Olive oil, Onions, Salt, | \n", "
| 1 | \n", "British | \n", "Butter, All purpose flour, Salt, | \n", "
| 2 | \n", "Cajun_creole | \n", "Garlic, Onions, Salt, | \n", "
| 3 | \n", "Chinese | \n", "Salt, Sesame oil, Soy sauce, | \n", "
| 4 | \n", "Filipino | \n", "Water, Garlic, Salt, | \n", "
| 5 | \n", "French | \n", "All purpose flour, Sugar, Salt, | \n", "
| 6 | \n", "Greek | \n", "Dried oregano, Olive oil, Salt, | \n", "
| 7 | \n", "Indian | \n", "Garam masala, Onions, Salt, | \n", "
| 8 | \n", "Irish | \n", "Butter, All purpose flour, Salt, | \n", "
| 9 | \n", "Italian | \n", "Garlic cloves, Olive oil, Salt, | \n", "
| 10 | \n", "Jamaican | \n", "Water, Onions, Salt, | \n", "
| 11 | \n", "Japanese | \n", "Mirin, Salt, Soy sauce, | \n", "
| 12 | \n", "Korean | \n", "Garlic, Sesame oil, Soy sauce, | \n", "
| 13 | \n", "Mexican | \n", "Ground cumin, Onions, Salt, | \n", "
| 14 | \n", "Moroccan | \n", "Ground cumin, Olive oil, Salt, | \n", "
| 15 | \n", "Russian | \n", "Onions, Sugar, Salt, | \n", "
| 16 | \n", "Southern_us | \n", "All purpose flour, Butter, Salt, | \n", "
| 17 | \n", "Spanish | \n", "Garlic cloves, Olive oil, Salt, | \n", "
| 18 | \n", "Thai | \n", "Salt, Garlic, Fish sauce, | \n", "
| 19 | \n", "Vietnamese | \n", "Salt, Sugar, Fish sauce, | \n", "
| \n", " | Cuisine | \n", "Most Frequent | \n", "TF-IDF | \n", "Info Weight | \n", "
|---|---|---|---|---|
| 0 | \n", "Brazilian | \n", "Olive oil, Onions, Salt, | \n", "Sweetened condensed milk, Lime, Cachaca, | \n", "Chocolate sprinkles, Açai, Cachaca, | \n", "
| 1 | \n", "British | \n", "Butter, All purpose flour, Salt, | \n", "Salt, All purpose flour, Milk, | \n", "Beef drippings, Suet, Stilton cheese, | \n", "
| 2 | \n", "Cajun_creole | \n", "Garlic, Onions, Salt, | \n", "Onions, Green bell pepper, Cajun seasoning, | \n", "Creole seasoning, Andouille sausage, Cajun sea... | \n", "
| 3 | \n", "Chinese | \n", "Salt, Sesame oil, Soy sauce, | \n", "Corn starch, Sesame oil, Soy sauce, | \n", "Shaoxing wine, Soy sauce, Sesame oil, | \n", "
| 4 | \n", "Filipino | \n", "Water, Garlic, Salt, | \n", "Water, Garlic, Soy sauce, | \n", "Calamansi juice, Fish sauce, Soy sauce, | \n", "
| 5 | \n", "French | \n", "All purpose flour, Sugar, Salt, | \n", "All purpose flour, Salt, Unsalted butter, | \n", "Grated gruyère cheese, Cognac, Gruyere cheese, | \n", "
| 6 | \n", "Greek | \n", "Dried oregano, Olive oil, Salt, | \n", "Feta cheese, Olive oil, Feta cheese crumbles, | \n", "Greek seasoning, Feta cheese, Feta cheese crum... | \n", "
| 7 | \n", "Indian | \n", "Garam masala, Onions, Salt, | \n", "Ground turmeric, Salt, Garam masala, | \n", "Cumin seed, Ground turmeric, Garam masala, | \n", "
| 8 | \n", "Irish | \n", "Butter, All purpose flour, Salt, | \n", "Salt, All purpose flour, Butter, | \n", "Irish cream liqueur, Guinness beer, Irish whis... | \n", "
| 9 | \n", "Italian | \n", "Garlic cloves, Olive oil, Salt, | \n", "Salt, Grated parmesan cheese, Olive oil, | \n", "Shredded mozzarella cheese, Ricotta cheese, Gr... | \n", "
| 10 | \n", "Jamaican | \n", "Water, Onions, Salt, | \n", "Dried thyme, Salt, Ground allspice, | \n", "Ground allspice, Jamaican jerk season, Scotch ... | \n", "
| 11 | \n", "Japanese | \n", "Mirin, Salt, Soy sauce, | \n", "Sake, Soy sauce, Mirin, | \n", "Dashi, Sake, Mirin, | \n", "
| 12 | \n", "Korean | \n", "Garlic, Sesame oil, Soy sauce, | \n", "Sesame seeds, Soy sauce, Sesame oil, | \n", "Sesame oil, Kimchi, Gochujang base, | \n", "
| 13 | \n", "Mexican | \n", "Ground cumin, Onions, Salt, | \n", "Chili powder, Jalapeno chilies, Salt, | \n", "Flour tortillas, Salsa, Corn tortillas, | \n", "
| 14 | \n", "Moroccan | \n", "Ground cumin, Olive oil, Salt, | \n", "Ground cinnamon, Olive oil, Ground cumin, | \n", "Ras el hanout, Preserved lemon, Couscous, | \n", "
| 15 | \n", "Russian | \n", "Onions, Sugar, Salt, | \n", "Sour cream, Sugar, Salt, | \n", "Dill, Fresh dill, Beets, | \n", "
| 16 | \n", "Southern_us | \n", "All purpose flour, Butter, Salt, | \n", "All purpose flour, Butter, Salt, | \n", "Bourbon whiskey, Grits, Buttermilk, | \n", "
| 17 | \n", "Spanish | \n", "Garlic cloves, Olive oil, Salt, | \n", "Salt, Extra virgin olive oil, Olive oil, | \n", "Serrano ham, Saffron threads, Spanish chorizo, | \n", "
| 18 | \n", "Thai | \n", "Salt, Garlic, Fish sauce, | \n", "Lemongrass, Coconut milk, Fish sauce, | \n", "Thai red curry paste, Lemongrass, Fish sauce, | \n", "
| 19 | \n", "Vietnamese | \n", "Salt, Sugar, Fish sauce, | \n", "Garlic, Sugar, Fish sauce, | \n", "Beansprouts, Lemongrass, Fish sauce, | \n", "