{ "cells": [ { "cell_type": "markdown", "id": "90bd7db7-2d7b-4869-80c2-8b5a9c94e182", "metadata": {}, "source": [ "# Keyphrase extractors\n", "\n", "Toponymy makes use of keyphrases to provide clear topical information about the content of potential clusters of objects. The means of ranking potential keyphrases is handled within Toponymy itself, but the task of extracting candidate keyphrases from text associated to objects is handled by a KeyphraseBuilder. This tutorial will work through what KeyphraseBuilders are, and how they may be used and tuned to your needs.\n", "\n", "To begin, let's load some libraries and collect data to demonstrate how KeyphraseBuilders work." ] }, { "cell_type": "code", "execution_count": 1, "id": "ea2ff348-52c0-4c1f-b645-1bbce212bc91", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import dask.dataframe as dd\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "id": "e6622b0f-384c-40d7-a10f-db5bb1e01ff7", "metadata": {}, "source": [ "For data we will collect a subset of the arxiv ml dataset which compiles titles and abstracts of ArXiv articles about machine learning. You can consult the [dataset card on Huggingface] for more details on its composition. To keep to a manageable size we'll simply grab one of the 8 eight chunks that make up the full dataset." ] }, { "cell_type": "code", "execution_count": 2, "id": "5539e06f-f9b0-4ce3-9068-75e661fe4f73", "metadata": {}, "outputs": [], "source": [ "arxiv_ml_df = pd.read_parquet(\"hf://datasets/lmcinnes/arxiv_ml/data/train-00000-of-00008-f3c9b137f969d545.parquet\")" ] }, { "cell_type": "code", "execution_count": 3, "id": "e06676f5-431c-439d-8b43-a17cc47cecd3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | date_created | \n", "abstract | \n", "title | \n", "categories | \n", "arxiv_id | \n", "year | \n", "embedding_str | \n", "embedding | \n", "data_map | \n", "
|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "2007-04-01 13:06:50 | \n", "The intelligent acoustic emission locator is... | \n", "Intelligent location of simultaneously active ... | \n", "cs.NE cs.AI | \n", "0704.0047 | \n", "2007 | \n", "# Intelligent location of simultaneously activ... | \n", "[-0.04404345899820328, 0.028888946399092674, -... | \n", "[9.242573738098145, 2.837263584136963] | \n", "
| 1 | \n", "2007-04-01 18:53:13 | \n", "Part I describes an intelligent acoustic emi... | \n", "Intelligent location of simultaneously active ... | \n", "cs.NE cs.AI | \n", "0704.0050 | \n", "2007 | \n", "# Intelligent location of simultaneously activ... | \n", "[-0.03380154073238373, 0.005963773000985384, -... | \n", "[9.297958374023438, 2.8726723194122314] | \n", "
| 2 | \n", "2007-04-03 02:08:48 | \n", "This paper discusses the benefits of describ... | \n", "The World as Evolving Information | \n", "cs.IT cs.AI math.IT q-bio.PE | \n", "0704.0304 | \n", "2007 | \n", "# The World as Evolving Information\\n\\n This ... | \n", "[-0.005738923791795969, 0.01626133918762207, 0... | \n", "[3.5662941932678223, 10.24143123626709] | \n", "
| 3 | \n", "2007-04-05 02:57:15 | \n", "The problem of statistical learning is to co... | \n", "Learning from compressed observations | \n", "cs.IT cs.LG math.IT | \n", "0704.0671 | \n", "2007 | \n", "# Learning from compressed observations\\n\\n T... | \n", "[0.004663723520934582, 0.02371317893266678, -0... | \n", "[-0.9903357625007629, 7.9114227294921875] | \n", "
| 4 | \n", "2007-04-06 21:58:52 | \n", "In a sensor network, in practice, the commun... | \n", "Sensor Networks with Random Links: Topology De... | \n", "cs.IT cs.LG math.IT | \n", "0704.0954 | \n", "2007 | \n", "# Sensor Networks with Random Links: Topology ... | \n", "[0.03113919124007225, 0.01722194068133831, -0.... | \n", "[-2.087822675704956, 9.253089904785156] | \n", "