{ "cells": [ { "cell_type": "markdown", "id": "2597f991-4064-4375-86ac-50112914ae26", "metadata": {}, "source": [ "# Getting Started with Toponymy\n", "\n", "Toponymy is a library that can provide rich well named topics for large collections of vectorizable data. Primarily that mean copora of documents, making use of neural text embedding models, but can extend to other modalities which will be discussed in later tutorials. The aim of this tutorial is to walk you through the basic usage of toponymy to get you started on using it. Further tutorials, looking at the details of clustering, different LLMs, keyphrase extraction, other data modalities and more, will follow. For now let's get started getting Toponymy up and running.\n", "\n", "To start we'll need some basic libraries to allow us to get some data suitable for applying Toponymy." ] }, { "cell_type": "code", "execution_count": 1, "id": "244441b4-c27a-4fea-839b-265a5a3220e2", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "markdown", "id": "ecd987a8-9438-4843-9789-e86f1d3e120d", "metadata": {}, "source": [ "For a dataset we'll be using the venerable 20-newsgroups dataset, a classic NLP (Natural Language Processing) dataset of posts to twenty different newsgroups from the 1990s. The dataset contains around twenty thousand posts on a wide variety of topics (despite being directed to partricular named newsgroups, people are inclined to go off-topic at times). To make use of this data in Toponymy we will need to turn the the newsgroup posts into vectors, and ideally produce a lower dimensional clusterable representation of those vectors. Toponymy tries to be agnostic to how this is done, so you can use whatever tools you wish. However, since vectorizing that much text can be computationally expensive (or just cost dollars if you are using an embedding service), and we want to get you up and running as fast as possible, let's use a version of 20-newsgroups that comes complete with embedding vectors (built using ``all-mpnet-base-v2`` from sentence-transformers) and a 2D representation we can use for clustering and plotting (built using UMAP)." ] }, { "cell_type": "code", "execution_count": 2, "id": "a74e1789", "metadata": {}, "outputs": [], "source": [ "newsgroups_df = pd.read_parquet(\"hf://datasets/lmcinnes/20newsgroups_embedded/data/train-00000-of-00001.parquet\")" ] }, { "cell_type": "markdown", "id": "f731b756-f698-412a-98aa-df6b2d6cfb3f", "metadata": {}, "source": [ "We can get a of the data by looking at the first few rows:" ] }, { "cell_type": "code", "execution_count": 3, "id": "9b4cc2fa-1049-4f3a-b2c4-d7fc9b97cf2a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
postnewsgroupembeddingmap
0\\n\\nI am sure some bashers of Pens fans are pr...rec.sport.hockey[-0.04380008950829506, 0.08495834469795227, -0...[-0.13199903070926666, 10.1972017288208]
1My brother is in the market for a high-perform...comp.sys.ibm.pc.hardware[0.006855607498437166, -0.05531690642237663, -...[11.03041934967041, 9.509867668151855]
2\\n\\n\\n\\n\\tFinally you said what you dream abou...talk.politics.mideast[0.01537406351417303, 0.03572937101125717, -0....[1.7360589504241943, -0.31686803698539734]
3\\nThink!\\n\\nIt's the SCSI card doing the DMA t...comp.sys.ibm.pc.hardware[0.010156078264117241, -0.07253803312778473, -...[10.975887298583984, 10.715202331542969]
41) I have an old Jasmine drive which I cann...comp.sys.mac.hardware[-0.008448092266917229, 0.06011670082807541, 0...[10.498811721801758, 11.010639190673828]
\n", "
" ], "text/plain": [ " post \\\n", "0 \\n\\nI am sure some bashers of Pens fans are pr... \n", "1 My brother is in the market for a high-perform... \n", "2 \\n\\n\\n\\n\\tFinally you said what you dream abou... \n", "3 \\nThink!\\n\\nIt's the SCSI card doing the DMA t... \n", "4 1) I have an old Jasmine drive which I cann... \n", "\n", " newsgroup \\\n", "0 rec.sport.hockey \n", "1 comp.sys.ibm.pc.hardware \n", "2 talk.politics.mideast \n", "3 comp.sys.ibm.pc.hardware \n", "4 comp.sys.mac.hardware \n", "\n", " embedding \\\n", "0 [-0.04380008950829506, 0.08495834469795227, -0... \n", "1 [0.006855607498437166, -0.05531690642237663, -... \n", "2 [0.01537406351417303, 0.03572937101125717, -0.... \n", "3 [0.010156078264117241, -0.07253803312778473, -... \n", "4 [-0.008448092266917229, 0.06011670082807541, 0... \n", "\n", " map \n", "0 [-0.13199903070926666, 10.1972017288208] \n", "1 [11.03041934967041, 9.509867668151855] \n", "2 [1.7360589504241943, -0.31686803698539734] \n", "3 [10.975887298583984, 10.715202331542969] \n", "4 [10.498811721801758, 11.010639190673828] " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "newsgroups_df.head()" ] }, { "cell_type": "markdown", "id": "7b4d1b6a-a130-455a-af77-e22ccae0360d", "metadata": {}, "source": [ "We have posts, which are the text content of the posts (with headers, footers, quotes and signatures stripped), the newsgroup the message was posted to, an embedding vector, and a 2D data map representation. In this version of the dataset we have slightly less than the twenty thousand posts, since there were a number of very short posts (once quotes and signatures were stripped) that barely have enough text to be worth embedding, and these have simply been removed.\n", "\n", "For Toponymy we will want the embedding vectors and the clusterable vectors in numpy format (as opposed to a pandas series of python lists of floats), so let's extract that out from the dataframe. If you were interested in trying the whole process then the cell below contains the relevant code to generate the sentence embeddings and clusterable data map directly from the text -- simply change the ``if False`` to ``if True`` to try running that step yourself. Be warned, depending on the hardware available (e.g. if you have no GPU) this could be very time consuming." ] }, { "cell_type": "code", "execution_count": 4, "id": "90e27ce1-9f8c-4d17-a633-943b5cc4648a", "metadata": {}, "outputs": [], "source": [ "if False:\n", " from sentence_transformers import SentenceTransformer\n", " from umap import UMAP\n", "\n", " embedding_model = SentenceTransformer(\"all-mpnet-base-v2\")\n", " embedding_vectors = embedding_model.encode(newsgroup_df[\"post\"], show_progress_bar=True)\n", " clusterable_vectors = UMAP(metric=\"cosine\").fit_transform(embedding_vectors)\n", "else:\n", " embedding_vectors = np.stack(newsgroups_df[\"embedding\"].values)\n", " clusterable_vectors = np.stack(newsgroups_df[\"map\"].values)" ] }, { "cell_type": "markdown", "id": "e0cf6c8e-9890-4a35-8393-55ab56c7a300", "metadata": {}, "source": [ "## Running Toponymy\n", "\n", "Now that we have some suitable data, and have extracted the relevant embedding vectors, let's get started using Toponymy. We will need to import a few pieces to get things working. First we'll need the ``Toponymy`` class that we can use to train a topic model. Alongside that we will also need to import a ``Clusterer`` and a ``KeyphraseBuilder``. Clusterers are pluggable, and you can even write your own fairly easily -- see the tutorials on clusterers for more details. The ``KeyphraseBuilder`` is used to extracting potential keyphrases from the text of the corpus. Both of these are provided as separate classes as each has a number of configuration options unique to them, and we wanted to separate configuration of these tasks from the overall ``Toponymy`` class so as not to clutter the interface with a large array of options.\n", "\n", "We will also need an LLM to distill out the final human readable topic names. Toponymy provides wrappers around a number of LLMs, including LlamaCpp and HuggingFace for local models, and services via OpenAI, Anthropic, Cohere, and AzureAI. In this tutorial we'll be using an Azure AI Foundry instance of a Cohere model, but you can subtitute in your preferred LLM provider. See the documentation of LLM wrappers for details on using your preferred LLM.\n", "\n", "Lastly, since Toponymy does internal semantic similarity work with both keyphrases and topic names, we will need a text embedding model. Note that this *does not* have to be the same as the model used to create the text embedding for the documents. Either a sentence-transformer model, or a model from ``toponymy.embedding_wrappers`` that wraps various embedding services, will suffice. Here we'll use a very light-weight sentence-transformer model to save time on the internal embedding work since the quality of this embedding (as opposed to the embeddings of the documents) is less important. " ] }, { "cell_type": "code", "execution_count": null, "id": "4bb89037", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.\n" ] } ], "source": [ "from toponymy import Toponymy, ToponymyClusterer, KeyphraseBuilder\n", "from toponymy.llm_wrappers import AzureAINamer\n", "\n", "from sentence_transformers import SentenceTransformer\n", "embedding_model = SentenceTransformer(\"paraphrase-MiniLM-L3-v2\")\n", "\n", "azure_api_key = open(\"../azure_cohere_api_key.txt\").read().strip()" ] }, { "cell_type": "markdown", "id": "9b5da471-f4bc-47fb-a04e-4298121a39e1", "metadata": {}, "source": [ "With the preliminaries out of the way, let's create a Toponymy topic modeller. We do this with the ``Toponymy`` class, which takes a number of parameters. The primary things we will need are an ``llm_wrapper`` instatiated from ``toponymy.llm_wrappers``, a ``text_embedding_model``, a ``clusterer`` and a ``keyphrase_builder. The latter two can be instatiated with suitable parameters for your needs -- see the documentation on those for more details, but more oftehn than not the default will suffice.\n", "\n", "To get more out of the topic modeller it can be helpful to provide an ``object_description`` and a ``corpus_description``. This will help provide context on what the individual documents and, and what kind of corpus they were drawn from. This can improve the final topic names produced. Lastly, in this particular case, it will be useful to provide ``exemplar_delimiters``. If you just have short sentences or other clean text in your corpus the default delimiters will be fine, however the newsgroup posts contain all manner of quoting, bullet lists, ascii art, and other messy aspects, so it will be beneficial to provide some clear delimiters around example texts that will be passed to the LLM. Here we will use tags ``\"\\n\"`` and ``\"\\n\\n\\n\"`` to denote the start and end of a example text." ] }, { "cell_type": "code", "execution_count": null, "id": "cbc20725", "metadata": {}, "outputs": [], "source": [ "topic_model = Toponymy(\n", " llm_wrapper=AzureAINamer(\n", " azure_api_key, \n", " endpoint=\"https://azureaitimcuse5821437469.services.ai.azure.com/models\",\n", " model=\"Cohere-command-r-08-2024\",\n", " ),\n", " text_embedding_model=embedding_model,\n", " clusterer=ToponymyClusterer(min_clusters=4, verbose=True),\n", " keyphrase_builder=KeyphraseBuilder(ngram_range=(1,6), max_features=15_000, verbose=True),\n", " object_description=\"newsgroup posts\",\n", " corpus_description=\"20-newsgroups dataset\",\n", " exemplar_delimiters=[\"\\n\",\"\\n\\n\\n\"],\n", ")" ] }, { "cell_type": "markdown", "id": "fabc8fc7-25f6-46b5-b0ea-115bc7dbd337", "metadata": {}, "source": [ "Having instantiated the model, we now need to fit it to data. That means we need to provide the newsgroup posts, along with ``embedding_vectors`` and ``clusterable_vectors``. Toponymy hews to a scikit-learn style API, with a fit command that takes the data to fit the model to (the text) along with other arguments -- in our case those ``embedding_vectors`` and ``clusterable_vectors``. This process will take some time as model constructs various layers of clustering resolution, extracts relevant information for each cluster, and the uses LLM calls to provide topic names for all the clusters -- making use of finer resolution clusters to help name larger clusters for which sampling text would not provide sufficient coverage. " ] }, { "cell_type": "code", "execution_count": 7, "id": "6db788d0", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Layer 0 found 429 clusters\n", "Layer 1 found 134 clusters\n", "Layer 2 found 41 clusters\n", "Layer 3 found 14 clusters\n", "Layer 4 found 5 clusters\n", "Building keyphrase matrix ... \n", "Chunking into 1 chunks of size 20000 for keyphrase identification.\n", "Combining count dictionaries ...\n", "Found 15000 keyphrases.\n", "Chunking into 1 chunks of size 20000 for keyphrase count construction.\n", "Combining count matrix chunks ...\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Selecting central exemplars: 0%| | 0/429 [00:00" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "topic_model.fit(\n", " newsgroups_df[\"post\"].str.strip().values, \n", " embedding_vectors=embedding_vectors, \n", " clusterable_vectors=clusterable_vectors\n", ")" ] }, { "cell_type": "markdown", "id": "519e58f4-9479-497d-918e-c18fa6e3a9ca", "metadata": {}, "source": [ "Having fit the model we can start looking at the results. The first thing to look at is what the topics that were found in the dataset are. The simplest way to see that is via the ``topic_names_`` attribute. This is a list of lists of topic names at various resolution layers, with the first being the finest granularity topics, and the last items being the highest level topics. Let's look at the top-most topic names -- the big picture topics that the topic model was able to extract:" ] }, { "cell_type": "code", "execution_count": 8, "id": "faedaf89", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Sports',\n", " 'PoliticsReligion',\n", " 'Vehicle Discussion',\n", " 'Computer Hardware',\n", " 'Computer Graphics']" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "topic_model.topic_names_[-1]" ] }, { "cell_type": "markdown", "id": "49fe1d82-6c48-479c-b475-200412275379", "metadata": {}, "source": [ "So overall we see sports, a topic on religion and politics, vehicles, and computer related topics split into software and hardware. Is this what we might expect as the big topics seen in the corpus? What are the newsgroups that the posts were from?" ] }, { "cell_type": "code", "execution_count": 9, "id": "497b1d22-a6ad-4ad2-b7a0-b70444e43252", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['rec.sport.hockey',\n", " 'comp.sys.ibm.pc.hardware',\n", " 'talk.politics.mideast',\n", " 'comp.sys.mac.hardware',\n", " 'sci.electronics',\n", " 'talk.religion.misc',\n", " 'sci.crypt',\n", " 'sci.med',\n", " 'alt.atheism',\n", " 'rec.motorcycles',\n", " 'rec.autos',\n", " 'comp.windows.x',\n", " 'comp.graphics',\n", " 'sci.space',\n", " 'talk.politics.guns',\n", " 'misc.forsale',\n", " 'rec.sport.baseball',\n", " 'talk.politics.misc',\n", " 'comp.os.ms-windows.misc',\n", " 'soc.religion.christian']" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "newsgroups_df.newsgroup.unique().tolist()" ] }, { "cell_type": "markdown", "id": "05432c04-c797-4f87-9e5c-7defb517913a", "metadata": {}, "source": [ "So a lot of newsgroups on relgions, and politics, a lot of newsgroups on computers, both hardwre and software, two newsgroups on sports (hockey and baseball), two newsgroups on cars and motorcycles, and a few science newsgroups that aren't showing up at the top level (likely because they are each on different specific topics, so represent smaller topics overall compared with these larger topics made up of multiple newsgroups each. Let's go down to the next layer of topic resolution and see if we can pick up those other newsgroups, and possibly differentiate a little better among the larger topics like religion and politics." ] }, { "cell_type": "code", "execution_count": 10, "id": "3f27dede-75b2-465c-9fa4-51fb14df8c46", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Baseball Players',\n", " 'NHL Game Updates',\n", " 'Space Exploration Innovations',\n", " 'US Surveillance and Encryption',\n", " 'Medical Treatments and Health',\n", " 'Middle East Conflicts',\n", " 'Vehicle Discussion',\n", " 'Religion and Beliefs',\n", " 'Waco Siege Analysis',\n", " 'Political Scandals and Debate',\n", " 'Gun Control and Politics',\n", " 'Computer Graphics',\n", " 'Computer Hardware Storage',\n", " 'Computer Hardware Sales']" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "topic_model.topic_names_[-2]" ] }, { "cell_type": "markdown", "id": "243be6ba-7d9f-4d0b-b276-5a093bd134f6", "metadata": {}, "source": [ "Now we see the science topics represented individually as 'Space Exploration and Astronomy', 'US Encryption and Surveillance', and 'Health & Medicine'. We also see several of the larger topics refine into more speciifc topics matching the newgroups -- baseball and hockey are split apart, and religion and poltiics breaks down into much more specific topics matching many of the newsgroups (mode-east politics, gun control, more general political discourse, and a Christianty specific topic). Notably we also see some of the topics splitting in ways not evident from the newsgroups names. For example, while there was a single newsgroup 'talk.guns' we can see separate topics for gun control versus discussion of the Waco siege (which was a significant topic for gun rights advocates in the 1990s).\n", "\n", "How fien grained do the topics get? Let's look the the 0th layer, that contains the most fine-grained topics. There will be far to many individual topics there to look at easily, so let's look at the first ten and get at least a general idea of the level of specificity in topics we can expect to see." ] }, { "cell_type": "code", "execution_count": 11, "id": "936ab9be-4046-4cc3-b077-4a5d6a0e0fc1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[\"Detroit Red Wings' Stanley Cup Octopus Tradition\",\n", " 'Kirlian Photography and Aura Imaging',\n", " 'Sports Radio Stations and Teams',\n", " 'Ice Hockey Broadcasts and Schedules',\n", " 'Baseball Game Length and Pacing',\n", " 'Baseball Tickets and Schedules',\n", " 'Sports Team Mailing Lists',\n", " 'NHL Captain Trades and Trivia',\n", " 'Hacker Ethics and the Evolution of Computing',\n", " 'NHL Hockey Violence and Cheap Shots']" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "topic_model.topic_names_[0][:10]" ] }, { "cell_type": "markdown", "id": "94b8f8b1-c164-484c-b82b-7f5c0dcf89aa", "metadata": {}, "source": [ "As you can see the topics really get down into the weeds at the fien grained level.\n", "\n", "To see how these topics fit together, from top level to fine-grained, we can also look at the topic tree via the ``topic_tree_`` attribute. This is actually a class that provides a few ways to view to topic tree, including pretty-printing, but the default representation as expandable tree in HTML will suffice for viewing in a notebook like this." ] }, { "cell_type": "code", "execution_count": 12, "id": "4005d1ac-6b13-4761-9386-12dea971afd4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " \n", "
" ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "topic_model.topic_tree_" ] }, { "cell_type": "markdown", "id": "8d973ce0-242d-4325-8d40-923f740f3a76", "metadata": {}, "source": [ "Click on \"Topic Tree\" to start expanding, and then click on a topic that has a triangle marker to further expand that topic. Note that not every topic from more fine grained layers has a parent at the top level, so the first expansion of the tree contains many topics. The relative resolution layer in the topic tree is displayed by the font-weight (bolder fonts represent higher level topics)." ] }, { "cell_type": "markdown", "id": "666aa8f7-0ea2-468f-8884-3987cda1b782", "metadata": {}, "source": [ "## Plotting an Interactive Document Data Map\n", "\n", "Ultimately one wants to see which documents are related to which topics. That is accessible directly through the ``topic_name_vectors_`` attribute of the topic model, which provides a list of vectors, each of length equal to the number of documents in the corpus, with values taken from the topic names. The *n*th entry of the *i*th array is the topic name assigned to the *n*th document at the *i*th layer of topic resolution. This provides a direct document to topic mapping. This is quite sufficient for any programmatic analysis of documents and topics. It is, however, a little unweidly to parse through by eye. A better approach would be to use the 2D representation of the documents to plot a data map of all the documents, and then overlay the layers of topics on top of that. Fortunately, to make that easy, there is a library called [datamapplot](https://github.com/TutteInstitute/datamapplot) that can handle all of the heavy lifting for us. We just need to import the library:" ] }, { "cell_type": "code", "execution_count": 13, "id": "112fee1a", "metadata": {}, "outputs": [], "source": [ "import datamapplot\n", "import datamapplot.selection_handlers" ] }, { "cell_type": "markdown", "id": "062c813e-6a88-406b-980b-a3b313860038", "metadata": {}, "source": [ "To create an interactive plot we can view directly in the notebook (or export to an HTML file to share with others) we simply need to use the ``create_interactive_plot`` function. The main parameters we need for that are the 2D coordinates of the data map, and then some number of layers of cluster names. The required format of those layers of cluster names is ... exactly the format of the ``topic_name_vectors`` attribute (expanded to a set of arguments via the ``*`` expansion of lists). That means we already have everything we need. From there datamapplot provides a number of extra tools for providing hover tooltips (we'll pass in the content of the newsgroup post), tweaking aesthetics, adding text search and alternate colormaps, and even allowing for lasso selections that trigger events (we'll create a word cloud from the selected newsgroup posts). We won't go into all the details of those parameters here, but please check the [datamapplot documentation](https://datamapplot.readthedocs.io/) for more details. " ] }, { "cell_type": "code", "execution_count": 14, "id": "b10b6ecb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "plot = datamapplot.create_interactive_plot(\n", " clusterable_vectors, \n", " *topic_model.topic_name_vectors_,\n", " title=\"20-Newsgroups\",\n", " sub_title=\"A data map of 20-newsgroups using all-mpnet-basev2, Toponymy, Cohere and UMAP\",\n", " hover_text=newsgroups_df[\"post\"].values,\n", " font_family=\"Cormorant SC\",\n", " marker_size_array=np.asarray([np.log(len(x)) for x in newsgroups_df[\"post\"].values]),\n", " colormaps={\"newsgroup\": pd.Series(newsgroups_df[\"newsgroup\"].values)},\n", " cluster_layer_colormaps=True,\n", " enable_search=True,\n", " selection_handler=datamapplot.selection_handlers.WordCloud(height=300),\n", ")\n", "plot" ] }, { "cell_type": "markdown", "id": "dc8fc81c-dbc6-49f0-b724-90a5c028a79e", "metadata": {}, "source": [ "The output is an interactive datamap that you can pan and zoom around in. Topics are overlaid in text, with finer grained topics appearing as required as you zoom further in. If you hold down the shift key and click and drag the mouse you can lasso-select points to get word clouds assoociated to the selected documents (select empty space to clear the word cloud). Text search, and colormaps (including by newsgroup title) are also available. This concludes the \"getting started\" tutorial, and hopefully provides you with enough to at least get started using Toponymy with your data." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.11" } }, "nbformat": 4, "nbformat_minor": 5 }