{ "cells": [ { "cell_type": "markdown", "id": "2597f991-4064-4375-86ac-50112914ae26", "metadata": {}, "source": [ "# Getting Started with Toponymy\n", "\n", "Toponymy is a library that can provide rich well named topics for large collections of vectorizable data. Primarily that mean copora of documents, making use of neural text embedding models, but can extend to other modalities which will be discussed in later tutorials. The aim of this tutorial is to walk you through the basic usage of toponymy to get you started on using it. Further tutorials, looking at the details of clustering, different LLMs, keyphrase extraction, other data modalities and more, will follow. For now let's get started getting Toponymy up and running.\n", "\n", "To start we'll need some basic libraries to allow us to get some data suitable for applying Toponymy." ] }, { "cell_type": "code", "execution_count": 1, "id": "244441b4-c27a-4fea-839b-265a5a3220e2", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "markdown", "id": "ecd987a8-9438-4843-9789-e86f1d3e120d", "metadata": {}, "source": [ "For a dataset we'll be using the venerable 20-newsgroups dataset, a classic NLP (Natural Language Processing) dataset of posts to twenty different newsgroups from the 1990s. The dataset contains around twenty thousand posts on a wide variety of topics (despite being directed to partricular named newsgroups, people are inclined to go off-topic at times). To make use of this data in Toponymy we will need to turn the the newsgroup posts into vectors, and ideally produce a lower dimensional clusterable representation of those vectors. Toponymy tries to be agnostic to how this is done, so you can use whatever tools you wish. However, since vectorizing that much text can be computationally expensive (or just cost dollars if you are using an embedding service), and we want to get you up and running as fast as possible, let's use a version of 20-newsgroups that comes complete with embedding vectors (built using ``all-mpnet-base-v2`` from sentence-transformers) and a 2D representation we can use for clustering and plotting (built using UMAP)." ] }, { "cell_type": "code", "execution_count": 2, "id": "a74e1789", "metadata": {}, "outputs": [], "source": [ "newsgroups_df = pd.read_parquet(\"hf://datasets/lmcinnes/20newsgroups_embedded/data/train-00000-of-00001.parquet\")" ] }, { "cell_type": "markdown", "id": "f731b756-f698-412a-98aa-df6b2d6cfb3f", "metadata": {}, "source": [ "We can get a of the data by looking at the first few rows:" ] }, { "cell_type": "code", "execution_count": 3, "id": "9b4cc2fa-1049-4f3a-b2c4-d7fc9b97cf2a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | post | \n", "newsgroup | \n", "embedding | \n", "map | \n", "
|---|---|---|---|---|
| 0 | \n", "\\n\\nI am sure some bashers of Pens fans are pr... | \n", "rec.sport.hockey | \n", "[-0.04380008950829506, 0.08495834469795227, -0... | \n", "[-0.13199903070926666, 10.1972017288208] | \n", "
| 1 | \n", "My brother is in the market for a high-perform... | \n", "comp.sys.ibm.pc.hardware | \n", "[0.006855607498437166, -0.05531690642237663, -... | \n", "[11.03041934967041, 9.509867668151855] | \n", "
| 2 | \n", "\\n\\n\\n\\n\\tFinally you said what you dream abou... | \n", "talk.politics.mideast | \n", "[0.01537406351417303, 0.03572937101125717, -0.... | \n", "[1.7360589504241943, -0.31686803698539734] | \n", "
| 3 | \n", "\\nThink!\\n\\nIt's the SCSI card doing the DMA t... | \n", "comp.sys.ibm.pc.hardware | \n", "[0.010156078264117241, -0.07253803312778473, -... | \n", "[10.975887298583984, 10.715202331542969] | \n", "
| 4 | \n", "1) I have an old Jasmine drive which I cann... | \n", "comp.sys.mac.hardware | \n", "[-0.008448092266917229, 0.06011670082807541, 0... | \n", "[10.498811721801758, 11.010639190673828] | \n", "