{ "cells": [ { "cell_type": "markdown", "id": "a999b850-e89d-4070-b58b-d3aaf966fdc4", "metadata": {}, "source": [ "# Clusterers for Toponymy\n", "\n", "Toponymy supports different ways to cluster your data. There is a default approach, the ``ToponymyClusterer`` that works well with the rest of the toolchain. However other clusterers exist, and it can be relatively easy to create your own clusterer that can plug directly into the existing Toponymy infrastructure. This tutorial will look at the clusterers available, and briefly explore what would be involved in writing your own clusterer class that can work with Toponymy.\n", "\n", "Before we start let's get some basic librarties imported, and load some data." ] }, { "cell_type": "code", "execution_count": 1, "id": "244441b4-c27a-4fea-839b-265a5a3220e2", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "id": "816c5cdf-8315-462a-80e0-758a07564497", "metadata": {}, "source": [ "We'll use the standard 20-newsgroup data with pre-built embeddings and a pre-built data map." ] }, { "cell_type": "code", "execution_count": 2, "id": "a74e1789", "metadata": {}, "outputs": [], "source": [ "newsgroups_df = pd.read_parquet(\"hf://datasets/lmcinnes/20newsgroups_embedded/data/train-00000-of-00001.parquet\")" ] }, { "cell_type": "code", "execution_count": 3, "id": "9b4cc2fa-1049-4f3a-b2c4-d7fc9b97cf2a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | post | \n", "newsgroup | \n", "embedding | \n", "map | \n", "
|---|---|---|---|---|
| 0 | \n", "\\n\\nI am sure some bashers of Pens fans are pr... | \n", "rec.sport.hockey | \n", "[-0.04380008950829506, 0.08495834469795227, -0... | \n", "[-0.13199903070926666, 10.1972017288208] | \n", "
| 1 | \n", "My brother is in the market for a high-perform... | \n", "comp.sys.ibm.pc.hardware | \n", "[0.006855607498437166, -0.05531690642237663, -... | \n", "[11.03041934967041, 9.509867668151855] | \n", "
| 2 | \n", "\\n\\n\\n\\n\\tFinally you said what you dream abou... | \n", "talk.politics.mideast | \n", "[0.01537406351417303, 0.03572937101125717, -0.... | \n", "[1.7360589504241943, -0.31686803698539734] | \n", "
| 3 | \n", "\\nThink!\\n\\nIt's the SCSI card doing the DMA t... | \n", "comp.sys.ibm.pc.hardware | \n", "[0.010156078264117241, -0.07253803312778473, -... | \n", "[10.975887298583984, 10.715202331542969] | \n", "
| 4 | \n", "1) I have an old Jasmine drive which I cann... | \n", "comp.sys.mac.hardware | \n", "[-0.008448092266917229, 0.06011670082807541, 0... | \n", "[10.498811721801758, 11.010639190673828] | \n", "