{ "cells": [ { "cell_type": "markdown", "id": "9bc0a65b-d47c-40f0-bf21-11a9ee3e3c7a", "metadata": {}, "source": [ "# Clusterer Options\n", "\n", "This tutorial will look at some of the hyperparameters available when performing clustering for Toponymy. We will focus specifically on the ``ToponymyClusterer``, although the ``EVoCClusterer`` works very similarly, albeit with some extra options made available via [EVōC](https://github.com/TutteInstitute/evoc). To get started let's load up some initial libraries and get some data to try clustering with." ] }, { "cell_type": "code", "execution_count": 1, "id": "244441b4-c27a-4fea-839b-265a5a3220e2", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "id": "83b66aae-9db9-4d4a-badf-cc71cb427ef0", "metadata": {}, "source": [ "For our dataset we will use the venerable 20-newsgroups dataset, a classic NLP (Natural Language Processing) dataset of posts to twenty different newsgroups from the 1990s. We will grab a version of the dataset that comes complete with precomputed embedding vectors and a precomputed clusterable representation (that we can visualize to see how the clustering is working). This is a dataset we have used elsewhere in tutorials, so hopefully it is somewhat familiar by now." ] }, { "cell_type": "code", "execution_count": 2, "id": "a74e1789", "metadata": {}, "outputs": [], "source": [ "newsgroups_df = pd.read_parquet(\"hf://datasets/lmcinnes/20newsgroups_embedded/data/train-00000-of-00001.parquet\")" ] }, { "cell_type": "code", "execution_count": 3, "id": "9b4cc2fa-1049-4f3a-b2c4-d7fc9b97cf2a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | post | \n", "newsgroup | \n", "embedding | \n", "map | \n", "
|---|---|---|---|---|
| 0 | \n", "\\n\\nI am sure some bashers of Pens fans are pr... | \n", "rec.sport.hockey | \n", "[-0.04380008950829506, 0.08495834469795227, -0... | \n", "[-0.13199903070926666, 10.1972017288208] | \n", "
| 1 | \n", "My brother is in the market for a high-perform... | \n", "comp.sys.ibm.pc.hardware | \n", "[0.006855607498437166, -0.05531690642237663, -... | \n", "[11.03041934967041, 9.509867668151855] | \n", "
| 2 | \n", "\\n\\n\\n\\n\\tFinally you said what you dream abou... | \n", "talk.politics.mideast | \n", "[0.01537406351417303, 0.03572937101125717, -0.... | \n", "[1.7360589504241943, -0.31686803698539734] | \n", "
| 3 | \n", "\\nThink!\\n\\nIt's the SCSI card doing the DMA t... | \n", "comp.sys.ibm.pc.hardware | \n", "[0.010156078264117241, -0.07253803312778473, -... | \n", "[10.975887298583984, 10.715202331542969] | \n", "
| 4 | \n", "1) I have an old Jasmine drive which I cann... | \n", "comp.sys.mac.hardware | \n", "[-0.008448092266917229, 0.06011670082807541, 0... | \n", "[10.498811721801758, 11.010639190673828] | \n", "