{ "cells": [ { "cell_type": "markdown", "id": "967c35da-f951-4a61-9854-3a55248b2a62", "metadata": {}, "source": [ "# How Toponymy Works\n", "\n", "Toponymy can produce compelling desconstructions of large corpora into well named topics at a wide variety of resolutions -- from large scale themes, to fine grained sub-topics. This tutorial will walk you through how it achieves this through a combination of clustering, feature extraction, LLMs, and careful cleanup of initially generated topic names. To get started let's import some initial libraries, and then get start collecting a (small) dataset we can explore with." ] }, { "cell_type": "code", "execution_count": 1, "id": "993e6f9f-1466-4cf5-af0c-9013971565f9", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/work/home/lmmcinn/.conda/envs/toponymy_docs/lib/python3.12/site-packages/sentence_transformers/cross_encoder/CrossEncoder.py:13: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)\n", " from tqdm.autonotebook import tqdm, trange\n" ] } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "import os\n", "import matplotlib.pyplot as plt\n", "\n", "from sentence_transformers import SentenceTransformer\n", "from collections import Counter" ] }, { "cell_type": "markdown", "id": "e1f55b66-dbb4-41e0-97a3-038622e39632", "metadata": {}, "source": [ "For our dataset we are going to use the category theory section of ArXiv -- all the papers tagged with ``math.CT`` as one of the categories they were deemed to fall under. To keep things simple we will work with just the titles of the papers, and we'll use a dataset that has embeddings and a data map pre-built. This should provide a good example, since we have a small enough dataset to work through quickly (only around 8500 items, and the text is just the relatively short title of the paper), but the dataset is still fairly complex, involving a lot of complex domain specific language that can be challenging to work through." ] }, { "cell_type": "code", "execution_count": 2, "id": "166cd82f-0e94-4f0a-952d-b15ad5e7024d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | date_created | \n", "title | \n", "categories | \n", "arxiv_id | \n", "year | \n", "embedding | \n", "data_map | \n", "
|---|---|---|---|---|---|---|---|
| 0 | \n", "2007-04-04 06:58:08 | \n", "Generic representations of orthogonal groups: ... | \n", "math.AT math.CT | \n", "0704.0502 | \n", "2007 | \n", "[-0.5840985178947449, 0.9819669723510742, -4.2... | \n", "[11.470277786254883, 10.948436737060547] | \n", "
| 1 | \n", "2007-04-11 09:08:03 | \n", "Triangulated categories without models | \n", "math.AT math.CT math.KT | \n", "0704.1378 | \n", "2007 | \n", "[0.13958382606506348, 1.518001675605774, -4.65... | \n", "[13.929526329040527, 10.156951904296875] | \n", "
| 2 | \n", "2007-04-12 17:21:24 | \n", "Complete Segal spaces arising from simplicial ... | \n", "math.AT math.CT | \n", "0704.1624 | \n", "2007 | \n", "[0.8432340025901794, 1.7251262664794922, -3.68... | \n", "[13.688016891479492, 7.38051700592041] | \n", "
| 3 | \n", "2007-04-17 06:55:01 | \n", "Associated Graded Algebras and Coalgebras | \n", "math.CT math.QA | \n", "0704.2106 | \n", "2007 | \n", "[0.7835264801979065, 1.1688741445541382, -4.08... | \n", "[12.427974700927734, 8.874861717224121] | \n", "
| 4 | \n", "2007-04-17 17:52:10 | \n", "Adjoint Functors and Heteromorphisms | \n", "math.CT math.LO | \n", "0704.2207 | \n", "2007 | \n", "[-0.21176309883594513, 1.6140257120132446, -3.... | \n", "[11.670499801635742, 11.705853462219238] | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 8452 | \n", "2005-06-16 13:51:21 | \n", "Quantum information-flow, concretely, and axio... | \n", "quant-ph math.CT math.LO | \n", "quant-ph/0506132 | \n", "2005 | \n", "[1.2231957912445068, 2.196269989013672, -4.278... | \n", "[9.509384155273438, 4.773165225982666] | \n", "
| 8453 | \n", "2005-06-16 14:15:14 | \n", "De-linearizing Linearity: Projective Quantum A... | \n", "quant-ph math.CT math.LO | \n", "quant-ph/0506134 | \n", "2005 | \n", "[0.9992305040359497, 0.9833099842071533, -3.19... | \n", "[9.6536283493042, 5.01566219329834] | \n", "
| 8454 | \n", "2005-10-04 15:53:48 | \n", "Kindergarten Quantum Mechanics | \n", "quant-ph math.CT | \n", "quant-ph/0510032 | \n", "2005 | \n", "[1.0677509307861328, 1.4395443201065063, -4.65... | \n", "[9.42551326751709, 4.819393634796143] | \n", "
| 8455 | \n", "2006-08-03 16:53:06 | \n", "Quantum measurements without sums | \n", "quant-ph math.CT math.LO math.QA | \n", "quant-ph/0608035 | \n", "2006 | \n", "[1.2694358825683594, 1.7442468404769897, -4.39... | \n", "[9.383161544799805, 4.770181179046631] | \n", "
| 8456 | \n", "2006-08-08 18:10:10 | \n", "POVMs and Naimark's theorem without sums | \n", "quant-ph math-ph math.CT math.MP math.QA | \n", "quant-ph/0608072 | \n", "2006 | \n", "[1.1692596673965454, 2.0193357467651367, -4.15... | \n", "[11.293802261352539, 7.266186237335205] | \n", "
8457 rows × 7 columns
\n", "