Getting Started with Toponymy
Toponymy is a library that can provide rich well named topics for large collections of vectorizable data. Primarily that mean copora of documents, making use of neural text embedding models, but can extend to other modalities which will be discussed in later tutorials. The aim of this tutorial is to walk you through the basic usage of toponymy to get you started on using it. Further tutorials, looking at the details of clustering, different LLMs, keyphrase extraction, other data modalities and more, will follow. For now let’s get started getting Toponymy up and running.
To start we’ll need some basic libraries to allow us to get some data suitable for applying Toponymy.
[1]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
For a dataset we’ll be using the venerable 20-newsgroups dataset, a classic NLP (Natural Language Processing) dataset of posts to twenty different newsgroups from the 1990s. The dataset contains around twenty thousand posts on a wide variety of topics (despite being directed to partricular named newsgroups, people are inclined to go off-topic at times). To make use of this data in Toponymy we will need to turn the the newsgroup posts into vectors, and ideally produce a lower dimensional
clusterable representation of those vectors. Toponymy tries to be agnostic to how this is done, so you can use whatever tools you wish. However, since vectorizing that much text can be computationally expensive (or just cost dollars if you are using an embedding service), and we want to get you up and running as fast as possible, let’s use a version of 20-newsgroups that comes complete with embedding vectors (built using all-mpnet-base-v2 from sentence-transformers) and a 2D representation
we can use for clustering and plotting (built using UMAP).
[2]:
newsgroups_df = pd.read_parquet("hf://datasets/lmcinnes/20newsgroups_embedded/data/train-00000-of-00001.parquet")
We can get a of the data by looking at the first few rows:
[3]:
newsgroups_df.head()
[3]:
| post | newsgroup | embedding | map | |
|---|---|---|---|---|
| 0 | \n\nI am sure some bashers of Pens fans are pr... | rec.sport.hockey | [-0.04380008950829506, 0.08495834469795227, -0... | [-0.13199903070926666, 10.1972017288208] |
| 1 | My brother is in the market for a high-perform... | comp.sys.ibm.pc.hardware | [0.006855607498437166, -0.05531690642237663, -... | [11.03041934967041, 9.509867668151855] |
| 2 | \n\n\n\n\tFinally you said what you dream abou... | talk.politics.mideast | [0.01537406351417303, 0.03572937101125717, -0.... | [1.7360589504241943, -0.31686803698539734] |
| 3 | \nThink!\n\nIt's the SCSI card doing the DMA t... | comp.sys.ibm.pc.hardware | [0.010156078264117241, -0.07253803312778473, -... | [10.975887298583984, 10.715202331542969] |
| 4 | 1) I have an old Jasmine drive which I cann... | comp.sys.mac.hardware | [-0.008448092266917229, 0.06011670082807541, 0... | [10.498811721801758, 11.010639190673828] |
We have posts, which are the text content of the posts (with headers, footers, quotes and signatures stripped), the newsgroup the message was posted to, an embedding vector, and a 2D data map representation. In this version of the dataset we have slightly less than the twenty thousand posts, since there were a number of very short posts (once quotes and signatures were stripped) that barely have enough text to be worth embedding, and these have simply been removed.
For Toponymy we will want the embedding vectors and the clusterable vectors in numpy format (as opposed to a pandas series of python lists of floats), so let’s extract that out from the dataframe. If you were interested in trying the whole process then the cell below contains the relevant code to generate the sentence embeddings and clusterable data map directly from the text – simply change the if False to if True to try running that step yourself. Be warned, depending on the hardware
available (e.g. if you have no GPU) this could be very time consuming.
[4]:
if False:
from sentence_transformers import SentenceTransformer
from umap import UMAP
embedding_model = SentenceTransformer("all-mpnet-base-v2")
embedding_vectors = embedding_model.encode(newsgroup_df["post"], show_progress_bar=True)
clusterable_vectors = UMAP(metric="cosine").fit_transform(embedding_vectors)
else:
embedding_vectors = np.stack(newsgroups_df["embedding"].values)
clusterable_vectors = np.stack(newsgroups_df["map"].values)
Running Toponymy
Now that we have some suitable data, and have extracted the relevant embedding vectors, let’s get started using Toponymy. We will need to import a few pieces to get things working. First we’ll need the Toponymy class that we can use to train a topic model. Alongside that we will also need to import a Clusterer and a KeyphraseBuilder. Clusterers are pluggable, and you can even write your own fairly easily – see the tutorials on clusterers for more details. The KeyphraseBuilder is
used to extracting potential keyphrases from the text of the corpus. Both of these are provided as separate classes as each has a number of configuration options unique to them, and we wanted to separate configuration of these tasks from the overall Toponymy class so as not to clutter the interface with a large array of options.
We will also need an LLM to distill out the final human readable topic names. Toponymy provides wrappers around a number of LLMs, including LlamaCpp and HuggingFace for local models, and services via OpenAI, Anthropic, Cohere, and AzureAI. In this tutorial we’ll be using an Azure AI Foundry instance of a Cohere model, but you can subtitute in your preferred LLM provider. See the documentation of LLM wrappers for details on using your preferred LLM.
Lastly, since Toponymy does internal semantic similarity work with both keyphrases and topic names, we will need a text embedding model. Note that this does not have to be the same as the model used to create the text embedding for the documents. Either a sentence-transformer model, or a model from toponymy.embedding_wrappers that wraps various embedding services, will suffice. Here we’ll use a very light-weight sentence-transformer model to save time on the internal embedding work since
the quality of this embedding (as opposed to the embeddings of the documents) is less important.
[ ]:
from toponymy import Toponymy, ToponymyClusterer, KeyphraseBuilder
from toponymy.llm_wrappers import AzureAINamer
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("paraphrase-MiniLM-L3-v2")
azure_api_key = open("../azure_cohere_api_key.txt").read().strip()
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
With the preliminaries out of the way, let’s create a Toponymy topic modeller. We do this with the Toponymy class, which takes a number of parameters. The primary things we will need are an llm_wrapper instatiated from toponymy.llm_wrappers, a text_embedding_model, a clusterer and a ``keyphrase_builder. The latter two can be instatiated with suitable parameters for your needs – see the documentation on those for more details, but more oftehn than not the default will
suffice.
To get more out of the topic modeller it can be helpful to provide an object_description and a corpus_description. This will help provide context on what the individual documents and, and what kind of corpus they were drawn from. This can improve the final topic names produced. Lastly, in this particular case, it will be useful to provide exemplar_delimiters. If you just have short sentences or other clean text in your corpus the default delimiters will be fine, however the newsgroup
posts contain all manner of quoting, bullet lists, ascii art, and other messy aspects, so it will be beneficial to provide some clear delimiters around example texts that will be passed to the LLM. Here we will use tags "<EXAMPLE_POST>\n" and "\n</EXAMPLE_POST>\n\n" to denote the start and end of a example text.
[ ]:
topic_model = Toponymy(
llm_wrapper=AzureAINamer(
azure_api_key,
endpoint="https://azureaitimcuse5821437469.services.ai.azure.com/models",
model="Cohere-command-r-08-2024",
),
text_embedding_model=embedding_model,
clusterer=ToponymyClusterer(min_clusters=4, verbose=True),
keyphrase_builder=KeyphraseBuilder(ngram_range=(1,6), max_features=15_000, verbose=True),
object_description="newsgroup posts",
corpus_description="20-newsgroups dataset",
exemplar_delimiters=["<EXAMPLE_POST>\n","\n</EXAMPLE_POST>\n\n"],
)
Having instantiated the model, we now need to fit it to data. That means we need to provide the newsgroup posts, along with embedding_vectors and clusterable_vectors. Toponymy hews to a scikit-learn style API, with a fit command that takes the data to fit the model to (the text) along with other arguments – in our case those embedding_vectors and clusterable_vectors. This process will take some time as model constructs various layers of clustering resolution, extracts relevant
information for each cluster, and the uses LLM calls to provide topic names for all the clusters – making use of finer resolution clusters to help name larger clusters for which sampling text would not provide sufficient coverage.
[7]:
%%time
topic_model.fit(
newsgroups_df["post"].str.strip().values,
embedding_vectors=embedding_vectors,
clusterable_vectors=clusterable_vectors
)
Layer 0 found 429 clusters
Layer 1 found 134 clusters
Layer 2 found 41 clusters
Layer 3 found 14 clusters
Layer 4 found 5 clusters
Building keyphrase matrix ...
Chunking into 1 chunks of size 20000 for keyphrase identification.
Combining count dictionaries ...
Found 15000 keyphrases.
Chunking into 1 chunks of size 20000 for keyphrase count construction.
Combining count matrix chunks ...
CPU times: user 36.5 s, sys: 2.87 s, total: 39.4 s
Wall time: 19min 4s
[7]:
<toponymy.toponymy.Toponymy at 0x3454af430>
Having fit the model we can start looking at the results. The first thing to look at is what the topics that were found in the dataset are. The simplest way to see that is via the topic_names_ attribute. This is a list of lists of topic names at various resolution layers, with the first being the finest granularity topics, and the last items being the highest level topics. Let’s look at the top-most topic names – the big picture topics that the topic model was able to extract:
[8]:
topic_model.topic_names_[-1]
[8]:
['Sports',
'PoliticsReligion',
'Vehicle Discussion',
'Computer Hardware',
'Computer Graphics']
So overall we see sports, a topic on religion and politics, vehicles, and computer related topics split into software and hardware. Is this what we might expect as the big topics seen in the corpus? What are the newsgroups that the posts were from?
[9]:
newsgroups_df.newsgroup.unique().tolist()
[9]:
['rec.sport.hockey',
'comp.sys.ibm.pc.hardware',
'talk.politics.mideast',
'comp.sys.mac.hardware',
'sci.electronics',
'talk.religion.misc',
'sci.crypt',
'sci.med',
'alt.atheism',
'rec.motorcycles',
'rec.autos',
'comp.windows.x',
'comp.graphics',
'sci.space',
'talk.politics.guns',
'misc.forsale',
'rec.sport.baseball',
'talk.politics.misc',
'comp.os.ms-windows.misc',
'soc.religion.christian']
So a lot of newsgroups on relgions, and politics, a lot of newsgroups on computers, both hardwre and software, two newsgroups on sports (hockey and baseball), two newsgroups on cars and motorcycles, and a few science newsgroups that aren’t showing up at the top level (likely because they are each on different specific topics, so represent smaller topics overall compared with these larger topics made up of multiple newsgroups each. Let’s go down to the next layer of topic resolution and see if we can pick up those other newsgroups, and possibly differentiate a little better among the larger topics like religion and politics.
[10]:
topic_model.topic_names_[-2]
[10]:
['Baseball Players',
'NHL Game Updates',
'Space Exploration Innovations',
'US Surveillance and Encryption',
'Medical Treatments and Health',
'Middle East Conflicts',
'Vehicle Discussion',
'Religion and Beliefs',
'Waco Siege Analysis',
'Political Scandals and Debate',
'Gun Control and Politics',
'Computer Graphics',
'Computer Hardware Storage',
'Computer Hardware Sales']
Now we see the science topics represented individually as ‘Space Exploration and Astronomy’, ‘US Encryption and Surveillance’, and ‘Health & Medicine’. We also see several of the larger topics refine into more speciifc topics matching the newgroups – baseball and hockey are split apart, and religion and poltiics breaks down into much more specific topics matching many of the newsgroups (mode-east politics, gun control, more general political discourse, and a Christianty specific topic). Notably we also see some of the topics splitting in ways not evident from the newsgroups names. For example, while there was a single newsgroup ‘talk.guns’ we can see separate topics for gun control versus discussion of the Waco siege (which was a significant topic for gun rights advocates in the 1990s).
How fien grained do the topics get? Let’s look the the 0th layer, that contains the most fine-grained topics. There will be far to many individual topics there to look at easily, so let’s look at the first ten and get at least a general idea of the level of specificity in topics we can expect to see.
[11]:
topic_model.topic_names_[0][:10]
[11]:
["Detroit Red Wings' Stanley Cup Octopus Tradition",
'Kirlian Photography and Aura Imaging',
'Sports Radio Stations and Teams',
'Ice Hockey Broadcasts and Schedules',
'Baseball Game Length and Pacing',
'Baseball Tickets and Schedules',
'Sports Team Mailing Lists',
'NHL Captain Trades and Trivia',
'Hacker Ethics and the Evolution of Computing',
'NHL Hockey Violence and Cheap Shots']
As you can see the topics really get down into the weeds at the fien grained level.
To see how these topics fit together, from top level to fine-grained, we can also look at the topic tree via the topic_tree_ attribute. This is actually a class that provides a few ways to view to topic tree, including pretty-printing, but the default representation as expandable tree in HTML will suffice for viewing in a notebook like this.
[12]:
topic_model.topic_tree_
[12]:
-
Topic Tree
-
Sports
-
NHL Game Updates
-
NHL Game Results, Standings, and Playoff Updates
-
NHL Game Results, Standings, and Playoff Updates
- NHL Game Results and Player Statistics
- Ice Hockey Standings and Tiebreakers
- NHL Playoff Predictions and Results
-
-
NHL Hockey Game Analysis and Predictions
-
NHL Hockey Game Analysis and Predictions
- NHL Hockey Player Analysis and Debate
- Ice Hockey Game Analysis
- Ice Hockey Match Reports and Predictions
-
-
NHL Team Dynamics and Player Trades
-
NHL Team Dynamics and Player Trades
- NHL Team Management and Coaching Strategies
- NHL Teams and Player Trades
- NHL Coaching Changes and Team Dynamics
-
-
Don Cherry's Hockey Insights and Goalie Mask Analysis
- Don Cherry's Hockey Commentary
- NHL Goalie Mask Designs and Preferences
-
NHL European Presence and Fan Culture
- Ice Hockey and Fan Culture
- European Presence in NHL
-
NHL Team Relocations and Franchise Issues
- NHL Team Relocations and Franchise Issues
- NHL Hockey Violence and Cheap Shots
- NHL Playoffs: Sabres vs Bruins
- NHL Captain Trades and Trivia
-
-
Baseball Players
-
Baseball Player Stats
-
Baseball Player Performance Analysis
- Baseball Players and Statistics
- Baseball Statistics and Analysis
- Baseball Statistics and Player Performance
-
Baseball Player Performance Analysis
- Baseball Player Analysis and Comparisons
- Baseball Player Performance Analysis
-
-
Baseball Teams and Players
-
Baseball Teams, Players, and Stadiums
- Baseball Teams and Players
- Philadelphia Phillies Baseball Team
- Baseball Teams and Stadiums
-
Baseball Management and Player Performance
- Baseball Management and Player Performance
- Baseball Player Performance and Team Strategies
- Baseball Player Performance and Injuries
-
-
Baseball Umpiring Controversies
- Baseball Rules and Strategies
- Baseball Umpire Controversy
- Baseball Game Length and Pacing
- Baseball Team Discussions and Fan Interactions
- Baseball Player Injuries and Recovery
- Sports Stadiums and Schedules
- Baseball Player Comparisons
- Sports Team Mailing Lists
- Baseball Tickets and Schedules
- Jewish Baseball Players and Their Legacy
- Baseball Uniforms and Team Aesthetics
- MLB Standings and Scores
- Baseball Statistics and Updates
-
-
NHL Hockey Broadcasting and Coverage
-
NHL Hockey Broadcasting and Coverage
- Sports Radio Stations and Teams
- Ice Hockey Broadcasts and Schedules
- NHL Hockey Coverage and Team Performance
- Sports Broadcasting: ESPN's Hockey and Baseball Coverage
-
- Detroit Red Wings' Stanley Cup Octopus Tradition
-
-
PoliticsReligion
-
Gun Control and Politics
-
Gun Control and Constitutional Rights
-
Gun Control and Constitutional Rights
- Gun Control and Second Amendment Interpretation
- Gun Rights and NRA Advocacy
- Gun Control and Crime
- Gun Control and Self-Defense
- Gun Control and Second Amendment Rights
- Gun Control and Legislation
- Constitutional Debates
- Military Draft and Voluntary Service
- Gun Safety and Performance
- Gun Control and Self-Defense
-
-
US Politics and Government
-
Libertarianism and Government Regulation
- Libertarianism and Government Regulation
-
U.S. Government Fiscal Policy
- U.S. Government Fiscal Policy
- US Politics and Government
-
-
-
Religion and Beliefs
-
Theism vs Atheism
-
Theism, Atheism, and Religious Beliefs
- Theism and Atheism Discussions
- Religious Beliefs and Arrogance
- Christianity and Atheism: Beliefs, Experience, and Spirituality
-
Atheism vs Theism Debates
- Religion and Atheism Debate
- Theism vs Atheism Debate
- Atheism and Religious Beliefs
- Theism and Atheism Debate
- Theology and Science
- Christianity and Friendship
-
-
Christian Eschatology and Theology
-
Hell and Eschatology
- Theology and the Concept of Hell
- Biblical Interpretations of Satan and Heaven
- Christian Theology and Eschatology
-
Christianity and Spiritual Growth
- Christianity and Spiritual Growth
- Christian Theology and Trinity
-
-
Christianity: Textual Criticism
-
Christianity: Resurrection & Textual Criticism
- Christianity and the Resurrection
- Biblical Textual Criticism
-
Christian Prophecy and Scripture
- Prophecy and Scripture Interpretation
- Christianity and Speaking in Tongues
-
-
Christianity and LGBTQ+ Issues
-
Christianity and Religious Studies
- Christianity and Religious Studies
-
Catholic Liturgy and Traditions
- Catholic Liturgy and Traditions
-
Christianity, Homosexuality, and Biblical Interpretation
- Christianity and Homosexuality
- Christianity and Biblical Interpretation
-
LGBTQ+ and Religion
- LGBTQ+ and Religious Views
- Religion and Homosexuality
- Christianity and Faith
-
-
Rushdie, Islam, and Religious Controversies
- Rushdie and Islam: Fatwa, Apostasy, and Media
- Islamic Banking and BCCI Controversy
- Islam and Religious Debates
- Occult and Esoteric Orders
-
Religion in Public Institutions
- Discussion on National Motto and Religious References
- Religion in Public Schools
- Religion and Politics
- Atheism and Usenet FAQ
- Infant Baptism and Original Sin
- Religion and its Impact on History and Society
- Religious Music and its Impact on Faith
- Religion and Beliefs
- Christianity and Pacifism
- Christian Faith and Works Debate
-
-
Waco Siege Analysis
-
Waco Siege Analysis
-
Waco Siege: Government, ATF, and Cult Actions
- Waco Siege and Government Responsibility
- Waco Siege and ATF Raid
- Waco Siege and Branch Davidian Standoff
- Waco Siege and Government Actions
- Waco Siege Analysis
-
Waco Siege and Koresh's Cult
- Waco Siege and Koresh's Cult
- Waco Siege and David Koresh
-
-
Waco Siege Investigation and Analysis
-
Waco Siege Investigation and Analysis
- Waco Incident: Gas Effects and Safety
- Waco Siege Investigation
- Waco Siege and Davidian Cult Investigation
- Waco Siege Analysis
-
-
Legal Trials and Civil Rights Cases
- Legal Trials and Civil Rights Cases
-
-
Space Exploration Innovations
-
Space Exploration Innovations
-
Space Exploration and Astronomy
- Space Exploration and Astronomy
-
Space Exploration Innovations
- Space Exploration and its Impact
- Space Exploration and Lunar Power Systems
-
Astronomy and Space Exploration
- Astronomy and Space Exploration
-
Space Exploration Funding and Incentives
- Space Exploration Funding and Incentives
-
Space Advertising and Light Pollution
- Space Advertising and Light Pollution
-
Space Exploration and Launch Vehicles
- Space Launch Vehicles and Programs
- Space Exploration and Rocket Technology
-
Space Exploration and Mission Updates
- Space Exploration and Mission Updates
-
Gamma Ray Bursts and Origins
- Gamma Ray Bursts and their Origins
- Astronomical Phenomena and Software Bugs
-
Hubble Space Telescope Servicing and Reboost
- Hubble Space Telescope Servicing and Reboost
- Advertising and Commercial Use of Internet
- Spacecraft Command and Control
- Rocketry and Spaceflight
-
-
-
Political Scandals and Debate
-
Political Scandals and Debate
-
Online Debate and Political Scandals
- Political Scandals and University Controversies
- Insults and Sarcasm in Online Interactions
- Recreational Activities and Online Communities
- Personal Interactions and Nicknames
- Media and Entertainment Criticism
- Discussion and Debate Etiquette
-
Death Penalty and Punishment Debate
- Detroit Riots and Tank Usage
- Death Penalty and Cruel Punishment
- Death Penalty Debate
-
Online Free Speech and Censorship
- Online Free Speech and Censorship
-
LGBTQ+ Rights and Discrimination
- Gay Rights and Discrimination
- Homosexuality and Child Molestation
- LGBTQ+ Rights and Sexual Harassment
- Political Affairs and Current Events
- Political and Social Commentary
- Race and Crime in America
-
-
-
Medical Treatments and Health
-
Medical Treatments and Health
-
Medical Treatments and Alternative Debates
- Medical Conditions and Treatments
- Health and Medicine
- Alternative Medicine Debate
- Mental Health & Pharmaceuticals
-
MSG and Food Additive Effects
- MSG and Food Additives
- Dietary Seizure Triggers
- Weight Management and Dietary Habits
-
Candida Overgrowth and Allergy Remedies
- Nasal Allergy and Infection Remedies
- Yeast Infections and Candida Treatment
-
Medical Conditions and Treatments
- Medical Conditions and Treatments
- Medical Conditions and Treatments
- Skin Care and Treatment
- Healthcare Insurance Reform
- Healthcare and Medical Services
- Space and Altitude Medicine
- Medical Procedures and Hospital Recommendations
- Health and Sexuality
- Hepatitis and Liver Diseases
- Health Risks of Barbecuing and Grilling
- Acupuncture and Injection Safety
- Medical Aspects of Circumcision and Pregnancy
- Lyme Disease Diagnosis and Treatment
-
-
-
Middle East Conflicts
-
Middle East Conflicts
-
Bosnian Conflict: Ethnic, Religious, and International Dimensions
- Bosnian Conflict and Ethnic Identity
- Bosnian Conflict and International Response
-
Middle East Conflict
- Middle East Conflict
-
Arab-Israeli Conflict and Discrimination
- Arab-Israeli Relations and Discrimination
- Arab-Israeli Peace Negotiations
- Israeli-Palestinian Conflict
-
Middle East Conflict and Zionism
- Middle East Conflict and Zionism
- Nazi Ideology and its Impact
- Zionism and Jewish Identity
- Israeli-Palestinian Conflict and Human Rights
- Holocaust and Israeli-Palestinian Conflict
- Media Bias and Israeli-Palestinian Conflict
- Deir Yassin Massacre Debate
- US-Middle East Relations and Gulf War
-
-
-
Morality, Science, and Philosophy
-
Morality, Science, and Philosophy
- Philosophy of Science and Morality
- Philosophy of Science
- Objective Morality and Relativism Debate
- Quantum Physics and the Reality of Atoms
- Animal Behavior and Morality
- Moral Relativism and Objectivism Debate
-
-
Catholic Theology: Mary and Sinlessness
- Catholic Theology: Mary and Sinlessness
-
Mormon Marriage Ceremonies
- Mormonism and its Beliefs
- Marriage and Religious Ceremonies
-
Gay Rights and Sexual Behavior Debates
- Gay Promiscuity and Societal Attitudes
- Gay Rights March on Washington
- Gay and Heterosexual Statistics Debate
- Gay and Bisexual Male Sexual Behavior Studies
- Social Issues and Family Dynamics
- Environmental Skepticism and Policy
- Newsgroup Post Deletions and Snippets
- Religious Holidays and Etymology
- Weapons of Mass Destruction
- Christian Sabbath Debate
- Cults and Religion
- Firearms and Ammunition Safety
-
-
Vehicle Discussion
-
Vehicle Discussion
-
Motorcycle Safety and Laws
-
Motorcycle Safety and Road Etiquette
- Motorcycle Culture and Safety
- Road Safety and Motorcycle Riding Tips
- Bike Safety and Traffic Etiquette
- Travel and Road Trip Destinations
- Motorcycle Riding in Windy Conditions
-
Drunk Driving Laws and Court Proceedings
- Drunk Driving and Impaired Driving Laws
- Traffic Violations and Court Proceedings
- Motorcycle and Biker Movies
-
-
Vehicle Reviews and Sales
-
Car Reviews and Specifications
- Sports Car Reviews and Discussions
- Automotive Engineering and Performance
- SUV and Car Models
- Car Engine Specifications and Comparisons
-
Used Car Sales and Features
- Car Reviews and Maintenance
- Used Car Sales and Features
- Car Models and Maintenance
-
Motorcycle Enthusiasts and Racing Bike Recommendations
- Motorcycle Enthusiasts and Racing Bike Recommendations
-
-
Motorcycle Sales and BMW Maintenance
-
Motorcycle Sales and Accessories
- Automotive Enthusiasts and Clubs
- Motorcycle Sales and Accessories
-
BMW Motorcycle Modifications and Maintenance
- Motorcycle Enthusiasts and Modifications
- BMW Motorcycle Owners Association and Related Topics
- Motorcycle Maintenance and Modifications
-
-
Vehicle Care and Detailing
-
Cleaning and Solvent Solutions
- Cleaning and Solvent Solutions
-
Automotive Care and Detailing
- Automotive Care and Detailing
-
-
Car Maintenance and Monitoring
- Car Maintenance and Oil Change Procedures
- Vehicle Instrumentation and Monitoring
- Automotive Odometer and Speedometer Electronics
-
Automotive Insurance and Rates
- Automotive Insurance and Rates
-
Lead-Acid Battery Storage and Maintenance
- Lead-Acid Battery Storage and Maintenance
- Motorcycle Riding Techniques and Countersteering
- Circuit Board Manufacturing and Materials
- Engine Configuration and Design
- Motorcycle Shaft Drive and Wheelie Performance
- Automotive and Motorcycle Maintenance
- Biker Culture and Etiquette
- Vehicle Safety and Accident Prevention
- Automotive Enthusiasts and Technology
- Motorcycle Passenger Safety and Riding Tips
- Motorcycle Security and Locking Mechanisms
- Car Buying and Pricing Strategies
- Convertible Car Models and Recommendations
- Automotive Safety and Security Features
-
-
-
Computer Hardware
-
Computer Hardware Sales
-
Computer Hardware and Upgrades
-
Computer Hardware and Retail Experiences
- Computer Hardware and Retail Experiences
- Gateway and Leading Edge Computer Hardware Configurations
- Computer Hardware Components and Specifications
- Powerbook and Duo Computers
- Macintosh Portable Computers
-
Computer Processor Upgrades and Comparisons: 486 CPU Performance
- Computer Processor Upgrades and Comparisons
- 486 CPU Performance and Upgrades
-
Apple Mac Hardware and Expansion Cards
- Apple Mac Hardware and Expansion Cards
- Computer Hardware and Architecture
-
-
Computer Hardware Sales: Peripherals and Accessories
-
Computer Hardware Sales: Peripherals and Accessories
- Computer Hardware and Software Sales
- Computer Hardware and Software Sales
- Computer Hardware Sales
- Computer Hardware Sales: Printers and Scanners
-
-
Computer Monitor Compatibility and Recommendations
-
Computer Monitor Compatibility and Recommendations
- Video and Monitor Compatibility
- Apple and Mac Monitor Compatibility
- Computer Monitor Recommendations and Comparisons
-
-
Computer Graphics Hardware
-
Computer Graphics Hardware Benchmarking and Performance
- Computer Graphics Hardware and Benchmarking
- Computer Graphics Hardware
-
Computer Graphics Drivers and Troubleshooting
- Computer Graphics Drivers
- Computer Graphics Card Troubleshooting
- Computer Hardware and Software Compatibility
-
Apple Monitor Issues and Troubleshooting
- Apple Monitor Issues and Troubleshooting
-
-
Computer Memory and SIMM Compatibility
- Computer Memory and SIMM Compatibility
-
Apple Mac Hardware Upgrades
- Apple Mac Hardware Upgrades
-
Macintosh Hardware and Peripherals Sales
- Computer Hardware and Peripherals
- Computer Hardware and Monitor Sales
- Macintosh Hardware and Software Support
-
Computer Hardware Configuration: Serial Ports and Connections
- Computer Hardware Configuration
- Serial Port Connections and Null Modem Cables
- Computer Hardware and Software Troubleshooting: Date and Time Issues
- Computer Hardware and Accessories
- Computer Mouse Troubleshooting and Maintenance
- Computer Hardware and Performance
-
-
Computer Hardware Storage
-
Computer Hardware Storage
-
Computer Storage Devices and Media
- Floptical Drives and Storage Devices
- Computer Storage Devices
- Computer Hardware and Peripherals
-
Computer Hardware Troubleshooting and Repairs
- Computer Hardware Troubleshooting and Repairs
- Computer Hardware Troubleshooting
- Floppy Drive Troubleshooting and Issues
-
IDE Hard Drive Configuration and Troubleshooting
- IDE Hard Drive Configuration and Troubleshooting
-
Computer Hardware and Storage
- Computer Hardware and Storage
- Computer Hardware Troubleshooting
- Computer Hardware Configuration and Troubleshooting
- Computer Hardware and Storage Solutions
- Computer Hardware and Storage
- Macintosh and Unix Disk Compatibility
- Hard Drive Sales and Offers
-
-
-
Electronics Sales and Repairs
-
Audio Equipment Sales
- Audio Equipment Sales
- Music and Audio Equipment Sales
- Electronics and Audio Equipment Sales
-
Audio Electronics Sales and Repairs
- Audio Equipment and Electronics
- Electronics and Audio Equipment Sales
-
Sales of Electronics, Cameras, and Musical Gear
- Camera and Photography Equipment Sales
- Musical Instruments and Equipment for Sale
- Electronics and Equipment Sales
- Electronics and Gadgets for Sale
-
Computer Power Management and Energy Efficiency
- Computer Power Management and Energy Efficiency
- TV Repair and Troubleshooting
- Audio Engineering and Equipment
-
-
Modem and Fax Sales
-
Modem and Fax Sales
- Modem and Fax Sales
-
Computer Hardware and Peripherals
- Computer Hardware and Peripherals
- Macintosh Audio and Hardware
-
-
Computer Graphics and Display Modes
- Computer Graphics and Display Modes
- Computer Hardware Troubleshooting and Technical Support
-
-
Computer Graphics
-
Computer Graphics
-
DOS and Windows Compatibility
-
DOS 6 Compatibility and Compression Issues
- DOS and Windows Compatibility Issues
- DOS 6 and Disk Compression Issues
-
Windows 3.1 Memory Management and Performance Troubleshooting
- Computer Troubleshooting: DOS and Windows Compatibility Issues
- Windows 3.1 Memory Management and Performance Issues
-
-
X-Windows and GUI
-
X-Windows Performance and Remote Access
- X-Windows and Remote Access
- X11R5 and X11R6 X Server Performance and Features
-
X-Windows Software and Tools
- X-Windows and GUI Development
- Computer Software and Operating Systems
- GUI and X Window System Tools
-
-
X11 and Motif Troubleshooting
-
X11 and Motif Application Troubleshooting
- X11/Xlib Widget Programming and Troubleshooting
- X11 and Motif Application Issues
- X11 Display and Window Management Issues
- X11 and XDMCP Troubleshooting
-
X Window System: Management and Programming
- X11 Window Management and Event Handling
- X Window System Programming
- Window Manager Positioning and Decoration
-
X11R5 Compilation and Linking Errors
- X11R5 Compilation Issues
- Software Compilation and Linking Errors
- X11R5 Color Management Issues
- X11R5 Software Distribution and Compatibility
- X Window System Graphics and Colormaps
-
-
Microsoft OS and Workstation Software
-
Microsoft OS and Workstation Software
- Operating Systems and Workstations
- Computer Software and Operating Systems
- Microsoft Software and User Experience
-
-
Computer Graphics and Imaging
-
Image File Conversion and Viewing
- Graphics and Image File Formats
- Image File Conversion and Viewing
- Graphics File Formats and Conversion
-
Computer Software, Utilities, and Networking Tools
- Windows Software and Utilities
- PostScript and Ghostscript Tools
- Computer Networking and Operating Systems
-
Computer Graphics Software and CAD
- Computer Graphics and CAD Software
- Computer Graphics Software and Animation
-
Computer Graphics and Visualization: Imagery and Distribution
- Space and Earth Imagery Copyright and Distribution
- Computer Graphics and Visualization
-
Image File Formats, Viewers, and Computer Graphics
- Image File Formats and Viewers
- Computer Graphics and Animation
- Computer Graphics Software and Tools
- Image File Conversion and Visualization
-
-
Motif Widgets Development and Software
- Motif and Widgets Development
- Widget Software and Development
-
3D Polygon and Sphere Algorithms
- 3D Polygon Algorithms and Rendering
- Computer Graphics: Spline and Curve Algorithms
- Sphere and Radius Calculation Algorithms
-
XV Software Licensing and Copyright Issues
- XV Software Licensing and Copyright
- X Window System Graphics and Text Display
-
Computer Graphics and FTP Resources
- Computer Graphics and FTP Resources
-
Windows Customization and Management
- Windows File Management and Customization
- Windows Program Management Issues
- Windows Program Icon Customization and Shortcuts
- Computer Software and Database Management
- Software and File Sharing Recommendations
- Computer Software and Hardware
-
-
-
US Surveillance and Encryption
-
US Surveillance and Encryption
-
NSA Surveillance and Encryption Policies
- Government Surveillance and Encryption Policies
- Government Surveillance and Encryption Policy
- NSA Surveillance and Encryption Debate
-
Government Surveillance and Encryption Policies
- Government Surveillance and Encryption Policies
- Escrow and Surveillance Debate
- Clipper Chip Wiretapping and Privacy Concerns
- Law Enforcement and Encryption
-
DES Encryption and Key Search Attacks: Cracking and Source Code
- DES Encryption and Key Search Attacks
- Encryption and Source Code Access
-
Clipper Chip Encryption and Skipjack Algorithm Security
- Clipper Chip Encryption and Security
- Skipjack Encryption Algorithm and Chip Security
-
Douglas Adams' 'Hitchhiker's Guide' and Economic Theories
- Douglas Adams' 'Hitchhiker's Guide to the Galaxy' References and Theories
- Economics and Global Affairs
- Cryptography and Encryption Techniques
- US Politics and Government Surveillance
- US Politics: Privacy and Surveillance
- Wiretapping and Privacy Concerns
- Cryptography and Secure Communication
-
-
-
Armenian-Turkish Conflict
-
Armenian Genocide and Turkish-Armenian Conflict
- Armenian Genocide
- Armenian Genocide and Turkish-Armenian Conflict
-
Greek-Turkish Conflict and Human Rights
- Greek-Turkish Conflict and Human Rights
-
-
Phone Services and Diagnostics: Dialing and Troubleshooting
-
Phone Services and Diagnostics: Dialing and Troubleshooting
- Electrical Grounding and Wiring Safety
- International Phone Compatibility and Dialing Systems
- Phone Services and Numbers
- Telephone Line Diagnostics and Troubleshooting
- Business Contact Information and Technical Support
-
-
Computer Communication & Mailing Lists
-
comp.graphics.organization
- comp.graphics.organization
-
FAQ and Mailing List Management
- FAQ Requests and Discussions
- Mailing List Management and Subscription Requests
-
Computer Science Education and Correspondence
- Education and University Programs
- Computer Science and Technology Discussions
- Computer and Internet Communication
- Email and Correspondence
-
-
Printer Performance and Postscript
-
Printer Performance, Drivers, and Postscript
- Printer Quality and Performance
- Printer Driver Compatibility and Updates
- Printer and Postscript Issues
-
Windows Font Management and Display Issues
- Windows Font and Character Display Issues
- Font Management and Printing Technologies
-
-
Radio Interference & Surveillance
-
Radio Interference & Surveillance
- Ham Radio Interference and Troubleshooting
- Surveillance and Electromagnetic Emissions
- Radio Frequency Transmitters and Receivers
-
-
Drug Legalization Debate
- Drug Legalization Debate
-
Dog-related motorcycle incidents and safety
- Dog-related motorcycle incidents and safety
-
Academic Book Sales: Computer Science & Math
- Academic Book Sales
- Book Sales and Publications
-
Software Piracy and Protection Measures
- Windows Registration and File Editing
- Software Piracy and Copy Protection Measures
-
Sales and Rentals: Travel, Real Estate, and Furnishings
- Real Estate and Home Furnishing Sales
- Travel and Airline Ticket Sales
- Summer Housing and Sublets
-
Car Transmission Types and Preferences
- Car Transmission Types and Preferences
-
Radar Detectors and Driving Safety
- Radar Detectors and Driving Safety
-
Video Game Sales and Trading
- Video Game Sales and Trading
-
CPU Cooling and Fan Solutions
- CPU Cooling and Fan Solutions
-
Speech Compression and Real-Time Encryption
- Speech Compression and Real-Time Encryption
-
Media Sales and Exchanges
- Music Media Sales
- Comic Book Sales and Auctions
- VHS Movie Sales and Exchanges
- Kirlian Photography and Aura Imaging
- Hacker Ethics and the Evolution of Computing
- Electrical Engineering and Units
- CView Software Issues and Workarounds
- Discussion on Testing and Tricks
- Computer Joystick Hardware and Software
- Motorcycle Helmet Safety and Care
- Nuclear Power Plant Cooling Systems
- Computer Hardware Sales: Memory Modules
- Blue LED Technology and Availability
- UV Lighting and Portable Sources
- Image File Formats and TIFF Complexity
- Electronics and Component Procurement
- Video and Image Processing
- Scanner Technology and Recommendations
- High-Speed Analog-Digital Conversion and DSP Applications
-
Click on “Topic Tree” to start expanding, and then click on a topic that has a triangle marker to further expand that topic. Note that not every topic from more fine grained layers has a parent at the top level, so the first expansion of the tree contains many topics. The relative resolution layer in the topic tree is displayed by the font-weight (bolder fonts represent higher level topics).
Plotting an Interactive Document Data Map
Ultimately one wants to see which documents are related to which topics. That is accessible directly through the topic_name_vectors_ attribute of the topic model, which provides a list of vectors, each of length equal to the number of documents in the corpus, with values taken from the topic names. The nth entry of the ith array is the topic name assigned to the nth document at the ith layer of topic resolution. This provides a direct document to topic mapping. This is quite
sufficient for any programmatic analysis of documents and topics. It is, however, a little unweidly to parse through by eye. A better approach would be to use the 2D representation of the documents to plot a data map of all the documents, and then overlay the layers of topics on top of that. Fortunately, to make that easy, there is a library called datamapplot that can handle all of the heavy lifting for us. We just need to import the library:
[13]:
import datamapplot
import datamapplot.selection_handlers
To create an interactive plot we can view directly in the notebook (or export to an HTML file to share with others) we simply need to use the create_interactive_plot function. The main parameters we need for that are the 2D coordinates of the data map, and then some number of layers of cluster names. The required format of those layers of cluster names is … exactly the format of the topic_name_vectors attribute (expanded to a set of arguments via the * expansion of lists). That
means we already have everything we need. From there datamapplot provides a number of extra tools for providing hover tooltips (we’ll pass in the content of the newsgroup post), tweaking aesthetics, adding text search and alternate colormaps, and even allowing for lasso selections that trigger events (we’ll create a word cloud from the selected newsgroup posts). We won’t go into all the details of those parameters here, but please check the datamapplot
documentation for more details.
[14]:
plot = datamapplot.create_interactive_plot(
clusterable_vectors,
*topic_model.topic_name_vectors_,
title="20-Newsgroups",
sub_title="A data map of 20-newsgroups using all-mpnet-basev2, Toponymy, Cohere and UMAP",
hover_text=newsgroups_df["post"].values,
font_family="Cormorant SC",
marker_size_array=np.asarray([np.log(len(x)) for x in newsgroups_df["post"].values]),
colormaps={"newsgroup": pd.Series(newsgroups_df["newsgroup"].values)},
cluster_layer_colormaps=True,
enable_search=True,
selection_handler=datamapplot.selection_handlers.WordCloud(height=300),
)
plot
[14]: