.. _intro: Introduction to Toponymy ======================== In the modern world, we are often faced with vast collections of information – large corpora of documents, extensive image libraries, or streams of user feedback. While powerful tools exist to embed this data into semantic vector spaces where similar items are close together, understanding the *structure* within these spaces remains a challenge. Simple clustering can group the data, but often leaves us with abstract groups identified only by numbers or lists of keywords. How do we navigate this "information space" effectively? How do we understand *what* these discovered regions truly represent? This is the core problem **Toponymy** aims to solve. Inspired by the geographic practice of naming places (*topos* 'place' + *onuma* 'name'), Toponymy seeks to place meaningful, human-readable names on the landmarks and regions within your data's embedding space. Instead of just knowing *that* a group of documents or images exists, Toponymy tells you *what* that group is about, using concise and descriptive names generated by Large Language Models (LLMs). Why Use Toponymy? ----------------- Traditional topic modeling techniques often rely on bag-of-words models, potentially missing nuances captured by semantic embeddings, and typically produce topics represented as lists of weighted keywords. While useful, interpreting these keyword lists and synthesizing a coherent understanding can be time-consuming and subjective. Toponymy offers a different approach: 1. **Leverages Semantic Embeddings:** It works directly with modern vector embeddings (from text, images, etc.), capturing deeper semantic relationships than raw word counts. 2. **Provides Interpretable Names:** The primary output is not a list of keywords, but well-formed topic names (e.g., "Computer Hardware and Storage", "Space Exploration and Astronomy") generated by LLMs, making interpretation faster and more intuitive. 3. **Hierarchical Topic Discovery:** Data rarely has structure at only one level. Toponymy discovers topics at multiple resolutions, from broad, high-level themes down to fine-grained sub-topics, presenting them as a navigable hierarchy. This allows you to zoom in and out on the structure of your data. 4. **Flexibility and Scalability:** Designed to handle large datasets, Toponymy allows plugging in different clustering algorithms, keyphrase extraction techniques, embedding models, and LLM providers to suit your specific needs and infrastructure. How It Works (Conceptually) --------------------------- At its heart, Toponymy follows a multi-stage process: 1. **Embedding:** Assumes you have (or can generate) vector embeddings for your data items. 2. **Dimensionality Reduction:** Assumes you have reduced the dimensionality of the embeddings using techniques like UMAP or t-SNE to facilitate clustering and visualization. 3. **Multi-Scale Clustering:** Applies custom multiresolution clustering methods to identify groups of data points at various levels of granularity. 4. **Information Extraction:** For each cluster at each level, it identifies representative data points (exemplars), extracts relevant keyphrases, and identifies sub-topics. 5. **LLM-Powered Naming:** It synthesizes the exemplars, keyphrases, and potentially sub-topic names from finer levels, feeding them into an LLM with carefully crafted prompts to generate a concise, descriptive name for the cluster's topic. 6. **Refinement:** Includes steps to disambiguate similar topic names and ensure coherence across the hierarchy. The result is a rich, multi-layered understanding of your dataset, complete with intuitive names that facilitate exploration and analysis. Who Is This For? ---------------- Toponymy is designed for anyone needing to make sense of large, unstructured, or semi-structured datasets that can be represented via embeddings. This includes: * **Data Scientists & Analysts:** Exploring large text corpora (customer reviews, news articles, scientific papers), image collections, or other embedded data to understand key themes. * **Researchers:** Analyzing experimental results, survey responses, or literature collections. * **Developers:** Building applications that require summarizing or navigating large amounts of content. Next Steps ---------- Ready to start placing names on your information landmarks? * Head to the :ref:`installation` guide to get Toponymy set up. * Walk through the :doc:`basic_usage` tutorial to see a practical example in action.