Topic summaries and explanations

Toponymy can now generate more than a short human-readable topic name. By using the summary layer, each topic also receives a short summary paragraph and a more detailed explanation. This is useful when topic names are going to appear in a browser, report, search interface, or review workflow where a compact label alone is not quite enough context.

The summary pipeline follows the usual Toponymy flow: cluster documents, choose keyphrases and exemplars, use lower-level topics to help name higher-level topics, and finally store per-layer topic outputs. The main difference is that the LLM prompt asks for structured JSON containing a topic name, topic summary, topic explanation, and specificity score.

The short version

If you already have a working Toponymy script, the summary version usually requires just two changes:

Use ClusterLayerSummaryText as the layer_class.
Use SUMMARY_PROMPT_TEMPLATES as the prompt_template.

After fitting, the model has the usual topic_names_ and topic_name_vectors_ attributes, plus topic_summaries_ and topic_explanations_. All three are lists of lists, one list per cluster layer.

Data

This tutorial uses the small arXiv AI example data bundled with Toponymy. The helper functions below use generic filepaths: one function loads documents, one loads high-dimensional document vectors, and one loads the low-dimensional document map used for clustering. For your own data, keep the same function shape and replace the filepath constants with your storage layout.

[1]:

from pathlib import Path

import numpy as np
import pandas as pd

def find_base_dir() -> Path:
    candidates = [
        Path('../examples'),
        Path('examples'),
        Path.cwd() / 'examples',
        Path.cwd().parent / 'examples',
    ]
    for candidate in candidates:
        if (candidate / 'ai_arxiv_papers.zip').exists():
            return candidate
    raise FileNotFoundError('Could not find the bundled examples directory.')


BASE_DIR = find_base_dir()
DOCS_FILEPATH = BASE_DIR / 'ai_arxiv_papers.zip'
DOCUMENT_VECTORS_FILEPATH = BASE_DIR / 'ai_arxiv_vectors.npy'
DOCUMENT_MAP_FILEPATH = BASE_DIR / 'ai_arxiv_coordinates.npz.npy'
MODEL_OUTPUT_FILEPATH = Path('topic_summary_outputs.pkl')

[2]:

def load_docs(filepath: str | Path):
    docs_df = pd.read_csv(filepath)
    return (
        docs_df['title'].str.strip()
        + '\n\n'
        + docs_df['abstract'].str.strip()
    ).to_numpy()


def load_vecs(filepath: str | Path):
    return np.load(filepath)


def load_map(filepath: str | Path):
    return np.load(filepath)


documents = load_docs(DOCS_FILEPATH)
document_vectors = load_vecs(DOCUMENT_VECTORS_FILEPATH)
clusterable_vectors = load_map(DOCUMENT_MAP_FILEPATH)

assert len(documents) == document_vectors.shape[0]
assert len(documents) == clusterable_vectors.shape[0]

documents.shape, document_vectors.shape, clusterable_vectors.shape

[2]:

((10000,), (10000, 768), (10000, 2))

Model wrappers

The document vectors in the example above are already built. The text embedder used in the Toponymy object is still needed because Toponymy embeds keyphrases and generated topic names internally when it builds and relates topic layers. It does not need to be the same embedding model that created the original document vectors, although it should be semantically capable for your corpus.

The OpenAI-compatible wrapper below can work either with the public OpenAI API or with an OpenAI-compatible hosted endpoint. Set OPENAI_API_KEY as usual. If you use a hosted endpoint, set TOPONYMY_API_BASE_URL or OPENAI_BASE_URL. You can also set TOPONYMY_TEXT_EMBEDDING_MODEL and TOPONYMY_TOPIC_NAMING_MODEL to override the default model names.

[ ]:

import os

from toponymy.embedding_wrappers import OpenAIEmbedder
from toponymy.llm_wrappers import LLMWrapper, OpenAINamer

TEXT_EMBEDDING_MODEL_NAME = 'llama-embed-nemotron'
TOPIC_NAMING_MODEL_NAME = 'nemotron-3-nano'

def get_toponymy_embedder() -> OpenAIEmbedder:
    return OpenAIEmbedder(
        api_key=os.environ['OPENAI_API_KEY'],
        base_url=os.environ['BASE_URL'],
        model=TEXT_EMBEDDING_MODEL_NAME,
    )


def get_toponymy_llm() -> LLMWrapper:
    return OpenAINamer(
        api_key=os.environ['OPENAI_API_KEY'],
        base_url=os.environ['BASE_URL'],
        model=TOPIC_NAMING_MODEL_NAME,
    )

Configuring the summary layer

The summary layer accepts the same kinds of controls as the ordinary text layer, but summaries tend to benefit from a little more context. Using partial() enables us to pass parameters to all the layer classes Toponymy will build while fitting the model. The parameters below ask for up to 32 keyphrases, exemplars, and subtopics per topic. That is a generous setting for a serious run; for a cheaper first pass, you can use the default.

The diversity alpha values control how strongly Toponymy tries to avoid near-duplicate context items. Larger values allow the diversify algorithm to converge to the appropriate value and select the correct number of keyphrases, exemplars or subtopics.

[4]:

from functools import partial

from toponymy.cluster_layer import ClusterLayerSummaryText


def make_layer_class():
    return partial(
        ClusterLayerSummaryText,
        n_keyphrases=32,
        keyphrase_diversify_alpha=3.0,
        n_exemplars=32,
        exemplars_diversify_alpha=3.0,
        n_subtopics=32,
        subtopic_diversify_alpha=3.0,
    )

[5]:

from toponymy import ToponymyClusterer

def get_toponymy_clusterer() -> ToponymyClusterer:
    return ToponymyClusterer(
        min_clusters=4,
        base_min_cluster_size=20,
        max_layers=4,
        verbose=True,
    )

Fitting a model with summaries

The Toponymy object is almost identical to the naming-only version. The important summary-specific arguments are layer_class=make_layer_class() and prompt_template=SUMMARY_PROMPT_TEMPLATES.

The object_description and corpus_description matter more for summaries than they do for terse labels, because the LLM is writing paragraph-level prose. Make them concrete: say what one object is, what fields are included, and what larger corpus the objects came from.

[6]:

from toponymy import Toponymy
from toponymy.templates import SUMMARY_PROMPT_TEMPLATES


def run_toponymy() -> Toponymy:
    # ==================== Load models ======================================
    toponymy_embedder = get_toponymy_embedder()
    toponymy_llm = get_toponymy_llm()

    # ==================== Load documents and embeddings ====================
    documents = load_docs(DOCS_FILEPATH)
    document_vectors = load_vecs(DOCUMENT_VECTORS_FILEPATH)
    document_map = load_map(DOCUMENT_MAP_FILEPATH)

    assert document_map.shape[0] == document_vectors.shape[0]

    # ==================== Run Toponymy =====================================
    toponymy_clusterer = get_toponymy_clusterer()
    layer_class = make_layer_class()

    topic_model = Toponymy(
        llm_wrapper=toponymy_llm,
        text_embedding_model=toponymy_embedder,
        clusterer=toponymy_clusterer,
        layer_class=layer_class,
        prompt_template=SUMMARY_PROMPT_TEMPLATES,
        object_description=(
            'Research journal, title, and abstract of an academic paper'
        ),
        corpus_description=(
            'Research papers from various scientific journals'
        ),
    )

    topic_model.fit(
        documents,
        embedding_vectors=document_vectors,
        clusterable_vectors=document_map,
    )
    return topic_model

Running the next cell will call your embedding model for keyphrase/topic embeddings and your LLM for every topic in every layer. For a quick documentation read, it is fine to leave it unrun. For your own corpus, run it once your API credentials and model names are configured.

[7]:

topic_model = run_toponymy()

Layer 0 found 150 clusters
Layer 1 found 49 clusters
Layer 2 found 14 clusters
Layer 3 found 4 clusters

embedding texts: 100%|██████████| 521/521 [11:23<00:00,  1.31s/it]
embedding texts: 100%|██████████| 2/2 [00:02<00:00,  1.05s/it]?layer/s]
embedding texts: 100%|██████████| 1/1 [00:00<00:00,  1.09it/s]23, 207.97s/layer]
embedding texts: 100%|██████████| 1/1 [00:00<00:00,  4.57it/s]07, 123.85s/layer]
embedding texts: 100%|██████████| 1/1 [00:00<00:00, 10.97it/s]15, 75.81s/layer]
Building topic names by layer: 100%|██████████| 4/4 [05:02<00:00, 75.74s/layer]

Saving names, summaries, and explanations

The summary outputs can be saved with the same core payload as a standard Toponymy run, with the addition of topic_summaries and topic_explanations. In a production script you may want to build partition-specific filenames, but the essential helper only needs a fitted model and a filepath.

[8]:

from pathlib import Path
import pickle

from toponymy import Toponymy


def save_toponymy_model(
    toponymy_model: Toponymy,
    filepath: str | Path,
) -> None:
    # ============ SAVE RESULTS ============
    topic_names = toponymy_model.topic_names_
    topic_summaries = toponymy_model.topic_summaries_
    topic_explanations = toponymy_model.topic_explanations_
    topic_name_vectors = toponymy_model.topic_name_vectors_
    cluster_layers = toponymy_model.cluster_layers_
    cluster_tree = toponymy_model.cluster_tree_
    keyphrase_list = toponymy_model.keyphrase_list_
    keyphrase_vectors = toponymy_model.keyphrase_vectors_

    filepath = Path(filepath)
    filepath.parent.mkdir(parents=True, exist_ok=True)

    payload = {
        'topic_names': topic_names,
        'topic_summaries': topic_summaries,
        'topic_explanations': topic_explanations,
        'topic_name_vectors': topic_name_vectors,
        'layer_cluster_labels': [
            layer.cluster_labels.tolist() for layer in cluster_layers
        ],
        'cluster_tree': cluster_tree,
        'keyphrase_list': keyphrase_list,
        'keyphrase_vectors': keyphrase_vectors,
    }

    with open(filepath, 'wb') as f:
        pickle.dump(payload, f)

[9]:

if topic_model is None:
    print('Fit the example model first to save summary outputs.')
else:
    save_toponymy_model(topic_model, MODEL_OUTPUT_FILEPATH)

[12]:

import numpy as np


def _shorten(text, width=260):
    text = ' '.join(str(text).split())
    return text if len(text) <= width else text[: width - 3].rstrip() + '...'


def print_topic_summary_report(topic_model, topics_per_layer=3):
    if topic_model is None:
        print('Fit the example model first to print topic summary results.')
        return

    print('TOPONYMY SUMMARY MODEL')
    print('=' * 88)
    print(f'Layers: {len(topic_model.topic_names_)}')
    print(f'Documents: {topic_model.embedding_vectors_.shape[0]:,}')
    print(f'Document vectors: {topic_model.embedding_vectors_.shape}')
    print(f'Document map: {topic_model.clusterable_vectors_.shape}')
    print(f'Keyphrases: {len(topic_model.keyphrase_list_):,}')
    print('=' * 88)

    for layer_index, layer in enumerate(topic_model.cluster_layers_):
        topic_names = topic_model.topic_names_[layer_index]
        topic_summaries = topic_model.topic_summaries_[layer_index]
        topic_explanations = topic_model.topic_explanations_[layer_index]
        labels = layer.cluster_labels
        topic_sizes = np.bincount(labels[labels >= 0], minlength=len(topic_names))
        topic_order = np.argsort(topic_sizes)[::-1][:topics_per_layer]

        print(f'\nLayer {layer_index}: {len(topic_names):,} topics')
        print('-' * 88)
        for rank, topic_index in enumerate(topic_order, start=1):
            print(
                f'{rank}. Topic {topic_index} '
                f'({int(topic_sizes[topic_index]):,} documents): '
                f'{topic_names[topic_index]}'
            )
            print(f'   Summary: {_shorten(topic_summaries[topic_index], 240)}')
            print(f'   Explanation: {_shorten(topic_explanations[topic_index], 360)}')


print_topic_summary_report(topic_model)

TOPONYMY SUMMARY MODEL
========================================================================================
Layers: 4
Documents: 10,000
Document vectors: (10000, 768)
Document map: (10000, 2)
Keyphrases: 50,000
========================================================================================

Layer 0: 150 topics
----------------------------------------------------------------------------------------
1. Topic 2 (126 documents): AI Energy Efficiency and Cross-Domain Technical Analysis
   Summary: This group explores AI energy consumption measurement, cross-lingual transfer, video editing, discourse modeling, federated learning, and related computational systems through diverse technical lenses including efficiency, robustness, an...
   Explanation: The papers collectively investigate energy-efficient AI workloads, zero-shot cross-lingual NLP transfer, synthetic video editing frameworks, discourse-aware evaluation benchmarks, personalized federated learning with layer dropping, contextual bandit platforms for user experience optimization, knowledge graph query answering with bidirectional encoders, g...
2. Topic 58 (116 documents): Multi-modal AI Systems and Embodied Agents Research
   Summary: This group comprises studies on multi-modal artificial intelligence systems, focusing on the development of embodied agents that can perceive, interpret, and interact with complex environments using diverse sensory inputs and multimodal...
   Explanation: The research spans foundational architectures for multimodal perception-action coupling, agent-based reasoning in physical and virtual spaces, and the integration of symbolic and subsymbolic representations to enable human-like social cognition and task execution. Key themes include the use of neural-symbolic frameworks for explanation and reasoning, the...
3. Topic 43 (107 documents): Interpretable Machine Learning and Explainable AI with Constrained Clustering and Calibration
   Summary: This group explores interpretable machine learning techniques, focusing on constrained clustering frameworks, model calibration using explanations, and explainable reinforcement learning through causal models. It emphasizes theoretical g...
   Explanation: The group centers on developing interpretable AI systems through constrained clustering with theoretical guarantees, leveraging explanations for model calibration, and applying causal reasoning to explain reinforcement learning behavior. It integrates methods like SAT-based clustering, explanation-guided confidence adjustment, and hierarchical active infe...

Layer 1: 49 topics
----------------------------------------------------------------------------------------
1. Topic 43 (302 documents): AI-Driven Complex Systems Optimization Analysis
   Summary: This group focuses on AI-enhanced methods for modeling, optimization, and analysis across diverse domains including reinforcement learning, federated learning, causal inference, and data-driven decision-making, emphasizing scalability, i...
   Explanation: The group encompasses research on AI-driven optimization techniques applied to complex systems, covering neural program induction for algorithmic efficiency (e.g., sorting networks), uncertainty-aware learning, fairness-aware domain adaptation, and scalable AI methods for real-world challenges like smart grids, supply chains, and healthcare. It integrates...
2. Topic 30 (296 documents): AI Ethics and Adaptive Cruise Control for Autonomous Vehicles
   Summary: This group explores AI ethics, fairness, explainability, and multilingual data challenges across healthcare, military, and social media domains, while also researching safety verification for autonomous vehicle systems like adaptive crui...
   Explanation: The group's core focus spans two major, distinct subtopics: (1) AI safety and adaptive cruise control for autonomous vehicles, emphasizing formal safety verification, hybrid automaton frameworks, and reinforcement learning methods to prevent unsafe behavior; and (2) AI ethics and multilingual data for fairness and explainability, addressing interdisciplin...
3. Topic 28 (271 documents): Multimodal Deep Learning for Adaptive Real-World Systems and Intelligent Automation
   Summary: The group focuses on advanced deep learning techniques that integrate multimodal data to enhance decision-making, personalization, and efficiency across diverse real-world applications including scheduling, fleet management, stress detec...
   Explanation: The group's research spans multimodal deep learning methodologies—including reinforcement learning, contrastive learning, diffusion models, and multimodal fusion—to solve complex real-world challenges. Key themes include adaptive system design (e.g., job shop scheduling, fleet management), robust handling of noisy data and concept drift, personalized feed...

Layer 2: 14 topics
----------------------------------------------------------------------------------------
1. Topic 4 (768 documents): Neural Approximation and Active Learning Innovations
   Summary: This group focuses on advanced neural network approximation techniques for solving PDEs using Monte Carlo methods and neural networks, combined with active learning frameworks that leverage counter-examples to improve partial label learn...
   Explanation: The group's core work centers on neural functionals for PDE approximation, active learning with counter-example-driven strategies, and specialized neural architectures (COFENET, STFNNs) for domain-specific tasks. It bridges theoretical advances (sharpness-aware optimization, information bottlenecks) with practical applications in computational biology, tr...
2. Topic 5 (678 documents): Explainable Multimodal AI for Biomedical and Scientific Discovery
   Summary: The group focuses on developing explainable AI systems that integrate multimodal data across biomedical and scientific domains, emphasizing interpretability, knowledge integration, and domain-specific validation through techniques like a...
   Explanation: The group's core research revolves around explainable AI frameworks for multimodal systems, with major subtopics covering biomedical signal processing (e.g., neural decoding, adaptive deep learning for medical diagnostics), ethical AI deployment (e.g., fairness in healthcare and social media), and scientific discovery (e.g., integrating multimodal data wi...
3. Topic 13 (437 documents): Data-Driven AI Methods for Vision, Language, and Systems Modeling
   Summary: This group focuses on advancing data-driven AI techniques across diverse domains such as computer vision, natural language processing, reinforcement learning, and systems engineering. Key contributions include novel algorithms for effici...
   Explanation: The group encompasses research on data-driven AI methods applied to vision, language, and systems modeling, with major subtopics covering advanced representation learning, multimodal AI systems, scientific computing, and hierarchical knowledge graph embeddings. Minor subtopics include multimodal data analysis, causal inference, and privacy-preserving tech...

Layer 3: 4 topics
----------------------------------------------------------------------------------------
1. Topic 1 (3,741 documents): Multimodal AI Systems
   Summary: The group encompasses advanced research on multimodal deep learning techniques that integrate vision, language, sensor, and graph-structured data to build adaptive, explainable, and ethically aligned AI systems for complex real-world app...
   Explanation: The group's work unifies multimodal learning frameworks (e.g., contrastive learning, diffusion models, neural symbolic integration) with safety-critical applications (e.g., autonomous vehicles, healthcare diagnostics) and ethical AI governance. It spans technical innovations in representation learning (e.g., graph neural networks, latent variables), adapt...
2. Topic 0 (2,083 documents): Multi-modal AI Systems
   Summary: The group focuses on developing advanced multi-modal AI systems and embodied agents that integrate diverse sensory inputs, perform neural-symbolic reasoning, and operate in real-world contexts such as robotics and human-robot interaction...
   Explanation: The group encompasses foundational work on multi-modal AI architectures, embodied agent behavior design, and integration of symbolic and subsymbolic representations. It addresses critical challenges in AI safety, interpretability, and alignment through methods like superhuman performance certification, adversarial vulnerability detection, and conservative...
3. Topic 2 (768 documents): Neural Approximation and Active Learning Innovations
   Summary: This group focuses on advanced neural network approximation techniques for solving PDEs using Monte Carlo methods and neural networks, combined with active learning frameworks that leverage counter-examples to improve partial label learn...
   Explanation: The group's core work centers on neural functionals for PDE approximation, active learning with counter-example-driven strategies, and specialized neural architectures (COFENET, STFNNs) for domain-specific tasks. It bridges theoretical advances (sharpness-aware optimization, information bottlenecks) with practical applications in computational biology, tr...

Practical tuning notes

Summaries are more expressive than names, but they also cost more output tokens. Start with a smaller partition, fewer layers, or smaller n_keyphrases, n_exemplars, and n_subtopics values while you are tuning. Once the generated summaries have the right level of detail, increase the context budget for the production run.

Good descriptions matter. A vague object_description such as documents gives the LLM less guidance than research paper title and abstract or support ticket subject, body, product area, and severity. Likewise, a precise corpus_description helps the model choose language appropriate to the collection.

If you write a custom prompt template, keep the same structured output contract used by SUMMARY_PROMPT_TEMPLATES: the layer prompt should return JSON with topic_name, topic_summary, topic_explanation, and topic_specificity fields, and the extraction function should return the name, summary, and explanation together.