{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "11f252c2",
   "metadata": {},
   "source": [
    "# Topic summaries and explanations\n",
    "\n",
    "Toponymy can now generate more than a short human-readable topic name. By using the summary layer, each topic also receives a short summary paragraph and a more detailed explanation. This is useful when topic names are going to appear in a browser, report, search interface, or review workflow where a compact label alone is not quite enough context.\n",
    "\n",
    "The summary pipeline follows the usual Toponymy flow: cluster documents, choose keyphrases and exemplars, use lower-level topics to help name higher-level topics, and finally store per-layer topic outputs. The main difference is that the LLM prompt asks for structured JSON containing a topic name, topic summary, topic explanation, and specificity score."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "006550e2",
   "metadata": {},
   "source": [
    "## The short version\n",
    "\n",
    "If you already have a working Toponymy script, the summary version usually requires just two changes:\n",
    "\n",
    "1. Use ``ClusterLayerSummaryText`` as the ``layer_class``.\n",
    "2. Use ``SUMMARY_PROMPT_TEMPLATES`` as the ``prompt_template``.\n",
    "\n",
    "After fitting, the model has the usual ``topic_names_`` and ``topic_name_vectors_`` attributes, plus ``topic_summaries_`` and ``topic_explanations_``. All three are lists of lists, one list per cluster layer."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5931a097",
   "metadata": {},
   "source": [
    "## Data\n",
    "\n",
    "This tutorial uses the small arXiv AI example data bundled with Toponymy. The helper functions below use generic filepaths: one function loads documents, one loads high-dimensional document vectors, and one loads the low-dimensional document map used for clustering. For your own data, keep the same function shape and replace the filepath constants with your storage layout."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "99268490",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "def find_base_dir() -> Path:\n",
    "    candidates = [\n",
    "        Path('../examples'),\n",
    "        Path('examples'),\n",
    "        Path.cwd() / 'examples',\n",
    "        Path.cwd().parent / 'examples',\n",
    "    ]\n",
    "    for candidate in candidates:\n",
    "        if (candidate / 'ai_arxiv_papers.zip').exists():\n",
    "            return candidate\n",
    "    raise FileNotFoundError('Could not find the bundled examples directory.')\n",
    "\n",
    "\n",
    "BASE_DIR = find_base_dir()\n",
    "DOCS_FILEPATH = BASE_DIR / 'ai_arxiv_papers.zip'\n",
    "DOCUMENT_VECTORS_FILEPATH = BASE_DIR / 'ai_arxiv_vectors.npy'\n",
    "DOCUMENT_MAP_FILEPATH = BASE_DIR / 'ai_arxiv_coordinates.npz.npy'\n",
    "MODEL_OUTPUT_FILEPATH = Path('topic_summary_outputs.pkl')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "06393ee7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((10000,), (10000, 768), (10000, 2))"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def load_docs(filepath: str | Path):\n",
    "    docs_df = pd.read_csv(filepath)\n",
    "    return (\n",
    "        docs_df['title'].str.strip()\n",
    "        + '\\n\\n'\n",
    "        + docs_df['abstract'].str.strip()\n",
    "    ).to_numpy()\n",
    "\n",
    "\n",
    "def load_vecs(filepath: str | Path):\n",
    "    return np.load(filepath)\n",
    "\n",
    "\n",
    "def load_map(filepath: str | Path):\n",
    "    return np.load(filepath)\n",
    "\n",
    "\n",
    "documents = load_docs(DOCS_FILEPATH)\n",
    "document_vectors = load_vecs(DOCUMENT_VECTORS_FILEPATH)\n",
    "clusterable_vectors = load_map(DOCUMENT_MAP_FILEPATH)\n",
    "\n",
    "assert len(documents) == document_vectors.shape[0]\n",
    "assert len(documents) == clusterable_vectors.shape[0]\n",
    "\n",
    "documents.shape, document_vectors.shape, clusterable_vectors.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "daaba9b7",
   "metadata": {},
   "source": [
    "## Model wrappers\n",
    "\n",
    "The document vectors in the example above are already built. The text embedder used in the Toponymy object is still needed because Toponymy embeds keyphrases and generated topic names internally when it builds and relates topic layers. It does not need to be the same embedding model that created the original document vectors, although it should be semantically capable for your corpus.\n",
    "\n",
    "The OpenAI-compatible wrapper below can work either with the public OpenAI API or with an OpenAI-compatible hosted endpoint. Set ``OPENAI_API_KEY`` as usual. If you use a hosted endpoint, set ``TOPONYMY_API_BASE_URL`` or ``OPENAI_BASE_URL``. You can also set ``TOPONYMY_TEXT_EMBEDDING_MODEL`` and ``TOPONYMY_TOPIC_NAMING_MODEL`` to override the default model names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0ea17bf2",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "from toponymy.embedding_wrappers import OpenAIEmbedder\n",
    "from toponymy.llm_wrappers import LLMWrapper, OpenAINamer\n",
    "\n",
    "TEXT_EMBEDDING_MODEL_NAME = 'llama-embed-nemotron'\n",
    "TOPIC_NAMING_MODEL_NAME = 'nemotron-3-nano'\n",
    "\n",
    "def get_toponymy_embedder() -> OpenAIEmbedder:\n",
    "    return OpenAIEmbedder(\n",
    "        api_key=os.environ['OPENAI_API_KEY'],\n",
    "        base_url=os.environ['BASE_URL'],\n",
    "        model=TEXT_EMBEDDING_MODEL_NAME,\n",
    "    )\n",
    "\n",
    "\n",
    "def get_toponymy_llm() -> LLMWrapper:\n",
    "    return OpenAINamer(\n",
    "        api_key=os.environ['OPENAI_API_KEY'],\n",
    "        base_url=os.environ['BASE_URL'],\n",
    "        model=TOPIC_NAMING_MODEL_NAME,\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "21765ca2",
   "metadata": {},
   "source": [
    "## Configuring the summary layer\n",
    "\n",
    "The summary layer accepts the same kinds of controls as the ordinary text layer, but summaries tend to benefit from a little more context. Using partial() enables us to pass parameters to all the layer classes Toponymy will build while fitting the model. The parameters below ask for up to 32 keyphrases, exemplars, and subtopics per topic. That is a generous setting for a serious run; for a cheaper first pass, you can use the default.\n",
    "\n",
    "The diversity alpha values control how strongly Toponymy tries to avoid near-duplicate context items. Larger values allow the diversify algorithm to converge to the appropriate value and select the correct number of keyphrases, exemplars or subtopics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "1d9e48e3",
   "metadata": {},
   "outputs": [],
   "source": [
    "from functools import partial\n",
    "\n",
    "from toponymy.cluster_layer import ClusterLayerSummaryText\n",
    "\n",
    "\n",
    "def make_layer_class():\n",
    "    return partial(\n",
    "        ClusterLayerSummaryText,\n",
    "        n_keyphrases=32,\n",
    "        keyphrase_diversify_alpha=3.0,\n",
    "        n_exemplars=32,\n",
    "        exemplars_diversify_alpha=3.0,\n",
    "        n_subtopics=32,\n",
    "        subtopic_diversify_alpha=3.0,\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "ee10f55b",
   "metadata": {},
   "outputs": [],
   "source": [
    "from toponymy import ToponymyClusterer\n",
    "\n",
    "def get_toponymy_clusterer() -> ToponymyClusterer:\n",
    "    return ToponymyClusterer(\n",
    "        min_clusters=4,\n",
    "        base_min_cluster_size=20,\n",
    "        max_layers=4,\n",
    "        verbose=True,\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5ee54c86",
   "metadata": {},
   "source": [
    "## Fitting a model with summaries\n",
    "\n",
    "The ``Toponymy`` object is almost identical to the naming-only version. The important summary-specific arguments are ``layer_class=make_layer_class()`` and ``prompt_template=SUMMARY_PROMPT_TEMPLATES``.\n",
    "\n",
    "The ``object_description`` and ``corpus_description`` matter more for summaries than they do for terse labels, because the LLM is writing paragraph-level prose. Make them concrete: say what one object is, what fields are included, and what larger corpus the objects came from."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "feb3a311",
   "metadata": {},
   "outputs": [],
   "source": [
    "from toponymy import Toponymy\n",
    "from toponymy.templates import SUMMARY_PROMPT_TEMPLATES\n",
    "\n",
    "\n",
    "def run_toponymy() -> Toponymy:\n",
    "    # ==================== Load models ======================================\n",
    "    toponymy_embedder = get_toponymy_embedder()\n",
    "    toponymy_llm = get_toponymy_llm()\n",
    "\n",
    "    # ==================== Load documents and embeddings ====================\n",
    "    documents = load_docs(DOCS_FILEPATH)\n",
    "    document_vectors = load_vecs(DOCUMENT_VECTORS_FILEPATH)\n",
    "    document_map = load_map(DOCUMENT_MAP_FILEPATH)\n",
    "\n",
    "    assert document_map.shape[0] == document_vectors.shape[0]\n",
    "\n",
    "    # ==================== Run Toponymy =====================================\n",
    "    toponymy_clusterer = get_toponymy_clusterer()\n",
    "    layer_class = make_layer_class()\n",
    "\n",
    "    topic_model = Toponymy(\n",
    "        llm_wrapper=toponymy_llm,\n",
    "        text_embedding_model=toponymy_embedder,\n",
    "        clusterer=toponymy_clusterer,\n",
    "        layer_class=layer_class,\n",
    "        prompt_template=SUMMARY_PROMPT_TEMPLATES,\n",
    "        object_description=(\n",
    "            'Research journal, title, and abstract of an academic paper'\n",
    "        ),\n",
    "        corpus_description=(\n",
    "            'Research papers from various scientific journals'\n",
    "        ),\n",
    "    )\n",
    "\n",
    "    topic_model.fit(\n",
    "        documents,\n",
    "        embedding_vectors=document_vectors,\n",
    "        clusterable_vectors=document_map,\n",
    "    )\n",
    "    return topic_model"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cdedacea",
   "metadata": {},
   "source": [
    "Running the next cell will call your embedding model for keyphrase/topic embeddings and your LLM for every topic in every layer. For a quick documentation read, it is fine to leave it unrun. For your own corpus, run it once your API credentials and model names are configured."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "6aafbbb3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Layer 0 found 150 clusters\n",
      "Layer 1 found 49 clusters\n",
      "Layer 2 found 14 clusters\n",
      "Layer 3 found 4 clusters\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "embedding texts: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 521/521 [11:23<00:00,  1.31s/it]\n",
      "embedding texts: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 2/2 [00:02<00:00,  1.05s/it]?layer/s]\n",
      "embedding texts: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 1/1 [00:00<00:00,  1.09it/s]23, 207.97s/layer]\n",
      "embedding texts: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 1/1 [00:00<00:00,  4.57it/s]07, 123.85s/layer]\n",
      "embedding texts: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 1/1 [00:00<00:00, 10.97it/s]15, 75.81s/layer] \n",
      "Building topic names by layer: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 4/4 [05:02<00:00, 75.74s/layer]\n"
     ]
    }
   ],
   "source": [
    "topic_model = run_toponymy()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a2967cd1",
   "metadata": {},
   "source": [
    "## Saving names, summaries, and explanations\n",
    "\n",
    "The summary outputs can be saved with the same core payload as a standard Toponymy run, with the addition of ``topic_summaries`` and ``topic_explanations``. In a production script you may want to build partition-specific filenames, but the essential helper only needs a fitted model and a filepath."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "8e0e7007",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "import pickle\n",
    "\n",
    "from toponymy import Toponymy\n",
    "\n",
    "\n",
    "def save_toponymy_model(\n",
    "    toponymy_model: Toponymy,\n",
    "    filepath: str | Path,\n",
    ") -> None:\n",
    "    # ============ SAVE RESULTS ============\n",
    "    topic_names = toponymy_model.topic_names_\n",
    "    topic_summaries = toponymy_model.topic_summaries_\n",
    "    topic_explanations = toponymy_model.topic_explanations_\n",
    "    topic_name_vectors = toponymy_model.topic_name_vectors_\n",
    "    cluster_layers = toponymy_model.cluster_layers_\n",
    "    cluster_tree = toponymy_model.cluster_tree_\n",
    "    keyphrase_list = toponymy_model.keyphrase_list_\n",
    "    keyphrase_vectors = toponymy_model.keyphrase_vectors_\n",
    "\n",
    "    filepath = Path(filepath)\n",
    "    filepath.parent.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "    payload = {\n",
    "        'topic_names': topic_names,\n",
    "        'topic_summaries': topic_summaries,\n",
    "        'topic_explanations': topic_explanations,\n",
    "        'topic_name_vectors': topic_name_vectors,\n",
    "        'layer_cluster_labels': [\n",
    "            layer.cluster_labels.tolist() for layer in cluster_layers\n",
    "        ],\n",
    "        'cluster_tree': cluster_tree,\n",
    "        'keyphrase_list': keyphrase_list,\n",
    "        'keyphrase_vectors': keyphrase_vectors,\n",
    "    }\n",
    "\n",
    "    with open(filepath, 'wb') as f:\n",
    "        pickle.dump(payload, f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "82ec5fd1",
   "metadata": {},
   "outputs": [],
   "source": [
    "if topic_model is None:\n",
    "    print('Fit the example model first to save summary outputs.')\n",
    "else:\n",
    "    save_toponymy_model(topic_model, MODEL_OUTPUT_FILEPATH)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "dd328078",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "TOPONYMY SUMMARY MODEL\n",
      "========================================================================================\n",
      "Layers: 4\n",
      "Documents: 10,000\n",
      "Document vectors: (10000, 768)\n",
      "Document map: (10000, 2)\n",
      "Keyphrases: 50,000\n",
      "========================================================================================\n",
      "\n",
      "Layer 0: 150 topics\n",
      "----------------------------------------------------------------------------------------\n",
      "1. Topic 2 (126 documents): AI Energy Efficiency and Cross-Domain Technical Analysis\n",
      "   Summary: This group explores AI energy consumption measurement, cross-lingual transfer, video editing, discourse modeling, federated learning, and related computational systems through diverse technical lenses including efficiency, robustness, an...\n",
      "   Explanation: The papers collectively investigate energy-efficient AI workloads, zero-shot cross-lingual NLP transfer, synthetic video editing frameworks, discourse-aware evaluation benchmarks, personalized federated learning with layer dropping, contextual bandit platforms for user experience optimization, knowledge graph query answering with bidirectional encoders, g...\n",
      "2. Topic 58 (116 documents): Multi-modal AI Systems and Embodied Agents Research\n",
      "   Summary: This group comprises studies on multi-modal artificial intelligence systems, focusing on the development of embodied agents that can perceive, interpret, and interact with complex environments using diverse sensory inputs and multimodal...\n",
      "   Explanation: The research spans foundational architectures for multimodal perception-action coupling, agent-based reasoning in physical and virtual spaces, and the integration of symbolic and subsymbolic representations to enable human-like social cognition and task execution. Key themes include the use of neural-symbolic frameworks for explanation and reasoning, the...\n",
      "3. Topic 43 (107 documents): Interpretable Machine Learning and Explainable AI with Constrained Clustering and Calibration\n",
      "   Summary: This group explores interpretable machine learning techniques, focusing on constrained clustering frameworks, model calibration using explanations, and explainable reinforcement learning through causal models. It emphasizes theoretical g...\n",
      "   Explanation: The group centers on developing interpretable AI systems through constrained clustering with theoretical guarantees, leveraging explanations for model calibration, and applying causal reasoning to explain reinforcement learning behavior. It integrates methods like SAT-based clustering, explanation-guided confidence adjustment, and hierarchical active infe...\n",
      "\n",
      "Layer 1: 49 topics\n",
      "----------------------------------------------------------------------------------------\n",
      "1. Topic 43 (302 documents): AI-Driven Complex Systems Optimization Analysis\n",
      "   Summary: This group focuses on AI-enhanced methods for modeling, optimization, and analysis across diverse domains including reinforcement learning, federated learning, causal inference, and data-driven decision-making, emphasizing scalability, i...\n",
      "   Explanation: The group encompasses research on AI-driven optimization techniques applied to complex systems, covering neural program induction for algorithmic efficiency (e.g., sorting networks), uncertainty-aware learning, fairness-aware domain adaptation, and scalable AI methods for real-world challenges like smart grids, supply chains, and healthcare. It integrates...\n",
      "2. Topic 30 (296 documents): AI Ethics and Adaptive Cruise Control for Autonomous Vehicles\n",
      "   Summary: This group explores AI ethics, fairness, explainability, and multilingual data challenges across healthcare, military, and social media domains, while also researching safety verification for autonomous vehicle systems like adaptive crui...\n",
      "   Explanation: The group's core focus spans two major, distinct subtopics: (1) AI safety and adaptive cruise control for autonomous vehicles, emphasizing formal safety verification, hybrid automaton frameworks, and reinforcement learning methods to prevent unsafe behavior; and (2) AI ethics and multilingual data for fairness and explainability, addressing interdisciplin...\n",
      "3. Topic 28 (271 documents): Multimodal Deep Learning for Adaptive Real-World Systems and Intelligent Automation\n",
      "   Summary: The group focuses on advanced deep learning techniques that integrate multimodal data to enhance decision-making, personalization, and efficiency across diverse real-world applications including scheduling, fleet management, stress detec...\n",
      "   Explanation: The group's research spans multimodal deep learning methodologies\u2014including reinforcement learning, contrastive learning, diffusion models, and multimodal fusion\u2014to solve complex real-world challenges. Key themes include adaptive system design (e.g., job shop scheduling, fleet management), robust handling of noisy data and concept drift, personalized feed...\n",
      "\n",
      "Layer 2: 14 topics\n",
      "----------------------------------------------------------------------------------------\n",
      "1. Topic 4 (768 documents): Neural Approximation and Active Learning Innovations\n",
      "   Summary: This group focuses on advanced neural network approximation techniques for solving PDEs using Monte Carlo methods and neural networks, combined with active learning frameworks that leverage counter-examples to improve partial label learn...\n",
      "   Explanation: The group's core work centers on neural functionals for PDE approximation, active learning with counter-example-driven strategies, and specialized neural architectures (COFENET, STFNNs) for domain-specific tasks. It bridges theoretical advances (sharpness-aware optimization, information bottlenecks) with practical applications in computational biology, tr...\n",
      "2. Topic 5 (678 documents): Explainable Multimodal AI for Biomedical and Scientific Discovery\n",
      "   Summary: The group focuses on developing explainable AI systems that integrate multimodal data across biomedical and scientific domains, emphasizing interpretability, knowledge integration, and domain-specific validation through techniques like a...\n",
      "   Explanation: The group's core research revolves around explainable AI frameworks for multimodal systems, with major subtopics covering biomedical signal processing (e.g., neural decoding, adaptive deep learning for medical diagnostics), ethical AI deployment (e.g., fairness in healthcare and social media), and scientific discovery (e.g., integrating multimodal data wi...\n",
      "3. Topic 13 (437 documents): Data-Driven AI Methods for Vision, Language, and Systems Modeling\n",
      "   Summary: This group focuses on advancing data-driven AI techniques across diverse domains such as computer vision, natural language processing, reinforcement learning, and systems engineering. Key contributions include novel algorithms for effici...\n",
      "   Explanation: The group encompasses research on data-driven AI methods applied to vision, language, and systems modeling, with major subtopics covering advanced representation learning, multimodal AI systems, scientific computing, and hierarchical knowledge graph embeddings. Minor subtopics include multimodal data analysis, causal inference, and privacy-preserving tech...\n",
      "\n",
      "Layer 3: 4 topics\n",
      "----------------------------------------------------------------------------------------\n",
      "1. Topic 1 (3,741 documents): Multimodal AI Systems\n",
      "   Summary: The group encompasses advanced research on multimodal deep learning techniques that integrate vision, language, sensor, and graph-structured data to build adaptive, explainable, and ethically aligned AI systems for complex real-world app...\n",
      "   Explanation: The group's work unifies multimodal learning frameworks (e.g., contrastive learning, diffusion models, neural symbolic integration) with safety-critical applications (e.g., autonomous vehicles, healthcare diagnostics) and ethical AI governance. It spans technical innovations in representation learning (e.g., graph neural networks, latent variables), adapt...\n",
      "2. Topic 0 (2,083 documents): Multi-modal AI Systems\n",
      "   Summary: The group focuses on developing advanced multi-modal AI systems and embodied agents that integrate diverse sensory inputs, perform neural-symbolic reasoning, and operate in real-world contexts such as robotics and human-robot interaction...\n",
      "   Explanation: The group encompasses foundational work on multi-modal AI architectures, embodied agent behavior design, and integration of symbolic and subsymbolic representations. It addresses critical challenges in AI safety, interpretability, and alignment through methods like superhuman performance certification, adversarial vulnerability detection, and conservative...\n",
      "3. Topic 2 (768 documents): Neural Approximation and Active Learning Innovations\n",
      "   Summary: This group focuses on advanced neural network approximation techniques for solving PDEs using Monte Carlo methods and neural networks, combined with active learning frameworks that leverage counter-examples to improve partial label learn...\n",
      "   Explanation: The group's core work centers on neural functionals for PDE approximation, active learning with counter-example-driven strategies, and specialized neural architectures (COFENET, STFNNs) for domain-specific tasks. It bridges theoretical advances (sharpness-aware optimization, information bottlenecks) with practical applications in computational biology, tr...\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "\n",
    "def _shorten(text, width=260):\n",
    "    text = ' '.join(str(text).split())\n",
    "    return text if len(text) <= width else text[: width - 3].rstrip() + '...'\n",
    "\n",
    "\n",
    "def print_topic_summary_report(topic_model, topics_per_layer=3):\n",
    "    if topic_model is None:\n",
    "        print('Fit the example model first to print topic summary results.')\n",
    "        return\n",
    "\n",
    "    print('TOPONYMY SUMMARY MODEL')\n",
    "    print('=' * 88)\n",
    "    print(f'Layers: {len(topic_model.topic_names_)}')\n",
    "    print(f'Documents: {topic_model.embedding_vectors_.shape[0]:,}')\n",
    "    print(f'Document vectors: {topic_model.embedding_vectors_.shape}')\n",
    "    print(f'Document map: {topic_model.clusterable_vectors_.shape}')\n",
    "    print(f'Keyphrases: {len(topic_model.keyphrase_list_):,}')\n",
    "    print('=' * 88)\n",
    "\n",
    "    for layer_index, layer in enumerate(topic_model.cluster_layers_):\n",
    "        topic_names = topic_model.topic_names_[layer_index]\n",
    "        topic_summaries = topic_model.topic_summaries_[layer_index]\n",
    "        topic_explanations = topic_model.topic_explanations_[layer_index]\n",
    "        labels = layer.cluster_labels\n",
    "        topic_sizes = np.bincount(labels[labels >= 0], minlength=len(topic_names))\n",
    "        topic_order = np.argsort(topic_sizes)[::-1][:topics_per_layer]\n",
    "\n",
    "        print(f'\\nLayer {layer_index}: {len(topic_names):,} topics')\n",
    "        print('-' * 88)\n",
    "        for rank, topic_index in enumerate(topic_order, start=1):\n",
    "            print(\n",
    "                f'{rank}. Topic {topic_index} '\n",
    "                f'({int(topic_sizes[topic_index]):,} documents): '\n",
    "                f'{topic_names[topic_index]}'\n",
    "            )\n",
    "            print(f'   Summary: {_shorten(topic_summaries[topic_index], 240)}')\n",
    "            print(f'   Explanation: {_shorten(topic_explanations[topic_index], 360)}')\n",
    "\n",
    "\n",
    "print_topic_summary_report(topic_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8567aa04",
   "metadata": {},
   "source": [
    "## Practical tuning notes\n",
    "\n",
    "Summaries are more expressive than names, but they also cost more output tokens. Start with a smaller partition, fewer layers, or smaller ``n_keyphrases``, ``n_exemplars``, and ``n_subtopics`` values while you are tuning. Once the generated summaries have the right level of detail, increase the context budget for the production run.\n",
    "\n",
    "Good descriptions matter. A vague ``object_description`` such as ``documents`` gives the LLM less guidance than ``research paper title and abstract`` or ``support ticket subject, body, product area, and severity``. Likewise, a precise ``corpus_description`` helps the model choose language appropriate to the collection.\n",
    "\n",
    "If you write a custom prompt template, keep the same structured output contract used by ``SUMMARY_PROMPT_TEMPLATES``: the layer prompt should return JSON with ``topic_name``, ``topic_summary``, ``topic_explanation``, and ``topic_specificity`` fields, and the extraction function should return the name, summary, and explanation together."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "None",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.9"
  },
  "nbsphinx": {
   "execute": "never"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}