Main Classes

class toponymy.Toponymy(llm_wrapper: ~toponymy.llm_wrappers.LLMWrapper, text_embedding_model: ~toponymy.embedding_wrappers.TextEmbedderProtocol, clusterer: ~toponymy.clustering.Clusterer = <toponymy.clustering.ToponymyClusterer object>, layer_class: ~typing.Type[~toponymy.cluster_layer.ClusterLayer] = <class 'toponymy.cluster_layer.ClusterLayerText'>, prompt_template: ~typing.Dict[str, ~typing.Any] = {'disambiguate_topics': {'combined': <Template memory:7def8a1b1c90>, 'extract_topic_names': <function default_extract_topic_names>, 'get_topic_names_regex': '\\{\\s*"new_topic_name_mapping":\\s*.*?,\\s*"topic_specificities": .*?\\}', 'system': <Template memory:7def8b304650>, 'user': <Template memory:7def8b651890>}, 'layer': {'combined': <Template memory:7def8a08d310>, 'extract_topic_name': <function <lambda>>, 'get_topic_name_regex': '\\{\\s*"topic_name":\\s*.*?,\\s*"topic_specificity":\\s*[\\w.]+\\s*\\}', 'system': <Template memory:7def8b856e10>, 'user': <Template memory:7def8e7b42d0>}}, keyphrase_builder: ~toponymy.keyphrases.KeyphraseBuilder = <toponymy.keyphrases.KeyphraseBuilder object>, object_description: str = 'objects', corpus_description: str = 'collection of objects', lowest_detail_level: float = 0.0, highest_detail_level: float = 1.0, exemplar_delimiters: ~typing.List[str] = [' * "', '"\n'], verbose: bool | None = None, show_progress_bars: bool | None = None)

Bases: object

A class for generating topic names for vector based topic modeling.

fit(objects: List[Any], embedding_vectors: ndarray, clusterable_vectors: ndarray, exemplar_method: str = 'central', keyphrase_method: str = 'information_weighted', subtopic_method: str = 'central'): Vectorizes using the classes embedding_model and constructs a low dimension data map with UMAP if object_vectors and object_map aren’t spec.

fit_predict(objects: List[Any], object_vectors: ndarray, clusterable_vectors: ndarray, exemplar_method: str = 'central', keyphrase_method: str = 'information_weighted', subtopic_method: str = 'facility_location') → List[ndarray]: Fit the model with objects and return the topic names.

property topic_tree_: TopicTree: Returns the topic tree.

class toponymy.ToponymyClusterer(min_clusters: int = 6, min_samples: int = 5, base_min_cluster_size: int | None = 10, base_n_clusters: int | None = None, next_cluster_size_quantile: float = 0.85, max_layers: int | None = None, verbose: bool | None = None, show_progress_bar: bool | None = None, n_threads: int = -1)

Bases: Clusterer

A class for clustering data using a layered version of HDBSCAN.

Parameters:

min_clustersint, optional: The minimum number of clusters to form in a layer (default is 6).
min_samplesint, optional: The minimum number of samples for hdbscan style clustering (default is 5).
base_min_cluster_sizeint, optional: The base minimum size of clusters for the most fine-grained cluster layer (default is 10).
base_n_clustersOptional[int], optional: The base number of clusters for the most fine-grained cluster layer (default is None). If None then base_min_cluster_size is used; otherwise this value will override base_min_cluster_size.
next_cluster_size_quantilefloat, optional: The quantile value to determine the size of the minimum cluster size for the next layer (default is 0.8).
max_layersOptional[int], optional: The maximum number of layers to create (default is None). If None, no limit is imposed.
verbosebool, optional: Whether to show progress bars and verbose output. If True, shows all output. If False, suppresses all output.
show_progress_barbool, optional, deprecated: Deprecated. Use verbose instead.
n_threadsint, optional: The number of threads to use for parallel computation (default is -1, use all available cores).

Attributes:

cluster_layers_List[ClusterLayer]: A list of the created cluster layers.
cluster_tree_Dict[Tuple[int, int], List[Tuple[int, int]]]: A dictionary representing the cluster tree. Keys are a tuple of (layer, cluster index) and values are lists of tuples representing child clusters.

fit(clusterable_vectors: ndarray, embedding_vectors: ndarray, layer_class: Type[ClusterLayer] = <class 'toponymy.cluster_layer.ClusterLayerText'>, verbose: bool | None = None, show_progress_bar: bool | None = None, **layer_kwargs) → Clusterer

fit_predict(clusterable_vectors: ndarray, embedding_vectors: ndarray, layer_class: Type[ClusterLayer] = <class 'toponymy.cluster_layer.ClusterLayerText'>, verbose: bool | None = None, show_progress_bar: bool | None = None, **layer_kwargs) → Tuple[List[ClusterLayer], Dict[Tuple[int, int], List[Tuple[int, int]]]]

class toponymy.PLSCANClusterer(min_clusters: int = 6, min_samples: int = 5, base_min_cluster_size: int = 10, max_layers: int | None = 10, verbose: bool | None = None, show_progress_bar: bool | None = None)

Bases: Clusterer

A class for clustering dense vector data in layers using fast_hdbscan.PLSCAN.

Parameters:

min_clustersint, optional: The minimum number of non-noise clusters to keep in a layer (default is 6).
min_samplesint, optional: The minimum number of samples used by PLSCAN (default is 5).
base_min_cluster_sizeint, optional: The base minimum cluster size passed to PLSCAN (default is 10).
max_layersOptional[int], optional: The maximum number of hierarchy layers to keep (default is 10).
verbosebool, optional: Whether to show progress bars and verbose output. If True, shows all output. If False, suppresses all output.
show_progress_barbool, optional, deprecated: Deprecated. Use verbose instead.

Attributes:

cluster_layers_List[ClusterLayer]: A list of the created cluster layers.
cluster_tree_Dict[Tuple[int, int], List[Tuple[int, int]]]: A dictionary representing the cluster tree.
cluster_probabilities_List[np.ndarray]: Membership probabilities for each returned layer.
cluster_persistence_scores_List[float]: Persistence scores for each returned layer.
plscan_min_cluster_sizes_Optional[np.ndarray]: The minimum cluster sizes explored by PLSCAN, when exposed by the upstream implementation.

fit(clusterable_vectors: ndarray, embedding_vectors: ndarray, layer_class: Type[ClusterLayer] = <class 'toponymy.cluster_layer.ClusterLayerText'>, verbose: bool | None = None, show_progress_bar: bool | None = None, **layer_kwargs) → Clusterer

fit_predict(clusterable_vectors: ndarray, embedding_vectors: ndarray, layer_class: Type[ClusterLayer] = <class 'toponymy.cluster_layer.ClusterLayerText'>, verbose: bool | None = None, show_progress_bar: bool | None = None, **layer_kwargs) → Tuple[List[ClusterLayer], Dict[Tuple[int, int], List[Tuple[int, int]]]]

class toponymy.ClusterLayerText(cluster_labels: ndarray, centroid_vectors: ndarray, layer_id: int, text_embedding_model: TextEmbedderProtocol | None = None, n_keyphrases: int = 16, keyphrase_diversify_alpha: float = 1.0, n_exemplars: int = 8, exemplars_diversify_alpha: float = 1.0, n_subtopics: int = 16, subtopic_diversify_alpha: float = 1.0, exemplar_delimiters: List[str] = None, prompt_format: str = 'combined', prompt_template: Dict[str, Any] | None = None, verbose: bool | None = None, show_progress_bar: bool | None = None, **kwargs: Any)

Bases: ClusterLayer

A cluster layer class for dealing with text data. A cluster layer is a layer of a cluster hierarchy.

Attributes: cluster_labels: vector of numeric cluster labels for the clusters in the layer centroid_vectors: list of centroid vectors of the clusters in the layer

Methods: make_prompts: creates and stores a list of prompts for the clusters in the layer make_keywords: generates and stores a list of keywords for each clusters in the layer make_subtopics: generates and stores a list of subtopics for each clusters in the layer make_sample_texts: generates and stores a list of sample texts for each clusters in the layer

make_exemplar_texts(object_list: List[str], object_vectors: ndarray, method='facility_location') → Tuple[List[List[str]], List[List[int]]]

make_keyphrases(keyphrase_list: List[str], object_x_keyphrase_matrix: spmatrix, keyphrase_vectors: ndarray, embedding_model: TextEmbedderProtocol | None = None, method: str = 'information_weighted') → List[List[str]]

make_prompts(detail_level: float, all_topic_names: List[List[str]], object_description: str, corpus_description: str, cluster_tree: dict | None = None, prompt_format: str | None = None, prompt_template: Dict[str, Any] | None = None, all_topic_summaries: List[List[str]] | None = None, all_topic_explanations: List[List[str]] | None = None) → List[str | Dict[str, str]]

make_subtopics(topic_list: List[str], topic_labels: ndarray, topic_vectors: ndarray | None = None, embedding_model: TextEmbedderProtocol | None = None, method: str = 'facility_location', topic_summaries: List[str] | None = None, topic_explanations: List[str] | None = None) → List[List[str]]

make_topic_name_vector() → ndarray

name_topics(llm, detail_level: float, all_topic_names: List[List[str]], object_description: str, corpus_description: str, cluster_tree: dict | None = None, embedding_model: TextEmbedderProtocol | None = None, all_topic_summaries: List[List[str]] | None = None, all_topic_explanations: List[List[str]] | None = None) → List[str]

class toponymy.KeyphraseBuilder(object_to_text: Callable[[Any], str] | None = None, ngram_range: Tuple[int, int] = (1, 4), tokenizer: TokenizerLike | None = None, token_pattern: str = "(?u)\\b\\w[-'\\w]+\\b", max_features: int = 50000, min_occurrences: int = 2, stop_words: FrozenSet[str] = frozenset({'a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amoungst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom', 'but', 'by', 'call', 'can', 'cannot', 'cant', 'co', 'con', 'could', 'couldnt', 'cry', 'de', 'describe', 'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 'either', 'eleven', 'else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'i', 'ie', 'if', 'in', 'inc', 'indeed', 'interest', 'into', 'is', 'it', 'its', 'itself', 'keep', 'last', 'latter', 'latterly', 'least', 'less', 'ltd', 'made', 'many', 'may', 'me', 'meanwhile', 'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must', 'my', 'myself', 'name', 'namely', 'neither', 'never', 'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once', 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'part', 'per', 'perhaps', 'please', 'put', 'rather', 're', 'same', 'see', 'seem', 'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 'should', 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so', 'some', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere', 'still', 'such', 'system', 'take', 'ten', 'than', 'that', 'the', 'their', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'therefore', 'therein', 'thereupon', 'these', 'they', 'thick', 'thin', 'third', 'this', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too', 'top', 'toward', 'towards', 'twelve', 'twenty', 'two', 'un', 'under', 'until', 'up', 'upon', 'us', 'very', 'via', 'was', 'we', 'well', 'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whither', 'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself', 'yourselves'}), n_jobs: int = -1, embedder: TextEmbedderProtocol | None = None, verbose: bool = None)

Bases: object

A class for building keyphrase count matrices from a list of objects. This can be useful as keyphrases can be a more specific way of helping prompt an LLM for a topic name. To make use of keyphrases you need to be able to convert objects to text. For basic short-text topic modeling, you can use the default settings, which simply assumes objects are already short texts. For other kinds of topic modeling you may need to provide a function that converts objects to text.

Parameters:

object_to_textOptional[Callable[[Any], str]], optional: A function that converts objects to text, by default None. If None, it is assumed that the objects are strings. An example of another case would be if objects were images and this function was a zero-short image captioning model.
ngram_rangeTuple[int, int], optional: The range of n-grams to consider, by default (1, 4).
tokenizerOptional[TokenizerLike], optional: A tokenizer object that has encode and decode methods, by default None. If None, a CountVectorizer is used.
token_patternstr, optional: The regular expression pattern to use for tokenization, by default “(?u)bw[-’w]+b”.
max_featuresint, optional: The maximum number of features to consider, by default 50_000.
min_occurrencesint, optional: The minimum number of occurrences for a keyphrase to be included, by default 2, so keyphrases have to re-occur.
stop_wordsFrozenSet[str], optional: The set of stop words to use, by default sklearn.feature_extraction.text.ENGLISH_STOP_WORDS.
n_jobsint, optional: The number of jobs to use in parallel processing, by default -1. If -1, all available cores are used.
embedderOptional[TextEmbedderProtocol], optional: An optional embedder to generate keyphrase vectors, by default None.
verbosebool, optional: Whether to show progress bars and verbose output. If True, shows all output. If False, suppresses all output.

Attributes:

object_x_keyphrase_matrix_scipy.sparse.spmatrix: A sparse count matrix of keyphrases in the objects.
keyphrase_list_List[str]: A list of keyphrases in the same order as columns in object_x_keyphrase_matrix.

fit(objects: List[Any])

fit_transform(objects: List[Any]) → Tuple[spmatrix, List[str], ndarray | None]: Fits the KeyphraseBuilder to the objects and returns the object x keyphrase matrix, keyphrase list, and keyphrase vectors.