Main Classes
- class toponymy.Toponymy(llm_wrapper: ~toponymy.llm_wrappers.LLMWrapper, text_embedding_model: ~toponymy.embedding_wrappers.TextEmbedderProtocol, clusterer: ~toponymy.clustering.Clusterer = <toponymy.clustering.ToponymyClusterer object>, layer_class: ~typing.Type[~toponymy.cluster_layer.ClusterLayer] = <class 'toponymy.cluster_layer.ClusterLayerText'>, prompt_template: ~typing.Dict[str, ~typing.Any] = {'disambiguate_topics': {'combined': <Template memory:712cc66fb410>, 'extract_topic_names': <function default_extract_topic_names>, 'get_topic_names_regex': '\\{\\s*"new_topic_name_mapping":\\s*.*?,\\s*"topic_specificities": .*?\\}', 'system': <Template memory:712cc7690d90>, 'user': <Template memory:712cc76dee90>}, 'layer': {'combined': <Template memory:712cc7250ad0>, 'extract_topic_name': <function <lambda>>, 'get_topic_name_regex': '\\{\\s*"topic_name":\\s*.*?,\\s*"topic_specificity":\\s*[\\w.]+\\s*\\}', 'system': <Template memory:712cc7af6ad0>, 'user': <Template memory:712cca597790>}}, keyphrase_builder: ~toponymy.keyphrases.KeyphraseBuilder = <toponymy.keyphrases.KeyphraseBuilder object>, object_description: str = 'objects', corpus_description: str = 'collection of objects', lowest_detail_level: float = 0.0, highest_detail_level: float = 1.0, exemplar_delimiters: ~typing.List[str] = [' * "', '"\n'], verbose: bool | None = None, show_progress_bars: bool | None = None)
Bases:
objectA class for generating topic names for vector based topic modeling.
- fit(objects: List[Any], embedding_vectors: ndarray, clusterable_vectors: ndarray, exemplar_method: str = 'central', keyphrase_method: str = 'information_weighted', subtopic_method: str = 'central')
Vectorizes using the classes embedding_model and constructs a low dimension data map with UMAP if object_vectors and object_map aren’t spec.
- fit_predict(objects: List[Any], object_vectors: ndarray, clusterable_vectors: ndarray, exemplar_method: str = 'central', keyphrase_method: str = 'information_weighted', subtopic_method: str = 'facility_location') List[ndarray]
Fit the model with objects and return the topic names.
- property topic_tree_: TopicTree
Returns the topic tree.
- class toponymy.ToponymyClusterer(min_clusters: int = 6, min_samples: int = 5, base_min_cluster_size: int | None = 10, base_n_clusters: int | None = None, next_cluster_size_quantile: float = 0.85, max_layers: int | None = None, verbose: bool | None = None, show_progress_bar: bool | None = None, n_threads: int = -1)
Bases:
ClustererA class for clustering data using a layered version of HDBSCAN.
- Parameters:
- min_clustersint, optional
The minimum number of clusters to form in a layer (default is 6).
- min_samplesint, optional
The minimum number of samples for hdbscan style clustering (default is 5).
- base_min_cluster_sizeint, optional
The base minimum size of clusters for the most fine-grained cluster layer (default is 10).
- base_n_clustersOptional[int], optional
The base number of clusters for the most fine-grained cluster layer (default is None). If None then base_min_cluster_size is used; otherwise this value will override base_min_cluster_size.
- next_cluster_size_quantilefloat, optional
The quantile value to determine the size of the minimum cluster size for the next layer (default is 0.8).
- max_layersOptional[int], optional
The maximum number of layers to create (default is None). If None, no limit is imposed.
- verbosebool, optional
Whether to show progress bars and verbose output. If True, shows all output. If False, suppresses all output.
- show_progress_barbool, optional, deprecated
Deprecated. Use verbose instead.
- n_threadsint, optional
The number of threads to use for parallel computation (default is -1, use all available cores).
- Attributes:
- cluster_layers_List[ClusterLayer]
A list of the created cluster layers.
- cluster_tree_Dict[Tuple[int, int], List[Tuple[int, int]]]
A dictionary representing the cluster tree. Keys are a tuple of (layer, cluster index) and values are lists of tuples representing child clusters.
- fit(clusterable_vectors: ndarray, embedding_vectors: ndarray, layer_class: Type[ClusterLayer] = <class 'toponymy.cluster_layer.ClusterLayerText'>, verbose: bool | None = None, show_progress_bar: bool | None = None, **layer_kwargs) Clusterer
- fit_predict(clusterable_vectors: ndarray, embedding_vectors: ndarray, layer_class: Type[ClusterLayer] = <class 'toponymy.cluster_layer.ClusterLayerText'>, verbose: bool | None = None, show_progress_bar: bool | None = None, **layer_kwargs) Tuple[List[ClusterLayer], Dict[Tuple[int, int], List[Tuple[int, int]]]]
- class toponymy.ClusterLayerText(cluster_labels: ndarray, centroid_vectors: ndarray, layer_id: int, text_embedding_model: TextEmbedderProtocol | None = None, n_keyphrases: int = 16, keyphrase_diversify_alpha: float = 1.0, n_exemplars: int = 8, exemplars_diversify_alpha: float = 1.0, n_subtopics: int = 16, subtopic_diversify_alpha: float = 1.0, exemplar_delimiters: List[str] = None, prompt_format: str = 'combined', prompt_template: Dict[str, Any] | None = None, verbose: bool | None = None, show_progress_bar: bool | None = None, **kwargs: Any)
Bases:
ClusterLayerA cluster layer class for dealing with text data. A cluster layer is a layer of a cluster hierarchy.
Attributes: cluster_labels: vector of numeric cluster labels for the clusters in the layer centroid_vectors: list of centroid vectors of the clusters in the layer
Methods: make_prompts: creates and stores a list of prompts for the clusters in the layer make_keywords: generates and stores a list of keywords for each clusters in the layer make_subtopics: generates and stores a list of subtopics for each clusters in the layer make_sample_texts: generates and stores a list of sample texts for each clusters in the layer
- make_exemplar_texts(object_list: List[str], object_vectors: ndarray, method='facility_location') Tuple[List[List[str]], List[List[int]]]
- make_keyphrases(keyphrase_list: List[str], object_x_keyphrase_matrix: spmatrix, keyphrase_vectors: ndarray, embedding_model: TextEmbedderProtocol | None = None, method: str = 'information_weighted') List[List[str]]
- make_prompts(detail_level: float, all_topic_names: List[List[str]], object_description: str, corpus_description: str, cluster_tree: dict | None = None, prompt_format: str | None = None, prompt_template: Dict[str, Any] | None = None, all_topic_summaries: List[List[str]] | None = None, all_topic_explanations: List[List[str]] | None = None) List[str | Dict[str, str]]
- make_subtopics(topic_list: List[str], topic_labels: ndarray, topic_vectors: ndarray | None = None, embedding_model: TextEmbedderProtocol | None = None, method: str = 'facility_location', topic_summaries: List[str] | None = None, topic_explanations: List[str] | None = None) List[List[str]]
- name_topics(llm, detail_level: float, all_topic_names: List[List[str]], object_description: str, corpus_description: str, cluster_tree: dict | None = None, embedding_model: TextEmbedderProtocol | None = None, all_topic_summaries: List[List[str]] | None = None, all_topic_explanations: List[List[str]] | None = None) List[str]
- class toponymy.KeyphraseBuilder(object_to_text: Callable[[Any], str] | None = None, ngram_range: Tuple[int, int] = (1, 4), tokenizer: TokenizerLike | None = None, token_pattern: str = "(?u)\\b\\w[-'\\w]+\\b", max_features: int = 50000, min_occurrences: int = 2, stop_words: FrozenSet[str] = frozenset({'a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amoungst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom', 'but', 'by', 'call', 'can', 'cannot', 'cant', 'co', 'con', 'could', 'couldnt', 'cry', 'de', 'describe', 'detail', 'do', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 'either', 'eleven', 'else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'i', 'ie', 'if', 'in', 'inc', 'indeed', 'interest', 'into', 'is', 'it', 'its', 'itself', 'keep', 'last', 'latter', 'latterly', 'least', 'less', 'ltd', 'made', 'many', 'may', 'me', 'meanwhile', 'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must', 'my', 'myself', 'name', 'namely', 'neither', 'never', 'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once', 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'part', 'per', 'perhaps', 'please', 'put', 'rather', 're', 'same', 'see', 'seem', 'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 'should', 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so', 'some', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere', 'still', 'such', 'system', 'take', 'ten', 'than', 'that', 'the', 'their', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'therefore', 'therein', 'thereupon', 'these', 'they', 'thick', 'thin', 'third', 'this', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too', 'top', 'toward', 'towards', 'twelve', 'twenty', 'two', 'un', 'under', 'until', 'up', 'upon', 'us', 'very', 'via', 'was', 'we', 'well', 'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whither', 'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself', 'yourselves'}), n_jobs: int = -1, embedder: TextEmbedderProtocol | None = None, verbose: bool = None)
Bases:
objectA class for building keyphrase count matrices from a list of objects. This can be useful as keyphrases can be a more specific way of helping prompt an LLM for a topic name. To make use of keyphrases you need to be able to convert objects to text. For basic short-text topic modeling, you can use the default settings, which simply assumes objects are already short texts. For other kinds of topic modeling you may need to provide a function that converts objects to text.
- Parameters:
- object_to_textOptional[Callable[[Any], str]], optional
A function that converts objects to text, by default None. If None, it is assumed that the objects are strings. An example of another case would be if objects were images and this function was a zero-short image captioning model.
- ngram_rangeTuple[int, int], optional
The range of n-grams to consider, by default (1, 4).
- tokenizerOptional[TokenizerLike], optional
A tokenizer object that has encode and decode methods, by default None. If None, a CountVectorizer is used.
- token_patternstr, optional
The regular expression pattern to use for tokenization, by default “(?u)bw[-’w]+b”.
- max_featuresint, optional
The maximum number of features to consider, by default 50_000.
- min_occurrencesint, optional
The minimum number of occurrences for a keyphrase to be included, by default 2, so keyphrases have to re-occur.
- stop_wordsFrozenSet[str], optional
The set of stop words to use, by default sklearn.feature_extraction.text.ENGLISH_STOP_WORDS.
- n_jobsint, optional
The number of jobs to use in parallel processing, by default -1. If -1, all available cores are used.
- embedderOptional[TextEmbedderProtocol], optional
An optional embedder to generate keyphrase vectors, by default None.
- verbosebool, optional
Whether to show progress bars and verbose output. If True, shows all output. If False, suppresses all output.
- Attributes:
- object_x_keyphrase_matrix_scipy.sparse.spmatrix
A sparse count matrix of keyphrases in the objects.
- keyphrase_list_List[str]
A list of keyphrases in the same order as columns in object_x_keyphrase_matrix.