Embedding Wrappers
This document provides an overview of the embedding wrappers available in the embedding_wrappers module. These wrappers allow various text embedding services and APIs to be used seamlessly within Toponymy for generating vector representations of keyphrases and topic names.
Embedding models play a crucial role in Toponymy’s topic naming process. While your documents may already have embeddings from any model, Toponymy uses a separate embedding model internally to encode and compare keyphrases and topic names. This allows for semantic similarity calculations that ensure diversity among selected keyphrases and enable effective topic name disambiguation.
Installing required libraries
Each wrapper may require specific libraries to be installed. You can install them using pip or uv. For example, to enable the use of the OpenAI embedding wrapper you would need to install the openai library:
pip install openai
The following wrappers require the following libraries:
openai: For the OpenAI embedding wrapper.
anthropic: For the Anthropic embedding wrapper.
cohere: For the Cohere embedding wrapper.
azure-ai-inference: For the Azure AI embedding wrapper.
mistralai: For the Mistral embedding wrapper.
requests: For the Voyage AI embedding wrapper.
vllm: For the vLLM embedding wrapper.
Role in Topic Naming
Understanding the role of embeddings in Toponymy’s workflow is essential for choosing the right embedding model. Unlike document embeddings, which need to capture the full semantic content of potentially long texts, the embedding models used by Toponymy focus specifically on short keyphrases and topic names. This creates different requirements and opens up different optimization opportunities.
The primary use cases for embeddings in Toponymy include keyphrase selection diversity, where embeddings ensure that selected keyphrases for each cluster represent diverse aspects of the topic rather than near-duplicates; topic name disambiguation, where semantically similar topic names are identified and re-prompted to create more distinctive labels; and subtopic selection, where embeddings help select representative subtopic names from lower clustering layers to inform higher-level topic naming.
Since these embeddings are used for comparison rather than absolute representation, the choice of embedding model is somewhat flexible. The key requirements are reasonable semantic understanding of domain-specific terminology, consistency in representation, and computational efficiency for processing potentially thousands of keyphrases. You don’t necessarily need the most powerful or expensive embedding model—a good balance of quality and speed is often optimal.
Most embedding wrappers in Toponymy process texts in batches of 96 items to balance API efficiency with memory usage. They include progress bars for long-running operations and handle API rate limiting and retry logic automatically. All wrappers return standardized numpy arrays, ensuring consistent interfaces regardless of the underlying embedding service.
Available Wrappers
API-Based Embedding Wrappers
OpenAIEmbedder
The OpenAIEmbedder provides access to OpenAI’s text embedding models through their API. OpenAI’s embedding models are known for their strong performance across diverse domains and languages.
from toponymy.embedding_wrappers import OpenAIEmbedder
# Initialize with OpenAI API
embedder = OpenAIEmbedder(
api_key="your-openai-api-key", # Or set OPENAI_API_KEY env var
model="text-embedding-3-small", # Cost-effective and performant
base_url="https://api.openai.com/v1" # Optional custom endpoint
)
# Generate embeddings for keyphrases
keyphrases = ["machine learning", "neural networks", "deep learning"]
embeddings = embedder.encode(keyphrases, show_progress_bar=True)
Available Models:
text-embedding-3-small: 1536 dimensions, $0.02/1M tokens (recommended)
text-embedding-3-large: 3072 dimensions, $0.13/1M tokens (higher quality)
text-embedding-ada-002: 1536 dimensions, $0.10/1M tokens (legacy)
CohereEmbedder
The CohereEmbedder provides access to Cohere’s embedding models, which are optimized for search and retrieval tasks. This makes them particularly well-suited for Toponymy’s keyphrase comparison needs.
from toponymy.embedding_wrappers import CohereEmbedder
# Initialize with Cohere API
embedder = CohereEmbedder(
api_key="your-cohere-api-key", # Or set CO_API_KEY env var
model="embed-multilingual-v3.0", # Supports multiple languages
base_url=None, # Optional custom endpoint
httpx_client=None # Optional custom HTTP client
)
# Generate embeddings
embeddings = embedder.encode(
texts=["category theory", "topology", "algebra"],
show_progress_bar=True
)
The Cohere embedder uses input_type=”search_query” by default, which is optimized for comparing keyphrases and topic names against document content.
AnthropicEmbedder
The AnthropicEmbedder provides access to embedding capabilities through Anthropic’s API. While primarily known for their language models, Anthropic also offers embedding services.
from toponymy.embedding_wrappers import AnthropicEmbedder
# Initialize with Anthropic API
embedder = AnthropicEmbedder(
api_key="your-anthropic-api-key", # Or set ANTHROPIC_API_KEY env var
model="claude-3-haiku-20240307", # Model for embedding generation
base_url=None, # Optional custom endpoint
httpx_client=None # Optional custom HTTP client
)
Note: The Anthropic embedder processes texts individually rather than in batches, which may result in slower processing for large keyphrase lists.
AzureAIEmbedder
The AzureAIEmbedder provides access to embedding models deployed through Azure AI services, offering enterprise-grade infrastructure with comprehensive compliance and security features.
from toponymy.embedding_wrappers import AzureAIEmbedder
# Initialize with Azure AI
embedder = AzureAIEmbedder(
api_key="your-azure-api-key",
endpoint="https://your-endpoint.inference.ai.azure.com",
model="your-deployed-embedding-model"
)
# Generate embeddings with automatic retry logic
embeddings = embedder.encode(
texts=["machine learning", "data science", "artificial intelligence"],
show_progress_bar=True
)
The Azure AI embedder includes built-in retry logic with exponential backoff to handle transient API failures gracefully.
MistralEmbedder
The MistralEmbedder provides access to Mistral’s embedding models through their API, offering competitive performance and pricing for text embedding tasks.
from toponymy.embedding_wrappers import MistralEmbedder
# Initialize with Mistral API
embedder = MistralEmbedder(
api_key="your-mistral-api-key",
model="mistral-embed" # Mistral's embedding model
)
# Generate embeddings
embeddings = embedder.encode(
texts=["natural language processing", "text mining", "information retrieval"],
show_progress_bar=True
)
VoyageAIEmbedder
The VoyageAIEmbedder provides access to Voyage AI’s embedding models, which are specifically optimized for retrieval and search applications, making them well-suited for Toponymy’s needs.
from toponymy.embedding_wrappers import VoyageAIEmbedder
# Initialize with Voyage AI API
embedder = VoyageAIEmbedder(
api_key="your-voyage-api-key",
model="voyage-2" # High-performance embedding model
)
# Generate embeddings
embeddings = embedder.encode(
texts=["computer vision", "image processing", "pattern recognition"],
show_progress_bar=True
)
Local Embedding Wrappers
VLLMEmbedder
The VLLMEmbedder provides high-performance local embedding generation using the vLLM library. This wrapper is ideal for scenarios requiring data privacy, high throughput, or freedom from API costs.
from toponymy.embedding_wrappers import VLLMEmbedder
# Initialize with a local embedding model
embedder = VLLMEmbedder(
model="all-MiniLM-L6-v2", # Popular and efficient embedding model
kwargs={
"tensor_parallel_size": 1, # Number of GPUs for tensor parallelism
"gpu_memory_utilization": 0.8, # Fraction of GPU memory to use
"max_model_len": 512 # Maximum sequence length
}
)
# Generate embeddings locally
embeddings = embedder.encode(
texts=["distributed systems", "microservices", "containerization"],
show_progress_bar=True
)
Supported Models:
Popular embedding models that work well with vLLM include:
all-MiniLM-L6-v2: Fast and efficient, good for most use cases
all-mpnet-base-v2: Higher quality, more resource intensive
sentence-transformers/all-MiniLM-L6-v2: Explicit sentence-transformers model
intfloat/e5-base-v2: Strong performance on various tasks
Using Local Embedding Models
While not technically a wrapper, many users find that using SentenceTransformers directly provides an excellent balance of simplicity, performance, and model selection for Toponymy’s embedding needs:
from sentence_transformers import SentenceTransformer
# Initialize a local embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
# Use directly with Toponymy
from toponymy.toponymy import Toponymy
topic_model = Toponymy(
# ... other parameters ...
text_embedding_model=embedding_model, # Pass the model directly
)
This approach provides direct access to the extensive SentenceTransformers model library and avoids the overhead of wrapper layers for local processing.
Choosing the Right Embedding Wrapper
Selecting the appropriate embedding wrapper depends on several key factors that mirror but differ from the considerations for LLM wrappers. The primary dimensions include cost efficiency, processing speed, data privacy requirements, model quality, and integration complexity. However, because embedding models are used for comparison tasks rather than generation, and because keyphrases are typically much shorter than full documents, the requirements are often less stringent than for document embeddings.
Cost Considerations
For API-based embedding services, costs are typically much lower than LLM costs because embeddings are smaller and require less computation than text generation. However, for projects processing large numbers of keyphrases, costs can still accumulate:
Provider |
Model |
Cost per 1M tokens |
Notes |
|---|---|---|---|
OpenAI |
text-embedding-3-small |
$0.02 |
Recommended |
Cohere |
embed-multilingual-v3.0 |
$0.10 |
Multilingual support |
OpenAI |
text-embedding-3-large |
$0.13 |
Higher dimensionality |
Voyage AI |
voyage-2 |
$0.10 |
Optimized for retrieval |
Local (vLLM) |
all-MiniLM-L6-v2 |
Hardware only |
One-time setup cost |
Quality vs. Speed Trade-offs
For Toponymy’s specific use cases, embedding quality requirements are moderate because the models are used for relative comparisons rather than absolute semantic understanding. This means that many embedding models will perform adequately:
High Performance (Recommended):
# Best balance of cost, speed, and quality
from toponymy.embedding_wrappers import OpenAIEmbedder
embedder = OpenAIEmbedder(model="text-embedding-3-small")
# Good multilingual support
from toponymy.embedding_wrappers import CohereEmbedder
embedder = CohereEmbedder(model="embed-multilingual-v3.0")
# Local processing with good performance
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer("all-MiniLM-L6-v2")
Budget-Conscious:
# Most cost-effective API option
from toponymy.embedding_wrappers import OpenAIEmbedder
embedder = OpenAIEmbedder(model="text-embedding-3-small")
# Free local processing (after hardware costs)
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer("all-MiniLM-L6-v2")
High Quality:
# Higher dimensional embeddings for better precision
from toponymy.embedding_wrappers import OpenAIEmbedder
embedder = OpenAIEmbedder(model="text-embedding-3-large")
# Specialized for retrieval tasks
from toponymy.embedding_wrappers import VoyageAIEmbedder
embedder = VoyageAIEmbedder(model="voyage-2")
Privacy and Security
For organizations with strict data privacy requirements, local embedding models are essential:
# Complete data privacy with local processing
from toponymy.embedding_wrappers import VLLMEmbedder
embedder = VLLMEmbedder(model="all-MiniLM-L6-v2")
# Alternative local approach
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer("all-MiniLM-L6-v2")
Enterprise Integration
For enterprise environments, Azure AI integration often provides the best fit with existing infrastructure:
# Enterprise-grade with compliance features
from toponymy.embedding_wrappers import AzureAIEmbedder
embedder = AzureAIEmbedder(
api_key="your-azure-api-key",
endpoint="https://your-endpoint.inference.ai.azure.com",
model="your-deployed-embedding-model"
)
Performance Guidance
Embedding performance in Toponymy is generally not a bottleneck compared to LLM processing, but understanding performance characteristics can help optimize your workflows, especially when processing large numbers of keyphrases.
Batch Size Optimization
Most embedding wrappers use a default batch size of 96 items, which provides a good balance between API efficiency and memory usage. For local models, you may be able to increase batch sizes:
# For local models, larger batches may be more efficient
from toponymy.embedding_wrappers import VLLMEmbedder
embedder = VLLMEmbedder(
model="all-MiniLM-L6-v2",
kwargs={
"max_num_seqs": 256, # Process more sequences in parallel
"gpu_memory_utilization": 0.9
}
)
Processing Large Keyphrase Lists
When working with very large keyphrase lists (>10,000 items), consider the following optimizations:
# Enable progress bars for long-running operations
embeddings = embedder.encode(
texts=large_keyphrase_list,
show_progress_bar=True, # Monitor progress
verbose=True # Additional logging
)
# For very large lists, consider processing in chunks
import numpy as np
chunk_size = 5000
all_embeddings = []
for i in range(0, len(large_keyphrase_list), chunk_size):
chunk = large_keyphrase_list[i:i+chunk_size]
chunk_embeddings = embedder.encode(chunk, show_progress_bar=True)
all_embeddings.append(chunk_embeddings)
final_embeddings = np.vstack(all_embeddings)
Memory Management
For memory-constrained environments, consider using smaller embedding models or processing in smaller batches:
# Smaller model for memory-constrained environments
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer("all-MiniLM-L6-v2") # ~90MB model
# Alternative: Even smaller model
embedder = SentenceTransformer("paraphrase-MiniLM-L3-v2") # ~60MB model
Integration with Toponymy
Embedding wrappers integrate seamlessly with Toponymy’s main workflow. Here’s how to use them effectively in different scenarios:
Basic Usage
from toponymy.toponymy import Toponymy
from toponymy.embedding_wrappers import OpenAIEmbedder
from toponymy.llm_wrappers import OpenAI
# Initialize embedding and LLM models
embedding_model = OpenAIEmbedder(
api_key="your-openai-api-key",
model="text-embedding-3-small"
)
llm_model = OpenAI(
api_key="your-openai-api-key",
model="gpt-4o-mini"
)
# Create Toponymy instance
topic_model = Toponymy(
llm_wrapper=llm_model,
text_embedding_model=embedding_model,
# ... other parameters ...
)
Mixed Local and API Approach
You can use local embeddings with API-based LLMs, or vice versa, depending on your specific requirements:
from sentence_transformers import SentenceTransformer
from toponymy.llm_wrappers import OpenAI
# Local embeddings for privacy, API LLM for quality
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
llm_model = OpenAI(api_key="your-api-key", model="gpt-4o-mini")
topic_model = Toponymy(
llm_wrapper=llm_model,
text_embedding_model=embedding_model,
)
Enterprise Configuration
For enterprise environments, Azure AI services can provide a unified platform:
from toponymy.embedding_wrappers import AzureAIEmbedder
from toponymy.llm_wrappers import AzureAI
# Unified Azure AI configuration
embedding_model = AzureAIEmbedder(
api_key="your-azure-api-key",
endpoint="https://your-embedding-endpoint.inference.ai.azure.com",
model="your-embedding-model"
)
llm_model = AzureAI(
api_key="your-azure-api-key",
endpoint="https://your-llm-endpoint.inference.ai.azure.com",
model="your-llm-model"
)
topic_model = Toponymy(
llm_wrapper=llm_model,
text_embedding_model=embedding_model,
)
Troubleshooting Common Issues
Authentication Errors:
Ensure API keys are correctly set either as parameters or environment variables:
export OPENAI_API_KEY="your-openai-api-key"
export CO_API_KEY="your-cohere-api-key"
export ANTHROPIC_API_KEY="your-anthropic-api-key"
Memory Issues with Local Models:
Reduce batch sizes or use smaller models:
# Reduce memory usage
from toponymy.embedding_wrappers import VLLMEmbedder
embedder = VLLMEmbedder(
model="all-MiniLM-L6-v2",
kwargs={
"gpu_memory_utilization": 0.5, # Use less GPU memory
"max_model_len": 256 # Shorter sequences
}
)
Rate Limiting Issues:
Most wrappers include automatic retry logic, but you can adjust batch sizes if needed:
# Process in smaller batches to avoid rate limits
import time
batch_size = 48 # Smaller than default 96
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i+batch_size]
embeddings = embedder.encode(batch)
all_embeddings.append(embeddings)
time.sleep(1) # Brief pause between batches
final_embeddings = np.vstack(all_embeddings)
Recommended Configurations
Based on common use cases and requirements, here are recommended embedding configurations:
For Getting Started:
# Simple, reliable, cost-effective
from toponymy.embedding_wrappers import OpenAIEmbedder
embedder = OpenAIEmbedder(
api_key="your-openai-api-key",
model="text-embedding-3-small"
)
For Budget-Conscious Projects:
# Free after initial setup, good performance
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer("all-MiniLM-L6-v2")
For Maximum Privacy:
# Complete local processing
from toponymy.embedding_wrappers import VLLMEmbedder
embedder = VLLMEmbedder(model="all-MiniLM-L6-v2")
For Multilingual Content:
# Strong multilingual support
from toponymy.embedding_wrappers import CohereEmbedder
embedder = CohereEmbedder(
api_key="your-cohere-api-key",
model="embed-multilingual-v3.0"
)
For Enterprise Environments:
# Enterprise-grade with compliance
from toponymy.embedding_wrappers import AzureAIEmbedder
embedder = AzureAIEmbedder(
api_key="your-azure-api-key",
endpoint="https://your-endpoint.inference.ai.azure.com",
model="your-deployed-model"
)
For High-Performance Requirements:
# High-quality embeddings for demanding applications
from toponymy.embedding_wrappers import OpenAIEmbedder
embedder = OpenAIEmbedder(
api_key="your-openai-api-key",
model="text-embedding-3-large"
)
The choice of embedding wrapper should align with your overall project architecture, security requirements, and performance needs. Since embedding costs are typically much lower than LLM costs, it’s often worth choosing a slightly higher-quality option for better topic naming results, especially for production applications where topic quality is important.