Embedding Wrappers

This document provides an overview of the embedding wrappers available in the embedding_wrappers module. These wrappers allow various text embedding services and APIs to be used seamlessly within Toponymy for generating vector representations of keyphrases and topic names.

Embedding models play a crucial role in Toponymy’s topic naming process. While your documents may already have embeddings from any model, Toponymy uses a separate embedding model internally to encode and compare keyphrases and topic names. This allows for semantic similarity calculations that ensure diversity among selected keyphrases and enable effective topic name disambiguation.

Installing required libraries

Each wrapper may require specific libraries to be installed. You can install them using pip or uv. For example, to enable the use of the OpenAI embedding wrapper you would need to install the openai library:

pip install openai

The following wrappers require the following libraries:

  • openai: For the OpenAI embedding wrapper.

  • anthropic: For the Anthropic embedding wrapper.

  • cohere: For the Cohere embedding wrapper.

  • azure-ai-inference: For the Azure AI embedding wrapper.

  • mistralai: For the Mistral embedding wrapper.

  • requests: For the Voyage AI embedding wrapper.

  • vllm: For the vLLM embedding wrapper.

Role in Topic Naming

Understanding the role of embeddings in Toponymy’s workflow is essential for choosing the right embedding model. Unlike document embeddings, which need to capture the full semantic content of potentially long texts, the embedding models used by Toponymy focus specifically on short keyphrases and topic names. This creates different requirements and opens up different optimization opportunities.

The primary use cases for embeddings in Toponymy include keyphrase selection diversity, where embeddings ensure that selected keyphrases for each cluster represent diverse aspects of the topic rather than near-duplicates; topic name disambiguation, where semantically similar topic names are identified and re-prompted to create more distinctive labels; and subtopic selection, where embeddings help select representative subtopic names from lower clustering layers to inform higher-level topic naming.

Since these embeddings are used for comparison rather than absolute representation, the choice of embedding model is somewhat flexible. The key requirements are reasonable semantic understanding of domain-specific terminology, consistency in representation, and computational efficiency for processing potentially thousands of keyphrases. You don’t necessarily need the most powerful or expensive embedding model—a good balance of quality and speed is often optimal.

Most embedding wrappers in Toponymy process texts in batches of 96 items to balance API efficiency with memory usage. They include progress bars for long-running operations and handle API rate limiting and retry logic automatically. All wrappers return standardized numpy arrays, ensuring consistent interfaces regardless of the underlying embedding service.

Available Wrappers

API-Based Embedding Wrappers

OpenAIEmbedder

The OpenAIEmbedder provides access to OpenAI’s text embedding models through their API. OpenAI’s embedding models are known for their strong performance across diverse domains and languages.

from toponymy.embedding_wrappers import OpenAIEmbedder

# Initialize with OpenAI API
embedder = OpenAIEmbedder(
    api_key="your-openai-api-key",  # Or set OPENAI_API_KEY env var
    model="text-embedding-3-small",  # Cost-effective and performant
    base_url="https://api.openai.com/v1"  # Optional custom endpoint
)

# Generate embeddings for keyphrases
keyphrases = ["machine learning", "neural networks", "deep learning"]
embeddings = embedder.encode(keyphrases, show_progress_bar=True)

Available Models:

  • text-embedding-3-small: 1536 dimensions, $0.02/1M tokens (recommended)

  • text-embedding-3-large: 3072 dimensions, $0.13/1M tokens (higher quality)

  • text-embedding-ada-002: 1536 dimensions, $0.10/1M tokens (legacy)

CohereEmbedder

The CohereEmbedder provides access to Cohere’s embedding models, which are optimized for search and retrieval tasks. This makes them particularly well-suited for Toponymy’s keyphrase comparison needs.

from toponymy.embedding_wrappers import CohereEmbedder

# Initialize with Cohere API
embedder = CohereEmbedder(
    api_key="your-cohere-api-key",  # Or set CO_API_KEY env var
    model="embed-multilingual-v3.0",  # Supports multiple languages
    base_url=None,  # Optional custom endpoint
    httpx_client=None  # Optional custom HTTP client
)

# Generate embeddings
embeddings = embedder.encode(
    texts=["category theory", "topology", "algebra"],
    show_progress_bar=True
)

The Cohere embedder uses input_type=”search_query” by default, which is optimized for comparing keyphrases and topic names against document content.

AnthropicEmbedder

The AnthropicEmbedder provides access to embedding capabilities through Anthropic’s API. While primarily known for their language models, Anthropic also offers embedding services.

from toponymy.embedding_wrappers import AnthropicEmbedder

# Initialize with Anthropic API
embedder = AnthropicEmbedder(
    api_key="your-anthropic-api-key",  # Or set ANTHROPIC_API_KEY env var
    model="claude-3-haiku-20240307",  # Model for embedding generation
    base_url=None,  # Optional custom endpoint
    httpx_client=None  # Optional custom HTTP client
)

Note: The Anthropic embedder processes texts individually rather than in batches, which may result in slower processing for large keyphrase lists.

AzureAIEmbedder

The AzureAIEmbedder provides access to embedding models deployed through Azure AI services, offering enterprise-grade infrastructure with comprehensive compliance and security features.

from toponymy.embedding_wrappers import AzureAIEmbedder

# Initialize with Azure AI
embedder = AzureAIEmbedder(
    api_key="your-azure-api-key",
    endpoint="https://your-endpoint.inference.ai.azure.com",
    model="your-deployed-embedding-model"
)

# Generate embeddings with automatic retry logic
embeddings = embedder.encode(
    texts=["machine learning", "data science", "artificial intelligence"],
    show_progress_bar=True
)

The Azure AI embedder includes built-in retry logic with exponential backoff to handle transient API failures gracefully.

MistralEmbedder

The MistralEmbedder provides access to Mistral’s embedding models through their API, offering competitive performance and pricing for text embedding tasks.

from toponymy.embedding_wrappers import MistralEmbedder

# Initialize with Mistral API
embedder = MistralEmbedder(
    api_key="your-mistral-api-key",
    model="mistral-embed"  # Mistral's embedding model
)

# Generate embeddings
embeddings = embedder.encode(
    texts=["natural language processing", "text mining", "information retrieval"],
    show_progress_bar=True
)

VoyageAIEmbedder

The VoyageAIEmbedder provides access to Voyage AI’s embedding models, which are specifically optimized for retrieval and search applications, making them well-suited for Toponymy’s needs.

from toponymy.embedding_wrappers import VoyageAIEmbedder

# Initialize with Voyage AI API
embedder = VoyageAIEmbedder(
    api_key="your-voyage-api-key",
    model="voyage-2"  # High-performance embedding model
)

# Generate embeddings
embeddings = embedder.encode(
    texts=["computer vision", "image processing", "pattern recognition"],
    show_progress_bar=True
)

Local Embedding Wrappers

VLLMEmbedder

The VLLMEmbedder provides high-performance local embedding generation using the vLLM library. This wrapper is ideal for scenarios requiring data privacy, high throughput, or freedom from API costs.

from toponymy.embedding_wrappers import VLLMEmbedder

# Initialize with a local embedding model
embedder = VLLMEmbedder(
    model="all-MiniLM-L6-v2",  # Popular and efficient embedding model
    kwargs={
        "tensor_parallel_size": 1,  # Number of GPUs for tensor parallelism
        "gpu_memory_utilization": 0.8,  # Fraction of GPU memory to use
        "max_model_len": 512  # Maximum sequence length
    }
)

# Generate embeddings locally
embeddings = embedder.encode(
    texts=["distributed systems", "microservices", "containerization"],
    show_progress_bar=True
)

Supported Models:

Popular embedding models that work well with vLLM include:

  • all-MiniLM-L6-v2: Fast and efficient, good for most use cases

  • all-mpnet-base-v2: Higher quality, more resource intensive

  • sentence-transformers/all-MiniLM-L6-v2: Explicit sentence-transformers model

  • intfloat/e5-base-v2: Strong performance on various tasks

Using Local Embedding Models

While not technically a wrapper, many users find that using SentenceTransformers directly provides an excellent balance of simplicity, performance, and model selection for Toponymy’s embedding needs:

from sentence_transformers import SentenceTransformer

# Initialize a local embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Use directly with Toponymy
from toponymy.toponymy import Toponymy

topic_model = Toponymy(
    # ... other parameters ...
    text_embedding_model=embedding_model,  # Pass the model directly
)

This approach provides direct access to the extensive SentenceTransformers model library and avoids the overhead of wrapper layers for local processing.

Choosing the Right Embedding Wrapper

Selecting the appropriate embedding wrapper depends on several key factors that mirror but differ from the considerations for LLM wrappers. The primary dimensions include cost efficiency, processing speed, data privacy requirements, model quality, and integration complexity. However, because embedding models are used for comparison tasks rather than generation, and because keyphrases are typically much shorter than full documents, the requirements are often less stringent than for document embeddings.

Cost Considerations

For API-based embedding services, costs are typically much lower than LLM costs because embeddings are smaller and require less computation than text generation. However, for projects processing large numbers of keyphrases, costs can still accumulate:

Embedding Cost Comparison (approximate)

Provider

Model

Cost per 1M tokens

Notes

OpenAI

text-embedding-3-small

$0.02

Recommended

Cohere

embed-multilingual-v3.0

$0.10

Multilingual support

OpenAI

text-embedding-3-large

$0.13

Higher dimensionality

Voyage AI

voyage-2

$0.10

Optimized for retrieval

Local (vLLM)

all-MiniLM-L6-v2

Hardware only

One-time setup cost

Quality vs. Speed Trade-offs

For Toponymy’s specific use cases, embedding quality requirements are moderate because the models are used for relative comparisons rather than absolute semantic understanding. This means that many embedding models will perform adequately:

High Performance (Recommended):

# Best balance of cost, speed, and quality
from toponymy.embedding_wrappers import OpenAIEmbedder
embedder = OpenAIEmbedder(model="text-embedding-3-small")

# Good multilingual support
from toponymy.embedding_wrappers import CohereEmbedder
embedder = CohereEmbedder(model="embed-multilingual-v3.0")

# Local processing with good performance
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer("all-MiniLM-L6-v2")

Budget-Conscious:

# Most cost-effective API option
from toponymy.embedding_wrappers import OpenAIEmbedder
embedder = OpenAIEmbedder(model="text-embedding-3-small")

# Free local processing (after hardware costs)
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer("all-MiniLM-L6-v2")

High Quality:

# Higher dimensional embeddings for better precision
from toponymy.embedding_wrappers import OpenAIEmbedder
embedder = OpenAIEmbedder(model="text-embedding-3-large")

# Specialized for retrieval tasks
from toponymy.embedding_wrappers import VoyageAIEmbedder
embedder = VoyageAIEmbedder(model="voyage-2")

Privacy and Security

For organizations with strict data privacy requirements, local embedding models are essential:

# Complete data privacy with local processing
from toponymy.embedding_wrappers import VLLMEmbedder
embedder = VLLMEmbedder(model="all-MiniLM-L6-v2")

# Alternative local approach
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer("all-MiniLM-L6-v2")

Enterprise Integration

For enterprise environments, Azure AI integration often provides the best fit with existing infrastructure:

# Enterprise-grade with compliance features
from toponymy.embedding_wrappers import AzureAIEmbedder
embedder = AzureAIEmbedder(
    api_key="your-azure-api-key",
    endpoint="https://your-endpoint.inference.ai.azure.com",
    model="your-deployed-embedding-model"
)

Performance Guidance

Embedding performance in Toponymy is generally not a bottleneck compared to LLM processing, but understanding performance characteristics can help optimize your workflows, especially when processing large numbers of keyphrases.

Batch Size Optimization

Most embedding wrappers use a default batch size of 96 items, which provides a good balance between API efficiency and memory usage. For local models, you may be able to increase batch sizes:

# For local models, larger batches may be more efficient
from toponymy.embedding_wrappers import VLLMEmbedder

embedder = VLLMEmbedder(
    model="all-MiniLM-L6-v2",
    kwargs={
        "max_num_seqs": 256,  # Process more sequences in parallel
        "gpu_memory_utilization": 0.9
    }
)

Processing Large Keyphrase Lists

When working with very large keyphrase lists (>10,000 items), consider the following optimizations:

# Enable progress bars for long-running operations
embeddings = embedder.encode(
    texts=large_keyphrase_list,
    show_progress_bar=True,  # Monitor progress
    verbose=True  # Additional logging
)

# For very large lists, consider processing in chunks
import numpy as np

chunk_size = 5000
all_embeddings = []

for i in range(0, len(large_keyphrase_list), chunk_size):
    chunk = large_keyphrase_list[i:i+chunk_size]
    chunk_embeddings = embedder.encode(chunk, show_progress_bar=True)
    all_embeddings.append(chunk_embeddings)

final_embeddings = np.vstack(all_embeddings)

Memory Management

For memory-constrained environments, consider using smaller embedding models or processing in smaller batches:

# Smaller model for memory-constrained environments
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer("all-MiniLM-L6-v2")  # ~90MB model

# Alternative: Even smaller model
embedder = SentenceTransformer("paraphrase-MiniLM-L3-v2")  # ~60MB model

Integration with Toponymy

Embedding wrappers integrate seamlessly with Toponymy’s main workflow. Here’s how to use them effectively in different scenarios:

Basic Usage

from toponymy.toponymy import Toponymy
from toponymy.embedding_wrappers import OpenAIEmbedder
from toponymy.llm_wrappers import OpenAI

# Initialize embedding and LLM models
embedding_model = OpenAIEmbedder(
    api_key="your-openai-api-key",
    model="text-embedding-3-small"
)

llm_model = OpenAI(
    api_key="your-openai-api-key",
    model="gpt-4o-mini"
)

# Create Toponymy instance
topic_model = Toponymy(
    llm_wrapper=llm_model,
    text_embedding_model=embedding_model,
    # ... other parameters ...
)

Mixed Local and API Approach

You can use local embeddings with API-based LLMs, or vice versa, depending on your specific requirements:

from sentence_transformers import SentenceTransformer
from toponymy.llm_wrappers import OpenAI

# Local embeddings for privacy, API LLM for quality
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
llm_model = OpenAI(api_key="your-api-key", model="gpt-4o-mini")

topic_model = Toponymy(
    llm_wrapper=llm_model,
    text_embedding_model=embedding_model,
)

Enterprise Configuration

For enterprise environments, Azure AI services can provide a unified platform:

from toponymy.embedding_wrappers import AzureAIEmbedder
from toponymy.llm_wrappers import AzureAI

# Unified Azure AI configuration
embedding_model = AzureAIEmbedder(
    api_key="your-azure-api-key",
    endpoint="https://your-embedding-endpoint.inference.ai.azure.com",
    model="your-embedding-model"
)

llm_model = AzureAI(
    api_key="your-azure-api-key",
    endpoint="https://your-llm-endpoint.inference.ai.azure.com",
    model="your-llm-model"
)

topic_model = Toponymy(
    llm_wrapper=llm_model,
    text_embedding_model=embedding_model,
)

Troubleshooting Common Issues

Authentication Errors:

Ensure API keys are correctly set either as parameters or environment variables:

export OPENAI_API_KEY="your-openai-api-key"
export CO_API_KEY="your-cohere-api-key"
export ANTHROPIC_API_KEY="your-anthropic-api-key"

Memory Issues with Local Models:

Reduce batch sizes or use smaller models:

# Reduce memory usage
from toponymy.embedding_wrappers import VLLMEmbedder

embedder = VLLMEmbedder(
    model="all-MiniLM-L6-v2",
    kwargs={
        "gpu_memory_utilization": 0.5,  # Use less GPU memory
        "max_model_len": 256  # Shorter sequences
    }
)

Rate Limiting Issues:

Most wrappers include automatic retry logic, but you can adjust batch sizes if needed:

# Process in smaller batches to avoid rate limits
import time

batch_size = 48  # Smaller than default 96
all_embeddings = []

for i in range(0, len(texts), batch_size):
    batch = texts[i:i+batch_size]
    embeddings = embedder.encode(batch)
    all_embeddings.append(embeddings)
    time.sleep(1)  # Brief pause between batches

final_embeddings = np.vstack(all_embeddings)