VectorFlow is a powerful vector database and similarity search API built with FastAPI. It provides efficient vector indexing, storage, and retrieval capabilities for developing AI applications that require semantic search.
VectorFlow enables you to:
- Store and organize text data in a hierarchical structure (libraries → documents → chunks)
- Generate vector embeddings for text using the Cohere API
- Index vectors using different algorithms optimized for various use cases
- Perform fast similarity searches with optional metadata filtering
- Update your vector database incrementally without full rebuilds
- Python 3.11 (recommended, also works with Python 3.9+)
- Docker (for containerization)
- Kubernetes & Minikube (for deployment)
- Cohere API key
VectorFlow uses the Cohere API for generating embeddings. You need to set up an API key:
-
Create a
.env
file in the project root:COHERE_API_KEY=your_api_key_here
-
The application will automatically read this file when running.
-
Clone the repository:
git clone https://github.com/yourusername/VectorFlow.git cd VectorFlow
-
Set up a Python virtual environment with Python 3.11 (recommended):
python3.11 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Make the run script executable (if not already):
chmod +x run.sh chmod +x run/*.sh
-
Run the application using the provided script:
./run.sh
Or manually with uvicorn:
uvicorn app.main:app --reload
-
Access the API documentation at http://localhost:8000/docs
-
Start Minikube:
minikube start
-
Make the deployment script executable (if not already):
chmod +x deploy-minikube.sh chmod +x run.sh chmod +x run/*.sh
-
Deploy VectorFlow:
./deploy-minikube.sh
-
Access the API:
minikube service vectorflow --url
VectorFlow includes a comprehensive interactive CLI demo to help you explore its capabilities. The demo is implemented through a modular collection of shell scripts.
./run.sh
The interactive CLI provides two main modes:
- Guided Demo: Walks you through a complete example workflow
- Custom Steps: Allows you to execute specific operations in any order
run/
├── main.sh # Main script coordinating all demo functionality
├── functions/ # Shared utility functions
│ ├── dependencies.sh # Dependency checks and environment setup
│ └── utils.sh # Helper functions and formatting utilities
└── steps/ # Individual operation implementations
├── create_library.sh # Create vector libraries
├── create_document.sh # Add documents to libraries
├── add_chunks.sh # Add text chunks with embeddings
├── build_index.sh # Build vector search indexes
├── search.sh # Perform vector similarity searches
├── list_operations.sh # List libraries, documents, and chunks
├── set_ids.sh # Manually set library and document IDs
└── delete_operations.sh # Delete libraries, documents, and chunks
The demo provides access to all core VectorFlow functionality:
-
Library Management
- Create libraries with metadata
- List available libraries
- Delete libraries and their contents
-
Document Management
- Add documents to libraries with metadata
- List documents in a library
- Delete documents
-
Chunk Operations
- Add text chunks with automatic embedding generation
- Delete specific chunks
- List chunks in a document
-
Vector Search Capabilities
- Build different types of vector indexes (Linear, KD-Tree, LSH)
- Perform vector similarity searches
- Text-to-vector searches with automatic embedding generation
- Apply metadata filters to search results
-
Administration
- View index status and statistics
- Set library and document IDs manually
When you run ./run.sh
, the script first checks dependencies and ensures the API server is running. Then it presents a menu with the following options:
-
Run Example Demo: Follows a recommended sequence of operations:
- Create a library
- Add a document
- Add text chunks with embeddings
- Build a vector index
- Perform searches
-
Run Custom Steps: Opens a menu where you can choose specific operations:
- Library operations
- Document operations
- Chunk operations
- Index building and searches
- Listing and deleting operations
The demo provides helpful prompts and colorized output to guide you through the process.
VectorFlow follows a modular architecture with these key components:
- Hierarchical data structure with libraries, documents, and chunks
- Thread-safe concurrent access with async/await patterns
- Atomic operations with library-level locking
- Linear Index: Simple brute force approach (O(n×d) query time)
- KD-Tree Index: Space-partitioning for lower dimensions (O(log n) to O(n) query time)
- LSH Index: Locality-sensitive hashing for high dimensions (sublinear query time)
- Pydantic schemas for type validation and serialization
- Support for metadata at all levels of the hierarchy
- Summary models optimized for API responses
- REST API with FastAPI
- Endpoints for managing libraries, documents, and chunks
- Vector and text search capabilities
VectorFlow/
├── app/ # Main application package
│ ├── api/ # API endpoints
│ │ ├── endpoints/ # Individual endpoint modules
│ │ └── api.py # API router
│ ├── core/ # Core application components
│ ├── db/ # Database components
│ ├── models/ # Data models
│ ├── services/ # Business logic services
│ └── main.py # Application entry point
├── helm/ # Helm chart for Kubernetes deployment
├── tests/ # Test directory
└── Dockerfile # Docker image definition
VectorFlow uses Pydantic to define a robust schema hierarchy:
# Core entity models with hierarchical relationships
class Chunk(ChunkBase):
id: UUID = Field(default_factory=uuid4)
text: str
embedding: List[float]
metadata: ChunkMetadata
class Document(DocumentBase):
id: UUID = Field(default_factory=uuid4)
metadata: DocumentMetadata # title, author, etc.
chunks: List[Chunk] = []
class Library(LibraryBase):
id: UUID = Field(default_factory=uuid4)
name: str
metadata: LibraryMetadata # description, etc.
documents: List[Document] = []
index: Optional[Any] = None # Vector index
VectorFlow implements specialized models for API interactions:
- Create Models (
LibraryCreate
,DocumentCreate
,ChunkCreate
): For validating creation requests - Summary Models (
LibrarySummary
,DocumentSummary
,ChunkSummary
): Lightweight representations excluding heavy data like embeddings - Response Models (
LibraryResponse
): Special models for API responses that exclude non-serializable fields
This schema design ensures:
- Type safety throughout the application
- Clear validation rules for API inputs
- Efficient API responses without unnecessary data
- Proper serialization/deserialization
VectorFlow implements three indexing strategies, each with different performance characteristics:
class LinearIndex(BaseIndex):
"""Linear index implementation using brute force search"""
Complexities:
- Build Time: O(n) - simply stores vectors in an array
- Query Time: O(n × d) - compares query against every vector
- Space Complexity: O(n × d) - stores all n vectors with d dimensions
- Memory Access Pattern: Sequential, cache-friendly
Optimizations:
- Batch processing to leverage CPU cache efficiency
- Normalization of vectors for faster dot product comparisons
- Early termination of dissimilar vectors
Best For:
- Small datasets (<10K vectors)
- Situations requiring exact results
- Development/testing environments
class KDTreeIndex(BaseIndex):
"""KD-Tree implementation for efficient vector search in lower dimensions"""
Complexities:
- Build Time: O(n log n) - recursive partitioning to build balanced tree
- Query Time:
- Best case: O(log n) in low dimensions
- Average case: O(n^(1-1/d)) - degrades with increasing dimensions
- Worst case: O(n) in high dimensions
- Space Complexity: O(n) - only stores references in tree nodes
- Memory Access Pattern: Non-sequential, less cache-friendly
Optimizations:
- Variance-based split axis selection for better balance
- Optimized quickselect for median finding
- Buffered updates to avoid tree rebalancing
Best For:
- Low to medium dimensions (d ≤ 20)
- Data with natural clusters or structure
- Applications needing balance between speed and accuracy
class LSHIndex(BaseIndex):
"""LSH implementation for efficient vector search in high dimensions"""
Complexities:
- Build Time: O(n × d × L × K) - where L is number of tables, K is hash size
- Query Time: O(L + nL/2^K) - often sublinear in practice
- Space Complexity: O(n × L) - stores vector references in L tables
- Memory Access Pattern: Random access, less cache-friendly
Optimizations:
- Hybrid Approach: Combines traditional LSH with neighbor exploration
- Adaptive neighbor exploration based on result quality
- Dynamic bucket size monitoring
- Fallback mechanisms for low-recall scenarios
Best For:
- High dimensional data (d > 20)
- Very large datasets (>100K vectors)
- Applications where query speed is critical
- Approximate nearest neighbor search
VectorFlow implements a sophisticated concurrency model to ensure thread safety:
class VectorDatabase:
def __init__(self):
self.libraries: Dict[UUID, Library] = {}
self.locks: Dict[UUID, asyncio.Lock] = {}
async def _get_lock(self, library_id: UUID):
if library_id not in self.locks:
self.locks[library_id] = asyncio.Lock()
return self.locks[library_id]
-
Library-Level Granularity
- Each library has its own lock
- Operations on different libraries can proceed concurrently
- Prevents system-wide blocking during operations
-
Async/Await Pattern
- All database methods are implemented as async functions
- Context managers ensure locks are properly released
- Compatible with FastAPI's async architecture
-
Lock Acquisition Strategy
async def add_chunk(self, library_id: UUID, document_id: UUID, chunk: Chunk): async with await self._get_lock(library_id): # Safe operations within the lock context
- Atomic operations within lock contexts
- Prevents data races during critical operations
-
Read/Write Consistency
- Reads also acquire locks to ensure consistent views
- Prevents reading partially updated data structures
VectorFlow maintains data consistency and integrity through several mechanisms:
-
Atomic Operations
- All CRUD operations are performed atomically within lock contexts
- Ensures all-or-nothing updates
-
Index-Data Synchronization
# First update data structure doc.chunks.append(chunk) # Then try to update index incrementally if lib.index and Indexer.is_index_updateable(lib.index): try: lib.index.add_chunk(chunk) except Exception: # Graceful failure handling lib.index = None
- Data structure updated first, then indexes
- Indexes marked invalid if update fails
- Ensures data and indexes stay in sync
-
Cascading Deletes
- Deleting a library removes all its documents and chunks
- Deleting a document removes all its chunks
- Ensures referential integrity
-
Validation Checks
- Verifies entity existence before operations
- Raises clear, specific exceptions for invalid operations
- Ensures data integrity throughout operations
VectorFlow implements a powerful metadata filtering system:
@staticmethod
def create_metadata_filter(**kwargs):
def filter_func(chunk):
for key, value in kwargs.items():
if key.endswith('_after') and key[:-6] in chunk.metadata.__dict__:
chunk_value = getattr(chunk.metadata, key[:-6])
if chunk_value <= value:
return False
# ... more filter implementations
return True
return filter_func
-
Dynamic Filter Generation
- Creates filter functions on the fly
- Combines multiple criteria with AND logic
-
Rich Comparison Operators
field
: Exact match (equality)field_after
: Greater than comparisonfield_before
: Less than comparisonfield_contains
: Substring matching
-
Integration with Queries
# In API endpoint results = index.query( query_vector, k=10, metadata_filter=Indexer.create_metadata_filter(**filter_criteria) )
-
Application Examples
# Filter by date and category metadata_filter={ "date_after": "2023-01-01", "category": "finance", "title_contains": "quarterly report" }
Endpoint | Method | Description |
---|---|---|
/libraries/ |
GET | Retrieve all libraries with summary information |
/libraries/ |
POST | Create a new library |
/libraries/{library_id} |
GET | Retrieve a library by its ID |
/libraries/{library_id} |
DELETE | Delete a library and all its contents |
/libraries/{library_id}/index |
POST | Build or update a vector index |
/libraries/{library_id}/index |
GET | Get the status of a library's index |
/libraries/{library_id}/search |
POST | Search for similar documents using a vector query |
/libraries/{library_id}/text-search |
POST | Search for documents using a text query |
Endpoint | Method | Description |
---|---|---|
/libraries/{library_id}/documents |
GET | Retrieve all documents in a library |
/libraries/{library_id}/documents |
POST | Add a new document to a library |
/libraries/{library_id}/documents/{document_id} |
DELETE | Delete a document from a library |
Endpoint | Method | Description |
---|---|---|
/libraries/{library_id}/documents/{document_id}/chunks |
GET | Retrieve all chunks in a document |
/libraries/{library_id}/documents/{document_id}/chunks |
POST | Add a new chunk to a document |
/libraries/{library_id}/documents/{document_id}/chunks/{chunk_id} |
DELETE | Delete a chunk |
/libraries/{library_id}/batch-chunks |
POST | Process a batch of texts with automatic embedding generation |
VectorFlow includes several optimizations for high-performance vector search:
-
Batched Processing
- Linear search processes vectors in batches to optimize CPU cache usage
- Significant speedup for large datasets
-
Vector Normalization
- Pre-normalizes vectors to unit length
- Enables faster dot product calculations for cosine similarity
-
Adaptive Index Selection
- Automatic warnings for suboptimal index choices
- KD-Tree warns when dimensions exceed recommended threshold
-
Incremental Index Updates
- Tracks changes to determine when full rebuilds are necessary
- Applies incremental updates when possible to avoid rebuilding
-
LSH Enhancements
- Neighboring hash exploration for improved recall
- Intelligent candidate collection strategies
- Fallback to broader search when results are insufficient
-
Memory Optimizations
- Uses
__slots__
in performance-critical classes to reduce memory overhead - Avoids duplicate storage of embeddings
- Uses
-
Query Optimizations
- Early termination for dissimilar vectors
- Heap-based top-k selection instead of sorting all distances
- Optimized distance calculations
# Create a new library
curl -X POST "http://localhost:8000/libraries/" \
-H "Content-Type: application/json" \
-d '{"name": "Research Papers", "metadata": {"description": "AI research papers"}}'
# Build an LSH index for high-dimensional vectors
curl -X POST "http://localhost:8000/libraries/{library_id}/index" \
-H "Content-Type: application/json" \
-d '{"algorithm": "lsh"}'
curl -X POST "http://localhost:8000/libraries/{library_id}/search" \
-H "Content-Type: application/json" \
-d '{
"query": [0.1, 0.2, ...],
"k": 5,
"metadata_filter": {"category": "finance"}
}'
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.