When searching for video content using AI-generated embeddings (like "man with white t-shirt"), current frame-by-frame indexing creates a result clustering problem:
- Current Setup: Each video frame is stored as a separate searchable item
- Search Behavior: Query returns individual frames ranked by similarity score
- Undesired Outcome: Top results often come from the same video, even if other videos contain equally relevant content
Searching for "man with white t-shirt" might return:
1. Video_A_Frame_15 (similarity: 0.95) ← Same video
2. Video_A_Frame_16 (similarity: 0.94) ← Same video
3. Video_A_Frame_17 (similarity: 0.93) ← Same video
4. Video_B_Frame_08 (similarity: 0.85) ← Different video (buried)
5. Video_C_Frame_22 (similarity: 0.84) ← Different video (buried)
Business Impact: Users miss relevant content from other videos because results are dominated by sequential frames from one video.
Qdrant's "Multiple Vectors Per Point" feature allows storing an entire video clip as a single searchable entity, with each frame represented as a named vector within that entity.
Before (Frame-level indexing):
Point 1: {id: "VIDEO_A_FRAME_001", vector: [...], payload: {video: "A"}}
Point 2: {id: "VIDEO_A_FRAME_002", vector: [...], payload: {video: "A"}}
Point 3: {id: "VIDEO_A_FRAME_003", vector: [...], payload: {video: "A"}}
Point 4: {id: "VIDEO_B_FRAME_001", vector: [...], payload: {video: "B"}}
After (Video-level indexing):
Point 1: {
id: "VIDEO_A",
vectors: {
"frame_001": [...],
"frame_002": [...],
"frame_003": [...]
},
payload: {video_metadata: "..."}
}
Point 2: {
id: "VIDEO_B",
vectors: {
"frame_001": [...],
"frame_002": [...]
},
payload: {video_metadata: "..."}
}
- Query: Single vector representing "man with white t-shirt"
- Qdrant Processing: Compares query against ALL frame vectors within each video point
- Result: Returns best-matching videos (not individual frames)
- Ranking: Based on the highest-scoring frame within each video
- Natural Diversity: Impossible to get multiple results from same video
- Semantic Grouping: Videos are treated as coherent units
- Efficient Storage: Reduced metadata duplication
- Scalability: Fewer total points to manage (videos vs. frames)
# Create video-level point with multiple frame vectors
video_point = PointStruct(
id="EVENT0145_CLIP4",
vector={
"frame_001": embedding_vector_1,
"frame_002": embedding_vector_2,
"frame_003": embedding_vector_3
},
payload={
"event_name": "EVENT0145",
"clip_number": 4,
"duration_seconds": 10.5,
"frame_count": 3
}
)
# Search returns video-level results
results = client.search(
collection_name="video_clips",
query_vector=query_vector,
limit=10 # Returns 10 different videos
)
Approach: Use Elasticsearch aggregations to group results by video and return only the best frame per video.
search_body = {
"size": 0, # Don't return individual frames
"aggs": {
"unique_videos": {
"terms": {"field": "video_id", "size": 10}, # Top 10 videos
"aggs": {
"best_frame": {
"top_hits": {
"size": 1, # Best frame per video
"sort": [{"_score": "desc"}]
}
}
}
}
},
"query": {
"script_score": {
"query": {"match_all": {}},
"script": {
"source": "cosineSimilarity(params.query_vector, 'embedding_vector') + 1.0",
"params": {"query_vector": query_vector}
}
}
}
}
Pros:
- Works with existing frame-level data
- Leverages ES's powerful aggregation engine
- Highly flexible grouping criteria
Cons:
- More complex query structure
- Potentially higher computational overhead
- Requires careful aggregation configuration
Approach: Retrieve more results than needed, then deduplicate in application code.
def deduplicate_video_results(search_results, max_per_video=1):
"""Keep only top N frames per video"""
video_counts = {}
filtered_results = []
for result in search_results:
video_id = result['_source']['video_id']
if video_counts.get(video_id, 0) < max_per_video:
filtered_results.append(result)
video_counts[video_id] = video_counts.get(video_id, 0) + 1
return filtered_results
Pros:
- Simple to implement and understand
- Works with any vector database
- Easy to adjust deduplication logic
Cons:
- Network overhead (retrieving extra results)
- Application-level complexity
- Less efficient than database-level solutions
Approach: Use ES nested documents to store videos with embedded frame data.
{
"video_id": "EVENT0145_CLIP4",
"video_metadata": {...},
"frames": [
{"frame_id": 1, "embedding": [...]},
{"frame_id": 2, "embedding": [...]},
{"frame_id": 3, "embedding": [...]}
]
}
Pros:
- Semantic grouping similar to Qdrant approach
- Leverages ES nested query capabilities
Cons:
- Complex nested queries required
- Vector similarity search on nested fields is challenging
- May require custom scoring functions
- Query Complexity: Simple (single API call)
- Network Overhead: Minimal (only relevant videos returned)
- Computational Efficiency: High (optimized for multi-vector points)
- Storage Efficiency: Good (reduced metadata duplication)
- Development Complexity: Low (built-in feature)
Approach | Query Complexity | Network Overhead | Computational Cost | Development Effort |
---|---|---|---|---|
Aggregations | High | Low | Medium-High | Medium |
Post-Processing | Low | High | Low | Low |
Nested Docs | Very High | Low | High | High |
For Production Use:
- Qdrant: Multiple vectors per point (if using Qdrant)
- Elasticsearch: Aggregation-based approach for best balance of efficiency and functionality
For Prototyping:
- Both platforms: Post-processing approach for simplicity and cross-platform compatibility
The Qdrant solution is more elegant and efficient for this specific use case, while Elasticsearch requires more complex workarounds but offers greater flexibility in other areas.