Video Search Diversity Problem & Solutions

Problem Definition

The Challenge

When searching for video content using AI-generated embeddings (like "man with white t-shirt"), current frame-by-frame indexing creates a result clustering problem:

Current Setup: Each video frame is stored as a separate searchable item
Search Behavior: Query returns individual frames ranked by similarity score
Undesired Outcome: Top results often come from the same video, even if other videos contain equally relevant content

Real-World Example

Searching for "man with white t-shirt" might return:

1. Video_A_Frame_15 (similarity: 0.95) ← Same video
2. Video_A_Frame_16 (similarity: 0.94) ← Same video  
3. Video_A_Frame_17 (similarity: 0.93) ← Same video
4. Video_B_Frame_08 (similarity: 0.85) ← Different video (buried)
5. Video_C_Frame_22 (similarity: 0.84) ← Different video (buried)

Business Impact: Users miss relevant content from other videos because results are dominated by sequential frames from one video.

Qdrant Solution: Multiple Vectors Per Point

Concept

Qdrant's "Multiple Vectors Per Point" feature allows storing an entire video clip as a single searchable entity, with each frame represented as a named vector within that entity.

Data Structure Transformation

Before (Frame-level indexing):

Point 1: {id: "VIDEO_A_FRAME_001", vector: [...], payload: {video: "A"}}
Point 2: {id: "VIDEO_A_FRAME_002", vector: [...], payload: {video: "A"}}
Point 3: {id: "VIDEO_A_FRAME_003", vector: [...], payload: {video: "A"}}
Point 4: {id: "VIDEO_B_FRAME_001", vector: [...], payload: {video: "B"}}

After (Video-level indexing):

Point 1: {
  id: "VIDEO_A", 
  vectors: {
    "frame_001": [...],
    "frame_002": [...], 
    "frame_003": [...]
  },
  payload: {video_metadata: "..."}
}

Point 2: {
  id: "VIDEO_B",
  vectors: {
    "frame_001": [...],
    "frame_002": [...]
  },
  payload: {video_metadata: "..."}
}

Search Behavior

Query: Single vector representing "man with white t-shirt"
Qdrant Processing: Compares query against ALL frame vectors within each video point
Result: Returns best-matching videos (not individual frames)
Ranking: Based on the highest-scoring frame within each video

Implementation Benefits

Natural Diversity: Impossible to get multiple results from same video
Semantic Grouping: Videos are treated as coherent units
Efficient Storage: Reduced metadata duplication
Scalability: Fewer total points to manage (videos vs. frames)

Code Example

# Create video-level point with multiple frame vectors
video_point = PointStruct(
    id="EVENT0145_CLIP4",
    vector={
        "frame_001": embedding_vector_1,
        "frame_002": embedding_vector_2, 
        "frame_003": embedding_vector_3
    },
    payload={
        "event_name": "EVENT0145",
        "clip_number": 4,
        "duration_seconds": 10.5,
        "frame_count": 3
    }
)

# Search returns video-level results
results = client.search(
    collection_name="video_clips",
    query_vector=query_vector,
    limit=10  # Returns 10 different videos
)

Elasticsearch Alternative Solutions

Solution 1: Aggregation-Based Deduplication

Approach: Use Elasticsearch aggregations to group results by video and return only the best frame per video.

search_body = {
    "size": 0,  # Don't return individual frames
    "aggs": {
        "unique_videos": {
            "terms": {"field": "video_id", "size": 10},  # Top 10 videos
            "aggs": {
                "best_frame": {
                    "top_hits": {
                        "size": 1,  # Best frame per video
                        "sort": [{"_score": "desc"}]
                    }
                }
            }
        }
    },
    "query": {
        "script_score": {
            "query": {"match_all": {}},
            "script": {
                "source": "cosineSimilarity(params.query_vector, 'embedding_vector') + 1.0",
                "params": {"query_vector": query_vector}
            }
        }
    }
}

Pros:

Works with existing frame-level data
Leverages ES's powerful aggregation engine
Highly flexible grouping criteria

Cons:

More complex query structure
Potentially higher computational overhead
Requires careful aggregation configuration

Solution 2: Application-Level Post-Processing

Approach: Retrieve more results than needed, then deduplicate in application code.

def deduplicate_video_results(search_results, max_per_video=1):
    """Keep only top N frames per video"""
    video_counts = {}
    filtered_results = []
    
    for result in search_results:
        video_id = result['_source']['video_id']
        
        if video_counts.get(video_id, 0) < max_per_video:
            filtered_results.append(result)
            video_counts[video_id] = video_counts.get(video_id, 0) + 1
            
    return filtered_results

Pros:

Simple to implement and understand
Works with any vector database
Easy to adjust deduplication logic

Cons:

Network overhead (retrieving extra results)
Application-level complexity
Less efficient than database-level solutions

Solution 3: Nested Documents (Advanced)

Approach: Use ES nested documents to store videos with embedded frame data.

{
    "video_id": "EVENT0145_CLIP4",
    "video_metadata": {...},
    "frames": [
        {"frame_id": 1, "embedding": [...]},
        {"frame_id": 2, "embedding": [...]},
        {"frame_id": 3, "embedding": [...]}
    ]
}

Pros:

Semantic grouping similar to Qdrant approach
Leverages ES nested query capabilities

Cons:

Complex nested queries required
Vector similarity search on nested fields is challenging
May require custom scoring functions

Efficiency Comparison

Qdrant Multiple Vectors Approach

Query Complexity: Simple (single API call)
Network Overhead: Minimal (only relevant videos returned)
Computational Efficiency: High (optimized for multi-vector points)
Storage Efficiency: Good (reduced metadata duplication)
Development Complexity: Low (built-in feature)

Elasticsearch Alternatives

Approach	Query Complexity	Network Overhead	Computational Cost	Development Effort
Aggregations	High	Low	Medium-High	Medium
Post-Processing	Low	High	Low	Low
Nested Docs	Very High	Low	High	High

Recommendation

For Production Use:

Qdrant: Multiple vectors per point (if using Qdrant)
Elasticsearch: Aggregation-based approach for best balance of efficiency and functionality

For Prototyping:

Both platforms: Post-processing approach for simplicity and cross-platform compatibility

The Qdrant solution is more elegant and efficient for this specific use case, while Elasticsearch requires more complex workarounds but offers greater flexibility in other areas.

cmantas/video_search_diversity.md

Video Search Diversity Problem & Solutions

Problem Definition

The Challenge

Real-World Example

Qdrant Solution: Multiple Vectors Per Point

Concept

Data Structure Transformation

Search Behavior

Implementation Benefits

Code Example

Elasticsearch Alternative Solutions

Solution 1: Aggregation-Based Deduplication

Solution 2: Application-Level Post-Processing

Solution 3: Nested Documents (Advanced)

Efficiency Comparison

Qdrant Multiple Vectors Approach

Elasticsearch Alternatives

Recommendation