Skip to content

Instantly share code, notes, and snippets.

@Slyracoon23
Created July 8, 2025 21:57
Show Gist options
  • Save Slyracoon23/55ba663858a5176c7095d27c5526077a to your computer and use it in GitHub Desktop.
Save Slyracoon23/55ba663858a5176c7095d27c5526077a to your computer and use it in GitHub Desktop.

Happenstance Affinity Ranking - 1 Page Proposal

Problem & Solution

Need: Rank user connections by "affinity strength" for search results and question prioritization Current: Only works for users with connected email headers Goal: Works for ALL users, efficient queries, production-ready ASAP

Solution: XGBoost classifier trained on binary labels (0/1) that outputs probability scores (0.0-1.0) for ranking

Features (What Makes People Connected?)

Affinity scores are only calculated for user-person pairs where at least one relevant feature or connection exists. This includes:

  • Email Signals: Total emails sent/received, bidirectional exchange score
  • LinkedIn Signals: Connected status, connection recency, mutual connections
  • Professional Overlap: Same company (current/past), shared education, network overlap
  • Engagement Patterns: Interaction consistency, response rates, communication trends

Pairs with no features or connections are ignored to ensure scalability and efficiency.

Training Data (Binary Labels)

High Affinity (1): Bidirectional email + recent LinkedIn connection + professional overlap Low Affinity (0): Old LinkedIn-only connections, one-way emails, distant network ties

Model Output: Probability scores 0.0-1.0 perfect for ranking

Architecture

Email/LinkedIn Data → Feature Engineering → XGBoost Model → Affinity Scores (0-1)
                                                          ↓
Search Results: Ranked by affinity score
Questions: Show highest affinity people first

Storage

  • PostgreSQL: user_person_affinity(user_id, person_id, score, updated_at)
  • Redis: Hot scores cached for <10ms lookups
  • Indexes: (user_id, score DESC) for fast ranking queries

Automated Affinity Score Update (Cron Job)

A daily cron job (scheduled batch process) is responsible for keeping affinity scores up to date in the database. This job performs the following steps:

  • Schedule: Runs every 24 hours (e.g., 2:00 AM UTC)
  • Steps:
    1. Select candidate pairs: Identify user-person pairs with at least one relevant feature or connection (e.g., email, LinkedIn, or professional overlap). Ignore pairs with no features.
    2. Extract latest features from Email and LinkedIn data for these candidate pairs
    3. Apply the trained XGBoost model to compute updated affinity scores (0-1)
    4. Update the user_person_affinity table in PostgreSQL with new scores and timestamps
    5. Refresh hot scores in Redis cache for fast access

Example Cron Schedule:

0 2 * * * /usr/bin/python3 /app/batch_update_affinity_scores.py

This ensures that search results and question prioritization always use the most recent data, while also supporting real-time updates for new interactions as needed, without incurring unnecessary computation for unrelated pairs.

Key Questions Answered

What signals? Email frequency, LinkedIn connections, professional overlap, network proximity

How to compute? Daily batch jobs for feature extraction + real-time updates on new interactions

What to store? Affinity scores (0-1) in PostgreSQL with user_id/person_id indexes

Runtime queries?

  • Search: SELECT * FROM people JOIN affinity WHERE user_id=X ORDER BY score DESC
  • Questions: SELECT person_id FROM affinity WHERE user_id=X AND score>0.5 ORDER BY score DESC

Performance targets: <50ms search ranking, <10ms individual lookups

Implementation (4 weeks)

Week 1-2: MVP

  • Basic feature extraction (email + LinkedIn)
  • Heuristic binary labeling
  • Train XGBoost classifier
  • Batch scoring pipeline
  • Incorporate user feedback and LLMs for labeling: Begin collecting explicit user feedback on connection relevance and experiment with using large language models (LLMs) to assist in labeling ambiguous or large-scale data.

Week 3-4: Production

  • Optimize database schema/indexes
  • Real-time scoring API
  • Deploy with monitoring
  • A/B test vs current system
  • Expand user feedback and LLM labeling: Integrate user feedback loops into the product and use LLMs to continuously improve label quality and coverage.

Architecture Diagram

image

Bottom Line: Simple, scalable system that works for all users. Binary classification trained on clear heuristics outputs nuanced 0-1 scores perfect for ranking. Gets us to production fast while building foundation for future improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment