Skip to content

Instantly share code, notes, and snippets.

@rjurney
Created November 11, 2025 23:34
Show Gist options
  • Select an option

  • Save rjurney/10650063c1a4e54a4c8c95e8380b0945 to your computer and use it in GitHub Desktop.

Select an option

Save rjurney/10650063c1a4e54a4c8c95e8380b0945 to your computer and use it in GitHub Desktop.
SERF entity resolution - round two results
============================================================
2025-11-11 15:33:37,107 - abzu.spark.er_eval - INFO - ENTITY RESOLUTION EVALUATION SUMMARY - ITERATION 2
2025-11-11 15:33:37,107 - abzu.spark.er_eval - INFO - ============================================================
2025-11-11 15:33:37,107 - abzu.spark.er_eval - INFO - Original raw companies (before matching): 13,641 unique
2025-11-11 15:33:37,107 - abzu.spark.er_eval - INFO - Companies that went into matching: 11,093
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - Skipped (singletons/errors): 2,548
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO -
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - MATCHING RESULTS:
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - BAML-processed companies: 4,930 unique
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - Companies merged: 6,163 (55.56%)
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - IDs dropped by BAML: 82 (0.74%) - recovered via UUID tracking
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO -
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - FINAL OUTPUT:
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - Total companies: 7,478 (4,930 matched + 2,548 skipped)
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - Total reduction: 6,163 companies (45.18%)
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO -
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - UUID VERIFICATION (BAML-processed companies only):
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - Overlap with original: 0 UUIDs (0.00%) - should be 0%
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - Overlap with previous: 0 UUIDs (0.00%) - should be 0%
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO -
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - SOURCE UUID COVERAGE:
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - Original companies tracked: 11,253/13,641 (82.49%)
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - Previous iteration tracked: 4,922/7,149 (68.85%)
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - Total unique source_uuids: 13,912
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO -
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - SOURCE UUID VALIDATION:
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - Valid references: 21,410/21,410 (100.00%)
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - Invalid references: 0/21,410 (0.00%)
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO -
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - RECOVERY STATISTICS:
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - Total recovered (match_skip=True): 2,548
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - BAML-processed records: 4,930
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - Skipped in iteration 2: 2,548
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - Recovery reasons:
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - - Error recovery: 0
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - - Missing in match output (legacy): 82
2025-11-11 15:33:37,108 - abzu.spark.er_eval - INFO - ============================================================
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment