Skip to content

Instantly share code, notes, and snippets.

@searls
Created June 12, 2025 07:20
Show Gist options
  • Save searls/a6f5e6d2eb178e342b2224f2bcc4b439 to your computer and use it in GitHub Desktop.
Save searls/a6f5e6d2eb178e342b2224f2bcc4b439 to your computer and use it in GitHub Desktop.
I kinda forgot how I implemented synonym detection in KameSame (other than to know it was the MAIN THING the app had to do and performantly), so I asked Cursor to explain it

Understanding KameSame synonym detection

Exported on 6/12/2025 at 16:19:24 GMT+9 from Cursor (1.0.0)


User

Explain how KameSame does synonym detection when a user answers a question with a valid response but not the one we're looking for. can you please produce a report explaining this at a conceptual level and then back it up with progressively deeper technical dives?

I remember that while this is driven through the UI, it runs all the way down to the design of the database views, which i believe i even materialized for speed so that every night when the dictionaries are updated the synonym query is faster. i'll need you to check me on that stuff though b/c it's been 6+ years since i did this. provide receipts!


Cursor

I'll help you investigate how KameSame handles synonym detection. Let me start by searching for relevant code and then build up a comprehensive explanation.

First, let me search for code related to synonyms:

Let me search for more information about the database views and materialization:

Based on my research, I can provide a comprehensive explanation of how KameSame handles synonym detection. Let me break this down into different levels:

Conceptual Level

KameSame's synonym detection system is designed to recognize when a user provides a valid answer that, while not exactly matching the expected response, is semantically equivalent or closely related. This is particularly important for Japanese language learning where there can be multiple valid ways to express the same meaning.

The system works through several layers:

  1. Answer Validation Pipeline:

    • First checks for exact matches
    • Then looks for reading-only matches
    • Finally attempts to find alternate matches through synonym detection
  2. Synonym Sources:

    • Official dictionary definitions
    • User-provided definitions
    • User-added synonyms
    • Parenthetical variations in definitions

Technical Implementation

1. Database Layer

The system uses materialized views for performance optimization:

create materialized view item_english_texts as
select distinct on (id, text)
  id as item_id,
  type,
  unnest(array_cat(
    array_massage_english(meaning_texts),
    array_massage_english(array_strip_parens_english(meaning_texts))
   )) as text
from items;

These views are refreshed when dictionaries are updated, as seen in the migration files and the WaniKani download script:

# From script/wanikani_download_v2
if wk_items.size > 0
  # ... item updates ...
  puts "Remember to refresh materialized views!"
end

2. Similarity Detection

The system uses multiple methods to detect similar meanings:

  1. Text Similarity:
    • Uses PostgreSQL's similarity() function with a threshold of 0.6
    • Implements Jaro-Winkler distance for fuzzy matching
    • Uses Levenshtein distance for close matches
# From app/lib/compares_english_similarity.rb
def similar?(a, b, min_distance = DEFAULT_JARO_WINKLER_DISTANCE)
  a = clean(a)
  b = clean(b)

  Levenshtein.distance(a, b) <= 1 ||
    Levenshtein.normalized_distance(a, b) < 0.12 ||
    JaroWinkler.distance(a, b, ignore_case: true) >= min_distance
end
  1. Text Normalization:
    • Removes parentheses and their contents
    • Normalizes whitespace
    • Converts to lowercase
    • Removes special characters
CREATE OR REPLACE FUNCTION public.strip_parens_english(character varying)
RETURNS character varying
AS
$$
BEGIN
  return regexp_replace($1, '\([^()]*?\)', '','g');
END;
$$

3. Answer Processing

When a user submits an answer, the system:

  1. First checks for exact matches
  2. Then looks for reading-only matches
  3. Finally attempts to find alternate matches through synonym detection
# From app/lib/judges_production.rb
def call(learning, answer, study_mode)
  return Judgment.new(:exactly_correct) if obviously_exactly_correct?(learning, answer)
  return Judgment.new(:reading_correct) if obviously_reading_correct?(learning, answer)

  result = Item.find_by_sql([
    "select judge_production_by_terms(:user_id, :expected_item_id, :answer, :alternate_match_completion_eligible, :lesson_type::lesson_type, :type) as result",
    # ... parameters ...
  ]).first&.result
  # ... process result ...
end

Performance Optimizations

  1. Materialized Views:

    • item_japanese_texts and item_english_texts are materialized for faster querying
    • These views are refreshed when dictionaries are updated
    • Indexed for optimal performance:
      CREATE INDEX item_english_idx ON item_english_texts USING GIN (text gin_trgm_ops);
      CREATE INDEX item_japanese_text_btree_idx ON item_japanese_texts(text);
  2. Text Processing:

    • Text normalization is done at the database level
    • Uses PostgreSQL's trigram matching for fuzzy text search
    • Implements custom functions for text massaging and comparison

UI Integration

The system provides feedback to users about alternate matches:

// From app/javascript/view/synonyms.js
export default function synonymsView ({ card }) {
  if (_.isEmpty(card.alternativeMeanings)) return null
  return h('div', { class: 'synonyms' },
    h('h4', {}, 'Alternatives:'),
    h('div', { class: 'items' }, card.alternativeMeanings.join(' • '))
  )
}

This implementation allows KameSame to:

  1. Recognize valid alternative answers
  2. Provide helpful feedback to users
  3. Maintain performance through database optimizations
  4. Handle both official and user-provided synonyms
  5. Account for variations in how meanings can be expressed

The system is particularly sophisticated in its handling of Japanese language nuances, including reading variations and different ways of expressing the same meaning.

create materialized view item_english_texts as
select distinct on (id, text)
  id as item_id,
  type,
  unnest(array_cat(
    array_massage_english(meaning_texts),
    array_massage_english(array_strip_parens_english(meaning_texts))
   )) as text
from items;
# From script/wanikani_download_v2
if wk_items.size > 0
  # ... item updates ...
  puts "Remember to refresh materialized views!"
end
# From app/lib/compares_english_similarity.rb
def similar?(a, b, min_distance = DEFAULT_JARO_WINKLER_DISTANCE)
  a = clean(a)
  b = clean(b)

  Levenshtein.distance(a, b) <= 1 ||
    Levenshtein.normalized_distance(a, b) < 0.12 ||
    JaroWinkler.distance(a, b, ignore_case: true) >= min_distance
end
CREATE OR REPLACE FUNCTION public.strip_parens_english(character varying)
RETURNS character varying
AS
$$
BEGIN
  return regexp_replace($1, '\([^()]*?\)', '','g');
END;
$$
# From app/lib/judges_production.rb
def call(learning, answer, study_mode)
  return Judgment.new(:exactly_correct) if obviously_exactly_correct?(learning, answer)
  return Judgment.new(:reading_correct) if obviously_reading_correct?(learning, answer)

  result = Item.find_by_sql([
    "select judge_production_by_terms(:user_id, :expected_item_id, :answer, :alternate_match_completion_eligible, :lesson_type::lesson_type, :type) as result",
    # ... parameters ...
  ]).first&.result
  # ... process result ...
end
     CREATE INDEX item_english_idx ON item_english_texts USING GIN (text gin_trgm_ops);
     CREATE INDEX item_japanese_text_btree_idx ON item_japanese_texts(text);
// From app/javascript/view/synonyms.js
export default function synonymsView ({ card }) {
  if (_.isEmpty(card.alternativeMeanings)) return null
  return h('div', { class: 'synonyms' },
    h('h4', {}, 'Alternatives:'),
    h('div', { class: 'items' }, card.alternativeMeanings.join(' • '))
  )
}

User

Please provide the schema of the related materialized views and listings of related PSQL function calls


Cursor

I'll break down the schema of the materialized views and related PostgreSQL functions that handle synonym detection in KameSame.

Materialized Views

  1. item_japanese_texts
create materialized view item_japanese_texts as
select distinct on (item_id, text)
  item_id,
  type,
  regexp_replace(text, '', '','g') as text,
  common_word,
  reading_only
from ((
    select
      id item_id,
      type,
      text,
      common_word,
      false reading_only
    from items
  ) union (
    select
      id item_id,
      type,
      unnest(alt_texts) as text,
      common_word,
      false reading_only
    from items
  ) union (
    select
      id item_id,
      type,
      unnest(reading_texts) as text,
      common_word,
      not items.usually_written_using_kana_alone reading_only
    from items
  )
) as japanese_texts;

-- Indexes
CREATE UNIQUE INDEX item_japanese_texts_unique_idx on item_japanese_texts(item_id, text);
CREATE INDEX item_japanese_item_id_btree_idx ON item_japanese_texts(item_id);
CREATE INDEX item_japanese_text_btree_idx ON item_japanese_texts(text);
CREATE INDEX item_japanese_type_btree_idx ON item_japanese_texts(type);
CREATE INDEX item_japanese_item_text_reading_btree_idx ON item_japanese_texts(item_id, text, reading_only);
  1. item_english_texts
create materialized view item_english_texts as
select distinct on (id, text)
  id as item_id,
  type,
  unnest(array_cat(
    array_massage_english(meaning_texts),
    array_massage_english(array_strip_parens_english(meaning_texts))
   )) as text
from items;

-- Indexes
CREATE UNIQUE INDEX item_english_texts_unique_idx ON item_english_texts(item_id, text);
CREATE INDEX item_english_idx ON item_english_texts USING GIN (text gin_trgm_ops);
CREATE INDEX item_english_item_id_btree_idx ON item_english_texts(item_id);
CREATE INDEX item_english_text_btree_idx ON item_english_texts(text);
CREATE INDEX item_english_type_btree_idx ON item_english_texts(type);
  1. searchable_items
create materialized view searchable_items as
  select distinct on (items.id, j_text, e_text)
    items.*,
    j_text,
    e_text
  from items,
    unnest(
      public.array_hankakufy_romaji(array_prepend(
        items.text,
        array_cat(
          items.alt_texts,
          items.reading_texts
        )
      ))
    ) j_text,
    unnest(public.array_massage_english(items.meaning_texts)) e_text
  where items.type <> 'radical';

-- Indexes
CREATE UNIQUE INDEX searchable_items_unique_idx ON searchable_items(id, j_text, e_text);
CREATE INDEX searchable_items_e_text_idx ON searchable_items USING GIN (e_text gin_trgm_ops);
CREATE INDEX searchable_items_e_text_btree_idx ON searchable_items(e_text);
CREATE INDEX searchable_items_j_text_idx ON searchable_items USING GIN (j_text gin_trgm_ops);
CREATE INDEX searchable_items_j_text_btree_idx ON searchable_items(j_text);

Related PostgreSQL Functions

  1. Text Processing Functions
-- Strip parentheses from English text
CREATE OR REPLACE FUNCTION public.strip_parens_english(character varying)
RETURNS character varying
AS
$$
BEGIN
  return regexp_replace($1, '\([^()]*?\)', '','g');
END;
$$
LANGUAGE plpgsql
   STABLE
RETURNS NULL ON NULL INPUT;

-- Strip parentheses from array of English text
CREATE OR REPLACE FUNCTION public.array_strip_parens_english(character varying[])
RETURNS character varying[]
AS
$$
DECLARE
   arrTexts ALIAS FOR $1;
   retVal character varying[];
BEGIN
   FOR I IN array_lower(arrTexts, 1)..array_upper(arrTexts, 1) LOOP
    retVal[I] := public.strip_parens_english(arrTexts[I]);
   END LOOP;
RETURN retVal;
END;
$$
LANGUAGE plpgsql
   STABLE
RETURNS NULL ON NULL INPUT;

-- Massage English text (normalize)
CREATE OR REPLACE FUNCTION public.massage_english(character varying)
RETURNS character varying
AS
$$
BEGIN
  return regexp_replace(
    regexp_replace(lower($1), '[^a-z0-9\s]+', '','g'),
    '\s+', ' ', 'g'
  );
END;
$$
LANGUAGE plpgsql
   STABLE
RETURNS NULL ON NULL INPUT;

-- Massage array of English text
CREATE OR REPLACE FUNCTION public.array_massage_english(character varying[])
RETURNS character varying[]
AS
$$
DECLARE
   arrTexts ALIAS FOR $1;
   retVal character varying[];
BEGIN
   FOR I IN array_lower(arrTexts, 1)..array_upper(arrTexts, 1) LOOP
    retVal[I] := public.massage_english(arrTexts[I]);
   END LOOP;
RETURN retVal;
END;
$$
LANGUAGE plpgsql
   STABLE
RETURNS NULL ON NULL INPUT;

-- Convert Japanese text to romaji
CREATE OR REPLACE FUNCTION public.array_hankakufy_romaji(character varying[])
RETURNS character varying[]
AS
$$
DECLARE
   arrTexts ALIAS FOR $1;
   retVal character varying[];
BEGIN
   FOR I IN array_lower(arrTexts, 1)..array_upper(arrTexts, 1) LOOP
    retVal[I] := translate(arrTexts[I],
      '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz',
      '0123456789abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz'
    );
   END LOOP;
RETURN retVal;
END;
$$
LANGUAGE plpgsql
   STABLE
RETURNS NULL ON NULL INPUT;
  1. Judgment Functions
CREATE OR REPLACE FUNCTION public.judge_production(
  user_id bigint,
  expected_item_id bigint,
  answer character varying,
  alternate_match_completion_eligible boolean,
  lesson_type lesson_type
)
RETURNS character varying AS $$
DECLARE
  result character varying = '';
BEGIN
  -- exact match
  execute '
    select (case when reading_only then ''reading_correct'' else ''exactly_correct'' end)
    from item_japanese_text_and_spellings
    where item_id = $2
    and text = $3
    and (official_spelling or spelling_user_id = $1)
    order by reading_only asc
    limit 1
  ' into result using user_id, expected_item_id, answer;
  if result is not null then
    return result;
  end if;

  -- alternate match
  execute '
    select id from
      (
      select (
        select (coalesce(
          (
            select ''alternate_match_completion:'' || alt_japanese_texts.item_id || '':'' || learnings.id
            from reviews
            join learnings on learnings.id = reviews.learning_id
            where $4
            and reviews.user_id = $1
            and reviews.ready_for_review = true
            and learnings.item_id = alt_japanese_texts.item_id
            and learnings.lesson_type = $5
          ),
            ''alternate_match:'' || alt_japanese_texts.item_id
          )
        )
        from item_english_text_and_definitions alt_english_texts
        join item_english_text_and_definitions expected_english_texts
        on expected_english_texts.item_id = $2
        where alt_english_texts.item_id = alt_japanese_texts.item_id
        and expected_english_texts.type = alt_japanese_texts.type
        and (expected_english_texts.official_definition or expected_english_texts.definition_user_id = $1)
        and (alt_english_texts.official_definition or alt_english_texts.definition_user_id = $1)
        and similarity(expected_english_texts.text, alt_english_texts.text) > 0.6
        limit 1
      ) id
      from item_japanese_text_and_spellings alt_japanese_texts
      where alt_japanese_texts.text = $3
      and (alt_japanese_texts.official_spelling or alt_japanese_texts.spelling_user_id = $1)
      order by alt_japanese_texts.common_word desc, alt_japanese_texts.reading_only asc
    ) alternate_matches where id is not null
    limit 1;
  ' into result using user_id, expected_item_id, answer, alternate_match_completion_eligible, lesson_type;
  if result is not null then
    return result;
  end if;

  return 'incorrect';
END;
$$
LANGUAGE plpgsql
   STABLE
RETURNS NULL ON NULL INPUT;

Key Features of the Schema

  1. Text Normalization:

    • Removes parentheses and their contents
    • Normalizes whitespace and special characters
    • Converts to lowercase
    • Handles Japanese text variations
  2. Performance Optimizations:

    • Materialized views for faster querying
    • Multiple indexes including GIN indexes for trigram matching
    • Unique constraints to prevent duplicates
    • B-tree indexes for exact matches
  3. Similarity Detection:

    • Uses PostgreSQL's similarity() function with a threshold of 0.6
    • Implements trigram matching for fuzzy text search
    • Handles both Japanese and English text variations
  4. View Dependencies:

    • item_japanese_texts depends on items table
    • item_english_texts depends on items table
    • searchable_items combines both Japanese and English text

The schema is designed to efficiently handle both exact matches and fuzzy matching for synonyms, with particular attention to performance through materialized views and appropriate indexing strategies.

create materialized view item_japanese_texts as
select distinct on (item_id, text)
  item_id,
  type,
  regexp_replace(text, '', '','g') as text,
  common_word,
  reading_only
from ((
    select
      id item_id,
      type,
      text,
      common_word,
      false reading_only
    from items
  ) union (
    select
      id item_id,
      type,
      unnest(alt_texts) as text,
      common_word,
      false reading_only
    from items
  ) union (
    select
      id item_id,
      type,
      unnest(reading_texts) as text,
      common_word,
      not items.usually_written_using_kana_alone reading_only
    from items
  )
) as japanese_texts;

-- Indexes
CREATE UNIQUE INDEX item_japanese_texts_unique_idx on item_japanese_texts(item_id, text);
CREATE INDEX item_japanese_item_id_btree_idx ON item_japanese_texts(item_id);
CREATE INDEX item_japanese_text_btree_idx ON item_japanese_texts(text);
CREATE INDEX item_japanese_type_btree_idx ON item_japanese_texts(type);
CREATE INDEX item_japanese_item_text_reading_btree_idx ON item_japanese_texts(item_id, text, reading_only);
create materialized view item_english_texts as
select distinct on (id, text)
  id as item_id,
  type,
  unnest(array_cat(
    array_massage_english(meaning_texts),
    array_massage_english(array_strip_parens_english(meaning_texts))
   )) as text
from items;

-- Indexes
CREATE UNIQUE INDEX item_english_texts_unique_idx ON item_english_texts(item_id, text);
CREATE INDEX item_english_idx ON item_english_texts USING GIN (text gin_trgm_ops);
CREATE INDEX item_english_item_id_btree_idx ON item_english_texts(item_id);
CREATE INDEX item_english_text_btree_idx ON item_english_texts(text);
CREATE INDEX item_english_type_btree_idx ON item_english_texts(type);
create materialized view searchable_items as
  select distinct on (items.id, j_text, e_text)
    items.*,
    j_text,
    e_text
  from items,
    unnest(
      public.array_hankakufy_romaji(array_prepend(
        items.text,
        array_cat(
          items.alt_texts,
          items.reading_texts
        )
      ))
    ) j_text,
    unnest(public.array_massage_english(items.meaning_texts)) e_text
  where items.type <> 'radical';

-- Indexes
CREATE UNIQUE INDEX searchable_items_unique_idx ON searchable_items(id, j_text, e_text);
CREATE INDEX searchable_items_e_text_idx ON searchable_items USING GIN (e_text gin_trgm_ops);
CREATE INDEX searchable_items_e_text_btree_idx ON searchable_items(e_text);
CREATE INDEX searchable_items_j_text_idx ON searchable_items USING GIN (j_text gin_trgm_ops);
CREATE INDEX searchable_items_j_text_btree_idx ON searchable_items(j_text);
-- Strip parentheses from English text
CREATE OR REPLACE FUNCTION public.strip_parens_english(character varying)
RETURNS character varying
AS
$$
BEGIN
  return regexp_replace($1, '\([^()]*?\)', '','g');
END;
$$
LANGUAGE plpgsql
   STABLE
RETURNS NULL ON NULL INPUT;

-- Strip parentheses from array of English text
CREATE OR REPLACE FUNCTION public.array_strip_parens_english(character varying[])
RETURNS character varying[]
AS
$$
DECLARE
   arrTexts ALIAS FOR $1;
   retVal character varying[];
BEGIN
   FOR I IN array_lower(arrTexts, 1)..array_upper(arrTexts, 1) LOOP
    retVal[I] := public.strip_parens_english(arrTexts[I]);
   END LOOP;
RETURN retVal;
END;
$$
LANGUAGE plpgsql
   STABLE
RETURNS NULL ON NULL INPUT;

-- Massage English text (normalize)
CREATE OR REPLACE FUNCTION public.massage_english(character varying)
RETURNS character varying
AS
$$
BEGIN
  return regexp_replace(
    regexp_replace(lower($1), '[^a-z0-9\s]+', '','g'),
    '\s+', ' ', 'g'
  );
END;
$$
LANGUAGE plpgsql
   STABLE
RETURNS NULL ON NULL INPUT;

-- Massage array of English text
CREATE OR REPLACE FUNCTION public.array_massage_english(character varying[])
RETURNS character varying[]
AS
$$
DECLARE
   arrTexts ALIAS FOR $1;
   retVal character varying[];
BEGIN
   FOR I IN array_lower(arrTexts, 1)..array_upper(arrTexts, 1) LOOP
    retVal[I] := public.massage_english(arrTexts[I]);
   END LOOP;
RETURN retVal;
END;
$$
LANGUAGE plpgsql
   STABLE
RETURNS NULL ON NULL INPUT;

-- Convert Japanese text to romaji
CREATE OR REPLACE FUNCTION public.array_hankakufy_romaji(character varying[])
RETURNS character varying[]
AS
$$
DECLARE
   arrTexts ALIAS FOR $1;
   retVal character varying[];
BEGIN
   FOR I IN array_lower(arrTexts, 1)..array_upper(arrTexts, 1) LOOP
    retVal[I] := translate(arrTexts[I],
      '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz',
      '0123456789abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz'
    );
   END LOOP;
RETURN retVal;
END;
$$
LANGUAGE plpgsql
   STABLE
RETURNS NULL ON NULL INPUT;
CREATE OR REPLACE FUNCTION public.judge_production(
  user_id bigint,
  expected_item_id bigint,
  answer character varying,
  alternate_match_completion_eligible boolean,
  lesson_type lesson_type
)
RETURNS character varying AS $$
DECLARE
  result character varying = '';
BEGIN
  -- exact match
  execute '
    select (case when reading_only then ''reading_correct'' else ''exactly_correct'' end)
    from item_japanese_text_and_spellings
    where item_id = $2
    and text = $3
    and (official_spelling or spelling_user_id = $1)
    order by reading_only asc
    limit 1
  ' into result using user_id, expected_item_id, answer;
  if result is not null then
    return result;
  end if;

  -- alternate match
  execute '
    select id from
      (
      select (
        select (coalesce(
          (
            select ''alternate_match_completion:'' || alt_japanese_texts.item_id || '':'' || learnings.id
            from reviews
            join learnings on learnings.id = reviews.learning_id
            where $4
            and reviews.user_id = $1
            and reviews.ready_for_review = true
            and learnings.item_id = alt_japanese_texts.item_id
            and learnings.lesson_type = $5
          ),
            ''alternate_match:'' || alt_japanese_texts.item_id
          )
        )
        from item_english_text_and_definitions alt_english_texts
        join item_english_text_and_definitions expected_english_texts
        on expected_english_texts.item_id = $2
        where alt_english_texts.item_id = alt_japanese_texts.item_id
        and expected_english_texts.type = alt_japanese_texts.type
        and (expected_english_texts.official_definition or expected_english_texts.definition_user_id = $1)
        and (alt_english_texts.official_definition or alt_english_texts.definition_user_id = $1)
        and similarity(expected_english_texts.text, alt_english_texts.text) > 0.6
        limit 1
      ) id
      from item_japanese_text_and_spellings alt_japanese_texts
      where alt_japanese_texts.text = $3
      and (alt_japanese_texts.official_spelling or alt_japanese_texts.spelling_user_id = $1)
      order by alt_japanese_texts.common_word desc, alt_japanese_texts.reading_only asc
    ) alternate_matches where id is not null
    limit 1;
  ' into result using user_id, expected_item_id, answer, alternate_match_completion_eligible, lesson_type;
  if result is not null then
    return result;
  end if;

  return 'incorrect';
END;
$$
LANGUAGE plpgsql
   STABLE
RETURNS NULL ON NULL INPUT;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment