Skip to content

Instantly share code, notes, and snippets.

@thomasdavis
Created April 24, 2025 15:01
Show Gist options
  • Save thomasdavis/caf335df575ef21718926cd611c3ddd3 to your computer and use it in GitHub Desktop.
Save thomasdavis/caf335df575ef21718926cd611c3ddd3 to your computer and use it in GitHub Desktop.

Below is a “from-PDF-to-production” blueprint that lets you pour the entire Grammar of Kuku Yalanji into a single modern stack – relational tables for precision, a vector index for AI search, and a graph/RDF layer for linked-data reuse.


1 Why three layers?

Layer What it gives you Typical tech
Relational / JSON B Loss-less storage of paradigms, rules, example IDs; fast SQL & GraphQL PostgreSQL 16
Vector index Semantic retrieval for RAG (“find the paragraph that explains ergative case”) pgvector inside Postgres OR an external DB like Weaviate citeturn0search2turn0search3
Graph / RDF Publish to web standards, link lemmas → rules → examples, interoperate with Wikidata OntoLex-Lemon in a triple-store or Neo4j

A single Postgres instance with the pgvector extension lets you keep both SQL tables and ANN search, but you can swap in Weaviate or Qdrant if you prefer a dedicated vector DB.


2 Extraction pipeline (ETL)

grammar.pdf ──► 1. pandoc --pdf-engine=…  →  Markdown per page
                2. custom parser           →  JSON chunks
                3. db_loader.py            →  POSTGRES
  1. Chunking strategy
    One heading = one chunk (use the TOC you pasted). Store section_id, parent_id, title, level, page_start, page_end, markdown.
  2. Tables (simplified):
SECTION(id, parent_id, title, level, pg_start, pg_end, md_text, embedding VECTOR)
GRAM_FEATURE(id, section_id, tag, value, note_json)
PARADIGM_CELL(id, lemma, tags_json, form, source_page)
IGT_EXAMPLE(id, section_id, src_text, gloss_json, translation)
  1. Embeddings
    ALTER TABLE section ADD COLUMN embedding vector(1536);
    UPDATE section
      SET embedding = openai_embed(md_text);
    CREATE INDEX ON section USING hnsw (embedding vector_cosine_ops);

The openai_embed() call is pseudocode – wrap the OpenAI (or local) API inside a PL/Python function or push the vectors to Weaviate.


3 Add linguistics-aware formats

3.1 CLDF “Grammar Profiles”

Store feature/value pairs in the Cross-Linguistic Data Formats (CLDF) StructureTable – already a CSV+JSON spec used by Glottobank citeturn0search0turn0search4.
Your loader can emit:

ID, Parameter_ID, Language_ID, Value, Source
erg-optional,kuku,gvn,optional,"Patz §4.1.4"
case-stacking,kuku,gvn,yes,"Patz §3.2.3.3"

3.2 Interlinear examples

Encode every glossed text in XIGT JSON (or Xigt XML) – an extensible standard for IGT that loads cleanly into Python citeturn0search1. Keep only Example IDs in Postgres; the heavy IGT files live in object storage.

3.3 Linked-data export

Nightly script converts lemmas + rules → OntoLex-Lemon RDF; publish via Apache Jena Fuseki. Now any SPARQL user can ask “give me all interrogative pronouns in Kuku Yalanji”.


4 GraphQL / API layer

Use Hasura or PostGraphile on Postgres:

query  {
  section_by_pk(id: "3.2.1") {
    title
    md_text
    nearest (limit: 3, query: "case alignment") {
      id title
    }
  }
}

That nearest field is a custom function wrapping pgvector similarity search.


5 RAG workflow for MobTranslate

User English sentence
        │
        ├─►  Keyword scan → GRAM_FEATURE + PARADIGM_CELL
        ├─►  Vector search → SECTION.embedding (top-k)
        │
        ▼
Prompt:  dictionary snippets + grammar chunks + user text
        │
        ▼
     LLM / GPT-4
        │
        ▼
  Kuku Yalanji output + cite_ids

Each cited section_id lets the UI show “Source: §3.8.4.2 Functions of verbal inflections” with a pop-up of the paragraph.


6 Modern tooling starter kit

Task Open-source tool
PDF → MD/OCR pandoc, pymupdf, tesserocr
Postgres+vectors pgvector (ACID & SQL joins) citeturn0search2
Vector-native alt Weaviate cloud/self-host citeturn0search3
CLDF helpers cldfbench – generates StructureTables citeturn0search4
IGT parsing xigt library citeturn0search1
Graph export pyrdf + OntoLex, or RDFLib
RAG glue LangChain + langchain-postgresql-pgvector example repo citeturn0search6

7 What “modern” gives you

  • One source of truth – everything lives in Postgres; vectors and JSON sidecars ride in the same DB.
  • AI-ready – embeddings let you power search, chatbots, and translation prompts.
  • Linked Data – linguists can harvest your OntoLex endpoint and cite individual rules.
  • No lock-in – CLDF CSVs, XIGT JSON, RDF triples are all open standards.

With this setup you can ingest any new grammar or dictionary just by adding rows and re-embedding. Scrapes of contemporary Kuku texts, speech transcripts, or children’s stories slot into the same schema and immediately become searchable for models and humans alike.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment