opensource_grammar_infrastructure.md

Below is a “from-PDF-to-production” blueprint that lets you pour the entire Grammar of Kuku Yalanji into a single modern stack – relational tables for precision, a vector index for AI search, and a graph/RDF layer for linked-data reuse.

1 Why three layers?

Layer	What it gives you	Typical tech
Relational / JSON B	Loss-less storage of paradigms, rules, example IDs; fast SQL & GraphQL	PostgreSQL 16
Vector index	Semantic retrieval for RAG (“find the paragraph that explains ergative case”)	`pgvector` inside Postgres OR an external DB like Weaviate citeturn0search2turn0search3
Graph / RDF	Publish to web standards, link lemmas → rules → examples, interoperate with Wikidata	OntoLex-Lemon in a triple-store or Neo4j

A single Postgres instance with the pgvector extension lets you keep both SQL tables and ANN search, but you can swap in Weaviate or Qdrant if you prefer a dedicated vector DB.

2 Extraction pipeline (ETL)

grammar.pdf ──► 1. pandoc --pdf-engine=…  →  Markdown per page
                2. custom parser           →  JSON chunks
                3. db_loader.py            →  POSTGRES

Chunking strategy
One heading = one chunk (use the TOC you pasted). Store section_id, parent_id, title, level, page_start, page_end, markdown.
Tables (simplified):

SECTION(id, parent_id, title, level, pg_start, pg_end, md_text, embedding VECTOR)
GRAM_FEATURE(id, section_id, tag, value, note_json)
PARADIGM_CELL(id, lemma, tags_json, form, source_page)
IGT_EXAMPLE(id, section_id, src_text, gloss_json, translation)

Embeddings

ALTER TABLE section ADD COLUMN embedding vector(1536);
UPDATE section
  SET embedding = openai_embed(md_text);
CREATE INDEX ON section USING hnsw (embedding vector_cosine_ops);

The openai_embed() call is pseudocode – wrap the OpenAI (or local) API inside a PL/Python function or push the vectors to Weaviate.

3 Add linguistics-aware formats

3.1 CLDF “Grammar Profiles”

Store feature/value pairs in the Cross-Linguistic Data Formats (CLDF) StructureTable – already a CSV+JSON spec used by Glottobank citeturn0search0turn0search4.
Your loader can emit:

ID, Parameter_ID, Language_ID, Value, Source
erg-optional,kuku,gvn,optional,"Patz §4.1.4"
case-stacking,kuku,gvn,yes,"Patz §3.2.3.3"

3.2 Interlinear examples

Encode every glossed text in XIGT JSON (or Xigt XML) – an extensible standard for IGT that loads cleanly into Python citeturn0search1. Keep only Example IDs in Postgres; the heavy IGT files live in object storage.

3.3 Linked-data export

Nightly script converts lemmas + rules → OntoLex-Lemon RDF; publish via Apache Jena Fuseki. Now any SPARQL user can ask “give me all interrogative pronouns in Kuku Yalanji”.

4 GraphQL / API layer

Use Hasura or PostGraphile on Postgres:

query  {
  section_by_pk(id: "3.2.1") {
    title
    md_text
    nearest (limit: 3, query: "case alignment") {
      id title
    }
  }
}

That nearest field is a custom function wrapping pgvector similarity search.

5 RAG workflow for MobTranslate

User English sentence
        │
        ├─►  Keyword scan → GRAM_FEATURE + PARADIGM_CELL
        ├─►  Vector search → SECTION.embedding (top-k)
        │
        ▼
Prompt:  dictionary snippets + grammar chunks + user text
        │
        ▼
     LLM / GPT-4
        │
        ▼
  Kuku Yalanji output + cite_ids

Each cited section_id lets the UI show “Source: §3.8.4.2 Functions of verbal inflections” with a pop-up of the paragraph.

6 Modern tooling starter kit

Task	Open-source tool
PDF → MD/OCR	`pandoc`, `pymupdf`, `tesserocr`
Postgres+vectors	`pgvector` (ACID & SQL joins) citeturn0search2
Vector-native alt	Weaviate cloud/self-host citeturn0search3
CLDF helpers	`cldfbench` – generates StructureTables citeturn0search4
IGT parsing	`xigt` library citeturn0search1
Graph export	`pyrdf` + OntoLex, or `RDFLib`
RAG glue	`LangChain` + `langchain-postgresql-pgvector` example repo citeturn0search6

7 What “modern” gives you

One source of truth – everything lives in Postgres; vectors and JSON sidecars ride in the same DB.
AI-ready – embeddings let you power search, chatbots, and translation prompts.
Linked Data – linguists can harvest your OntoLex endpoint and cite individual rules.
No lock-in – CLDF CSVs, XIGT JSON, RDF triples are all open standards.

With this setup you can ingest any new grammar or dictionary just by adding rows and re-embedding. Scrapes of contemporary Kuku texts, speech transcripts, or children’s stories slot into the same schema and immediately become searchable for models and humans alike.

thomasdavis/opensource_grammar_infrastructure.md