Below is a “from-PDF-to-production” blueprint that lets you pour the entire Grammar of Kuku Yalanji into a single modern stack – relational tables for precision, a vector index for AI search, and a graph/RDF layer for linked-data reuse.
Layer | What it gives you | Typical tech |
---|---|---|
Relational / JSON B | Loss-less storage of paradigms, rules, example IDs; fast SQL & GraphQL | PostgreSQL 16 |
Vector index | Semantic retrieval for RAG (“find the paragraph that explains ergative case”) | pgvector inside Postgres OR an external DB like Weaviate citeturn0search2turn0search3 |
Graph / RDF | Publish to web standards, link lemmas → rules → examples, interoperate with Wikidata | OntoLex-Lemon in a triple-store or Neo4j |
A single Postgres instance with the pgvector extension lets you keep both SQL tables and ANN search, but you can swap in Weaviate or Qdrant if you prefer a dedicated vector DB.
grammar.pdf ──► 1. pandoc --pdf-engine=… → Markdown per page
2. custom parser → JSON chunks
3. db_loader.py → POSTGRES
- Chunking strategy
One heading = one chunk (use the TOC you pasted). Storesection_id
,parent_id
,title
,level
,page_start
,page_end
,markdown
. - Tables (simplified):
SECTION(id, parent_id, title, level, pg_start, pg_end, md_text, embedding VECTOR)
GRAM_FEATURE(id, section_id, tag, value, note_json)
PARADIGM_CELL(id, lemma, tags_json, form, source_page)
IGT_EXAMPLE(id, section_id, src_text, gloss_json, translation)
- Embeddings
ALTER TABLE section ADD COLUMN embedding vector(1536); UPDATE section SET embedding = openai_embed(md_text); CREATE INDEX ON section USING hnsw (embedding vector_cosine_ops);
The openai_embed()
call is pseudocode – wrap the OpenAI (or local) API inside a PL/Python function or push the vectors to Weaviate.
Store feature/value pairs in the Cross-Linguistic Data Formats (CLDF) StructureTable
– already a CSV+JSON spec used by Glottobank citeturn0search0turn0search4.
Your loader can emit:
ID, Parameter_ID, Language_ID, Value, Source
erg-optional,kuku,gvn,optional,"Patz §4.1.4"
case-stacking,kuku,gvn,yes,"Patz §3.2.3.3"
Encode every glossed text in XIGT JSON (or Xigt XML) – an extensible standard for IGT that loads cleanly into Python citeturn0search1. Keep only Example IDs in Postgres; the heavy IGT files live in object storage.
Nightly script converts lemmas + rules → OntoLex-Lemon RDF; publish via Apache Jena Fuseki. Now any SPARQL user can ask “give me all interrogative pronouns in Kuku Yalanji”.
Use Hasura or PostGraphile on Postgres:
query {
section_by_pk(id: "3.2.1") {
title
md_text
nearest (limit: 3, query: "case alignment") {
id title
}
}
}
That nearest
field is a custom function wrapping pgvector
similarity search.
User English sentence
│
├─► Keyword scan → GRAM_FEATURE + PARADIGM_CELL
├─► Vector search → SECTION.embedding (top-k)
│
▼
Prompt: dictionary snippets + grammar chunks + user text
│
▼
LLM / GPT-4
│
▼
Kuku Yalanji output + cite_ids
Each cited section_id
lets the UI show “Source: §3.8.4.2 Functions of verbal inflections” with a pop-up of the paragraph.
Task | Open-source tool |
---|---|
PDF → MD/OCR | pandoc , pymupdf , tesserocr |
Postgres+vectors | pgvector (ACID & SQL joins) citeturn0search2 |
Vector-native alt | Weaviate cloud/self-host citeturn0search3 |
CLDF helpers | cldfbench – generates StructureTables citeturn0search4 |
IGT parsing | xigt library citeturn0search1 |
Graph export | pyrdf + OntoLex, or RDFLib |
RAG glue | LangChain + langchain-postgresql-pgvector example repo citeturn0search6 |
- One source of truth – everything lives in Postgres; vectors and JSON sidecars ride in the same DB.
- AI-ready – embeddings let you power search, chatbots, and translation prompts.
- Linked Data – linguists can harvest your OntoLex endpoint and cite individual rules.
- No lock-in – CLDF CSVs, XIGT JSON, RDF triples are all open standards.
With this setup you can ingest any new grammar or dictionary just by adding rows and re-embedding. Scrapes of contemporary Kuku texts, speech transcripts, or children’s stories slot into the same schema and immediately become searchable for models and humans alike.