coderplay/0-wrenai-architecture.md

Last active July 8, 2025 21:34

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/coderplay/9023fa0e251883b5586de4529be4857a.js"></script>
Save coderplay/9023fa0e251883b5586de4529be4857a to your computer and use it in GitHub Desktop.

Download ZIP

Raw

0-wrenai-architecture.md

WrenAI Architecture Documentation

Overview

WrenAI is a comprehensive AI-powered data modeling and query generation platform that transforms natural language questions into SQL queries. The system consists of multiple interconnected services that work together to provide intelligent data analysis capabilities.

High-Level Architecture

graph TB
    subgraph "User Interface Layer"
        UI[Wren UI<br/>Next.js Frontend<br/>Port: 3000]
        CLI[Wren Launcher<br/>CLI Tool<br/>Go Application]
    end
    
    subgraph "API Gateway Layer"
        GQL[GraphQL API<br/>Apollo Server]
        REST[REST API<br/>FastAPI]
    end
    
    subgraph "Core Services"
        AI[Wren AI Service<br/>Python FastAPI<br/>Port: 5555]
        ENGINE[Wren Engine<br/>Java/Rust<br/>Port: 8080]
        IBIS[Ibis Server<br/>Python<br/>Port: 8000]
    end
    
    subgraph "Data Layer"
        QDRANT[Qdrant Vector DB<br/>Port: 6333]
        SQLITE[SQLite Database<br/>Metadata Storage]
        CACHE[In-Memory Cache<br/>Query Results]
    end
    
    subgraph "External Services"
        LLM[LLM Providers<br/>OpenAI, Custom]
        DATASOURCES[Data Sources<br/>BigQuery, PostgreSQL<br/>MySQL, Snowflake, etc.]
    end
    
    CLI --> UI
    UI --> GQL
    GQL --> AI
    GQL --> ENGINE
    GQL --> IBIS
    
    AI --> REST
    AI --> QDRANT
    AI --> LLM
    
    ENGINE --> DATASOURCES
    IBIS --> DATASOURCES
    
    UI --> SQLITE
    AI --> CACHE
    
    style UI fill:#e3f2fd
    style AI fill:#fff3e0
    style ENGINE fill:#f3e5f5
    style QDRANT fill:#e8f5e8
    style LLM fill:#fce4ec

Component Architecture

1. User Interface Layer

Wren UI (Frontend)

Technology: Next.js 14, React, TypeScript, Ant Design
Port: 3000
Responsibilities:
- Data source connection and configuration
- Visual data modeling interface
- Natural language query input
- Result visualization and charts
- User authentication and settings

Key Components:

// Main App Structure
src/pages/_app.tsx          // Application root
src/pages/home/index.tsx    // Query interface
src/pages/setup/            // Data source setup
src/pages/modeling/         // Data modeling tools
src/components/diagram/     // Visual modeling components

Wren Launcher (CLI)

Technology: Go
Purpose: System orchestration and setup
Responsibilities:
- Interactive system setup
- Docker container management
- Configuration generation
- Service health monitoring

2. Core Services Architecture

Wren AI Service

Technology: Python, FastAPI, Hamilton pipelines
Port: 5555
Architecture Pattern: Pipeline-based processing

graph LR
    subgraph "AI Service Internal Architecture"
        ROUTER[FastAPI Router] --> PIPELINE[Pipeline Manager]
        PIPELINE --> INDEXING[Indexing Pipelines]
        PIPELINE --> RETRIEVAL[Retrieval Pipelines]
        PIPELINE --> GENERATION[Generation Pipelines]
        
        INDEXING --> DBSCHEMA[DB Schema Indexing]
        INDEXING --> HISTORICAL[Historical Q&A]
        
        RETRIEVAL --> DBRETRIEVAL[DB Schema Retrieval]
        RETRIEVAL --> QARETRIEVAL[Q&A Retrieval]
        
        GENERATION --> SQLGEN[SQL Generation]
        GENERATION --> FOLLOWUP[Follow-up Generation]
        GENERATION --> ADJUSTMENT[Chart Adjustment]
    end
    
    style DBSCHEMA fill:#e8f5e8
    style SQLGEN fill:#fff3e0
    style RETRIEVAL fill:#f3e5f5

Core Pipelines:

Indexing Pipelines (src/pipelines/indexing/):
- db_schema.py - Database schema vectorization
- historical_question.py - Q&A pair indexing
Retrieval Pipelines (src/pipelines/retrieval/):
- db_schema_retrieval.py - Schema context retrieval
- historical_question_retrieval.py - Similar question retrieval
Generation Pipelines (src/pipelines/generation/):
- sql_generation.py - SQL query generation
- followup_generation.py - Follow-up question generation
- chart_generation.py - Visualization generation

Wren Engine

Technology: Java (Traditional) / Rust (Experimental)
Port: 8080
Purpose: SQL execution and optimization
Responsibilities:
- SQL query execution
- Query optimization
- Result caching
- Performance monitoring

Ibis Server

Technology: Python, Ibis framework
Port: 8000
Purpose: Data source abstraction
Responsibilities:
- Multi-database connectivity
- Schema introspection
- Query translation
- Metadata extraction

Data Flow Architecture

3. Query Processing Flow

sequenceDiagram
    participant User as User
    participant UI as Wren UI
    participant AI as AI Service
    participant Engine as Wren Engine
    participant Vector as Qdrant
    participant LLM as LLM Provider
    participant DS as Data Source
    
    User->>UI: Natural language query
    UI->>AI: POST /v1/asks
    
    Note over AI: Query Processing Pipeline
    AI->>Vector: Retrieve schema context
    Vector-->>AI: Relevant schema documents
    
    AI->>Vector: Retrieve similar questions
    Vector-->>AI: Historical Q&A pairs
    
    AI->>LLM: Generate SQL with context
    LLM-->>AI: SQL query + reasoning
    
    AI->>Engine: Validate SQL
    Engine-->>AI: Validation result
    
    AI-->>UI: SQL + explanation
    UI->>Engine: Execute SQL
    Engine->>DS: Query execution
    DS-->>Engine: Results
    Engine-->>UI: Formatted results
    UI-->>User: Visualization + data

4. Data Modeling Flow

sequenceDiagram
    participant User as User
    participant UI as Wren UI
    participant Ibis as Ibis Server
    participant AI as AI Service
    participant Vector as Qdrant
    participant DS as Data Source
    
    User->>UI: Connect data source
    UI->>Ibis: GET /v1/ibis/metadata
    Ibis->>DS: Schema introspection
    DS-->>Ibis: Schema metadata
    Ibis-->>UI: Structured schema
    
    UI->>UI: Visual modeling
    User->>UI: Define relationships
    UI->>AI: POST /v1/semantics-preparations
    
    Note over AI: MDL Processing
    AI->>AI: Parse MDL schema
    AI->>AI: Generate documents
    AI->>Vector: Store embeddings
    Vector-->>AI: Indexing complete
    
    AI-->>UI: Deployment success

Data Storage Architecture

5. Storage Components

Vector Database (Qdrant)

Purpose: Semantic search and embeddings storage
Port: 6333
Data Types:
- Schema embeddings
- Historical question embeddings
- Metadata for filtering

Document Types:

# Schema Documents
{
    "type": "TABLE_SCHEMA",
    "name": "table_name",
    "content": "CREATE TABLE...",
    "meta": {
        "project_id": "uuid",
        "type": "TABLE"
    }
}

# Q&A Documents  
{
    "type": "QA_PAIR",
    "question": "natural language",
    "sql": "SELECT...",
    "meta": {
        "project_id": "uuid"
    }
}

SQLite Database

Purpose: Application metadata and configuration
Tables:
- Projects and data sources
- Models and relationships
- User queries and threads
- Dashboard configurations

In-Memory Cache

Technology: Redis-compatible caching
Purpose: Query result caching and session management

Service Communication Architecture

6. Inter-Service Communication

graph TB
    subgraph "Communication Patterns"
        subgraph "Synchronous"
            HTTP[HTTP/REST APIs]
            GQL[GraphQL Queries]
            GRPC[gRPC Calls]
        end
        
        subgraph "Asynchronous"
            QUEUE[Task Queues]
            EVENTS[Event Streaming]
            WEBHOOK[Webhooks]
        end
    end
    
    subgraph "Service Mesh"
        UI_SVC[UI Service<br/>:3000]
        AI_SVC[AI Service<br/>:5555]
        ENGINE_SVC[Engine Service<br/>:8080]
        IBIS_SVC[Ibis Service<br/>:8000]
        VECTOR_SVC[Vector Service<br/>:6333]
    end
    
    UI_SVC -->|GraphQL| AI_SVC
    UI_SVC -->|REST| ENGINE_SVC
    UI_SVC -->|REST| IBIS_SVC
    AI_SVC -->|HTTP| VECTOR_SVC
    AI_SVC -->|REST| ENGINE_SVC
    
    style UI_SVC fill:#e3f2fd
    style AI_SVC fill:#fff3e0
    style ENGINE_SVC fill:#f3e5f5
    style VECTOR_SVC fill:#e8f5e8

7. API Architecture

GraphQL API (UI ↔ Services)

Endpoint: /api/graphql

type Query {
    # Data source management
    dataSources: [DataSource!]!
    
    # Model management  
    models: [Model!]!
    relationships: [Relationship!]!
    
    # Query interface
    threads: [Thread!]!
    askingTask(taskId: ID!): AskingTask
}

type Mutation {
    # Data modeling
    createModel(data: CreateModelInput!): Model!
    updateModel(where: ModelWhereUniqueInput!, data: UpdateModelInput!): Model!
    
    # Query execution
    createThread(data: CreateThreadInput!): Thread!
    createAsk(data: CreateAskInput!): AskingTask!
}

REST API (AI Service)

Base URL: http://localhost:5555/v1

# Key endpoints
POST /asks                     # Natural language queries
POST /semantics-preparations   # Schema deployment
GET  /asks/{id}               # Query status
POST /sql-explanations        # SQL explanations

Deployment Architecture

8. Container Architecture

graph TB
    subgraph "Docker Compose Stack"
        subgraph "Application Containers"
            UI_CONTAINER[wren-ui<br/>Node.js Container]
            AI_CONTAINER[wren-ai-service<br/>Python Container]
            ENGINE_CONTAINER[wren-engine<br/>Java/Rust Container]
            IBIS_CONTAINER[ibis-server<br/>Python Container]
        end
        
        subgraph "Data Containers"
            QDRANT_CONTAINER[qdrant<br/>Vector DB Container]
            BOOTSTRAP_CONTAINER[bootstrap<br/>Init Container]
        end
        
        subgraph "Volumes"
            DATA_VOLUME[data<br/>Persistent Volume]
            CONFIG_VOLUME[config<br/>Configuration Volume]
        end
    end
    
    subgraph "External"
        DATASOURCES_EXT[External Data Sources]
        LLM_EXT[LLM Providers]
    end
    
    UI_CONTAINER --> AI_CONTAINER
    AI_CONTAINER --> ENGINE_CONTAINER
    AI_CONTAINER --> QDRANT_CONTAINER
    ENGINE_CONTAINER --> DATASOURCES_EXT
    AI_CONTAINER --> LLM_EXT
    
    DATA_VOLUME --> QDRANT_CONTAINER
    DATA_VOLUME --> UI_CONTAINER
    CONFIG_VOLUME --> AI_CONTAINER
    
    style UI_CONTAINER fill:#e3f2fd
    style AI_CONTAINER fill:#fff3e0
    style QDRANT_CONTAINER fill:#e8f5e8

9. Configuration Management

Environment Variables

# Core service ports
WREN_UI_PORT: 3000
WREN_AI_SERVICE_PORT: 5555
WREN_ENGINE_PORT: 8080
IBIS_SERVER_PORT: 8000
QDRANT_PORT: 6333

# Service endpoints
WREN_AI_ENDPOINT: http://wren-ai-service:5555
WREN_ENGINE_ENDPOINT: http://wren-engine:8080
IBIS_SERVER_ENDPOINT: http://ibis-server:8000
QDRANT_HOST: wren-qdrant

# AI Configuration
OPENAI_API_KEY: sk-...
GENERATION_MODEL: gpt-4o-mini
EMBEDDING_MODEL: text-embedding-3-small

# Database
DB_TYPE: sqlite
SQLITE_FILE: /app/data/db.sqlite3

Configuration Files

docker/docker-compose.yaml - Service orchestration
docker/.env.example - Environment template
wren-ai-service/config.yaml - AI service configuration
wren-ui/next.config.js - UI build configuration

Security Architecture

10. Security Components

graph TB
    subgraph "Security Layers"
        subgraph "Network Security"
            NETWORK[Internal Network<br/>Docker Bridge]
            FIREWALL[Port Restrictions]
            TLS[TLS/SSL Encryption]
        end
        
        subgraph "Authentication"
            API_KEY[API Key Management]
            SESSION[Session Management]
            CORS[CORS Configuration]
        end
        
        subgraph "Data Security"
            ENCRYPTION[Data Encryption]
            SANITIZATION[Input Sanitization]
            VALIDATION[Schema Validation]
        end
    end
    
    subgraph "External Security"
        LLM_AUTH[LLM Provider Auth]
        DB_AUTH[Database Authentication]
        SECRETS[Secret Management]
    end
    
    NETWORK --> API_KEY
    API_KEY --> ENCRYPTION
    ENCRYPTION --> LLM_AUTH
    ENCRYPTION --> DB_AUTH
    
    style NETWORK fill:#e3f2fd
    style ENCRYPTION fill:#e8f5e8
    style LLM_AUTH fill:#fce4ec

Security Features

Network Isolation: Services communicate through internal Docker network
API Security: Request validation and rate limiting
Data Protection: Sensitive data encryption at rest and in transit
Secret Management: Environment-based secret configuration
Input Validation: SQL injection prevention and input sanitization

Performance Architecture

11. Performance Optimization

Caching Strategy

graph LR
    subgraph "Multi-Level Caching"
        L1[L1: Browser Cache<br/>Static Assets]
        L2[L2: Application Cache<br/>Query Results]
        L3[L3: Database Cache<br/>Metadata]
        L4[L4: Vector Cache<br/>Embeddings]
    end
    
    REQUEST[User Request] --> L1
    L1 --> L2
    L2 --> L3
    L3 --> L4
    L4 --> DATABASE[Database Query]
    
    style L1 fill:#e3f2fd
    style L2 fill:#fff3e0
    style L3 fill:#f3e5f5
    style L4 fill:#e8f5e8

Scalability Features

Horizontal Scaling: Stateless service design
Load Balancing: Container orchestration support
Resource Optimization: Efficient memory and CPU usage
Query Optimization: Intelligent SQL optimization
Vector Search: Optimized semantic search with Qdrant

12. Monitoring and Observability

Telemetry Architecture

graph TB
    subgraph "Observability Stack"
        METRICS[Metrics Collection<br/>Performance Data]
        LOGS[Centralized Logging<br/>Application Logs]
        TRACES[Distributed Tracing<br/>Request Flows]
        HEALTH[Health Checks<br/>Service Status]
    end
    
    subgraph "Analytics"
        POSTHOG[PostHog Analytics<br/>User Behavior]
        LANGFUSE[Langfuse<br/>LLM Observability]
        CUSTOM[Custom Metrics<br/>Business Logic]
    end
    
    METRICS --> POSTHOG
    LOGS --> LANGFUSE
    TRACES --> CUSTOM
    
    style METRICS fill:#e3f2fd
    style POSTHOG fill:#fff3e0
    style LANGFUSE fill:#e8f5e8

Integration Architecture

13. External Integrations

Data Source Connectors

Supported Databases:
- Google BigQuery
- PostgreSQL / MySQL
- Snowflake
- ClickHouse
- SQL Server / Oracle
- Amazon Athena / Redshift
- DuckDB

LLM Provider Integration

OpenAI: GPT-4, GPT-3.5, Embeddings
Custom Providers: Configurable LLM endpoints
Embedding Models: Text-embedding-3-small/large

Third-Party Services

Vector Database: Qdrant for semantic search
Analytics: PostHog for user behavior tracking
Observability: Langfuse for LLM monitoring

Development Architecture

14. Development Workflow

graph LR
    subgraph "Development Environment"
        DEV_UI[UI Development<br/>yarn dev]
        DEV_AI[AI Service<br/>python -m src.__main__]
        DEV_ENGINE[Engine Service<br/>Docker]
        DEV_VECTOR[Vector DB<br/>Docker]
    end
    
    subgraph "Testing"
        UNIT[Unit Tests<br/>pytest, jest]
        INTEGRATION[Integration Tests<br/>API Testing]
        E2E[E2E Tests<br/>Playwright]
    end
    
    subgraph "Deployment"
        BUILD[Container Build<br/>Docker Images]
        DEPLOY[Deployment<br/>Docker Compose]
        MONITOR[Monitoring<br/>Health Checks]
    end
    
    DEV_UI --> UNIT
    DEV_AI --> INTEGRATION
    UNIT --> E2E
    E2E --> BUILD
    BUILD --> DEPLOY
    DEPLOY --> MONITOR
    
    style DEV_UI fill:#e3f2fd
    style BUILD fill:#e8f5e8
    style DEPLOY fill:#fff3e0

Technology Stack Summary

Component	Technology	Purpose
Frontend	Next.js 14, React, TypeScript	User interface
AI Service	Python, FastAPI, Hamilton	AI/ML processing
Engine	Java/Rust	SQL execution
Vector DB	Qdrant	Semantic search
Database	SQLite/PostgreSQL	Metadata storage
Orchestration	Docker Compose	Service management
API	GraphQL, REST	Service communication
Authentication	API Keys, Sessions	Security
Monitoring	PostHog, Langfuse	Observability

Conclusion

WrenAI's architecture is designed for:

Scalability: Microservices architecture with container orchestration
Extensibility: Plugin-based data source and LLM provider support
Performance: Multi-level caching and optimized query processing
Security: Network isolation and data protection
Observability: Comprehensive monitoring and analytics
Developer Experience: Clear separation of concerns and modern tooling

The architecture supports both development and production environments with flexible deployment options and comprehensive monitoring capabilities.

Raw

1-wrenai-pipelines.md

WrenAI Pipelines Documentation

Overview

WrenAI employs a sophisticated pipeline architecture built on Hamilton (async execution) and Haystack (component-based) frameworks. The system is organized into three main pipeline categories that work together to provide intelligent SQL generation and data analysis capabilities:

Indexing Pipelines - Data ingestion and storage
Retrieval Pipelines - Context and schema retrieval
Generation Pipelines - AI-powered content generation

Architecture

Core Pipeline Framework

All pipelines inherit from the BasicPipeline abstract base class:

class BasicPipeline(metaclass=ABCMeta):
    def __init__(self, pipe: Pipeline | AsyncDriver | Driver):
        self._pipe = pipe

    @abstractmethod
    def run(self, *args, **kwargs) -> Dict[str, Any]:
        ...

Pipeline Components

Each pipeline utilizes standardized components:

LLM Provider - Language model integration
Embedder Provider - Text embedding services
Document Store Provider - Vector and document storage
Engine - SQL execution and validation

System Architecture Overview

graph TB
    subgraph "User Interface"
        UI[User Query]
    end
    
    subgraph "Generation Pipelines"
        IC[Intent Classification]
        SG[SQL Generation]
        CG[Chart Generation]
        QR[Question Recommendation]
        SA[SQL Answer]
    end
    
    subgraph "Retrieval Pipelines"
        DSR[DB Schema Retrieval]
        HQR[Historical Question Retrieval]
        SPR[SQL Pairs Retrieval]
        IR[Instructions Retrieval]
        SFR[SQL Functions Retrieval]
        SE[SQL Executor]
    end
    
    subgraph "Indexing Pipelines"
        DBI[DB Schema Indexing]
        TDI[Table Description Indexing]
        HQI[Historical Question Indexing]
        SPI[SQL Pairs Indexing]
        II[Instructions Indexing]
        PMI[Project Meta Indexing]
    end
    
    subgraph "Storage Layer"
        VS[Vector Store]
        DS[Document Store]
        MS[Metadata Store]
    end
    
    subgraph "External Components"
        LLM[Language Models]
        EMB[Embedders]
        DB[Database Engines]
    end
    
    UI --> IC
    IC --> DSR
    IC --> SG
    DSR --> SG
    HQR --> SG
    SPR --> SG
    IR --> SG
    SFR --> SG
    SG --> SE
    SE --> SA
    SE --> CG
    SG --> QR
    
    DBI --> VS
    TDI --> VS
    HQI --> VS
    SPI --> VS
    II --> VS
    PMI --> MS
    
    DSR --> VS
    HQR --> VS
    SPR --> VS
    IR --> VS
    SFR --> VS
    
    SG --> LLM
    IC --> LLM
    CG --> LLM
    DSR --> EMB
    SE --> DB

Pipeline Component Architecture

graph TB
    subgraph "Pipeline Components"
        BP[BasicPipeline Abstract Class]
        
        subgraph "Providers"
            LLMP[LLM Provider]
            EMBP[Embedder Provider]
            DSP[Document Store Provider]
            EP[Engine Provider]
        end
        
        subgraph "Core Components"
            GEN[Generator]
            EMB[Embedder]
            RET[Retriever]
            WRT[Writer]
            ENG[Engine]
        end
        
        subgraph "Processing Components"
            PB[Prompt Builder]
            PP[Post Processor]
            VAL[Validator]
            CHK[Chunker]
        end
    end
    
    BP --> LLMP
    BP --> EMBP
    BP --> DSP
    BP --> EP
    
    LLMP --> GEN
    EMBP --> EMB
    DSP --> RET
    DSP --> WRT
    EP --> ENG
    
    GEN --> PB
    GEN --> PP
    EMB --> VAL
    WRT --> CHK

1. Indexing Pipelines

Indexing pipelines are responsible for processing and storing various types of data that will later be retrieved for query processing.

Indexing Pipeline Architecture

graph TB
    subgraph "Data Sources"
        MDL[MDL Schema]
        QH[Query History]
        SP[SQL Pairs]
        INST[Instructions]
        META[Project Metadata]
    end
    
    subgraph "Indexing Pipelines"
        DBI[DB Schema Indexing]
        TDI[Table Description Indexing]
        HQI[Historical Question Indexing]
        SPI[SQL Pairs Indexing]
        II[Instructions Indexing]
        PMI[Project Meta Indexing]
    end
    
    subgraph "Processing Components"
        VAL[Validator]
        CHK[Chunker]
        EMB[Embedder]
        WRT[Writer]
        CLN[Cleaner]
    end
    
    subgraph "Storage"
        VS[Vector Store<br/>Qdrant]
        DS[Document Store]
    end
    
    MDL --> DBI
    MDL --> TDI
    QH --> HQI
    SP --> SPI
    INST --> II
    META --> PMI
    
    DBI --> VAL
    VAL --> CHK
    CHK --> EMB
    EMB --> WRT
    WRT --> CLN
    CLN --> VS
    
    TDI --> VS
    HQI --> VS
    SPI --> VS
    II --> VS
    PMI --> DS
    
    style DBI fill:#e1f5fe
    style TDI fill:#e8f5e8
    style HQI fill:#fff3e0
    style SPI fill:#f3e5f5
    style II fill:#e0f2f1
    style PMI fill:#fce4ec

1.1 DB Schema Indexing (`DBSchema`)

Purpose: Indexes database schema information including tables, columns, relationships, views, and metrics.

Key Features:

DDL Chunking: Converts MDL (Model Definition Language) into searchable document chunks
Column Batching: Processes columns in configurable batch sizes (default: 50)
Relationship Processing: Handles foreign key relationships and joins
Calculated Fields: Indexes pre-computed calculated fields and metrics
Vector Embedding: Generates embeddings for semantic search

Process Flow:

MDL Validation - Validates input MDL structure
Model Preprocessing - Processes models, views, metrics, and relationships
DDL Chunking - Creates document chunks for each schema component
Embedding Generation - Generates vector embeddings
Document Storage - Stores in document store with metadata

flowchart TD
    A[MDL Input] --> B[MDL Validator]
    B --> C[DDL Chunker]
    C --> D[Model Preprocessor]
    D --> E[Column Batcher]
    E --> F[Relationship Processor]
    F --> G[Embedder]
    G --> H[Document Writer]
    H --> I[Vector Store]
    
    subgraph "Document Types"
        J[TABLE_SCHEMA]
        K[TABLE_COLUMNS]
        L[VIEW]
        M[METRIC]
    end
    
    C --> J
    C --> K
    C --> L
    C --> M
    
    style A fill:#e3f2fd
    style I fill:#e8f5e8

Output Documents:

TABLE_SCHEMA - Table structure documents
TABLE_COLUMNS - Column-specific documents (batched)
VIEW - View definition documents
METRIC - Metric calculation documents

1.2 Table Description Indexing (`TableDescription`)

Purpose: Indexes high-level table descriptions for table discovery and selection.

Key Features:

Semantic table descriptions
Table-level metadata storage
Table relationship context

1.3 Historical Question Indexing (`HistoricalQuestion`)

Purpose: Indexes previously asked questions and their corresponding SQL queries for learning and recommendation.

Key Features:

Question-SQL pair storage
Semantic similarity indexing
Query pattern learning

1.4 SQL Pairs Indexing (`SqlPairs`)

Purpose: Indexes curated SQL examples and question-answer pairs for training and reference.

Key Features:

Example query storage
Question-SQL mapping
Training data management

1.5 Instructions Indexing (`Instructions`)

Purpose: Indexes custom instructions and business rules for query generation.

Key Features:

Custom instruction storage
Business rule indexing
Context-specific guidance

1.6 Project Meta Indexing (`ProjectMeta`)

Purpose: Indexes project-level metadata and configuration information.

Key Features:

Project configuration storage
Metadata validation
Project-specific settings

2. Retrieval Pipelines

Retrieval pipelines fetch relevant context and information needed for query processing and generation.

Retrieval Pipeline Architecture

graph TB
    subgraph "Input"
        UQ[User Query]
        HIST[Query History]
        CTX[Context]
    end
    
    subgraph "Retrieval Pipelines"
        DSR[DB Schema Retrieval]
        HQR[Historical Question Retrieval]
        SPR[SQL Pairs Retrieval]
        IR[Instructions Retrieval]
        SFR[SQL Functions Retrieval]
        SE[SQL Executor]
        PSD[Preprocess SQL Data]
    end
    
    subgraph "Retrieval Components"
        EMB[Embedder]
        RET[Retrievers]
        FLT[Filters]
        PRC[Processors]
    end
    
    subgraph "Storage Sources"
        VS[Vector Store]
        DS[Document Store]
        MS[Metadata Store]
    end
    
    subgraph "Output Context"
        SC[Schema Context]
        EC[Example Context]
        IC[Instruction Context]
        FC[Function Context]
        RC[Result Context]
    end
    
    UQ --> EMB
    EMB --> DSR
    EMB --> HQR
    EMB --> SPR
    EMB --> IR
    EMB --> SFR
    
    DSR --> RET
    HQR --> RET
    SPR --> RET
    IR --> RET
    SFR --> RET
    
    RET --> VS
    RET --> DS
    RET --> MS
    
    DSR --> FLT
    FLT --> PRC
    PRC --> SC
    
    HQR --> EC
    SPR --> EC
    IR --> IC
    SFR --> FC
    SE --> RC
    
    style DSR fill:#e1f5fe
    style HQR fill:#e8f5e8
    style SPR fill:#fff3e0
    style IR fill:#f3e5f5
    style SFR fill:#e0f2f1
    style SE fill:#fce4ec

2.1 DB Schema Retrieval (`DbSchemaRetrieval`)

Purpose: The most critical retrieval pipeline that intelligently selects relevant schema components for query generation.

Key Features:

Smart Column Pruning: Uses LLM to select only necessary columns
Table Selection: Identifies relevant tables based on query intent
Context Window Management: Manages token limits for LLM processing
Semantic Retrieval: Uses vector similarity for schema component selection

Column Pruning Algorithm:

Semantic Analysis - Analyzes user question semantically
Table Identification - Identifies potentially relevant tables
Column Selection - LLM selects minimal necessary columns
Reasoning Generation - Provides justification for each selection
Schema Construction - Builds filtered schema context

Process Flow:

flowchart TD
    A[User Query] --> B[Embedding Generation]
    B --> C[Table Retrieval]
    C --> D[Schema Retrieval]
    D --> E[Context Window Check]
    E --> F{Enable Column Pruning?}
    F -->|Yes| G[Column Selection LLM]
    F -->|No| H[Full Schema]
    G --> I[Column Reasoning]
    I --> J[Schema Construction]
    H --> J
    J --> K[Filtered Context]
    
    subgraph "Column Pruning Process"
        G --> L[Table Analysis]
        L --> M[Column Selection]
        M --> N[Reasoning Chain]
        N --> O[Context Validation]
    end
    
    style A fill:#e3f2fd
    style G fill:#fff9c4
    style K fill:#e8f5e8

2.2 Historical Question Retrieval (`HistoricalQuestionRetrieval`)

Purpose: Retrieves similar historical questions and their SQL solutions.

Key Features:

Semantic similarity matching
Historical pattern recognition
Query template suggestions

2.3 SQL Pairs Retrieval (`SqlPairsRetrieval`)

Purpose: Retrieves relevant SQL examples and patterns.

Key Features:

Example query matching
Pattern-based retrieval
Template selection

2.4 Instructions Retrieval (`Instructions`)

Purpose: Retrieves applicable instructions and business rules.

Key Features:

Context-specific instruction selection
Business rule application
Custom guidance retrieval

2.5 SQL Functions Retrieval (`SqlFunctions`)

Purpose: Retrieves available SQL functions and their usage patterns.

Key Features:

Function documentation retrieval
Usage example selection
Database-specific function support

2.6 SQL Executor (`SQLExecutor`)

Purpose: Executes SQL queries and retrieves results.

Key Features:

Query execution with timeout management
Result data fetching
Error handling and validation

2.7 Preprocess SQL Data (`PreprocessSqlData`)

Purpose: Preprocesses SQL data for downstream processing.

Key Features:

Token counting and limit management
Data format validation
Context preparation

3. Generation Pipelines

Generation pipelines use AI models to create various types of content based on retrieved context.

Generation Pipeline Architecture

graph TB
    subgraph "Input Context"
        SC[Schema Context]
        EC[Example Context]
        IC[Instruction Context]
        FC[Function Context]
        HC[History Context]
    end
    
    subgraph "Generation Pipelines"
        ICL[Intent Classification]
        SG[SQL Generation]
        SGR[SQL Generation Reasoning]
        FSG[Follow-up SQL Generation]
        SR[SQL Regeneration]
        SCR[SQL Correction]
        CG[Chart Generation]
        CA[Chart Adjustment]
        QR[Question Recommendation]
        RR[Relationship Recommendation]
        SA[SQL Answer]
        SD[Semantics Description]
    end
    
    subgraph "Generation Components"
        PB[Prompt Builder]
        LLM[Language Model]
        PP[Post Processor]
        VAL[Validator]
        DRY[Dry Run Engine]
    end
    
    subgraph "Output"
        SQL[Generated SQL]
        CHART[Chart Specs]
        ANS[Natural Language Answers]
        REC[Recommendations]
        DESC[Descriptions]
    end
    
    SC --> PB
    EC --> PB
    IC --> PB
    FC --> PB
    HC --> PB
    
    PB --> ICL
    PB --> SG
    PB --> SGR
    PB --> FSG
    PB --> SR
    PB --> SCR
    PB --> CG
    PB --> CA
    PB --> QR
    PB --> RR
    PB --> SA
    PB --> SD
    
    ICL --> LLM
    SG --> LLM
    SGR --> LLM
    FSG --> LLM
    SR --> LLM
    SCR --> LLM
    CG --> LLM
    CA --> LLM
    QR --> LLM
    RR --> LLM
    SA --> LLM
    SD --> LLM
    
    LLM --> PP
    PP --> VAL
    VAL --> DRY
    
    SG --> SQL
    CG --> CHART
    CA --> CHART
    SA --> ANS
    QR --> REC
    RR --> REC
    SD --> DESC
    
    style ICL fill:#e1f5fe
    style SG fill:#e8f5e8
    style CG fill:#fff3e0
    style QR fill:#f3e5f5

3.1 Intent Classification (`IntentClassification`)

Purpose: Classifies user queries into different intent categories to route them appropriately.

Intent Categories:

TEXT_TO_SQL - Queries requiring SQL generation
GENERAL - General database information requests
USER_GUIDE - WrenAI feature and usage questions
MISLEADING_QUERY - Off-topic or irrelevant queries

flowchart TD
    A[User Query] --> B[Context Analysis]
    B --> C[Schema Retrieval]
    C --> D[Intent Classification LLM]
    D --> E{Classification Result}
    E -->|TEXT_TO_SQL| F[SQL Generation Pipeline]
    E -->|GENERAL| G[General Assistance Pipeline]
    E -->|USER_GUIDE| H[User Guide Pipeline]
    E -->|MISLEADING_QUERY| I[Misleading Assistance Pipeline]
    
    F --> J[Schema Context + SQL Generation]
    G --> K[General Information Response]
    H --> L[User Guide Response]
    I --> M[Polite Redirection]
    
    style D fill:#fff9c4
    style F fill:#e8f5e8
    style G fill:#e3f2fd
    style H fill:#f3e5f5
    style I fill:#ffebee

Key Features:

Multi-turn conversation context
Question rephrasing for follow-ups
Schema-aware classification
Reasoning generation

3.2 SQL Generation (`SQLGeneration`)

Purpose: Core pipeline that generates SQL queries from natural language questions.

flowchart TD
    A[Natural Language Query] --> B[Context Preparation]
    B --> C[Schema Context]
    B --> D[Example Context]
    B --> E[Instruction Context]
    B --> F[Function Context]
    
    C --> G[Prompt Builder]
    D --> G
    E --> G
    F --> G
    
    G --> H[SQL Generation LLM]
    H --> I[Generated SQL]
    I --> J[Post Processor]
    J --> K[Quote Addition]
    K --> L{Validation Method}
    L -->|Dry Plan| M[Dry Plan Validation]
    L -->|Dry Run| N[Dry Run Validation]
    
    M --> O{Valid?}
    N --> O
    O -->|Yes| P[Valid SQL Output]
    O -->|No| Q[Error Analysis]
    Q --> R[SQL Correction Pipeline]
    
    subgraph "SQL Rules Enforcement"
        S[SELECT Only]
        T[Table Qualification]
        U[JOIN Requirements]
        V[Performance Patterns]
        W[Calculated Fields]
    end
    
    H --> S
    H --> T
    H --> U
    H --> V
    H --> W
    
    style H fill:#fff9c4
    style P fill:#e8f5e8
    style Q fill:#ffebee

Key Features:

Performance Optimization Rules: Prevents expensive query patterns
Schema-Aware Generation: Uses retrieved schema context
Calculated Field Integration: Leverages pre-computed fields
Dry Run Validation: Validates queries before execution
SQL Rule Enforcement: Follows strict SQL generation guidelines

Generation Rules:

Only SELECT statements allowed
Mandatory table qualification
Proper JOIN usage requirements
Calculated field utilization
Performance-optimized patterns

3.3 SQL Generation Reasoning (`SQLGenerationReasoning`)

Purpose: Generates step-by-step reasoning for SQL query construction.

Key Features:

Logical step breakdown
Decision justification
Educational explanations

3.4 Follow-up SQL Generation (`FollowUpSQLGeneration`)

Purpose: Handles follow-up questions that modify or extend previous queries.

Key Features:

Conversation context awareness
Query modification logic
Incremental query building

3.5 SQL Regeneration (`SQLRegeneration`)

Purpose: Regenerates SQL queries with improvements or corrections.

Key Features:

Query optimization
Error correction
Performance improvements

3.6 SQL Correction (`SQLCorrection`)

Purpose: Corrects invalid or problematic SQL queries.

Key Features:

Error detection and analysis
Syntax correction
Logic error fixes

3.7 Chart Generation (`ChartGeneration`)

Purpose: Generates visualization specifications from query results.

Key Features:

Vega-Lite specification generation
Chart type selection
Data binding optimization

3.8 Chart Adjustment (`ChartAdjustment`)

Purpose: Modifies existing chart specifications based on user feedback.

Key Features:

Chart property modification
Visual adjustment logic
Interactive chart updates

3.9 Question Recommendation (`QuestionRecommendation`)

Purpose: Suggests relevant questions users can ask about their data.

Key Features:

Schema-based suggestions
Categorized recommendations
Difficulty-appropriate questions

3.10 Relationship Recommendation (`RelationshipRecommendation`)

Purpose: Suggests potential relationships between tables in the schema.

Key Features:

Foreign key pattern detection
Relationship type classification
Confidence scoring

3.11 SQL Answer (`SQLAnswer`)

Purpose: Generates natural language explanations of SQL query results.

Key Features:

Result interpretation
Summary generation
Business context explanation

3.12 Semantics Description (`SemanticsDescription`)

Purpose: Generates semantic descriptions of data models and their business meaning.

Key Features:

Business context generation
Model documentation
Semantic relationship explanation

3.13 Data Assistance (`DataAssistance`)

Purpose: Provides guidance and assistance for data analysis tasks.

Key Features:

Analysis suggestions
Data interpretation help
Methodology recommendations

3.14 User Guide Assistance (`UserGuideAssistance`)

Purpose: Answers questions about WrenAI features and usage.

Key Features:

Feature documentation
Usage guidance
Troubleshooting help

3.15 Misleading Assistance (`MisleadingAssistance`)

Purpose: Handles off-topic or irrelevant queries gracefully.

Key Features:

Polite redirection
Capability explanation
Guidance back to relevant topics

3.16 SQL Question (`SQLQuestion`)

Purpose: Generates questions that could be answered by given SQL queries.

Key Features:

Reverse engineering questions
Educational query examples
Documentation generation

3.17 SQL Tables Extraction (`SQLTablesExtraction`)

Purpose: Extracts table names and dependencies from SQL queries.

Key Features:

SQL parsing and analysis
Dependency identification
Table usage tracking

Pipeline Orchestration

Typical Query Processing Flow

sequenceDiagram
    participant User
    participant IC as Intent Classification
    participant DSR as DB Schema Retrieval
    participant CR as Context Retrieval
    participant SG as SQL Generation
    participant SE as SQL Executor
    participant CG as Chart Generation
    participant SA as SQL Answer
    
    User->>IC: Natural Language Query
    IC->>IC: Classify Intent
    
    alt TEXT_TO_SQL Intent
        IC->>DSR: Request Schema Context
        DSR->>DSR: Column Pruning & Selection
        DSR->>CR: Schema Context
        
        par Context Retrieval
            CR->>CR: Historical Questions
            CR->>CR: SQL Examples
            CR->>CR: Instructions
            CR->>CR: SQL Functions
        end
        
        CR->>SG: Combined Context
        SG->>SG: Generate SQL with Reasoning
        SG->>SG: Dry Run Validation
        
        alt Valid SQL
            SG->>SE: Execute SQL
            SE->>SE: Fetch Results
            SE->>CG: Query Results
            CG->>CG: Generate Visualizations
            SE->>SA: Generate Explanation
            SA->>User: Complete Response
        else Invalid SQL
            SG->>SG: SQL Correction Pipeline
            SG->>SE: Corrected SQL
        end
        
    else GENERAL Intent
        IC->>SA: General Information Request
        SA->>User: General Response
        
    else USER_GUIDE Intent
        IC->>SA: User Guide Query
        SA->>User: User Guide Response
        
    else MISLEADING_QUERY Intent
        IC->>SA: Redirect Query
        SA->>User: Polite Redirection
    end

Intent Classification - Determine query type and route appropriately
Schema Retrieval - Fetch relevant schema components with column pruning
Context Retrieval - Gather historical examples, instructions, and functions
SQL Generation - Generate optimized SQL with reasoning
Validation - Dry run validation and error checking
Execution - Execute query and fetch results
Chart Generation - Create visualizations if applicable
Answer Generation - Provide natural language explanations

Performance Optimizations

graph TD
    subgraph "Performance Optimization Strategies"
        A[Input Query] --> B[Column Pruning]
        B --> C[Token Management]
        C --> D[Context Window Optimization]
        D --> E[Smart Caching]
        E --> F[Dry Run Validation]
        F --> G[Calculated Field Usage]
        G --> H[Optimized SQL Output]
    end
    
    subgraph "Optimization Benefits"
        I[60-80% Schema Size Reduction]
        J[Prevents Context Overflow]
        K[Catches Expensive Queries]
        L[Leverages Pre-computed Metrics]
        M[Reduces Processing Time]
    end
    
    B --> I
    C --> J
    F --> K
    G --> L
    E --> M
    
    style B fill:#e8f5e8
    style F fill:#fff3e0
    style G fill:#f3e5f5

Column Pruning - Reduces schema size by 60-80%
Token Management - Prevents context window overflow
Dry Run Validation - Catches expensive queries before execution
Calculated Field Usage - Leverages pre-computed metrics
Smart Caching - Reuses retrieved contexts when possible

Error Handling and Fallbacks

graph TD
    A[SQL Generation] --> B{Valid SQL?}
    B -->|No| C[Error Analysis]
    C --> D[SQL Correction Pipeline]
    D --> E{Correctable?}
    E -->|Yes| F[Corrected SQL]
    E -->|No| G[Fallback Strategy]
    
    G --> H[Simplified Query]
    G --> I[Alternative Approach]
    G --> J[User Guidance]
    
    F --> K[Validation Check]
    K --> L{Valid?}
    L -->|Yes| M[Execute Query]
    L -->|No| N[Graceful Degradation]
    
    N --> O[Reduced Functionality]
    N --> P[Error Message with Suggestions]
    
    style C fill:#ffebee
    style D fill:#fff3e0
    style G fill:#e3f2fd
    style O fill:#f3e5f5

SQL Correction Pipeline - Fixes invalid queries
Fallback Strategies - Alternative generation approaches
Graceful Degradation - Reduced functionality under constraints
User Guidance - Helpful error messages and suggestions

Configuration and Deployment

Pipeline Configuration

graph TB
    subgraph "Configuration Structure"
        A[Pipeline Configuration]
        
        subgraph "Provider Types"
            B[LLM Provider]
            C[Embedder Provider]
            D[Document Store Provider]
            E[Engine Provider]
        end
        
        subgraph "Pipeline Mappings"
            F[Indexing Pipelines]
            G[Retrieval Pipelines]
            H[Generation Pipelines]
        end
        
        subgraph "Component Instances"
            I[OpenAI GPT-4o]
            J[Text Embedding 3 Large]
            K[Qdrant Vector Store]
            L[Wren Engine]
        end
    end
    
    A --> B
    A --> C
    A --> D
    A --> E
    
    B --> F
    B --> G
    B --> H
    
    C --> F
    C --> G
    
    D --> F
    D --> G
    D --> H
    
    E --> G
    E --> H
    
    B --> I
    C --> J
    D --> K
    E --> L

Pipelines are configured through provider mappings:

pipeline:
  type: pipeline
  pipes:
    - name: indexing
      llm: openai_llm.gpt-4o-mini
      embedder: openai_embedder.text-embedding-3-large
      document_store: qdrant
      engine: wren_ui

Scalability Features

Async Processing - Non-blocking pipeline execution
Batch Processing - Efficient bulk operations
Resource Management - Configurable timeouts and limits
Load Balancing - Distributed processing capabilities

Monitoring and Observability

graph TB
    subgraph "Monitoring Stack"
        A[Pipeline Execution]
        
        subgraph "Tracing & Monitoring"
            B[Langfuse Integration]
            C[Performance Metrics]
            D[Error Tracking]
            E[Cost Tracking]
        end
        
        subgraph "Metrics Collection"
            F[Execution Time]
            G[Token Usage]
            H[Success Rate]
            I[Error Rate]
            J[Resource Usage]
        end
        
        subgraph "Dashboards"
            K[Performance Dashboard]
            L[Error Dashboard]
            M[Cost Dashboard]
            N[Usage Dashboard]
        end
    end
    
    A --> B
    A --> C
    A --> D
    A --> E
    
    B --> F
    B --> G
    C --> H
    C --> I
    D --> I
    E --> G
    E --> J
    
    F --> K
    G --> M
    H --> K
    I --> L
    J --> N

Langfuse Integration - Comprehensive tracing and monitoring
Performance Metrics - Pipeline execution statistics
Error Tracking - Detailed error analysis and reporting
Cost Tracking - LLM usage and cost monitoring

Best Practices

Pipeline Development

Modular Design - Keep pipelines focused and composable
Error Handling - Implement comprehensive error handling
Validation - Validate inputs and outputs thoroughly
Documentation - Maintain clear documentation and examples

Performance Optimization

Context Management - Minimize context size while maintaining quality
Caching Strategies - Cache expensive operations when possible
Batch Processing - Process multiple items together when feasible
Resource Monitoring - Monitor and optimize resource usage

Quality Assurance

Testing - Comprehensive unit and integration testing
Validation - Multi-stage validation and verification
Monitoring - Continuous quality monitoring
Feedback Loops - Incorporate user feedback for improvements

This pipeline architecture enables WrenAI to provide intelligent, efficient, and scalable AI-powered data analysis capabilities while maintaining high performance and accuracy standards.

Raw

2-mdl.md

What is MDL Schema?

MDL (Model Definition Language) is WrenAI's proprietary schema format that defines the semantic layer of your data warehouse. It's a JSON-based specification that describes your data models, relationships, metrics, views, and business logic in a structured way that WrenAI can understand and use for intelligent SQL generation.

Core Components of MDL Schema:

Models - Represent tables or data sources
Relationships - Define how models connect to each other
Metrics - Pre-defined aggregations and business calculations
Views - Saved SQL queries for common use cases
Calculated Fields - Computed columns with custom expressions

MDL Schema Structure

{
  "catalog": "string",      // Database catalog name
  "schema": "string",       // Database schema name
  "models": [...],          // Array of data models
  "relationships": [...],   // Array of relationships between models
  "metrics": [...],         // Array of business metrics
  "views": [...],           // Array of saved views
  "enumDefinitions": [...], // Array of enum definitions
  "macros": [...]          // Array of reusable macros
}

Examples

1. Simple MDL Example - Book Store

{
  "catalog": "wrenai",
  "schema": "spider",
  "models": [
    {
      "name": "book",
      "properties": {
        "description": "Book information table",
        "displayName": "Books"
      },
      "refSql": "select * from \"wrenai\".spider.\"book_2-book\"",
      "columns": [
        {
          "name": "Book_ID",
          "type": "INTEGER",
          "notNull": false,
          "isCalculated": false,
          "expression": "Book_ID",
          "properties": {
            "description": "Unique identifier for each book",
            "displayName": "Book ID"
          }
        },
        {
          "name": "Title",
          "type": "VARCHAR",
          "notNull": false,
          "isCalculated": false,
          "expression": "Title",
          "properties": {
            "description": "Title of the book",
            "displayName": "Book Title"
          }
        },
        {
          "name": "Writer",
          "type": "VARCHAR",
          "notNull": false,
          "isCalculated": false,
          "expression": "Writer",
          "properties": {
            "description": "Author of the book",
            "displayName": "Author"
          }
        },
        {
          "name": "Price",
          "type": "REAL",
          "notNull": false,
          "isCalculated": false,
          "expression": "Price",
          "properties": {
            "description": "Price of the book",
            "displayName": "Book Price"
          }
        }
      ],
      "primaryKey": "Book_ID",
      "cached": false
    },
    {
      "name": "publication",
      "properties": {
        "description": "Publication information",
        "displayName": "Publications"
      },
      "refSql": "select * from \"wrenai\".spider.\"book_2-publication\"",
      "columns": [
        {
          "name": "Publication_ID",
          "type": "INTEGER",
          "notNull": false,
          "isCalculated": false,
          "expression": "Publication_ID",
          "properties": {
            "description": "Unique identifier for publication",
            "displayName": "Publication ID"
          }
        },
        {
          "name": "Book_ID",
          "type": "INTEGER",
          "notNull": false,
          "isCalculated": false,
          "expression": "Book_ID",
          "properties": {
            "description": "Reference to book",
            "displayName": "Book ID"
          }
        },
        {
          "name": "Publisher",
          "type": "VARCHAR",
          "notNull": false,
          "isCalculated": false,
          "expression": "Publisher",
          "properties": {
            "description": "Name of the publisher",
            "displayName": "Publisher Name"
          }
        },
        {
          "name": "Publication_Date",
          "type": "VARCHAR",
          "notNull": false,
          "isCalculated": false,
          "expression": "Publication_Date",
          "properties": {
            "description": "Date of publication",
            "displayName": "Publication Date"
          }
        }
      ],
      "primaryKey": "Publication_ID",
      "cached": false
    }
  ],
  "relationships": [
    {
      "name": "BookPublication",
      "models": ["book", "publication"],
      "joinType": "ONE_TO_MANY",
      "condition": "book.Book_ID = publication.Book_ID",
      "properties": {
        "description": "Relationship between books and their publications"
      }
    }
  ],
  "views": [
    {
      "name": "book_summary",
      "statement": "SELECT COUNT(*) as total_books, AVG(Price) as avg_price FROM book",
      "properties": {
        "question": "How many books are there and what's the average price?",
        "summary": "Summary statistics for all books",
        "displayName": "Book Summary"
      }
    }
  ]
}

2. E-commerce MDL Example with Complex Relationships

{
  "catalog": "ecommerce",
  "schema": "public",
  "models": [
    {
      "name": "customers",
      "tableReference": {
        "catalog": "ecommerce",
        "schema": "public",
        "table": "customers"
      },
      "columns": [
        {
          "name": "customer_id",
          "type": "STRING",
          "notNull": true,
          "isCalculated": false,
          "properties": {
            "description": "Unique customer identifier",
            "displayName": "Customer ID"
          }
        },
        {
          "name": "customer_name",
          "type": "STRING",
          "notNull": false,
          "isCalculated": false,
          "properties": {
            "description": "Customer full name",
            "displayName": "Customer Name"
          }
        },
        {
          "name": "city",
          "type": "STRING",
          "notNull": false,
          "isCalculated": false,
          "properties": {
            "description": "Customer city",
            "displayName": "City"
          }
        },
        {
          "name": "total_orders",
          "type": "INTEGER",
          "notNull": false,
          "isCalculated": true,
          "expression": "COUNT(orders.order_id)",
          "relationship": "CustomerOrders",
          "properties": {
            "description": "Total number of orders by customer",
            "displayName": "Total Orders"
          }
        }
      ],
      "primaryKey": "customer_id",
      "cached": false,
      "properties": {
        "description": "Customer information",
        "displayName": "Customers"
      }
    },
    {
      "name": "orders",
      "tableReference": {
        "catalog": "ecommerce",
        "schema": "public",
        "table": "orders"
      },
      "columns": [
        {
          "name": "order_id",
          "type": "STRING",
          "notNull": true,
          "isCalculated": false,
          "properties": {
            "description": "Unique order identifier",
            "displayName": "Order ID"
          }
        },
        {
          "name": "customer_id",
          "type": "STRING",
          "notNull": true,
          "isCalculated": false,
          "properties": {
            "description": "Customer who placed the order",
            "displayName": "Customer ID"
          }
        },
        {
          "name": "order_date",
          "type": "TIMESTAMP",
          "notNull": false,
          "isCalculated": false,
          "properties": {
            "description": "When the order was placed",
            "displayName": "Order Date"
          }
        },
        {
          "name": "order_status",
          "type": "STRING",
          "notNull": false,
          "isCalculated": false,
          "properties": {
            "description": "Current status of the order",
            "displayName": "Order Status"
          }
        }
      ],
      "primaryKey": "order_id",
      "cached": false,
      "properties": {
        "description": "Order information",
        "displayName": "Orders"
      }
    },
    {
      "name": "order_items",
      "tableReference": {
        "catalog": "ecommerce",
        "schema": "public",
        "table": "order_items"
      },
      "columns": [
        {
          "name": "order_id",
          "type": "STRING",
          "notNull": true,
          "isCalculated": false,
          "properties": {
            "description": "Reference to order",
            "displayName": "Order ID"
          }
        },
        {
          "name": "product_id",
          "type": "STRING",
          "notNull": true,
          "isCalculated": false,
          "properties": {
            "description": "Reference to product",
            "displayName": "Product ID"
          }
        },
        {
          "name": "quantity",
          "type": "INTEGER",
          "notNull": false,
          "isCalculated": false,
          "properties": {
            "description": "Quantity ordered",
            "displayName": "Quantity"
          }
        },
        {
          "name": "price",
          "type": "DECIMAL",
          "notNull": false,
          "isCalculated": false,
          "properties": {
            "description": "Price per item",
            "displayName": "Unit Price"
          }
        }
      ],
      "primaryKey": "",
      "cached": false,
      "properties": {
        "description": "Individual items within orders",
        "displayName": "Order Items"
      }
    }
  ],
  "relationships": [
    {
      "name": "CustomerOrders",
      "models": ["customers", "orders"],
      "joinType": "ONE_TO_MANY",
      "condition": "customers.customer_id = orders.customer_id",
      "properties": {
        "description": "Relationship between customers and their orders"
      }
    },
    {
      "name": "OrderItems",
      "models": ["orders", "order_items"],
      "joinType": "ONE_TO_MANY",
      "condition": "orders.order_id = order_items.order_id",
      "properties": {
        "description": "Relationship between orders and their line items"
      }
    }
  ],
  "metrics": [
    {
      "name": "revenue_metrics",
      "baseObject": "order_items",
      "dimension": [
        {
          "name": "order_date",
          "type": "TIMESTAMP",
          "isCalculated": false,
          "notNull": false,
          "properties": {
            "description": "Date dimension for time-based analysis",
            "displayName": "Order Date"
          }
        },
        {
          "name": "customer_city",
          "type": "STRING",
          "isCalculated": false,
          "notNull": false,
          "properties": {
            "description": "Customer city for geographic analysis",
            "displayName": "Customer City"
          }
        }
      ],
      "measure": [
        {
          "name": "total_revenue",
          "type": "DECIMAL",
          "isCalculated": true,
          "expression": "SUM(order_items.price * order_items.quantity)",
          "notNull": false,
          "properties": {
            "description": "Total revenue from sales",
            "displayName": "Total Revenue"
          }
        },
        {
          "name": "order_count",
          "type": "BIGINT",
          "isCalculated": true,
          "expression": "COUNT(DISTINCT orders.order_id)",
          "notNull": false,
          "properties": {
            "description": "Number of unique orders",
            "displayName": "Order Count"
          }
        }
      ],
      "timeGrain": [
        {
          "name": "order_date_grain",
          "refColumn": "order_date",
          "dateParts": ["YEAR", "QUARTER", "MONTH", "DAY"]
        }
      ],
      "cached": true,
      "refreshTime": "1h",
      "properties": {
        "description": "Revenue and order metrics for business analysis",
        "displayName": "Revenue Metrics"
      }
    }
  ],
  "views": [
    {
      "name": "monthly_revenue",
      "statement": "SELECT DATE_TRUNC('month', order_date) as month, SUM(total_revenue) as revenue FROM revenue_metrics GROUP BY DATE_TRUNC('month', order_date)",
      "properties": {
        "question": "What is the monthly revenue trend?",
        "summary": "Monthly revenue aggregation",
        "displayName": "Monthly Revenue Trend"
      }
    }
  ]
}

3. Advanced MDL with Table References and Nested Columns

{
  "catalog": "analytics",
  "schema": "events",
  "models": [
    {
      "name": "user_events",
      "tableReference": {
        "catalog": "analytics",
        "schema": "events",
        "table": "user_events"
      },
      "columns": [
        {
          "name": "event_id",
          "type": "STRING",
          "notNull": true,
          "isCalculated": false,
          "properties": {
            "description": "Unique event identifier",
            "displayName": "Event ID"
          }
        },
        {
          "name": "user_id",
          "type": "STRING",
          "notNull": true,
          "isCalculated": false,
          "properties": {
            "description": "User who triggered the event",
            "displayName": "User ID"
          }
        },
        {
          "name": "event_params",
          "type": "ARRAY<STRUCT<key STRING, value STRING>>",
          "notNull": false,
          "isCalculated": false,
          "properties": {
            "description": "Event parameters as nested structure",
            "displayName": "Event Parameters",
            "nestedDisplayName.event_params.key": "Parameter Key",
            "nestedDescription.event_params.key": "Parameter name",
            "nestedDisplayName.event_params.value": "Parameter Value",
            "nestedDescription.event_params.value": "Parameter value"
          }
        }
      ],
      "primaryKey": "event_id",
      "cached": true,
      "refreshTime": "30m",
      "properties": {
        "description": "User event tracking data",
        "displayName": "User Events"
      }
    }
  ],
  "enumDefinitions": [
    {
      "name": "EventType",
      "values": [
        {
          "name": "PAGE_VIEW",
          "value": "page_view",
          "properties": {
            "description": "User viewed a page"
          }
        },
        {
          "name": "CLICK",
          "value": "click",
          "properties": {
            "description": "User clicked on an element"
          }
        },
        {
          "name": "PURCHASE",
          "value": "purchase",
          "properties": {
            "description": "User completed a purchase"
          }
        }
      ],
      "properties": {
        "description": "Types of user events"
      }
    }
  ]
}

Key MDL Features

1. Model Definition Options:

refSql - Define model using raw SQL
tableReference - Reference existing database table
baseObject - Reference another model

2. Column Types:

Regular columns - Direct database columns
Calculated fields - Computed using expressions
Relationship columns - Connect to other models

3. Relationship Types:

ONE_TO_ONE - 1:1 relationship
ONE_TO_MANY - 1:N relationship
MANY_TO_ONE - N:1 relationship
MANY_TO_MANY - N:N relationship

4. Metric Components:

Dimensions - Categorical data for grouping
Measures - Quantitative calculations
Time Grains - Time-based aggregation levels

5. Caching & Performance:

Model-level caching with refresh intervals
Calculated field pre-computation
Metric materialization

This MDL schema provides WrenAI with the semantic understanding needed to generate intelligent SQL queries, optimize performance, and provide business context for data analysis.

Where MDL Schemas Originally Come From

MDL schemas in WrenAI originate from multiple sources depending on how you set up your WrenAI project. Here are the main sources:

1. 🔌 External Database Connections (Primary Source)

The most common source of MDL schemas is automatic introspection of external databases that you connect to WrenAI.

Supported Data Sources:

BigQuery - Google's cloud data warehouse
PostgreSQL - Open-source relational database
MySQL - Popular relational database
Snowflake - Cloud data platform
ClickHouse - Columnar database
SQL Server (MSSQL) - Microsoft's database
Oracle - Enterprise database
Trino - Distributed SQL query engine
Athena - AWS serverless query service
Redshift - AWS data warehouse
DuckDB - In-process OLAP database

How Database Introspection Works:

Connection Setup: You provide database credentials through the WrenAI UI
Metadata Discovery: WrenAI uses two adapters to scan your database:
- Ibis Adapter: For most external databases (BigQuery, PostgreSQL, etc.)
- Wren Engine Adapter: For DuckDB (direct control)
Table Discovery: The system automatically discovers:
- All available tables and views
- Column names and data types
- Primary keys and constraints
- Table relationships (foreign keys)
MDL Generation: The MDLBuilder class converts this metadata into MDL format

Example Connection Flow:

// 1. User connects to BigQuery
const connectionInfo = {
  projectId: "my-gcp-project",
  datasetId: "my-dataset", 
  credentials: "service-account-key.json"
};

// 2. WrenAI scans tables automatically
const tables = await ibisAdaptor.getTables(DataSourceName.BIG_QUERY, connectionInfo);

// 3. MDL is generated from discovered metadata
const mdl = mdlBuilder.build({
  project,
  models: discoveredTables,
  columns: discoveredColumns,
  relationships: discoveredRelationships
});

2. 📋 Pre-built Sample Datasets

WrenAI provides curated sample datasets that come with pre-configured MDL schemas for learning and demonstration purposes.

Available Sample Datasets:

E-commerce Dataset

Tables: customers, orders, order_items, products, reviews, payments, sellers, geolocation
Source: Brazilian e-commerce data (Olist)
File Format: Parquet files hosted on S3
Use Case: Learning about retail analytics

NBA Dataset

Tables: games, teams, players, player_games, line_scores
Source: NBA statistics data
File Format: CSV files hosted on S3
Use Case: Sports analytics examples

HR/Employee Dataset

Tables: employees, salaries, departments, titles
Source: Employee management data
File Format: Parquet files hosted on S3
Use Case: HR analytics demonstrations

Music Dataset

Tables: artists, albums, tracks, customers, invoices, invoice_lines
Source: Music store sales data
File Format: CSV files hosted on S3
Use Case: Music industry analytics

Sample Dataset Structure:

export const sampleDatasets: Record<string, SampleDataset> = {
  ecommerce: {
    name: SampleDatasetName.ECOMMERCE,
    tables: [
      {
        tableName: 'olist_customers_dataset',
        filePath: 'https://assets.getwren.ai/sample_data/brazilian-ecommerce/olist_customers_dataset.parquet',
        schema: [
          { columnName: 'customer_id', dataType: 'VARCHAR' },
          { columnName: 'customer_city', dataType: 'VARCHAR' }
        ],
        properties: {
          displayName: 'customers',
          description: 'Customer information'
        }
      }
    ],
    relations: [
      {
        fromModelName: 'customers',
        fromColumnName: 'customer_id', 
        toModelName: 'orders',
        toColumnName: 'customer_id',
        type: RelationType.ONE_TO_MANY
      }
    ]
  }
};

3. 🏗️ Manual MDL Creation

You can also manually create or import MDL schemas:

Direct MDL Import

Upload JSON files that conform to the MDL schema specification
Useful for migrating from other semantic layer tools
Allows for custom business logic definitions

SQL-Based Model Creation

{
  "models": [
    {
      "name": "sales_summary",
      "refSql": "SELECT customer_id, SUM(amount) as total_sales FROM orders GROUP BY customer_id",
      "columns": [
        {
          "name": "customer_id",
          "type": "VARCHAR",
          "isCalculated": false
        },
        {
          "name": "total_sales", 
          "type": "DECIMAL",
          "isCalculated": true,
          "expression": "SUM(amount)"
        }
      ]
    }
  ]
}

Bootstrap Initialization

The system can start with minimal MDL:

{
  "catalog": "test_catalog",
  "schema": "test_schema", 
  "models": []
}

4. 🔄 MDL Generation Process

Here's how WrenAI transforms various sources into MDL:

Database → MDL Transformation:

Discovery Phase:

// Scan database for tables
const tables = await metadataService.listTables(project);
const constraints = await metadataService.listConstraints(project);

Model Creation:

// Convert tables to MDL models
mdlBuilder.addModel(); // Creates model definitions
mdlBuilder.addNormalField(); // Adds column definitions
mdlBuilder.addCalculatedField(); // Adds computed columns

Relationship Mapping:

// Map foreign keys to relationships
mdlBuilder.addRelation(); // Creates relationship definitions

Final Assembly:

// Build complete MDL
const manifest = mdlBuilder.build();

5. 🎯 MDL Enrichment

Once the basic MDL is created from any source, WrenAI allows you to enrich it with:

Business Context

Display Names: User-friendly column names
Descriptions: Business meaning of fields
Categories: Grouping related fields

Advanced Features

Calculated Fields: Custom computed columns
Metrics: Pre-defined aggregations
Views: Saved SQL queries
Relationships: Business logic connections

Performance Optimizations

Caching: Model-level caching strategies
Refresh Schedules: Automated data updates
Indexing: Vector embeddings for AI queries

Summary

MDL schemas in WrenAI primarily originate from:

🔌 Database Introspection (80%) - Automatic scanning of connected databases
📋 Sample Datasets (15%) - Pre-built examples for learning/demos
🏗️ Manual Creation (5%) - Custom MDL files or SQL-based models

The system is designed to minimize manual work by automatically discovering your data structure and generating semantically rich MDL schemas that can then be enhanced with business context and optimizations through the WrenAI interface.

This approach allows data teams to quickly get started with AI-powered analytics while maintaining the flexibility to customize and extend their semantic layer as needed.

Raw

3-mdl-to-vector-documents.md

MDL to Vector Documents Transformation in WrenAI

Overview

This document explains how WrenAI's DB Schema Indexing pipeline transforms Model Definition Language (MDL) files into vector documents that are stored in the vector database for semantic search. The transformation process involves parsing, chunking, preprocessing, and structuring the schema information into searchable document formats.

Architecture Overview

graph TB
    subgraph "Input Layer"
        MDL[MDL File<br/>JSON Schema]
    end
    
    subgraph "Processing Pipeline"
        VAL[MDL Validator]
        CHUNK[DDL Chunker]
        PREPROC[Model Preprocessor]
        HELPER[Helper System]
        EMB[Embedder]
    end
    
    subgraph "Document Types"
        TABLE[TABLE Documents]
        COLUMNS[TABLE_COLUMNS Documents]
        VIEW[VIEW Documents]
        METRIC[METRIC Documents]
        FK[FOREIGN_KEY Documents]
    end
    
    subgraph "Storage Layer"
        VECTOR[Vector Store<br/>Qdrant]
        META[Metadata Store]
    end
    
    MDL --> VAL
    VAL --> CHUNK
    CHUNK --> PREPROC
    PREPROC --> HELPER
    HELPER --> TABLE
    HELPER --> COLUMNS
    HELPER --> VIEW
    HELPER --> METRIC
    HELPER --> FK
    
    TABLE --> EMB
    COLUMNS --> EMB
    VIEW --> EMB
    METRIC --> EMB
    FK --> EMB
    
    EMB --> VECTOR
    EMB --> META
    
    style MDL fill:#e3f2fd
    style VECTOR fill:#e8f5e8
    style TABLE fill:#fff3e0
    style COLUMNS fill:#f3e5f5
    style VIEW fill:#e0f2f1
    style METRIC fill:#fce4ec
    style FK fill:#fff8e1

MDL Structure to Document Mapping

1. MDL Schema Components

graph LR
    subgraph "MDL Schema"
        MODELS[Models]
        VIEWS[Views]
        RELATIONSHIPS[Relationships]
        METRICS[Metrics]
        
        subgraph "Model Components"
            COLS[Columns]
            PROPS[Properties]
            PK[Primary Keys]
            CALC[Calculated Fields]
            NESTED[Nested Columns]
        end
        
        MODELS --> COLS
        MODELS --> PROPS
        MODELS --> PK
        MODELS --> CALC
        MODELS --> NESTED
    end
    
    subgraph "Document Types Generated"
        TD[TABLE Documents]
        TCD[TABLE_COLUMNS Documents]
        VD[VIEW Documents]
        MD[METRIC Documents]
        FKD[FOREIGN_KEY Documents]
    end
    
    MODELS --> TD
    COLS --> TCD
    CALC --> TCD
    NESTED --> TCD
    RELATIONSHIPS --> FKD
    VIEWS --> VD
    METRICS --> MD
    
    style MODELS fill:#e1f5fe
    style TD fill:#fff3e0
    style TCD fill:#f3e5f5

Document Transformation Process

2. Model to TABLE Document Transformation

sequenceDiagram
    participant MDL as MDL Model
    participant PROC as Model Preprocessor
    participant CHUNK as DDL Chunker
    participant DOC as TABLE Document
    
    MDL->>PROC: Model with properties
    Note over MDL: {"name": "user",<br/>"properties": {<br/>"description": "User table",<br/>"displayName": "Users"}}
    
    PROC->>PROC: Extract metadata
    PROC->>CHUNK: Processed model
    
    CHUNK->>CHUNK: Generate table command
    Note over CHUNK: Creates table structure<br/>with comments
    
    CHUNK->>DOC: TABLE Document
    Note over DOC: {"type": "TABLE",<br/>"comment": "/* metadata */",<br/>"name": "user"}

3. Columns to TABLE_COLUMNS Document Transformation

flowchart TD
    A[Model Columns] --> B{Column Type?}
    
    B -->|Regular Column| C[Column Preprocessor]
    B -->|Calculated Field| D[Calculated Field Handler]
    B -->|Nested Column| E[Nested Column Handler]
    B -->|Relationship Column| F[Skip - Handled Separately]
    
    C --> G[Column Command Generator]
    D --> H[Add Expression Metadata]
    E --> I[Add Nested Metadata]
    
    H --> G
    I --> G
    G --> J[Column Batcher]
    
    J --> K{Batch Size Reached?}
    K -->|Yes| L[Create TABLE_COLUMNS Document]
    K -->|No| M[Add to Current Batch]
    M --> J
    
    L --> N[Vector Document]
    
    style A fill:#e3f2fd
    style N fill:#e8f5e8
    style D fill:#fff3e0
    style E fill:#f3e5f5

Column Batching Strategy

4. Column Batching Process

graph TB
    subgraph "Input: Model with 7 Columns"
        C1[Column 1: id]
        C2[Column 2: name]
        C3[Column 3: email]
        C4[Column 4: age]
        C5[Column 5: profile_data]
        C6[Column 6: created_at]
        C7[Column 7: updated_at]
    end
    
    subgraph "Batch Size = 3"
        subgraph "Batch 1"
            B1C1[id]
            B1C2[name]
            B1C3[email]
        end
        
        subgraph "Batch 2"
            B2C1[age]
            B2C2[profile_data]
            B2C3[created_at]
        end
        
        subgraph "Batch 3"
            B3C1[updated_at]
        end
    end
    
    subgraph "Generated Documents"
        DOC1[TABLE_COLUMNS Document 1<br/>Contains: id, name, email]
        DOC2[TABLE_COLUMNS Document 2<br/>Contains: age, profile_data, created_at]
        DOC3[TABLE_COLUMNS Document 3<br/>Contains: updated_at]
    end
    
    C1 --> B1C1
    C2 --> B1C2
    C3 --> B1C3
    C4 --> B2C1
    C5 --> B2C2
    C6 --> B2C3
    C7 --> B3C1
    
    B1C1 --> DOC1
    B1C2 --> DOC1
    B1C3 --> DOC1
    B2C1 --> DOC2
    B2C2 --> DOC2
    B2C3 --> DOC2
    B3C1 --> DOC3
    
    style DOC1 fill:#e8f5e8
    style DOC2 fill:#e8f5e8
    style DOC3 fill:#e8f5e8

Relationship Processing

5. Relationship to FOREIGN_KEY Document Transformation

graph TB
    subgraph "MDL Relationship Definition"
        REL["Relationship: UserOrders<br/>condition: user.id = order.user_id<br/>joinType: ONE_TO_MANY<br/>models: [user, order]"]
    end
    
    subgraph "Processing Logic"
        PARSE[Parse Condition]
        EXTRACT[Extract FK Columns]
        BUILD[Build Constraint]
    end
    
    subgraph "Generated FOREIGN_KEY Documents"
        FK1["Document for user table<br/>FOREIGN KEY id REFERENCES order user_id"]
        FK2["Document for order table<br/>FOREIGN KEY user_id REFERENCES user id"]
    end
    
    REL --> PARSE
    PARSE --> EXTRACT
    EXTRACT --> BUILD
    BUILD --> FK1
    BUILD --> FK2
    
    style REL fill:#e3f2fd
    style FK1 fill:#fff8e1
    style FK2 fill:#fff8e1

Complex Data Type Handling

6. Nested Column Processing

graph TB
    subgraph "Input: Nested Column"
        NC["Column: user_profile<br/>Type: STRUCT with name STRING and addresses ARRAY<br/>Properties: nested metadata"]
    end
    
    subgraph "Helper System Processing"
        HELPER[Properties Helper]
        NESTED[Nested Column Extractor]
        JSON[JSON Serializer]
    end
    
    subgraph "Generated Comment Structure"
        COMMENT["Comment: JSON metadata<br/>alias: User Profile<br/>description: User profile data<br/>nested_columns: extracted paths<br/>nested.user_profile.name<br/>nested.user_profile.addresses"]
    end
    
    subgraph "Final Document"
        DOC["TABLE_COLUMNS Document<br/>with nested metadata"]
    end
    
    NC --> HELPER
    HELPER --> NESTED
    NESTED --> JSON
    JSON --> COMMENT
    COMMENT --> DOC
    
    style NC fill:#e3f2fd
    style COMMENT fill:#fff3e0
    style DOC fill:#e8f5e8

Document Content Examples

7. Complete Transformation Example

graph LR
    subgraph "MDL Input"
        MDL_IN["
        {
          'models': [{
            'name': 'customers',
            'properties': {
              'description': 'Customer data',
              'displayName': 'Customers'
            },
            'columns': [
              {
                'name': 'id',
                'type': 'BIGINT',
                'isPrimaryKey': true
              },
              {
                'name': 'profile',
                'type': 'STRUCT<name STRING>',
                'properties': {
                  'nested.profile.name': {...}
                }
              },
              {
                'name': 'total_orders',
                'type': 'INTEGER',
                'isCalculated': true,
                'expression': 'COUNT(orders.id)'
              }
            ]
          }],
          'views': [{
            'name': 'customer_summary',
            'statement': 'SELECT...'
          }],
          'relationships': [{
            'name': 'CustomerOrders',
            'condition': 'customers.id = orders.customer_id'
          }]
        }
        "]
    end
    
    subgraph "Generated Documents"
        DOC1["TABLE Document
        {
          'type': 'TABLE',
          'name': 'customers',
          'comment': '/* metadata */'
        }"]
        
        DOC2["TABLE_COLUMNS Document
        {
          'type': 'TABLE_COLUMNS',
          'columns': [
            {
              'type': 'COLUMN',
              'name': 'id',
              'is_primary_key': true
            }
          ]
        }"]
        
        DOC3["TABLE_COLUMNS Document
        {
          'type': 'TABLE_COLUMNS',
          'columns': [
            {
              'type': 'COLUMN',
              'name': 'profile',
              'comment': '-- nested metadata'
            }
          ]
        }"]
        
        DOC4["TABLE_COLUMNS Document
        {
          'type': 'TABLE_COLUMNS',
          'columns': [
            {
              'type': 'COLUMN',
              'name': 'total_orders',
              'comment': '-- Calculated Field'
            }
          ]
        }"]
        
        DOC5["VIEW Document
        {
          'type': 'VIEW',
          'name': 'customer_summary',
          'statement': 'SELECT...'
        }"]
        
        DOC6["FOREIGN_KEY Document
        {
          'type': 'FOREIGN_KEY',
          'constraint': 'FOREIGN KEY...'
        }"]
    end
    
    MDL_IN --> DOC1
    MDL_IN --> DOC2
    MDL_IN --> DOC3
    MDL_IN --> DOC4
    MDL_IN --> DOC5
    MDL_IN --> DOC6
    
    style MDL_IN fill:#e3f2fd
    style DOC1 fill:#fff3e0
    style DOC2 fill:#f3e5f5
    style DOC3 fill:#f3e5f5
    style DOC4 fill:#f3e5f5
    style DOC5 fill:#e0f2f1
    style DOC6 fill:#fff8e1

Vector Embedding Process

8. Document to Vector Transformation

sequenceDiagram
    participant DOCS as Generated Documents
    participant EMB as Embedder
    participant VECTOR as Vector Store
    participant META as Metadata Store
    
    DOCS->>EMB: Document batch
    Note over DOCS: Multiple document types<br/>with structured content
    
    EMB->>EMB: Generate embeddings
    Note over EMB: Text → Vector<br/>(1536 dimensions for OpenAI)
    
    EMB->>VECTOR: Store vectors
    Note over VECTOR: Vector + Metadata<br/>in Qdrant
    
    EMB->>META: Store metadata
    Note over META: Document metadata<br/>for filtering
    
    VECTOR-->>EMB: Confirmation
    META-->>EMB: Confirmation
    EMB-->>DOCS: Indexing complete

Search and Retrieval Flow

9. Query to Document Retrieval

graph TB
    subgraph "User Query"
        QUERY["Natural Language Query:<br/>Show me customer information"]
    end
    
    subgraph "Retrieval Process"
        EMBED[Query Embedding]
        SEARCH[Vector Similarity Search]
        FILTER[Metadata Filtering]
        RANK[Relevance Ranking]
    end
    
    subgraph "Retrieved Documents"
        R1["TABLE: customers<br/>Relevance: 0.95"]
        R2["TABLE_COLUMNS: customer fields<br/>Relevance: 0.89"]
        R3["VIEW: customer_summary<br/>Relevance: 0.82"]
        R4["METRIC: customer_metrics<br/>Relevance: 0.78"]
    end
    
    subgraph "Schema Context"
        CONTEXT["Assembled Schema Context<br/>for SQL Generation"]
    end
    
    QUERY --> EMBED
    EMBED --> SEARCH
    SEARCH --> FILTER
    FILTER --> RANK
    
    RANK --> R1
    RANK --> R2
    RANK --> R3
    RANK --> R4
    
    R1 --> CONTEXT
    R2 --> CONTEXT
    R3 --> CONTEXT
    R4 --> CONTEXT
    
    style QUERY fill:#e3f2fd
    style CONTEXT fill:#e8f5e8
    style R1 fill:#fff3e0
    style R2 fill:#f3e5f5
    style R3 fill:#e0f2f1
    style R4 fill:#fce4ec

Performance Optimizations

10. Column Pruning in Retrieval

graph TB
    subgraph "Initial Retrieval"
        LARGE[Large Schema Context<br/>All columns retrieved<br/>~10,000 tokens]
    end
    
    subgraph "Column Pruning Process"
        LLM[Column Selection LLM]
        ANALYZE[Analyze Query Intent]
        SELECT[Select Relevant Columns]
        REASON[Generate Reasoning]
    end
    
    subgraph "Optimized Context"
        SMALL[Pruned Schema Context<br/>Only relevant columns<br/>~2,000 tokens]
    end
    
    subgraph "Benefits"
        SPEED[Faster Processing]
        COST[Lower LLM Costs]
        QUALITY[Better Focus]
    end
    
    LARGE --> LLM
    LLM --> ANALYZE
    ANALYZE --> SELECT
    SELECT --> REASON
    REASON --> SMALL
    
    SMALL --> SPEED
    SMALL --> COST
    SMALL --> QUALITY
    
    style LARGE fill:#ffebee
    style SMALL fill:#e8f5e8
    style SPEED fill:#e0f2f1
    style COST fill:#e0f2f1
    style QUALITY fill:#e0f2f1

Document Metadata Structure

11. Document Metadata Schema

classDiagram
    class Document {
        +String id
        +Dict meta
        +String content
    }
    
    class DocumentMeta {
        +String type
        +String name
        +String project_id
        +Dict additional_fields
    }
    
    class TableDocument {
        +type: "TABLE_SCHEMA"
        +name: table_name
    }
    
    class ColumnDocument {
        +type: "TABLE_SCHEMA"
        +name: table_name
        +column_info: embedded_in_content
    }
    
    class ViewDocument {
        +type: "TABLE_SCHEMA"
        +name: view_name
        +statement: embedded_in_content
    }
    
    class MetricDocument {
        +type: "TABLE_SCHEMA"
        +name: metric_name
        +base_object: embedded_in_content
    }
    
    Document --> DocumentMeta
    DocumentMeta <|-- TableDocument
    DocumentMeta <|-- ColumnDocument
    DocumentMeta <|-- ViewDocument
    DocumentMeta <|-- MetricDocument

Content Structure Patterns

12. Document Content Patterns

graph TB
    subgraph "Content Structure Patterns"
        subgraph "TABLE Pattern"
            T_TYPE[type: 'TABLE']
            T_NAME[name: table_name]
            T_COMMENT[comment: metadata]
        end
        
        subgraph "TABLE_COLUMNS Pattern"
            C_TYPE[type: 'TABLE_COLUMNS']
            C_COLS[columns: array]
            
            subgraph "Column Object"
                COL_TYPE[type: 'COLUMN'/'FOREIGN_KEY']
                COL_NAME[name: column_name]
                COL_DATA[data_type: sql_type]
                COL_COMMENT[comment: semantic_info]
                COL_PK[is_primary_key: boolean]
            end
        end
        
        subgraph "VIEW Pattern"
            V_TYPE[type: 'VIEW']
            V_NAME[name: view_name]
            V_STMT[statement: sql_query]
            V_COMMENT[comment: description]
        end
        
        subgraph "METRIC Pattern"
            M_TYPE[type: 'METRIC']
            M_NAME[name: metric_name]
            M_BASE[base_object: source_table]
            M_COLS[columns: dimensions_measures]
            M_COMMENT[comment: metric_info]
        end
    end
    
    C_COLS --> COL_TYPE
    C_COLS --> COL_NAME
    C_COLS --> COL_DATA
    C_COLS --> COL_COMMENT
    C_COLS --> COL_PK
    
    style T_TYPE fill:#fff3e0
    style C_TYPE fill:#f3e5f5
    style V_TYPE fill:#e0f2f1
    style M_TYPE fill:#fce4ec

Error Handling and Validation

13. Validation and Error Handling Flow

flowchart TD
    A[MDL Input] --> B[MDL Validator]
    B --> C{Valid JSON?}
    C -->|No| D[JSON Error]
    C -->|Yes| E[Structure Validation]
    
    E --> F{Required Fields?}
    F -->|Missing| G[Add Default Fields]
    F -->|Present| H[DDL Chunker]
    
    G --> H
    H --> I[Model Preprocessor]
    I --> J{Processing Error?}
    J -->|Yes| K[Log Error & Skip]
    J -->|No| L[Generate Documents]
    
    K --> M[Partial Success]
    L --> N[Embedding Process]
    N --> O{Embedding Error?}
    O -->|Yes| P[Retry with Backoff]
    O -->|No| Q[Store in Vector DB]
    
    P --> Q
    Q --> R[Success]
    
    style D fill:#ffebee
    style K fill:#fff3e0
    style M fill:#fff3e0
    style R fill:#e8f5e8

Performance Metrics and Monitoring

14. Pipeline Performance Monitoring

graph TB
    subgraph "Input Metrics"
        MDL_SIZE[MDL File Size]
        MODEL_COUNT[Model Count]
        COLUMN_COUNT[Column Count]
        REL_COUNT[Relationship Count]
    end
    
    subgraph "Processing Metrics"
        PROC_TIME[Processing Time]
        DOC_COUNT[Documents Generated]
        BATCH_COUNT[Batches Created]
        ERROR_RATE[Error Rate]
    end
    
    subgraph "Output Metrics"
        EMBED_TIME[Embedding Time]
        STORAGE_TIME[Storage Time]
        VECTOR_COUNT[Vectors Stored]
        SUCCESS_RATE[Success Rate]
    end
    
    subgraph "Quality Metrics"
        COVERAGE[Schema Coverage]
        SEARCHABILITY[Search Quality]
        RETRIEVAL_ACC[Retrieval Accuracy]
        LATENCY[Query Latency]
    end
    
    MDL_SIZE --> PROC_TIME
    MODEL_COUNT --> DOC_COUNT
    COLUMN_COUNT --> BATCH_COUNT
    
    PROC_TIME --> EMBED_TIME
    DOC_COUNT --> VECTOR_COUNT
    BATCH_COUNT --> STORAGE_TIME
    
    VECTOR_COUNT --> COVERAGE
    SUCCESS_RATE --> SEARCHABILITY
    ERROR_RATE --> RETRIEVAL_ACC
    
    style PROC_TIME fill:#e3f2fd
    style EMBED_TIME fill:#e8f5e8
    style COVERAGE fill:#e0f2f1
    style RETRIEVAL_ACC fill:#fce4ec

Conclusion

The MDL to Vector Documents transformation in WrenAI is a sophisticated process that:

Preserves Semantic Information: Rich metadata and comments maintain the business context
Enables Efficient Search: Structured document types allow for targeted retrieval
Supports Complex Data Types: Handles nested columns, calculated fields, and relationships
Optimizes Performance: Column batching and pruning reduce processing overhead
Maintains Data Integrity: Comprehensive validation and error handling ensure reliability

This transformation enables WrenAI to provide intelligent, context-aware SQL generation by making database schemas semantically searchable through vector embeddings while preserving the rich structural and business information contained in the original MDL files.

References

Source Code: wren-ai-service/src/pipelines/indexing/db_schema.py
Tests: wren-ai-service/tests/pytest/pipelines/indexing/test_db_schema.py
Helper System: wren-ai-service/src/pipelines/indexing/utils/helper.py
Pipeline Documentation: WrenAI_Pipelines_Documentation.md

Raw

4-api-asks-flow.md

I'll search for the handler logic of the /v1/asks endpoint in the WrenAI codebase. Ran tool Now let me look at the complete ask service implementation to understand the full handler logic: Read file: wren-ai-service/src/web/v1/services/ask.py Let me get the complete pipeline flow from the ask method: Read file: wren-ai-service/src/web/v1/services/ask.py Let me get the rest of the ask method to see the SQL generation and correction logic: Read file: wren-ai-service/src/web/v1/services/ask.py Perfect! Now I have a complete understanding of the /v1/asks endpoint handler logic. Let me present this in a comprehensive way:

`/v1/asks` Handler Logic - Complete Flow

Based on my analysis of the WrenAI codebase, here's the complete handler logic for the /v1/asks endpoint:

🎯 Endpoint Overview

Router: wren-ai-service/src/web/v1/routers/ask.py Service: wren-ai-service/src/web/v1/services/ask.py Method: POST /v1/asks

📝 Request/Response Models

# Request Model
class AskRequest(BaseRequest):
    query: str                                    # Natural language query
    mdl_hash: Optional[str]                      # Model hash/ID (legacy: "id")
    histories: Optional[list[AskHistory]]        # Query history (max 5)
    ignore_sql_generation_reasoning: bool = False
    enable_column_pruning: bool = False

# Response Model  
class AskResponse(BaseModel):
    query_id: str                                # UUID for tracking

# Result Model (via GET /v1/asks/{query_id}/result)
class AskResultResponse(BaseModel):
    status: Literal["understanding", "searching", "planning", 
                   "generating", "correcting", "finished", "failed", "stopped"]
    type: Optional[Literal["GENERAL", "TEXT_TO_SQL"]]
    response: Optional[List[AskResult]]          # Generated SQL results
    error: Optional[AskError]                    # Error details if failed
    # ... additional fields

🔄 Handler Flow Diagram

sequenceDiagram
    participant Client as Client
    participant Router as FastAPI Router
    participant Service as Ask Service
    participant Pipelines as AI Pipelines
    participant Cache as TTL Cache
    participant LLM as LLM Provider
    
    Client->>Router: POST /v1/asks
    Router->>Router: Generate UUID
    Router->>Cache: Store initial status="understanding"
    Router->>Service: Background task: ask()
    Router-->>Client: Return query_id
    
    Note over Service: Async Processing Begins
    
    Service->>Pipelines: 1. Historical Question Retrieval
    alt Historical Result Found
        Service->>Cache: Update status="finished"
        Service-->>Client: Return cached result
    else No Historical Result
        Service->>Pipelines: 2. SQL Pairs + Instructions Retrieval
        Service->>Pipelines: 3. Intent Classification
        
        alt Intent = MISLEADING_QUERY
            Service->>Pipelines: Misleading Assistance
            Service->>Cache: Update status="finished", type="GENERAL"
        else Intent = GENERAL  
            Service->>Pipelines: Data Assistance
            Service->>Cache: Update status="finished", type="GENERAL"
        else Intent = TEXT_TO_SQL
            Service->>Cache: Update status="searching"
            Service->>Pipelines: 4. DB Schema Retrieval
            
            alt No Relevant Data
                Service->>Cache: Update status="failed", error="NO_RELEVANT_DATA"
            else Data Found
                Service->>Cache: Update status="planning"
                Service->>Pipelines: 5. SQL Generation Reasoning
                Service->>Cache: Update status="generating"
                Service->>Pipelines: 6. SQL Generation
                
                alt Valid SQL Generated
                    Service->>Cache: Update status="finished", response=SQL
                else Invalid SQL
                    Service->>Cache: Update status="correcting"
                    Service->>Pipelines: 7. SQL Correction (retry up to 3x)
                    
                    alt Correction Successful
                        Service->>Cache: Update status="finished", response=SQL
                    else Correction Failed
                        Service->>Cache: Update status="failed", error="NO_RELEVANT_SQL"
                    end
                end
            end
        end
    end

🏗️ Detailed Handler Logic

1. Router Handler (`ask.py:81-101`)

@router.post("/asks")
async def ask(
    ask_request: AskRequest,
    background_tasks: BackgroundTasks,
    service_container: ServiceContainer = Depends(get_service_container),
    service_metadata: ServiceMetadata = Depends(get_service_metadata),
) -> AskResponse:
    # Generate unique query ID
    query_id = str(uuid.uuid4())
    ask_request.query_id = query_id
    
    # Initialize result cache with "understanding" status
    service_container.ask_service._ask_results[query_id] = AskResultResponse(
        status="understanding",
    )

    # Start async processing in background
    background_tasks.add_task(
        service_container.ask_service.ask,
        ask_request,
        service_metadata=asdict(service_metadata),
    )
    
    # Return immediately with query_id for tracking
    return AskResponse(query_id=query_id)

2. Service Processing Pipeline (`ask.py:170-628`)

The main ask() method processes through multiple stages:

Stage 1: Understanding (Historical Question Check)

# Status: "understanding"
historical_question = await self._pipelines["historical_question"].run(
    query=user_query,
    project_id=ask_request.project_id,
)

# If exact match found, return immediately
if historical_question_result:
    api_results = [AskResult(sql=result.get("statement"), ...)]
    # Skip to finished

Stage 2: Intent Classification

# Retrieve SQL samples and instructions concurrently
sql_samples_task, instructions_task = await asyncio.gather(
    self._pipelines["sql_pairs_retrieval"].run(...),
    self._pipelines["instructions_retrieval"].run(...),
)

# Classify user intent
intent_classification_result = await self._pipelines["intent_classification"].run(...)
intent = intent_classification_result.get("intent")

# Handle different intents
if intent == "MISLEADING_QUERY":
    # Provide guidance for unclear queries
    asyncio.create_task(self._pipelines["misleading_assistance"].run(...))
    return # status="finished", type="GENERAL"
    
elif intent == "GENERAL":
    # Provide data insights without SQL
    asyncio.create_task(self._pipelines["data_assistance"].run(...))
    return # status="finished", type="GENERAL"
    
elif intent == "USER_GUIDE":
    # Provide user guidance
    asyncio.create_task(self._pipelines["user_guide_assistance"].run(...))
    return # status="finished", type="GENERAL"

Stage 3: Schema Retrieval

# Status: "searching"
retrieval_result = await self._pipelines["db_schema_retrieval"].run(
    query=user_query,
    histories=histories,
    project_id=ask_request.project_id,
    enable_column_pruning=enable_column_pruning,
)

documents = retrieval_result.get("construct_retrieval_results", {}).get("retrieval_results", [])

if not documents:
    # No relevant schema found
    return # status="failed", error="NO_RELEVANT_DATA"

Stage 4: SQL Generation Planning

# Status: "planning" (if reasoning enabled)
if histories:
    sql_generation_reasoning = await self._pipelines["followup_sql_generation_reasoning"].run(...)
else:
    sql_generation_reasoning = await self._pipelines["sql_generation_reasoning"].run(...)

Stage 5: SQL Generation

# Status: "generating"
if histories:
    text_to_sql_generation_results = await self._pipelines["followup_sql_generation"].run(...)
else:
    text_to_sql_generation_results = await self._pipelines["sql_generation"].run(...)

# Check if valid SQL was generated
if sql_valid_result := text_to_sql_generation_results["post_process"]["valid_generation_result"]:
    api_results = [AskResult(sql=sql_valid_result.get("sql"), type="llm")]

Stage 6: SQL Correction (if needed)

# Status: "correcting" 
elif failed_dry_run_result := text_to_sql_generation_results["post_process"]["invalid_generation_result"]:
    # Retry up to max_sql_correction_retries (default: 3)
    while current_sql_correction_retries < max_sql_correction_retries:
        sql_correction_results = await self._pipelines["sql_correction"].run(...)
        
        if valid_generation_result := sql_correction_results["post_process"]["valid_generation_result"]:
            api_results = [AskResult(sql=valid_generation_result.get("sql"), type="llm")]
            break

Stage 7: Completion

if api_results:
    # Status: "finished"
    self._ask_results[query_id] = AskResultResponse(
        status="finished",
        type="TEXT_TO_SQL", 
        response=api_results,
        ...
    )
else:
    # Status: "failed"
    self._ask_results[query_id] = AskResultResponse(
        status="failed",
        error=AskError(code="NO_RELEVANT_SQL", message="No relevant SQL"),
        ...
    )

🔧 Key Features

Asynchronous Processing

Returns query_id immediately
Processes in background using FastAPI BackgroundTasks
Client polls /v1/asks/{query_id}/result for status updates

Status Progression

understanding → searching → planning → generating → correcting → finished/failed

Error Handling

NO_RELEVANT_DATA: No schema matches found
NO_RELEVANT_SQL: SQL generation failed
OTHERS: Unexpected errors

Caching Strategy

Uses TTL Cache (default: 120 seconds, 1M entries)
Stores intermediate results for status polling
Automatic cleanup of expired queries

Pipeline Integration

Historical Question: Check for exact matches
Intent Classification: Determine query type
DB Schema Retrieval: Find relevant tables/columns
SQL Generation: Create SQL from natural language
SQL Correction: Fix invalid SQL with retries

Configuration Options

enable_column_pruning: Optimize retrieved schema
ignore_sql_generation_reasoning: Skip reasoning step
max_histories: Limit conversation context (default: 5)
max_sql_correction_retries: SQL fix attempts (default: 3)

📊 Performance Optimizations

Concurrent Operations: SQL samples and instructions retrieved in parallel
Early Returns: Historical matches bypass full pipeline
Column Pruning: Reduces schema context size
Retry Logic: Smart SQL correction with timeout handling
Caching: TTL cache prevents duplicate processing

This handler demonstrates a sophisticated AI pipeline that transforms natural language into SQL through multiple intelligent processing stages, with robust error handling and performance optimizations.

coderplay/0-wrenai-architecture.md

WrenAI Architecture Documentation

Overview

High-Level Architecture

Component Architecture

1. User Interface Layer

Wren UI (Frontend)

Wren Launcher (CLI)

2. Core Services Architecture

Wren AI Service

Wren Engine

Ibis Server

Data Flow Architecture

3. Query Processing Flow

4. Data Modeling Flow

Data Storage Architecture

5. Storage Components

Vector Database (Qdrant)

SQLite Database

In-Memory Cache

Service Communication Architecture

6. Inter-Service Communication

7. API Architecture

GraphQL API (UI ↔ Services)

REST API (AI Service)

Deployment Architecture

8. Container Architecture

9. Configuration Management

Environment Variables

Configuration Files

Security Architecture

10. Security Components

Security Features

Performance Architecture

11. Performance Optimization

Caching Strategy

Scalability Features

12. Monitoring and Observability

Telemetry Architecture

Integration Architecture

13. External Integrations

Data Source Connectors

LLM Provider Integration

Third-Party Services

Development Architecture

14. Development Workflow

Technology Stack Summary

Conclusion

WrenAI Pipelines Documentation

Overview

Architecture

Core Pipeline Framework

Pipeline Components

System Architecture Overview

Pipeline Component Architecture

1. Indexing Pipelines

Indexing Pipeline Architecture

1.1 DB Schema Indexing (DBSchema)

1.2 Table Description Indexing (TableDescription)

1.3 Historical Question Indexing (HistoricalQuestion)

1.4 SQL Pairs Indexing (SqlPairs)

1.5 Instructions Indexing (Instructions)

1.6 Project Meta Indexing (ProjectMeta)

2. Retrieval Pipelines

Retrieval Pipeline Architecture

2.1 DB Schema Retrieval (DbSchemaRetrieval)

2.2 Historical Question Retrieval (HistoricalQuestionRetrieval)

2.3 SQL Pairs Retrieval (SqlPairsRetrieval)

2.4 Instructions Retrieval (Instructions)

2.5 SQL Functions Retrieval (SqlFunctions)

2.6 SQL Executor (SQLExecutor)

2.7 Preprocess SQL Data (PreprocessSqlData)

3. Generation Pipelines

Generation Pipeline Architecture

3.1 Intent Classification (IntentClassification)

3.2 SQL Generation (SQLGeneration)

3.3 SQL Generation Reasoning (SQLGenerationReasoning)

3.4 Follow-up SQL Generation (FollowUpSQLGeneration)

3.5 SQL Regeneration (SQLRegeneration)

3.6 SQL Correction (SQLCorrection)

1.1 DB Schema Indexing (`DBSchema`)

1.2 Table Description Indexing (`TableDescription`)

1.3 Historical Question Indexing (`HistoricalQuestion`)

1.4 SQL Pairs Indexing (`SqlPairs`)

1.5 Instructions Indexing (`Instructions`)

1.6 Project Meta Indexing (`ProjectMeta`)

2.1 DB Schema Retrieval (`DbSchemaRetrieval`)

2.2 Historical Question Retrieval (`HistoricalQuestionRetrieval`)

2.3 SQL Pairs Retrieval (`SqlPairsRetrieval`)

2.4 Instructions Retrieval (`Instructions`)

2.5 SQL Functions Retrieval (`SqlFunctions`)

2.6 SQL Executor (`SQLExecutor`)

2.7 Preprocess SQL Data (`PreprocessSqlData`)

3.1 Intent Classification (`IntentClassification`)

3.2 SQL Generation (`SQLGeneration`)

3.3 SQL Generation Reasoning (`SQLGenerationReasoning`)

3.4 Follow-up SQL Generation (`FollowUpSQLGeneration`)

3.5 SQL Regeneration (`SQLRegeneration`)

3.6 SQL Correction (`SQLCorrection`)

3.7 Chart Generation (`ChartGeneration`)

3.8 Chart Adjustment (`ChartAdjustment`)

3.9 Question Recommendation (`QuestionRecommendation`)

3.10 Relationship Recommendation (`RelationshipRecommendation`)

3.11 SQL Answer (`SQLAnswer`)

3.12 Semantics Description (`SemanticsDescription`)

3.13 Data Assistance (`DataAssistance`)

3.14 User Guide Assistance (`UserGuideAssistance`)

3.15 Misleading Assistance (`MisleadingAssistance`)

3.16 SQL Question (`SQLQuestion`)

3.17 SQL Tables Extraction (`SQLTablesExtraction`)