Skip to content

Instantly share code, notes, and snippets.

@coderplay
Last active July 8, 2025 21:34
Show Gist options
  • Save coderplay/9023fa0e251883b5586de4529be4857a to your computer and use it in GitHub Desktop.
Save coderplay/9023fa0e251883b5586de4529be4857a to your computer and use it in GitHub Desktop.

WrenAI Architecture Documentation

Overview

WrenAI is a comprehensive AI-powered data modeling and query generation platform that transforms natural language questions into SQL queries. The system consists of multiple interconnected services that work together to provide intelligent data analysis capabilities.

High-Level Architecture

graph TB
    subgraph "User Interface Layer"
        UI[Wren UI<br/>Next.js Frontend<br/>Port: 3000]
        CLI[Wren Launcher<br/>CLI Tool<br/>Go Application]
    end
    
    subgraph "API Gateway Layer"
        GQL[GraphQL API<br/>Apollo Server]
        REST[REST API<br/>FastAPI]
    end
    
    subgraph "Core Services"
        AI[Wren AI Service<br/>Python FastAPI<br/>Port: 5555]
        ENGINE[Wren Engine<br/>Java/Rust<br/>Port: 8080]
        IBIS[Ibis Server<br/>Python<br/>Port: 8000]
    end
    
    subgraph "Data Layer"
        QDRANT[Qdrant Vector DB<br/>Port: 6333]
        SQLITE[SQLite Database<br/>Metadata Storage]
        CACHE[In-Memory Cache<br/>Query Results]
    end
    
    subgraph "External Services"
        LLM[LLM Providers<br/>OpenAI, Custom]
        DATASOURCES[Data Sources<br/>BigQuery, PostgreSQL<br/>MySQL, Snowflake, etc.]
    end
    
    CLI --> UI
    UI --> GQL
    GQL --> AI
    GQL --> ENGINE
    GQL --> IBIS
    
    AI --> REST
    AI --> QDRANT
    AI --> LLM
    
    ENGINE --> DATASOURCES
    IBIS --> DATASOURCES
    
    UI --> SQLITE
    AI --> CACHE
    
    style UI fill:#e3f2fd
    style AI fill:#fff3e0
    style ENGINE fill:#f3e5f5
    style QDRANT fill:#e8f5e8
    style LLM fill:#fce4ec
Loading

Component Architecture

1. User Interface Layer

Wren UI (Frontend)

  • Technology: Next.js 14, React, TypeScript, Ant Design
  • Port: 3000
  • Responsibilities:
    • Data source connection and configuration
    • Visual data modeling interface
    • Natural language query input
    • Result visualization and charts
    • User authentication and settings

Key Components:

// Main App Structure
src/pages/_app.tsx          // Application root
src/pages/home/index.tsx    // Query interface
src/pages/setup/            // Data source setup
src/pages/modeling/         // Data modeling tools
src/components/diagram/     // Visual modeling components

Wren Launcher (CLI)

  • Technology: Go
  • Purpose: System orchestration and setup
  • Responsibilities:
    • Interactive system setup
    • Docker container management
    • Configuration generation
    • Service health monitoring

2. Core Services Architecture

Wren AI Service

  • Technology: Python, FastAPI, Hamilton pipelines
  • Port: 5555
  • Architecture Pattern: Pipeline-based processing
graph LR
    subgraph "AI Service Internal Architecture"
        ROUTER[FastAPI Router] --> PIPELINE[Pipeline Manager]
        PIPELINE --> INDEXING[Indexing Pipelines]
        PIPELINE --> RETRIEVAL[Retrieval Pipelines]
        PIPELINE --> GENERATION[Generation Pipelines]
        
        INDEXING --> DBSCHEMA[DB Schema Indexing]
        INDEXING --> HISTORICAL[Historical Q&A]
        
        RETRIEVAL --> DBRETRIEVAL[DB Schema Retrieval]
        RETRIEVAL --> QARETRIEVAL[Q&A Retrieval]
        
        GENERATION --> SQLGEN[SQL Generation]
        GENERATION --> FOLLOWUP[Follow-up Generation]
        GENERATION --> ADJUSTMENT[Chart Adjustment]
    end
    
    style DBSCHEMA fill:#e8f5e8
    style SQLGEN fill:#fff3e0
    style RETRIEVAL fill:#f3e5f5
Loading

Core Pipelines:

  1. Indexing Pipelines (src/pipelines/indexing/):

    • db_schema.py - Database schema vectorization
    • historical_question.py - Q&A pair indexing
  2. Retrieval Pipelines (src/pipelines/retrieval/):

    • db_schema_retrieval.py - Schema context retrieval
    • historical_question_retrieval.py - Similar question retrieval
  3. Generation Pipelines (src/pipelines/generation/):

    • sql_generation.py - SQL query generation
    • followup_generation.py - Follow-up question generation
    • chart_generation.py - Visualization generation

Wren Engine

  • Technology: Java (Traditional) / Rust (Experimental)
  • Port: 8080
  • Purpose: SQL execution and optimization
  • Responsibilities:
    • SQL query execution
    • Query optimization
    • Result caching
    • Performance monitoring

Ibis Server

  • Technology: Python, Ibis framework
  • Port: 8000
  • Purpose: Data source abstraction
  • Responsibilities:
    • Multi-database connectivity
    • Schema introspection
    • Query translation
    • Metadata extraction

Data Flow Architecture

3. Query Processing Flow

sequenceDiagram
    participant User as User
    participant UI as Wren UI
    participant AI as AI Service
    participant Engine as Wren Engine
    participant Vector as Qdrant
    participant LLM as LLM Provider
    participant DS as Data Source
    
    User->>UI: Natural language query
    UI->>AI: POST /v1/asks
    
    Note over AI: Query Processing Pipeline
    AI->>Vector: Retrieve schema context
    Vector-->>AI: Relevant schema documents
    
    AI->>Vector: Retrieve similar questions
    Vector-->>AI: Historical Q&A pairs
    
    AI->>LLM: Generate SQL with context
    LLM-->>AI: SQL query + reasoning
    
    AI->>Engine: Validate SQL
    Engine-->>AI: Validation result
    
    AI-->>UI: SQL + explanation
    UI->>Engine: Execute SQL
    Engine->>DS: Query execution
    DS-->>Engine: Results
    Engine-->>UI: Formatted results
    UI-->>User: Visualization + data
Loading

4. Data Modeling Flow

sequenceDiagram
    participant User as User
    participant UI as Wren UI
    participant Ibis as Ibis Server
    participant AI as AI Service
    participant Vector as Qdrant
    participant DS as Data Source
    
    User->>UI: Connect data source
    UI->>Ibis: GET /v1/ibis/metadata
    Ibis->>DS: Schema introspection
    DS-->>Ibis: Schema metadata
    Ibis-->>UI: Structured schema
    
    UI->>UI: Visual modeling
    User->>UI: Define relationships
    UI->>AI: POST /v1/semantics-preparations
    
    Note over AI: MDL Processing
    AI->>AI: Parse MDL schema
    AI->>AI: Generate documents
    AI->>Vector: Store embeddings
    Vector-->>AI: Indexing complete
    
    AI-->>UI: Deployment success
Loading

Data Storage Architecture

5. Storage Components

Vector Database (Qdrant)

  • Purpose: Semantic search and embeddings storage
  • Port: 6333
  • Data Types:
    • Schema embeddings
    • Historical question embeddings
    • Metadata for filtering

Document Types:

# Schema Documents
{
    "type": "TABLE_SCHEMA",
    "name": "table_name",
    "content": "CREATE TABLE...",
    "meta": {
        "project_id": "uuid",
        "type": "TABLE"
    }
}

# Q&A Documents  
{
    "type": "QA_PAIR",
    "question": "natural language",
    "sql": "SELECT...",
    "meta": {
        "project_id": "uuid"
    }
}

SQLite Database

  • Purpose: Application metadata and configuration
  • Tables:
    • Projects and data sources
    • Models and relationships
    • User queries and threads
    • Dashboard configurations

In-Memory Cache

  • Technology: Redis-compatible caching
  • Purpose: Query result caching and session management

Service Communication Architecture

6. Inter-Service Communication

graph TB
    subgraph "Communication Patterns"
        subgraph "Synchronous"
            HTTP[HTTP/REST APIs]
            GQL[GraphQL Queries]
            GRPC[gRPC Calls]
        end
        
        subgraph "Asynchronous"
            QUEUE[Task Queues]
            EVENTS[Event Streaming]
            WEBHOOK[Webhooks]
        end
    end
    
    subgraph "Service Mesh"
        UI_SVC[UI Service<br/>:3000]
        AI_SVC[AI Service<br/>:5555]
        ENGINE_SVC[Engine Service<br/>:8080]
        IBIS_SVC[Ibis Service<br/>:8000]
        VECTOR_SVC[Vector Service<br/>:6333]
    end
    
    UI_SVC -->|GraphQL| AI_SVC
    UI_SVC -->|REST| ENGINE_SVC
    UI_SVC -->|REST| IBIS_SVC
    AI_SVC -->|HTTP| VECTOR_SVC
    AI_SVC -->|REST| ENGINE_SVC
    
    style UI_SVC fill:#e3f2fd
    style AI_SVC fill:#fff3e0
    style ENGINE_SVC fill:#f3e5f5
    style VECTOR_SVC fill:#e8f5e8
Loading

7. API Architecture

GraphQL API (UI ↔ Services)

Endpoint: /api/graphql

type Query {
    # Data source management
    dataSources: [DataSource!]!
    
    # Model management  
    models: [Model!]!
    relationships: [Relationship!]!
    
    # Query interface
    threads: [Thread!]!
    askingTask(taskId: ID!): AskingTask
}

type Mutation {
    # Data modeling
    createModel(data: CreateModelInput!): Model!
    updateModel(where: ModelWhereUniqueInput!, data: UpdateModelInput!): Model!
    
    # Query execution
    createThread(data: CreateThreadInput!): Thread!
    createAsk(data: CreateAskInput!): AskingTask!
}

REST API (AI Service)

Base URL: http://localhost:5555/v1

# Key endpoints
POST /asks                     # Natural language queries
POST /semantics-preparations   # Schema deployment
GET  /asks/{id}               # Query status
POST /sql-explanations        # SQL explanations

Deployment Architecture

8. Container Architecture

graph TB
    subgraph "Docker Compose Stack"
        subgraph "Application Containers"
            UI_CONTAINER[wren-ui<br/>Node.js Container]
            AI_CONTAINER[wren-ai-service<br/>Python Container]
            ENGINE_CONTAINER[wren-engine<br/>Java/Rust Container]
            IBIS_CONTAINER[ibis-server<br/>Python Container]
        end
        
        subgraph "Data Containers"
            QDRANT_CONTAINER[qdrant<br/>Vector DB Container]
            BOOTSTRAP_CONTAINER[bootstrap<br/>Init Container]
        end
        
        subgraph "Volumes"
            DATA_VOLUME[data<br/>Persistent Volume]
            CONFIG_VOLUME[config<br/>Configuration Volume]
        end
    end
    
    subgraph "External"
        DATASOURCES_EXT[External Data Sources]
        LLM_EXT[LLM Providers]
    end
    
    UI_CONTAINER --> AI_CONTAINER
    AI_CONTAINER --> ENGINE_CONTAINER
    AI_CONTAINER --> QDRANT_CONTAINER
    ENGINE_CONTAINER --> DATASOURCES_EXT
    AI_CONTAINER --> LLM_EXT
    
    DATA_VOLUME --> QDRANT_CONTAINER
    DATA_VOLUME --> UI_CONTAINER
    CONFIG_VOLUME --> AI_CONTAINER
    
    style UI_CONTAINER fill:#e3f2fd
    style AI_CONTAINER fill:#fff3e0
    style QDRANT_CONTAINER fill:#e8f5e8
Loading

9. Configuration Management

Environment Variables

# Core service ports
WREN_UI_PORT: 3000
WREN_AI_SERVICE_PORT: 5555
WREN_ENGINE_PORT: 8080
IBIS_SERVER_PORT: 8000
QDRANT_PORT: 6333

# Service endpoints
WREN_AI_ENDPOINT: http://wren-ai-service:5555
WREN_ENGINE_ENDPOINT: http://wren-engine:8080
IBIS_SERVER_ENDPOINT: http://ibis-server:8000
QDRANT_HOST: wren-qdrant

# AI Configuration
OPENAI_API_KEY: sk-...
GENERATION_MODEL: gpt-4o-mini
EMBEDDING_MODEL: text-embedding-3-small

# Database
DB_TYPE: sqlite
SQLITE_FILE: /app/data/db.sqlite3

Configuration Files

  • docker/docker-compose.yaml - Service orchestration
  • docker/.env.example - Environment template
  • wren-ai-service/config.yaml - AI service configuration
  • wren-ui/next.config.js - UI build configuration

Security Architecture

10. Security Components

graph TB
    subgraph "Security Layers"
        subgraph "Network Security"
            NETWORK[Internal Network<br/>Docker Bridge]
            FIREWALL[Port Restrictions]
            TLS[TLS/SSL Encryption]
        end
        
        subgraph "Authentication"
            API_KEY[API Key Management]
            SESSION[Session Management]
            CORS[CORS Configuration]
        end
        
        subgraph "Data Security"
            ENCRYPTION[Data Encryption]
            SANITIZATION[Input Sanitization]
            VALIDATION[Schema Validation]
        end
    end
    
    subgraph "External Security"
        LLM_AUTH[LLM Provider Auth]
        DB_AUTH[Database Authentication]
        SECRETS[Secret Management]
    end
    
    NETWORK --> API_KEY
    API_KEY --> ENCRYPTION
    ENCRYPTION --> LLM_AUTH
    ENCRYPTION --> DB_AUTH
    
    style NETWORK fill:#e3f2fd
    style ENCRYPTION fill:#e8f5e8
    style LLM_AUTH fill:#fce4ec
Loading

Security Features

  • Network Isolation: Services communicate through internal Docker network
  • API Security: Request validation and rate limiting
  • Data Protection: Sensitive data encryption at rest and in transit
  • Secret Management: Environment-based secret configuration
  • Input Validation: SQL injection prevention and input sanitization

Performance Architecture

11. Performance Optimization

Caching Strategy

graph LR
    subgraph "Multi-Level Caching"
        L1[L1: Browser Cache<br/>Static Assets]
        L2[L2: Application Cache<br/>Query Results]
        L3[L3: Database Cache<br/>Metadata]
        L4[L4: Vector Cache<br/>Embeddings]
    end
    
    REQUEST[User Request] --> L1
    L1 --> L2
    L2 --> L3
    L3 --> L4
    L4 --> DATABASE[Database Query]
    
    style L1 fill:#e3f2fd
    style L2 fill:#fff3e0
    style L3 fill:#f3e5f5
    style L4 fill:#e8f5e8
Loading

Scalability Features

  • Horizontal Scaling: Stateless service design
  • Load Balancing: Container orchestration support
  • Resource Optimization: Efficient memory and CPU usage
  • Query Optimization: Intelligent SQL optimization
  • Vector Search: Optimized semantic search with Qdrant

12. Monitoring and Observability

Telemetry Architecture

graph TB
    subgraph "Observability Stack"
        METRICS[Metrics Collection<br/>Performance Data]
        LOGS[Centralized Logging<br/>Application Logs]
        TRACES[Distributed Tracing<br/>Request Flows]
        HEALTH[Health Checks<br/>Service Status]
    end
    
    subgraph "Analytics"
        POSTHOG[PostHog Analytics<br/>User Behavior]
        LANGFUSE[Langfuse<br/>LLM Observability]
        CUSTOM[Custom Metrics<br/>Business Logic]
    end
    
    METRICS --> POSTHOG
    LOGS --> LANGFUSE
    TRACES --> CUSTOM
    
    style METRICS fill:#e3f2fd
    style POSTHOG fill:#fff3e0
    style LANGFUSE fill:#e8f5e8
Loading

Integration Architecture

13. External Integrations

Data Source Connectors

  • Supported Databases:
    • Google BigQuery
    • PostgreSQL / MySQL
    • Snowflake
    • ClickHouse
    • SQL Server / Oracle
    • Amazon Athena / Redshift
    • DuckDB

LLM Provider Integration

  • OpenAI: GPT-4, GPT-3.5, Embeddings
  • Custom Providers: Configurable LLM endpoints
  • Embedding Models: Text-embedding-3-small/large

Third-Party Services

  • Vector Database: Qdrant for semantic search
  • Analytics: PostHog for user behavior tracking
  • Observability: Langfuse for LLM monitoring

Development Architecture

14. Development Workflow

graph LR
    subgraph "Development Environment"
        DEV_UI[UI Development<br/>yarn dev]
        DEV_AI[AI Service<br/>python -m src.__main__]
        DEV_ENGINE[Engine Service<br/>Docker]
        DEV_VECTOR[Vector DB<br/>Docker]
    end
    
    subgraph "Testing"
        UNIT[Unit Tests<br/>pytest, jest]
        INTEGRATION[Integration Tests<br/>API Testing]
        E2E[E2E Tests<br/>Playwright]
    end
    
    subgraph "Deployment"
        BUILD[Container Build<br/>Docker Images]
        DEPLOY[Deployment<br/>Docker Compose]
        MONITOR[Monitoring<br/>Health Checks]
    end
    
    DEV_UI --> UNIT
    DEV_AI --> INTEGRATION
    UNIT --> E2E
    E2E --> BUILD
    BUILD --> DEPLOY
    DEPLOY --> MONITOR
    
    style DEV_UI fill:#e3f2fd
    style BUILD fill:#e8f5e8
    style DEPLOY fill:#fff3e0
Loading

Technology Stack Summary

Component Technology Purpose
Frontend Next.js 14, React, TypeScript User interface
AI Service Python, FastAPI, Hamilton AI/ML processing
Engine Java/Rust SQL execution
Vector DB Qdrant Semantic search
Database SQLite/PostgreSQL Metadata storage
Orchestration Docker Compose Service management
API GraphQL, REST Service communication
Authentication API Keys, Sessions Security
Monitoring PostHog, Langfuse Observability

Conclusion

WrenAI's architecture is designed for:

  1. Scalability: Microservices architecture with container orchestration
  2. Extensibility: Plugin-based data source and LLM provider support
  3. Performance: Multi-level caching and optimized query processing
  4. Security: Network isolation and data protection
  5. Observability: Comprehensive monitoring and analytics
  6. Developer Experience: Clear separation of concerns and modern tooling

The architecture supports both development and production environments with flexible deployment options and comprehensive monitoring capabilities.

WrenAI Pipelines Documentation

Overview

WrenAI employs a sophisticated pipeline architecture built on Hamilton (async execution) and Haystack (component-based) frameworks. The system is organized into three main pipeline categories that work together to provide intelligent SQL generation and data analysis capabilities:

  1. Indexing Pipelines - Data ingestion and storage
  2. Retrieval Pipelines - Context and schema retrieval
  3. Generation Pipelines - AI-powered content generation

Architecture

Core Pipeline Framework

All pipelines inherit from the BasicPipeline abstract base class:

class BasicPipeline(metaclass=ABCMeta):
    def __init__(self, pipe: Pipeline | AsyncDriver | Driver):
        self._pipe = pipe

    @abstractmethod
    def run(self, *args, **kwargs) -> Dict[str, Any]:
        ...

Pipeline Components

Each pipeline utilizes standardized components:

  • LLM Provider - Language model integration
  • Embedder Provider - Text embedding services
  • Document Store Provider - Vector and document storage
  • Engine - SQL execution and validation

System Architecture Overview

graph TB
    subgraph "User Interface"
        UI[User Query]
    end
    
    subgraph "Generation Pipelines"
        IC[Intent Classification]
        SG[SQL Generation]
        CG[Chart Generation]
        QR[Question Recommendation]
        SA[SQL Answer]
    end
    
    subgraph "Retrieval Pipelines"
        DSR[DB Schema Retrieval]
        HQR[Historical Question Retrieval]
        SPR[SQL Pairs Retrieval]
        IR[Instructions Retrieval]
        SFR[SQL Functions Retrieval]
        SE[SQL Executor]
    end
    
    subgraph "Indexing Pipelines"
        DBI[DB Schema Indexing]
        TDI[Table Description Indexing]
        HQI[Historical Question Indexing]
        SPI[SQL Pairs Indexing]
        II[Instructions Indexing]
        PMI[Project Meta Indexing]
    end
    
    subgraph "Storage Layer"
        VS[Vector Store]
        DS[Document Store]
        MS[Metadata Store]
    end
    
    subgraph "External Components"
        LLM[Language Models]
        EMB[Embedders]
        DB[Database Engines]
    end
    
    UI --> IC
    IC --> DSR
    IC --> SG
    DSR --> SG
    HQR --> SG
    SPR --> SG
    IR --> SG
    SFR --> SG
    SG --> SE
    SE --> SA
    SE --> CG
    SG --> QR
    
    DBI --> VS
    TDI --> VS
    HQI --> VS
    SPI --> VS
    II --> VS
    PMI --> MS
    
    DSR --> VS
    HQR --> VS
    SPR --> VS
    IR --> VS
    SFR --> VS
    
    SG --> LLM
    IC --> LLM
    CG --> LLM
    DSR --> EMB
    SE --> DB
Loading

Pipeline Component Architecture

graph TB
    subgraph "Pipeline Components"
        BP[BasicPipeline Abstract Class]
        
        subgraph "Providers"
            LLMP[LLM Provider]
            EMBP[Embedder Provider]
            DSP[Document Store Provider]
            EP[Engine Provider]
        end
        
        subgraph "Core Components"
            GEN[Generator]
            EMB[Embedder]
            RET[Retriever]
            WRT[Writer]
            ENG[Engine]
        end
        
        subgraph "Processing Components"
            PB[Prompt Builder]
            PP[Post Processor]
            VAL[Validator]
            CHK[Chunker]
        end
    end
    
    BP --> LLMP
    BP --> EMBP
    BP --> DSP
    BP --> EP
    
    LLMP --> GEN
    EMBP --> EMB
    DSP --> RET
    DSP --> WRT
    EP --> ENG
    
    GEN --> PB
    GEN --> PP
    EMB --> VAL
    WRT --> CHK
Loading

1. Indexing Pipelines

Indexing pipelines are responsible for processing and storing various types of data that will later be retrieved for query processing.

Indexing Pipeline Architecture

graph TB
    subgraph "Data Sources"
        MDL[MDL Schema]
        QH[Query History]
        SP[SQL Pairs]
        INST[Instructions]
        META[Project Metadata]
    end
    
    subgraph "Indexing Pipelines"
        DBI[DB Schema Indexing]
        TDI[Table Description Indexing]
        HQI[Historical Question Indexing]
        SPI[SQL Pairs Indexing]
        II[Instructions Indexing]
        PMI[Project Meta Indexing]
    end
    
    subgraph "Processing Components"
        VAL[Validator]
        CHK[Chunker]
        EMB[Embedder]
        WRT[Writer]
        CLN[Cleaner]
    end
    
    subgraph "Storage"
        VS[Vector Store<br/>Qdrant]
        DS[Document Store]
    end
    
    MDL --> DBI
    MDL --> TDI
    QH --> HQI
    SP --> SPI
    INST --> II
    META --> PMI
    
    DBI --> VAL
    VAL --> CHK
    CHK --> EMB
    EMB --> WRT
    WRT --> CLN
    CLN --> VS
    
    TDI --> VS
    HQI --> VS
    SPI --> VS
    II --> VS
    PMI --> DS
    
    style DBI fill:#e1f5fe
    style TDI fill:#e8f5e8
    style HQI fill:#fff3e0
    style SPI fill:#f3e5f5
    style II fill:#e0f2f1
    style PMI fill:#fce4ec
Loading

1.1 DB Schema Indexing (DBSchema)

Purpose: Indexes database schema information including tables, columns, relationships, views, and metrics.

Key Features:

  • DDL Chunking: Converts MDL (Model Definition Language) into searchable document chunks
  • Column Batching: Processes columns in configurable batch sizes (default: 50)
  • Relationship Processing: Handles foreign key relationships and joins
  • Calculated Fields: Indexes pre-computed calculated fields and metrics
  • Vector Embedding: Generates embeddings for semantic search

Process Flow:

  1. MDL Validation - Validates input MDL structure
  2. Model Preprocessing - Processes models, views, metrics, and relationships
  3. DDL Chunking - Creates document chunks for each schema component
  4. Embedding Generation - Generates vector embeddings
  5. Document Storage - Stores in document store with metadata
flowchart TD
    A[MDL Input] --> B[MDL Validator]
    B --> C[DDL Chunker]
    C --> D[Model Preprocessor]
    D --> E[Column Batcher]
    E --> F[Relationship Processor]
    F --> G[Embedder]
    G --> H[Document Writer]
    H --> I[Vector Store]
    
    subgraph "Document Types"
        J[TABLE_SCHEMA]
        K[TABLE_COLUMNS]
        L[VIEW]
        M[METRIC]
    end
    
    C --> J
    C --> K
    C --> L
    C --> M
    
    style A fill:#e3f2fd
    style I fill:#e8f5e8
Loading

Output Documents:

  • TABLE_SCHEMA - Table structure documents
  • TABLE_COLUMNS - Column-specific documents (batched)
  • VIEW - View definition documents
  • METRIC - Metric calculation documents

1.2 Table Description Indexing (TableDescription)

Purpose: Indexes high-level table descriptions for table discovery and selection.

Key Features:

  • Semantic table descriptions
  • Table-level metadata storage
  • Table relationship context

1.3 Historical Question Indexing (HistoricalQuestion)

Purpose: Indexes previously asked questions and their corresponding SQL queries for learning and recommendation.

Key Features:

  • Question-SQL pair storage
  • Semantic similarity indexing
  • Query pattern learning

1.4 SQL Pairs Indexing (SqlPairs)

Purpose: Indexes curated SQL examples and question-answer pairs for training and reference.

Key Features:

  • Example query storage
  • Question-SQL mapping
  • Training data management

1.5 Instructions Indexing (Instructions)

Purpose: Indexes custom instructions and business rules for query generation.

Key Features:

  • Custom instruction storage
  • Business rule indexing
  • Context-specific guidance

1.6 Project Meta Indexing (ProjectMeta)

Purpose: Indexes project-level metadata and configuration information.

Key Features:

  • Project configuration storage
  • Metadata validation
  • Project-specific settings

2. Retrieval Pipelines

Retrieval pipelines fetch relevant context and information needed for query processing and generation.

Retrieval Pipeline Architecture

graph TB
    subgraph "Input"
        UQ[User Query]
        HIST[Query History]
        CTX[Context]
    end
    
    subgraph "Retrieval Pipelines"
        DSR[DB Schema Retrieval]
        HQR[Historical Question Retrieval]
        SPR[SQL Pairs Retrieval]
        IR[Instructions Retrieval]
        SFR[SQL Functions Retrieval]
        SE[SQL Executor]
        PSD[Preprocess SQL Data]
    end
    
    subgraph "Retrieval Components"
        EMB[Embedder]
        RET[Retrievers]
        FLT[Filters]
        PRC[Processors]
    end
    
    subgraph "Storage Sources"
        VS[Vector Store]
        DS[Document Store]
        MS[Metadata Store]
    end
    
    subgraph "Output Context"
        SC[Schema Context]
        EC[Example Context]
        IC[Instruction Context]
        FC[Function Context]
        RC[Result Context]
    end
    
    UQ --> EMB
    EMB --> DSR
    EMB --> HQR
    EMB --> SPR
    EMB --> IR
    EMB --> SFR
    
    DSR --> RET
    HQR --> RET
    SPR --> RET
    IR --> RET
    SFR --> RET
    
    RET --> VS
    RET --> DS
    RET --> MS
    
    DSR --> FLT
    FLT --> PRC
    PRC --> SC
    
    HQR --> EC
    SPR --> EC
    IR --> IC
    SFR --> FC
    SE --> RC
    
    style DSR fill:#e1f5fe
    style HQR fill:#e8f5e8
    style SPR fill:#fff3e0
    style IR fill:#f3e5f5
    style SFR fill:#e0f2f1
    style SE fill:#fce4ec
Loading

2.1 DB Schema Retrieval (DbSchemaRetrieval)

Purpose: The most critical retrieval pipeline that intelligently selects relevant schema components for query generation.

Key Features:

  • Smart Column Pruning: Uses LLM to select only necessary columns
  • Table Selection: Identifies relevant tables based on query intent
  • Context Window Management: Manages token limits for LLM processing
  • Semantic Retrieval: Uses vector similarity for schema component selection

Column Pruning Algorithm:

  1. Semantic Analysis - Analyzes user question semantically
  2. Table Identification - Identifies potentially relevant tables
  3. Column Selection - LLM selects minimal necessary columns
  4. Reasoning Generation - Provides justification for each selection
  5. Schema Construction - Builds filtered schema context

Process Flow:

flowchart TD
    A[User Query] --> B[Embedding Generation]
    B --> C[Table Retrieval]
    C --> D[Schema Retrieval]
    D --> E[Context Window Check]
    E --> F{Enable Column Pruning?}
    F -->|Yes| G[Column Selection LLM]
    F -->|No| H[Full Schema]
    G --> I[Column Reasoning]
    I --> J[Schema Construction]
    H --> J
    J --> K[Filtered Context]
    
    subgraph "Column Pruning Process"
        G --> L[Table Analysis]
        L --> M[Column Selection]
        M --> N[Reasoning Chain]
        N --> O[Context Validation]
    end
    
    style A fill:#e3f2fd
    style G fill:#fff9c4
    style K fill:#e8f5e8
Loading

2.2 Historical Question Retrieval (HistoricalQuestionRetrieval)

Purpose: Retrieves similar historical questions and their SQL solutions.

Key Features:

  • Semantic similarity matching
  • Historical pattern recognition
  • Query template suggestions

2.3 SQL Pairs Retrieval (SqlPairsRetrieval)

Purpose: Retrieves relevant SQL examples and patterns.

Key Features:

  • Example query matching
  • Pattern-based retrieval
  • Template selection

2.4 Instructions Retrieval (Instructions)

Purpose: Retrieves applicable instructions and business rules.

Key Features:

  • Context-specific instruction selection
  • Business rule application
  • Custom guidance retrieval

2.5 SQL Functions Retrieval (SqlFunctions)

Purpose: Retrieves available SQL functions and their usage patterns.

Key Features:

  • Function documentation retrieval
  • Usage example selection
  • Database-specific function support

2.6 SQL Executor (SQLExecutor)

Purpose: Executes SQL queries and retrieves results.

Key Features:

  • Query execution with timeout management
  • Result data fetching
  • Error handling and validation

2.7 Preprocess SQL Data (PreprocessSqlData)

Purpose: Preprocesses SQL data for downstream processing.

Key Features:

  • Token counting and limit management
  • Data format validation
  • Context preparation

3. Generation Pipelines

Generation pipelines use AI models to create various types of content based on retrieved context.

Generation Pipeline Architecture

graph TB
    subgraph "Input Context"
        SC[Schema Context]
        EC[Example Context]
        IC[Instruction Context]
        FC[Function Context]
        HC[History Context]
    end
    
    subgraph "Generation Pipelines"
        ICL[Intent Classification]
        SG[SQL Generation]
        SGR[SQL Generation Reasoning]
        FSG[Follow-up SQL Generation]
        SR[SQL Regeneration]
        SCR[SQL Correction]
        CG[Chart Generation]
        CA[Chart Adjustment]
        QR[Question Recommendation]
        RR[Relationship Recommendation]
        SA[SQL Answer]
        SD[Semantics Description]
    end
    
    subgraph "Generation Components"
        PB[Prompt Builder]
        LLM[Language Model]
        PP[Post Processor]
        VAL[Validator]
        DRY[Dry Run Engine]
    end
    
    subgraph "Output"
        SQL[Generated SQL]
        CHART[Chart Specs]
        ANS[Natural Language Answers]
        REC[Recommendations]
        DESC[Descriptions]
    end
    
    SC --> PB
    EC --> PB
    IC --> PB
    FC --> PB
    HC --> PB
    
    PB --> ICL
    PB --> SG
    PB --> SGR
    PB --> FSG
    PB --> SR
    PB --> SCR
    PB --> CG
    PB --> CA
    PB --> QR
    PB --> RR
    PB --> SA
    PB --> SD
    
    ICL --> LLM
    SG --> LLM
    SGR --> LLM
    FSG --> LLM
    SR --> LLM
    SCR --> LLM
    CG --> LLM
    CA --> LLM
    QR --> LLM
    RR --> LLM
    SA --> LLM
    SD --> LLM
    
    LLM --> PP
    PP --> VAL
    VAL --> DRY
    
    SG --> SQL
    CG --> CHART
    CA --> CHART
    SA --> ANS
    QR --> REC
    RR --> REC
    SD --> DESC
    
    style ICL fill:#e1f5fe
    style SG fill:#e8f5e8
    style CG fill:#fff3e0
    style QR fill:#f3e5f5
Loading

3.1 Intent Classification (IntentClassification)

Purpose: Classifies user queries into different intent categories to route them appropriately.

Intent Categories:

  • TEXT_TO_SQL - Queries requiring SQL generation
  • GENERAL - General database information requests
  • USER_GUIDE - WrenAI feature and usage questions
  • MISLEADING_QUERY - Off-topic or irrelevant queries
flowchart TD
    A[User Query] --> B[Context Analysis]
    B --> C[Schema Retrieval]
    C --> D[Intent Classification LLM]
    D --> E{Classification Result}
    E -->|TEXT_TO_SQL| F[SQL Generation Pipeline]
    E -->|GENERAL| G[General Assistance Pipeline]
    E -->|USER_GUIDE| H[User Guide Pipeline]
    E -->|MISLEADING_QUERY| I[Misleading Assistance Pipeline]
    
    F --> J[Schema Context + SQL Generation]
    G --> K[General Information Response]
    H --> L[User Guide Response]
    I --> M[Polite Redirection]
    
    style D fill:#fff9c4
    style F fill:#e8f5e8
    style G fill:#e3f2fd
    style H fill:#f3e5f5
    style I fill:#ffebee
Loading

Key Features:

  • Multi-turn conversation context
  • Question rephrasing for follow-ups
  • Schema-aware classification
  • Reasoning generation

3.2 SQL Generation (SQLGeneration)

Purpose: Core pipeline that generates SQL queries from natural language questions.

flowchart TD
    A[Natural Language Query] --> B[Context Preparation]
    B --> C[Schema Context]
    B --> D[Example Context]
    B --> E[Instruction Context]
    B --> F[Function Context]
    
    C --> G[Prompt Builder]
    D --> G
    E --> G
    F --> G
    
    G --> H[SQL Generation LLM]
    H --> I[Generated SQL]
    I --> J[Post Processor]
    J --> K[Quote Addition]
    K --> L{Validation Method}
    L -->|Dry Plan| M[Dry Plan Validation]
    L -->|Dry Run| N[Dry Run Validation]
    
    M --> O{Valid?}
    N --> O
    O -->|Yes| P[Valid SQL Output]
    O -->|No| Q[Error Analysis]
    Q --> R[SQL Correction Pipeline]
    
    subgraph "SQL Rules Enforcement"
        S[SELECT Only]
        T[Table Qualification]
        U[JOIN Requirements]
        V[Performance Patterns]
        W[Calculated Fields]
    end
    
    H --> S
    H --> T
    H --> U
    H --> V
    H --> W
    
    style H fill:#fff9c4
    style P fill:#e8f5e8
    style Q fill:#ffebee
Loading

Key Features:

  • Performance Optimization Rules: Prevents expensive query patterns
  • Schema-Aware Generation: Uses retrieved schema context
  • Calculated Field Integration: Leverages pre-computed fields
  • Dry Run Validation: Validates queries before execution
  • SQL Rule Enforcement: Follows strict SQL generation guidelines

Generation Rules:

  • Only SELECT statements allowed
  • Mandatory table qualification
  • Proper JOIN usage requirements
  • Calculated field utilization
  • Performance-optimized patterns

3.3 SQL Generation Reasoning (SQLGenerationReasoning)

Purpose: Generates step-by-step reasoning for SQL query construction.

Key Features:

  • Logical step breakdown
  • Decision justification
  • Educational explanations

3.4 Follow-up SQL Generation (FollowUpSQLGeneration)

Purpose: Handles follow-up questions that modify or extend previous queries.

Key Features:

  • Conversation context awareness
  • Query modification logic
  • Incremental query building

3.5 SQL Regeneration (SQLRegeneration)

Purpose: Regenerates SQL queries with improvements or corrections.

Key Features:

  • Query optimization
  • Error correction
  • Performance improvements

3.6 SQL Correction (SQLCorrection)

Purpose: Corrects invalid or problematic SQL queries.

Key Features:

  • Error detection and analysis
  • Syntax correction
  • Logic error fixes

3.7 Chart Generation (ChartGeneration)

Purpose: Generates visualization specifications from query results.

Key Features:

  • Vega-Lite specification generation
  • Chart type selection
  • Data binding optimization

3.8 Chart Adjustment (ChartAdjustment)

Purpose: Modifies existing chart specifications based on user feedback.

Key Features:

  • Chart property modification
  • Visual adjustment logic
  • Interactive chart updates

3.9 Question Recommendation (QuestionRecommendation)

Purpose: Suggests relevant questions users can ask about their data.

Key Features:

  • Schema-based suggestions
  • Categorized recommendations
  • Difficulty-appropriate questions

3.10 Relationship Recommendation (RelationshipRecommendation)

Purpose: Suggests potential relationships between tables in the schema.

Key Features:

  • Foreign key pattern detection
  • Relationship type classification
  • Confidence scoring

3.11 SQL Answer (SQLAnswer)

Purpose: Generates natural language explanations of SQL query results.

Key Features:

  • Result interpretation
  • Summary generation
  • Business context explanation

3.12 Semantics Description (SemanticsDescription)

Purpose: Generates semantic descriptions of data models and their business meaning.

Key Features:

  • Business context generation
  • Model documentation
  • Semantic relationship explanation

3.13 Data Assistance (DataAssistance)

Purpose: Provides guidance and assistance for data analysis tasks.

Key Features:

  • Analysis suggestions
  • Data interpretation help
  • Methodology recommendations

3.14 User Guide Assistance (UserGuideAssistance)

Purpose: Answers questions about WrenAI features and usage.

Key Features:

  • Feature documentation
  • Usage guidance
  • Troubleshooting help

3.15 Misleading Assistance (MisleadingAssistance)

Purpose: Handles off-topic or irrelevant queries gracefully.

Key Features:

  • Polite redirection
  • Capability explanation
  • Guidance back to relevant topics

3.16 SQL Question (SQLQuestion)

Purpose: Generates questions that could be answered by given SQL queries.

Key Features:

  • Reverse engineering questions
  • Educational query examples
  • Documentation generation

3.17 SQL Tables Extraction (SQLTablesExtraction)

Purpose: Extracts table names and dependencies from SQL queries.

Key Features:

  • SQL parsing and analysis
  • Dependency identification
  • Table usage tracking

Pipeline Orchestration

Typical Query Processing Flow

sequenceDiagram
    participant User
    participant IC as Intent Classification
    participant DSR as DB Schema Retrieval
    participant CR as Context Retrieval
    participant SG as SQL Generation
    participant SE as SQL Executor
    participant CG as Chart Generation
    participant SA as SQL Answer
    
    User->>IC: Natural Language Query
    IC->>IC: Classify Intent
    
    alt TEXT_TO_SQL Intent
        IC->>DSR: Request Schema Context
        DSR->>DSR: Column Pruning & Selection
        DSR->>CR: Schema Context
        
        par Context Retrieval
            CR->>CR: Historical Questions
            CR->>CR: SQL Examples
            CR->>CR: Instructions
            CR->>CR: SQL Functions
        end
        
        CR->>SG: Combined Context
        SG->>SG: Generate SQL with Reasoning
        SG->>SG: Dry Run Validation
        
        alt Valid SQL
            SG->>SE: Execute SQL
            SE->>SE: Fetch Results
            SE->>CG: Query Results
            CG->>CG: Generate Visualizations
            SE->>SA: Generate Explanation
            SA->>User: Complete Response
        else Invalid SQL
            SG->>SG: SQL Correction Pipeline
            SG->>SE: Corrected SQL
        end
        
    else GENERAL Intent
        IC->>SA: General Information Request
        SA->>User: General Response
        
    else USER_GUIDE Intent
        IC->>SA: User Guide Query
        SA->>User: User Guide Response
        
    else MISLEADING_QUERY Intent
        IC->>SA: Redirect Query
        SA->>User: Polite Redirection
    end
Loading
  1. Intent Classification - Determine query type and route appropriately
  2. Schema Retrieval - Fetch relevant schema components with column pruning
  3. Context Retrieval - Gather historical examples, instructions, and functions
  4. SQL Generation - Generate optimized SQL with reasoning
  5. Validation - Dry run validation and error checking
  6. Execution - Execute query and fetch results
  7. Chart Generation - Create visualizations if applicable
  8. Answer Generation - Provide natural language explanations

Performance Optimizations

graph TD
    subgraph "Performance Optimization Strategies"
        A[Input Query] --> B[Column Pruning]
        B --> C[Token Management]
        C --> D[Context Window Optimization]
        D --> E[Smart Caching]
        E --> F[Dry Run Validation]
        F --> G[Calculated Field Usage]
        G --> H[Optimized SQL Output]
    end
    
    subgraph "Optimization Benefits"
        I[60-80% Schema Size Reduction]
        J[Prevents Context Overflow]
        K[Catches Expensive Queries]
        L[Leverages Pre-computed Metrics]
        M[Reduces Processing Time]
    end
    
    B --> I
    C --> J
    F --> K
    G --> L
    E --> M
    
    style B fill:#e8f5e8
    style F fill:#fff3e0
    style G fill:#f3e5f5
Loading
  1. Column Pruning - Reduces schema size by 60-80%
  2. Token Management - Prevents context window overflow
  3. Dry Run Validation - Catches expensive queries before execution
  4. Calculated Field Usage - Leverages pre-computed metrics
  5. Smart Caching - Reuses retrieved contexts when possible

Error Handling and Fallbacks

graph TD
    A[SQL Generation] --> B{Valid SQL?}
    B -->|No| C[Error Analysis]
    C --> D[SQL Correction Pipeline]
    D --> E{Correctable?}
    E -->|Yes| F[Corrected SQL]
    E -->|No| G[Fallback Strategy]
    
    G --> H[Simplified Query]
    G --> I[Alternative Approach]
    G --> J[User Guidance]
    
    F --> K[Validation Check]
    K --> L{Valid?}
    L -->|Yes| M[Execute Query]
    L -->|No| N[Graceful Degradation]
    
    N --> O[Reduced Functionality]
    N --> P[Error Message with Suggestions]
    
    style C fill:#ffebee
    style D fill:#fff3e0
    style G fill:#e3f2fd
    style O fill:#f3e5f5
Loading
  1. SQL Correction Pipeline - Fixes invalid queries
  2. Fallback Strategies - Alternative generation approaches
  3. Graceful Degradation - Reduced functionality under constraints
  4. User Guidance - Helpful error messages and suggestions

Configuration and Deployment

Pipeline Configuration

graph TB
    subgraph "Configuration Structure"
        A[Pipeline Configuration]
        
        subgraph "Provider Types"
            B[LLM Provider]
            C[Embedder Provider]
            D[Document Store Provider]
            E[Engine Provider]
        end
        
        subgraph "Pipeline Mappings"
            F[Indexing Pipelines]
            G[Retrieval Pipelines]
            H[Generation Pipelines]
        end
        
        subgraph "Component Instances"
            I[OpenAI GPT-4o]
            J[Text Embedding 3 Large]
            K[Qdrant Vector Store]
            L[Wren Engine]
        end
    end
    
    A --> B
    A --> C
    A --> D
    A --> E
    
    B --> F
    B --> G
    B --> H
    
    C --> F
    C --> G
    
    D --> F
    D --> G
    D --> H
    
    E --> G
    E --> H
    
    B --> I
    C --> J
    D --> K
    E --> L
Loading

Pipelines are configured through provider mappings:

pipeline:
  type: pipeline
  pipes:
    - name: indexing
      llm: openai_llm.gpt-4o-mini
      embedder: openai_embedder.text-embedding-3-large
      document_store: qdrant
      engine: wren_ui

Scalability Features

  1. Async Processing - Non-blocking pipeline execution
  2. Batch Processing - Efficient bulk operations
  3. Resource Management - Configurable timeouts and limits
  4. Load Balancing - Distributed processing capabilities

Monitoring and Observability

graph TB
    subgraph "Monitoring Stack"
        A[Pipeline Execution]
        
        subgraph "Tracing & Monitoring"
            B[Langfuse Integration]
            C[Performance Metrics]
            D[Error Tracking]
            E[Cost Tracking]
        end
        
        subgraph "Metrics Collection"
            F[Execution Time]
            G[Token Usage]
            H[Success Rate]
            I[Error Rate]
            J[Resource Usage]
        end
        
        subgraph "Dashboards"
            K[Performance Dashboard]
            L[Error Dashboard]
            M[Cost Dashboard]
            N[Usage Dashboard]
        end
    end
    
    A --> B
    A --> C
    A --> D
    A --> E
    
    B --> F
    B --> G
    C --> H
    C --> I
    D --> I
    E --> G
    E --> J
    
    F --> K
    G --> M
    H --> K
    I --> L
    J --> N
Loading
  1. Langfuse Integration - Comprehensive tracing and monitoring
  2. Performance Metrics - Pipeline execution statistics
  3. Error Tracking - Detailed error analysis and reporting
  4. Cost Tracking - LLM usage and cost monitoring

Best Practices

Pipeline Development

  1. Modular Design - Keep pipelines focused and composable
  2. Error Handling - Implement comprehensive error handling
  3. Validation - Validate inputs and outputs thoroughly
  4. Documentation - Maintain clear documentation and examples

Performance Optimization

  1. Context Management - Minimize context size while maintaining quality
  2. Caching Strategies - Cache expensive operations when possible
  3. Batch Processing - Process multiple items together when feasible
  4. Resource Monitoring - Monitor and optimize resource usage

Quality Assurance

  1. Testing - Comprehensive unit and integration testing
  2. Validation - Multi-stage validation and verification
  3. Monitoring - Continuous quality monitoring
  4. Feedback Loops - Incorporate user feedback for improvements

This pipeline architecture enables WrenAI to provide intelligent, efficient, and scalable AI-powered data analysis capabilities while maintaining high performance and accuracy standards.

What is MDL Schema?

MDL (Model Definition Language) is WrenAI's proprietary schema format that defines the semantic layer of your data warehouse. It's a JSON-based specification that describes your data models, relationships, metrics, views, and business logic in a structured way that WrenAI can understand and use for intelligent SQL generation.

Core Components of MDL Schema:

  1. Models - Represent tables or data sources
  2. Relationships - Define how models connect to each other
  3. Metrics - Pre-defined aggregations and business calculations
  4. Views - Saved SQL queries for common use cases
  5. Calculated Fields - Computed columns with custom expressions

MDL Schema Structure

{
  "catalog": "string",      // Database catalog name
  "schema": "string",       // Database schema name
  "models": [...],          // Array of data models
  "relationships": [...],   // Array of relationships between models
  "metrics": [...],         // Array of business metrics
  "views": [...],           // Array of saved views
  "enumDefinitions": [...], // Array of enum definitions
  "macros": [...]          // Array of reusable macros
}

Examples

1. Simple MDL Example - Book Store

{
  "catalog": "wrenai",
  "schema": "spider",
  "models": [
    {
      "name": "book",
      "properties": {
        "description": "Book information table",
        "displayName": "Books"
      },
      "refSql": "select * from \"wrenai\".spider.\"book_2-book\"",
      "columns": [
        {
          "name": "Book_ID",
          "type": "INTEGER",
          "notNull": false,
          "isCalculated": false,
          "expression": "Book_ID",
          "properties": {
            "description": "Unique identifier for each book",
            "displayName": "Book ID"
          }
        },
        {
          "name": "Title",
          "type": "VARCHAR",
          "notNull": false,
          "isCalculated": false,
          "expression": "Title",
          "properties": {
            "description": "Title of the book",
            "displayName": "Book Title"
          }
        },
        {
          "name": "Writer",
          "type": "VARCHAR",
          "notNull": false,
          "isCalculated": false,
          "expression": "Writer",
          "properties": {
            "description": "Author of the book",
            "displayName": "Author"
          }
        },
        {
          "name": "Price",
          "type": "REAL",
          "notNull": false,
          "isCalculated": false,
          "expression": "Price",
          "properties": {
            "description": "Price of the book",
            "displayName": "Book Price"
          }
        }
      ],
      "primaryKey": "Book_ID",
      "cached": false
    },
    {
      "name": "publication",
      "properties": {
        "description": "Publication information",
        "displayName": "Publications"
      },
      "refSql": "select * from \"wrenai\".spider.\"book_2-publication\"",
      "columns": [
        {
          "name": "Publication_ID",
          "type": "INTEGER",
          "notNull": false,
          "isCalculated": false,
          "expression": "Publication_ID",
          "properties": {
            "description": "Unique identifier for publication",
            "displayName": "Publication ID"
          }
        },
        {
          "name": "Book_ID",
          "type": "INTEGER",
          "notNull": false,
          "isCalculated": false,
          "expression": "Book_ID",
          "properties": {
            "description": "Reference to book",
            "displayName": "Book ID"
          }
        },
        {
          "name": "Publisher",
          "type": "VARCHAR",
          "notNull": false,
          "isCalculated": false,
          "expression": "Publisher",
          "properties": {
            "description": "Name of the publisher",
            "displayName": "Publisher Name"
          }
        },
        {
          "name": "Publication_Date",
          "type": "VARCHAR",
          "notNull": false,
          "isCalculated": false,
          "expression": "Publication_Date",
          "properties": {
            "description": "Date of publication",
            "displayName": "Publication Date"
          }
        }
      ],
      "primaryKey": "Publication_ID",
      "cached": false
    }
  ],
  "relationships": [
    {
      "name": "BookPublication",
      "models": ["book", "publication"],
      "joinType": "ONE_TO_MANY",
      "condition": "book.Book_ID = publication.Book_ID",
      "properties": {
        "description": "Relationship between books and their publications"
      }
    }
  ],
  "views": [
    {
      "name": "book_summary",
      "statement": "SELECT COUNT(*) as total_books, AVG(Price) as avg_price FROM book",
      "properties": {
        "question": "How many books are there and what's the average price?",
        "summary": "Summary statistics for all books",
        "displayName": "Book Summary"
      }
    }
  ]
}

2. E-commerce MDL Example with Complex Relationships

{
  "catalog": "ecommerce",
  "schema": "public",
  "models": [
    {
      "name": "customers",
      "tableReference": {
        "catalog": "ecommerce",
        "schema": "public",
        "table": "customers"
      },
      "columns": [
        {
          "name": "customer_id",
          "type": "STRING",
          "notNull": true,
          "isCalculated": false,
          "properties": {
            "description": "Unique customer identifier",
            "displayName": "Customer ID"
          }
        },
        {
          "name": "customer_name",
          "type": "STRING",
          "notNull": false,
          "isCalculated": false,
          "properties": {
            "description": "Customer full name",
            "displayName": "Customer Name"
          }
        },
        {
          "name": "city",
          "type": "STRING",
          "notNull": false,
          "isCalculated": false,
          "properties": {
            "description": "Customer city",
            "displayName": "City"
          }
        },
        {
          "name": "total_orders",
          "type": "INTEGER",
          "notNull": false,
          "isCalculated": true,
          "expression": "COUNT(orders.order_id)",
          "relationship": "CustomerOrders",
          "properties": {
            "description": "Total number of orders by customer",
            "displayName": "Total Orders"
          }
        }
      ],
      "primaryKey": "customer_id",
      "cached": false,
      "properties": {
        "description": "Customer information",
        "displayName": "Customers"
      }
    },
    {
      "name": "orders",
      "tableReference": {
        "catalog": "ecommerce",
        "schema": "public",
        "table": "orders"
      },
      "columns": [
        {
          "name": "order_id",
          "type": "STRING",
          "notNull": true,
          "isCalculated": false,
          "properties": {
            "description": "Unique order identifier",
            "displayName": "Order ID"
          }
        },
        {
          "name": "customer_id",
          "type": "STRING",
          "notNull": true,
          "isCalculated": false,
          "properties": {
            "description": "Customer who placed the order",
            "displayName": "Customer ID"
          }
        },
        {
          "name": "order_date",
          "type": "TIMESTAMP",
          "notNull": false,
          "isCalculated": false,
          "properties": {
            "description": "When the order was placed",
            "displayName": "Order Date"
          }
        },
        {
          "name": "order_status",
          "type": "STRING",
          "notNull": false,
          "isCalculated": false,
          "properties": {
            "description": "Current status of the order",
            "displayName": "Order Status"
          }
        }
      ],
      "primaryKey": "order_id",
      "cached": false,
      "properties": {
        "description": "Order information",
        "displayName": "Orders"
      }
    },
    {
      "name": "order_items",
      "tableReference": {
        "catalog": "ecommerce",
        "schema": "public",
        "table": "order_items"
      },
      "columns": [
        {
          "name": "order_id",
          "type": "STRING",
          "notNull": true,
          "isCalculated": false,
          "properties": {
            "description": "Reference to order",
            "displayName": "Order ID"
          }
        },
        {
          "name": "product_id",
          "type": "STRING",
          "notNull": true,
          "isCalculated": false,
          "properties": {
            "description": "Reference to product",
            "displayName": "Product ID"
          }
        },
        {
          "name": "quantity",
          "type": "INTEGER",
          "notNull": false,
          "isCalculated": false,
          "properties": {
            "description": "Quantity ordered",
            "displayName": "Quantity"
          }
        },
        {
          "name": "price",
          "type": "DECIMAL",
          "notNull": false,
          "isCalculated": false,
          "properties": {
            "description": "Price per item",
            "displayName": "Unit Price"
          }
        }
      ],
      "primaryKey": "",
      "cached": false,
      "properties": {
        "description": "Individual items within orders",
        "displayName": "Order Items"
      }
    }
  ],
  "relationships": [
    {
      "name": "CustomerOrders",
      "models": ["customers", "orders"],
      "joinType": "ONE_TO_MANY",
      "condition": "customers.customer_id = orders.customer_id",
      "properties": {
        "description": "Relationship between customers and their orders"
      }
    },
    {
      "name": "OrderItems",
      "models": ["orders", "order_items"],
      "joinType": "ONE_TO_MANY",
      "condition": "orders.order_id = order_items.order_id",
      "properties": {
        "description": "Relationship between orders and their line items"
      }
    }
  ],
  "metrics": [
    {
      "name": "revenue_metrics",
      "baseObject": "order_items",
      "dimension": [
        {
          "name": "order_date",
          "type": "TIMESTAMP",
          "isCalculated": false,
          "notNull": false,
          "properties": {
            "description": "Date dimension for time-based analysis",
            "displayName": "Order Date"
          }
        },
        {
          "name": "customer_city",
          "type": "STRING",
          "isCalculated": false,
          "notNull": false,
          "properties": {
            "description": "Customer city for geographic analysis",
            "displayName": "Customer City"
          }
        }
      ],
      "measure": [
        {
          "name": "total_revenue",
          "type": "DECIMAL",
          "isCalculated": true,
          "expression": "SUM(order_items.price * order_items.quantity)",
          "notNull": false,
          "properties": {
            "description": "Total revenue from sales",
            "displayName": "Total Revenue"
          }
        },
        {
          "name": "order_count",
          "type": "BIGINT",
          "isCalculated": true,
          "expression": "COUNT(DISTINCT orders.order_id)",
          "notNull": false,
          "properties": {
            "description": "Number of unique orders",
            "displayName": "Order Count"
          }
        }
      ],
      "timeGrain": [
        {
          "name": "order_date_grain",
          "refColumn": "order_date",
          "dateParts": ["YEAR", "QUARTER", "MONTH", "DAY"]
        }
      ],
      "cached": true,
      "refreshTime": "1h",
      "properties": {
        "description": "Revenue and order metrics for business analysis",
        "displayName": "Revenue Metrics"
      }
    }
  ],
  "views": [
    {
      "name": "monthly_revenue",
      "statement": "SELECT DATE_TRUNC('month', order_date) as month, SUM(total_revenue) as revenue FROM revenue_metrics GROUP BY DATE_TRUNC('month', order_date)",
      "properties": {
        "question": "What is the monthly revenue trend?",
        "summary": "Monthly revenue aggregation",
        "displayName": "Monthly Revenue Trend"
      }
    }
  ]
}

3. Advanced MDL with Table References and Nested Columns

{
  "catalog": "analytics",
  "schema": "events",
  "models": [
    {
      "name": "user_events",
      "tableReference": {
        "catalog": "analytics",
        "schema": "events",
        "table": "user_events"
      },
      "columns": [
        {
          "name": "event_id",
          "type": "STRING",
          "notNull": true,
          "isCalculated": false,
          "properties": {
            "description": "Unique event identifier",
            "displayName": "Event ID"
          }
        },
        {
          "name": "user_id",
          "type": "STRING",
          "notNull": true,
          "isCalculated": false,
          "properties": {
            "description": "User who triggered the event",
            "displayName": "User ID"
          }
        },
        {
          "name": "event_params",
          "type": "ARRAY<STRUCT<key STRING, value STRING>>",
          "notNull": false,
          "isCalculated": false,
          "properties": {
            "description": "Event parameters as nested structure",
            "displayName": "Event Parameters",
            "nestedDisplayName.event_params.key": "Parameter Key",
            "nestedDescription.event_params.key": "Parameter name",
            "nestedDisplayName.event_params.value": "Parameter Value",
            "nestedDescription.event_params.value": "Parameter value"
          }
        }
      ],
      "primaryKey": "event_id",
      "cached": true,
      "refreshTime": "30m",
      "properties": {
        "description": "User event tracking data",
        "displayName": "User Events"
      }
    }
  ],
  "enumDefinitions": [
    {
      "name": "EventType",
      "values": [
        {
          "name": "PAGE_VIEW",
          "value": "page_view",
          "properties": {
            "description": "User viewed a page"
          }
        },
        {
          "name": "CLICK",
          "value": "click",
          "properties": {
            "description": "User clicked on an element"
          }
        },
        {
          "name": "PURCHASE",
          "value": "purchase",
          "properties": {
            "description": "User completed a purchase"
          }
        }
      ],
      "properties": {
        "description": "Types of user events"
      }
    }
  ]
}

Key MDL Features

1. Model Definition Options:

  • refSql - Define model using raw SQL
  • tableReference - Reference existing database table
  • baseObject - Reference another model

2. Column Types:

  • Regular columns - Direct database columns
  • Calculated fields - Computed using expressions
  • Relationship columns - Connect to other models

3. Relationship Types:

  • ONE_TO_ONE - 1:1 relationship
  • ONE_TO_MANY - 1:N relationship
  • MANY_TO_ONE - N:1 relationship
  • MANY_TO_MANY - N:N relationship

4. Metric Components:

  • Dimensions - Categorical data for grouping
  • Measures - Quantitative calculations
  • Time Grains - Time-based aggregation levels

5. Caching & Performance:

  • Model-level caching with refresh intervals
  • Calculated field pre-computation
  • Metric materialization

This MDL schema provides WrenAI with the semantic understanding needed to generate intelligent SQL queries, optimize performance, and provide business context for data analysis.

Where MDL Schemas Originally Come From

MDL schemas in WrenAI originate from multiple sources depending on how you set up your WrenAI project. Here are the main sources:


1. πŸ”Œ External Database Connections (Primary Source)

The most common source of MDL schemas is automatic introspection of external databases that you connect to WrenAI.

Supported Data Sources:

  • BigQuery - Google's cloud data warehouse
  • PostgreSQL - Open-source relational database
  • MySQL - Popular relational database
  • Snowflake - Cloud data platform
  • ClickHouse - Columnar database
  • SQL Server (MSSQL) - Microsoft's database
  • Oracle - Enterprise database
  • Trino - Distributed SQL query engine
  • Athena - AWS serverless query service
  • Redshift - AWS data warehouse
  • DuckDB - In-process OLAP database

How Database Introspection Works:

  1. Connection Setup: You provide database credentials through the WrenAI UI
  2. Metadata Discovery: WrenAI uses two adapters to scan your database:
    • Ibis Adapter: For most external databases (BigQuery, PostgreSQL, etc.)
    • Wren Engine Adapter: For DuckDB (direct control)
  3. Table Discovery: The system automatically discovers:
    • All available tables and views
    • Column names and data types
    • Primary keys and constraints
    • Table relationships (foreign keys)
  4. MDL Generation: The MDLBuilder class converts this metadata into MDL format

Example Connection Flow:

// 1. User connects to BigQuery
const connectionInfo = {
  projectId: "my-gcp-project",
  datasetId: "my-dataset", 
  credentials: "service-account-key.json"
};

// 2. WrenAI scans tables automatically
const tables = await ibisAdaptor.getTables(DataSourceName.BIG_QUERY, connectionInfo);

// 3. MDL is generated from discovered metadata
const mdl = mdlBuilder.build({
  project,
  models: discoveredTables,
  columns: discoveredColumns,
  relationships: discoveredRelationships
});

2. πŸ“‹ Pre-built Sample Datasets

WrenAI provides curated sample datasets that come with pre-configured MDL schemas for learning and demonstration purposes.

Available Sample Datasets:

E-commerce Dataset

  • Tables: customers, orders, order_items, products, reviews, payments, sellers, geolocation
  • Source: Brazilian e-commerce data (Olist)
  • File Format: Parquet files hosted on S3
  • Use Case: Learning about retail analytics

NBA Dataset

  • Tables: games, teams, players, player_games, line_scores
  • Source: NBA statistics data
  • File Format: CSV files hosted on S3
  • Use Case: Sports analytics examples

HR/Employee Dataset

  • Tables: employees, salaries, departments, titles
  • Source: Employee management data
  • File Format: Parquet files hosted on S3
  • Use Case: HR analytics demonstrations

Music Dataset

  • Tables: artists, albums, tracks, customers, invoices, invoice_lines
  • Source: Music store sales data
  • File Format: CSV files hosted on S3
  • Use Case: Music industry analytics

Sample Dataset Structure:

export const sampleDatasets: Record<string, SampleDataset> = {
  ecommerce: {
    name: SampleDatasetName.ECOMMERCE,
    tables: [
      {
        tableName: 'olist_customers_dataset',
        filePath: 'https://assets.getwren.ai/sample_data/brazilian-ecommerce/olist_customers_dataset.parquet',
        schema: [
          { columnName: 'customer_id', dataType: 'VARCHAR' },
          { columnName: 'customer_city', dataType: 'VARCHAR' }
        ],
        properties: {
          displayName: 'customers',
          description: 'Customer information'
        }
      }
    ],
    relations: [
      {
        fromModelName: 'customers',
        fromColumnName: 'customer_id', 
        toModelName: 'orders',
        toColumnName: 'customer_id',
        type: RelationType.ONE_TO_MANY
      }
    ]
  }
};

3. πŸ—οΈ Manual MDL Creation

You can also manually create or import MDL schemas:

Direct MDL Import

  • Upload JSON files that conform to the MDL schema specification
  • Useful for migrating from other semantic layer tools
  • Allows for custom business logic definitions

SQL-Based Model Creation

{
  "models": [
    {
      "name": "sales_summary",
      "refSql": "SELECT customer_id, SUM(amount) as total_sales FROM orders GROUP BY customer_id",
      "columns": [
        {
          "name": "customer_id",
          "type": "VARCHAR",
          "isCalculated": false
        },
        {
          "name": "total_sales", 
          "type": "DECIMAL",
          "isCalculated": true,
          "expression": "SUM(amount)"
        }
      ]
    }
  ]
}

Bootstrap Initialization

The system can start with minimal MDL:

{
  "catalog": "test_catalog",
  "schema": "test_schema", 
  "models": []
}

4. πŸ”„ MDL Generation Process

Here's how WrenAI transforms various sources into MDL:

Database β†’ MDL Transformation:

  1. Discovery Phase:

    // Scan database for tables
    const tables = await metadataService.listTables(project);
    const constraints = await metadataService.listConstraints(project);
  2. Model Creation:

    // Convert tables to MDL models
    mdlBuilder.addModel(); // Creates model definitions
    mdlBuilder.addNormalField(); // Adds column definitions
    mdlBuilder.addCalculatedField(); // Adds computed columns
  3. Relationship Mapping:

    // Map foreign keys to relationships
    mdlBuilder.addRelation(); // Creates relationship definitions
  4. Final Assembly:

    // Build complete MDL
    const manifest = mdlBuilder.build();

5. 🎯 MDL Enrichment

Once the basic MDL is created from any source, WrenAI allows you to enrich it with:

Business Context

  • Display Names: User-friendly column names
  • Descriptions: Business meaning of fields
  • Categories: Grouping related fields

Advanced Features

  • Calculated Fields: Custom computed columns
  • Metrics: Pre-defined aggregations
  • Views: Saved SQL queries
  • Relationships: Business logic connections

Performance Optimizations

  • Caching: Model-level caching strategies
  • Refresh Schedules: Automated data updates
  • Indexing: Vector embeddings for AI queries

Summary

MDL schemas in WrenAI primarily originate from:

  1. πŸ”Œ Database Introspection (80%) - Automatic scanning of connected databases
  2. πŸ“‹ Sample Datasets (15%) - Pre-built examples for learning/demos
  3. πŸ—οΈ Manual Creation (5%) - Custom MDL files or SQL-based models

The system is designed to minimize manual work by automatically discovering your data structure and generating semantically rich MDL schemas that can then be enhanced with business context and optimizations through the WrenAI interface.

This approach allows data teams to quickly get started with AI-powered analytics while maintaining the flexibility to customize and extend their semantic layer as needed.

MDL to Vector Documents Transformation in WrenAI

Overview

This document explains how WrenAI's DB Schema Indexing pipeline transforms Model Definition Language (MDL) files into vector documents that are stored in the vector database for semantic search. The transformation process involves parsing, chunking, preprocessing, and structuring the schema information into searchable document formats.

Architecture Overview

graph TB
    subgraph "Input Layer"
        MDL[MDL File<br/>JSON Schema]
    end
    
    subgraph "Processing Pipeline"
        VAL[MDL Validator]
        CHUNK[DDL Chunker]
        PREPROC[Model Preprocessor]
        HELPER[Helper System]
        EMB[Embedder]
    end
    
    subgraph "Document Types"
        TABLE[TABLE Documents]
        COLUMNS[TABLE_COLUMNS Documents]
        VIEW[VIEW Documents]
        METRIC[METRIC Documents]
        FK[FOREIGN_KEY Documents]
    end
    
    subgraph "Storage Layer"
        VECTOR[Vector Store<br/>Qdrant]
        META[Metadata Store]
    end
    
    MDL --> VAL
    VAL --> CHUNK
    CHUNK --> PREPROC
    PREPROC --> HELPER
    HELPER --> TABLE
    HELPER --> COLUMNS
    HELPER --> VIEW
    HELPER --> METRIC
    HELPER --> FK
    
    TABLE --> EMB
    COLUMNS --> EMB
    VIEW --> EMB
    METRIC --> EMB
    FK --> EMB
    
    EMB --> VECTOR
    EMB --> META
    
    style MDL fill:#e3f2fd
    style VECTOR fill:#e8f5e8
    style TABLE fill:#fff3e0
    style COLUMNS fill:#f3e5f5
    style VIEW fill:#e0f2f1
    style METRIC fill:#fce4ec
    style FK fill:#fff8e1
Loading

MDL Structure to Document Mapping

1. MDL Schema Components

graph LR
    subgraph "MDL Schema"
        MODELS[Models]
        VIEWS[Views]
        RELATIONSHIPS[Relationships]
        METRICS[Metrics]
        
        subgraph "Model Components"
            COLS[Columns]
            PROPS[Properties]
            PK[Primary Keys]
            CALC[Calculated Fields]
            NESTED[Nested Columns]
        end
        
        MODELS --> COLS
        MODELS --> PROPS
        MODELS --> PK
        MODELS --> CALC
        MODELS --> NESTED
    end
    
    subgraph "Document Types Generated"
        TD[TABLE Documents]
        TCD[TABLE_COLUMNS Documents]
        VD[VIEW Documents]
        MD[METRIC Documents]
        FKD[FOREIGN_KEY Documents]
    end
    
    MODELS --> TD
    COLS --> TCD
    CALC --> TCD
    NESTED --> TCD
    RELATIONSHIPS --> FKD
    VIEWS --> VD
    METRICS --> MD
    
    style MODELS fill:#e1f5fe
    style TD fill:#fff3e0
    style TCD fill:#f3e5f5
Loading

Document Transformation Process

2. Model to TABLE Document Transformation

sequenceDiagram
    participant MDL as MDL Model
    participant PROC as Model Preprocessor
    participant CHUNK as DDL Chunker
    participant DOC as TABLE Document
    
    MDL->>PROC: Model with properties
    Note over MDL: {"name": "user",<br/>"properties": {<br/>"description": "User table",<br/>"displayName": "Users"}}
    
    PROC->>PROC: Extract metadata
    PROC->>CHUNK: Processed model
    
    CHUNK->>CHUNK: Generate table command
    Note over CHUNK: Creates table structure<br/>with comments
    
    CHUNK->>DOC: TABLE Document
    Note over DOC: {"type": "TABLE",<br/>"comment": "/* metadata */",<br/>"name": "user"}
Loading

3. Columns to TABLE_COLUMNS Document Transformation

flowchart TD
    A[Model Columns] --> B{Column Type?}
    
    B -->|Regular Column| C[Column Preprocessor]
    B -->|Calculated Field| D[Calculated Field Handler]
    B -->|Nested Column| E[Nested Column Handler]
    B -->|Relationship Column| F[Skip - Handled Separately]
    
    C --> G[Column Command Generator]
    D --> H[Add Expression Metadata]
    E --> I[Add Nested Metadata]
    
    H --> G
    I --> G
    G --> J[Column Batcher]
    
    J --> K{Batch Size Reached?}
    K -->|Yes| L[Create TABLE_COLUMNS Document]
    K -->|No| M[Add to Current Batch]
    M --> J
    
    L --> N[Vector Document]
    
    style A fill:#e3f2fd
    style N fill:#e8f5e8
    style D fill:#fff3e0
    style E fill:#f3e5f5
Loading

Column Batching Strategy

4. Column Batching Process

graph TB
    subgraph "Input: Model with 7 Columns"
        C1[Column 1: id]
        C2[Column 2: name]
        C3[Column 3: email]
        C4[Column 4: age]
        C5[Column 5: profile_data]
        C6[Column 6: created_at]
        C7[Column 7: updated_at]
    end
    
    subgraph "Batch Size = 3"
        subgraph "Batch 1"
            B1C1[id]
            B1C2[name]
            B1C3[email]
        end
        
        subgraph "Batch 2"
            B2C1[age]
            B2C2[profile_data]
            B2C3[created_at]
        end
        
        subgraph "Batch 3"
            B3C1[updated_at]
        end
    end
    
    subgraph "Generated Documents"
        DOC1[TABLE_COLUMNS Document 1<br/>Contains: id, name, email]
        DOC2[TABLE_COLUMNS Document 2<br/>Contains: age, profile_data, created_at]
        DOC3[TABLE_COLUMNS Document 3<br/>Contains: updated_at]
    end
    
    C1 --> B1C1
    C2 --> B1C2
    C3 --> B1C3
    C4 --> B2C1
    C5 --> B2C2
    C6 --> B2C3
    C7 --> B3C1
    
    B1C1 --> DOC1
    B1C2 --> DOC1
    B1C3 --> DOC1
    B2C1 --> DOC2
    B2C2 --> DOC2
    B2C3 --> DOC2
    B3C1 --> DOC3
    
    style DOC1 fill:#e8f5e8
    style DOC2 fill:#e8f5e8
    style DOC3 fill:#e8f5e8
Loading

Relationship Processing

5. Relationship to FOREIGN_KEY Document Transformation

graph TB
    subgraph "MDL Relationship Definition"
        REL["Relationship: UserOrders<br/>condition: user.id = order.user_id<br/>joinType: ONE_TO_MANY<br/>models: [user, order]"]
    end
    
    subgraph "Processing Logic"
        PARSE[Parse Condition]
        EXTRACT[Extract FK Columns]
        BUILD[Build Constraint]
    end
    
    subgraph "Generated FOREIGN_KEY Documents"
        FK1["Document for user table<br/>FOREIGN KEY id REFERENCES order user_id"]
        FK2["Document for order table<br/>FOREIGN KEY user_id REFERENCES user id"]
    end
    
    REL --> PARSE
    PARSE --> EXTRACT
    EXTRACT --> BUILD
    BUILD --> FK1
    BUILD --> FK2
    
    style REL fill:#e3f2fd
    style FK1 fill:#fff8e1
    style FK2 fill:#fff8e1
Loading

Complex Data Type Handling

6. Nested Column Processing

graph TB
    subgraph "Input: Nested Column"
        NC["Column: user_profile<br/>Type: STRUCT with name STRING and addresses ARRAY<br/>Properties: nested metadata"]
    end
    
    subgraph "Helper System Processing"
        HELPER[Properties Helper]
        NESTED[Nested Column Extractor]
        JSON[JSON Serializer]
    end
    
    subgraph "Generated Comment Structure"
        COMMENT["Comment: JSON metadata<br/>alias: User Profile<br/>description: User profile data<br/>nested_columns: extracted paths<br/>nested.user_profile.name<br/>nested.user_profile.addresses"]
    end
    
    subgraph "Final Document"
        DOC["TABLE_COLUMNS Document<br/>with nested metadata"]
    end
    
    NC --> HELPER
    HELPER --> NESTED
    NESTED --> JSON
    JSON --> COMMENT
    COMMENT --> DOC
    
    style NC fill:#e3f2fd
    style COMMENT fill:#fff3e0
    style DOC fill:#e8f5e8
Loading

Document Content Examples

7. Complete Transformation Example

graph LR
    subgraph "MDL Input"
        MDL_IN["
        {
          'models': [{
            'name': 'customers',
            'properties': {
              'description': 'Customer data',
              'displayName': 'Customers'
            },
            'columns': [
              {
                'name': 'id',
                'type': 'BIGINT',
                'isPrimaryKey': true
              },
              {
                'name': 'profile',
                'type': 'STRUCT<name STRING>',
                'properties': {
                  'nested.profile.name': {...}
                }
              },
              {
                'name': 'total_orders',
                'type': 'INTEGER',
                'isCalculated': true,
                'expression': 'COUNT(orders.id)'
              }
            ]
          }],
          'views': [{
            'name': 'customer_summary',
            'statement': 'SELECT...'
          }],
          'relationships': [{
            'name': 'CustomerOrders',
            'condition': 'customers.id = orders.customer_id'
          }]
        }
        "]
    end
    
    subgraph "Generated Documents"
        DOC1["TABLE Document
        {
          'type': 'TABLE',
          'name': 'customers',
          'comment': '/* metadata */'
        }"]
        
        DOC2["TABLE_COLUMNS Document
        {
          'type': 'TABLE_COLUMNS',
          'columns': [
            {
              'type': 'COLUMN',
              'name': 'id',
              'is_primary_key': true
            }
          ]
        }"]
        
        DOC3["TABLE_COLUMNS Document
        {
          'type': 'TABLE_COLUMNS',
          'columns': [
            {
              'type': 'COLUMN',
              'name': 'profile',
              'comment': '-- nested metadata'
            }
          ]
        }"]
        
        DOC4["TABLE_COLUMNS Document
        {
          'type': 'TABLE_COLUMNS',
          'columns': [
            {
              'type': 'COLUMN',
              'name': 'total_orders',
              'comment': '-- Calculated Field'
            }
          ]
        }"]
        
        DOC5["VIEW Document
        {
          'type': 'VIEW',
          'name': 'customer_summary',
          'statement': 'SELECT...'
        }"]
        
        DOC6["FOREIGN_KEY Document
        {
          'type': 'FOREIGN_KEY',
          'constraint': 'FOREIGN KEY...'
        }"]
    end
    
    MDL_IN --> DOC1
    MDL_IN --> DOC2
    MDL_IN --> DOC3
    MDL_IN --> DOC4
    MDL_IN --> DOC5
    MDL_IN --> DOC6
    
    style MDL_IN fill:#e3f2fd
    style DOC1 fill:#fff3e0
    style DOC2 fill:#f3e5f5
    style DOC3 fill:#f3e5f5
    style DOC4 fill:#f3e5f5
    style DOC5 fill:#e0f2f1
    style DOC6 fill:#fff8e1
Loading

Vector Embedding Process

8. Document to Vector Transformation

sequenceDiagram
    participant DOCS as Generated Documents
    participant EMB as Embedder
    participant VECTOR as Vector Store
    participant META as Metadata Store
    
    DOCS->>EMB: Document batch
    Note over DOCS: Multiple document types<br/>with structured content
    
    EMB->>EMB: Generate embeddings
    Note over EMB: Text β†’ Vector<br/>(1536 dimensions for OpenAI)
    
    EMB->>VECTOR: Store vectors
    Note over VECTOR: Vector + Metadata<br/>in Qdrant
    
    EMB->>META: Store metadata
    Note over META: Document metadata<br/>for filtering
    
    VECTOR-->>EMB: Confirmation
    META-->>EMB: Confirmation
    EMB-->>DOCS: Indexing complete
Loading

Search and Retrieval Flow

9. Query to Document Retrieval

graph TB
    subgraph "User Query"
        QUERY["Natural Language Query:<br/>Show me customer information"]
    end
    
    subgraph "Retrieval Process"
        EMBED[Query Embedding]
        SEARCH[Vector Similarity Search]
        FILTER[Metadata Filtering]
        RANK[Relevance Ranking]
    end
    
    subgraph "Retrieved Documents"
        R1["TABLE: customers<br/>Relevance: 0.95"]
        R2["TABLE_COLUMNS: customer fields<br/>Relevance: 0.89"]
        R3["VIEW: customer_summary<br/>Relevance: 0.82"]
        R4["METRIC: customer_metrics<br/>Relevance: 0.78"]
    end
    
    subgraph "Schema Context"
        CONTEXT["Assembled Schema Context<br/>for SQL Generation"]
    end
    
    QUERY --> EMBED
    EMBED --> SEARCH
    SEARCH --> FILTER
    FILTER --> RANK
    
    RANK --> R1
    RANK --> R2
    RANK --> R3
    RANK --> R4
    
    R1 --> CONTEXT
    R2 --> CONTEXT
    R3 --> CONTEXT
    R4 --> CONTEXT
    
    style QUERY fill:#e3f2fd
    style CONTEXT fill:#e8f5e8
    style R1 fill:#fff3e0
    style R2 fill:#f3e5f5
    style R3 fill:#e0f2f1
    style R4 fill:#fce4ec
Loading

Performance Optimizations

10. Column Pruning in Retrieval

graph TB
    subgraph "Initial Retrieval"
        LARGE[Large Schema Context<br/>All columns retrieved<br/>~10,000 tokens]
    end
    
    subgraph "Column Pruning Process"
        LLM[Column Selection LLM]
        ANALYZE[Analyze Query Intent]
        SELECT[Select Relevant Columns]
        REASON[Generate Reasoning]
    end
    
    subgraph "Optimized Context"
        SMALL[Pruned Schema Context<br/>Only relevant columns<br/>~2,000 tokens]
    end
    
    subgraph "Benefits"
        SPEED[Faster Processing]
        COST[Lower LLM Costs]
        QUALITY[Better Focus]
    end
    
    LARGE --> LLM
    LLM --> ANALYZE
    ANALYZE --> SELECT
    SELECT --> REASON
    REASON --> SMALL
    
    SMALL --> SPEED
    SMALL --> COST
    SMALL --> QUALITY
    
    style LARGE fill:#ffebee
    style SMALL fill:#e8f5e8
    style SPEED fill:#e0f2f1
    style COST fill:#e0f2f1
    style QUALITY fill:#e0f2f1
Loading

Document Metadata Structure

11. Document Metadata Schema

classDiagram
    class Document {
        +String id
        +Dict meta
        +String content
    }
    
    class DocumentMeta {
        +String type
        +String name
        +String project_id
        +Dict additional_fields
    }
    
    class TableDocument {
        +type: "TABLE_SCHEMA"
        +name: table_name
    }
    
    class ColumnDocument {
        +type: "TABLE_SCHEMA"
        +name: table_name
        +column_info: embedded_in_content
    }
    
    class ViewDocument {
        +type: "TABLE_SCHEMA"
        +name: view_name
        +statement: embedded_in_content
    }
    
    class MetricDocument {
        +type: "TABLE_SCHEMA"
        +name: metric_name
        +base_object: embedded_in_content
    }
    
    Document --> DocumentMeta
    DocumentMeta <|-- TableDocument
    DocumentMeta <|-- ColumnDocument
    DocumentMeta <|-- ViewDocument
    DocumentMeta <|-- MetricDocument
Loading

Content Structure Patterns

12. Document Content Patterns

graph TB
    subgraph "Content Structure Patterns"
        subgraph "TABLE Pattern"
            T_TYPE[type: 'TABLE']
            T_NAME[name: table_name]
            T_COMMENT[comment: metadata]
        end
        
        subgraph "TABLE_COLUMNS Pattern"
            C_TYPE[type: 'TABLE_COLUMNS']
            C_COLS[columns: array]
            
            subgraph "Column Object"
                COL_TYPE[type: 'COLUMN'/'FOREIGN_KEY']
                COL_NAME[name: column_name]
                COL_DATA[data_type: sql_type]
                COL_COMMENT[comment: semantic_info]
                COL_PK[is_primary_key: boolean]
            end
        end
        
        subgraph "VIEW Pattern"
            V_TYPE[type: 'VIEW']
            V_NAME[name: view_name]
            V_STMT[statement: sql_query]
            V_COMMENT[comment: description]
        end
        
        subgraph "METRIC Pattern"
            M_TYPE[type: 'METRIC']
            M_NAME[name: metric_name]
            M_BASE[base_object: source_table]
            M_COLS[columns: dimensions_measures]
            M_COMMENT[comment: metric_info]
        end
    end
    
    C_COLS --> COL_TYPE
    C_COLS --> COL_NAME
    C_COLS --> COL_DATA
    C_COLS --> COL_COMMENT
    C_COLS --> COL_PK
    
    style T_TYPE fill:#fff3e0
    style C_TYPE fill:#f3e5f5
    style V_TYPE fill:#e0f2f1
    style M_TYPE fill:#fce4ec
Loading

Error Handling and Validation

13. Validation and Error Handling Flow

flowchart TD
    A[MDL Input] --> B[MDL Validator]
    B --> C{Valid JSON?}
    C -->|No| D[JSON Error]
    C -->|Yes| E[Structure Validation]
    
    E --> F{Required Fields?}
    F -->|Missing| G[Add Default Fields]
    F -->|Present| H[DDL Chunker]
    
    G --> H
    H --> I[Model Preprocessor]
    I --> J{Processing Error?}
    J -->|Yes| K[Log Error & Skip]
    J -->|No| L[Generate Documents]
    
    K --> M[Partial Success]
    L --> N[Embedding Process]
    N --> O{Embedding Error?}
    O -->|Yes| P[Retry with Backoff]
    O -->|No| Q[Store in Vector DB]
    
    P --> Q
    Q --> R[Success]
    
    style D fill:#ffebee
    style K fill:#fff3e0
    style M fill:#fff3e0
    style R fill:#e8f5e8
Loading

Performance Metrics and Monitoring

14. Pipeline Performance Monitoring

graph TB
    subgraph "Input Metrics"
        MDL_SIZE[MDL File Size]
        MODEL_COUNT[Model Count]
        COLUMN_COUNT[Column Count]
        REL_COUNT[Relationship Count]
    end
    
    subgraph "Processing Metrics"
        PROC_TIME[Processing Time]
        DOC_COUNT[Documents Generated]
        BATCH_COUNT[Batches Created]
        ERROR_RATE[Error Rate]
    end
    
    subgraph "Output Metrics"
        EMBED_TIME[Embedding Time]
        STORAGE_TIME[Storage Time]
        VECTOR_COUNT[Vectors Stored]
        SUCCESS_RATE[Success Rate]
    end
    
    subgraph "Quality Metrics"
        COVERAGE[Schema Coverage]
        SEARCHABILITY[Search Quality]
        RETRIEVAL_ACC[Retrieval Accuracy]
        LATENCY[Query Latency]
    end
    
    MDL_SIZE --> PROC_TIME
    MODEL_COUNT --> DOC_COUNT
    COLUMN_COUNT --> BATCH_COUNT
    
    PROC_TIME --> EMBED_TIME
    DOC_COUNT --> VECTOR_COUNT
    BATCH_COUNT --> STORAGE_TIME
    
    VECTOR_COUNT --> COVERAGE
    SUCCESS_RATE --> SEARCHABILITY
    ERROR_RATE --> RETRIEVAL_ACC
    
    style PROC_TIME fill:#e3f2fd
    style EMBED_TIME fill:#e8f5e8
    style COVERAGE fill:#e0f2f1
    style RETRIEVAL_ACC fill:#fce4ec
Loading

Conclusion

The MDL to Vector Documents transformation in WrenAI is a sophisticated process that:

  1. Preserves Semantic Information: Rich metadata and comments maintain the business context
  2. Enables Efficient Search: Structured document types allow for targeted retrieval
  3. Supports Complex Data Types: Handles nested columns, calculated fields, and relationships
  4. Optimizes Performance: Column batching and pruning reduce processing overhead
  5. Maintains Data Integrity: Comprehensive validation and error handling ensure reliability

This transformation enables WrenAI to provide intelligent, context-aware SQL generation by making database schemas semantically searchable through vector embeddings while preserving the rich structural and business information contained in the original MDL files.

References

  • Source Code: wren-ai-service/src/pipelines/indexing/db_schema.py
  • Tests: wren-ai-service/tests/pytest/pipelines/indexing/test_db_schema.py
  • Helper System: wren-ai-service/src/pipelines/indexing/utils/helper.py
  • Pipeline Documentation: WrenAI_Pipelines_Documentation.md

I'll search for the handler logic of the /v1/asks endpoint in the WrenAI codebase. Ran tool Now let me look at the complete ask service implementation to understand the full handler logic: Read file: wren-ai-service/src/web/v1/services/ask.py Let me get the complete pipeline flow from the ask method: Read file: wren-ai-service/src/web/v1/services/ask.py Let me get the rest of the ask method to see the SQL generation and correction logic: Read file: wren-ai-service/src/web/v1/services/ask.py Perfect! Now I have a complete understanding of the /v1/asks endpoint handler logic. Let me present this in a comprehensive way:

/v1/asks Handler Logic - Complete Flow

Based on my analysis of the WrenAI codebase, here's the complete handler logic for the /v1/asks endpoint:

🎯 Endpoint Overview

Router: wren-ai-service/src/web/v1/routers/ask.py Service: wren-ai-service/src/web/v1/services/ask.py Method: POST /v1/asks

πŸ“ Request/Response Models

# Request Model
class AskRequest(BaseRequest):
    query: str                                    # Natural language query
    mdl_hash: Optional[str]                      # Model hash/ID (legacy: "id")
    histories: Optional[list[AskHistory]]        # Query history (max 5)
    ignore_sql_generation_reasoning: bool = False
    enable_column_pruning: bool = False

# Response Model  
class AskResponse(BaseModel):
    query_id: str                                # UUID for tracking

# Result Model (via GET /v1/asks/{query_id}/result)
class AskResultResponse(BaseModel):
    status: Literal["understanding", "searching", "planning", 
                   "generating", "correcting", "finished", "failed", "stopped"]
    type: Optional[Literal["GENERAL", "TEXT_TO_SQL"]]
    response: Optional[List[AskResult]]          # Generated SQL results
    error: Optional[AskError]                    # Error details if failed
    # ... additional fields

πŸ”„ Handler Flow Diagram

sequenceDiagram
    participant Client as Client
    participant Router as FastAPI Router
    participant Service as Ask Service
    participant Pipelines as AI Pipelines
    participant Cache as TTL Cache
    participant LLM as LLM Provider
    
    Client->>Router: POST /v1/asks
    Router->>Router: Generate UUID
    Router->>Cache: Store initial status="understanding"
    Router->>Service: Background task: ask()
    Router-->>Client: Return query_id
    
    Note over Service: Async Processing Begins
    
    Service->>Pipelines: 1. Historical Question Retrieval
    alt Historical Result Found
        Service->>Cache: Update status="finished"
        Service-->>Client: Return cached result
    else No Historical Result
        Service->>Pipelines: 2. SQL Pairs + Instructions Retrieval
        Service->>Pipelines: 3. Intent Classification
        
        alt Intent = MISLEADING_QUERY
            Service->>Pipelines: Misleading Assistance
            Service->>Cache: Update status="finished", type="GENERAL"
        else Intent = GENERAL  
            Service->>Pipelines: Data Assistance
            Service->>Cache: Update status="finished", type="GENERAL"
        else Intent = TEXT_TO_SQL
            Service->>Cache: Update status="searching"
            Service->>Pipelines: 4. DB Schema Retrieval
            
            alt No Relevant Data
                Service->>Cache: Update status="failed", error="NO_RELEVANT_DATA"
            else Data Found
                Service->>Cache: Update status="planning"
                Service->>Pipelines: 5. SQL Generation Reasoning
                Service->>Cache: Update status="generating"
                Service->>Pipelines: 6. SQL Generation
                
                alt Valid SQL Generated
                    Service->>Cache: Update status="finished", response=SQL
                else Invalid SQL
                    Service->>Cache: Update status="correcting"
                    Service->>Pipelines: 7. SQL Correction (retry up to 3x)
                    
                    alt Correction Successful
                        Service->>Cache: Update status="finished", response=SQL
                    else Correction Failed
                        Service->>Cache: Update status="failed", error="NO_RELEVANT_SQL"
                    end
                end
            end
        end
    end
Loading

πŸ—οΈ Detailed Handler Logic

1. Router Handler (ask.py:81-101)

@router.post("/asks")
async def ask(
    ask_request: AskRequest,
    background_tasks: BackgroundTasks,
    service_container: ServiceContainer = Depends(get_service_container),
    service_metadata: ServiceMetadata = Depends(get_service_metadata),
) -> AskResponse:
    # Generate unique query ID
    query_id = str(uuid.uuid4())
    ask_request.query_id = query_id
    
    # Initialize result cache with "understanding" status
    service_container.ask_service._ask_results[query_id] = AskResultResponse(
        status="understanding",
    )

    # Start async processing in background
    background_tasks.add_task(
        service_container.ask_service.ask,
        ask_request,
        service_metadata=asdict(service_metadata),
    )
    
    # Return immediately with query_id for tracking
    return AskResponse(query_id=query_id)

2. Service Processing Pipeline (ask.py:170-628)

The main ask() method processes through multiple stages:

Stage 1: Understanding (Historical Question Check)
# Status: "understanding"
historical_question = await self._pipelines["historical_question"].run(
    query=user_query,
    project_id=ask_request.project_id,
)

# If exact match found, return immediately
if historical_question_result:
    api_results = [AskResult(sql=result.get("statement"), ...)]
    # Skip to finished
Stage 2: Intent Classification
# Retrieve SQL samples and instructions concurrently
sql_samples_task, instructions_task = await asyncio.gather(
    self._pipelines["sql_pairs_retrieval"].run(...),
    self._pipelines["instructions_retrieval"].run(...),
)

# Classify user intent
intent_classification_result = await self._pipelines["intent_classification"].run(...)
intent = intent_classification_result.get("intent")

# Handle different intents
if intent == "MISLEADING_QUERY":
    # Provide guidance for unclear queries
    asyncio.create_task(self._pipelines["misleading_assistance"].run(...))
    return # status="finished", type="GENERAL"
    
elif intent == "GENERAL":
    # Provide data insights without SQL
    asyncio.create_task(self._pipelines["data_assistance"].run(...))
    return # status="finished", type="GENERAL"
    
elif intent == "USER_GUIDE":
    # Provide user guidance
    asyncio.create_task(self._pipelines["user_guide_assistance"].run(...))
    return # status="finished", type="GENERAL"
Stage 3: Schema Retrieval
# Status: "searching"
retrieval_result = await self._pipelines["db_schema_retrieval"].run(
    query=user_query,
    histories=histories,
    project_id=ask_request.project_id,
    enable_column_pruning=enable_column_pruning,
)

documents = retrieval_result.get("construct_retrieval_results", {}).get("retrieval_results", [])

if not documents:
    # No relevant schema found
    return # status="failed", error="NO_RELEVANT_DATA"
Stage 4: SQL Generation Planning
# Status: "planning" (if reasoning enabled)
if histories:
    sql_generation_reasoning = await self._pipelines["followup_sql_generation_reasoning"].run(...)
else:
    sql_generation_reasoning = await self._pipelines["sql_generation_reasoning"].run(...)
Stage 5: SQL Generation
# Status: "generating"
if histories:
    text_to_sql_generation_results = await self._pipelines["followup_sql_generation"].run(...)
else:
    text_to_sql_generation_results = await self._pipelines["sql_generation"].run(...)

# Check if valid SQL was generated
if sql_valid_result := text_to_sql_generation_results["post_process"]["valid_generation_result"]:
    api_results = [AskResult(sql=sql_valid_result.get("sql"), type="llm")]
Stage 6: SQL Correction (if needed)
# Status: "correcting" 
elif failed_dry_run_result := text_to_sql_generation_results["post_process"]["invalid_generation_result"]:
    # Retry up to max_sql_correction_retries (default: 3)
    while current_sql_correction_retries < max_sql_correction_retries:
        sql_correction_results = await self._pipelines["sql_correction"].run(...)
        
        if valid_generation_result := sql_correction_results["post_process"]["valid_generation_result"]:
            api_results = [AskResult(sql=valid_generation_result.get("sql"), type="llm")]
            break
Stage 7: Completion
if api_results:
    # Status: "finished"
    self._ask_results[query_id] = AskResultResponse(
        status="finished",
        type="TEXT_TO_SQL", 
        response=api_results,
        ...
    )
else:
    # Status: "failed"
    self._ask_results[query_id] = AskResultResponse(
        status="failed",
        error=AskError(code="NO_RELEVANT_SQL", message="No relevant SQL"),
        ...
    )

πŸ”§ Key Features

Asynchronous Processing

  • Returns query_id immediately
  • Processes in background using FastAPI BackgroundTasks
  • Client polls /v1/asks/{query_id}/result for status updates

Status Progression

understanding β†’ searching β†’ planning β†’ generating β†’ correcting β†’ finished/failed

Error Handling

  • NO_RELEVANT_DATA: No schema matches found
  • NO_RELEVANT_SQL: SQL generation failed
  • OTHERS: Unexpected errors

Caching Strategy

  • Uses TTL Cache (default: 120 seconds, 1M entries)
  • Stores intermediate results for status polling
  • Automatic cleanup of expired queries

Pipeline Integration

  • Historical Question: Check for exact matches
  • Intent Classification: Determine query type
  • DB Schema Retrieval: Find relevant tables/columns
  • SQL Generation: Create SQL from natural language
  • SQL Correction: Fix invalid SQL with retries

Configuration Options

  • enable_column_pruning: Optimize retrieved schema
  • ignore_sql_generation_reasoning: Skip reasoning step
  • max_histories: Limit conversation context (default: 5)
  • max_sql_correction_retries: SQL fix attempts (default: 3)

πŸ“Š Performance Optimizations

  1. Concurrent Operations: SQL samples and instructions retrieved in parallel
  2. Early Returns: Historical matches bypass full pipeline
  3. Column Pruning: Reduces schema context size
  4. Retry Logic: Smart SQL correction with timeout handling
  5. Caching: TTL cache prevents duplicate processing

This handler demonstrates a sophisticated AI pipeline that transforms natural language into SQL through multiple intelligent processing stages, with robust error handling and performance optimizations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment