Rust-Based Web Crawler with Kafka and PostgreSQL Integration - Design Specification

1. Introduction

1.1 Overview

This document outlines the design specification for a high-performance Rust-based web crawler integrated with Apache Kafka and PostgreSQL. The crawler will act as a worker within a distributed system, consuming URLs from Kafka topics, crawling the associated web pages, and storing the results in a PostgreSQL database. Docker Compose will be utilized to manage the infrastructure, ensuring seamless deployment and orchestration of the services.

1.2 Objectives

Develop a high-performance web crawler using Rust that integrates with Kafka and PostgreSQL.
Ensure scalable crawling of up to 10 million websites.
Use Kafka for task distribution, allowing the crawler to act as a worker in a distributed system.
Store crawled data and metadata in a PostgreSQL database.
Use Docker Compose to manage the infrastructure and deployment of services.

2. System Architecture

2.1 High-Level Architecture

The crawler will follow a modular architecture with Kafka handling the distribution of crawl tasks and PostgreSQL as the data storage backend. Docker Compose will manage the orchestration of all services, including the crawler, Kafka, and PostgreSQL.

Crawler Manager: Central component managing the crawling process by interacting with Kafka and PostgreSQL.
Kafka Consumer: Consumes URLs from Kafka topics and passes them to the Crawler Manager for processing.
HTTP Request Handler: Manages sending HTTP requests to websites, handling retries, and timeouts.
HTML Parser: Extracts useful data (e.g., links, metadata) from the HTML content.
Kafka Producer: Sends the parsed content and metadata to a Kafka topic for further processing.
PostgreSQL Database: Stores the raw HTML, metadata, and any additional information collected during crawling.

2.2 Component Diagram

graph TD
    subgraph Kafka
        KT[Kafka Topics]
    end

    subgraph PostgreSQL
        PDB[(PostgreSQL Database)]
        PDB -->|Stores| UT[URLs Table]
        PDB -->|Stores| CDT[Crawled Data Table]
        PDB -->|Stores| RRT[Robots Rules Table]
    end

    subgraph "Crawler Manager"
        CM[Crawler Manager]
        KC[Kafka Consumer]
        KP[Kafka Producer]
        HRH[HTTP Request Handler]
        HP[HTML Parser]
        RH[Robots.txt Handler]
    end

    subgraph "External"
        WS[Websites]
    end

    %% Data Flow
    KT -->|Consumes URLs| KC
    KC -->|Passes URLs| CM
    CM -->|Sends URL| HRH
    HRH -->|Fetches HTML| WS
    WS -->|Returns HTML| HRH
    HRH -->|Passes HTML| HP
    HP -->|Extracts data| CM
    CM -->|Stores data| PDB
    CM -->|Publishes results| KP
    KP -->|Sends status| KT

    %% Additional Interactions
    CM -->|Checks rules| RH
    RH -->|Fetches rules| WS
    RH -->|Stores rules| PDB

    %% Database Interactions
    CM -->|Reads/Writes| UT
    CM -->|Reads/Writes| CDT
    RH -->|Reads/Writes| RRT

    %% Styling
    classDef kafka fill:#ff9900,stroke:#333,stroke-width:2px;
    classDef postgres fill:#336791,stroke:#333,stroke-width:2px,color:#fff;
    classDef crawler fill:#deb887,stroke:#333,stroke-width:2px;
    classDef external fill:#c0c0c0,stroke:#333,stroke-width:2px;

    class KT kafka;
    class PDB,UT,CDT,RRT postgres;
    class CM,KC,KP,HRH,HP,RH crawler;
    class WS external;

2.3 Data Flow

The Kafka Consumer retrieves a URL from a Kafka topic and sends it to the Crawler Manager.
The Crawler Manager passes the URL to the HTTP Request Handler to fetch the content.
The HTML Parser extracts relevant data from the HTML content.
The Crawler Manager stores the crawled data (HTML, metadata) in the PostgreSQL Database.
The Kafka Producer sends a message to a Kafka topic with the status of the crawl and any additional metadata.
The Robots.txt Handler fetches and stores robots.txt rules, which the Crawler Manager checks before crawling.

3. Detailed Component Design

3.1 Crawler Manager

Responsibilities: Coordinates the crawling process by consuming tasks from Kafka and managing the queue of URLs. Also manages interactions with the PostgreSQL database for storing data.
Implementation:
- Use Kafka to distribute URLs to the crawler workers.
- Implement logic to prioritize and manage URLs.
- Interact with PostgreSQL to store and retrieve crawl data.

3.2 Kafka Consumer

Responsibilities: Listens to Kafka topics for new URLs to crawl.
Implementation:
- Use the rust-rdkafka crate to interface with Kafka.
- Deserialize messages to extract URLs and pass them to the Crawler Manager.

3.3 HTTP Request Handler

Responsibilities: Handles the actual HTTP requests to the websites, including retries and timeouts.
Implementation:
- Use reqwest for making asynchronous HTTP requests.
- Implement retry logic with exponential backoff for robustness.
- Handle HTTP headers and User-Agent customization.

3.4 HTML Parser

Responsibilities: Extracts meaningful content and metadata from the fetched HTML pages.
Implementation:
- Use the scraper crate for HTML parsing.
- Extract and sanitize URLs, titles, meta descriptions, and other relevant data.
- Handle edge cases, including malformed HTML and various document structures.

3.5 Kafka Producer

Responsibilities: Publishes the crawled data (HTML, metadata) to a Kafka topic for further processing.
Implementation:
- Use rust-rdkafka to produce messages to Kafka.
- Serialize the crawled data (HTML content, metadata) into a suitable format (e.g., JSON, Avro) before sending it to Kafka.

3.6 PostgreSQL Database

Responsibilities: Stores the crawled data and metadata for later analysis and processing.
Implementation:
- Define a schema to store URLs, HTML content, metadata, and crawl status.
- Use the postgres crate to interact with the PostgreSQL database from Rust.

3.7 Robots.txt Handler

Responsibilities: Ensures the crawler respects the robots.txt rules for each website.
Implementation:
- Fetch and parse robots.txt rules before crawling a domain.
- Implement logic to skip disallowed URLs based on the robots.txt file.

4. Database Schema

4.1 Schema Design

The PostgreSQL database will have the following tables:

urls:
- id SERIAL PRIMARY KEY: Unique identifier for each URL.
- url TEXT NOT NULL UNIQUE: The URL of the webpage.
- status VARCHAR(20): Status of the URL (e.g., pending, in-progress, completed, failed).
- last_crawled TIMESTAMP: The last time the URL was crawled.
crawled_data:
- id SERIAL PRIMARY KEY: Unique identifier for the crawled data.
- url_id INTEGER REFERENCES urls(id): Foreign key linking to the URL.
- html_content TEXT: The raw HTML content of the webpage.
- metadata JSONB: JSON object containing extracted metadata (e.g., title, meta description).
- crawl_timestamp TIMESTAMP: Timestamp when the data was crawled.
robots_rules:
- id SERIAL PRIMARY KEY: Unique identifier for the robots.txt rules.
- domain TEXT NOT NULL UNIQUE: The domain of the website.
- rules JSONB: JSON object containing the parsed robots.txt rules.

4.2 Indexes

Index on urls.url to speed up lookups and ensure uniqueness.
Index on crawled_data.url_id for fast retrieval of crawled data based on URL.

5. Docker Compose Configuration

5.1 Overview

Docker Compose will be used to manage the infrastructure, including the crawler, Kafka, Zookeeper (required by Kafka), and PostgreSQL. This ensures that the entire system can be easily started, stopped, and managed as a cohesive unit.

5.2 Docker Compose File

Here's a sample docker-compose.yml configuration:

version: '3.8'

services:
  zookeeper:
    image: confluentinc/cp-zookeeper:latest
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000

  kafka:
    image: confluentinc/cp-kafka:latest
    depends_on:
      - zookeeper
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
    ports:
      - "9092:9092"

  postgres:
    image: postgres:latest
    environment:
      POSTGRES_USER: crawler_user
      POSTGRES_PASSWORD: crawler_password
      POSTGRES_DB: crawler_db
    ports:
      - "5432:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data

  crawler:
    build: .
    depends_on:
      - kafka
      - postgres
    environment:
      KAFKA_BROKER: kafka:9092
      DATABASE_URL: postgres://crawler_user:crawler_password@postgres:5432/crawler_db

volumes:
  postgres_data:

5.3 Deployment

Building the Crawler: Ensure the Rust-based crawler is built into a Docker image by defining the appropriate Dockerfile in the project root.
Orchestration: Use Docker Compose commands (up, down, logs, etc.) to manage the lifecycle of the services.

6. Performance Considerations

6.1 Concurrency

Asynchronous Crawling: Leverage Rust's async/await with tokio to handle a large number of concurrent requests.
Worker Threads: Implement a thread pool to manage CPU-bound tasks, such as HTML parsing.

6.2 Rate Limiting

Domain-Specific Rate Limits: Apply rate limits per domain to avoid overwhelming servers.
Global Rate Limit: Implement a global rate limit to control the overall crawl rate.

6.3 Caching

In-Memory Cache: Implement an LRU cache for frequently accessed data (e.g., robots.txt rules) to reduce database load.

6.4 Database Optimization

Batch Inserts: Use batch inserts for storing crawled data to improve database performance.
Connection Pooling: Implement connection pooling to efficiently manage database connections.

7. Monitoring and Logging

7.1 Metrics Collection

Implement metrics collection for key performance indicators (e.g., crawl rate, error rate, latency).
Use a time-series database (e.g., Prometheus) for storing metrics.

7.2 Logging

Implement structured logging using a crate like slog for easier log analysis.
Use log levels appropriately to distinguish between different types of events.

7.3 Alerting

Set up alerts for critical issues (e.g., high error rates, service downtime).

8. Security Considerations

8.1 Network Security

Use TLS for all network communications (Kafka, PostgreSQL).
Implement proper firewall rules to restrict access to services.

8.2 Data Security

Encrypt sensitive data at rest in PostgreSQL.
Implement proper access controls for the database.

8.3 Crawler Ethics

Respect robots.txt rules and website terms of service.
Implement politeness delays between requests to the same domain.

9. Future Enhancements

Implement distributed tracing for better debugging and performance analysis.
Add support for JavaScript rendering using a headless browser for dynamic content.
Implement a web interface for monitoring and controlling the crawler.
Enhance the parsing capabilities to extract structured data from specific types of websites.

10. Conclusion

This design document provides a comprehensive blueprint for building a high-performance, scalable web crawler using Rust, Kafka, and PostgreSQL. By following this design, we can create a robust system capable of efficiently crawling millions of websites while maintaining flexibility for future enhancements.

copyleftdev/crawler_spec.md