BookLister - Architecture Overview & API Documentation

System Overview
Architecture Patterns
Core Components
Data Flow
Database Schema
API Documentation
Integration Guide
External Dependencies
Development Patterns

System Overview

BookLister is a comprehensive book cataloging and marketplace preparation tool designed to automate the process of converting book cover images into structured, market-ready listings. The system combines computer vision, web scraping, and data enrichment APIs to provide a complete workflow from image capture to marketplace listing.

Purpose

Image Processing: Extract book information from cover photographs using OpenAI Vision API
Data Enrichment: Enhance book records with comprehensive metadata from Google Books API
Pricing Research: Gather competitive pricing data from AbeBooks marketplace
Multi-View Support: Handle multiple perspectives (cover, back, spine, inner pages) of the same book
Quality Control: Prevent duplicates and validate data accuracy
Web Interface: Provide modern UI for managing book collections and generating listings

Target Use Cases

Book resellers preparing inventory for online marketplaces (Depop, eBay, Amazon)
Personal book collection cataloging
Library/bookstore inventory management
Bulk book processing for estate sales or acquisitions

Architecture Patterns

1. Modular Monolith

The application follows a modular monolith pattern with clear separation of concerns:

CLI Layer: Unified command-line interface
Web Layer: RESTful API and modern web interface
Business Logic Layer: Processing engines and workflows
Data Access Layer: Database operations and caching
Integration Layer: External API clients and adapters

2. Event-Driven Processing

Image processing follows an event-driven workflow:

Image Upload → Classification → Extraction → Enrichment → Storage → Notification

3. Plugin Architecture

External services are abstracted through adapter interfaces:

Vision processing (OpenAI Vision API)
Book metadata (Google Books API)
Pricing data (AbeBooks scraper)
Image optimization (PIL/ImageIO)

4. Data-First Design

SQLite database serves as the central source of truth with comprehensive schema supporting:

Multi-view image relationships
Book condition assessment
Pricing history
Processing audit trails
Data quality metrics

Core Components

1. CLI Interface (`booklister.py`)

Purpose: Unified command-line tool providing all system functionality Key Features:

Database management (backup, restore, migration)
Batch processing operations
Server lifecycle management
Data import/export utilities

# Key Classes and Methods
class DatabaseManager:
    def clear_all_data()
    def backup_database(filename)
    def show_statistics()
    def fix_incorrect_isbns()

class ServerManager:
    def start_server(debug, port)
    def stop_server()
    def get_server_pid()

2. Web Interface (`web_server.py`)

Purpose: Flask-based REST API and web UI for interactive book management Architecture:

RESTful API endpoints
Server-side rendering with Jinja2 templates
Real-time image upload with progress tracking
Bootstrap-based responsive UI

# Key API Endpoints
@app.route('/api/books', methods=['GET', 'POST'])
@app.route('/api/book/<int:book_id>')
@app.route('/api/lookup_pricing', methods=['POST'])
@app.route('/api/book/<int:book_id>/upload_images', methods=['POST'])

3. Database Layer (`database.py`)

Purpose: SQLite database operations with comprehensive schema Key Features:

Multi-view image relationship management
Advanced duplicate detection
Pricing data integration
Book condition tracking
Processing status management

# Key Classes and Methods
class BookDatabase:
    def insert_book(book_data) -> Optional[int]
    def find_duplicate_book(title, author, isbn) -> Optional[Dict]
    def get_book_images(book_id) -> List[Dict]
    def update_consolidated_data(book_id, data, views)
    def update_book_pricing(book_id, pricing_data)

4. Processing Engine (`book_processor.py`)

Purpose: Main orchestration logic for image processing workflow Architecture:

Single-pass image classification and extraction
Multi-view file organization
Intelligent duplicate detection
Batch processing support
Error handling and recovery

# Key Classes and Methods
class BookProcessor:
    def process_single_image(image_path) -> Dict
    def process_all_images(use_batch=False)
    def _find_existing_book(extracted_info) -> Optional[Dict]
    def _process_new_book(image_path, view_type, extracted_info) -> Dict

5. Computer Vision (`llm_processor.py` + `multi_view_processor.py`)

Purpose: OpenAI Vision API integration for image analysis Key Features:

View type classification (cover, back, spine, inner)
Information extraction from book images
Book condition assessment
Confidence scoring
HEIC image format support

# Key Classes and Methods
class ViewClassifier:
    def classify_and_extract(image_path) -> Dict

class LLMProcessor:
    def extract_book_info_from_image(image_path) -> Dict
    def validate_and_enrich_book_data(ocr_data, api_data) -> Dict

6. Data Enrichment (`book_apis.py` + `book_enrichment.py`)

Purpose: Google Books API integration for metadata enhancement Key Features:

Title/author → ISBN lookup
ISBN → comprehensive metadata retrieval
Missing field population
Data quality validation

# Key Classes and Methods
class GoogleBooksAPI:
    def search_by_title_author(title, author) -> Optional[Dict]
    def search_by_isbn(isbn) -> Optional[Dict]

class BookEnrichmentService:
    def enrich_book(title, author, isbn) -> Optional[Dict]

7. Pricing Research (`abebooks_pricing.py`)

Purpose: AbeBooks marketplace scraping for competitive pricing Key Features:

Intelligent search query construction
Price range analysis
Condition-based pricing
Rate limiting and error handling

# Key Functions
def lookup_book_pricing(title, author, isbn, max_results=10) -> Dict
def extract_book_prices(html_content) -> List[Dict]

8. Quality Control (`duplicate_detector.py`)

Purpose: Advanced duplicate detection and data validation Features:

Multi-strategy matching (ISBN, title+author, fuzzy matching)
Confidence scoring
Provisional record handling
Match metadata tracking

# Key Classes and Methods
class DuplicateDetector:
    def find_existing_book_enhanced(extracted_info, confidence) -> Optional[MatchResult]
    def calculate_title_similarity(title1, title2) -> float

Data Flow

Image Processing Workflow

graph TD
    A[Image Upload] --> B[HEIC Conversion]
    B --> C[View Classification]
    C --> D[Information Extraction]
    D --> E[API Enrichment]
    E --> F[Duplicate Detection]
    F --> G{Existing Book?}
    G -->|Yes| H[Add View to Book]
    G -->|No| I[Create New Book]
    H --> J[Update Consolidated Data]
    I --> J
    J --> K[Store in Database]
    K --> L[Move to Processed]

ASCII Flow Diagram (Alternative)

[Image Upload]
      ↓
[HEIC Conversion]
      ↓
[View Classification] ← OpenAI Vision API
      ↓
[Information Extraction]
      ↓
[API Enrichment] ← Google Books API
      ↓
[Duplicate Detection] ← Database Query
      ↓
   {Existing Book?}
      ↓           ↓
[Add View]   [Create New Book]
      ↓           ↓
   [Update Consolidated Data]
      ↓
[Store in Database] → SQLite
      ↓
[Move to Processed] → File System

Batch Processing Flow

graph TD
    A[Select Images] --> B[Create Batch Job]
    B --> C[Submit to OpenAI Batch API]
    C --> D[Poll for Completion]
    D --> E[Download Results]
    E --> F[Process Each Result]
    F --> G[Store in Database]
    G --> H[Generate Report]

Data Enrichment Flow

graph TD
    A[Book Record] --> B{Has ISBN?}
    B -->|Yes| C[ISBN → Metadata]
    B -->|No| D[Title+Author → ISBN]
    C --> E[Enhance Record]
    D --> E
    E --> F[Validate Data Quality]
    F --> G[Update Database]

Database Schema

Core Tables

`listings` (Main Books Table)

CREATE TABLE listings (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    input_path TEXT NOT NULL,
    title TEXT,
    author TEXT,
    isbn TEXT,
    publisher TEXT,
    publication_date TEXT,
    description TEXT,

    -- Processing metadata
    processing_status TEXT DEFAULT 'processing',
    status TEXT DEFAULT 'New',
    confidence TEXT DEFAULT 'medium',
    record_type TEXT DEFAULT 'confirmed',

    -- Multi-view support
    consolidated_data TEXT,
    primary_cover_image_id INTEGER,
    total_views_count INTEGER DEFAULT 0,
    match_metadata TEXT,

    -- Pricing data
    price_range TEXT,
    min_price REAL,
    max_price REAL,
    avg_price REAL,
    pricing_timestamp TIMESTAMP,
    pricing_listings_found INTEGER,

    -- Audit fields
    notes TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

`book_images` (Multi-View Images)

CREATE TABLE book_images (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    book_id INTEGER NOT NULL,
    image_path TEXT NOT NULL,
    view_type TEXT NOT NULL, -- cover, back, spine, inner, unknown

    -- Extracted data
    extracted_info TEXT, -- JSON
    book_condition TEXT, -- JSON
    processing_notes TEXT,

    -- File metadata
    file_size_bytes INTEGER,
    image_dimensions TEXT,

    -- Audit fields
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    FOREIGN KEY (book_id) REFERENCES listings(id) ON DELETE CASCADE
);

Key Indexes

CREATE INDEX idx_listings_isbn ON listings(isbn);
CREATE INDEX idx_listings_title_author ON listings(title, author);
CREATE INDEX idx_book_images_book_id ON book_images(book_id);
CREATE INDEX idx_book_images_view_type ON book_images(view_type);

API Documentation

Web API Endpoints

Books Management

GET `/api/books`

Retrieve all books with enhanced metadata

{
  "books": [
    {
      "id": 1,
      "title": "Example Book",
      "author": "Author Name",
      "isbn": "9781234567890",
      "status": "Ready to List",
      "images": [...],
      "pricing_info": {...},
      "condition_summary": {...}
    }
  ]
}

POST `/api/books`

Create a new book record manually

// Request
{
  "title": "Book Title",
  "author": "Author Name",
  "isbn": "9781234567890",
  "publisher": "Publisher",
  "description": "Book description"
}

// Response
{
  "success": true,
  "book": {...},
  "message": "Book created successfully"
}

GET `/api/book/<int:book_id>`

Retrieve specific book with full details

{
  "id": 1,
  "title": "Example Book",
  "author": "Author Name",
  "images": [
    {
      "id": 1,
      "view_type": "cover",
      "image_path": "/path/to/image.jpg",
      "extracted_info": {...},
      "book_condition": {...}
    }
  ],
  "consolidated_data": {...}
}

Image Management

POST `/api/book/<int:book_id>/upload_images`

Upload additional images for a book

// Form Data
const formData = new FormData();
formData.append('files', file1);
formData.append('files', file2);
formData.append('view_types', 'cover');
formData.append('view_types', 'back');

// Response
{
  "success": true,
  "uploaded": 2,
  "errors": 0,
  "images": [
    {
      "id": 123,
      "filename": "cover_image.jpg",
      "view_type": "cover",
      "url": "/processed_book_images/1/cover_image.jpg"
    }
  ]
}

DELETE `/api/book/<int:book_id>/image/<int:image_id>`

Delete a specific image

{
  "success": true,
  "message": "Image deleted successfully"
}

Pricing & Enrichment

POST `/api/lookup_pricing`

Look up pricing for a book

// Request
{
  "book_id": 1,
  "title": "Book Title",
  "author": "Author Name",
  "isbn": "9781234567890"
}

// Response
{
  "found": true,
  "price_range": "$12.00 - $45.00",
  "min_price": 12.00,
  "max_price": 45.00,
  "avg_price": 28.50,
  "listings_found": 15,
  "search_url": "https://abebooks.com/...",
  "timestamp": "2024-01-15T10:30:00Z"
}

POST `/api/update_status`

Update book status

// Request
{
  "book_id": 1,
  "status": "Ready to List"
}

// Response
{
  "success": true
}

CLI API

Database Operations

# Statistics and management
python booklister.py db stats
python booklister.py db list [status]
python booklister.py db backup [filename]
python booklister.py db clear [--images-only]

# Data repair
python booklister.py db fix-isbns [--dry-run] [--limit N]
python booklister.py db fix-missing-titles [--dry-run] [--limit N]

# CSV import
python booklister.py db import-csv file.csv [--dry-run]
python booklister.py db needs-images [--limit N]

Processing Operations

# Image processing
python booklister.py process [directory] [--batch] [--no-skip-duplicates]

# Pricing lookup
python booklister.py pricing lookup -t "Title" -a "Author" [-i "ISBN"]
python booklister.py pricing batch-db [--update-db] [--status "New"] [--limit N]

# Data enrichment
python booklister.py enrich stats
python booklister.py enrich batch [--missing description|isbn|both] [--limit N]

Server Management

# Web server
python booklister.py server start [--debug] [--port 8000]
python booklister.py server stop
python booklister.py server restart
python booklister.py server status

Integration Guide

Adding New Vision Providers

Create Provider Adapter

class CustomVisionProvider:
    def classify_and_extract(self, image_path: str) -> Dict:
        # Implement vision processing
        return {
            'view_type': 'cover|back|spine|inner|unknown',
            'confidence_level': 'high|medium|low',
            'extracted_info': {...},
            'book_condition': {...}
        }

Register in Configuration

# config.py
VISION_PROVIDERS = {
    'openai': OpenAIVisionProvider,
    'custom': CustomVisionProvider
}

Adding New Book APIs

Implement BookAPI Interface

class CustomBookAPI:
    def search_by_title_author(self, title: str, author: str) -> Optional[Dict]:
        # API implementation
        return {
            'title': title,
            'author': author,
            'isbn': isbn,
            'publisher': publisher,
            'description': description
        }

    def search_by_isbn(self, isbn: str) -> Optional[Dict]:
        # ISBN lookup implementation
        pass

Update Book APIs Module

# book_apis.py
def search_book_info(title, author, isbn, provider='google'):
    providers = {
        'google': GoogleBooksAPI,
        'custom': CustomBookAPI
    }
    api = providers[provider]()
    return api.search_by_title_author(title, author)

Custom Processing Workflows

Extend BookProcessor

class CustomBookProcessor(BookProcessor):
    def process_single_image(self, image_path: Path) -> Dict:
        # Add custom preprocessing
        result = super().process_single_image(image_path)

        # Add custom post-processing
        self._custom_post_process(result)

        return result

    def _custom_post_process(self, result: Dict):
        # Custom logic here
        pass

Custom Web Endpoints

@app.route('/api/custom/<int:book_id>', methods=['POST'])
def custom_endpoint(book_id):
    # Custom functionality
    return jsonify({'success': True})

External Dependencies

Required APIs

OpenAI API: Vision processing and text analysis
Google Books API: Book metadata enrichment

Optional Integrations

AbeBooks: Pricing research (web scraping)
Barcode APIs: ISBN extraction from barcodes

Python Libraries

flask>=2.3.0          # Web framework
openai>=1.0.0         # OpenAI API client
requests>=2.31.0      # HTTP client
pillow>=10.0.0        # Image processing
sqlite3               # Database (built-in)
python-dotenv>=1.0.0  # Environment management

System Requirements

Python: 3.11+
Storage: SQLite database + processed images
Memory: 2GB+ recommended for batch processing
Network: Internet access for API calls

Development Patterns

Error Handling Strategy

try:
    result = process_image(image_path)
except ProcessingError as e:
    # Log error and move to needs_review
    move_to_needs_review(image_path, str(e))
    return {'success': False, 'error': str(e)}
except Exception as e:
    # Unexpected error - log and re-raise
    logger.error(f"Unexpected error: {e}")
    raise

Configuration Management

# config.py
MULTI_VIEW_CONFIG = {
    'view_types': ['cover', 'back', 'spine', 'inner', 'unknown'],
    'min_confidence': 0.7,
    'max_images_per_book': 10
}

PROCESSING_CONFIG = {
    'vision_model': 'gpt-4o',
    'max_tokens': 800,
    'temperature': 0.1,
    'timeout': 30
}

Testing Patterns

# Test structure
class TestBookProcessor:
    def test_single_image_processing(self):
        processor = BookProcessor('test_images/')
        result = processor.process_single_image(Path('test_cover.jpg'))
        assert result['success'] == True
        assert result['view_type'] == 'cover'

    @patch('book_apis.search_book_info')
    def test_api_integration(self, mock_api):
        mock_api.return_value = {'title': 'Test Book'}
        # Test with mocked API

Deployment Considerations

Environment Variables: Store API keys securely
Database Backups: Regular automated backups
Image Storage: Consider cloud storage for large collections
Rate Limiting: Respect API rate limits
Monitoring: Log processing metrics and errors
Scaling: Consider async processing for high volumes

Troubleshooting Guide

Common Development Issues

Database Connection Errors

sqlite3.OperationalError: database is locked

# Solution: Ensure proper connection handling
with sqlite3.connect(self.db_path) as conn:
    # Always use context manager
    cursor = conn.cursor()
    # Operations here
    conn.commit()  # Explicit commit

OpenAI API Rate Limits

# Error: Rate limit exceeded
# Solution: Implement exponential backoff
import time
import random

def api_call_with_retry(func, max_retries=3):
    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            if "rate limit" in str(e).lower() and attempt < max_retries - 1:
                wait_time = (2 ** attempt) + random.uniform(0, 1)
                time.sleep(wait_time)
            else:
                raise e

Image Processing Failures

# Handle PIL image errors gracefully
from PIL import Image
import os

def safe_image_processing(image_path):
    try:
        with Image.open(image_path) as img:
            # Verify image integrity
            img.verify()

        # Reopen for processing (verify() closes the file)
        with Image.open(image_path) as img:
            return process_image(img)

    except (IOError, OSError) as e:
        print(f"Invalid image file: {image_path}")
        return None

Memory Issues with Large Batches

# Process images in chunks to manage memory
def process_large_batch(image_paths, chunk_size=10):
    for i in range(0, len(image_paths), chunk_size):
        chunk = image_paths[i:i + chunk_size]
        yield process_chunk(chunk)
        # Force garbage collection after each chunk
        import gc
        gc.collect()

Performance Optimization Tips

Database Indexing

-- Add these indexes for better query performance
CREATE INDEX idx_listings_processing_status ON listings(processing_status);
CREATE INDEX idx_listings_created_at ON listings(created_at);
CREATE INDEX idx_book_images_created_at ON book_images(created_at);

-- Composite indexes for complex queries
CREATE INDEX idx_listings_status_created ON listings(status, created_at);

Batch API Optimization

# Optimal batch size for OpenAI API
OPTIMAL_BATCH_SIZE = 50  # Balances cost savings vs processing time

# Prepare batch requests efficiently
def prepare_efficient_batch(images):
    return [
        {
            "custom_id": f"image_{i}",
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": create_vision_request(img)
        }
        for i, img in enumerate(images[:OPTIMAL_BATCH_SIZE])
    ]

Debugging Techniques

Enable Debug Logging

import logging

# Add to config.py
logging.basicConfig(
    level=logging.DEBUG,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('booklister.log'),
        logging.StreamHandler()
    ]
)

# Use in modules
logger = logging.getLogger(__name__)
logger.debug(f"Processing image: {image_path}")

Database Query Debugging

# Add query logging to database.py
def execute_with_logging(cursor, query, params=None):
    logger.debug(f"SQL: {query}")
    if params:
        logger.debug(f"Params: {params}")

    cursor.execute(query, params or [])
    logger.debug(f"Rows affected: {cursor.rowcount}")

Testing Strategies

Unit Test Example

import unittest
from unittest.mock import patch, MagicMock
from pathlib import Path

class TestBookProcessor(unittest.TestCase):
    def setUp(self):
        self.processor = BookProcessor('test_directory')
        self.sample_image = Path('test_data/sample_cover.jpg')

    @patch('book_processor.ViewClassifier')
    def test_image_classification(self, mock_classifier):
        # Mock the classifier response
        mock_classifier.return_value.classify_and_extract.return_value = {
            'view_type': 'cover',
            'confidence_level': 'high',
            'extracted_info': {'title': 'Test Book', 'author': 'Test Author'}
        }

        result = self.processor.process_single_image(self.sample_image)
        self.assertTrue(result['success'])
        self.assertEqual(result['view_type'], 'cover')

Integration Test Example

class TestIntegration(unittest.TestCase):
    def test_complete_workflow(self):
        # Test end-to-end processing
        processor = BookProcessor('test_images')

        # Process a test image
        result = processor.process_single_image(Path('test_cover.jpg'))

        # Verify database state
        db = BookDatabase()
        books = db.get_all_books()
        self.assertEqual(len(books), 1)

        # Verify file organization
        processed_files = list(Path('processed_book_images').glob('**/*.jpg'))
        self.assertGreater(len(processed_files), 0)

Monitoring and Maintenance

Health Check Endpoint

@app.route('/health')
def health_check():
    """System health check endpoint"""
    health_status = {
        'status': 'healthy',
        'timestamp': datetime.now().isoformat(),
        'components': {}
    }

    # Check database connectivity
    try:
        db = BookDatabase()
        db.get_all_books()
        health_status['components']['database'] = 'healthy'
    except Exception as e:
        health_status['components']['database'] = f'error: {str(e)}'
        health_status['status'] = 'unhealthy'

    # Check API keys
    health_status['components']['openai_api'] = 'configured' if os.getenv('OPENAI_API_KEY') else 'missing'
    health_status['components']['google_books_api'] = 'configured' if os.getenv('GOOGLE_BOOKS_API_KEY') else 'missing'

    return jsonify(health_status)

Metrics Collection

# Add to web_server.py
import time
from functools import wraps

def track_processing_time(f):
    @wraps(f)
    def decorated_function(*args, **kwargs):
        start_time = time.time()
        result = f(*args, **kwargs)
        processing_time = time.time() - start_time

        # Log metrics
        logger.info(f"Function {f.__name__} took {processing_time:.2f}s")

        # Store in database for analysis
        # db.log_metric(f.__name__, processing_time)

        return result
    return decorated_function

@track_processing_time
def process_image_request():
    # Processing logic here
    pass

Deployment Checklist

Environment Setup

# 1. Python environment
python3.11 -m venv .venv
source .venv/bin/activate

# 2. Install dependencies
pip install -r requirements.txt

# 3. Environment variables
cp .env.example .env
# Edit .env with your API keys

# 4. Database initialization
python booklister.py db migrate

# 5. Test installation
python booklister.py server start --debug

Production Configuration

# production_config.py
import os

class ProductionConfig:
    # Security
    SECRET_KEY = os.environ.get('SECRET_KEY', 'change-this-in-production')

    # Database
    DATABASE_URL = os.environ.get('DATABASE_URL', 'book_listings_prod.db')

    # API Settings
    OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY')
    GOOGLE_BOOKS_API_KEY = os.environ.get('GOOGLE_BOOKS_API_KEY')

    # Rate Limiting
    RATE_LIMIT_ENABLED = True
    MAX_REQUESTS_PER_MINUTE = 60

    # File Storage
    MAX_FILE_SIZE = 10 * 1024 * 1024  # 10MB
    ALLOWED_EXTENSIONS = {'jpg', 'jpeg', 'png', 'heic'}

This architecture documentation provides a comprehensive foundation for understanding, extending, and maintaining the BookLister application. The modular design allows for easy customization while maintaining system integrity and performance.

Quick Start for Developers

Clone and Setup

git clone <repository>
cd BookReview
python3.11 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Configure Environment

cp .env.example .env
# Add your OpenAI and Google Books API keys

Initialize Database
```
python booklister.py db migrate
```

Start Development Server

python booklister.py server start --debug --port 8000

Test Image Processing

# Add test images to listHelper/books_to_sell/
python booklister.py process --no-skip-duplicates

This documentation serves as both an architectural reference and practical development guide for the BookLister application.

brossi/booklister-architecture-api-overview.md