- System Overview
- Architecture Patterns
- Core Components
- Data Flow
- Database Schema
- API Documentation
- Integration Guide
- External Dependencies
- Development Patterns
BookLister is a comprehensive book cataloging and marketplace preparation tool designed to automate the process of converting book cover images into structured, market-ready listings. The system combines computer vision, web scraping, and data enrichment APIs to provide a complete workflow from image capture to marketplace listing.
- Image Processing: Extract book information from cover photographs using OpenAI Vision API
- Data Enrichment: Enhance book records with comprehensive metadata from Google Books API
- Pricing Research: Gather competitive pricing data from AbeBooks marketplace
- Multi-View Support: Handle multiple perspectives (cover, back, spine, inner pages) of the same book
- Quality Control: Prevent duplicates and validate data accuracy
- Web Interface: Provide modern UI for managing book collections and generating listings
- Book resellers preparing inventory for online marketplaces (Depop, eBay, Amazon)
- Personal book collection cataloging
- Library/bookstore inventory management
- Bulk book processing for estate sales or acquisitions
The application follows a modular monolith pattern with clear separation of concerns:
- CLI Layer: Unified command-line interface
- Web Layer: RESTful API and modern web interface
- Business Logic Layer: Processing engines and workflows
- Data Access Layer: Database operations and caching
- Integration Layer: External API clients and adapters
Image processing follows an event-driven workflow:
Image Upload → Classification → Extraction → Enrichment → Storage → Notification
External services are abstracted through adapter interfaces:
- Vision processing (OpenAI Vision API)
- Book metadata (Google Books API)
- Pricing data (AbeBooks scraper)
- Image optimization (PIL/ImageIO)
SQLite database serves as the central source of truth with comprehensive schema supporting:
- Multi-view image relationships
- Book condition assessment
- Pricing history
- Processing audit trails
- Data quality metrics
Purpose: Unified command-line tool providing all system functionality Key Features:
- Database management (backup, restore, migration)
- Batch processing operations
- Server lifecycle management
- Data import/export utilities
# Key Classes and Methods
class DatabaseManager:
def clear_all_data()
def backup_database(filename)
def show_statistics()
def fix_incorrect_isbns()
class ServerManager:
def start_server(debug, port)
def stop_server()
def get_server_pid()
Purpose: Flask-based REST API and web UI for interactive book management Architecture:
- RESTful API endpoints
- Server-side rendering with Jinja2 templates
- Real-time image upload with progress tracking
- Bootstrap-based responsive UI
# Key API Endpoints
@app.route('/api/books', methods=['GET', 'POST'])
@app.route('/api/book/<int:book_id>')
@app.route('/api/lookup_pricing', methods=['POST'])
@app.route('/api/book/<int:book_id>/upload_images', methods=['POST'])
Purpose: SQLite database operations with comprehensive schema Key Features:
- Multi-view image relationship management
- Advanced duplicate detection
- Pricing data integration
- Book condition tracking
- Processing status management
# Key Classes and Methods
class BookDatabase:
def insert_book(book_data) -> Optional[int]
def find_duplicate_book(title, author, isbn) -> Optional[Dict]
def get_book_images(book_id) -> List[Dict]
def update_consolidated_data(book_id, data, views)
def update_book_pricing(book_id, pricing_data)
Purpose: Main orchestration logic for image processing workflow Architecture:
- Single-pass image classification and extraction
- Multi-view file organization
- Intelligent duplicate detection
- Batch processing support
- Error handling and recovery
# Key Classes and Methods
class BookProcessor:
def process_single_image(image_path) -> Dict
def process_all_images(use_batch=False)
def _find_existing_book(extracted_info) -> Optional[Dict]
def _process_new_book(image_path, view_type, extracted_info) -> Dict
Purpose: OpenAI Vision API integration for image analysis Key Features:
- View type classification (cover, back, spine, inner)
- Information extraction from book images
- Book condition assessment
- Confidence scoring
- HEIC image format support
# Key Classes and Methods
class ViewClassifier:
def classify_and_extract(image_path) -> Dict
class LLMProcessor:
def extract_book_info_from_image(image_path) -> Dict
def validate_and_enrich_book_data(ocr_data, api_data) -> Dict
Purpose: Google Books API integration for metadata enhancement Key Features:
- Title/author → ISBN lookup
- ISBN → comprehensive metadata retrieval
- Missing field population
- Data quality validation
# Key Classes and Methods
class GoogleBooksAPI:
def search_by_title_author(title, author) -> Optional[Dict]
def search_by_isbn(isbn) -> Optional[Dict]
class BookEnrichmentService:
def enrich_book(title, author, isbn) -> Optional[Dict]
Purpose: AbeBooks marketplace scraping for competitive pricing Key Features:
- Intelligent search query construction
- Price range analysis
- Condition-based pricing
- Rate limiting and error handling
# Key Functions
def lookup_book_pricing(title, author, isbn, max_results=10) -> Dict
def extract_book_prices(html_content) -> List[Dict]
Purpose: Advanced duplicate detection and data validation Features:
- Multi-strategy matching (ISBN, title+author, fuzzy matching)
- Confidence scoring
- Provisional record handling
- Match metadata tracking
# Key Classes and Methods
class DuplicateDetector:
def find_existing_book_enhanced(extracted_info, confidence) -> Optional[MatchResult]
def calculate_title_similarity(title1, title2) -> float
graph TD
A[Image Upload] --> B[HEIC Conversion]
B --> C[View Classification]
C --> D[Information Extraction]
D --> E[API Enrichment]
E --> F[Duplicate Detection]
F --> G{Existing Book?}
G -->|Yes| H[Add View to Book]
G -->|No| I[Create New Book]
H --> J[Update Consolidated Data]
I --> J
J --> K[Store in Database]
K --> L[Move to Processed]
[Image Upload]
↓
[HEIC Conversion]
↓
[View Classification] ← OpenAI Vision API
↓
[Information Extraction]
↓
[API Enrichment] ← Google Books API
↓
[Duplicate Detection] ← Database Query
↓
{Existing Book?}
↓ ↓
[Add View] [Create New Book]
↓ ↓
[Update Consolidated Data]
↓
[Store in Database] → SQLite
↓
[Move to Processed] → File System
graph TD
A[Select Images] --> B[Create Batch Job]
B --> C[Submit to OpenAI Batch API]
C --> D[Poll for Completion]
D --> E[Download Results]
E --> F[Process Each Result]
F --> G[Store in Database]
G --> H[Generate Report]
graph TD
A[Book Record] --> B{Has ISBN?}
B -->|Yes| C[ISBN → Metadata]
B -->|No| D[Title+Author → ISBN]
C --> E[Enhance Record]
D --> E
E --> F[Validate Data Quality]
F --> G[Update Database]
CREATE TABLE listings (
id INTEGER PRIMARY KEY AUTOINCREMENT,
input_path TEXT NOT NULL,
title TEXT,
author TEXT,
isbn TEXT,
publisher TEXT,
publication_date TEXT,
description TEXT,
-- Processing metadata
processing_status TEXT DEFAULT 'processing',
status TEXT DEFAULT 'New',
confidence TEXT DEFAULT 'medium',
record_type TEXT DEFAULT 'confirmed',
-- Multi-view support
consolidated_data TEXT,
primary_cover_image_id INTEGER,
total_views_count INTEGER DEFAULT 0,
match_metadata TEXT,
-- Pricing data
price_range TEXT,
min_price REAL,
max_price REAL,
avg_price REAL,
pricing_timestamp TIMESTAMP,
pricing_listings_found INTEGER,
-- Audit fields
notes TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE book_images (
id INTEGER PRIMARY KEY AUTOINCREMENT,
book_id INTEGER NOT NULL,
image_path TEXT NOT NULL,
view_type TEXT NOT NULL, -- cover, back, spine, inner, unknown
-- Extracted data
extracted_info TEXT, -- JSON
book_condition TEXT, -- JSON
processing_notes TEXT,
-- File metadata
file_size_bytes INTEGER,
image_dimensions TEXT,
-- Audit fields
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (book_id) REFERENCES listings(id) ON DELETE CASCADE
);
CREATE INDEX idx_listings_isbn ON listings(isbn);
CREATE INDEX idx_listings_title_author ON listings(title, author);
CREATE INDEX idx_book_images_book_id ON book_images(book_id);
CREATE INDEX idx_book_images_view_type ON book_images(view_type);
Retrieve all books with enhanced metadata
{
"books": [
{
"id": 1,
"title": "Example Book",
"author": "Author Name",
"isbn": "9781234567890",
"status": "Ready to List",
"images": [...],
"pricing_info": {...},
"condition_summary": {...}
}
]
}
Create a new book record manually
// Request
{
"title": "Book Title",
"author": "Author Name",
"isbn": "9781234567890",
"publisher": "Publisher",
"description": "Book description"
}
// Response
{
"success": true,
"book": {...},
"message": "Book created successfully"
}
Retrieve specific book with full details
{
"id": 1,
"title": "Example Book",
"author": "Author Name",
"images": [
{
"id": 1,
"view_type": "cover",
"image_path": "/path/to/image.jpg",
"extracted_info": {...},
"book_condition": {...}
}
],
"consolidated_data": {...}
}
Upload additional images for a book
// Form Data
const formData = new FormData();
formData.append('files', file1);
formData.append('files', file2);
formData.append('view_types', 'cover');
formData.append('view_types', 'back');
// Response
{
"success": true,
"uploaded": 2,
"errors": 0,
"images": [
{
"id": 123,
"filename": "cover_image.jpg",
"view_type": "cover",
"url": "/processed_book_images/1/cover_image.jpg"
}
]
}
Delete a specific image
{
"success": true,
"message": "Image deleted successfully"
}
Look up pricing for a book
// Request
{
"book_id": 1,
"title": "Book Title",
"author": "Author Name",
"isbn": "9781234567890"
}
// Response
{
"found": true,
"price_range": "$12.00 - $45.00",
"min_price": 12.00,
"max_price": 45.00,
"avg_price": 28.50,
"listings_found": 15,
"search_url": "https://abebooks.com/...",
"timestamp": "2024-01-15T10:30:00Z"
}
Update book status
// Request
{
"book_id": 1,
"status": "Ready to List"
}
// Response
{
"success": true
}
# Statistics and management
python booklister.py db stats
python booklister.py db list [status]
python booklister.py db backup [filename]
python booklister.py db clear [--images-only]
# Data repair
python booklister.py db fix-isbns [--dry-run] [--limit N]
python booklister.py db fix-missing-titles [--dry-run] [--limit N]
# CSV import
python booklister.py db import-csv file.csv [--dry-run]
python booklister.py db needs-images [--limit N]
# Image processing
python booklister.py process [directory] [--batch] [--no-skip-duplicates]
# Pricing lookup
python booklister.py pricing lookup -t "Title" -a "Author" [-i "ISBN"]
python booklister.py pricing batch-db [--update-db] [--status "New"] [--limit N]
# Data enrichment
python booklister.py enrich stats
python booklister.py enrich batch [--missing description|isbn|both] [--limit N]
# Web server
python booklister.py server start [--debug] [--port 8000]
python booklister.py server stop
python booklister.py server restart
python booklister.py server status
- Create Provider Adapter
class CustomVisionProvider:
def classify_and_extract(self, image_path: str) -> Dict:
# Implement vision processing
return {
'view_type': 'cover|back|spine|inner|unknown',
'confidence_level': 'high|medium|low',
'extracted_info': {...},
'book_condition': {...}
}
- Register in Configuration
# config.py
VISION_PROVIDERS = {
'openai': OpenAIVisionProvider,
'custom': CustomVisionProvider
}
- Implement BookAPI Interface
class CustomBookAPI:
def search_by_title_author(self, title: str, author: str) -> Optional[Dict]:
# API implementation
return {
'title': title,
'author': author,
'isbn': isbn,
'publisher': publisher,
'description': description
}
def search_by_isbn(self, isbn: str) -> Optional[Dict]:
# ISBN lookup implementation
pass
- Update Book APIs Module
# book_apis.py
def search_book_info(title, author, isbn, provider='google'):
providers = {
'google': GoogleBooksAPI,
'custom': CustomBookAPI
}
api = providers[provider]()
return api.search_by_title_author(title, author)
- Extend BookProcessor
class CustomBookProcessor(BookProcessor):
def process_single_image(self, image_path: Path) -> Dict:
# Add custom preprocessing
result = super().process_single_image(image_path)
# Add custom post-processing
self._custom_post_process(result)
return result
def _custom_post_process(self, result: Dict):
# Custom logic here
pass
- Custom Web Endpoints
@app.route('/api/custom/<int:book_id>', methods=['POST'])
def custom_endpoint(book_id):
# Custom functionality
return jsonify({'success': True})
- OpenAI API: Vision processing and text analysis
- Google Books API: Book metadata enrichment
- AbeBooks: Pricing research (web scraping)
- Barcode APIs: ISBN extraction from barcodes
flask>=2.3.0 # Web framework
openai>=1.0.0 # OpenAI API client
requests>=2.31.0 # HTTP client
pillow>=10.0.0 # Image processing
sqlite3 # Database (built-in)
python-dotenv>=1.0.0 # Environment management
- Python: 3.11+
- Storage: SQLite database + processed images
- Memory: 2GB+ recommended for batch processing
- Network: Internet access for API calls
try:
result = process_image(image_path)
except ProcessingError as e:
# Log error and move to needs_review
move_to_needs_review(image_path, str(e))
return {'success': False, 'error': str(e)}
except Exception as e:
# Unexpected error - log and re-raise
logger.error(f"Unexpected error: {e}")
raise
# config.py
MULTI_VIEW_CONFIG = {
'view_types': ['cover', 'back', 'spine', 'inner', 'unknown'],
'min_confidence': 0.7,
'max_images_per_book': 10
}
PROCESSING_CONFIG = {
'vision_model': 'gpt-4o',
'max_tokens': 800,
'temperature': 0.1,
'timeout': 30
}
# Test structure
class TestBookProcessor:
def test_single_image_processing(self):
processor = BookProcessor('test_images/')
result = processor.process_single_image(Path('test_cover.jpg'))
assert result['success'] == True
assert result['view_type'] == 'cover'
@patch('book_apis.search_book_info')
def test_api_integration(self, mock_api):
mock_api.return_value = {'title': 'Test Book'}
# Test with mocked API
- Environment Variables: Store API keys securely
- Database Backups: Regular automated backups
- Image Storage: Consider cloud storage for large collections
- Rate Limiting: Respect API rate limits
- Monitoring: Log processing metrics and errors
- Scaling: Consider async processing for high volumes
sqlite3.OperationalError: database is locked
# Solution: Ensure proper connection handling
with sqlite3.connect(self.db_path) as conn:
# Always use context manager
cursor = conn.cursor()
# Operations here
conn.commit() # Explicit commit
# Error: Rate limit exceeded
# Solution: Implement exponential backoff
import time
import random
def api_call_with_retry(func, max_retries=3):
for attempt in range(max_retries):
try:
return func()
except Exception as e:
if "rate limit" in str(e).lower() and attempt < max_retries - 1:
wait_time = (2 ** attempt) + random.uniform(0, 1)
time.sleep(wait_time)
else:
raise e
# Handle PIL image errors gracefully
from PIL import Image
import os
def safe_image_processing(image_path):
try:
with Image.open(image_path) as img:
# Verify image integrity
img.verify()
# Reopen for processing (verify() closes the file)
with Image.open(image_path) as img:
return process_image(img)
except (IOError, OSError) as e:
print(f"Invalid image file: {image_path}")
return None
# Process images in chunks to manage memory
def process_large_batch(image_paths, chunk_size=10):
for i in range(0, len(image_paths), chunk_size):
chunk = image_paths[i:i + chunk_size]
yield process_chunk(chunk)
# Force garbage collection after each chunk
import gc
gc.collect()
-- Add these indexes for better query performance
CREATE INDEX idx_listings_processing_status ON listings(processing_status);
CREATE INDEX idx_listings_created_at ON listings(created_at);
CREATE INDEX idx_book_images_created_at ON book_images(created_at);
-- Composite indexes for complex queries
CREATE INDEX idx_listings_status_created ON listings(status, created_at);
# Optimal batch size for OpenAI API
OPTIMAL_BATCH_SIZE = 50 # Balances cost savings vs processing time
# Prepare batch requests efficiently
def prepare_efficient_batch(images):
return [
{
"custom_id": f"image_{i}",
"method": "POST",
"url": "/v1/chat/completions",
"body": create_vision_request(img)
}
for i, img in enumerate(images[:OPTIMAL_BATCH_SIZE])
]
import logging
# Add to config.py
logging.basicConfig(
level=logging.DEBUG,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler('booklister.log'),
logging.StreamHandler()
]
)
# Use in modules
logger = logging.getLogger(__name__)
logger.debug(f"Processing image: {image_path}")
# Add query logging to database.py
def execute_with_logging(cursor, query, params=None):
logger.debug(f"SQL: {query}")
if params:
logger.debug(f"Params: {params}")
cursor.execute(query, params or [])
logger.debug(f"Rows affected: {cursor.rowcount}")
import unittest
from unittest.mock import patch, MagicMock
from pathlib import Path
class TestBookProcessor(unittest.TestCase):
def setUp(self):
self.processor = BookProcessor('test_directory')
self.sample_image = Path('test_data/sample_cover.jpg')
@patch('book_processor.ViewClassifier')
def test_image_classification(self, mock_classifier):
# Mock the classifier response
mock_classifier.return_value.classify_and_extract.return_value = {
'view_type': 'cover',
'confidence_level': 'high',
'extracted_info': {'title': 'Test Book', 'author': 'Test Author'}
}
result = self.processor.process_single_image(self.sample_image)
self.assertTrue(result['success'])
self.assertEqual(result['view_type'], 'cover')
class TestIntegration(unittest.TestCase):
def test_complete_workflow(self):
# Test end-to-end processing
processor = BookProcessor('test_images')
# Process a test image
result = processor.process_single_image(Path('test_cover.jpg'))
# Verify database state
db = BookDatabase()
books = db.get_all_books()
self.assertEqual(len(books), 1)
# Verify file organization
processed_files = list(Path('processed_book_images').glob('**/*.jpg'))
self.assertGreater(len(processed_files), 0)
@app.route('/health')
def health_check():
"""System health check endpoint"""
health_status = {
'status': 'healthy',
'timestamp': datetime.now().isoformat(),
'components': {}
}
# Check database connectivity
try:
db = BookDatabase()
db.get_all_books()
health_status['components']['database'] = 'healthy'
except Exception as e:
health_status['components']['database'] = f'error: {str(e)}'
health_status['status'] = 'unhealthy'
# Check API keys
health_status['components']['openai_api'] = 'configured' if os.getenv('OPENAI_API_KEY') else 'missing'
health_status['components']['google_books_api'] = 'configured' if os.getenv('GOOGLE_BOOKS_API_KEY') else 'missing'
return jsonify(health_status)
# Add to web_server.py
import time
from functools import wraps
def track_processing_time(f):
@wraps(f)
def decorated_function(*args, **kwargs):
start_time = time.time()
result = f(*args, **kwargs)
processing_time = time.time() - start_time
# Log metrics
logger.info(f"Function {f.__name__} took {processing_time:.2f}s")
# Store in database for analysis
# db.log_metric(f.__name__, processing_time)
return result
return decorated_function
@track_processing_time
def process_image_request():
# Processing logic here
pass
# 1. Python environment
python3.11 -m venv .venv
source .venv/bin/activate
# 2. Install dependencies
pip install -r requirements.txt
# 3. Environment variables
cp .env.example .env
# Edit .env with your API keys
# 4. Database initialization
python booklister.py db migrate
# 5. Test installation
python booklister.py server start --debug
# production_config.py
import os
class ProductionConfig:
# Security
SECRET_KEY = os.environ.get('SECRET_KEY', 'change-this-in-production')
# Database
DATABASE_URL = os.environ.get('DATABASE_URL', 'book_listings_prod.db')
# API Settings
OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY')
GOOGLE_BOOKS_API_KEY = os.environ.get('GOOGLE_BOOKS_API_KEY')
# Rate Limiting
RATE_LIMIT_ENABLED = True
MAX_REQUESTS_PER_MINUTE = 60
# File Storage
MAX_FILE_SIZE = 10 * 1024 * 1024 # 10MB
ALLOWED_EXTENSIONS = {'jpg', 'jpeg', 'png', 'heic'}
This architecture documentation provides a comprehensive foundation for understanding, extending, and maintaining the BookLister application. The modular design allows for easy customization while maintaining system integrity and performance.
-
Clone and Setup
git clone <repository> cd BookReview python3.11 -m venv .venv source .venv/bin/activate pip install -r requirements.txt
-
Configure Environment
cp .env.example .env # Add your OpenAI and Google Books API keys
-
Initialize Database
python booklister.py db migrate
-
Start Development Server
python booklister.py server start --debug --port 8000
-
Test Image Processing
# Add test images to listHelper/books_to_sell/ python booklister.py process --no-skip-duplicates
This documentation serves as both an architectural reference and practical development guide for the BookLister application.