Software Design Document

1. Introduction

This document describes a command-line Python application for Automatic Time Tracking by Watching Computer Screen. The application periodically captures screenshots on macOS, processes them using a vision-based language model, and generates time-tracking reports.

The design covers:

Overall architecture
System components
Data flows
Implementation details
Configuration and extensibility
Security and privacy considerations

2. System Overview

2.1 High-Level Flow

Scheduler triggers screenshot capture at a configurable interval (default: 1 minute).
Screenshot Capturer saves screenshots to a designated directory.
Vision Processor (using a vision-based LLM) generates textual descriptions of each screenshot.
Data Extractor parses and normalizes textual descriptions into structured activity logs.
Aggregator & Reporter summarizes daily/weekly/monthly data into CSV and JSON reports.

2.2 Execution Context

Platform: macOS (with possible future extension to Windows/Linux).

Runtime: Python 3.x

Interface: Command line (CLI)

3. Architecture & Components

The application is split into several logical modules to keep concerns separated and the code maintainable.

┌─────────────────────┐ │ main.py │ │ (CLI Entry Point) │ └─────────┬───────────┘ │ ▼ ┌────────────────────┐ │ config_manager │ │ (Loads Settings) │ └────────────────────┘ │ ▼ ┌────────────────────┐ ┌────────────────────────┐ │ scheduler.py │→→→→│ screenshot_capturer │ │ (Interval Trigger) │ │ (Captures Screenshots) │ └────────────────────┘ └────────────────────────┘ ▼ ┌───────────────────────┐ │ vision_processor │ │(LLM-based Description)│ └───────────────────────┘ ▼ ┌───────────────────────┐ │ data_extractor.py │ │ (Parses & Categorizes)│ └───────────────────────┘ ▼ ┌───────────────────────┐ │ aggregator_reporter.py│ │ (Summaries & Reports) │ └───────────────────────┘

3.1 main.py (CLI Entry Point)

Responsibilities:

Parse command-line arguments (e.g., --interval, --output-dir, etc.).
Initialize and load configuration.
Start the scheduler and handle graceful shutdown.

Key Functions:

parse_arguments()
load_config()
run()

3.2 config_manager.py

Responsibilities:

Load and validate configuration from a file (e.g., config.yaml) or command-line parameters.
Provide a central object or dictionary containing all settings.

Configurable Parameters:

Screenshot interval (default 60 seconds)
Output directory (default ~/time_tracker_data)
LLM mode (e.g., local vs. cloud API)
Report generation frequency (e.g., daily/weekly/monthly)
API keys (if using a cloud-based LLM like OpenAI’s GPT-4-Vision)

3.3 scheduler.py

Responsibilities:

Use a timer or loop to periodically trigger screenshot captures.
Could be implemented via:
- A simple while True loop with time.sleep(interval).
- A more robust scheduling library (e.g., schedule in Python).

Key Functions:

start_scheduler(capture_callback, interval)
Internally calls capture_callback() at every interval.

3.4 screenshot_capturer.py

Responsibilities:

Perform the actual screenshot on macOS.
Handle multi-monitor setups by stitching or capturing each screen separately.
Save images to disk with a timestamp-based filename (e.g., YYYYMMDD_HHMMSS.png).

Implementation Options:

screencapture command-line tool:

import subprocess
def capture_screenshot(output_path: str):
    subprocess.run(["screencapture", "-x", output_path])

mss Library (cross-platform option):

from mss import mss
def capture_screenshot(output_path: str):
    with mss() as sct:
        sct.shot(output=output_path)

####Key Functions:

capture_screenshot(output_path)

3.5 vision_processor.py

Responsibilities:

Take the screenshot image path as input.
Call a vision-based LLM (e.g., GPT-4 Vision or a local model).
Return a textual description of the screenshot content.

Pseudo-Code:

def generate_description(image_path: str, llm_settings: dict) -> str:
    # 1. Open the image file
    # 2. If using a cloud API, send the image to the LLM endpoint
    # 3. Receive and return textual description
    pass

Details:

LLM Integration:
- If using OpenAI:
  - openai.api_key = llm_settings["api_key"]
  - Call the appropriate vision endpoint (if available).
- If using a local model (e.g., Ollama):
  - Start local server or run command-line with the image.
Error Handling:
- Handle timeouts or network failures gracefully.
- Retries if necessary (configurable retry limit).

3.6 data_extractor.py

Responsibilities:

Parse raw textual descriptions (from vision LLM) into structured data using a separate text-based LLM.
The LLM should extract:
- List of applications in use
- Task categories (e.g., Work, Leisure, Communication)
- Estimated focus level (Deep work, Passive browsing, etc.)
Guide the LLM with a structured prompt that ensures consistent output format.
Validate and normalize the LLM's output.
Store the structured data for logging.

Possible Approach:

SYSTEM_PROMPT = '''
You are an activity analyzer for a time tracking system. Given a description of computer activity,
extract structured data about:
1. Applications in use (from a predefined list)
2. Activity categories (work, leisure, communication, etc.)
3. Focus level (deep_work, light_work, passive)

Your output should be JSON formatted like:
{
    "applications": ["Visual Studio Code", "Chrome"],
    "categories": ["coding", "research"],
    "focus_level": "deep_work"
}
'''

async def extract_data(description: str) -> ActivityData:
    # 1. Send description to LLM with system prompt
    # 2. Parse JSON response into structured data
    # 3. Validate against known applications and categories
    # 4. Return normalized ActivityData object

Data Storage:

Each screenshot's structured data (validated LLM output) is stored as a JSON file:

{
    "timestamp": "20250101_120000",
    "applications": ["Visual Studio Code", "Chrome"],
    "activities": ["coding", "research"],
    "focus_level": "deep_work",
    "raw_description": "User is writing code in VSCode while researching in Chrome"
}

LLM Integration:

The data extractor should:

Maintain a list of valid applications and categories for validation
Use a consistent system prompt that guides the LLM to produce well-structured output
Handle LLM failures gracefully (fallback to basic keyword matching)
Cache common patterns to reduce API usage
Batch process descriptions when possible

3.7 aggregator_reporter.py

Responsibilities:

Read and aggregate the structured data from logs.
Compute daily, weekly, monthly summaries:
Top 5 most-used applications.
Percentage breakdown by category.
Focused vs. passive time.
Export results to CSV/JSON.

Key Functions:

aggregate_data(time_range: str) -> dict:
time_range could be 'daily', 'weekly', or 'monthly'.
Returns a dictionary of summary stats.
generate_report(summary_dict: dict, output_format: str) -> None:
Creates CSV or JSON files with summary data.

Example Summaries:

{
  "total_time_monitored": "8h",
  "top_applications": ["Chrome (2h)", "Slack (1.5h)", ...],
  "category_breakdown": {
    "Work": 50,
    "Communication": 30,
    "Entertainment": 20
  },
  "focus_distribution": {
    "Deep work": 4h,
    "Passive": 2h,
    ...
  }
}

4. Data Flow

Capture: Every interval minutes, scheduler calls capture_screenshot().
Store: Screenshot gets saved to ~/time_tracker_data/screenshots/YYYYMMDD_HHMMSS.png.
Process: vision_processor.generate_description() is invoked to describe the screenshot.
Extract: data_extractor.extract_data() normalizes the description into structured fields.
Log: A row is appended to data_log.csv or a local database with the extracted data.
Aggregate: At user request or on a scheduled basis (daily, weekly, monthly), aggregator_reporter aggregates data from data_log.csv.
Report: The aggregator outputs a summary in CSV and/or JSON.

5. Implementation Details

5.1 Directory Structure

A suggested directory layout:

time_tracker/
├── main.py
├── config_manager.py
├── scheduler.py
├── screenshot_capturer.py
├── vision_processor.py
├── data_extractor.py
├── aggregator_reporter.py
├── requirements.txt
└── config.yaml (example)

5.2 Configuration Format

config.yaml Example:

screenshot_interval: 60        # in seconds
output_directory: "/Users/username/time_tracker_data"
llm_mode: "cloud"
llm_api_key: "sk-XXXXX"
report_frequency: "daily"
exclude_applications: ["1Password", "Keychain Access"]

5.3 Scheduling

Option 1: Use a simple infinite loop with sleep:

import time
def start_scheduler(capture_callback, interval):
    while True:
        capture_callback()
        time.sleep(interval)

Option 2: Use the schedule library for more robust scheduling:

import schedule
import time

def start_scheduler(capture_callback, interval):
    schedule.every(interval).seconds.do(capture_callback)
    while True:
        schedule.run_pending()
        time.sleep(1)

5.4 Vision LLM Interaction

If using OpenAI GPT-4 Vision (hypothetical code, as GPT-4 Vision specifics may differ):

import openai

def generate_description(image_path: str, llm_settings: dict) -> str:
    openai.api_key = llm_settings["api_key"]
    # open the image
    with open(image_path, "rb") as f:
        image_data = f.read()

    # Hypothetical endpoint for GPT-4 Vision
    response = openai.Image.create_description(
        image=image_data,
        # additional parameters
    )
    return response['description']

If using a Local Model (e.g., Ollama):

import subprocess
import json

def generate_description(image_path: str, llm_settings: dict) -> str:
    # Example command:
    result = subprocess.run([
        "ollama",
        "describe-image",
        "--model", llm_settings.get("model_path"),
        image_path
    ], capture_output=True, text=True)
    # parse result
    return result.stdout.strip()

5.5 Data Extraction (Regex / Rule-Based)

Maintain a list of known application keywords and categories:

APP_KEYWORDS = {
    "chrome": "Web Browser",
    "slack": "Communication",
    "word": "Document Editing",
    "excel": "Spreadsheet",
    ...
}

CATEGORY_RULES = {
    "Communication": ["slack", "teams", "outlook"],
    "Work": ["word", "excel", "jupyter", "pycharm"],
    "Entertainment": ["youtube", "netflix", "spotify"]
}

Pseudo-code:

import re

def extract_data(description: str) -> dict:
    apps_found = []
    categories_found = set()

    desc_lower = description.lower()
    for keyword, app_name in APP_KEYWORDS.items():
        if keyword in desc_lower:
            apps_found.append(app_name)

    for cat_name, cat_keywords in CATEGORY_RULES.items():
        for kw in cat_keywords:
            if kw in desc_lower:
                categories_found.add(cat_name)

    # Simple heuristic for focus level
    if any(site in desc_lower for site in ["youtube", "netflix"]):
        focus_level = "Passive"
    else:
        focus_level = "Deep work"

    return {
        "applications": apps_found,
        "categories": list(categories_found),
        "focus_level": focus_level,
    }

5.6 Aggregation & Reporting

Data Source: data_log.csv (or a lightweight SQLite DB).
Daily Aggregation:
- Filter rows by date.
- Count occurrences/total time for each application.
- Sum categories/focus levels.
Generate Output:

import csv
import json
from datetime import datetime

def aggregate_data(time_range: str) -> dict:
    # read data_log.csv
    # filter rows by time_range
    # compute sums, totals, breakdowns
    return summary_dict

def generate_report(summary_dict: dict, output_format: str, output_path: str):
    if output_format == "json":
        with open(output_path, "w") as f:
            json.dump(summary_dict, f, indent=2)
    elif output_format == "csv":
        with open(output_path, "w") as f:
            writer = csv.writer(f)
            # format summary_dict into rows and write
    print(f"Report written to {output_path}")

6. Configuration & Extensibility

Config File: Users can edit config.yaml to customize intervals, LLM mode, or advanced options.
Command-Line Overrides: For quick changes, e.g.:

python main.py --interval 30 --output-dir /tmp/screens

Future Enhancements:
- Real-time dashboard in a web UI (Phase 3/4).
- Integration with task management tools (Trello, Jira).
- Additional heuristics or advanced ML for activity classification.

7. Security & Privacy

Local Storage: By default, all screenshots and logs remain on the local machine.
Encryption: Optionally, screenshots can be encrypted at rest using user-supplied credentials (e.g., with cryptography library).
Sensitive Window Exclusion: A future feature might detect certain window titles or processes (e.g., password managers) and blur them or skip captures.
Network Security: If using a cloud-based LLM, ensure connections use HTTPS and keys are not exposed in logs.

8. Error Handling & Edge Cases

LLM Failure: If the LLM call fails, store a placeholder description ("LLM error") and log the event for debugging.
Network Timeout: Retry n times (configurable). If still failing, proceed with a placeholder.
Screenshot Failure: Catch exceptions when taking screenshots (e.g., permission issues). Prompt the user to grant screen capture permission in System Preferences (macOS).
No Screens: If no screenshots exist for a report period, produce an empty or minimal report with a warning message.

9. Testing Strategy

Unit Tests:
- test_screenshot_capturer.py: Mocks screenshot capture and verifies correct file output.
- test_vision_processor.py: Mocks LLM calls; checks parsing of responses.
- test_data_extractor.py: Uses sample descriptions to validate extraction logic.
- test_aggregator_reporter.py: Checks correct summation and CSV/JSON output.
Integration Tests:
- End-to-end test: run the app for a short interval (e.g., 10 seconds), verifying logs and generated reports.
- User Acceptance Tests:
- Validate daily/weekly/monthly summary correctness with known sample data.

10. Conclusion

The above design provides a modular, extensible, and configurable Python-based solution for automated time tracking through screenshot capture and vision-based LLM analysis. The outlined modules and data flows ensure that a programmer can implement the system with minimal ambiguity. Future extensions—such as real-time dashboards or advanced task classification—can be layered on top without major structural changes.

vivekhaldar/design_doc.md

Software Design Document

1. Introduction

2. System Overview

2.1 High-Level Flow

2.2 Execution Context

3. Architecture & Components

3.1 main.py (CLI Entry Point)

Responsibilities:

Key Functions:

3.2 config_manager.py

Responsibilities:

Configurable Parameters:

3.3 scheduler.py

Responsibilities:

Key Functions:

3.4 screenshot_capturer.py

Responsibilities:

Implementation Options:

3.5 vision_processor.py

Responsibilities:

Pseudo-Code:

Details:

3.6 data_extractor.py

Responsibilities:

Possible Approach:

Data Storage:

LLM Integration:

3.7 aggregator_reporter.py

Responsibilities:

Key Functions:

Example Summaries:

4. Data Flow

5. Implementation Details

5.1 Directory Structure

5.2 Configuration Format

5.3 Scheduling

5.4 Vision LLM Interaction

5.5 Data Extraction (Regex / Rule-Based)

5.6 Aggregation & Reporting

6. Configuration & Extensibility

7. Security & Privacy

8. Error Handling & Edge Cases

9. Testing Strategy

10. Conclusion