Skip to content

Instantly share code, notes, and snippets.

@vivekhaldar
Created February 9, 2025 17:46
Show Gist options
  • Save vivekhaldar/c32177773d44856a6e800c81573182fd to your computer and use it in GitHub Desktop.
Save vivekhaldar/c32177773d44856a6e800c81573182fd to your computer and use it in GitHub Desktop.

Software Design Document

1. Introduction

This document describes a command-line Python application for Automatic Time Tracking by Watching Computer Screen. The application periodically captures screenshots on macOS, processes them using a vision-based language model, and generates time-tracking reports.

The design covers:

  • Overall architecture
  • System components
  • Data flows
  • Implementation details
  • Configuration and extensibility
  • Security and privacy considerations

2. System Overview

2.1 High-Level Flow

  1. Scheduler triggers screenshot capture at a configurable interval (default: 1 minute).
  2. Screenshot Capturer saves screenshots to a designated directory.
  3. Vision Processor (using a vision-based LLM) generates textual descriptions of each screenshot.
  4. Data Extractor parses and normalizes textual descriptions into structured activity logs.
  5. Aggregator & Reporter summarizes daily/weekly/monthly data into CSV and JSON reports.

2.2 Execution Context

Platform: macOS (with possible future extension to Windows/Linux).

Runtime: Python 3.x

Interface: Command line (CLI)

3. Architecture & Components

The application is split into several logical modules to keep concerns separated and the code maintainable.

┌─────────────────────┐ │ main.py │ │ (CLI Entry Point) │ └─────────┬───────────┘ │ ▼ ┌────────────────────┐ │ config_manager │ │ (Loads Settings) │ └────────────────────┘ │ ▼ ┌────────────────────┐ ┌────────────────────────┐ │ scheduler.py │→→→→│ screenshot_capturer │ │ (Interval Trigger) │ │ (Captures Screenshots) │ └────────────────────┘ └────────────────────────┘ ▼ ┌───────────────────────┐ │ vision_processor │ │(LLM-based Description)│ └───────────────────────┘ ▼ ┌───────────────────────┐ │ data_extractor.py │ │ (Parses & Categorizes)│ └───────────────────────┘ ▼ ┌───────────────────────┐ │ aggregator_reporter.py│ │ (Summaries & Reports) │ └───────────────────────┘

3.1 main.py (CLI Entry Point)

Responsibilities:

  • Parse command-line arguments (e.g., --interval, --output-dir, etc.).
  • Initialize and load configuration.
  • Start the scheduler and handle graceful shutdown.

Key Functions:

  • parse_arguments()
  • load_config()
  • run()

3.2 config_manager.py

Responsibilities:

  • Load and validate configuration from a file (e.g., config.yaml) or command-line parameters.
  • Provide a central object or dictionary containing all settings.

Configurable Parameters:

  • Screenshot interval (default 60 seconds)
  • Output directory (default ~/time_tracker_data)
  • LLM mode (e.g., local vs. cloud API)
  • Report generation frequency (e.g., daily/weekly/monthly)
  • API keys (if using a cloud-based LLM like OpenAI’s GPT-4-Vision)

3.3 scheduler.py

Responsibilities:

  • Use a timer or loop to periodically trigger screenshot captures.
  • Could be implemented via:
    • A simple while True loop with time.sleep(interval).
    • A more robust scheduling library (e.g., schedule in Python).

Key Functions:

  • start_scheduler(capture_callback, interval)
  • Internally calls capture_callback() at every interval.

3.4 screenshot_capturer.py

Responsibilities:

  • Perform the actual screenshot on macOS.
  • Handle multi-monitor setups by stitching or capturing each screen separately.
  • Save images to disk with a timestamp-based filename (e.g., YYYYMMDD_HHMMSS.png).

Implementation Options:

screencapture command-line tool:

import subprocess
def capture_screenshot(output_path: str):
    subprocess.run(["screencapture", "-x", output_path])

mss Library (cross-platform option):

from mss import mss
def capture_screenshot(output_path: str):
    with mss() as sct:
        sct.shot(output=output_path)

####Key Functions:

  • capture_screenshot(output_path)

3.5 vision_processor.py

Responsibilities:

  • Take the screenshot image path as input.
  • Call a vision-based LLM (e.g., GPT-4 Vision or a local model).
  • Return a textual description of the screenshot content.

Pseudo-Code:

def generate_description(image_path: str, llm_settings: dict) -> str:
    # 1. Open the image file
    # 2. If using a cloud API, send the image to the LLM endpoint
    # 3. Receive and return textual description
    pass

Details:

  • LLM Integration:
    • If using OpenAI:
      • openai.api_key = llm_settings["api_key"]
      • Call the appropriate vision endpoint (if available).
    • If using a local model (e.g., Ollama):
      • Start local server or run command-line with the image.
  • Error Handling:
    • Handle timeouts or network failures gracefully.
    • Retries if necessary (configurable retry limit).

3.6 data_extractor.py

Responsibilities:

  • Parse raw textual descriptions (from vision LLM) into structured data using a separate text-based LLM.
  • The LLM should extract:
    • List of applications in use
    • Task categories (e.g., Work, Leisure, Communication)
    • Estimated focus level (Deep work, Passive browsing, etc.)
  • Guide the LLM with a structured prompt that ensures consistent output format.
  • Validate and normalize the LLM's output.
  • Store the structured data for logging.

Possible Approach:

SYSTEM_PROMPT = '''
You are an activity analyzer for a time tracking system. Given a description of computer activity,
extract structured data about:
1. Applications in use (from a predefined list)
2. Activity categories (work, leisure, communication, etc.)
3. Focus level (deep_work, light_work, passive)

Your output should be JSON formatted like:
{
    "applications": ["Visual Studio Code", "Chrome"],
    "categories": ["coding", "research"],
    "focus_level": "deep_work"
}
'''

async def extract_data(description: str) -> ActivityData:
    # 1. Send description to LLM with system prompt
    # 2. Parse JSON response into structured data
    # 3. Validate against known applications and categories
    # 4. Return normalized ActivityData object

Data Storage:

Each screenshot's structured data (validated LLM output) is stored as a JSON file:

{
    "timestamp": "20250101_120000",
    "applications": ["Visual Studio Code", "Chrome"],
    "activities": ["coding", "research"],
    "focus_level": "deep_work",
    "raw_description": "User is writing code in VSCode while researching in Chrome"
}

LLM Integration:

The data extractor should:

  1. Maintain a list of valid applications and categories for validation
  2. Use a consistent system prompt that guides the LLM to produce well-structured output
  3. Handle LLM failures gracefully (fallback to basic keyword matching)
  4. Cache common patterns to reduce API usage
  5. Batch process descriptions when possible

3.7 aggregator_reporter.py

Responsibilities:

  • Read and aggregate the structured data from logs.
  • Compute daily, weekly, monthly summaries:
  • Top 5 most-used applications.
  • Percentage breakdown by category.
  • Focused vs. passive time.
  • Export results to CSV/JSON.

Key Functions:

  • aggregate_data(time_range: str) -> dict:
  • time_range could be 'daily', 'weekly', or 'monthly'.
  • Returns a dictionary of summary stats.
  • generate_report(summary_dict: dict, output_format: str) -> None:
  • Creates CSV or JSON files with summary data.

Example Summaries:

{
  "total_time_monitored": "8h",
  "top_applications": ["Chrome (2h)", "Slack (1.5h)", ...],
  "category_breakdown": {
    "Work": 50,
    "Communication": 30,
    "Entertainment": 20
  },
  "focus_distribution": {
    "Deep work": 4h,
    "Passive": 2h,
    ...
  }
}

4. Data Flow

  1. Capture: Every interval minutes, scheduler calls capture_screenshot().
  2. Store: Screenshot gets saved to ~/time_tracker_data/screenshots/YYYYMMDD_HHMMSS.png.
  3. Process: vision_processor.generate_description() is invoked to describe the screenshot.
  4. Extract: data_extractor.extract_data() normalizes the description into structured fields.
  5. Log: A row is appended to data_log.csv or a local database with the extracted data.
  6. Aggregate: At user request or on a scheduled basis (daily, weekly, monthly), aggregator_reporter aggregates data from data_log.csv.
  7. Report: The aggregator outputs a summary in CSV and/or JSON.

5. Implementation Details

5.1 Directory Structure

A suggested directory layout:

time_tracker/
├── main.py
├── config_manager.py
├── scheduler.py
├── screenshot_capturer.py
├── vision_processor.py
├── data_extractor.py
├── aggregator_reporter.py
├── requirements.txt
└── config.yaml (example)

5.2 Configuration Format

config.yaml Example:

screenshot_interval: 60        # in seconds
output_directory: "/Users/username/time_tracker_data"
llm_mode: "cloud"
llm_api_key: "sk-XXXXX"
report_frequency: "daily"
exclude_applications: ["1Password", "Keychain Access"]

5.3 Scheduling

Option 1: Use a simple infinite loop with sleep:

import time
def start_scheduler(capture_callback, interval):
    while True:
        capture_callback()
        time.sleep(interval)

Option 2: Use the schedule library for more robust scheduling:

import schedule
import time

def start_scheduler(capture_callback, interval):
    schedule.every(interval).seconds.do(capture_callback)
    while True:
        schedule.run_pending()
        time.sleep(1)

5.4 Vision LLM Interaction

If using OpenAI GPT-4 Vision (hypothetical code, as GPT-4 Vision specifics may differ):

import openai

def generate_description(image_path: str, llm_settings: dict) -> str:
    openai.api_key = llm_settings["api_key"]
    # open the image
    with open(image_path, "rb") as f:
        image_data = f.read()

    # Hypothetical endpoint for GPT-4 Vision
    response = openai.Image.create_description(
        image=image_data,
        # additional parameters
    )
    return response['description']

If using a Local Model (e.g., Ollama):

import subprocess
import json

def generate_description(image_path: str, llm_settings: dict) -> str:
    # Example command:
    result = subprocess.run([
        "ollama",
        "describe-image",
        "--model", llm_settings.get("model_path"),
        image_path
    ], capture_output=True, text=True)
    # parse result
    return result.stdout.strip()

5.5 Data Extraction (Regex / Rule-Based)

Maintain a list of known application keywords and categories:

APP_KEYWORDS = {
    "chrome": "Web Browser",
    "slack": "Communication",
    "word": "Document Editing",
    "excel": "Spreadsheet",
    ...
}

CATEGORY_RULES = {
    "Communication": ["slack", "teams", "outlook"],
    "Work": ["word", "excel", "jupyter", "pycharm"],
    "Entertainment": ["youtube", "netflix", "spotify"]
}

Pseudo-code:

import re

def extract_data(description: str) -> dict:
    apps_found = []
    categories_found = set()

    desc_lower = description.lower()
    for keyword, app_name in APP_KEYWORDS.items():
        if keyword in desc_lower:
            apps_found.append(app_name)

    for cat_name, cat_keywords in CATEGORY_RULES.items():
        for kw in cat_keywords:
            if kw in desc_lower:
                categories_found.add(cat_name)

    # Simple heuristic for focus level
    if any(site in desc_lower for site in ["youtube", "netflix"]):
        focus_level = "Passive"
    else:
        focus_level = "Deep work"

    return {
        "applications": apps_found,
        "categories": list(categories_found),
        "focus_level": focus_level,
    }

5.6 Aggregation & Reporting

  • Data Source: data_log.csv (or a lightweight SQLite DB).
  • Daily Aggregation:
    • Filter rows by date.
    • Count occurrences/total time for each application.
    • Sum categories/focus levels.
  • Generate Output:
import csv
import json
from datetime import datetime

def aggregate_data(time_range: str) -> dict:
    # read data_log.csv
    # filter rows by time_range
    # compute sums, totals, breakdowns
    return summary_dict

def generate_report(summary_dict: dict, output_format: str, output_path: str):
    if output_format == "json":
        with open(output_path, "w") as f:
            json.dump(summary_dict, f, indent=2)
    elif output_format == "csv":
        with open(output_path, "w") as f:
            writer = csv.writer(f)
            # format summary_dict into rows and write
    print(f"Report written to {output_path}")

6. Configuration & Extensibility

  • Config File: Users can edit config.yaml to customize intervals, LLM mode, or advanced options.
  • Command-Line Overrides: For quick changes, e.g.:
python main.py --interval 30 --output-dir /tmp/screens
  • Future Enhancements:
    • Real-time dashboard in a web UI (Phase 3/4).
    • Integration with task management tools (Trello, Jira).
    • Additional heuristics or advanced ML for activity classification.

7. Security & Privacy

  • Local Storage: By default, all screenshots and logs remain on the local machine.
  • Encryption: Optionally, screenshots can be encrypted at rest using user-supplied credentials (e.g., with cryptography library).
  • Sensitive Window Exclusion: A future feature might detect certain window titles or processes (e.g., password managers) and blur them or skip captures.
  • Network Security: If using a cloud-based LLM, ensure connections use HTTPS and keys are not exposed in logs.

8. Error Handling & Edge Cases

  • LLM Failure: If the LLM call fails, store a placeholder description ("LLM error") and log the event for debugging.
  • Network Timeout: Retry n times (configurable). If still failing, proceed with a placeholder.
  • Screenshot Failure: Catch exceptions when taking screenshots (e.g., permission issues). Prompt the user to grant screen capture permission in System Preferences (macOS).
  • No Screens: If no screenshots exist for a report period, produce an empty or minimal report with a warning message.

9. Testing Strategy

  • Unit Tests:
    • test_screenshot_capturer.py: Mocks screenshot capture and verifies correct file output.
    • test_vision_processor.py: Mocks LLM calls; checks parsing of responses.
    • test_data_extractor.py: Uses sample descriptions to validate extraction logic.
    • test_aggregator_reporter.py: Checks correct summation and CSV/JSON output.
  • Integration Tests:
    • End-to-end test: run the app for a short interval (e.g., 10 seconds), verifying logs and generated reports.
    • User Acceptance Tests:
    • Validate daily/weekly/monthly summary correctness with known sample data.

10. Conclusion

The above design provides a modular, extensible, and configurable Python-based solution for automated time tracking through screenshot capture and vision-based LLM analysis. The outlined modules and data flows ensure that a programmer can implement the system with minimal ambiguity. Future extensions—such as real-time dashboards or advanced task classification—can be layered on top without major structural changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment