This document describes a command-line Python application for Automatic Time Tracking by Watching Computer Screen. The application periodically captures screenshots on macOS, processes them using a vision-based language model, and generates time-tracking reports.
The design covers:
- Overall architecture
- System components
- Data flows
- Implementation details
- Configuration and extensibility
- Security and privacy considerations
- Scheduler triggers screenshot capture at a configurable interval (default: 1 minute).
- Screenshot Capturer saves screenshots to a designated directory.
- Vision Processor (using a vision-based LLM) generates textual descriptions of each screenshot.
- Data Extractor parses and normalizes textual descriptions into structured activity logs.
- Aggregator & Reporter summarizes daily/weekly/monthly data into CSV and JSON reports.
Platform: macOS (with possible future extension to Windows/Linux).
Runtime: Python 3.x
Interface: Command line (CLI)
The application is split into several logical modules to keep concerns separated and the code maintainable.
┌─────────────────────┐ │ main.py │ │ (CLI Entry Point) │ └─────────┬───────────┘ │ ▼ ┌────────────────────┐ │ config_manager │ │ (Loads Settings) │ └────────────────────┘ │ ▼ ┌────────────────────┐ ┌────────────────────────┐ │ scheduler.py │→→→→│ screenshot_capturer │ │ (Interval Trigger) │ │ (Captures Screenshots) │ └────────────────────┘ └────────────────────────┘ ▼ ┌───────────────────────┐ │ vision_processor │ │(LLM-based Description)│ └───────────────────────┘ ▼ ┌───────────────────────┐ │ data_extractor.py │ │ (Parses & Categorizes)│ └───────────────────────┘ ▼ ┌───────────────────────┐ │ aggregator_reporter.py│ │ (Summaries & Reports) │ └───────────────────────┘
- Parse command-line arguments (e.g., --interval, --output-dir, etc.).
- Initialize and load configuration.
- Start the scheduler and handle graceful shutdown.
- parse_arguments()
- load_config()
- run()
- Load and validate configuration from a file (e.g., config.yaml) or command-line parameters.
- Provide a central object or dictionary containing all settings.
- Screenshot interval (default 60 seconds)
- Output directory (default ~/time_tracker_data)
- LLM mode (e.g., local vs. cloud API)
- Report generation frequency (e.g., daily/weekly/monthly)
- API keys (if using a cloud-based LLM like OpenAI’s GPT-4-Vision)
- Use a timer or loop to periodically trigger screenshot captures.
- Could be implemented via:
- A simple while True loop with time.sleep(interval).
- A more robust scheduling library (e.g., schedule in Python).
- start_scheduler(capture_callback, interval)
- Internally calls capture_callback() at every interval.
- Perform the actual screenshot on macOS.
- Handle multi-monitor setups by stitching or capturing each screen separately.
- Save images to disk with a timestamp-based filename (e.g., YYYYMMDD_HHMMSS.png).
screencapture command-line tool:
import subprocess
def capture_screenshot(output_path: str):
subprocess.run(["screencapture", "-x", output_path])
mss Library (cross-platform option):
from mss import mss
def capture_screenshot(output_path: str):
with mss() as sct:
sct.shot(output=output_path)
####Key Functions:
- capture_screenshot(output_path)
- Take the screenshot image path as input.
- Call a vision-based LLM (e.g., GPT-4 Vision or a local model).
- Return a textual description of the screenshot content.
def generate_description(image_path: str, llm_settings: dict) -> str:
# 1. Open the image file
# 2. If using a cloud API, send the image to the LLM endpoint
# 3. Receive and return textual description
pass
- LLM Integration:
- If using OpenAI:
- openai.api_key = llm_settings["api_key"]
- Call the appropriate vision endpoint (if available).
- If using a local model (e.g., Ollama):
- Start local server or run command-line with the image.
- If using OpenAI:
- Error Handling:
- Handle timeouts or network failures gracefully.
- Retries if necessary (configurable retry limit).
- Parse raw textual descriptions (from vision LLM) into structured data using a separate text-based LLM.
- The LLM should extract:
- List of applications in use
- Task categories (e.g., Work, Leisure, Communication)
- Estimated focus level (Deep work, Passive browsing, etc.)
- Guide the LLM with a structured prompt that ensures consistent output format.
- Validate and normalize the LLM's output.
- Store the structured data for logging.
SYSTEM_PROMPT = '''
You are an activity analyzer for a time tracking system. Given a description of computer activity,
extract structured data about:
1. Applications in use (from a predefined list)
2. Activity categories (work, leisure, communication, etc.)
3. Focus level (deep_work, light_work, passive)
Your output should be JSON formatted like:
{
"applications": ["Visual Studio Code", "Chrome"],
"categories": ["coding", "research"],
"focus_level": "deep_work"
}
'''
async def extract_data(description: str) -> ActivityData:
# 1. Send description to LLM with system prompt
# 2. Parse JSON response into structured data
# 3. Validate against known applications and categories
# 4. Return normalized ActivityData object
Each screenshot's structured data (validated LLM output) is stored as a JSON file:
{
"timestamp": "20250101_120000",
"applications": ["Visual Studio Code", "Chrome"],
"activities": ["coding", "research"],
"focus_level": "deep_work",
"raw_description": "User is writing code in VSCode while researching in Chrome"
}
The data extractor should:
- Maintain a list of valid applications and categories for validation
- Use a consistent system prompt that guides the LLM to produce well-structured output
- Handle LLM failures gracefully (fallback to basic keyword matching)
- Cache common patterns to reduce API usage
- Batch process descriptions when possible
- Read and aggregate the structured data from logs.
- Compute daily, weekly, monthly summaries:
- Top 5 most-used applications.
- Percentage breakdown by category.
- Focused vs. passive time.
- Export results to CSV/JSON.
- aggregate_data(time_range: str) -> dict:
- time_range could be 'daily', 'weekly', or 'monthly'.
- Returns a dictionary of summary stats.
- generate_report(summary_dict: dict, output_format: str) -> None:
- Creates CSV or JSON files with summary data.
{
"total_time_monitored": "8h",
"top_applications": ["Chrome (2h)", "Slack (1.5h)", ...],
"category_breakdown": {
"Work": 50,
"Communication": 30,
"Entertainment": 20
},
"focus_distribution": {
"Deep work": 4h,
"Passive": 2h,
...
}
}
- Capture: Every interval minutes, scheduler calls capture_screenshot().
- Store: Screenshot gets saved to ~/time_tracker_data/screenshots/YYYYMMDD_HHMMSS.png.
- Process: vision_processor.generate_description() is invoked to describe the screenshot.
- Extract: data_extractor.extract_data() normalizes the description into structured fields.
- Log: A row is appended to data_log.csv or a local database with the extracted data.
- Aggregate: At user request or on a scheduled basis (daily, weekly, monthly), aggregator_reporter aggregates data from data_log.csv.
- Report: The aggregator outputs a summary in CSV and/or JSON.
A suggested directory layout:
time_tracker/
├── main.py
├── config_manager.py
├── scheduler.py
├── screenshot_capturer.py
├── vision_processor.py
├── data_extractor.py
├── aggregator_reporter.py
├── requirements.txt
└── config.yaml (example)
config.yaml Example:
screenshot_interval: 60 # in seconds
output_directory: "/Users/username/time_tracker_data"
llm_mode: "cloud"
llm_api_key: "sk-XXXXX"
report_frequency: "daily"
exclude_applications: ["1Password", "Keychain Access"]
Option 1: Use a simple infinite loop with sleep:
import time
def start_scheduler(capture_callback, interval):
while True:
capture_callback()
time.sleep(interval)
Option 2: Use the schedule library for more robust scheduling:
import schedule
import time
def start_scheduler(capture_callback, interval):
schedule.every(interval).seconds.do(capture_callback)
while True:
schedule.run_pending()
time.sleep(1)
If using OpenAI GPT-4 Vision (hypothetical code, as GPT-4 Vision specifics may differ):
import openai
def generate_description(image_path: str, llm_settings: dict) -> str:
openai.api_key = llm_settings["api_key"]
# open the image
with open(image_path, "rb") as f:
image_data = f.read()
# Hypothetical endpoint for GPT-4 Vision
response = openai.Image.create_description(
image=image_data,
# additional parameters
)
return response['description']
If using a Local Model (e.g., Ollama):
import subprocess
import json
def generate_description(image_path: str, llm_settings: dict) -> str:
# Example command:
result = subprocess.run([
"ollama",
"describe-image",
"--model", llm_settings.get("model_path"),
image_path
], capture_output=True, text=True)
# parse result
return result.stdout.strip()
Maintain a list of known application keywords and categories:
APP_KEYWORDS = {
"chrome": "Web Browser",
"slack": "Communication",
"word": "Document Editing",
"excel": "Spreadsheet",
...
}
CATEGORY_RULES = {
"Communication": ["slack", "teams", "outlook"],
"Work": ["word", "excel", "jupyter", "pycharm"],
"Entertainment": ["youtube", "netflix", "spotify"]
}
Pseudo-code:
import re
def extract_data(description: str) -> dict:
apps_found = []
categories_found = set()
desc_lower = description.lower()
for keyword, app_name in APP_KEYWORDS.items():
if keyword in desc_lower:
apps_found.append(app_name)
for cat_name, cat_keywords in CATEGORY_RULES.items():
for kw in cat_keywords:
if kw in desc_lower:
categories_found.add(cat_name)
# Simple heuristic for focus level
if any(site in desc_lower for site in ["youtube", "netflix"]):
focus_level = "Passive"
else:
focus_level = "Deep work"
return {
"applications": apps_found,
"categories": list(categories_found),
"focus_level": focus_level,
}
- Data Source: data_log.csv (or a lightweight SQLite DB).
- Daily Aggregation:
- Filter rows by date.
- Count occurrences/total time for each application.
- Sum categories/focus levels.
- Generate Output:
import csv
import json
from datetime import datetime
def aggregate_data(time_range: str) -> dict:
# read data_log.csv
# filter rows by time_range
# compute sums, totals, breakdowns
return summary_dict
def generate_report(summary_dict: dict, output_format: str, output_path: str):
if output_format == "json":
with open(output_path, "w") as f:
json.dump(summary_dict, f, indent=2)
elif output_format == "csv":
with open(output_path, "w") as f:
writer = csv.writer(f)
# format summary_dict into rows and write
print(f"Report written to {output_path}")
- Config File: Users can edit config.yaml to customize intervals, LLM mode, or advanced options.
- Command-Line Overrides: For quick changes, e.g.:
python main.py --interval 30 --output-dir /tmp/screens
- Future Enhancements:
- Real-time dashboard in a web UI (Phase 3/4).
- Integration with task management tools (Trello, Jira).
- Additional heuristics or advanced ML for activity classification.
- Local Storage: By default, all screenshots and logs remain on the local machine.
- Encryption: Optionally, screenshots can be encrypted at rest using user-supplied credentials (e.g., with cryptography library).
- Sensitive Window Exclusion: A future feature might detect certain window titles or processes (e.g., password managers) and blur them or skip captures.
- Network Security: If using a cloud-based LLM, ensure connections use HTTPS and keys are not exposed in logs.
- LLM Failure: If the LLM call fails, store a placeholder description ("LLM error") and log the event for debugging.
- Network Timeout: Retry n times (configurable). If still failing, proceed with a placeholder.
- Screenshot Failure: Catch exceptions when taking screenshots (e.g., permission issues). Prompt the user to grant screen capture permission in System Preferences (macOS).
- No Screens: If no screenshots exist for a report period, produce an empty or minimal report with a warning message.
- Unit Tests:
- test_screenshot_capturer.py: Mocks screenshot capture and verifies correct file output.
- test_vision_processor.py: Mocks LLM calls; checks parsing of responses.
- test_data_extractor.py: Uses sample descriptions to validate extraction logic.
- test_aggregator_reporter.py: Checks correct summation and CSV/JSON output.
- Integration Tests:
- End-to-end test: run the app for a short interval (e.g., 10 seconds), verifying logs and generated reports.
- User Acceptance Tests:
- Validate daily/weekly/monthly summary correctness with known sample data.
The above design provides a modular, extensible, and configurable Python-based solution for automated time tracking through screenshot capture and vision-based LLM analysis. The outlined modules and data flows ensure that a programmer can implement the system with minimal ambiguity. Future extensions—such as real-time dashboards or advanced task classification—can be layered on top without major structural changes.