Skip to content

Instantly share code, notes, and snippets.

@vivekhaldar
Created February 9, 2025 17:45
Show Gist options
  • Save vivekhaldar/a596a4d5eaec4f71e9162ffc0568746f to your computer and use it in GitHub Desktop.
Save vivekhaldar/a596a4d5eaec4f71e9162ffc0568746f to your computer and use it in GitHub Desktop.

Product Specification: Automatic Time Tracking by Watching Computer Screen

1. Overview

Goal

The objective of this software is to automatically measure and analyze how time is spent on a computer by periodically capturing screenshots, using a vision-based language model (LLM) to describe the content, and summarizing the results to generate a high-level report of time allocation across applications and tasks.

Use Case

Users who want to track and optimize their productivity by understanding which applications and tasks they engage with the most.


2. Functional Requirements

2.1 Screenshot Capture

  • The software will take screenshots of the desktop at a configurable interval (default: 1 minute).
  • Screenshots should include all active screens in a multi-monitor setup.
  • The files will be saved in a designated directory with a timestamped filename (e.g., YYYYMMDD_HHMMSS.png).

2.2 Image Processing & Description

  • A vision-based LLM will process each screenshot and generate a textual description.
  • The description should include:
    • Visible applications
    • Any identifiable tasks (e.g., “Editing a document in Microsoft Word,” “Browsing a webpage on Chrome”)
    • Major content indicators (e.g., “Writing an email,” “Watching a video on YouTube”)
  • The generated descriptions will be stored as text files alongside the screenshots, also timestamped.

2.3 Data Extraction & Summarization

  • Extract structured data from the LLM-generated descriptions:
    • List of applications in use
    • Task categories (e.g., "Work," "Leisure," "Communication")
    • Estimated focus level (e.g., "Deep work," "Passive browsing")
  • Summarize these data points into structured logs for further aggregation.

2.4 Aggregation & Reporting

  • Generate a high-level summary of time spent on different apps and tasks.
  • Support daily, weekly, and monthly reports.
  • Provide insights such as:
    • Top 5 most-used applications
    • Percentage breakdown by category (e.g., "50% work, 30% communication, 20% entertainment")
    • Time spent on focused tasks vs. passive activities
  • Reports should be exportable in CSV and JSON formats for further analysis.

3. Technical Requirements

3.1 System Architecture

  • Frontend: Minimal UI for configuration (optional, CLI-based interaction acceptable)
  • Backend:
    • Screenshot capture module (e.g., Python + OpenCV/Pillow)
    • Vision LLM API integration (e.g., OpenAI’s GPT-4-Vision, local models using Ollama)
    • Data extraction and summarization pipeline
    • Storage and report generation logic
  • Storage:
    • Screenshots stored in a dedicated directory
    • Text descriptions stored as .txt files
    • Summary reports stored in CSV/JSON

3.2 Performance Considerations

  • Efficient screenshot capture with minimal CPU/memory overhead
  • Batched processing for LLM calls to reduce API usage costs
  • Local or cloud storage options for large datasets

4. Non-Functional Requirements

  • Security & Privacy: Ensure screenshots are stored locally unless explicitly uploaded for cloud analysis. User data should be private and encrypted.
  • Configurable Settings:
    • Screenshot interval
    • Output directory for data storage
    • LLM processing mode (local/cloud)
    • Report generation frequency
  • Cross-Platform Compatibility: Should work on Windows, macOS, and Linux.

5. Future Enhancements

  • Real-time dashboard visualization
  • AI-based anomaly detection (e.g., detecting unproductive behaviors)
  • Integration with task management tools (e.g., Notion, Jira, Trello)
  • Mobile app companion for reviewing reports

6. Development Roadmap

  1. Phase 1: Core functionality (screenshot capture, vision LLM processing, text storage)
  2. Phase 2: Data aggregation and basic reporting
  3. Phase 3: Configurable UI and cross-platform compatibility
  4. Phase 4: Advanced analytics and integrations

7. Open Questions

  • What level of accuracy is expected from the vision LLM? Should it refine descriptions based on historical context?
  • Should there be a mechanism to exclude sensitive windows (e.g., password managers, private documents)?
  • What are the ideal default categories for task classification?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment