The objective of this software is to automatically measure and analyze how time is spent on a computer by periodically capturing screenshots, using a vision-based language model (LLM) to describe the content, and summarizing the results to generate a high-level report of time allocation across applications and tasks.
Users who want to track and optimize their productivity by understanding which applications and tasks they engage with the most.
- The software will take screenshots of the desktop at a configurable interval (default: 1 minute).
- Screenshots should include all active screens in a multi-monitor setup.
- The files will be saved in a designated directory with a timestamped filename (e.g.,
YYYYMMDD_HHMMSS.png
).
- A vision-based LLM will process each screenshot and generate a textual description.
- The description should include:
- Visible applications
- Any identifiable tasks (e.g., “Editing a document in Microsoft Word,” “Browsing a webpage on Chrome”)
- Major content indicators (e.g., “Writing an email,” “Watching a video on YouTube”)
- The generated descriptions will be stored as text files alongside the screenshots, also timestamped.
- Extract structured data from the LLM-generated descriptions:
- List of applications in use
- Task categories (e.g., "Work," "Leisure," "Communication")
- Estimated focus level (e.g., "Deep work," "Passive browsing")
- Summarize these data points into structured logs for further aggregation.
- Generate a high-level summary of time spent on different apps and tasks.
- Support daily, weekly, and monthly reports.
- Provide insights such as:
- Top 5 most-used applications
- Percentage breakdown by category (e.g., "50% work, 30% communication, 20% entertainment")
- Time spent on focused tasks vs. passive activities
- Reports should be exportable in CSV and JSON formats for further analysis.
- Frontend: Minimal UI for configuration (optional, CLI-based interaction acceptable)
- Backend:
- Screenshot capture module (e.g., Python + OpenCV/Pillow)
- Vision LLM API integration (e.g., OpenAI’s GPT-4-Vision, local models using Ollama)
- Data extraction and summarization pipeline
- Storage and report generation logic
- Storage:
- Screenshots stored in a dedicated directory
- Text descriptions stored as
.txt
files - Summary reports stored in CSV/JSON
- Efficient screenshot capture with minimal CPU/memory overhead
- Batched processing for LLM calls to reduce API usage costs
- Local or cloud storage options for large datasets
- Security & Privacy: Ensure screenshots are stored locally unless explicitly uploaded for cloud analysis. User data should be private and encrypted.
- Configurable Settings:
- Screenshot interval
- Output directory for data storage
- LLM processing mode (local/cloud)
- Report generation frequency
- Cross-Platform Compatibility: Should work on Windows, macOS, and Linux.
- Real-time dashboard visualization
- AI-based anomaly detection (e.g., detecting unproductive behaviors)
- Integration with task management tools (e.g., Notion, Jira, Trello)
- Mobile app companion for reviewing reports
- Phase 1: Core functionality (screenshot capture, vision LLM processing, text storage)
- Phase 2: Data aggregation and basic reporting
- Phase 3: Configurable UI and cross-platform compatibility
- Phase 4: Advanced analytics and integrations
- What level of accuracy is expected from the vision LLM? Should it refine descriptions based on historical context?
- Should there be a mechanism to exclude sensitive windows (e.g., password managers, private documents)?
- What are the ideal default categories for task classification?