A multimodal agent framework to automate your computer like a human.
Watch the screen, decide intelligently, and act via mouse + keyboard to complete tasks autonomously.
This project turns your computer into a self-operating intelligent agent that can:
- See the screen using screenshots (like a human)
- Understand it using GPT‑4o, Gemini, Claude 3, or LLaVa
- Act with mouse and keyboard to achieve objectives (click buttons, type text, navigate apps)
- Optionally use voice commands and OCR detection for better precision
👇 Terminal-based usage of the
operate
command with AI automation:
final-low.mp4
- 🧠 Multimodal AI Models Supported:
GPT‑4o
,Claude 3
,Gemini Pro Vision
,LLaVa
- 🎯 Operates with real mouse & keyboard like a human
- 🧩 Modular skill system with plugin support
- 🎤 Voice objective input (
--voice
flag) - 🔍 OCR vision mode for clickable element mapping
- 🖥️ Local model support via Ollama
pip install self-operating-computer-automation
operate
You’ll be prompted to enter an OpenAI, Claude, or Gemini key
→ Get OpenAI Key
This app needs screen recording and accessibility control on Windows/Mac.
Mode Flag | Description |
---|---|
(default) | -m gpt-4-with-ocr for best click accuracy |
-m gemini-pro-vision |
Use Gemini Pro Vision |
-m claude-3 |
Use Claude 3 |
-m llava |
Use local model via Ollama |
--voice |
Voice input support for hands-free operation |
# Step 1: Install Ollama (https://ollama.ai/download)
# Step 2: Pull model
ollama pull llava
# Step 3: Start Ollama
ollama serve
# Step 4: Run your agent
operate -m llava
# Clone the repo and install audio dependencies
git clone https://github.com/masfaatanveer/_Self-Operating-Computer-Automation_.git
cd self-operating-computer-automation
pip install -r requirements-audio.txt
# Install system audio libs
# Mac:
brew install portaudio
# Linux:
sudo apt install portaudio19-dev python3-pyaudio
# Run with voice
operate --voice
git clone https://github.com/masfaatanveer/_Self-Operating-Computer-Automation_.git
cd _Self-Operating-Computer-Automation_
pip install -r requirements.txt
Run the dev build:
operate
📁 self-operating-computer-automation/
├── operate # CLI entry
├── core/ # Main agent logic
├── vision/ # Screenshot and OCR tools
├── plugins/ # Custom skill scripts
├── models/ # API model wrappers
├── requirements.txt
├── requirements-audio.txt
└── README.md
automation
self-operating
windows-automation
multimodal-ai
gpt-4o
gemini-pro-vision
claude-3
ollama
agentic-ai
ai-agent
python
autopilot
MIT License — free for personal and commercial use. Attribution appreciated!