🧠 Self Operating Computer Automation

By Masfa Tanveer

A multimodal agent framework to automate your computer like a human.
Watch the screen, decide intelligently, and act via mouse + keyboard to complete tasks autonomously.

🚀 Overview

This project turns your computer into a self-operating intelligent agent that can:

See the screen using screenshots (like a human)
Understand it using GPT‑4o, Gemini, Claude 3, or LLaVa
Act with mouse and keyboard to achieve objectives (click buttons, type text, navigate apps)
Optionally use voice commands and OCR detection for better precision

📽 Demo Video

👇 Terminal-based usage of the operate command with AI automation:

final-low.mp4

✨ Key Features

🧠 Multimodal AI Models Supported: GPT‑4o, Claude 3, Gemini Pro Vision, LLaVa
🎯 Operates with real mouse & keyboard like a human
🧩 Modular skill system with plugin support
🎤 Voice objective input (--voice flag)
🔍 OCR vision mode for clickable element mapping
🖥️ Local model support via Ollama

⚙️ Install and Run

1. Install via pip

pip install self-operating-computer-automation

2. Run the agent

operate

3. Enter your API key

You’ll be prompted to enter an OpenAI, Claude, or Gemini key
→ Get OpenAI Key

🛑 Required Permissions

This app needs screen recording and accessibility control on Windows/Mac.

🧠 Supported Modes

Mode Flag	Description
(default)	`-m gpt-4-with-ocr` for best click accuracy
`-m gemini-pro-vision`	Use Gemini Pro Vision
`-m claude-3`	Use Claude 3
`-m llava`	Use local model via Ollama
`--voice`	Voice input support for hands-free operation

🔥 Using LLaVa Locally via Ollama

# Step 1: Install Ollama (https://ollama.ai/download)

# Step 2: Pull model
ollama pull llava

# Step 3: Start Ollama
ollama serve

# Step 4: Run your agent
operate -m llava

🧩 Voice Mode (Optional)

# Clone the repo and install audio dependencies
git clone https://github.com/masfaatanveer/_Self-Operating-Computer-Automation_.git
cd self-operating-computer-automation
pip install -r requirements-audio.txt

# Install system audio libs
# Mac:
brew install portaudio

# Linux:
sudo apt install portaudio19-dev python3-pyaudio

# Run with voice
operate --voice

🛠 Development Setup

git clone https://github.com/masfaatanveer/_Self-Operating-Computer-Automation_.git
cd _Self-Operating-Computer-Automation_
pip install -r requirements.txt

Run the dev build:

operate

📂 Repo Structure

📁 self-operating-computer-automation/
├── operate                     # CLI entry
├── core/                       # Main agent logic
├── vision/                     # Screenshot and OCR tools
├── plugins/                   # Custom skill scripts
├── models/                    # API model wrappers
├── requirements.txt
├── requirements-audio.txt
└── README.md

📌 Tags / Topics

automation
self-operating
windows-automation
multimodal-ai
gpt-4o
gemini-pro-vision
claude-3
ollama
agentic-ai
ai-agent
python
autopilot

👨‍💻 Created by Masfa Dhillon

GitHub • LinkedIn

📄 License

MIT License — free for personal and commercial use. Attribution appreciated!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
operate		operate
.gitignore		.gitignore
Readme.md		Readme.md
app.py		app.py
evaluate.py		evaluate.py
requirements-audio.txt		requirements-audio.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧠 Self Operating Computer Automation

By Masfa Tanveer

🚀 Overview

📽 Demo Video

✨ Key Features

⚙️ Install and Run

1. Install via pip

2. Run the agent

3. Enter your API key

🛑 Required Permissions

🧠 Supported Modes

🔥 Using LLaVa Locally via Ollama

🧩 Voice Mode (Optional)

🛠 Development Setup

📂 Repo Structure

📌 Tags / Topics

👨‍💻 Created by Masfa Dhillon

📄 License

About

Uh oh!

Languages

masfaatanveer/_Self-Operating-Computer-Automation_

Folders and files

Latest commit

History

Repository files navigation

🧠 Self Operating Computer Automation

By Masfa Tanveer

🚀 Overview

📽 Demo Video

✨ Key Features

⚙️ Install and Run

1. Install via pip

2. Run the agent

3. Enter your API key

🛑 Required Permissions

🧠 Supported Modes

🔥 Using LLaVa Locally via Ollama

🧩 Voice Mode (Optional)

🛠 Development Setup

📂 Repo Structure

📌 Tags / Topics

👨‍💻 Created by Masfa Dhillon

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages