Skip to content

This is a fully autonomous, self-operating computer automation system designed to automate tasks on Windows without any user interaction. It runs scheduled or trigger-based workflows using Python, system tools, and smart agents — ideal for repetitive tasks, bots, or self-executing pipelines.

Notifications You must be signed in to change notification settings

masfaatanveer/_Self-Operating-Computer-Automation_

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧠 Self Operating Computer Automation

By Masfa Tanveer

A multimodal agent framework to automate your computer like a human.
Watch the screen, decide intelligently, and act via mouse + keyboard to complete tasks autonomously.

Demo Screenshot

🚀 Overview

This project turns your computer into a self-operating intelligent agent that can:

  • See the screen using screenshots (like a human)
  • Understand it using GPT‑4o, Gemini, Claude 3, or LLaVa
  • Act with mouse and keyboard to achieve objectives (click buttons, type text, navigate apps)
  • Optionally use voice commands and OCR detection for better precision

📽 Demo Video

👇 Terminal-based usage of the operate command with AI automation:

final-low.mp4

✨ Key Features

  • 🧠 Multimodal AI Models Supported: GPT‑4o, Claude 3, Gemini Pro Vision, LLaVa
  • 🎯 Operates with real mouse & keyboard like a human
  • 🧩 Modular skill system with plugin support
  • 🎤 Voice objective input (--voice flag)
  • 🔍 OCR vision mode for clickable element mapping
  • 🖥️ Local model support via Ollama

⚙️ Install and Run

1. Install via pip

pip install self-operating-computer-automation

2. Run the agent

operate

3. Enter your API key

You’ll be prompted to enter an OpenAI, Claude, or Gemini key
Get OpenAI Key


🛑 Required Permissions

This app needs screen recording and accessibility control on Windows/Mac.


🧠 Supported Modes

Mode Flag Description
(default) -m gpt-4-with-ocr for best click accuracy
-m gemini-pro-vision Use Gemini Pro Vision
-m claude-3 Use Claude 3
-m llava Use local model via Ollama
--voice Voice input support for hands-free operation

🔥 Using LLaVa Locally via Ollama

# Step 1: Install Ollama (https://ollama.ai/download)

# Step 2: Pull model
ollama pull llava

# Step 3: Start Ollama
ollama serve

# Step 4: Run your agent
operate -m llava

🧩 Voice Mode (Optional)

# Clone the repo and install audio dependencies
git clone https://github.com/masfaatanveer/_Self-Operating-Computer-Automation_.git
cd self-operating-computer-automation
pip install -r requirements-audio.txt

# Install system audio libs
# Mac:
brew install portaudio

# Linux:
sudo apt install portaudio19-dev python3-pyaudio

# Run with voice
operate --voice

🛠 Development Setup

git clone https://github.com/masfaatanveer/_Self-Operating-Computer-Automation_.git
cd _Self-Operating-Computer-Automation_
pip install -r requirements.txt

Run the dev build:

operate

📂 Repo Structure

📁 self-operating-computer-automation/
├── operate                     # CLI entry
├── core/                       # Main agent logic
├── vision/                     # Screenshot and OCR tools
├── plugins/                   # Custom skill scripts
├── models/                    # API model wrappers
├── requirements.txt
├── requirements-audio.txt
└── README.md

📌 Tags / Topics

automation
self-operating
windows-automation
multimodal-ai
gpt-4o
gemini-pro-vision
claude-3
ollama
agentic-ai
ai-agent
python
autopilot

👨‍💻 Created by Masfa Dhillon

GitHubLinkedIn


📄 License

MIT License — free for personal and commercial use. Attribution appreciated!


About

This is a fully autonomous, self-operating computer automation system designed to automate tasks on Windows without any user interaction. It runs scheduled or trigger-based workflows using Python, system tools, and smart agents — ideal for repetitive tasks, bots, or self-executing pipelines.

Topics

Resources

Stars

Watchers

Forks

Languages