Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 43 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,21 @@
</h5>

## <img src="./assets/ootb_icon.png" alt="Star" style="height:25px; vertical-align:middle; filter: invert(1) brightness(2);"> Overview
**Computer Use <span style="color:rgb(106, 158, 210)">O</span><span style="color:rgb(111, 163, 82)">O</span><span style="color:rgb(209, 100, 94)">T</span><span style="color:rgb(238, 171, 106)">B</span>**<img src="./assets/ootb_icon.png" alt="Star" style="height:20px; vertical-align:middle; filter: invert(1) brightness(2);"> is an out-of-the-box (OOTB) solution for Desktop GUI Agent, including API-based (**Claude 3.5 Computer Use**) and locally-running models (**<span style="color:rgb(106, 158, 210)">S</span><span style="color:rgb(111, 163, 82)">h</span><span style="color:rgb(209, 100, 94)">o</span><span style="color:rgb(238, 171, 106)">w</span>UI**, **UI-TARS**).
**Computer Use <span style="color:rgb(106, 158, 210)">O</span><span style="color:rgb(111, 163, 82)">O</span><span style="color:rgb(209, 100, 94)">T</span><span style="color:rgb(238, 171, 106)">B</span>**<img src="./assets/ootb_icon.png" alt="Star" style="height:20px; vertical-align:middle; filter: invert(1) brightness(2);"> is an out-of-the-box (OOTB) solution for Desktop GUI Agent, including API-based (**Claude 3.5 Computer Use**, **OpenRouter**) and locally-running models (**<span style="color:rgb(106, 158, 210)">S</span><span style="color:rgb(111, 163, 82)">h</span><span style="color:rgb(209, 100, 94)">o</span><span style="color:rgb(238, 171, 106)">w</span>UI**, **UI-TARS**).

**No Docker** is required, and it supports both **Windows** and **macOS**. OOTB provides a user-friendly interface based on Gradio.🎨

### ⚡ **Key Optimizations & Features**
- 🔄 **Smart Model Routing**: Automatically select optimal models via OpenRouter
- 💰 **Cost Optimization**: Reduced token costs with intelligent model selection
- 🚀 **Enhanced Performance**: Improved inference speed with 4-bit quantization
- 📊 **Multi-Provider Support**: Seamless switching between OpenAI, Anthropic, Qwen, and OpenRouter
- 🛠️ **Flexible Architecture**: Unified & modular planner-actor configurations

Visit our study on GUI Agent of Claude 3.5 Computer Use [[project page]](https://computer-use-ootb.github.io). 🌐

## Update
- **[2025/01/22]** 🚀 **OpenRouter Integration** & **Performance Optimizations** are now live! Access 100+ AI models through a single API with [**OpenRouter**](https://openrouter.ai) - including GPT-4o, Claude, Qwen-VL, and more. Enjoy **cost-efficient routing**, **automatic failover**, and **competitive pricing** 💰!
- **[2025/02/08]** We've added the support for [**UI-TARS**](https://github.com/bytedance/UI-TARS). Follow [Cloud Deployment](https://github.com/bytedance/UI-TARS?tab=readme-ov-file#cloud-deployment) or [VLLM deployment](https://github.com/bytedance/UI-TARS?tab=readme-ov-file#local-deployment-vllm) to implement UI-TARS and run it locally in OOTB.
- **Major Update! [2024/12/04]** **Local Run🔥** is now live! Say hello to [**<span style="color:rgb(106, 158, 210)">S</span><span style="color:rgb(111, 163, 82)">h</span><span style="color:rgb(209, 100, 94)">o</span><span style="color:rgb(238, 171, 106)">w</span>UI**](https://github.com/showlab/ShowUI), an open-source 2B vision-language-action (VLA) model for GUI Agent. Now compatible with `"gpt-4o + ShowUI" (~200x cheaper)`* & `"Qwen2-VL + ShowUI" (~30x cheaper)`* for only few cents for each task💰! <span style="color: grey; font-size: small;">*compared to Claude Computer Use</span>.
- **[2024/11/20]** We've added some examples to help you get hands-on experience with Claude 3.5 Computer Use.
Expand Down Expand Up @@ -87,7 +95,36 @@ pip install -r requirements.txt

2. Test your UI-TARS sever with the script `.\install_tools\test_ui-tars_server.py`.

### 2.4 (Optional) If you want to deploy Qwen model as planner on ssh server
### 2.4 (Optional) Get Prepared for **OpenRouter** Integration 🌐

[OpenRouter](https://openrouter.ai) provides unified access to 100+ AI models through a single API, offering cost-efficient routing and competitive pricing.

**Benefits:**
- 🔄 **Automatic failover** between models
- 💰 **Cost optimization** with smart routing
- 🚀 **100+ models** including GPT-4o, Claude, Gemini, and more
- 📊 **Transparent pricing** and usage analytics

**Setup:**
1. Sign up at [OpenRouter](https://openrouter.ai/)
2. Get your API key from the [Keys page](https://openrouter.ai/keys)
3. Set your environment variable:
```bash
# Windows PowerShell
$env:OPENROUTER_API_KEY="sk-or-xxxxx"

# macOS/Linux
export OPENROUTER_API_KEY="sk-or-xxxxx"
```

**Popular Models Available:**
- `openrouter/auto` - Automatically route to the best available model
- GPT-4o, GPT-4o-mini
- Claude 3.5 Sonnet, Claude 3 Haiku
- Gemini Pro, PaLM 2
- And many more...

### 2.5 (Optional) If you want to deploy Qwen model as planner on ssh server
1. git clone this project on your ssh server

2. python computer_use_demo/remote_inference.py
Expand All @@ -104,13 +141,14 @@ If you successfully start the interface, you will see two URLs in the terminal:
```


> <u>For convenience</u>, we recommend running one or more of the following command to set API keys to the environment variables before starting the interface. Then you dont need to manually pass the keys each run. On Windows Powershell (via the `set` command if on cmd):
> <u>For convenience</u>, we recommend running one or more of the following command to set API keys to the environment variables before starting the interface. Then you don't need to manually pass the keys each run. On Windows Powershell (via the `set` command if on cmd):
> ```bash
> $env:ANTHROPIC_API_KEY="sk-xxxxx" (Replace with your own key)
> $env:QWEN_API_KEY="sk-xxxxx"
> $env:OPENAI_API_KEY="sk-xxxxx"
> $env:OPENROUTER_API_KEY="sk-xxxxx" # For OpenRouter integration
> ```
> On macOS/Linux, replace `$env:ANTHROPIC_API_KEY` with `export ANTHROPIC_API_KEY` in the above command.
> On macOS/Linux, replace `$env:ANTHROPIC_API_KEY` with `export ANTHROPIC_API_KEY` in the above command.


### 4. Control Your Computer with Any Device can Access the Internet
Expand Down Expand Up @@ -173,6 +211,7 @@ Now, OOTB supports customizing the GUI Agent via the following models:
<ul>
<li><a href="">GPT-4o</a></li>
<li><a href="">Qwen2-VL-Max</a></li>
<li><a href="https://openrouter.ai">OpenRouter (100+ models)</a></li>
<li><a href="">Qwen2-VL-2B(ssh)</a></li>
<li><a href="">Qwen2-VL-7B(ssh)</a></li>
<li><a href="">Qwen2.5-VL-7B(ssh)</a></li>
Expand Down
57 changes: 44 additions & 13 deletions app.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,8 @@ def setup_state(state):
state["anthropic_api_key"] = os.getenv("ANTHROPIC_API_KEY", "")
if "qwen_api_key" not in state:
state["qwen_api_key"] = os.getenv("QWEN_API_KEY", "")
if "openrouter_api_key" not in state:
state["openrouter_api_key"] = os.getenv("OPENROUTER_API_KEY", "")
if "ui_tars_url" not in state:
state["ui_tars_url"] = ""

Expand All @@ -72,6 +74,8 @@ def setup_state(state):
state["planner_api_key"] = state["anthropic_api_key"]
elif state["planner_provider"] == "qwen":
state["planner_api_key"] = state["qwen_api_key"]
elif state["planner_provider"] == "openrouter":
state["planner_api_key"] = state["openrouter_api_key"]
else:
state["planner_api_key"] = ""

Expand Down Expand Up @@ -278,7 +282,7 @@ def process_input(user_input, state):
label="API Provider",
choices=[option.value for option in APIProvider],
value="openai",
interactive=False,
interactive=True,
)
with gr.Column():
planner_api_key = gr.Textbox(
Expand Down Expand Up @@ -393,9 +397,9 @@ def update_planner_model(model_selection, state):
logger.info(f"Model updated to: {state['planner_model']}")

if model_selection == "qwen2-vl-max":
provider_choices = ["qwen"]
provider_choices = ["qwen", "openrouter"]
provider_value = "qwen"
provider_interactive = False
provider_interactive = True
api_key_interactive = True
api_key_placeholder = "qwen API key"
actor_model_choices = ["ShowUI", "UI-TARS"]
Expand Down Expand Up @@ -432,10 +436,10 @@ def update_planner_model(model_selection, state):
state["api_key"] = ""

elif model_selection == "gpt-4o" or model_selection == "gpt-4o-mini":
# Set provider to "openai", make it unchangeable
provider_choices = ["openai"]
# Allow OpenAI or OpenRouter as provider
provider_choices = ["openai", "openrouter"]
provider_value = "openai"
provider_interactive = False
provider_interactive = True
api_key_interactive = True
api_key_type = "password" # Display API key in password form

Expand Down Expand Up @@ -470,6 +474,8 @@ def update_planner_model(model_selection, state):
state["api_key"] = state.get("anthropic_api_key", "")
elif provider_value == "qwen":
state["api_key"] = state.get("qwen_api_key", "")
elif provider_value == "openrouter":
state["api_key"] = state.get("openrouter_api_key", "")
elif provider_value == "local":
state["api_key"] = ""
# SSH的情况已经在上面处理过了,这里不需要重复处理
Expand Down Expand Up @@ -502,19 +508,44 @@ def update_actor_model(actor_model_selection, state):
logger.info(f"Actor model updated to: {state['actor_model']}")

def update_api_key_placeholder(provider_value, model_selection):
# Persist provider selection into state for use in sampling loop
state.value["planner_provider"] = provider_value
# Choose placeholder and value based on provider/model
if model_selection == "claude-3-5-sonnet-20241022":
if provider_value == "anthropic":
return gr.update(placeholder="anthropic API key")
placeholder = "anthropic API key"
value = state.value.get("anthropic_api_key", "")
elif provider_value == "bedrock":
return gr.update(placeholder="bedrock API key")
placeholder = "bedrock API key"
value = "" # credentials via environment
elif provider_value == "vertex":
return gr.update(placeholder="vertex API key")
placeholder = "vertex API key"
value = "" # credentials via environment
else:
return gr.update(placeholder="")
elif model_selection == "gpt-4o + ShowUI":
return gr.update(placeholder="openai API key")
placeholder = ""
value = ""
else:
return gr.update(placeholder="")
if provider_value == "openai":
placeholder = "openai API key"
value = state.value.get("openai_api_key", "")
elif provider_value == "openrouter":
placeholder = "openrouter API key"
value = state.value.get("openrouter_api_key", "")
elif provider_value == "qwen":
placeholder = "qwen API key"
value = state.value.get("qwen_api_key", "")
elif provider_value == "ssh":
placeholder = "ssh host and port (e.g. localhost:8000)"
value = state.value.get("planner_api_key", "")
elif provider_value == "local":
placeholder = "not required"
value = ""
else:
placeholder = ""
value = ""
# Update state mirrored key used by loop
state.value["planner_api_key"] = value
return gr.update(placeholder=placeholder, value=value, type="password", interactive=True)

def update_system_prompt_suffix(system_prompt_suffix, state):
state["custom_system_prompt"] = system_prompt_suffix
Expand Down
91 changes: 79 additions & 12 deletions computer_use_demo/gui_agent/llm_utils/oai.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,71 @@
def run_openrouter_interleaved(messages: list, system: str, llm: str, api_key: str, max_tokens=256, temperature=0):

api_key = api_key or os.environ.get("OPENROUTER_API_KEY")
if not api_key:
raise ValueError("OPENROUTER_API_KEY is not set")

headers = {"Content-Type": "application/json",
"Authorization": f"Bearer {api_key}"}

final_messages = [{"role": "system", "content": system}]

if type(messages) == list:
for item in messages:
print(f"item: {item}")
contents = []
if isinstance(item, dict):
for cnt in item["content"]:
if isinstance(cnt, str):
if is_image_path(cnt):
base64_image = encode_image(cnt)
content = {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
else:
content = {"type": "text", "text": cnt}

contents.append(content)
message = {"role": item["role"], "content": contents}

elif isinstance(item, str):
if is_image_path(item):
base64_image = encode_image(item)
contents.append({"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}})
message = {"role": "user", "content": contents}
else:
contents.append({"type": "text", "text": item})
message = {"role": "user", "content": contents}

else: # str
contents.append({"type": "text", "text": item})
message = {"role": "user", "content": contents}

final_messages.append(message)


elif isinstance(messages, str):
final_messages.append({"role": "user", "content": messages})

print("[openrouter] sending messages:", [f"{k}: {v}, {k}" for k, v in final_messages])

payload = {
"model": llm,
"messages": final_messages,
"max_tokens": max_tokens,
"temperature": temperature,
}

response = requests.post(
"https://openrouter.ai/api/v1/chat/completions", headers=headers, json=payload
)

try:
text = response.json()['choices'][0]['message']['content']
token_usage = int(response.json()['usage']['total_tokens'])
return text, token_usage

except Exception as e:
print(f"Error in interleaved openAI: {e}. This may due to your invalid OPENROUTER_API_KEY. Please check the response: {response.json()} ")
return response.json()

import os
import logging
import base64
Expand Down Expand Up @@ -214,17 +282,16 @@ def encode_image(image_path: str, max_size=1024) -> str:
# temperature=0)

# print(text, token_usage)
text, token_usage = run_ssh_llm_interleaved(
messages= [{"content": [
"What is in the screenshot?",
"tmp/outputs/screenshot_5a26d36c59e84272ab58c1b34493d40d.png"],
"role": "user"
}],
llm="Qwen2.5-VL-7B-Instruct",
ssh_host="10.245.92.68",
ssh_port=9192,
text, token_usage = run_openrouter_interleaved(
messages=[{"content": [
"What is in the screenshot?",
"tmp/outputs/screenshot_5a26d36c59e84272ab58c1b34493d40d.png"],
"role": "user"
}],
llm="openrouter/auto",
system="You are a helpful assistant",
api_key=api_key,
max_tokens=256,
temperature=0.7
)
temperature=0)

print(text, token_usage)
# There is an introduction describing the Calyx... 36986
35 changes: 32 additions & 3 deletions computer_use_demo/gui_agent/planner/api_vlm_planner.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
from anthropic.types.beta import BetaMessage, BetaTextBlock, BetaToolUseBlock, BetaMessageParam

from computer_use_demo.tools.screen_capture import get_screenshot
from computer_use_demo.gui_agent.llm_utils.oai import run_oai_interleaved, run_ssh_llm_interleaved
from computer_use_demo.gui_agent.llm_utils.oai import run_oai_interleaved, run_ssh_llm_interleaved, run_openrouter_interleaved
from computer_use_demo.gui_agent.llm_utils.qwen import run_qwen
from computer_use_demo.gui_agent.llm_utils.llm_utils import extract_data, encode_image
from computer_use_demo.tools.colorful_text import colorful_text_showui, colorful_text_vlm
Expand Down Expand Up @@ -43,9 +43,10 @@ def __init__(
self.model = "Qwen2-VL-7B-Instruct"
elif model == "qwen2.5-vl-7b (ssh)":
self.model = "Qwen2.5-VL-7B-Instruct"
elif model == "openrouter/auto":
self.model = "openrouter/auto"
else:
raise ValueError(f"Model {model} not supported")

self.provider = provider
self.system_prompt_suffix = system_prompt_suffix
self.api_key = api_key
Expand Down Expand Up @@ -92,7 +93,23 @@ def __call__(self, messages: list):

print(f"Sending messages to VLMPlanner: {planner_messages}")

if self.model == "gpt-4o-2024-11-20":
# If provider is explicitly OpenRouter, route via OpenRouter regardless of model string
provider_str = self.provider.value if hasattr(self.provider, "value") else str(self.provider)
if provider_str == "openrouter":
# Use a generic auto model on OpenRouter unless a specific compatible ID is set elsewhere
or_model = "openrouter/auto"
vlm_response, token_usage = run_openrouter_interleaved(
messages=planner_messages,
system=self.system_prompt,
llm=or_model,
api_key=self.api_key,
max_tokens=self.max_tokens,
temperature=0,
)
print(f"openrouter token usage: {token_usage}")
self.total_token_usage += token_usage
self.total_cost += (token_usage * 0.15 / 1000000) # Placeholder cost
elif self.model == "gpt-4o-2024-11-20":
vlm_response, token_usage = run_oai_interleaved(
messages=planner_messages,
system=self.system_prompt,
Expand All @@ -117,6 +134,18 @@ def __call__(self, messages: list):
print(f"qwen token usage: {token_usage}")
self.total_token_usage += token_usage
self.total_cost += (token_usage * 0.02 / 7.25 / 1000) # 1USD=7.25CNY, https://help.aliyun.com/zh/dashscope/developer-reference/tongyi-qianwen-vl-plus-api
elif self.model == "openrouter/auto":
vlm_response, token_usage = run_openrouter_interleaved(
messages=planner_messages,
system=self.system_prompt,
llm=self.model,
api_key=self.api_key,
max_tokens=self.max_tokens,
temperature=0,
)
print(f"openrouter token usage: {token_usage}")
self.total_token_usage += token_usage
self.total_cost += (token_usage * 0.15 / 1000000) # Placeholder cost
elif "Qwen" in self.model:
# 从api_key中解析host和port
try:
Expand Down
Loading