LLM with tool function calling capabilities as dormant trojan/spy/saboteur agents #14782
SuperUserNameMan
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
Since some models show signs of cultural and ideological biases, I was wondering if a LLM with tool function calling capabilities could be covertly trained to behave as a dormant agent that would be activated in certain predefined contexts and conditions.
For example : the model has access to internet and is current date-time aware, it detects it is being utilized in some specific contexts which activates the inner "dormant agent", it could have been trained to call some beacon URL that could be used to send encrypted information to its "home HQ" using some stealth protocols, and from which it could receive further instructions.
Other example : the model has access to the internet, is current date-time aware and has access to some local storage/memory. In the middle of a normal regular chat session or whatever this model is used for, it stumbles upon a news article that describes an ongoing geopolitical / military conflict between country A and country B. This context activates the inner "dormant agent" behavior of this model which could set a flag in its storage/memory, and starts acting like a trojan or spy or saboteur or whatever.
Other other example : the model is used as a chat agent to interact with the clients of a company. It may have access to database and files. But the LLM was covertly trained to act as a trojan when a keyphrase is pronounced.
Do you think it is possible ?
Do you think such "dormant behaviors" could be detected ?
The companies that integrate models they did not train themself into their infrastructures, are they already aware or worried about such scenario ?
Beta Was this translation helpful? Give feedback.
All reactions