Usage instructions: here
Table of Contents
| Publish Date | Title | Authors | Code | |
|---|---|---|---|---|
| 2025-08-25 | Speculative Safety-Aware Decoding 论文主要内容:[本文提出了一种轻量级且高效的解码时方法——推测性安全感知解码(SSD),以增强大型语言模型(LLMs)的安全对齐属性,无需调整模型参数。SSD利用一个小语言模型的安全特性,通过推测采样和匹配率动态切换解码方案,从而在保证实用性的同时提高安全性。实验表明,SSD不仅成功赋予了大型模型所需的安全属性,还保持了对良性查询的帮助性,并加速了推理时间。] 论文结论:[SSD能够有效地增强现有大型语言模型的安全对齐属性,同时保持模型的实用性和推理效率。实验结果显示,SSD在多种攻击场景下表现出色,尤其对于较不安全的模型如Vicuna,其性能优于直接微调原模型的方法。此外,SSD避免了过度拒绝响应的问题,能够在处理敏感话题时保持适当的回应能力。] | Xuekang Wang et.al. | 2508.17739 | null | 
| 2025-08-24 | Activation Transport Operators 论文主要内容:[本文提出了激活传输算子(ATO),即从上游到下游残差流层的线性映射,用于预测特定特征是否通过线性传输或非线性计算合成。ATO通过对配对激活的学习,无需微调且计算成本低,可以评估特征在残差流中的线性传输效率和对应的子空间大小,有助于提高模型安全性、调试和理解大语言模型中哪些部分的行为是线性的] 论文结论:[ATO提供了简单且可测试的方法来映射特征流,实验证明线性传输主要发生在相邻层之间,随着距离增加而减弱。运输效率指标量化了算子接近最佳线性预测的程度,分析表明线性传输子空间的维度与ATO的最佳秩紧密相关。ATO能够显著恢复在零干预下丢失的语言建模能力,支持其用于目标诊断和编辑] | Andrzej Szablewski et.al. | 2508.17540 | null | 
| 2025-07-23 | Enabling Cyber Security Education through Digital Twins and Generative AI | Vita Santa Barletta et.al. | 2507.17518 | null | 
| 2025-07-22 | DREAM: Scalable Red Teaming for Text-to-Image Generative Systems via Distribution Modeling | Boheng Li et.al. | 2507.16329 | null | 
| 2025-07-21 | Red-Team Multi-Agent Reinforcement Learning for Emergency Braking Scenario | Yinsong Chen et.al. | 2507.15587 | null | 
| 2025-07-20 | AlphaAlign: Incentivizing Safety Alignment with Extremely Simplified Reinforcement Learning | Yi Zhang et.al. | 2507.14987 | null | 
| 2025-07-14 | PRM-Free Security Alignment of Large Models via Red Teaming and Adversarial Training | Pengfei Du et.al. | 2507.14202 | null | 
| 2025-07-18 | Innocence in the Crossfire: Roles of Skip Connections in Jailbreaking Visual Language Models | Palash Nandi et.al. | 2507.13761 | null | 
| 2025-07-17 | Paper Summary Attack: Jailbreaking LLMs through LLM Safety Papers | Liang Lin et.al. | 2507.13474 | null | 
| 2025-07-16 | Exploiting Jailbreaking Vulnerabilities in Generative AI to Bypass Ethical Safeguards for Facilitating Phishing Attacks | Rina Mishra et.al. | 2507.12185 | null | 
| 2025-07-16 | LLMs Encode Harmfulness and Refusal Separately | Jiachen Zhao et.al. | 2507.11878 | null | 
| 2025-07-15 | Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility | Brendan Murphy et.al. | 2507.11630 | null | 
| 2025-07-14 | ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning | Zhengyue Zhao et.al. | 2507.11500 | null | 
| 2025-07-15 | The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs | Zichen Wen et.al. | 2507.11097 | null | 
| 2025-07-17 | SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems | Wenliang Shan et.al. | 2507.08898 | null | 
| 2025-07-10 | A Dynamic Stackelberg Game Framework for Agentic AI Defense Against LLM Jailbreaking | Zhengye Han et.al. | 2507.08207 | null | 
| 2025-07-06 | Mass-Scale Analysis of In-the-Wild Conversations Reveals Complexity Bounds on LLM Jailbreaking | Aldan Creo et.al. | 2507.08014 | null | 
| 2025-07-10 | GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing | Peiyan Zhang et.al. | 2507.07735 | null | 
| 2025-07-11 | Medical Red Teaming Protocol of Language Models: On the Importance of User Perspectives in Healthcare Settings | Jean-Philippe Corbeil et.al. | 2507.07248 | null | 
| 2025-07-09 | An attention-aware GNN-based input defender against multi-turn jailbreak on LLMs | Zixuan Huang et.al. | 2507.07146 | null | 
| 2025-07-09 | On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks | Stephen Obadinma et.al. | 2507.06489 | null | 
| 2025-07-09 | Foundation Model Self-Play: Open-Ended Strategy Innovation via Foundation Models | Aaron Dharna et.al. | 2507.06466 | null | 
| 2025-07-08 | The bitter lesson of misuse detection | Hadrien Mariaccia et.al. | 2507.06282 | null | 
| 2025-07-07 | Evaluating the Critical Risks of Amazon's Nova Premier under the Frontier Model Safety Framework | Satyapriya Krishna et.al. | 2507.06260 | null | 
| 2025-07-08 | CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations | Xiaohu Li et.al. | 2507.06043 | null | 
| 2025-07-08 | RabakBench: Scaling Human Annotations to Construct Localized Multilingual Safety Benchmarks for Low-Resource Languages | Gabriel Chua et.al. | 2507.05980 | null | 
| 2025-07-08 | TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data | Aravind Cheruvu et.al. | 2507.05660 | null | 
| 2025-07-07 | Red Teaming AI Red Teaming | Subhabrata Majumdar et.al. | 2507.05538 | null | 
| 2025-07-07 | Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models | Ziqi Miao et.al. | 2507.05248 | null | 
| 2025-07-07 | Trojan Horse Prompting: Jailbreaking Conversational Multimodal Models by Forging Assistant Message | Wei Duan et.al. | 2507.04673 | null | 
| 2025-07-09 | Tail-aware Adversarial Attacks: A Distributional Approach to Efficient LLM Jailbreaking | Tim Beyer et.al. | 2507.04446 | null | 
| 2025-07-06 | Attention Slipping: A Mechanistic Understanding of Jailbreak Attacks and Defenses in LLMs | Xiaomeng Hu et.al. | 2507.04365 | null | 
| 2025-07-08 | On Jailbreaking Quantized Language Models Through Fault Injection Attacks | Noureldin Zahran et.al. | 2507.03236 | null | 
| 2025-07-03 | Adversarial Manipulation of Reasoning Models using Internal Representations | Kureha Yamaguchi et.al. | 2507.03167 | null | 
| 2025-07-01 | `For Argument's Sake, Show Me How to Harm Myself!': Jailbreaking LLMs in Suicide and Self-Harm Contexts | Annika M Schoene et.al. | 2507.02990 | null | 
| 2025-06-29 | A Representation Engineering Perspective on the Effectiveness of Multi-Turn Jailbreaks | Blake Bullwinkel et.al. | 2507.02956 | null | 
| 2025-07-03 | Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection | Ziqi Miao et.al. | 2507.02844 | null | 
| 2025-07-03 | Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models | Riccardo Cantini et.al. | 2507.02799 | null | 
| 2025-07-03 | PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage | Krishna Kanth Nakka et.al. | 2507.02332 | null | 
| 2025-07-02 | MGC: A Compiler Framework Exploiting Compositional Blindness in Aligned LLMs for Malware Generation | Lu Yan et.al. | 2507.02057 | null | 
| 2025-07-02 | SafePTR: Token-Level Jailbreak Defense in Multimodal LLMs via Prune-then-Restore Mechanism | Beitao Chen et.al. | 2507.01513 | null | 
| 2025-04-18 | AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models | Aashray Reddy et.al. | 2507.01020 | null | 
| 2025-07-01 | Reasoning as an Adaptive Defense for Safety | Taeyoun Kim et.al. | 2507.00971 | null | 
| 2025-07-01 | SafeMobile: Chain-level Jailbreak Detection and Automated Evaluation for Multimodal Mobile Agents | Siyuan Liang et.al. | 2507.00841 | null | 
| 2025-06-30 | Linearly Decoding Refused Knowledge in Aligned Language Models | Aryan Shrivastava et.al. | 2507.00239 | null | 
| 2025-07-18 | STACK: Adversarial Attacks on LLM Safeguard Pipelines | Ian R. McKenzie et.al. | 2506.24068 | null | 
| 2025-06-30 | Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models | Tung-Ling Li et.al. | 2506.24056 | null | 
| 2025-06-30 | Leveraging the Potential of Prompt Engineering for Hate Speech Detection in Low-Resource Languages | Ruhina Tabasshum Prome et.al. | 2506.23930 | null | 
| 2025-06-30 | Evaluating Multi-Agent Defences Against Jailbreaking Attacks on Large Language Models | Maria Carolina Cornelia Wit et.al. | 2506.23576 | null | 
| 2025-06-28 | Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models | Younwoo Choi et.al. | 2506.22957 | null | 
| 2025-06-27 | VERA: Variational Inference Framework for Jailbreaking Large Language Models | Anamika Lochab et.al. | 2506.22666 | null | 
| 2025-06-27 | MetaCipher: A General and Extensible Reinforcement Learning Framework for Obfuscation-Based Jailbreak Attacks on Black-Box LLMs | Boyuan Chen et.al. | 2506.22557 | null | 
| 2025-07-02 | Red Teaming for Generative AI, Report on a Copyright-Focused Exercise Completed in an Academic Medical Center | James Wen et.al. | 2506.22523 | null | 
| 2025-06-27 | Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses | Mohamed Ahmed et.al. | 2506.21972 | null | 
| 2025-06-24 | PrivacyXray: Detecting Privacy Breaches in LLMs through Semantic Consistency and Probability Certainty | Jinwen He et.al. | 2506.19563 | null | 
| 2025-06-24 | MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models | Yinan Xia et.al. | 2506.19257 | null | 
| 2025-06-23 | Command-V: Pasting LLM Behaviors via Activation Profiles | Barry Wang et.al. | 2506.19140 | null | 
| 2025-06-23 | Security Assessment of DeepSeek and GPT Series Models against Jailbreak Attacks | Xiaodong Wu et.al. | 2506.18543 | null | 
| 2025-06-23 | NSFW-Classifier Guided Prompt Sanitization for Safe Text-to-Image Generation | Yu Xie et.al. | 2506.18325 | null | 
| 2025-06-22 | Multi-turn Jailbreaking via Global Refinement and Active Fabrication | Hua Tang et.al. | 2506.17881 | null | 
| 2025-06-20 | Kaleidoscopic Teaming in Multi Agent Simulations | Ninareh Mehrabi et.al. | 2506.17514 | null | 
| 2025-06-17 | LLM Jailbreak Oracle | Shuyi Lin et.al. | 2506.17299 | null | 
| 2025-05-26 | Efficient and Stealthy Jailbreak Attacks via Adversarial Prompt Distillation from LLMs to SLMs | Xiang Li et.al. | 2506.17231 | null | 
| 2025-06-20 | From Concepts to Components: Concept-Agnostic Attention Module Discovery in Transformers | Jingtong Su et.al. | 2506.17052 | null | 
| 2025-06-20 | MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning | Muyang Zheng et.al. | 2506.16792 | null | 
| 2025-06-20 | Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models | Lei Jiang et.al. | 2506.16760 | null | 
| 2025-06-19 | Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models | Biao Yi et.al. | 2506.16447 | null | 
| 2025-06-19 | Probing the Robustness of Large Language Models Safety to Latent Perturbations | Tianle Gu et.al. | 2506.16078 | link | 
| 2025-06-18 | Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts | Kartik Sharma et.al. | 2506.15751 | null | 
| 2025-07-20 | From LLMs to MLLMs to Agents: A Survey of Emerging Paradigms in Jailbreak Attacks and Defenses within LLM Ecosystem | Yanxu Mao et.al. | 2506.15170 | null | 
| 2025-06-24 | FORTRESS: Frontier Risk Evaluation for National Security and Public Safety | Christina Q. Knight et.al. | 2506.14922 | null | 
| 2025-06-17 | AIRTBench: Measuring Autonomous AI Red Teaming Capabilities in Language Models | Ads Dawson et.al. | 2506.14682 | link | 
| 2025-06-16 | Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations | Abhilekh Borah et.al. | 2506.13901 | null | 
| 2025-06-16 | We Should Identify and Mitigate Third-Party Safety Risks in MCP-Powered Agent Systems | Junfeng Fang et.al. | 2506.13666 | link | 
| 2025-06-17 | Safe-Child-LLM: A Developmental Benchmark for Evaluating LLM Safety in Child-LLM Interactions | Junfeng Jiao et.al. | 2506.13510 | link | 
| 2025-06-16 | From Promise to Peril: Rethinking Cybersecurity Red and Blue Teaming in the Age of LLMs | Alsharif Abuadbba et.al. | 2506.13434 | null | 
| 2025-06-15 | Jailbreak Strength and Model Similarity Predict Transferability | Rico Angell et.al. | 2506.12913 | null | 
| 2025-06-15 | Universal Jailbreak Suffixes Are Strong Attention Hijackers | Matan Ben-Tov et.al. | 2506.12880 | link | 
| 2025-06-15 | SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression | Yucheng Li et.al. | 2506.12707 | null | 
| 2025-06-15 | Alphabet Index Mapping: Jailbreaking LLMs through Semantic Dissimilarity | Bilal Saleh Husain et.al. | 2506.12685 | null | 
| 2025-07-11 | Pushing the Limits of Safety: A Technical Report on the ATLAS Challenge 2025 | Zonghao Ying et.al. | 2506.12430 | link | 
| 2025-06-21 | Exploring the Secondary Risks of Large Language Models | Jiawei Chen et.al. | 2506.12382 | null | 
| 2025-06-14 | QGuard:Question-based Zero-shot Guard for Multi-modal LLM Safety | Taegyeong Lee et.al. | 2506.12299 | null | 
| 2025-06-13 | InfoFlood: Jailbreaking Large Language Models with Information Overload | Advait Yadav et.al. | 2506.12274 | null | 
| 2025-06-13 | Investigating Vulnerabilities and Defenses Against Audio-Visual Attacks: A Comprehensive Survey Emphasizing Multimodal Models | Jinming Wen et.al. | 2506.11521 | null | 
| 2025-06-04 | RedDebate: Safer Responses through Multi-Agent Red Teaming Debates | Ali Asad et.al. | 2506.11083 | null | 
| 2025-06-12 | How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts? | Sohee Yang et.al. | 2506.10979 | null | 
| 2025-06-12 | SoK: Evaluating Jailbreak Guardrails for Large Language Models | Xunguang Wang et.al. | 2506.10597 | link | 
| 2025-07-01 | Data-Centric Safety and Ethical Measures for Data and AI Governance | Srija Chakraborty et.al. | 2506.10217 | null | 
| 2025-06-11 | GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models | Zilong Wang et.al. | 2506.10047 | null | 
| 2025-06-10 | Evaluation empirique de la sécurisation et de l'alignement de ChatGPT et Gemini: analyse comparative des vulnérabilités par expérimentations de jailbreaks | Rafaël Nouailles et.al. | 2506.10029 | null | 
| 2025-06-09 | LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges | Haoyang Li et.al. | 2506.10022 | link | 
| 2025-06-07 | From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment | Kyubyung Chae et.al. | 2506.10020 | null | 
| 2025-06-22 | Effective Red-Teaming of Policy-Adherent Agents | Itay Nakash et.al. | 2506.09600 | null | 
| 2025-06-11 | LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge | Songze Li et.al. | 2506.09443 | link | 
| 2025-06-08 | Enhancing the Safety of Medical Vision-Language Models by Synthetic Demonstrations | Zhiyu Xue et.al. | 2506.09067 | null | 
| 2025-06-11 | AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI) | Danush Khanna et.al. | 2506.08885 | null | 
| 2025-06-11 | RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards | Jingnan Zheng et.al. | 2506.07736 | null | 
| 2025-06-09 | Evaluating LLMs Robustness in Less Resourced Languages with Proxy Models | Maciej Chrabąszcz et.al. | 2506.07645 | null | 
| 2025-06-09 | TwinBreak: Jailbreaking LLM Security Alignments based on Twin Prompts | Torsten Krauß et.al. | 2506.07596 | null | 
| 2025-06-09 | When Style Breaks Safety: Defending Language Models Against Superficial Style Alignment | Yuxin Xiao et.al. | 2506.07452 | link | 
| 2025-06-09 | Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures | Yukai Zhou et.al. | 2506.07402 | null | 
| 2025-06-08 | Quality-Diversity Red-Teaming: Automated Generation of High-Quality and Diverse Attackers for Large Language Models | Ren-Jian Wang et.al. | 2506.07121 | null | 
| 2025-06-08 | AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint | Leheng Sheng et.al. | 2506.07022 | link | 
| 2025-06-11 | Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test | Xiaoyuan Zhu et.al. | 2506.06975 | null | 
| 2025-07-09 | Saffron-1: Safety Inference Scaling | Ruizhong Qiu et.al. | 2506.06444 | link | 
| 2025-06-06 | Small Models, Big Support: A Local LLM Framework for Teacher-Centric Content Creation and Assessment using RAG and CAG | Zarreen Reza et.al. | 2506.05925 | null | 
| 2025-06-09 | A Red Teaming Roadmap Towards System-Level Safety | Zifan Wang et.al. | 2506.05376 | null | 
| 2025-06-05 | Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets | Lei Hsiung et.al. | 2506.05346 | null | 
| 2025-06-11 | HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model | Youngwan Lee et.al. | 2506.04704 | null | 
| 2025-06-04 | RedRFT: A Light-Weight Benchmark for Reinforcement Fine-Tuning-Based Red Teaming | Xiang Zheng et.al. | 2506.04302 | link | 
| 2025-06-03 | Adversarial Attacks on Robotic Vision Language Action Models | Eliot Krzysztof Jones et.al. | 2506.03350 | link | 
| 2025-06-03 | It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics | Matthew Kowal et.al. | 2506.02873 | null | 
| 2025-06-03 | From Prompts to Protection: Large Language Model-Enabled In-Context Learning for Smart Public Safety UAV | Yousef Emami et.al. | 2506.02649 | null | 
| 2025-06-03 | BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage | Kalyan Nakka et.al. | 2506.02479 | link | 
| 2025-05-30 | Towards Secure MLOps: Surveying Attacks, Mitigation Strategies, and Research Challenges | Raj Patel et.al. | 2506.02032 | null | 
| 2025-06-02 | Red Teaming AI Policy: A Taxonomy of Avoision and the EU AI Act | Rui-Jie Yew et.al. | 2506.01931 | null | 
| 2025-06-02 | ReGA: Representation-Guided Abstraction for Model-based Safeguarding of LLMs | Zeming Wei et.al. | 2506.01770 | link | 
| 2025-06-02 | Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models | Youze Wang et.al. | 2506.01307 | null | 
| 2025-06-01 | XGUARD: A Graded Benchmark for Evaluating Safety Failures of Large Language Models on Extremist Content | Vadivel Abishethvarman et.al. | 2506.00973 | null | 
| 2025-06-01 | Predicting Empirical AI Research Outcomes with Language Models | Jiaxin Wen et.al. | 2506.00794 | null | 
| 2025-06-01 | Jailbreak-R1: Exploring the Jailbreak Capabilities of LLMs via Reinforcement Learning | Weiyang Guo et.al. | 2506.00782 | null | 
| 2025-06-01 | CoP: Agentic Red-teaming for Large Language Models using Composition of Principles | Chen Xiong et.al. | 2506.00781 | null | 
| 2025-05-31 | Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities | Jiahui Geng et.al. | 2506.00548 | link | 
| 2025-05-31 | Keeping an Eye on LLM Unlearning: The Hidden Risk and Remedy | Jie Ren et.al. | 2506.00359 | null | 
| 2025-05-29 | SafeCOMM: What about Safety Alignment in Fine-Tuned Telecom Large Language Models? | Aladin Djuhera et.al. | 2506.00062 | null | 
| 2025-05-30 | TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis | Xiaorui Wu et.al. | 2505.24672 | link | 
| 2025-05-30 | Benchmarking Large Language Models for Cryptanalysis and Mismatched-Generalization | Utsav Maskey et.al. | 2505.24621 | null | 
| 2025-05-30 | AMIA: Automatic Masking and Joint Intention Analysis Makes LVLMs Robust Jailbreak Defenders | Yuqi Zhang et.al. | 2505.24519 | null | 
| 2025-05-30 | Model Unlearning via Sparse Autoencoder Subspace Guided Projections | Xu Wang et.al. | 2505.24428 | null | 
| 2025-05-30 | A Reward-driven Automated Webshell Malicious-code Generator for Red-teaming | Yizhong Ding et.al. | 2505.24252 | null | 
| 2025-05-30 | Locating Risk: Task Designers and the Challenge of Risk Disclosure in RAI Content Work | Alice Qian Zhang et.al. | 2505.24246 | null | 
| 2025-05-30 | From Hallucinations to Jailbreaks: Rethinking the Vulnerability of Large Foundation Models | Haibo Jin et.al. | 2505.24232 | null | 
| 2025-05-28 | GeneBreaker: Jailbreak Attacks against DNA Language Models with Pathogenicity Guidance | Zaixi Zhang et.al. | 2505.23839 | link | 
| 2025-06-29 | CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring | Benjamin Arnav et.al. | 2505.23575 | null | 
| 2025-05-29 | Understanding Refusal in Language Models with Sparse Autoencoders | Wei Jie Yeo et.al. | 2505.23556 | link | 
| 2025-07-23 | MEF: A Capability-Aware Multi-Encryption Framework for Evaluating Vulnerabilities in Black-Box Large Language Models | Mingyu Yu et.al. | 2505.23404 | null | 
| 2025-05-28 | Adaptive Detoxification: Safeguarding General Capabilities of LLMs through Toxicity-Aware Knowledge Editing | Yifan Lu et.al. | 2505.22298 | null | 
| 2025-05-28 | Test-Time Immunization: A Universal Defense Framework Against Jailbreaks for (Multimodal) Large Language Models | Yongcan Yu et.al. | 2505.22271 | null | 
| 2025-05-28 | Jailbreak Distillation: Renewable Safety Benchmarking | Jingyu Zhang et.al. | 2505.22037 | null | 
| 2025-06-01 | RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments | Zeyi Liao et.al. | 2505.21936 | link | 
| 2025-05-27 | Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation | Tharindu Kumarage et.al. | 2505.21784 | null | 
| 2025-05-26 | Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts | Hee-Seon Kim et.al. | 2505.21556 | null | 
| 2025-05-28 | Breaking the Ceiling: Exploring the Potential of Jailbreak Attacks through Expanding Strategy Space | Yao Huang et.al. | 2505.21277 | link | 
| 2025-05-27 | Red-Teaming Text-to-Image Systems by Rule-based Preference Modeling | Yichuan Cao et.al. | 2505.21074 | null | 
| 2025-05-27 | Improved Representation Steering for Language Models | Zhengxuan Wu et.al. | 2505.20809 | link | 
| 2025-05-22 | Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs | Amr Hegazy et.al. | 2505.20309 | null | 
| 2025-05-26 | Lifelong Safety Alignment for Language Models | Haoyu Wang et.al. | 2505.20259 | link | 
| 2025-05-26 | Capability-Based Scaling Laws for LLM Red-Teaming | Alexander Panfilov et.al. | 2505.20162 | link | 
| 2025-05-26 | Attention! You Vision Language Model Could Be Maliciously Manipulated | Xiaosen Wang et.al. | 2505.19911 | null | 
| 2025-05-26 | What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs | Sangyeop Kim et.al. | 2505.19773 | null | 
| 2025-05-26 | SGM: A Framework for Building Specification-Guided Moderation Filters | Masoomali Fatehkia et.al. | 2505.19766 | null | 
| 2025-05-28 | VisCRA: A Visual Chain Reasoning Attack for Jailbreaking Multimodal Large Language Models | Bingrui Sima et.al. | 2505.19684 | null | 
| 2025-05-30 | JailBound: Jailbreaking Internal Safety Boundaries of Vision-Language Models | Jiaxin Song et.al. | 2505.19610 | null | 
| 2025-05-25 | GhostPrompt: Jailbreaking Text-to-image Generative Models based on Dynamic Optimization | Zixuan Chen et.al. | 2505.18979 | null | 
| 2025-05-31 | Security Concerns for Large Language Models: A Survey | Miles Q. Li et.al. | 2505.18889 | null | 
| 2025-05-24 | Audio Jailbreak Attacks: Exposing Vulnerabilities in SpeechGPT in a White-Box Framework | Binhao Ma et.al. | 2505.18864 | link | 
| 2025-05-24 | Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment? | Hongzheng Yang et.al. | 2505.18672 | null | 
| 2025-05-24 | Safety Alignment via Constrained Knowledge Unlearning | Zesheng Shi et.al. | 2505.18588 | null | 
| 2025-05-24 | Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation | Jun Zhuang et.al. | 2505.18556 | null | 
| 2025-05-22 | Towards medical AI misalignment: a preliminary study | Barbara Puccio et.al. | 2505.18212 | null | 
| 2025-05-23 | An Example Safety Case for Safeguards Against Misuse | Joshua Clymer et.al. | 2505.18003 | null | 
| 2025-05-28 | Survival Games: Human-LLM Strategic Showdowns under Severe Resource Scarcity | Zhihong Chen et.al. | 2505.17937 | link | 
| 2025-05-23 | Does Chain-of-Thought Reasoning Really Reduce Harmfulness from Jailbreaking? | Chengda Lu et.al. | 2505.17650 | null | 
| 2025-05-28 | Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models | Jiawei Kong et.al. | 2505.17601 | null | 
| 2025-05-23 | One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs | Linbao Li et.al. | 2505.17598 | link | 
| 2025-05-23 | JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models | Zifan Peng et.al. | 2505.17568 | link | 
| 2025-05-23 | Chain-of-Lure: A Synthetic Narrative-Driven Approach to Compromise Large Language Models | Wenhan Chang et.al. | 2505.17519 | null | 
| 2025-05-22 | Refusal Direction is Universal Across Safety-Aligned Languages | Xinpeng Wang et.al. | 2505.17306 | null | 
| 2025-05-22 | MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming | Weiyang Guo et.al. | 2505.17147 | link | 
| 2025-06-07 | Robustifying Vision-Language Models via Dynamic Token Reweighting | Tanqiu Jiang et.al. | 2505.17132 | null | 
| 2025-05-21 | RRTL: Red Teaming Reasoning Large Language Models in Tool Learning | Yifei Liu et.al. | 2505.17106 | null | 
| 2025-05-20 | Trust Me, I Can Handle It: Self-Generated Adversarial Scenario Extrapolation for Robust Language Models | Md Rafi Ur Rashid et.al. | 2505.17089 | null | 
| 2025-06-27 | Improving LLM Outputs Against Jailbreak Attacks with Expert Model Integration | Tatia Tsmindashvili et.al. | 2505.17066 | null | 
| 2025-05-22 | When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques | Jianing Geng et.al. | 2505.16765 | null | 
| 2025-05-23 | Finetuning-Activated Backdoors in LLMs | Thibaud Gloaguen et.al. | 2505.16567 | link | 
| 2025-05-22 | Implicit Jailbreak Attacks via Cross-Modal Information Concealment on Vision-Language Models | Zhaoxin Wang et.al. | 2505.16446 | null | 
| 2025-05-26 | Three Minds, One Legend: Jailbreak Large Reasoning Model with Adaptive Stacked Ciphers | Viet-Anh Nguyen et.al. | 2505.16241 | null | 
| 2025-05-22 | SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning | Kaiwen Zhou et.al. | 2505.16186 | null | 
| 2025-05-21 | Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval | Taiye Chen et.al. | 2505.15753 | null | 
| 2025-05-21 | Alignment Under Pressure: The Case for Informed Adversaries When Evaluating LLM Defenses | Xiaoxue Yang et.al. | 2505.15738 | link | 
| 2025-05-21 | Silent Leaks: Implicit Knowledge Extraction Attack on RAG Systems through Benign Queries | Yuhao Wang et.al. | 2505.15420 | null | 
| 2025-05-21 | Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models | Zirui Song et.al. | 2505.15406 | link | 
| 2025-05-20 | Soft Prompts for Evaluation: Measuring Conditional Distance of Capabilities | Ross Nordby et.al. | 2505.14943 | link | 
| 2025-05-31 | SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment | Wonje Jeung et.al. | 2505.14667 | null | 
| 2025-05-20 | sudoLLM : On Multi-role Alignment of Language Models | Soumadeep Saha et.al. | 2505.14607 | null | 
| 2025-05-20 | Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders | Agam Goyal et.al. | 2505.14536 | null | 
| 2025-05-23 | Hidden Ghost Hand: Unveiling Backdoor Vulnerabilities in MLLM-Powered Mobile GUI Agents | Pengzhou Cheng et.al. | 2505.14418 | null | 
| 2025-05-20 | Exploring Jailbreak Attacks on LLMs through Intent Concealment and Diversion | Tiehan Cui et.al. | 2505.14316 | null | 
| 2025-05-20 | EVA: Red-Teaming GUI Agents via Evolving Indirect Prompt Injection | Yijie Lu et.al. | 2505.14289 | null | 
| 2025-05-20 | "Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs | Darpan Aswal et.al. | 2505.14226 | null | 
| 2025-05-21 | AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models | Guangke Chen et.al. | 2505.14103 | null | 
| 2025-05-26 | PandaGuard: Systematic Evaluation of LLM Safety against Jailbreaking Attacks | Guobin Shen et.al. | 2505.13862 | link | 
| 2025-05-18 | SPIRIT: Patching Speech Language Models against Jailbreak Attacks | Amirbek Djanibekov et.al. | 2505.13541 | null | 
| 2025-05-18 | Logic Jailbreak: Efficiently Unlocking LLM Safety Restrictions Through Formal Logical Expression | Jingyu Peng et.al. | 2505.13527 | null | 
| 2025-05-19 | I'll believe it when I see it: Images increase misinformation sharing in Vision-Language Models | Alice Plebe et.al. | 2505.13302 | link | 
| 2025-05-19 | CURE: Concept Unlearning via Orthogonal Representation Editing in Diffusion Models | Shristi Das Biswas et.al. | 2505.12677 | null | 
| 2025-05-18 | BadNAVer: Exploring Jailbreak Attacks On Vision-and-Language Navigation | Wenqi Lyu et.al. | 2505.12443 | null | 
| 2025-05-18 | The Tower of Babel Revisited: Multilingual Jailbreak Prompts on Closed-Source Large Language Models | Linghan Huang et.al. | 2505.12287 | null | 
| 2025-05-17 | Why Not Act on What You Know? Unleashing Safety Potential of LLMs via Self-Aware Guard Enhancement | Peng Ding et.al. | 2505.12060 | link | 
| 2025-05-17 | Multilingual Collaborative Defense for Large Language Models | Hongliang Li et.al. | 2505.11835 | link | 
| 2025-05-20 | JULI: Jailbreak Large Language Models by Self-Introspection | Jesson Wang et.al. | 2505.11790 | null | 
| 2025-05-16 | Unveiling the Black Box: A Multi-Layer Framework for Explaining Reinforcement Learning-Based Cyber Agents | Diksha Goel et.al. | 2505.11708 | null | 
| 2025-05-16 | CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs | Sijia Chen et.al. | 2505.11413 | null | 
| 2025-05-16 | AutoRAN: Weak-to-Strong Jailbreaking of Large Reasoning Models | Jiacheng Liang et.al. | 2505.10846 | link | 
| 2025-05-16 | LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs | Ran Li et.al. | 2505.10838 | null | 
| 2025-05-15 | Dark LLMs: The Growing Threat of Unaligned AI Models | Michael Fire et.al. | 2505.10066 | null | 
| 2025-05-16 | PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization | Yidan Wang et.al. | 2505.09921 | link | 
| 2025-05-14 | Adversarial Attack on Large Language Models using Exponentiated Gradient Descent | Sajib Biswas et.al. | 2505.09820 | link | 
| 2025-05-14 | Adversarial Suffix Filtering: a Defense Pipeline for LLMs | David Khachaturov et.al. | 2505.09602 | null | 
| 2025-05-11 | TokenProber: Jailbreaking Text-to-image Models via Fine-grained Word Impact Analysis | Longtian Wang et.al. | 2505.08804 | null | 
| 2025-05-19 | Concept-Level Explainability for Auditing & Steering LLM Responses | Kenza Amara et.al. | 2505.07610 | link | 
| 2025-05-12 | One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models | Haoran Gu et.al. | 2505.07167 | null | 
| 2025-05-25 | Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety | Zihan Guan et.al. | 2505.06843 | link | 
| 2025-06-17 | T2V-OptJail: Discrete Prompt Optimization for Text-to-Video Jailbreak Attacks | Jiayang Liu et.al. | 2505.06679 | null | 
| 2025-05-10 | Practical Reasoning Interruption Attacks on Reasoning Large Language Models | Yu Cui et.al. | 2505.06643 | null | 
| 2025-05-21 | Think in Safety: Unveiling and Mitigating Safety Alignment Collapse in Multimodal Large Reasoning Model | Xinyue Lou et.al. | 2505.06538 | link | 
| 2025-05-10 | System Prompt Poisoning: Persistent Attacks on Large Language Models Beyond User Injection | Jiawei Guo et.al. | 2505.06493 | null | 
| 2025-05-09 | Offensive Security for AI Systems: Concepts, Practices, and Applications | Josh Harguess et.al. | 2505.06380 | null | 
| 2025-05-07 | DMRL: Data- and Model-aware Reward Learning for Data Extraction | Zhiqiang Wang et.al. | 2505.06284 | null | 
| 2025-06-14 | AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents | Zhun Wang et.al. | 2505.05849 | null | 
| 2025-05-12 | LiteLMGuard: Seamless and Lightweight On-Device Prompt Filtering for Safeguarding Small Language Models against Quantization-induced Risks and Vulnerabilities | Kalyan Nakka et.al. | 2505.05619 | link | 
| 2025-05-08 | Safety by Measurement: A Systematic Literature Review of AI Safety Evaluation Methods | Markov Grey et.al. | 2505.05541 | null | 
| 2025-05-13 | Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs | Chetan Pathade et.al. | 2505.04806 | null | 
| 2025-05-28 | The Aloe Family Recipe for Open and Specialized Healthcare LLMs | Dario Garcia-Gasulla et.al. | 2505.04388 | null | 
| 2025-05-07 | Unmasking the Canvas: A Dynamic Benchmark for Image Generation Jailbreaking and LLM Content Safety | Variath Madhupal Gautham Nair et.al. | 2505.04146 | null | 
| 2025-05-06 | LlamaFirewall: An open source guardrail system for building secure AI agents | Sahana Chennabasappa et.al. | 2505.03574 | null | 
| 2025-06-27 | Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs | Haoming Yang et.al. | 2505.02862 | null | 
| 2025-05-04 | Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents | Christian Schroeder de Witt et.al. | 2505.02077 | null | 
| 2025-05-05 | Helping Large Language Models Protect Themselves: An Enhanced Filtering and Summarization System | Sheikh Samit Muhaimin et.al. | 2505.01315 | null | 
| 2025-05-01 | OET: Optimization-based prompt injection Evaluation Toolkit | Jinsheng Pan et.al. | 2505.00843 | link | 
| 2025-07-11 | Red Teaming Large Language Models for Healthcare | Vahid Balazadeh et.al. | 2505.00467 | null | 
| 2025-05-19 | HyPerAlign: Interpretable Personalized LLM Alignment via Hypothesis Generation | Cristina Garbacea et.al. | 2505.00038 | null | 
| 2025-04-21 | Jailbreak Detection in Clinical Training LLMs Using Feature-Based Predictive Models | Tri Nguyen et.al. | 2505.00010 | null | 
| 2025-04-30 | XBreaking: Explainable Artificial Intelligence for Jailbreaking LLMs | Marco Arazzi et.al. | 2504.21700 | null | 
| 2025-04-30 | Hoist with His Own Petard: Inducing Guardrails to Facilitate Denial-of-Service Attacks on Retrieval-Augmented Generation of LLMs | Pan Suo et.al. | 2504.21680 | null | 
| 2025-06-02 | The Dual Power of Interpretable Token Embeddings: Jailbreaking Attacks and Defenses for Diffusion Model Unlearning | Siyi Chen et.al. | 2504.21307 | null | 
| 2025-04-28 | Prefill-Based Jailbreak: A Novel Approach of Bypassing LLM Safety Boundary | Yakai Li et.al. | 2504.21038 | link | 
| 2025-06-13 | AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security | Zikui Cai et.al. | 2504.20965 | link | 
| 2025-04-29 | When Testing AI Tests Us: Safeguarding Mental Health on the Digital Frontlines | Sachin R. Pendse et.al. | 2504.20910 | null | 
| 2025-04-29 | Inception: Jailbreak the Memory Mechanism of Text-to-Image Generation Systems | Shiqian Zhao et.al. | 2504.20376 | null | 
| 2025-04-25 | Understanding and Mitigating Risks of Generative AI in Financial Services | Sebastian Gehrmann et.al. | 2504.20086 | null | 
| 2025-04-29 | The Automation Advantage in AI Red Teaming | Rob Mulla et.al. | 2504.19855 | null | 
| 2025-04-28 | Madhur Jindal et.al. | 2504.19674 | link | |
| 2025-05-09 | Security Steerability is All You Need | Itay Hazan et.al. | 2504.19521 | null | 
| 2025-04-28 | JailbreaksOverTime: Detecting Jailbreak Attacks Under Distribution Shift | Julien Piet et.al. | 2504.19440 | link | 
| 2025-04-26 | Graph of Attacks: Improved Black-Box and Interpretable Jailbreaks for LLMs | Mohammad Akbar-Tajari et.al. | 2504.19019 | link | 
| 2025-04-21 | DualBreach: Efficient Dual-Jailbreaking via Target-Driven Initialization and Multi-Target Optimization | Xinzhe Huang et.al. | 2504.18564 | null | 
| 2025-04-25 | RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models | Bang An et.al. | 2504.18041 | null | 
| 2025-04-23 | Amplified Vulnerabilities: Structured Jailbreak Attacks on LLM-based Multi-Agent Debate | Senmao Qi et.al. | 2504.16489 | null | 
| 2025-04-26 | T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models | Siyuan Liang et.al. | 2504.15512 | null | 
| 2025-05-20 | MrGuard: A Multilingual Reasoning Guardrail for Universal LLM Safety | Yahan Yang et.al. | 2504.15241 | null | 
| 2025-04-21 | RainbowPlus: Enhancing Adversarial Prompt Generation via Evolutionary Quality-Diversity Search | Quy-Anh Dang et.al. | 2504.15047 | link | 
| 2025-04-20 | LLM-Enabled In-Context Learning for Data Collection Scheduling in UAV-assisted Sensor Networks | Yousef Emami et.al. | 2504.14556 | null | 
| 2025-04-18 | DETAM: Defending LLMs Against Jailbreak Attacks via Targeted Attention Modification | Yu Li et.al. | 2504.13562 | null | 
| 2025-04-15 | X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents | Salman Rahman et.al. | 2504.13203 | null | 
| 2025-04-15 | Concept Enhancement Engineering: A Lightweight and Efficient Robust Defense Against Jailbreak Attacks in Embodied AI | Jirui Yang et.al. | 2504.13201 | null | 
| 2025-04-17 | GraphAttack: Exploiting Representational Blindspots in LLM Safety Mechanisms | Sinan He et.al. | 2504.13052 | null | 
| 2025-04-17 | ZeroSumEval: Scaling LLM Evaluation with Inter-Model Competition | Haidar Khan et.al. | 2504.12562 | link | 
| 2025-04-17 | ELAB: Extensive LLM Alignment Benchmark in Persian Language | Zahra Pourbahman et.al. | 2504.12553 | null | 
| 2025-04-10 | AttentionDefense: Leveraging System Prompt Attention for Explainable Defense Against Novel Jailbreaks | Charlotte Siska et.al. | 2504.12321 | null | 
| 2025-07-14 | Bypassing LLM Guardrails: An Empirical Analysis of Evasion Attacks against Prompt Injection and Jailbreak Detection Systems | William Hackett et.al. | 2504.11168 | null | 
| 2025-04-15 | Token-Level Constraint Boundary Search for Jailbreaking Text-to-Image Models | Jiangtao Liu et.al. | 2504.11106 | null | 
| 2025-04-14 | The Jailbreak Tax: How Useful are Your Jailbreak Outputs? | Kristina Nikolić et.al. | 2504.10694 | link | 
| 2025-04-29 | Demo: ViolentUTF as An Accessible Platform for Generative AI Red Teaming | Tam n. Nguyen et.al. | 2504.10603 | null | 
| 2025-04-16 | LLM Unlearning Reveals a Stronger-Than-Expected Coreset Effect in Current Benchmarks | Soumyadeep Pal et.al. | 2504.10185 | link | 
| 2025-04-14 | RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability | Yichi Zhang et.al. | 2504.10081 | null | 
| 2025-05-30 | The Structural Safety Generalization Problem | Julius Broomfield et.al. | 2504.09712 | link | 
| 2025-05-16 | Mitigating Many-Shot Jailbreaking | Christopher M. Ackerman et.al. | 2504.09604 | null | 
| 2025-04-13 | AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender | Weixiang Zhao et.al. | 2504.09466 | null | 
| 2025-04-13 | SaRO: Enhancing LLM Safety through Reasoning-based Alignment | Yutao Mou et.al. | 2504.09420 | null | 
| 2025-04-12 | Feature-Aware Malicious Output Detection and Mitigation | Weilong Dong et.al. | 2504.09191 | null | 
| 2025-04-09 | SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models | Junfeng Fang et.al. | 2504.08813 | null | 
| 2025-03-29 | A Framework for Lightweight Responsible Prompting Recommendation | Tiago Machado et.al. | 2504.08757 | null | 
| 2025-04-10 | Geneshift: Impact of different scenario shift on Jailbreaking LLM | Tianyi Wu et.al. | 2504.08104 | null | 
| 2025-04-10 | Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge | Riccardo Cantini et.al. | 2504.07887 | link | 
| 2025-04-10 | Achilles Heel of Distributed Multi-Agent Systems | Yiting Zhang et.al. | 2504.07461 | null | 
| 2025-03-20 | How Robust Are Router-LLMs? Analysis of the Fragility of LLM Routing Capabilities | Aly M. Kassem et.al. | 2504.07113 | null | 
| 2025-04-09 | Bypassing Safety Guardrails in LLMs Using Humor | Pedro Cisneros-Velarde et.al. | 2504.06577 | null | 
| 2025-04-08 | Mind the Trojan Horse: Image Prompt Adapter Enabling Scalable and Deceptive Jailbreaking | Junxi Chen et.al. | 2504.05838 | link | 
| 2025-05-24 | Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking | Yu-Hang Wu et.al. | 2504.05652 | link | 
| 2025-04-07 | How to evaluate control measures for LLM agents? A trajectory from today to superintelligence | Tomek Korbak et.al. | 2504.05259 | null | 
| 2025-04-07 | A Domain-Based Taxonomy of Jailbreak Vulnerabilities in Large Language Models | Carlos Peláez-González et.al. | 2504.04976 | null | 
| 2025-05-14 | Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models | Yubo Li et.al. | 2504.04717 | link | 
| 2025-04-06 | StyleRec: A Benchmark Dataset for Prompt Recovery in Writing Style Transformation | Shenyang Liu et.al. | 2504.04373 | null | 
| 2025-07-16 | JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model | Yi Nian et.al. | 2504.03770 | link | 
| 2025-04-04 | RWKVTTS: Yet another TTS based on RWKV-7 | Lin yueyu et.al. | 2504.03289 | link | 
| 2025-04-04 | Multi-lingual Multi-turn Automated Red Teaming for LLMs | Abhishek Singhania et.al. | 2504.03174 | null | 
| 2025-07-22 | More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment | Yifan Wang et.al. | 2504.02193 | null | 
| 2025-04-02 | Evolving Security in LLMs: A Study of Jailbreak Attacks and Defenses | Zhengchun Shang et.al. | 2504.02080 | null | 
| 2025-07-15 | Representation Bending for Large Language Model Safety | Ashkan Yousefpour et.al. | 2504.01550 | link | 
| 2025-04-02 | LightDefense: A Lightweight Uncertainty-Driven Defense against Jailbreaks via Shifted Token Distribution | Zhuoran Yang et.al. | 2504.01533 | null | 
| 2025-06-21 | PiCo: Jailbreaking Multimodal Large Language Models via $\textbf{Pi}$ctorial | Aofan Liu et.al. | 2504.01444 | null | 
| 2025-04-07 | Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks | Jiawei Wang et.al. | 2504.01308 | link | 
| 2025-04-02 | Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning | Si Chen et.al. | 2504.01278 | null | 
| 2025-04-01 | Multilingual and Multi-Accent Jailbreaking of Audio LLMs | Jaechul Roh et.al. | 2504.01094 | null | 
| 2025-04-01 | Exposing the Ghost in the Transformer: Abnormal Detection for Large Language Models via Hidden State Forensics | Shide Zhou et.al. | 2504.00446 | null | 
| 2025-03-31 | Output Constraints as Attack Surface: Exploiting Structured Generation to Bypass LLM Safety Mechanisms | Shuoming Zhang et.al. | 2503.24191 | null | 
| 2025-03-28 | Effective Automation to Support the Human Infrastructure in AI Red Teaming | Alice Qian Zhang et.al. | 2503.22116 | null | 
| 2025-03-27 | Prompt, Divide, and Conquer: Bypassing Large Language Model Safety Filters via Segmented and Distributed Prompt Processing | Johan Wahréus et.al. | 2503.21598 | null | 
| 2025-03-26 | Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy | Joonhyun Jeong et.al. | 2503.20823 | link | 
| 2025-03-26 | Iterative Prompting with Persuasion Skills in Jailbreaking Large Language Models | Shih-Wen Ke et.al. | 2503.20320 | null | 
| 2025-06-08 | sudo rm -rf agentic_security | Sejin Lee et.al. | 2503.20279 | link | 
| 2025-03-25 | Red Teaming with Artificial Intelligence-Driven Cyberattacks: A Scoping Review | Mays Al-Azzawi et.al. | 2503.19626 | null | 
| 2025-03-24 | MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks | Wenhao You et.al. | 2503.19134 | null | 
| 2025-05-29 | Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment | Ruoxi Cheng et.al. | 2503.18991 | null | 
| 2025-03-24 | Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training | Brian R. Bartoldson et.al. | 2503.18929 | null | 
| 2025-04-19 | Reason2Attack: Jailbreaking Text-to-Image Models via LLM Reasoning | Chenyu Zhang et.al. | 2503.17987 | null | 
| 2025-03-23 | Smoke and Mirrors: Jailbreaking LLM-based Code Generation via Implicit Malicious Prompts | Sheng Ouyang et.al. | 2503.17953 | null | 
| 2025-03-23 | STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models | Xunguang Wang et.al. | 2503.17932 | null | 
| 2025-03-21 | Jailbreaking the Non-Transferable Barrier via Test-Time Data Disguising | Yongli Xiang et.al. | 2503.17198 | null | 
| 2025-03-25 | In-House Evaluation Is Not Enough: Towards Robust Third-Party Flaw Disclosure for General-Purpose AI | Shayne Longpre et.al. | 2503.16861 | null | 
| 2025-03-20 | REVAL: A Comprehension Evaluation on Reliability and Values of Large Vision-Language Models | Jie Zhang et.al. | 2503.16566 | null | 
| 2025-01-24 | OpenAI's Approach to External Red Teaming for AI Models and Systems | Lama Ahmad et.al. | 2503.16431 | null | 
| 2025-05-19 | Detecting LLM-Generated Peer Reviews | Vishisht Rao et.al. | 2503.15772 | link | 
| 2025-03-20 | AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration | Andy Zhou et.al. | 2503.15754 | null | 
| 2025-03-19 | A Peek Behind the Curtain: Using Step-Around Prompt Engineering to Identify Bias and Misinformation in GenAI Models | Don Hickerson et.al. | 2503.15205 | null | 
| 2025-03-19 | MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models | Chejian Xu et.al. | 2503.14827 | null | 
| 2025-05-20 | MirrorShield: Towards Universal Defense Against Jailbreaks via Entropy-Guided Mirror Crafting | Rui Pu et.al. | 2503.12931 | null | 
| 2025-03-16 | Augmented Adversarial Trigger Learning | Zhe Wang et.al. | 2503.12339 | null | 
| 2025-04-21 | A Framework for Evaluating Emerging Cyberattack Capabilities of AI | Mikel Rodriguez et.al. | 2503.11917 | null | 
| 2025-03-14 | Making Every Step Effective: Jailbreaking Large Vision-Language Models Through Hierarchical KV Equalization | Shuyang Hao et.al. | 2503.11750 | null | 
| 2025-03-14 | Tit-for-Tat: Safeguarding Large Vision-Language Models Against Jailbreak Attacks via Adversarial Defense | Shuyang Hao et.al. | 2503.11619 | null | 
| 2025-03-14 | Align in Depth: Defending Jailbreak Attacks via Progressive Answer Detoxification | Yingjie Zhang et.al. | 2503.11185 | null | 
| 2025-03-21 | TAIJI: Textual Anchoring for Immunizing Jailbreak Images in Vision Language Models | Xiangyu Yin et.al. | 2503.10872 | null | 
| 2025-03-21 | CeTAD: Towards Certified Toxicity-Aware Distance in Vision Language Models | Xiangyu Yin et.al. | 2503.10661 | null | 
| 2025-05-28 | Tempest: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search | Andy Zhou et.al. | 2503.10619 | null | 
| 2025-03-13 | Red Teaming Contemporary AI Models: Insights from Spanish and Basque Perspectives | Miguel Romero-Arjona et.al. | 2503.10192 | null | 
| 2025-07-04 | Probing Latent Subspaces in LLM for AI Security: Identifying and Manipulating Adversarial States | Xin Wei Chia et.al. | 2503.09066 | null | 
| 2025-03-12 | JBFuzz: Jailbreaking LLMs Efficiently and Effectively Using Fuzzing | Vasudev Gohil et.al. | 2503.08990 | null | 
| 2025-04-08 | Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in LLMs | Ariba Khan et.al. | 2503.08688 | link | 
| 2025-03-11 | Dialogue Injection Attack: Jailbreaking LLMs through Context Manipulation | Wenlong Meng et.al. | 2503.08195 | link | 
| 2025-03-10 | Safety Guardrails for LLM-Enabled Robots | Zachary Ravichandran et.al. | 2503.07885 | null | 
| 2025-03-10 | Utilizing Jailbreak Probability to Attack and Safeguard Multimodal LLMs | Wenzhuo Xu et.al. | 2503.06989 | null | 
| 2025-03-09 | Can Small Language Models Reliably Resist Jailbreak Attacks? A Comprehensive Evaluation | Wenhui Zhang et.al. | 2503.06519 | null | 
| 2025-05-06 | Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models | Thomas Winninger et.al. | 2503.06269 | link | 
| 2025-06-18 | MAD-MAX: Modular And Diverse Malicious Attack MiXtures for Automated LLM Red Teaming | Stefan Schoepf et.al. | 2503.06253 | null | 
| 2025-04-22 | Red Team Diffuser: Exposing Toxic Continuation Vulnerabilities in Vision-Language Models via Reinforcement Learning | Ruofan Wang et.al. | 2503.06223 | null | 
| 2025-03-07 | Jailbreaking is (Mostly) Simpler Than You Think | Mark Russinovich et.al. | 2503.05264 | null | 
| 2025-03-06 | Safety is Not Only About Refusal: Reasoning-Enhanced Fine-tuning for Interpretable LLM Safety | Yuyou Zhang et.al. | 2503.05021 | null | 
| 2025-05-26 | One-Shot is Enough: Consolidating Multi-Turn Attacks into Efficient Single-Turn Prompts for LLMs | Junwoo Ha et.al. | 2503.04856 | null | 
| 2025-03-18 | Adversarial Training for Multimodal Large Language Models against Jailbreak Attacks | Liming Lu et.al. | 2503.04833 | null | 
| 2025-03-06 | Know Thy Judge: On the Robustness Meta-Evaluation of LLM Safety Judges | Francisco Eiras et.al. | 2503.04474 | null | 
| 2025-06-12 | Improving LLM Safety Alignment with Dual-Objective Optimization | Xuandong Zhao et.al. | 2503.03710 | link | 
| 2025-03-05 | CURVALID: Geometrically-guided Adversarial Prompt Detection | Canaan Yung et.al. | 2503.03502 | link | 
| 2025-03-04 | LLM-Safety Evaluations Lack Robustness | Tim Beyer et.al. | 2503.02574 | null | 
| 2025-06-03 | Unnatural Languages Are Not Bugs but Features for LLMs | Keyu Duan et.al. | 2503.01926 | null | 
| 2025-06-06 | UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own Reasoning | Jiawei Zhang et.al. | 2503.01908 | link | 
| 2025-02-25 | Guiding not Forcing: Enhancing the Transferability of Jailbreaking Attacks on LLMs via Removing Superfluous Constraints | Junxiao Yang et.al. | 2503.01865 | link | 
| 2025-03-03 | Jailbreaking Safeguarded Text-to-Image Models via Large Language Models | Zhengyuan Jiang et.al. | 2503.01839 | null | 
| 2025-03-05 | Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models | Alberto Purpura et.al. | 2503.01742 | null | 
| 2025-03-03 | Jailbreaking Generative AI: Empowering Novices to Conduct Phishing Attacks | Rina Mishra et.al. | 2503.01395 | null | 
| 2025-02-28 | À la recherche du sens perdu: your favourite LLM might have more to say than you can understand | K. O. T. Erziev et.al. | 2503.00224 | link | 
| 2025-02-28 | Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks | Hanjiang Hu et.al. | 2503.00187 | link | 
| 2025-06-07 | from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors | Yu Yan et.al. | 2503.00038 | null | 
| 2025-06-10 | FC-Attack: Jailbreaking Multimodal Large Language Models via Auto-Generated Flowcharts | Ziyi Zhang et.al. | 2502.21059 | null | 
| 2025-02-28 | Efficient Jailbreaking of Large Models by Freeze Training: Lower Layers Exhibit Greater Sensitivity to Harmful Content | Hongyuan Shen et.al. | 2502.20952 | null | 
| 2025-02-28 | SafeText: Safe Text-to-image Models via Aligning the Text Encoder | Yuepeng Hu et.al. | 2502.20623 | null | 
| 2025-05-26 | Beyond the Tip of Efficiency: Uncovering the Submerged Threats of Jailbreak Attacks in Small Language Models | Sibo Yi et.al. | 2502.19883 | null | 
| 2025-03-28 | Foot-In-The-Door: A Multi-turn Jailbreak for LLMs | Zixuan Weng et.al. | 2502.19820 | link | 
| 2025-07-12 | No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms | Joshua Kazdan et.al. | 2502.19537 | null | 
| 2025-05-28 | Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs | Shiyu Xiang et.al. | 2502.19041 | null | 
| 2025-02-26 | JailBench: A Comprehensive Chinese Security Assessment Benchmark for Large Language Models | Shuyi Liu et.al. | 2502.18935 | null | 
| 2025-06-04 | TurboFuzzLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice | Aman Goel et.al. | 2502.18504 | link | 
| 2025-02-24 | How Do Large Language Monkeys Get Their Power (Laws)? | Rylan Schaeffer et.al. | 2502.17578 | null | 
| 2025-05-29 | Dataset Featurization: Uncovering Natural Language Features through Unsupervised Data Reconstruction | Michal Bravansky et.al. | 2502.17541 | null | 
| 2025-02-24 | REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective | Simon Geisler et.al. | 2502.17254 | link | 
| 2025-07-09 | GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-wild LLM Jailbreak Methods | Ruixuan Huang et.al. | 2502.16903 | null | 
| 2025-06-12 | Guardians of the Agentic System: Preventing Many Shots Jailbreak with Agentic System | Saikat Barua et.al. | 2502.16750 | link | 
| 2025-02-22 | Be a Multitude to Itself: A Prompt Evolution Framework for Red Teaming | Rui Li et.al. | 2502.16109 | null | 
| 2025-06-03 | A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos | Yang Yao et.al. | 2502.15806 | null | 
| 2025-05-23 | SafeInt: Shielding Large Language Models from Jailbreak Attacks via Safety-Aware Representation Intervention | Jiaqi Wu et.al. | 2502.15594 | null | 
| 2025-02-21 | Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders | Xuansheng Wu et.al. | 2502.15576 | null | 
| 2025-02-21 | Single-pass Detection of Jailbreaking Input in Large Language Models | Leyla Naz Candogan et.al. | 2502.15435 | null | 
| 2025-02-21 | Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs | Giulio Zizzo et.al. | 2502.15427 | link | 
| 2025-02-21 | Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment | Pedram Zaree et.al. | 2502.15334 | null | 
| 2025-06-02 | Red-Teaming LLM Multi-Agent Systems via Communication Attacks | Pengfei He et.al. | 2502.14847 | null | 
| 2025-06-23 | HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States | Yilei Jiang et.al. | 2502.14744 | link | 
| 2025-02-20 | How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation | Zhuohang Long et.al. | 2502.14486 | null | 
| 2025-06-03 | Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region | Chak Tou Leong et.al. | 2502.13946 | null | 
| 2025-02-25 | Efficient Safety Retrofitting Against Jailbreaking for LLMs | Dario Garcia-Gasulla et.al. | 2502.13603 | null | 
| 2025-02-19 | Exploiting Prefix-Tree in Structured Output Interfaces for Enhancing Jailbreak Attacking | Yanzeng Li et.al. | 2502.13527 | link | 
| 2025-02-19 | Integrating Sequential Hypothesis Testing into Adversarial Games: A Sun Zi-Inspired Framework | Haosheng Zhou et.al. | 2502.13462 | null | 
| 2025-02-25 | Towards Robust and Secure Embodied AI: A Survey on Vulnerabilities and Attacks | Wenpeng Xing et.al. | 2502.13175 | null | 
| 2025-02-16 | ShieldLearner: A New Paradigm for Jailbreak Attack Defense in LLMs | Ziyi Ni et.al. | 2502.13162 | null | 
| 2025-02-18 | Understanding and Rectifying Safety Perception Distortion in VLMs | Xiaohan Zou et.al. | 2502.13095 | null | 
| 2025-05-29 | Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking | Junda Zhu et.al. | 2502.12970 | link | 
| 2025-02-27 | H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking | Martin Kuo et.al. | 2502.12893 | link | 
| 2025-02-27 | The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1 | Kaiwen Zhou et.al. | 2502.12659 | null | 
| 2025-02-18 | SoK: Understanding Vulnerabilities in the Large Language Model Supply Chain | Shenao Wang et.al. | 2502.12497 | null | 
| 2025-02-18 | Computational Safety for Generative AI: A Signal Processing Perspective | Pin-Yu Chen et.al. | 2502.12445 | null | 
| 2025-05-16 | To Think or Not to Think: Exploring the Unthinking Vulnerability in Large Reasoning Models | Zihao Zhu et.al. | 2502.12202 | link | 
| 2025-05-29 | DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing | Yi Wang et.al. | 2502.11647 | null | 
| 2025-02-17 | Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training | Fenghua Weng et.al. | 2502.11455 | null | 
| 2025-02-17 | Detecting and Filtering Unsafe Training Data via Data Attribution | Yijun Pan et.al. | 2502.11411 | null | 
| 2025-02-17 | CCJA: Context-Coherent Jailbreak Attack for Aligned Large Language Models | Guanghao Zhou et.al. | 2502.11379 | null | 
| 2025-02-18 | SafeDialBench: A Fine-Grained Safety Benchmark for Large Language Models in Multi-Turn Dialogues with Diverse Jailbreak Attacks | Hongye Cao et.al. | 2502.11090 | link | 
| 2025-05-30 | Rewrite to Jailbreak: Discover Learnable and Transferable Implicit Harmfulness Instruction | Yuting Huang et.al. | 2502.11084 | link | 
| 2025-03-11 | Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models | Zonghao Ying et.al. | 2502.11054 | link | 
| 2025-02-16 | Prompt Inject Detection with Generative Explanation as an Investigative Tool | Jonathan Pan et.al. | 2502.11006 | null | 
| 2025-06-17 | Distraction is All You Need for Multimodal Large Language Model Jailbreaking | Zuopeng Yang et.al. | 2502.10794 | null | 
| 2025-02-14 | Fast Proxies for LLM Robustness Evaluation | Tim Beyer et.al. | 2502.10487 | null | 
| 2025-02-09 | Injecting Universal Jailbreak Backdoors into LLMs in Minutes | Zhuowei Chen et.al. | 2502.10438 | link | 
| 2025-02-04 | Position: Stop Acting Like Language Model Agents Are Normal Agents | Elija Perrier et.al. | 2502.10420 | null | 
| 2025-03-06 | X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability | Xiaoya Lu et.al. | 2502.09990 | link | 
| 2025-06-05 | Jailbreak Attack Initializations as Extractors of Compliance Directions | Amit Levi et.al. | 2502.09755 | null | 
| 2025-05-26 | QueryAttack: Jailbreaking Aligned Large Language Models Using Structured Non-natural Query Language | Qingsong Zou et.al. | 2502.09723 | link | 
| 2025-05-27 | The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Analysis of Orthogonal Safety Directions | Wenbo Pan et.al. | 2502.09674 | link | 
| 2025-05-29 | Jailbreaking to Jailbreak | Jeremy Kritz et.al. | 2502.09638 | null | 
| 2025-02-13 | FLAME: Flexible LLM-Assisted Moderation Engine | Ivan Bakulin et.al. | 2502.09175 | null | 
| 2025-04-07 | MetaSC: Test-Time Safety Specification Optimization for Language Models | Víctor Gallego et.al. | 2502.07985 | link | 
| 2025-02-11 | JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation | Shenyi Zhang et.al. | 2502.07557 | link | 
| 2025-02-19 | A Frontier AI Risk Management Framework: Bridging the Gap Between Current AI Practices and Established Risk Management | Simeon Campos et.al. | 2502.06656 | null | 
| 2025-02-10 | Predictive Red Teaming: Breaking Policies Without Breaking Robots | Anirudha Majumdar et.al. | 2502.06575 | null | 
| 2025-02-11 | When Data Manipulation Meets Attack Goals: An In-depth Survey of Attacks for VLMs | Aobotao Dai et.al. | 2502.06390 | link | 
| 2025-05-27 | Towards LLM Unlearning Resilient to Relearning Attacks: A Sharpness-Aware Minimization Perspective and Beyond | Chongyu Fan et.al. | 2502.05374 | link | 
| 2025-02-05 | KDA: A Knowledge-Distilled Attacker for Generating Diverse Prompts to Jailbreak LLMs | Buyun Liang et.al. | 2502.05223 | null | 
| 2025-06-02 | Safety at Scale: A Comprehensive Survey of Large Model Safety | Xingjun Ma et.al. | 2502.05206 | link | 
| 2025-06-16 | Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions | Yik Siu Chan et.al. | 2502.04322 | link | 
| 2025-06-07 | Short-length Adversarial Training Helps LLMs Defend Long-length Jailbreak Attacks: Theoretical and Empirical Evidence | Shaopeng Fu et.al. | 2502.04204 | link | 
| 2025-05-30 | Safety Reasoning with Guidelines | Haoyu Wang et.al. | 2502.04040 | null | 
| 2025-05-17 | Understanding and Enhancing the Transferability of Jailbreaking Attacks | Runqi Lin et.al. | 2502.03052 | link | 
| 2025-02-06 | When Anti-Fraud Laws Become a Barrier to Computer Science Research | Madelyne Xiao et.al. | 2502.02767 | null | 
| 2025-06-27 | STAIR: Improving Safety Alignment with Introspective Reasoning | Yichi Zhang et.al. | 2502.02384 | link | 
| 2025-06-12 | PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling | Avery Ma et.al. | 2502.01925 | link | 
| 2025-05-26 | Firewalls to Secure Dynamic LLM Agentic Networks | Sahar Abdelnabi et.al. | 2502.01822 | null | 
| 2025-06-25 | Adversarial Reasoning at Jailbreaking Time | Mahdi Sabbaghi et.al. | 2502.01633 | link | 
| 2025-02-03 | Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models | Hashmat Shadab Malik et.al. | 2502.01576 | link | 
| 2025-04-14 | Towards Safer Chatbots: A Framework for Policy Compliance Evaluation of Custom GPTs | David Rodriguez et.al. | 2502.01436 | null | 
| 2025-02-03 | Peering Behind the Shield: Guardrail Identification in Large Language Models | Ziqing Yang et.al. | 2502.01241 | null | 
| 2025-02-03 | Eliciting Language Model Behaviors with Investigator Agents | Xiang Lisa Li et.al. | 2502.01236 | null | 
| 2025-02-03 | Jailbreaking with Universal Multi-Prompts | Yu-Ling Hsu et.al. | 2502.01154 | link | 
| 2025-06-05 | Blink of an eye: a simple theory for feature localization in generative models | Marvin Li et.al. | 2502.00921 | null | 
| 2025-04-14 | AgentBreeder: Mitigating the AI Safety Impact of Multi-Agent Scaffolds via Self-Improvement | J Rosser et.al. | 2502.00757 | link | 
| 2025-05-18 | `Do as I say not as I do': A Semi-Automated Approach for Jailbreak Prompt Attack against Multimodal LLMs | Chun Wai Chiu et.al. | 2502.00735 | null | 
| 2025-07-10 | "I am bad": Interpreting Stealthy, Universal and Robust Audio Jailbreaks in Audio-Language Models | Isha Gupta et.al. | 2502.00718 | null | 
| 2025-02-02 | Safety Alignment Depth in Large Language Models: A Markov Chain Perspective | Ching-Chia Kao et.al. | 2502.00669 | null | 
| 2025-06-01 | LLM Safety Alignment is Divergence Estimation in Disguise | Rajdeep Haldar et.al. | 2502.00657 | link | 
| 2025-02-02 | Towards Robust Multimodal Large Language Models Against Jailbreak Attacks | Ziyi Yin et.al. | 2502.00653 | null | 
| 2025-02-01 | Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation | Stuart Armstrong et.al. | 2502.00580 | link | 
| 2025-07-08 | Agents Are All You Need for LLM Unlearning | Debdeep Sanyal et.al. | 2502.00406 | null | 
| 2025-06-30 | Riddle Me This! Stealthy Membership Inference for Retrieval-Augmented Generation | Ali Naseh et.al. | 2502.00306 | null | 
| 2025-01-31 | Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning | Xianglin Yang et.al. | 2501.19180 | null | 
| 2025-01-31 | Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming | Mrinank Sharma et.al. | 2501.18837 | null | 
| 2025-06-13 | Graph of Attacks with Pruning: Optimizing Stealthy Jailbreak Prompt Generation for Enhanced LLM Content Moderation | Daniel Schwartz et.al. | 2501.18638 | link | 
| 2025-03-04 | Towards Safe AI Clinicians: A Comprehensive Study on Large Language Model Jailbreaking in Healthcare | Hang Zhang et.al. | 2501.18632 | null | 
| 2025-01-27 | Indiana Jones: There Are Always Some Useful Ancient Relics | Junchen Ding et.al. | 2501.18628 | null | 
| 2025-05-31 | The TIP of the Iceberg: Revealing a Hidden Class of Task-in-Prompt Adversarial Attacks on LLMs | Sergey Berezin et.al. | 2501.18626 | null | 
| 2025-05-17 | Jailbreaking LLMs' Safeguard with Universal Magic Words for Text Embedding Models | Haoyu Liang et.al. | 2501.18280 | null | 
| 2025-01-29 | RICoTA: Red-teaming of In-the-wild Conversation with Test Attempts | Eujeong Choi et.al. | 2501.17715 | link | 
| 2025-01-29 | Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation | Tiansheng Huang et.al. | 2501.17433 | link | 
| 2025-01-28 | A sketch of an AI control safety case | Tomek Korbak et.al. | 2501.17315 | null | 
| 2025-01-30 | xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking | Sunbowen Lee et.al. | 2501.16727 | link | 
| 2025-01-27 | Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs | Jean-Charles Noirot Ferrand et.al. | 2501.16534 | null | 
| 2025-01-27 | Smoothed Embeddings for Robust Language Models | Ryo Hase et.al. | 2501.16497 | null | 
| 2025-01-24 | Internal Activation Revision: Safeguarding Vision Language Models Without Parameter Update | Qing Li et.al. | 2501.16378 | null | 
| 2025-01-26 | ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer | Lin Yueyu et.al. | 2501.15570 | link | 
| 2025-01-26 | Token Democracy: The Architectural Limits of Alignment in Transformer-Based Language Models | Robin Young et.al. | 2501.15446 | null | 
| 2025-01-24 | Siren: A Learning-Based Multi-Turn Attack Framework for Simulating Real-World Human Jailbreak Behaviors | Yi Zhao et.al. | 2501.14250 | link | 
| 2025-02-18 | LLMs are Vulnerable to Malicious Prompts Disguised as Scientific Language | Yubin Ge et.al. | 2501.14073 | null | 
| 2025-06-01 | Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models | Hao Cheng et.al. | 2501.13772 | null | 
| 2025-02-17 | Dagger Behind Smile: Fool LLMs with a Happy Ending Story | Xurui Song et.al. | 2501.13115 | null | 
| 2025-01-21 | You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense | Wuyuao Mai et.al. | 2501.12210 | null | 
| 2025-01-19 | Can Safety Fine-Tuning Be More Principled? Lessons Learned from Cybersecurity | David Williams-King et.al. | 2501.11183 | null | 
| 2025-03-13 | Jailbreaking Large Language Models in Infinitely Many Ways | Oliver Goldstein et.al. | 2501.10800 | null | 
| 2025-05-30 | Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks | Xin Yi et.al. | 2501.10639 | link | 
| 2024-12-17 | What Information Should Be Shared with Whom "Before and During Training"? | Haydn Belfield et.al. | 2501.10379 | null | 
| 2025-01-16 | A Survey on Responsible LLMs: Inherent Risk, Malicious Use, and Mitigation Strategy | Huandong Wang et.al. | 2501.09431 | null | 
| 2025-01-14 | Playing Devil's Advocate: Unmasking Toxicity and Vulnerabilities in Large Vision-Language Models | Abdulkadir Erol et.al. | 2501.09039 | null | 
| 2025-03-28 | SAIF: A Comprehensive Framework for Evaluating the Risks of Generative AI in the Public Sector | Kyeongryul Lee et.al. | 2501.08814 | null | 
| 2025-01-14 | Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints | Jonathan Nöther et.al. | 2501.08246 | null | 
| 2025-02-01 | Self-Instruct Few-Shot Jailbreaking: Decompose the Attack into Pattern and Behavior Learning | Jiaqi Hua et.al. | 2501.07959 | link | 
| 2025-02-02 | Gandalf the Red: Adaptive Security for LLMs | Niklas Pfister et.al. | 2501.07927 | link | 
| 2025-01-13 | Lessons From Red Teaming 100 Generative AI Products | Blake Bullwinkel et.al. | 2501.07238 | null | 
| 2025-06-27 | Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency | Shiji Zhao et.al. | 2501.04931 | null | 
| 2025-02-12 | Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense | Yang Ouyang et.al. | 2501.02629 | link | 
| 2025-01-03 | Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models | Ziwei Zheng et.al. | 2501.02029 | null | 
| 2025-01-02 | Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs | Joao Fonseca et.al. | 2501.02018 | null | 
| 2025-01-09 | Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions | Rachneet Sachdeva et.al. | 2501.01872 | link | 
| 2025-01-03 | Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models | Yanjiang Liu et.al. | 2501.01830 | null | 
| 2025-04-29 | WeAudit: Scaffolding User Auditors and AI Practitioners in Auditing Generative AI | Wesley Hanwen Deng et.al. | 2501.01397 | null | 
| 2025-01-02 | CySecBench: Generative AI-based CyberSecurity-focused Prompt Dataset for Benchmarking Large Language Models | Johan Wahréus et.al. | 2501.01335 | link | 
| 2024-12-29 | Adversarial Negotiation Dynamics in Generative Language Models | Arinbjörn Kolbeinsson et.al. | 2501.00069 | null | 
| 2024-12-28 | LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models | Miao Yu et.al. | 2501.00055 | link | 
| 2025-02-06 | InfAlign: Inference-aware language model alignment | Ananth Balashankar et.al. | 2412.19792 | null | 
| 2024-12-24 | Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning | Alex Beutel et.al. | 2412.18693 | null | 
| 2024-12-25 | Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models | Xiaomeng Hu et.al. | 2412.18171 | null | 
| 2024-12-23 | Retention Score: Quantifying Jailbreak Risks for Vision Language Models | Zaitang Li et.al. | 2412.17544 | null | 
| 2025-01-05 | DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak | Hao Wang et.al. | 2412.17522 | null | 
| 2025-05-21 | Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models | Lang Gao et.al. | 2412.17034 | null | 
| 2024-12-22 | Robustness of Large Language Models Against Adversarial Attacks | Yiyi Tao et.al. | 2412.17011 | null | 
| 2024-12-21 | OpenAI o1 System Card | OpenAI et.al. | 2412.16720 | null | 
| 2025-02-10 | POEX: Understanding and Mitigating Policy Executable Jailbreak Attacks against Embodied AI | Xuancun Lu et.al. | 2412.16633 | null | 
| 2025-05-29 | Divide and Conquer: A Hybrid Strategy Defeats Multimodal Large Language Models | Yanxu Mao et.al. | 2412.16555 | null | 
| 2025-01-08 | Deliberative Alignment: Reasoning Enables Safer Language Models | Melody Y. Guan et.al. | 2412.16339 | null | 
| 2024-12-20 | Logical Consistency of Large Language Models in Fact-checking | Bishwamittra Ghosh et.al. | 2412.16100 | null | 
| 2024-12-20 | JailPO: A Novel Black-box Jailbreak Framework via Preference Optimization against Aligned LLMs | Hongyi Li et.al. | 2412.15623 | null | 
| 2025-06-09 | SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage | Xiaoning Dong et.al. | 2412.15289 | link | 
| 2025-03-04 | Toxicity Detection towards Adaptability to Changing Perturbations | Hankun Kang et.al. | 2412.15267 | null | 
| 2024-12-18 | Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation Generation | Aneta Zugecova et.al. | 2412.13666 | null | 
| 2024-12-17 | Concept-ROT: Poisoning Concepts in Large Language Models with Model Editing | Keltin Grimes et.al. | 2412.13341 | link | 
| 2024-12-17 | Jailbreaking? One Step Is Enough! | Weixiong Zheng et.al. | 2412.12621 | null | 
| 2025-03-19 | Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols? | Alex Mallen et.al. | 2412.12480 | null | 
| 2024-12-13 | No Free Lunch for Defending Against Prefilling Attack by In-Context Learning | Zhiyu Xue et.al. | 2412.12192 | null | 
| 2025-02-22 | Na'vi or Knave: Jailbreaking Language Models via Metaphorical Avatars | Yu Yan et.al. | 2412.12145 | null | 
| 2024-12-15 | SpearBot: Leveraging Large Language Models in a Generative-Critique Framework for Spear-Phishing Email Generation | Qinglin Qi et.al. | 2412.11109 | null | 
| 2025-05-26 | Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models | Di Wu et.al. | 2412.11041 | link | 
| 2024-12-14 | IntelEX: A LLM-driven Attack-level Threat Intelligence Extraction Framework | Ming Xu et.al. | 2412.10872 | null | 
| 2025-06-12 | Towards Action Hijacking of Large Language Model-based Agent | Yuyang Zhang et.al. | 2412.10807 | null | 
| 2025-04-14 | Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM | Shaoqing Zhang et.al. | 2412.10423 | link | 
| 2024-12-13 | AdvPrefix: An Objective for Nuanced LLM Jailbreaks | Sicheng Zhu et.al. | 2412.10321 | link | 
| 2025-04-03 | AI red-teaming is a sociotechnical challenge: on values, labor, and harms | Tarleton Gillespie et.al. | 2412.09751 | null | 
| 2025-02-08 | Obfuscated Activations Bypass LLM Latent-Space Defenses | Luke Bailey et.al. | 2412.09565 | null | 
| 2024-12-16 | Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models | Jiahui Li et.al. | 2412.08615 | link | 
| 2024-12-11 | AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models | Mintong Kang et.al. | 2412.08608 | null | 
| 2024-12-11 | Model-Editing-Based Jailbreak against Safety-aligned Large Language Models | Yuxi Li et.al. | 2412.08201 | null | 
| 2024-12-11 | Antelope: Potent and Concealed Jailbreak Attack Strategy | Xin Zhao et.al. | 2412.08156 | null | 
| 2025-03-31 | Evil twins are not that evil: Qualitative insights into machine-generated prompts | Nathanaël Carraz Rakotonirina et.al. | 2412.08127 | null | 
| 2024-12-16 | Granite Guardian | Inkit Padhi et.al. | 2412.07724 | link | 
| 2024-12-10 | FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks | Bocheng Chen et.al. | 2412.07672 | null | 
| 2025-03-17 | TraSCE: Trajectory Steering for Concept Erasure | Anubhav Jain et.al. | 2412.07658 | link | 
| 2025-06-10 | PrisonBreak: Jailbreaking Large Language Models with Fewer Than Twenty-Five Targeted Bit-flips | Zachary Coalson et.al. | 2412.07192 | null | 
| 2024-11-03 | Poison Attacks and Adversarial Prompts Against an Informed University Virtual Assistant | Ivan A. Fernandez et.al. | 2412.06788 | null | 
| 2024-12-09 | Enhancing Adversarial Resistance in LLMs with Recursion | Bryan Li et.al. | 2412.06181 | null | 
| 2025-01-03 | Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models | Ma Teng et.al. | 2412.05934 | link | 
| 2025-02-03 | PBI-Attack: Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for Toxicity Maximization | Ruoxi Cheng et.al. | 2412.05892 | null | 
| 2024-12-07 | PrivAgent: Agentic-based Red-teaming for LLM Privacy Leakage | Yuzhou Nie et.al. | 2412.05734 | link | 
| 2024-12-06 | BadGPT-4o: stripping safety finetuning from GPT models | Ekaterina Krupkina et.al. | 2412.05346 | null | 
| 2025-07-03 | LIAR: Leveraging Inference Time Alignment (Best-of-N) to Jailbreak LLMs in Seconds | James Beetham et.al. | 2412.05232 | null | 
| 2024-12-19 | Best-of-N Jailbreaking | John Hughes et.al. | 2412.03556 | link | 
| 2025-03-25 | Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts? | Sravanti Addepalli et.al. | 2412.03235 | null | 
| 2024-12-03 | Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach | Tony T. Wang et.al. | 2412.02159 | null | 
| 2025-06-30 | Trust & Safety of LLMs and LLMs in Trust & Safety | Doohee You et.al. | 2412.02113 | null | 
| 2024-12-02 | Improved Large Language Model Jailbreak Detection via Pretrained Embeddings | Erick Galinkin et.al. | 2412.01547 | null | 
| 2025-06-18 | Jailbreak Large Vision-Language Models Through Multi-Modal Linkage | Yu Wang et.al. | 2412.00473 | link | 
| 2024-11-30 | Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models | Sanghyun Kim et.al. | 2412.00357 | null | 
| 2024-12-19 | PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning | Shenghui Li et.al. | 2411.19335 | null | 
| 2025-03-09 | DIESEL -- Dynamic Inference-Guidance via Evasion of Semantic Embeddings in LLMs | Ben Ganon et.al. | 2411.19038 | null | 
| 2025-06-14 | Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment | Soumya Suvra Ghosal et.al. | 2411.18688 | null | 
| 2025-02-10 | Embodied Red Teaming for Auditing Robotic Foundation Models | Sathwik Karnik et.al. | 2411.18676 | null | 
| 2024-11-28 | Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models | Shuyang Hao et.al. | 2411.18000 | null | 
| 2024-11-26 | Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats | Jiaxin Wen et.al. | 2411.17693 | null | 
| 2025-01-14 | Don't Command, Cultivate: An Exploratory Study of System-2 Alignment | Yuhang Wang et.al. | 2411.17075 | link | 
| 2025-02-12 | In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models | Zhi-Yi Chin et.al. | 2411.16769 | null | 
| 2024-11-23 | ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain | Haochen Zhao et.al. | 2411.16736 | null | 
| 2025-03-20 | "Moralized" Multi-Step Jailbreak Prompts: Black-Box Testing of Guardrails in Large Language Models for Verbal Attacks | Libo Wang et.al. | 2411.16730 | link | 
| 2025-05-01 | Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks | Han Wang et.al. | 2411.16721 | link | 
| 2024-11-25 | Preventing Jailbreak Prompts as Malicious Tools for Cybercriminals: A Cyber Defense Perspective | Jean Marie Tshimula et.al. | 2411.16642 | null | 
| 2024-11-22 | Universal and Context-Independent Triggers for Precise Control of LLM Outputs | Jiashuo Liang et.al. | 2411.14738 | null | 
| 2024-11-21 | Global Challenge for Safe and Secure LLMs Track 1 | Xiaojun Jia et.al. | 2411.14502 | null | 
| 2025-06-25 | GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs | Advik Raj Basani et.al. | 2411.14133 | link | 
| 2025-04-09 | A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection | Gabriel Chua et.al. | 2411.12946 | link | 
| 2024-11-27 | Playing Language Game with LLMs Leads to Jailbreaking | Yu Peng et.al. | 2411.12762 | null | 
| 2025-04-12 | Conceptwm: A Diffusion Model Watermark for Concept Protection | Liangqi Lei et.al. | 2411.11688 | null | 
| 2024-12-08 | TrojanRobot: Backdoor Attacks Against LLM-based Embodied Robots in the Physical World | Xianlong Wang et.al. | 2411.11683 | null | 
| 2024-11-28 | Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models | Chenhang Cui et.al. | 2411.11496 | link | 
| 2024-11-18 | The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models | Xikang Yang et.al. | 2411.11407 | link | 
| 2025-05-22 | Steering Language Model Refusal with Sparse Autoencoders | Kyle O'Brien et.al. | 2411.11296 | null | 
| 2025-04-24 | JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit | Zeqing He et.al. | 2411.11114 | null | 
| 2024-12-09 | Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey | Xuannan Liu et.al. | 2411.09259 | link | 
| 2024-11-14 | DROJ: A Prompt-Driven Attack against Large Language Models | Leyang Hu et.al. | 2411.09125 | link | 
| 2024-11-13 | LLMStinger: Jailbreaking LLMs using RL fine-tuned LLMs | Piyush Jha et.al. | 2411.08862 | null | 
| 2025-03-06 | The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense | Yangyang Guo et.al. | 2411.08410 | null | 
| 2024-11-12 | Zer0-Jack: A Memory-efficient Gradient-based Jailbreaking Method for Black-box Multi-modal Large Language Models | Tiejin Chen et.al. | 2411.07559 | null | 
| 2024-11-12 | Rapid Response: Mitigating LLM Jailbreaks with a Few Examples | Alwin Peng et.al. | 2411.07494 | null | 
| 2024-11-11 | HarmLevelBench: Evaluating Harm-Level Compliance and the Impact of Quantization on Model Alignment | Yannis Belkhiter et.al. | 2411.06835 | null | 
| 2025-05-28 | SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains | Bijoy Ahmed Saiem et.al. | 2411.06426 | null | 
| 2025-05-11 | Diversity Helps Jailbreak Large Language Models | Weiliang Zhao et.al. | 2411.04223 | null | 
| 2025-01-07 | MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue | Fengxiang Wang et.al. | 2411.03814 | null | 
| 2025-05-14 | What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks | Nathalie Kirch et.al. | 2411.03343 | link | 
| 2024-12-05 | Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment | Jason Vega et.al. | 2411.02785 | link | 
| 2025-01-31 | UniGuard: Towards Universal Safety Guardrails for Jailbreak Attacks on Multimodal Large Language Models | Sejoon Oh et.al. | 2411.01703 | null | 
| 2025-05-21 | SQL Injection Jailbreak: A Structural Disaster of Large Language Models | Jiawei Zhao et.al. | 2411.01565 | link | 
| 2024-11-03 | AURA: Amplifying Understanding, Resilience, and Awareness for Responsible AI Content Work | Alice Qian Zhang et.al. | 2411.01426 | null | 
| 2024-12-11 | Plentiful Jailbreaks with String Compositions | Brian R. Y. Huang et.al. | 2411.01084 | null | 
| 2025-07-11 | Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection | Zhipeng Wei et.al. | 2411.01077 | link | 
| 2025-03-08 | IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves | Ruofan Wang et.al. | 2411.00827 | null | 
| 2024-11-26 | Desert Camels and Oil Sheikhs: Arab-Centric Red Teaming of Frontier LLMs | Muhammed Saeed et.al. | 2410.24049 | null | 
| 2024-10-31 | Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal Models | Hao Yang et.al. | 2410.23861 | link | 
| 2025-05-17 | Adversarial Attacks of Vision Tasks in the Past 10 Years: A Survey | Chiyu Zhang et.al. | 2410.23687 | null | 
| 2024-11-27 | Transferable Ensemble Black-box Jailbreak Attacks on Large Language Models | Yiqi Yang et.al. | 2410.23558 | null | 
| 2024-10-30 | ProTransformer: Robustify Transformers via Plug-and-Play Paradigm | Zhichao Hou et.al. | 2410.23182 | link | 
| 2024-10-29 | Benchmarking LLM Guardrails in Handling Multilingual Toxicity | Yahan Yang et.al. | 2410.22153 | null | 
| 2024-10-29 | AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts | Vishal Kumar et.al. | 2410.22143 | null | 
| 2024-10-29 | SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types | Yutao Mou et.al. | 2410.21965 | link | 
| 2025-03-06 | Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring | Honglin Mu et.al. | 2410.21083 | null | 
| 2025-02-12 | BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks | Yunhan Zhao et.al. | 2410.20971 | null | 
| 2024-10-25 | RobustKV: Defending Large Language Models against Jailbreak Attacks via KV Eviction | Tanqiu Jiang et.al. | 2410.19937 | null | 
| 2025-06-15 | An Auditing Test To Detect Behavioral Shift in Language Models | Leo Richter et.al. | 2410.19406 | link | 
| 2025-06-23 | ADVLLM: Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities | Chung-En Sun et.al. | 2410.18469 | link | 
| 2025-02-27 | Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks | Samuele Poppi et.al. | 2410.18210 | null | 
| 2025-02-09 | Dynamic Guided and Domain Applicable Safeguards for Enhanced Security in Large Language Models | Weidi Luo et.al. | 2410.17922 | link | 
| 2025-05-31 | AdvAgent: Controllable Blackbox Red-teaming on Web Agents | Chejian Xu et.al. | 2410.17401 | null | 
| 2024-10-22 | LLM-Assisted Red Teaming of Diffusion Models through "Failures Are Fated, But Can Be Faded" | Som Sagar et.al. | 2410.16738 | null | 
| 2024-11-02 | Bayesian scaling laws for in-context learning | Aryaman Arora et.al. | 2410.16531 | link | 
| 2024-11-16 | Insights and Current Gaps in Open-Source LLM Vulnerability Scanners: A Comparative Analysis | Jonathan Brokman et.al. | 2410.16527 | null | 
| 2025-07-08 | Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs | Rui Pu et.al. | 2410.16327 | null | 
| 2025-05-30 | An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks | Valentyn Boreiko et.al. | 2410.16222 | link | 
| 2025-06-26 | A Troublemaker with Contagious Jailbreak Makes Chaos in Honest Towns | Tianyi Men et.al. | 2410.16155 | null | 
| 2024-11-03 | Boosting Jailbreak Transferability for Large Language Models | Hanqing Liu et.al. | 2410.15645 | link | 
| 2024-10-21 | SMILES-Prompting: A Novel Approach to LLM Jailbreak Attacks in Chemical Synthesis | Aidan Wong et.al. | 2410.15641 | link | 
| 2024-10-20 | Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models | Xiao Li et.al. | 2410.15362 | null | 
| 2025-05-08 | Jailbreaking and Mitigation of Vulnerabilities in Large Language Models | Benji Peng et.al. | 2410.15236 | null | 
| 2024-10-16 | SoK: Prompt Hacking of Large Language Models | Baha Rababah et.al. | 2410.13901 | null | 
| 2024-10-15 | A Formal Framework for Assessing and Mitigating Emergent Security Risks in Generative AI Models: Bridging Theory and Dynamic Risk Mitigation | Aviral Srivastava et.al. | 2410.13897 | null | 
| 2024-10-21 | Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents | Priyanshu Kumar et.al. | 2410.13886 | link | 
| 2024-10-17 | PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment | Zekun Moore Wang et.al. | 2410.13785 | null | 
| 2024-10-17 | Persistent Pre-Training Poisoning of LLMs | Yiming Zhang et.al. | 2410.13722 | null | 
| 2024-11-09 | Jailbreaking LLM-Controlled Robots | Alexander Robey et.al. | 2410.13691 | null | 
| 2025-01-02 | BiasJailbreak:Analyzing Ethical Biases and Jailbreak Vulnerabilities in Large Language Models | Isack Lee et.al. | 2410.13334 | link | 
| 2024-10-17 | SPIN: Self-Supervised Prompt INjection | Leon Zhou et.al. | 2410.13236 | null | 
| 2024-10-18 | JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework | Fan Liu et.al. | 2410.12855 | null | 
| 2024-10-19 | Multi-round jailbreak attack on large language models | Yihua Zhou et.al. | 2410.11533 | null | 
| 2024-10-15 | Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language Models | Hao Yang et.al. | 2410.11459 | link | 
| 2025-01-20 | Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation | Qizhang Li et.al. | 2410.11317 | link | 
| 2025-06-05 | AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment | Pankayaraj Pathmanathan et.al. | 2410.11283 | null | 
| 2024-10-15 | Cognitive Overload Attack:Prompt Injection for Long Context | Bibek Upadhayay et.al. | 2410.11272 | link | 
| 2025-05-25 | LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts | Qibing Ren et.al. | 2410.10700 | link | 
| 2025-02-23 | On Calibration of LLM-based Guard Models for Reliable Content Moderation | Hongfu Liu et.al. | 2410.10414 | link | 
| 2024-10-14 | Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting | Yifan Luo et.al. | 2410.10150 | null | 
| 2024-11-27 | BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models | Xinyuan Wang et.al. | 2410.09804 | null | 
| 2024-10-18 | VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment | Lei Li et.al. | 2410.09421 | null | 
| 2024-12-17 | Recent advancements in LLM Red-Teaming: Techniques, Defenses, and Ethical Considerations | Tarun Raheja et.al. | 2410.09097 | null | 
| 2024-10-11 | AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation | Zijun Wang et.al. | 2410.09040 | link | 
| 2025-04-18 | AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents | Maksym Andriushchenko et.al. | 2410.09024 | null | 
| 2024-11-29 | RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process | Peiran Wang et.al. | 2410.08660 | null | 
| 2025-06-18 | Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level | Xinyi Zeng et.al. | 2410.06809 | null | 
| 2024-10-04 | Developing Assurance Cases for Adversarial Robustness and Regulatory Compliance in LLMs | Tomas Bueno Momcilovic et.al. | 2410.05304 | null | 
| 2025-04-22 | AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs | Xiaogeng Liu et.al. | 2410.05295 | link | 
| 2024-10-06 | Attention Shift: Steering AI Away from Unsafe Content | Shivank Garg et.al. | 2410.04447 | null | 
| 2025-02-16 | Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks | Zi Wang et.al. | 2410.04234 | null | 
| 2024-10-05 | Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models | Yiting Dong et.al. | 2410.04190 | null | 
| 2025-06-03 | Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step | Wenxuan Wang et.al. | 2410.03869 | null | 
| 2024-10-08 | You Know What I'm Saying: Jailbreak Attack via Implicit Reference | Tianyu Wu et.al. | 2410.03857 | link | 
| 2024-12-16 | SciSafeEval: A Comprehensive Benchmark for Safety Alignment of Large Language Models in Scientific Tasks | Tianhao Li et.al. | 2410.03769 | null | 
| 2024-10-23 | Gradient-based Jailbreak Images for Multimodal Fusion Models | Javier Rando et.al. | 2410.03489 | link | 
| 2025-04-09 | LLM Safeguard is a Double-Edged Sword: Exploiting False Positives for Denial-of-Service Attacks | Qingzhao Zhang et.al. | 2410.02916 | null | 
| 2024-10-02 | FlipAttack: Jailbreak LLMs via Flipping | Yue Liu et.al. | 2410.02832 | link | 
| 2024-10-01 | PyRIT: A Framework for Security Risk Identification and Red Teaming in Generative AI System | Gary D. Lopez Munoz et.al. | 2410.02828 | link | 
| 2024-10-03 | SteerDiff: Steering towards Safe Text-to-Image Diffusion Models | Hongxiang Zhang et.al. | 2410.02710 | null | 
| 2025-06-03 | CulturalBench: A Robust, Diverse, and Challenging Cultural Benchmark by Human-AI CulturalTeaming | Yu Ying Chiu et.al. | 2410.02677 | null | 
| 2025-02-07 | Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models | Guobin Shen et.al. | 2410.02298 | null | 
| 2025-02-18 | Data to Defense: The Role of Curation in Customizing LLMs Against Jailbreaking Attacks | Xiaoqun Liu et.al. | 2410.02220 | null | 
| 2025-03-06 | Adversarial Decoding: Generating Readable Documents for Adversarial Objectives | Collin Zhang et.al. | 2410.02163 | link | 
| 2024-10-02 | Automated Red Teaming with GOAT: the Generative Offensive Agent Tester | Maya Pavlova et.al. | 2410.01606 | null | 
| 2025-02-24 | HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models | Seanie Lee et.al. | 2410.01524 | link | 
| 2025-02-02 | The Great Contradiction Showdown: How Jailbreak and Stealth Wrestle in Vision-Language Models? | Ching-Chia Kao et.al. | 2410.01438 | null | 
| 2025-05-10 | Endless Jailbreaks with Bijection Learning | Brian R. Y. Huang et.al. | 2410.01294 | null | 
| 2024-12-19 | Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language Models | Wei Zhao et.al. | 2410.00451 | link | 
| 2024-09-29 | Survey of Security and Data Attacks on Machine Unlearning In Financial and E-Commerce | Carl E. J. Brodzinski et.al. | 2410.00055 | null | 
| 2025-03-20 | Robust LLM safeguarding via refusal feature adversarial training | Lei Yu et.al. | 2409.20089 | null | 
| 2024-09-28 | Overriding Safety protections of Open-source Models | Sachin Kumar et.al. | 2409.19476 | link | 
| 2024-09-27 | HM3: Heterogeneous Multi-Class Model Merging | Stefan Hackmann et.al. | 2409.19173 | null | 
| 2025-06-10 | Multimodal Pragmatic Jailbreak on Text-to-image Models | Tong Liu et.al. | 2409.19149 | null | 
| 2025-05-31 | An Adversarial Perspective on Machine Unlearning for AI Safety | Jakub Łucki et.al. | 2409.18025 | link | 
| 2024-10-04 | MoJE: Mixture of Jailbreak Experts, Naive Tabular Classifiers as Guard for Prompt Attacks | Giandomenico Cornacchia et.al. | 2409.17699 | null | 
| 2025-06-07 | RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking | Yifan Jiang et.al. | 2409.17458 | link | 
| 2024-09-25 | Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction | Jinchuan Zhang et.al. | 2409.16783 | link | 
| 2024-09-25 | RoleBreak: Character Hallucination as a Jailbreak Attack in Role-Playing Systems | Yihong Tang et.al. | 2409.16727 | null | 
| 2024-09-23 | Attack Atlas: A Practitioner's Perspective on Challenges and Pitfalls in Red Teaming GenAI | Ambrish Rawat et.al. | 2409.15398 | null | 
| 2024-09-18 | Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning | Essa Jan et.al. | 2409.15361 | null | 
| 2025-03-03 | PAPILLON: Efficient and Stealthy Fuzz Testing-Powered Jailbreaks for LLMs | Xueluan Gong et.al. | 2409.14866 | link | 
| 2024-10-03 | PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach | Zhihao Lin et.al. | 2409.14177 | null | 
| 2024-10-29 | Towards Safe Multilingual Frontier AI | Artūrs Kanepajs et.al. | 2409.13708 | link | 
| 2024-11-05 | Jailbreaking Large Language Models with Symbolic Mathematics | Emet Bethany et.al. | 2409.11445 | null | 
| 2024-09-17 | Hackphyr: A Local Fine-Tuned LLM Agent for Network Security Environments | Maria Rigaki et.al. | 2409.11276 | null | 
| 2024-09-14 | What Is Wrong with My Model? Identifying Systematic Problems with Semantic Data Slicing | Chenyang Yang et.al. | 2409.09261 | link | 
| 2024-09-27 | Multi-Robot Coordination Induced in an Adversarial Graph-Traversal Game | James Berneburg et.al. | 2409.08222 | null | 
| 2024-10-19 | Securing Large Language Models: Addressing Bias, Misinformation, and Prompt Attacks | Benji Peng et.al. | 2409.08087 | null | 
| 2024-09-12 | Unleashing Worms and Extracting Data: Escalating the Outcome of Attacks against RAG-based Inference in Scale and Severity Using Jailbreaking | Stav Cohen et.al. | 2409.08045 | link | 
| 2024-09-12 | Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols | Charlie Griffin et.al. | 2409.07985 | null | 
| 2024-09-11 | AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs | Lijia Lv et.al. | 2409.07503 | link | 
| 2024-09-11 | Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks | Md Zarif Hossain et.al. | 2409.07353 | null | 
| 2025-07-13 | DiPT: Enhancing LLM reasoning through diversified perspective-taking | Hoang Anh Just et.al. | 2409.06241 | null | 
| 2024-09-07 | Exploring Straightforward Conversational Red-Teaming | George Kour et.al. | 2409.04822 | null | 
| 2025-06-09 | HSF: Defending against Jailbreak Attacks with Hidden State Filtering | Cheng Qian et.al. | 2409.03788 | null | 
| 2024-11-29 | Conversational Complexity for Assessing Risk in Large Language Models | John Burden et.al. | 2409.01247 | null | 
| 2025-06-11 | Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models | Bang An et.al. | 2409.00598 | link | 
| 2024-08-31 | Testing and Evaluation of Large Language Models: Correctness, Non-Toxicity, and Fairness | Wenxuan Wang et.al. | 2409.00551 | null | 
| 2025-03-14 | PrivacyLens: Evaluating Privacy Norm Awareness of Language Models in Action | Yijia Shao et.al. | 2409.00138 | link | 
| 2024-08-29 | Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks | Tom Gibbs et.al. | 2409.00137 | null | 
| 2024-11-07 | FRACTURED-SORRY-Bench: Framework for Revealing Attacks in Conversational Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench (Automated Multi-shot Jailbreaks) | Aman Priyanshu et.al. | 2408.16163 | null | 
| 2024-08-28 | Red Team Redemption: A Structured Comparison of Open-Source Tools for Adversary Emulation | Max Landauer et.al. | 2408.15645 | null | 
| 2024-09-05 | Legilimens: Practical and Unified Content Moderation for Large Language Model Services | Jialin Wu et.al. | 2408.15488 | link | 
| 2024-09-04 | LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet | Nathaniel Li et.al. | 2408.15221 | null | 
| 2025-04-01 | Understanding the Effectiveness of Coverage Criteria for Large Language Models: A Special Angle from Jailbreak Attacks | Shide Zhou et.al. | 2408.15207 | null | 
| 2024-10-05 | Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models | Hongfu Liu et.al. | 2408.14866 | link | 
| 2025-02-16 | Atoxia: Red-teaming Large Language Models with Target Toxic Answers | Yuhao Du et.al. | 2408.14853 | null | 
| 2024-12-15 | HTS-Attack: Heuristic Token Search for Jailbreaking Text-to-Image Models | Sensen Gao et.al. | 2408.13896 | null | 
| 2024-08-14 | SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming | Anurakt Kumar et.al. | 2408.11851 | null | 
| 2024-09-14 | Efficient Detection of Toxic Prompts in Large Language Models | Yi Liu et.al. | 2408.11727 | null | 
| 2025-04-02 | An Optimizable Suffix Is Worth A Thousand Templates: Efficient Black-box Jailbreaking without Affirmative Phrases via LLM as Optimizer | Weipeng Jiang et.al. | 2408.11313 | link | 
| 2024-08-21 | EEG-Defender: Defending against Jailbreak through Early Exit Generation of Large Language Models | Chongwen Zhao et.al. | 2408.11308 | null | 
| 2025-02-07 | Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Carrier Articles | Zhilong Wang et.al. | 2408.11182 | null | 
| 2025-02-06 | DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization | Pucheng Dang et.al. | 2408.11071 | null | 
| 2025-01-02 | Security Attacks on LLM-based Code Completion Tools | Wen Cheng et.al. | 2408.11006 | link | 
| 2025-02-09 | Perception-guided Jailbreak against Text-to-Image Models | Yihao Huang et.al. | 2408.10848 | null | 
| 2024-08-20 | Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique | Tej Deep Pala et.al. | 2408.10701 | link | 
| 2024-08-20 | Towards Robust Knowledge Unlearning: An Adversarial Framework for Assessing and Improving Unlearning Robustness in Large Language Models | Hongbang Yuan et.al. | 2408.10682 | null | 
| 2024-08-26 | Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation | Haoyu Wang et.al. | 2408.10668 | null | 
| 2024-08-18 | Characterizing and Evaluating the Reliability of LLMs against Jailbreak Attacks | Kexin Chen et.al. | 2408.09326 | null | 
| 2025-04-22 | BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger | Yulin Chen et.al. | 2408.09093 | null | 
| 2024-08-22 | Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks | Jiawei Zhao et.al. | 2408.08924 | link | 
| 2024-08-11 | Kov: Transferable and Naturalistic Black-Box LLM Attacks using Markov Decision Processes and Tree Search | Robert J. Moss et.al. | 2408.08899 | link | 
| 2024-10-22 | Fenghua Weng et.al. | 2408.08464 | link | |
| 2024-12-19 | Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability Distributions | Quan Liu et.al. | 2408.07663 | link | 
| 2025-02-06 | On Effects of Steering Latent Representation for Large Language Model Unlearning | Dang Huu-Tien et.al. | 2408.06223 | link | 
| 2024-08-09 | A Jailbroken GenAI Model Can Cause Substantial Harm: GenAI-powered Applications are Vulnerable to PromptWares | Stav Cohen et.al. | 2408.05061 | link | 
| 2025-03-25 | h4rm3l: A language for Composable Jailbreak Attack Synthesis | Moussa Koulako Bala Doumbouya et.al. | 2408.04811 | null | 
| 2024-08-08 | Multi-Turn Context Jailbreak Attack on Large Language Models From First Principles | Xiongtao Sun et.al. | 2408.04686 | null | 
| 2024-08-08 | Compromesso! Italian Many-Shot Jailbreaks Undermine the Safety of Large Language Models | Fabio Pernisi et.al. | 2408.04522 | null | 
| 2024-08-07 | EnJa: Ensemble Jailbreak on Large Language Models | Jiahao Zhang et.al. | 2408.03603 | null | 
| 2025-07-17 | Scaling Trends for Data Poisoning in LLMs | Dillon Bowen et.al. | 2408.02946 | link | 
| 2024-08-05 | Can Reinforcement Learning Unlock the Hidden Dangers in Aligned Large Language Models? | Mohammad Bahrami Karkevandi et.al. | 2408.02651 | null | 
| 2024-12-23 | SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models | Muxi Diao et.al. | 2408.02632 | null | 
| 2024-08-02 | Mission Impossible: A Statistical Perspective on Jailbreaking LLMs | Jingtong Su et.al. | 2408.01420 | null | 
| 2024-08-01 | WHITE PAPER: A Brief Exploration of Data Exfiltration using GCG Suffixes | Victor Valbuena et.al. | 2408.00925 | null | 
| 2025-02-10 | Tamper-Resistant Safeguards for Open-Weight LLMs | Rishub Tamirisa et.al. | 2408.00761 | link | 
| 2025-06-24 | Fuzz-Testing Meets LLM-Based Agents: An Automated and Efficient Framework for Jailbreaking Text-To-Image Generation Models | Yingkai Dong et.al. | 2408.00523 | null | 
| 2024-10-17 | Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models | Yue Xu et.al. | 2407.21659 | link | 
| 2025-01-16 | Direct Unlearning Optimization for Robust and Safe Text-to-Image Models | Yong-Hyun Park et.al. | 2407.21035 | null | 
| 2025-02-04 | BadRobot: Jailbreaking Embodied LLMs in the Physical World | Hangtao Zhang et.al. | 2407.20242 | null | 
| 2025-06-05 | Scaling Trends in Language Model Robustness | Nikolaus Howe et.al. | 2407.18213 | link | 
| 2024-12-24 | The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models | Zihui Wu et.al. | 2407.17915 | link | 
| 2024-10-01 | FLRT: Fluent Student-Teacher Redteaming | T. Ben Thompson et.al. | 2407.17447 | link | 
| 2024-10-07 | Revisiting Who's Harry Potter: Towards Targeted Unlearning from a Causal Intervention Perspective | Yujian Liu et.al. | 2407.16997 | link | 
| 2024-12-31 | From Sands to Mansions: Simulating Full Attack Chain with LLM-Organized Knowledge | Lingzhi Wang et.al. | 2407.16928 | null | 
| 2024-08-23 | Can Large Language Models Automatically Jailbreak GPT-4V? | Yuanwei Wu et.al. | 2407.16686 | null | 
| 2024-07-23 | RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent | Huiyu Xu et.al. | 2407.16667 | null | 
| 2024-10-26 | Course-Correction: Safety Alignment Using Synthetic Preferences | Rongwu Xu et.al. | 2407.16637 | link | 
| 2024-07-23 | PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing | Blazej Manczak et.al. | 2407.16318 | link | 
| 2025-06-18 | LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models | Shi Lin et.al. | 2407.16205 | link | 
| 2024-07-26 | Breaking the Global North Stereotype: A Global South-centric Benchmark Dataset for Auditing and Mitigating Biases in Facial Recognition Systems | Siddharth D Jaiswal et.al. | 2407.15810 | null | 
| 2024-08-21 | Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs | Abhay Sheshadri et.al. | 2407.15549 | link | 
| 2024-12-16 | Failures to Find Transferable Image Jailbreaks Between Vision-Language Models | Rylan Schaeffer et.al. | 2407.15211 | null | 
| 2024-07-21 | Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts | Yi Liu et.al. | 2407.15050 | null | 
| 2024-07-23 | RogueGPT: dis-ethical tuning transforms ChatGPT4 into a Rogue AI in 158 Words | Alessio Buscemi et.al. | 2407.15009 | null | 
| 2025-07-10 | Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs) | Apurv Verma et.al. | 2407.14937 | link | 
| 2024-08-23 | Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle | Emman Haider et.al. | 2407.13833 | null | 
| 2024-07-16 | Continuous Embedding Attacks via Clipped Inputs in Jailbreaking Large Language Models | Zihao Xu et.al. | 2407.13796 | link | 
| 2024-07-18 | LLMs as Function Approximators: Terminology, Taxonomy, and Questions for Evaluation | David Schlangen et.al. | 2407.13744 | null | 
| 2024-07-17 | AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases | Zhaorun Chen et.al. | 2407.12784 | link | 
| 2024-10-28 | Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models | Chao Gong et.al. | 2407.12383 | link | 
| 2024-07-17 | The Better Angels of Machine Personality: How Personality Relates to LLM Safety | Jie Zhang et.al. | 2407.12344 | link | 
| 2025-04-17 | Does Refusal Training in LLMs Generalize to the Past Tense? | Maksym Andriushchenko et.al. | 2407.11969 | link | 
| 2024-08-21 | What Makes and Breaks Safety Fine-tuning? A Mechanistic Study | Samyak Jain et.al. | 2407.10264 | null | 
| 2024-07-13 | MOAT: Securely Mitigating Rowhammer with Per-Row Activation Counters | Moinuddin Qureshi et.al. | 2407.09995 | null | 
| 2025-05-28 | ASTPrompter: Preference-Aligned Automated Language Model Red-Teaming to Generate Low-Perplexity Unsafe Prompts | Amelia F. Hardy et.al. | 2407.09447 | link | 
| 2025-01-27 | Self-interpreting Adversarial Images | Tingwei Zhang et.al. | 2407.08970 | link | 
| 2025-02-11 | Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing | Huanqian Wang et.al. | 2407.08770 | link | 
| 2025-02-13 | Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation | Riccardo Cantini et.al. | 2407.08441 | link | 
| 2024-09-11 | The Human Factor in AI Red Teaming: Perspectives from Social and Collaborative Computing | Alice Qian Zhang et.al. | 2407.07786 | null | 
| 2024-07-12 | A Survey of Attacks on Large Vision-Language Models: Resources, Advances, and Future Trends | Daizong Liu et.al. | 2407.07403 | link | 
| 2024-09-08 | T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models | Yibo Miao et.al. | 2407.05965 | null | 
| 2024-07-08 | Mintong Kang et.al. | 2407.05557 | link | |
| 2024-07-06 | Safe Generative Chats in a WhatsApp Intelligent Tutoring System | Zachary Levonian et.al. | 2407.04915 | null | 
| 2024-08-30 | Jailbreak Attacks and Defenses Against Large Language Models: A Survey | Sibo Yi et.al. | 2407.04295 | null | 
| 2024-12-21 | Automated Progressive Red Teaming | Bojian Jiang et.al. | 2407.03876 | link | 
| 2024-07-03 | Soft Begging: Modular and Efficient Shielding of LLMs against Prompt Injection and Jailbreaking based on Prompt Tuning | Simon Ostermann et.al. | 2407.03391 | null | 
| 2024-07-03 | SOS! Soft Prompt Attack Against Open-Source Large Language Models | Ziqing Yang et.al. | 2407.03160 | null | 
| 2024-07-03 | JailbreakHunter: A Visual Analytics Approach for Jailbreak Prompts Discovery from Large-Scale Human-LLM Conversational Datasets | Zhihua Jin et.al. | 2407.03045 | null | 
| 2025-05-20 | From Theft to Bomb-Making: The Ripple Effect of Unlearning in Defending Against Jailbreak Attacks | Zhexin Zhang et.al. | 2407.02855 | link | 
| 2024-10-30 | Breach By A Thousand Leaks: Unsafe Information Leakage in `Safe' AI Responses | David Glukhov et.al. | 2407.02551 | null | 
| 2024-08-26 | Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything | Xiaotian Zou et.al. | 2407.02534 | null | 
| 2025-03-02 | SeqAR: Jailbreak LLMs with Sequential Auto-Generated Characters | Yan Yang et.al. | 2407.01902 | link | 
| 2024-07-01 | Purple-teaming LLMs with Adversarial Defender Training | Jingyan Zhou et.al. | 2407.01850 | null | 
| 2024-07-25 | JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models | Haibo Jin et.al. | 2407.01599 | link | 
| 2025-06-28 | Enhancing the Capability and Robustness of Large Language Models through Reinforcement Learning-Driven Query Refinement | Xiaohua Wang et.al. | 2407.01461 | link | 
| 2024-07-01 | Badllama 3: removing safety finetuning from Llama 3 in minutes | Dmitrii Volkov et.al. | 2407.01376 | null | 
| 2025-05-23 | Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks | Yue Zhou et.al. | 2407.00869 | link | 
| 2025-02-28 | Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference | Anton Xue et.al. | 2407.00075 | null | 
| 2024-07-11 | Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection | Yuqi Zhou et.al. | 2406.19845 | null | 
| 2024-10-03 | Jailbreaking LLMs with Arabic Transliteration and Arabizi | Mansour Al Ghanim et.al. | 2406.18725 | link | 
| 2024-07-08 | The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm | Aakanksha et.al. | 2406.18682 | null | 
| 2024-06-26 | WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models | Liwei Jiang et.al. | 2406.18510 | link | 
| 2024-12-09 | WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs | Seungju Han et.al. | 2406.18495 | link | 
| 2024-06-26 | Poisoned LangChain: Jailbreak LLMs by LangChain | Ziqiu Wang et.al. | 2406.18122 | null | 
| 2024-12-24 | SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance | Caishuang Huang et.al. | 2406.18118 | link | 
| 2024-06-25 | CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference | Erxin Yu et.al. | 2406.17626 | link | 
| 2024-06-25 | Leveraging Reinforcement Learning in Red Teaming for Advanced Ransomware Attack Simulations | Cheng Wang et.al. | 2406.17576 | null | 
| 2024-06-21 | Steering Without Side Effects: Improving Post-Deployment Control of Language Models | Asa Cooper Stickland et.al. | 2406.15518 | link | 
| 2025-06-11 | Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding | Haneul Yoo et.al. | 2406.15481 | link | 
| 2024-06-21 | From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking | Siyuan Wang et.al. | 2406.14859 | null | 
| 2024-07-01 | Adversaries Can Misuse Combinations of Safe Models | Erik Jones et.al. | 2406.14595 | null | 
| 2025-04-19 | Jailbreaking as a Reward Misspecification Problem | Zhihui Xie et.al. | 2406.14393 | link | 
| 2024-06-20 | Finding Safety Neurons in Large Language Models | Jianhui Chen et.al. | 2406.14144 | null | 
| 2025-01-27 | Jailbreaking Large Language Models Through Alignment Vulnerabilities in Out-of-Distribution Settings | Yue Huang et.al. | 2406.13662 | link | 
| 2024-08-21 | SHIELD: Evaluation and Defense Strategies for Copyright Compliance in LLM Text Generation | Xiaoze Liu et.al. | 2406.12975 | link | 
| 2025-01-07 | ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates | Fengqing Jiang et.al. | 2406.12935 | link | 
| 2024-06-21 | [WIP] Jailbreak Paradox: The Achilles' Heel of LLMs | Abhinav Rao et.al. | 2406.12702 | null | 
| 2024-06-17 | Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner | Kenneth Li et.al. | 2406.11978 | link | 
| 2024-10-16 | CELL your Model: Contrastive Explanations for Large Language Models | Ronny Luss et.al. | 2406.11785 | null | 
| 2024-10-23 | STAR: SocioTechnical Approach to Red Teaming Language Models | Laura Weidinger et.al. | 2406.11757 | null | 
| 2024-10-30 | Refusal in Language Models Is Mediated by a Single Direction | Andy Arditi et.al. | 2406.11717 | link | 
| 2025-06-09 | Knowledge-to-Jailbreak: Investigating Knowledge-driven Jailbreaking Attacks for Large Language Models | Shangqing Tu et.al. | 2406.11682 | link | 
| 2025-02-03 | "Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak | Lingrui Mei et.al. | 2406.11668 | link | 
| 2024-06-17 | Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming | Vernon Toh Yan Han et.al. | 2406.11654 | null | 
| 2024-06-16 | garak: A Framework for Security Probing Large Language Models | Leon Derczynski et.al. | 2406.11036 | link | 
| 2024-06-16 | Threat Modelling and Risk Analysis for Large Language Model (LLM)-Powered Applications | Stephen Burabari Tete et.al. | 2406.11007 | null | 
| 2024-12-02 | Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis | Yuping Lin et.al. | 2406.10794 | link | 
| 2024-11-06 | Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs | Zhao Xu et.al. | 2406.09324 | link | 
| 2025-02-04 | JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models | Delong Ran et.al. | 2406.09321 | link | 
| 2024-10-05 | Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models | Sarah Ball et.al. | 2406.09289 | link | 
| 2025-02-25 | Towards Effective Evaluations and Comparisons for LLM Unlearning Methods | Qizhou Wang et.al. | 2406.09179 | null | 
| 2025-02-18 | StructuralSleight: Automated Jailbreak Attacks on Large Language Models Utilizing Uncommon Text-Organization Structures | Bangxin Li et.al. | 2406.08754 | null | 
| 2024-06-13 | RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs | Xuan Chen et.al. | 2406.08725 | null | 
| 2025-01-27 | When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search | Xuan Chen et.al. | 2406.08705 | link | 
| 2024-06-13 | MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models | Tianle Gu et.al. | 2406.07594 | link | 
| 2024-07-14 | Merging Improves Self-Critique Against Jailbreak Attacks | Victor Gallego et.al. | 2406.07188 | link | 
| 2024-12-06 | MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models | Yichi Zhang et.al. | 2406.07057 | null | 
| 2024-06-07 | Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs | Fan Liu et.al. | 2406.06622 | null | 
| 2024-07-03 | Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak Attacks | Zonghao Ying et.al. | 2406.06302 | link | 
| 2024-06-10 | Safety Alignment Should Be Made More Than Just a Few Tokens Deep | Xiangyu Qi et.al. | 2406.05946 | link | 
| 2024-06-13 | How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States | Zhenhong Zhou et.al. | 2406.05644 | link | 
| 2025-02-05 | SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner | Xunguang Wang et.al. | 2406.05498 | null | 
| 2025-03-05 | Is On-Device AI Broken and Exploitable? Assessing the Trust and Ethics in Small Language Models | Kalyan Nakka et.al. | 2406.05364 | null | 
| 2024-07-01 | Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt | Zonghao Ying et.al. | 2406.04031 | link | 
| 2024-06-06 | AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens | Lin Lu et.al. | 2406.03805 | null | 
| 2024-09-25 | Ranking Manipulation for Conversational Search Engines | Samuel Pfrommer et.al. | 2406.03589 | link | 
| 2025-02-01 | Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits | Andis Draguns et.al. | 2406.02619 | link | 
| 2024-05-28 | Are PPO-ed Language Models Hackable? | Suraj Anand et.al. | 2406.02577 | null | 
| 2025-05-06 | Towards Universal and Black-Box Query-Response Only Attack on LLMs with QROA | Hussein Jawad et.al. | 2406.02044 | link | 
| 2024-10-30 | Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses | Xiaosen Zheng et.al. | 2406.01288 | link | 
| 2025-03-06 | Get my drift? Catching LLM Task Drift with Activation Deltas | Sahar Abdelnabi et.al. | 2406.00799 | link | 
| 2024-06-01 | Exploring Vulnerabilities and Protections in Large Language Models: A Survey | Frank Weizhen Liu et.al. | 2406.00240 | null | 
| 2024-07-29 | Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization | Yuanpu Cao et.al. | 2406.00045 | link | 
| 2024-06-05 | Improved Techniques for Optimization-Based Jailbreaking on Large Language Models | Xiaojun Jia et.al. | 2405.21018 | link | 
| 2024-11-01 | Improved Generation of Adversarial Examples Against Safety-aligned LLMs | Qizhang Li et.al. | 2405.20778 | link | 
| 2024-08-21 | Medical MLLM is Vulnerable: Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models | Xijie Huang et.al. | 2405.20775 | link | 
| 2024-06-12 | Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character | Siyuan Ma et.al. | 2405.20773 | null | 
| 2025-06-17 | Mind the Inconspicuous: Revealing the Hidden Weakness in Aligned LLMs' Refusal Boundaries | Jiahao Yu et.al. | 2405.20653 | null | 
| 2024-05-30 | Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters | Haibo Jin et.al. | 2405.20413 | null | 
| 2024-10-17 | TAIA: Large Language Models are Out-of-Distribution Data Learners | Shuyang Jiang et.al. | 2405.20192 | link | 
| 2025-06-04 | Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks | Chen Xiong et.al. | 2405.20099 | null | 
| 2025-05-17 | Efficient Indirect LLM Jailbreak via Multimodal-LLM Jailbreak | Zhenxing Niu et.al. | 2405.20015 | null | 
| 2024-05-30 | AutoBreach: Universal and Adaptive Jailbreaking with Efficient Wordplay-Guided Optimization | Jiawei Chen et.al. | 2405.19668 | null | 
| 2024-10-11 | ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users | Guanlin Li et.al. | 2405.19360 | link | 
| 2024-05-31 | Robustifying Safety-Aligned Large Language Models through Clean Data Curation | Xiaoqun Liu et.al. | 2405.19358 | null | 
| 2024-05-29 | ConceptPrune: Concept Editing in Diffusion Models via Skilled Neuron Pruning | Ruchika Chavhan et.al. | 2405.19237 | link | 
| 2024-05-29 | Voice Jailbreak Attacks Against GPT-4o | Xinyue Shen et.al. | 2405.19103 | link | 
| 2024-12-20 | DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints | Andrew Zhao et.al. | 2405.19026 | link | 
| 2025-04-21 | Certifying Counterfactual Bias in LLMs | Isha Chaudhary et.al. | 2405.18780 | link | 
| 2024-11-18 | A Theoretical Understanding of Self-Correction through In-context Alignment | Yifei Wang et.al. | 2405.18634 | null | 
| 2025-02-28 | Learning diverse attacks on large language models for robust red-teaming and safety tuning | Seanie Lee et.al. | 2405.18540 | link | 
| 2024-06-14 | Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing | Wei Zhao et.al. | 2405.18166 | link | 
| 2024-10-14 | White-box Multimodal Jailbreaks Against Large Vision-Language Models | Ruofan Wang et.al. | 2405.17894 | link | 
| 2024-05-28 | Automatic Jailbreaking of the Text-to-Image Generative AI Systems | Minseon Kim et.al. | 2405.16567 | link | 
| 2024-05-24 | Hacc-Man: An Arcade Game for Jailbreaking LLMs | Matheus Valentim et.al. | 2405.15902 | null | 
| 2024-10-08 | Extracting Prompts by Inverting LLM Outputs | Collin Zhang et.al. | 2405.15012 | link | 
| 2024-10-30 | Representation Noising: A Defence Mechanism Against Harmful Finetuning | Domenic Rosati et.al. | 2405.14577 | link | 
| 2024-05-23 | Impact of Non-Standard Unicode Characters on Security and Comprehension in Large Language Models | Johan S Daniel et.al. | 2405.14490 | link | 
| 2024-05-22 | WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response | Tianrong Zhang et.al. | 2405.14023 | null | 
| 2024-05-22 | Safety Alignment for Vision Language Models | Zhendong Liu et.al. | 2405.13581 | null | 
| 2024-07-07 | TrojanRAG: Retrieval-Augmented Generation Can Be Backdoor Driver in Large Language Models | Pengzhou Cheng et.al. | 2405.13401 | link | 
| 2024-10-15 | GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation | Govind Ramesh et.al. | 2405.13077 | null | 
| 2024-06-19 | Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation | Yuxi Li et.al. | 2405.13068 | link | 
| 2024-06-17 | Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model Against LLM Red-Teaming | Jiaxu Liu et.al. | 2405.12604 | null | 
| 2025-03-28 | Single Image Unlearning: Efficient Machine Unlearning in Multimodal Large Language Models | Jiaqi Li et.al. | 2405.12523 | null | 
| 2024-08-06 | Hummer: Towards Limited Competitive Preference Dataset | Li Jiang et.al. | 2405.11647 | null | 
| 2024-05-15 | Benchmark Early and Red Team Often: A Framework for Assessing and Managing Dual-Use Hazards of AI Foundation Models | Anthony M. Barrett et.al. | 2405.10986 | null | 
| 2024-10-05 | Red Teaming Language Models for Processing Contradictory Dialogues | Xiaofei Wen et.al. | 2405.10128 | link | 
| 2025-02-12 | Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained Optimization | Kai Hu et.al. | 2405.09113 | link | 
| 2024-05-15 | A safety realignment framework via subspace-oriented model fusion for large language models | Xin Yi et.al. | 2405.09055 | link | 
| 2024-05-14 | SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models | Raghuveer Peri et.al. | 2405.08317 | null | 
| 2024-05-14 | PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition | Ziyang Zhang et.al. | 2405.07932 | link | 
| 2025-06-13 | PLeak: Prompt Leaking Attacks against Large Language Model Applications | Bo Hui et.al. | 2405.06823 | link | 
| 2024-08-29 | Mitigating Exaggerated Safety in Large Language Models | Ruchira Ray et.al. | 2405.05418 | null | 
| 2025-05-21 | Red-Teaming for Inducing Societal Bias in Large Language Models | Chu Fei Luo et.al. | 2405.04756 | link | 
| 2024-05-07 | Learning To See But Forgetting To Follow: Visual Instruction Tuning Makes LLMs More Prone To Jailbreak Attacks | Georgios Pantazopoulos et.al. | 2405.04403 | link | 
| 2024-05-07 | Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent | Shang Shang et.al. | 2405.03654 | null | 
| 2024-05-03 | Aloe: A Family of Fine-tuned Open Healthcare LLMs | Ashwin Kumar Gururajan et.al. | 2405.01886 | null | 
| 2025-03-02 | Boosting Jailbreak Attack with Momentum | Yihao Zhang et.al. | 2405.01229 | link | 
| 2025-06-01 | Mixture of insighTful Experts (MoTE): The Synergy of Thought Chains and Expert Mixtures in Self-Alignment | Zhili Liu et.al. | 2405.00557 | null | 
| 2024-05-10 | Evaluating and Mitigating Linguistic Discrimination in Large Language Models | Guoliang Dong et.al. | 2404.18534 | null | 
| 2024-04-26 | Probabilistic Inference in Language Models via Twisted Sequential Monte Carlo | Stephen Zhao et.al. | 2404.17546 | link | 
| 2025-06-02 | AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs | Anselm Paulus et.al. | 2404.16873 | link | 
| 2025-07-02 | Don't Say No: Jailbreaking LLM by Suppressing Refusal | Yukai Zhou et.al. | 2404.16369 | link | 
| 2025-04-08 | Investigating Adversarial Trigger Transfer in Large Language Models | Nicholas Meade et.al. | 2404.16020 | link | 
| 2024-04-23 | Bias patterns in the application of LLMs for clinical decision support: A comprehensive study | Raphael Poulain et.al. | 2404.15149 | link | 
| 2024-04-23 | A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI | Seliem El-Sayed et.al. | 2404.15058 | null | 
| 2024-06-06 | Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs | Javier Rando et.al. | 2404.14461 | link | 
| 2024-10-10 | Protecting Your LLMs with Information Bottleneck | Zichuan Liu et.al. | 2404.13968 | link | 
| 2024-04-19 | The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions | Eric Wallace et.al. | 2404.13208 | null | 
| 2024-04-18 | Advancing the Robustness of Large Language Models through Self-Denoised Smoothing | Jiabao Ji et.al. | 2404.12274 | link | 
| 2025-06-06 | JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models | Yingchaojie Feng et.al. | 2404.08793 | null | 
| 2024-06-24 | ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming | Simone Tedeschi et.al. | 2404.08676 | link | 
| 2024-04-12 | Subtoxic Questions: Dive Into Attitude Change of LLM's Response in Jailbreak Attempts | Tianyu Zhang et.al. | 2404.08309 | null | 
| 2024-11-24 | AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs | Zeyi Liao et.al. | 2404.07921 | link | 
| 2024-04-10 | CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge | Yu Ying Chiu et.al. | 2404.06664 | null | 
| 2024-05-07 | Rethinking How to Evaluate Language Model Jailbreak | Hongyu Cai et.al. | 2404.06407 | link | 
| 2024-07-03 | Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge | Weikai Lu et.al. | 2404.05880 | link | 
| 2024-04-16 | Hidden You Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Logic Chain Injection | Zhilong Wang et.al. | 2404.04849 | null | 
| 2024-09-09 | Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes | Divyanshu Kumar et.al. | 2404.04392 | null | 
| 2024-12-15 | Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks? | Shuo Chen et.al. | 2404.03411 | link | 
| 2024-11-24 | JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks | Weidi Luo et.al. | 2404.03027 | null | 
| 2025-05-26 | Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models | Jiachen Ma et.al. | 2404.02928 | null | 
| 2024-04-03 | Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game | Qianqiao Xu et.al. | 2404.02532 | null | 
| 2025-04-17 | Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks | Maksym Andriushchenko et.al. | 2404.02151 | link | 
| 2024-04-02 | Red-Teaming Segment Anything Model | Krzysztof Jankowski et.al. | 2404.02067 | link | 
| 2025-02-26 | Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack | Mark Russinovich et.al. | 2404.01833 | null | 
| 2024-10-31 | JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models | Patrick Chao et.al. | 2404.01318 | link | 
| 2024-08-20 | What is in Your Safe Data? Identifying Benign Data that Breaks Safety | Luxi He et.al. | 2404.01099 | link | 
| 2024-11-26 | Against The Achilles' Heel: A Survey on Red Teaming for Generative Models | Lizhi Lin et.al. | 2404.00629 | link | 
| 2024-12-27 | Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code | Taishi Nakamura et.al. | 2404.00399 | null | 
| 2025-04-28 | Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation | Yutong He et.al. | 2403.19103 | null | 
| 2024-03-27 | IterAlign: Iterative Constitutional Alignment of Large Language Models | Xiusi Chen et.al. | 2403.18341 | null | 
| 2025-03-03 | Optimization-based Prompt Injection Attack to LLM-as-a-Judge | Jiawen Shi et.al. | 2403.17710 | link | 
| 2024-09-30 | Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models | Zhiyuan Yu et.al. | 2403.17336 | null | 
| 2024-03-22 | Risk and Response in Large Language Models: Evaluating Key Threat Categories | Bahareh Harandizadeh et.al. | 2403.14988 | null | 
| 2024-06-24 | Testing the Limits of Jailbreaking Defenses with the Purple Problem | Taeyoun Kim et.al. | 2403.14725 | link | 
| 2024-07-23 | RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content | Zhuowen Yuan et.al. | 2403.13031 | link | 
| 2024-03-18 | EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models | Weikang Zhou et.al. | 2403.12171 | link | 
| 2024-05-14 | Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation | Jessica Quaye et.al. | 2403.12075 | link | 
| 2025-01-13 | Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models | Yifan Li et.al. | 2403.09792 | link | 
| 2024-10-15 | Eyes Closed, Safety On: Protecting Multimodal LLMs via Image-to-Text Transformation | Yunhao Gou et.al. | 2403.09572 | null | 
| 2024-03-14 | AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting | Yu Wang et.al. | 2403.09513 | link | 
| 2024-07-17 | The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models? | Qinyu Zhao et.al. | 2403.09037 | link | 
| 2024-03-19 | Review of Generative AI Methods in Cybersecurity | Yagmur Yigit et.al. | 2403.08701 | null | 
| 2024-09-30 | Distract Large Language Models for Automatic Jailbreak Attack | Zeguan Xiao et.al. | 2403.08424 | link | 
| 2024-03-14 | HRLAIF: Improvements in Helpfulness and Harmlessness in Open-domain Reinforcement Learning From AI Feedback | Ang Li et.al. | 2403.08309 | null | 
| 2024-03-14 | Red Teaming Models for Hyperspectral Image Analysis Using Explainable AI | Vladimir Zaigrajew et.al. | 2403.08017 | null | 
| 2025-07-18 | Defending Against Unforeseen Failure Modes with Latent Adversarial Training | Stephen Casper et.al. | 2403.05030 | link | 
| 2024-03-07 | A Safe Harbor for AI Evaluation and Red Teaming | Shayne Longpre et.al. | 2403.04893 | null | 
| 2024-11-14 | AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks | Yifan Zeng et.al. | 2403.04783 | link | 
| 2024-03-11 | Using Hallucinations to Bypass GPT4's Filter | Benjamin Lemkin et.al. | 2403.04769 | null | 
| 2024-10-04 | Aligners: Decoupling LLMs and Alignment | Lilian Ngweta et.al. | 2403.04224 | link | 
| 2025-02-05 | ImgTrojan: Jailbreaking Vision-Language Models with ONE Image | Xijia Tao et.al. | 2403.02910 | link | 
| 2025-01-30 | Here Comes The AI Worm: Unleashing Zero-click Worms that Target GenAI-Powered Applications | Stav Cohen et.al. | 2403.02817 | link | 
| 2024-03-02 | AutoAttacker: A Large Language Model Guided System to Implement Automatic Cyber-attacks | Jiacen Xu et.al. | 2403.01038 | null | 
| 2024-11-07 | Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes | Xiaomeng Hu et.al. | 2403.00867 | null | 
| 2024-02-28 | TroubleLLM: Align to Red Team Expert | Zhuoer Xu et.al. | 2403.00829 | null | 
| 2024-09-19 | Enhancing Jailbreak Attacks with Diversity Guidance | Xu Zhang et.al. | 2403.00292 | null | 
| 2024-02-29 | Curiosity-driven Red-teaming for Large Language Models | Zhang-Wei Hong et.al. | 2402.19464 | link | 
| 2024-06-10 | Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction | Tong Liu et.al. | 2402.18104 | link | 
| 2024-10-30 | Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue | Zhenhong Zhou et.al. | 2402.17262 | null | 
| 2024-11-11 | DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers | Xirui Li et.al. | 2402.16914 | link | 
| 2024-02-26 | CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models | Huijie Lv et.al. | 2402.16717 | link | 
| 2024-06-06 | Defending LLMs against Jailbreaking Attacks via Backtranslation | Yihan Wang et.al. | 2402.16459 | link | 
| 2024-02-28 | Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing | Jiabao Ji et.al. | 2402.16192 | link | 
| 2024-06-04 | ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings | Hao Wang et.al. | 2402.16006 | null | 
| 2024-02-24 | PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails | Neal Mangaokar et.al. | 2402.15911 | null | 
| 2024-03-04 | LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper | Daoyuan Wu et.al. | 2402.15727 | null | 
| 2024-02-24 | Foot In The Door: Understanding Large Language Model Jailbreaking via Cognitive Psychology | Zhenhua Wang et.al. | 2402.15690 | null | 
| 2024-02-23 | Fast Adversarial Attacks on Language Models In One GPU Minute | Vinu Sankar Sadasivan et.al. | 2402.15570 | link | 
| 2024-11-16 | How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries | Somnath Banerjee et.al. | 2402.15302 | link | 
| 2024-02-27 | Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement | Heegyu Kim et.al. | 2402.15180 | null | 
| 2024-06-20 | Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment | Jiongxiao Wang et.al. | 2402.14968 | null | 
| 2024-02-27 | Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs | Xiaoxia Li et.al. | 2402.14872 | null | 
| 2024-06-18 | Is the System Message Really Important to Jailbreaks in Large Language Models? | Xiaotian Zou et.al. | 2402.14857 | null | 
| 2024-02-21 | Coercing LLMs to do and reveal (almost) anything | Jonas Geiping et.al. | 2402.14020 | link | 
| 2024-02-26 | AttackGNN: Red-Teaming GNNs in Hardware Security Using Reinforcement Learning | Vasudev Gohil et.al. | 2402.13946 | null | 
| 2025-04-30 | Round Trip Translation Defence against Large Language Model Jailbreaking Attacks | Canaan Yung et.al. | 2402.13517 | link | 
| 2024-05-29 | GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis | Yueqi Xie et.al. | 2402.13494 | link | 
| 2024-05-17 | A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models | Zihao Xu et.al. | 2402.13457 | link | 
| 2025-02-21 | Defending Jailbreak Prompts via In-Context Adversarial Game | Yujun Zhou et.al. | 2402.13148 | null | 
| 2024-06-06 | TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification | Martin Gubri et.al. | 2402.12991 | link | 
| 2024-06-07 | ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs | Fengqing Jiang et.al. | 2402.11753 | link | 
| 2024-08-16 | ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages | Junjie Ye et.al. | 2402.10753 | link | 
| 2025-03-16 | When "Competency" in Reasoning Opens the Door to Vulnerability: Jailbreaking LLMs via Novel Complex Ciphers | Divij Handa et.al. | 2402.10601 | link | 
| 2024-08-27 | A StrongREJECT for Empty Jailbreaks | Alexandra Souly et.al. | 2402.10260 | link | 
| 2024-02-15 | A Trembling House of Cards? Mapping Adversarial Attacks against Language Agents | Lingbo Mo et.al. | 2402.10196 | link | 
| 2024-10-02 | Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks | Yixin Cheng et.al. | 2402.09177 | null | 
| 2024-02-16 | Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues | Zhiyuan Chang et.al. | 2402.09091 | null | 
| 2024-07-25 | SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding | Zhangchen Xu et.al. | 2402.08983 | link | 
| 2024-06-07 | COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability | Xingang Guo et.al. | 2402.08679 | link | 
| 2024-06-03 | Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast | Xiangming Gu et.al. | 2402.08567 | link | 
| 2024-02-13 | Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning | Gelei Deng et.al. | 2402.08416 | null | 
| 2024-10-31 | Fight Back Against Jailbreaking via Prompt Adversarial Tuning | Yichuan Mo et.al. | 2402.06255 | link | 
| 2025-03-20 | EmojiPrompt: Generative Prompt Obfuscation for Privacy-Preserving Communication with Cloud-based LLMs | Sam Lin et.al. | 2402.05868 | link | 
| 2025-05-26 | JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs | Junjie Chu et.al. | 2402.05668 | link | 
| 2024-02-08 | Rapid Optimization for Jailbreaking LLMs via Subconscious Exploitation and Echopraxia | Guangyu Shen et.al. | 2402.05467 | link | 
| 2024-10-24 | Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications | Boyi Wei et.al. | 2402.05162 | null | 
| 2024-02-27 | HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal | Mantas Mazeika et.al. | 2402.04249 | link | 
| 2024-02-05 | Nevermind: Instruction Override and Moderation in Large Language Models | Edward Kim et.al. | 2402.03303 | null | 
| 2025-05-26 | GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models | Haibo Jin et.al. | 2402.03299 | null | 
| 2024-02-04 | Jailbreaking Attack against Multimodal Large Language Model | Zhenxing Niu et.al. | 2402.02309 | link | 
| 2024-06-17 | Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models | Yongshuo Zong et.al. | 2402.02207 | link | 
| 2024-01-25 | MULTIVERSE: Exposing Large Language Model Alignment Problems in Diverse Worlds | Xiaolong Jin et.al. | 2402.01706 | null | 
| 2024-11-14 | Security and Privacy Challenges of Large Language Models: A Survey | Badhan Chandra Das et.al. | 2402.00888 | null | 
| 2024-02-01 | Investigating Bias Representations in Llama 2 Chat via Activation Steering | Dawn Lu et.al. | 2402.00402 | null | 
| 2024-06-03 | On Prompt-Driven Safeguarding for Large Language Models | Chujie Zheng et.al. | 2401.18018 | link | 
| 2024-11-08 | Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks | Andy Zhou et.al. | 2401.17263 | link | 
| 2025-07-11 | Weak-to-Strong Jailbreaking on Large Language Models | Xuandong Zhao et.al. | 2401.17256 | link | 
| 2024-01-30 | A Cross-Language Investigation into Jailbreak Attacks in Large Language Models | Jie Li et.al. | 2401.16765 | null | 
| 2024-01-30 | Gradient-Based Language Model Red Teaming | Nevan Wichers et.al. | 2401.16656 | link | 
| 2024-01-29 | Towards Red Teaming in Multimodal and Multilingual Translation | Christophe Ropers et.al. | 2401.16247 | null | 
| 2024-08-27 | Red-Teaming for Generative AI: Silver Bullet or Security Theater? | Michael Feffer et.al. | 2401.15897 | null | 
| 2024-01-23 | Red Teaming Visual Language Models | Mukai Li et.al. | 2401.12915 | null | 
| 2024-01-24 | Digital cloning of online social networks for language-sensitive agent-based modeling of misinformation spread | Prateek Puri et.al. | 2401.12509 | null | 
| 2024-07-10 | The Ethics of Interaction: Mitigating Security Threats in LLMs | Ashutosh Kumar et.al. | 2401.12273 | null | 
| 2024-01-20 | InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance | Pengyu Wang et.al. | 2401.11206 | link | 
| 2024-10-31 | Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning | Adib Hasan et.al. | 2401.10862 | link | 
| 2024-05-16 | Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models | Rima Hazra et.al. | 2401.10647 | link | 
| 2024-02-12 | All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks | Kazuhiro Takemoto et.al. | 2401.09798 | link | 
| 2025-03-18 | AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models | Dong Shu et.al. | 2401.09002 | null | 
| 2025-02-21 | Revisiting Jailbreaking for Large Language Models: A Representation Engineering Perspective | Tianlong Li et.al. | 2401.06824 | null | 
| 2024-12-16 | Intention Analysis Makes LLMs A Good Jailbreak Defender | Yuqi Zhang et.al. | 2401.06561 | link | 
| 2024-01-23 | How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs | Yi Zeng et.al. | 2401.06373 | link | 
| 2024-01-11 | Combating Adversarial Attacks with Multi-Agent Debate | Steffi Chern et.al. | 2401.05998 | link | 
| 2024-04-01 | The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Model Performance | Abel Salinas et.al. | 2401.03729 | link | 
| 2024-08-19 | Malla: Demystifying Real-world Large Language Model Integrated Malicious Services | Zilong Lin et.al. | 2401.03315 | link | 
| 2024-01-03 | A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity | Andrew Lee et.al. | 2401.01967 | link | 
| 2023-12-30 | Red Teaming for Large Language Models At Scale: Tackling Hallucinations on Mathematics Tasks | Aleksander Buszydlik et.al. | 2401.00290 | link | 
| 2023-12-28 | Scalable and automated Evaluation of Blue Team cyber posture in Cyber Ranges | Federica Bianchi et.al. | 2312.17221 | null | 
| 2024-08-04 | Exploiting Novel GPT-4 APIs | Kellin Pelrine et.al. | 2312.14302 | link | 
| 2023-12-12 | Maatphor: Automated Variant Analysis for Prompt Injection Attacks | Ahmed Salem et.al. | 2312.11513 | null | 
| 2023-12-08 | A Red Teaming Framework for Securing AI in Maritime Autonomous Systems | Mathew J. Walter et.al. | 2312.11500 | null | 
| 2025-03-15 | JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks | Xiaoyu Zhang et.al. | 2312.10766 | null | 
| 2023-12-16 | Comprehensive Evaluation of ChatGPT Reliability Through Multilingual Inquiries | Poorna Chander Reddy Puttaparthi et.al. | 2312.10524 | link | 
| 2023-12-04 | Generative AI in Writing Research Papers: A New Type of Algorithmic Bias and Uncertainty in Scholarly Work | Rishab Jain et.al. | 2312.10057 | null | 
| 2023-12-14 | OSTINATO: Cross-host Attack Correlation Through Attack Activity Similarity Detection | Sutanu Kumar Ghosh et.al. | 2312.09321 | null | 
| 2024-04-17 | Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF | Anand Siththaranjan et.al. | 2312.08358 | link | 
| 2023-12-13 | Causality Analysis for Evaluating the Security of Large Language Models | Wei Zhao et.al. | 2312.07876 | link | 
| 2024-07-23 | AI Control: Improving Safety Despite Intentional Subversion | Ryan Greenblatt et.al. | 2312.06942 | link | 
| 2024-05-30 | Privacy Issues in Large Language Models: A Survey | Seth Neel et.al. | 2312.06717 | link | 
| 2023-12-11 | Control Risk for Potential Misuse of Artificial Intelligence in Science | Jiyan He et.al. | 2312.06632 | link | 
| 2023-12-08 | Seamless: Multilingual Expressive and Streaming Speech Translation | Seamless Communication et.al. | 2312.05187 | link | 
| 2023-12-12 | DeceptPrompt: Exploiting LLM-driven Code Generation via Adversarial Natural Language Instructions | Fangzhou Wu et.al. | 2312.04730 | null | 
| 2024-02-23 | Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak | Yanrui Du et.al. | 2312.04127 | null | 
| 2024-10-31 | Tree of Attacks: Jailbreaking Black-Box LLMs Automatically | Anay Mehrotra et.al. | 2312.02119 | link | 
| 2024-06-09 | Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections | Yuanpu Cao et.al. | 2312.00027 | link | 
| 2024-03-03 | Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition | Sander Schulhoff et.al. | 2311.16119 | link | 
| 2023-11-27 | How Many Unicorns Are in This Image? A Safety Evaluation Benchmark for Vision LLMs | Haoqin Tu et.al. | 2311.16101 | link | 
| 2023-11-27 | InfoPattern: Unveiling Information Propagation Patterns in Social Media | Chi Han et.al. | 2311.15642 | null | 
| 2023-11-15 | Towards Publicly Accountable Frontier LLMs: Building an External Scrutiny Ecosystem under the ASPIRE Framework | Markus Anderljung et.al. | 2311.14711 | null | 
| 2024-04-29 | Universal Jailbreak Backdoors from Poisoned Human Feedback | Javier Rando et.al. | 2311.14455 | link | 
| 2024-03-24 | Unmasking and Improving Data Credibility: A Study with Datasets for Training Harmless Language Models | Zhaowei Zhu et.al. | 2311.11202 | link | 
| 2025-05-29 | Hijacking Large Language Models via Adversarial In-Context Learning | Xiangyu Zhou et.al. | 2311.09948 | link | 
| 2024-02-29 | Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking | Nan Xu et.al. | 2311.09827 | null | 
| 2024-06-19 | RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models | Jiongxiao Wang et.al. | 2311.09641 | null | 
| 2023-11-16 | JAB: Joint Adversarial Prompting and Belief Augmentation | Ninareh Mehrabi et.al. | 2311.09473 | null | 
| 2024-08-15 | Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment | Haoran Wang et.al. | 2311.09433 | link | 
| 2024-01-20 | Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts | Yuanwei Wu et.al. | 2311.09127 | null | 
| 2024-06-12 | Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization | Zhexin Zhang et.al. | 2311.09096 | link | 
| 2023-11-29 | AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications | Bhaktipriya Radharapu et.al. | 2311.08592 | null | 
| 2024-04-07 | A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily | Peng Ding et.al. | 2311.08268 | link | 
| 2023-11-13 | MART: Improving LLM Safety with Multi-round Automatic Red-Teaming | Suyu Ge et.al. | 2311.07689 | null | 
| 2024-05-22 | Flames: Benchmarking Value Alignment of LLMs in Chinese | Kexin Huang et.al. | 2311.06899 | link | 
| 2024-12-10 | Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming | Nanna Inie et.al. | 2311.06237 | null | 
| 2024-04-01 | Fake Alignment: Are LLMs Really Aligned Well? | Yixu Wang et.al. | 2311.05915 | link | 
| 2025-01-19 | FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts | Yichen Gong et.al. | 2311.05608 | link | 
| 2024-03-08 | Can LLMs Follow Simple Rules? | Norman Mu et.al. | 2311.04235 | link | 
| 2023-11-24 | Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation | Rusheb Shah et.al. | 2311.03348 | null | 
| 2024-11-28 | DeepInception: Hypnotize Large Language Model to Be Jailbreaker | Xuan Li et.al. | 2311.03191 | link | 
| 2024-05-22 | LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B | Simon Lermen et.al. | 2310.20624 | null | 
| 2024-03-10 | From Chatbots to PhishBots? -- Preventing Phishing scams created using ChatGPT, Google Bard and Claude | Sayak Saha Roy et.al. | 2310.19181 | null | 
| 2024-03-22 | Self-Guard: Empower the LLM to Safeguard Itself | Zezhong Wang et.al. | 2310.15851 | null | 
| 2023-12-14 | AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models | Sicheng Zhu et.al. | 2310.15140 | null | 
| 2023-11-13 | Language Model Unalignment: Parametric Red-Teaming to Expose Hidden Harms and Biases | Rishabh Bhardwaj et.al. | 2310.14303 | null | 
| 2023-10-20 | Adaptive Experimental Design for Intrusion Data Collection | Kate Highnam et.al. | 2310.13224 | null | 
| 2023-10-28 | Probing LLMs for hate speech detection: strengths and vulnerabilities | Sarthak Roy et.al. | 2310.12860 | null | 
| 2023-10-19 | Attack Prompt Generation for Red Teaming and Defending Large Language Models | Boyi Deng et.al. | 2310.12505 | link | 
| 2023-10-17 | Learning from Red Teaming: Gender Bias Provocation and Mitigation in Large Language Models | Hsuan Su et.al. | 2310.11079 | null | 
| 2023-10-16 | Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks | Erfan Shayegani et.al. | 2310.10844 | null | 
| 2024-02-16 | Large Language Model Unlearning | Yuanshun Yao et.al. | 2310.10683 | link | 
| 2024-06-07 | Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models? | Yu-Lin Tsai et.al. | 2310.10012 | link | 
| 2023-11-11 | ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models | Alex Mei et.al. | 2310.09624 | link | 
| 2024-07-18 | Jailbreaking Black Box Large Language Models in Twenty Queries | Patrick Chao et.al. | 2310.08419 | link | 
| 2023-10-10 | Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation | Yangsibo Huang et.al. | 2310.06987 | link | 
| 2024-03-04 | Multilingual Jailbreak Challenges in Large Language Models | Yue Deng et.al. | 2310.06474 | link | 
| 2024-05-25 | Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations | Zeming Wei et.al. | 2310.06387 | null | 
| 2024-03-20 | AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models | Xiaogeng Liu et.al. | 2310.04451 | link | 
| 2023-09-17 | Red Teaming Generative AI/NLP, the BB84 quantum cryptography protocol and the NIST-approved Quantum-Resistant Cryptographic Algorithms | Petar Radanliev et.al. | 2310.04425 | null | 
| 2023-10-05 | Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! | Xiangyu Qi et.al. | 2310.03693 | link | 
| 2024-06-11 | SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks | Alexander Robey et.al. | 2310.03684 | link | 
| 2024-01-27 | Low-Resource Languages Jailbreak GPT-4 | Zheng-Xin Yong et.al. | 2310.02446 | null | 
| 2023-10-03 | Jailbreaker in Jail: Moving Target Defense for Large Language Models | Bocheng Chen et.al. | 2310.02417 | null | 
| 2023-10-03 | Can Language Models be Instructed to Protect Personal Information? | Yang Chen et.al. | 2310.02224 | null | 
| 2024-01-22 | Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench | Jen-tse Huang et.al. | 2310.01386 | link | 
| 2023-10-02 | No Offense Taken: Eliciting Offensiveness from Language Models | Anugya Srivastava et.al. | 2310.00892 | link | 
| 2024-07-28 | Evolving Diverse Red-team Language Models in Multi-round Multi-agent Games | Chengdong Ma et.al. | 2310.00322 | null | 
| 2024-06-12 | Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM | Bochuan Cao et.al. | 2309.14348 | link | 
| 2024-06-27 | GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts | Jiahao Yu et.al. | 2309.10253 | link | 
| 2024-06-08 | Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts | Zhi-Yi Chin et.al. | 2309.06135 | link | 
| 2024-04-14 | FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively Discovering Jailbreak Vulnerabilities in Large Language Models | Dongyu Yao et.al. | 2309.05274 | link | 
| 2024-08-05 | Open Sesame! Universal Black Box Jailbreaking of Large Language Models | Raz Lapid et.al. | 2309.01446 | null | 
| 2023-09-04 | Baseline Defenses for Adversarial Attacks Against Aligned Language Models | Neel Jain et.al. | 2309.00614 | null | 
| 2023-08-28 | The Promise and Peril of Artificial Intelligence -- Violet Teaming Offers a Balanced Path Forward | Alexander J. Titus et.al. | 2308.14253 | null | 
| 2023-11-07 | Detecting Language Model Attacks with Perplexity | Gabriel Alon et.al. | 2308.14132 | null | 
| 2023-08-25 | Self-Deception: Reverse Penetrating the Semantic Firewall of Large Language Models | Zhenhua Wang et.al. | 2308.11521 | null | 
| 2023-08-21 | On the Adversarial Robustness of Multi-Modal Foundation Models | Christian Schlarmann et.al. | 2308.10741 | link | 
| 2023-08-21 | Using Large Language Models for Cybersecurity Capture-The-Flag Challenges and Certification Questions | Wesley Tann et.al. | 2308.10443 | null | 
| 2023-08-30 | Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment | Rishabh Bhardwaj et.al. | 2308.09662 | link | 
| 2024-05-06 | Robustness Over Time: Understanding Adversarial Examples' Effectiveness on Longitudinal Versions of Large Language Models | Yugeng Liu et.al. | 2308.07847 | null | 
| 2024-03-26 | GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher | Youliang Yuan et.al. | 2308.06463 | link | 
| 2023-08-16 | Where's the Liability in Harmful AI Speech? | Peter Henderson et.al. | 2308.04635 | null | 
| 2024-11-07 | FLIRT: Feedback Loop In-context Red Teaming | Ninareh Mehrabi et.al. | 2308.04265 | null | 
| 2024-05-15 | "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models | Xinyue Shen et.al. | 2308.03825 | link | 
| 2024-04-01 | XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models | Paul Röttger et.al. | 2308.01263 | link | 
| 2023-08-03 | Confidence-Building Measures for Artificial Intelligence: Workshop Proceedings | Sarah Shoker et.al. | 2308.00862 | null | 
| 2023-12-20 | Universal and Transferable Adversarial Attacks on Aligned Language Models | Andy Zou et.al. | 2307.15043 | link | 
| 2023-10-10 | Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models | Erfan Shayegani et.al. | 2307.14539 | null | 
| 2023-10-25 | MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots | Gelei Deng et.al. | 2307.08715 | null | 
| 2023-08-28 | Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models | Huachuan Qiu et.al. | 2307.08487 | link | 
| 2023-07-05 | Jailbroken: How Does LLM Safety Training Fail? | Alexander Wei et.al. | 2307.02483 | null | 
| 2023-07-03 | From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy | Maanak Gupta et.al. | 2307.00691 | null | 
| 2023-08-16 | Visual Adversarial Examples Jailbreak Aligned Large Language Models | Xiangyu Qi et.al. | 2306.13213 | link | 
| 2024-02-26 | DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models | Boxin Wang et.al. | 2306.11698 | null | 
| 2023-10-11 | Explore, Establish, Exploit: Red Teaming Language Models from Scratch | Stephen Casper et.al. | 2306.09442 | link | 
| 2023-05-30 | Seeing Seeds Beyond Weeds: Green Teaming Generative AI for Beneficial Uses | Logan Stapleton et.al. | 2306.03097 | null | 
| 2023-10-19 | Red Teaming Language Model Detectors with Language Models | Zhouxing Shi et.al. | 2305.19713 | link | 
| 2023-05-27 | Query-Efficient Black-Box Red Teaming via Bayesian Optimization | Deokjae Lee et.al. | 2305.17444 | link | 
| 2024-03-27 | Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks | Abhinav Rao et.al. | 2305.14965 | link | 
| 2024-03-10 | Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study | Yi Liu et.al. | 2305.13860 | null | 
| 2023-11-10 | SneakyPrompt: Jailbreaking Text-to-image Generative Models | Yuchen Yang et.al. | 2305.12082 | link | 
| 2023-05-11 | Towards best practices in AGI safety and governance: A survey of expert opinion | Jonas Schuett et.al. | 2305.07153 | null | 
| 2023-05-09 | Generating Phishing Attacks using ChatGPT | Sayak Saha Roy et.al. | 2305.05133 | null | 
| 2023-10-19 | Automatic Prompt Optimization with "Gradient Descent" and Beam Search | Reid Pryzant et.al. | 2305.03495 | link | 
| 2023-04-21 | Power to the Data Defenders: Human-Centered Disclosure Risk Calibration of Open Data | Kaustav Bhattacharjee et.al. | 2304.11278 | null | 
| 2024-06-03 | Fundamental Limitations of Alignment in Large Language Models | Yotam Wolf et.al. | 2304.11082 | link | 
| 2023-11-01 | Multi-step Jailbreaking Privacy Attacks on ChatGPT | Haoran Li et.al. | 2304.05197 | link | 
| 2023-07-27 | Clustered Federated Learning Architecture for Network Anomaly Detection in Large Scale Heterogeneous IoT Networks | Xabier Sáez-de-Cámara et.al. | 2303.15986 | null | 
| 2023-03-09 | Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback | Hannah Rose Kirk et.al. | 2303.05453 | null | 
| 2023-09-21 | Red Teaming Deep Neural Networks with Feature Synthesis Tools | Stephen Casper et.al. | 2302.10894 | null | 
| 2023-01-05 | Can Large Language Models Change User Preference Adversarially? | Varshini Subhash et.al. | 2302.10291 | null | 
| 2023-05-29 | Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity | Terry Yue Zhuo et.al. | 2301.12867 | null | 
| 2024-08-23 | Asymptotically Normal Estimation of Local Latent Network Curvature | Steven Wilkins-Reeves et.al. | 2211.11673 | link | 
| 2023-05-05 | Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks | Stephen Casper et.al. | 2211.10024 | link | 
| 2023-06-07 | Beyond the Surface: Investigating Malicious CVE Proof of Concept Exploits on GitHub | Soufian El Yadmani et.al. | 2210.08374 | null | 
| 2022-11-10 | Red-Teaming the Stable Diffusion Safety Filter | Javier Rando et.al. | 2210.04610 | null | 
| 2022-11-22 | Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned | Deep Ganguli et.al. | 2209.07858 | link | 
| 2023-10-13 | Red Teaming with Mind Reading: White-Box Adversarial Policies Against RL Agents | Stephen Casper et.al. | 2209.02167 | link | 
| 2022-08-16 | CTI4AI: Threat Intelligence Generation and Sharing after Red Teaming AI Models | Chuyen Nguyen et.al. | 2208.07476 | null | 
| 2022-08-12 | PRIVEE: A Visual Analytic Workflow for Proactive Privacy Risk Inspection of Open Data | Kaustav Bhattacharjee et.al. | 2208.06481 | null | 
| 2022-07-30 | 'PeriHack': Designing a Serious Game for Cybersecurity Awareness | Roberto Dillon et.al. | 2208.00235 | null | 
| 2023-07-27 | Gotham Testbed: a Reproducible IoT Testbed for Security Experiments and Dataset Generation | Xabier Sáez-de-Cámara et.al. | 2207.13981 | link | 
| 2022-02-07 | Red Teaming Language Models with Language Models | Ethan Perez et.al. | 2202.03286 | null | 
| 2021-12-22 | Catch Me If You GAN: Using Artificial Intelligence for Fake Log Generation | Christian Toemmel et.al. | 2112.12006 | null | 
| 2021-12-18 | Dynamic Defender-Attacker Blotto Game | Daigo Shishika et.al. | 2112.09890 | null | 
| 2021-11-24 | Needle in a Haystack: Detecting Subtle Malicious Edits to Additive Manufacturing G-code Files | Caleb Beckwith et.al. | 2111.12746 | null | 
| 2021-10-04 | Automating Privilege Escalation with Deep Reinforcement Learning | Kalle Kujanpää et.al. | 2110.01362 | null | 
| 2021-08-20 | CybORG: A Gym for the Development of Autonomous Cyber Agents | Maxwell Standen et.al. | 2108.09118 | null | 
| 2021-05-27 | Hopper: Modeling and Detecting Lateral Movement (Extended Report) | Grant Ho et.al. | 2105.13442 | link | 
| 2021-04-23 | Predicting Adversary Lateral Movement Patterns with Deep Learning | Nathan Danneman et.al. | 2104.13195 | null | 
| 2021-03-29 | Automating Defense Against Adversarial Attacks: Discovery of Vulnerabilities and Application of Multi-INT Imagery to Protect Deployed Models | Josh Kalin et.al. | 2103.15897 | null | 
| 2022-06-28 | Dynamically Modelling Heterogeneous Higher-Order Interactions for Malicious Behavior Detection in Event Logs | Corentin Larroche et.al. | 2103.15708 | link | 
| 2021-05-04 | An In-memory Embedding of CPython for Offensive Use | Ateeq Sharfuddin et.al. | 2103.15202 | null | 
| 2020-11-26 | Investigation on Research Ethics and Building a Benchmark | Shun Inagaki et.al. | 2011.13925 | null | 
| 2020-09-17 | Can ROS be used securely in industry? Red teaming ROS-Industrial | Víctor Mayoral-Vilches et.al. | 2009.08211 | null | 
| 2020-07-17 | HARMer: Cyber-attacks Automation and Evaluation | Simon Yusuf Enoch et.al. | 2006.14352 | null | 
| 2021-04-16 | HACK3D: Crowdsourcing the Assessment of Cybersecurity in Digital Manufacturing | Michael Linares et.al. | 2005.04368 | null | 
| 2020-03-11 | Passlab: A Password Security Tool for the Blue Team | Saul Johnson et.al. | 2003.07208 | null | 
| 2020-10-02 | SoK: A Survey of Open-Source Threat Emulators | Polina Zilberman et.al. | 2003.01518 | null | 
| 2020-02-26 | CybORG: An Autonomous Cyber Operations Research Gym | Callum Baillie et.al. | 2002.10667 | null | 
| 2021-01-29 | Anomaly Detection in Large Scale Networks with Latent Space Models | Wesley Lee et.al. | 1911.05522 | null | 
| 2019-06-17 | The Little Phone That Could Ch-Ch-Chroot | Jack Whitter-Jones et.al. | 1906.07242 | null | 
| 2019-06-12 | Relative Hausdorff Distance for Network Analysis | Sinan G. Aksoy et.al. | 1906.04936 | null | 
| 2019-10-24 | Quantifiable & Comparable Evaluations of Cyber Defensive Capabilities: A Survey & Novel, Unified Approach | Michael D. Iannacone et.al. | 1902.00053 | null | 
| 2018-10-13 | Two Can Play That Game: An Adversarial Evaluation of a Cyber-alert Inspection System | Ankit Shah et.al. | 1810.05921 | null | 
| 2018-02-27 | A Multi-Disciplinary Review of Knowledge Acquisition Methods: From Human to Autonomous Eliciting Agents | George Leu et.al. | 1802.09669 | null | 
| 2018-02-27 | Computational Red Teaming in a Sudoku Solving Context: Neural Network Based Skill Representation and Acquisition | George Leu et.al. | 1802.09660 | null | 
| 2018-02-26 | Shaping Influence and Influencing Shaping: A Computational Red Teaming Trust-based Swarm Intelligence Model | Jiangjun Tang et.al. | 1802.09647 | null | 
| 2018-01-06 | SLEUTH: Real-time Attack Scenario Reconstruction from COTS Audit Data | Md Nahid Hossain et.al. | 1801.02062 | null | 
| 2017-12-02 | Recurrent Neural Network Language Models for Open Vocabulary Event-Level Cyber Anomaly Detection | Aaron Tuor et.al. | 1712.00557 | link | 
| 2015-04-07 | Security Toolbox for Detecting Novel and Sophisticated Android Malware | Benjamin Holland et.al. | 1504.01693 | null |