Aladdin Security's "Compound Jailbreak" Research Paper Accepted at International Conference on Artificial Intelligence and Cybersecurity (AISEC 2026) and Japanese Society for Artificial Intelligence National Conference (JSAI 2026)

Aladdin Security's research paper on "Compound Jailbreaks" has been accepted by the International Conference on Artificial Intelligence and Cybersecurity (AISEC 2026) and the Japanese Society for Artificial Intelligence National Conference (JSAI 2026). This research theoretically and empirically demonstrates that LLM safety mechanisms can be bypassed by combining individually preventable attacks.
調査NQ 88/100出典:PR Times

📋 Article Processing Timeline

  • 📰 Published: May 1, 2026 at 19:10
  • 🔍 Collected: May 1, 2026 at 10:31
  • 🤖 AI Analyzed: May 1, 2026 at 23:49 (13h 17m after Collected)
Aladdin Security Inc. (Headquarters: Kyoto City, Kyoto Prefecture; Representative Director CEO: Keigo Kansa) announced that its research paper on the safety evaluation of large language models (LLMs) by its research team, titled "Generalization Limits of Reinforcement Learning Alignment: Detecting LLM Vulnerabilities through Compound Jailbreaks," has been accepted by two conferences: the International Conference on Artificial Intelligence and Cybersecurity (AISEC 2026), an international conference specializing in the intersection of AI and cybersecurity, and the Japanese Society for Artificial Intelligence National Conference (JSAI 2026, 40th edition), one of Japan's largest AI research conferences.

This research theoretically and empirically demonstrates that current mainstream LLM safety mechanisms can be bypassed by combining "individually preventable attacks," and its findings directly contribute to the proactive safety evaluation of Sovereign AI (national sovereign AI).

About the Accepted Conferences

【International Conference】International Conference on Artificial Intelligence and Cybersecurity (AISEC 2026)

Official Name: International Conference on Artificial Intelligence and Cybersecurity 2026

Positioning: An international conference specializing in the intersection of AI and cybersecurity. It is a venue where researchers from around the world present their findings after peer review, and the acceptance of this paper indicates that Aladdin Security's white-box AI security research has gained international recognition.

Japanese Society for Artificial Intelligence National Conference (JSAI 2026)

Official Name: The 40th Annual Conference of the Japanese Society for Artificial Intelligence

Positioning: Japan's largest conference for artificial intelligence research, where cutting-edge AI research results are gathered.

Background of the Research ── Is Reinforcement Learning Alignment Truly "Generalizing"?

Since the advent of ChatGPT, LLMs have been utilized in a wide range of areas such as dialogue, code generation, and document creation, while also harboring risks such as generating harmful information, misinformation, and malicious code. In response, modern LLMs implement multi-layered safety mechanisms combining RLHF (Reinforcement Learning from Human Feedback), Instruction Hierarchy, and Deliberative Alignment.

However, it was unclear whether these methods could generalize to unknown attack patterns. Recent theoretical research points out that "training through reinforcement learning is not the acquisition of new capabilities, but merely a redistribution of the probability of utilizing existing capabilities." Applying this to safety training suggests that the distribution of attack patterns included in the training data structurally constrains the generalization range of safety mechanisms.

Overview of the Research ── "Compound Jailbreaks"

In this research, targeting OpenAI's open-weight model gpt-oss-20b, we proposed a new attack paradigm called "Compound Jailbreaks," which combines multiple individually defended attack methods to saturate the cognitive resources of LLMs.

The core Compound Role-Playing approach combines the following three elements:

Contrastive Structure: Justaposing pairs of harmful and harmless responses in a Markdown table format, justifying harmful content generation as "educational text."

Authoritative Persona: Assigning roles such as security expert or medical professional to imply the legitimacy of harmful content generation.

Self-Assessment Demand: Requesting confidence score output in JSON format, adding metacognitive load.

The difference from conventional jailbreak research is that each element functions not as a "contradiction" but as a "cognitive load," saturating the process of maintaining the instruction hierarchy itself.

Figure 1: Mechanism to bypass safety mechanisms by combining three elements and saturating cognitive resources.

Key Findings

Finding 1: Single Attack ASR 14.3% → Compound Attack ASR 71.4%

In an evaluation using a total of 70 prompts across 7 categories including biological weapons, malware, phishing, illegal drugs, weapon manufacturing, fraud, and personal information leakage, the attack success rate (ASR), which was 14.3% with a single method, was demonstrated to jump to 71.4% through compounding. This is Instruct