APTO Releases Learning Dataset and Safety-Tuned Models to Enhance Japanese LLM Safety
株式会社APTO (APTO) has released a proprietary dataset of approximately 18,000 Japanese safety training records and safety-tuned Qwen3.5 series models. The initiative aims to address safety challenges in Japanese LLMs by optimizing for both security and conversational quality using a rigorous multi-stage verification process and LoRA fine-tuning.
📋 Article Processing Timeline
- 📰 Published: May 22, 2026 at 19:00
- 🔍 Collected: May 22, 2026 at 10:31
- 🤖 AI Analyzed: May 22, 2026 at 11:19 (47 min after Collected)
株式会社APTO (APTO; Headquarters: Shibuya, Tokyo; CEO: Ryo Takashina) has announced the release of safety-tuned models and training datasets aimed at enhancing the safety of Large Language Models (LLMs) in Japanese environments.
While the adoption of generative AI is expanding rapidly, managing harmful output and ensuring ethical responses remain critical challenges. In particular, the Japanese LLM sector has suffered from a lack of safety data that accounts for specific cultural contexts and linguistic nuances.
APTO has designed approximately 18,000 Japanese safety training records and performed tuning to balance safety with conversational quality. Improvements in safety metrics were verified across multiple benchmarks while maintaining performance.
■ Release Details
◆ Safety-Tuned Models
Models based on the Qwen3.5 series, tuned for Japanese safety:
- Qwen3.5-27B-SafetyTuned
- Qwen3.5-9B-Base-SafetyTuned
- Qwen3.5-9B-SafetyTuned
GGUF quantized versions for various inference environments are also available.
◆ Learning Dataset (Sample)
APTO released a 500-record sample dataset maintained at the same distribution as the original 18,000-record set. Categories include:
- safety_refusal: Proper rejection of harmful questions with safe alternatives.
- overrefusal_prevention: Preventing excessive refusals for safe questions.
- mid_refusal: Pattern correction during responses.
- anti_hallucination: Preventing fabrication of non-existent information.
■ Overview of Safety Tuning
◆ Methodology
APTO utilized a four-stage process: attack prompt design, model response generation, exemplary safe answer creation, and quality review. Features include overrefusal prevention data, response correction patterns inspired by Decoupled Refusal Training (DeRTa) from ACL 2025, and a 5-level automatic evaluation using LLM-as-Judge.
LoRA (Low-Rank Adaptation) was used for training, optimizing ranks and target layers for each model size.
◆ Technical Features
- Japanese-Specific Data Design: Avoids simple application of English-centric methods, prioritizing Japanese context.
- Balancing Safety and Quality: Improves safety metrics while maintaining knowledge, dialogue, and mathematical reasoning performance.
- Multi-faceted Evaluation: Used AnswerCarefully v2.2, SORRY-Bench, MultiJail, and MT-Bench, with cross-evaluation by Qwen, Mistral, and Gemma for reliability.
■ Evaluation Results (Summary)
- AC Acceptable Rate: 84.1% → 89.8% (+5.7pt)
- SORRY-Bench Refusal Rate: 85.3% → 90.4% (+5.1pt)
- MT-Bench-ja: 8.97 (maintained)
- JMMLU: 67.8% → 75.6% (+7.9pt)
While the adoption of generative AI is expanding rapidly, managing harmful output and ensuring ethical responses remain critical challenges. In particular, the Japanese LLM sector has suffered from a lack of safety data that accounts for specific cultural contexts and linguistic nuances.
APTO has designed approximately 18,000 Japanese safety training records and performed tuning to balance safety with conversational quality. Improvements in safety metrics were verified across multiple benchmarks while maintaining performance.
■ Release Details
◆ Safety-Tuned Models
Models based on the Qwen3.5 series, tuned for Japanese safety:
- Qwen3.5-27B-SafetyTuned
- Qwen3.5-9B-Base-SafetyTuned
- Qwen3.5-9B-SafetyTuned
GGUF quantized versions for various inference environments are also available.
◆ Learning Dataset (Sample)
APTO released a 500-record sample dataset maintained at the same distribution as the original 18,000-record set. Categories include:
- safety_refusal: Proper rejection of harmful questions with safe alternatives.
- overrefusal_prevention: Preventing excessive refusals for safe questions.
- mid_refusal: Pattern correction during responses.
- anti_hallucination: Preventing fabrication of non-existent information.
■ Overview of Safety Tuning
◆ Methodology
APTO utilized a four-stage process: attack prompt design, model response generation, exemplary safe answer creation, and quality review. Features include overrefusal prevention data, response correction patterns inspired by Decoupled Refusal Training (DeRTa) from ACL 2025, and a 5-level automatic evaluation using LLM-as-Judge.
LoRA (Low-Rank Adaptation) was used for training, optimizing ranks and target layers for each model size.
◆ Technical Features
- Japanese-Specific Data Design: Avoids simple application of English-centric methods, prioritizing Japanese context.
- Balancing Safety and Quality: Improves safety metrics while maintaining knowledge, dialogue, and mathematical reasoning performance.
- Multi-faceted Evaluation: Used AnswerCarefully v2.2, SORRY-Bench, MultiJail, and MT-Bench, with cross-evaluation by Qwen, Mistral, and Gemma for reliability.
■ Evaluation Results (Summary)
- AC Acceptable Rate: 84.1% → 89.8% (+5.7pt)
- SORRY-Bench Refusal Rate: 85.3% → 90.4% (+5.1pt)
- MT-Bench-ja: 8.97 (maintained)
- JMMLU: 67.8% → 75.6% (+7.9pt)
FAQ
APTOが公開した安全性学習データセットの規模はどれくらいですか?
APTOが独自設計した日本語安全性学習データセットは約18,000件です。
安全性チューニングに採用されたモデルは何ですか?
Qwen3.5シリーズをベースにチューニングが行われています。
データセットにはどのようなカテゴリが含まれますか?
有害な質問の拒否、過剰拒否防止、応答途中からの軌道修正、捏造防止などのカテゴリが含まれています。
チューニングの手法には何が採用されていますか?
攻撃プロンプト設計から模範回答作成までを行う多段階プロセスに加え、学習にはLoRAを採用し、モデルサイズごとに最適化を行っています。
安全性チューニングの効果はどのように測定されましたか?
AnswerCarefully v2.2、SORRY-Bench、MultiJail、MT-Benchなどのベンチマークと、Qwen・Mistral・Gemmaによるクロス評価で測定されました。