Ministry of Digital Affairs Collaborates with Academia Sinica to Release Research Corpus, Boosting Diverse AI Applications
The Ministry of Digital Affairs (MODA) and Academia Sinica are collaborating to release research corpora to the Taiwan Sovereign AI Training Corpus, supporting diverse AI applications. Academia Sinica has uploaded over 6.2 million tokens of high-quality traditional Chinese corpora, including academic research, policy analysis, historical culture, and popular science texts. These specialized corpora enhance AI models' understanding and inference accuracy in specific fields, contributing to the development of RAG knowledge bases and professional Q&A systems. Since its launch late last year, the corpus has accumulated over 3,000 datasets and 1.2 billion tokens, with plans for continued expansion.
📋 Article Processing Timeline
- 📰 Published: April 10, 2026 at 17:23
- 🔍 Collected: April 10, 2026 at 18:00 (37 min after Published)
- 🤖 AI Analyzed: April 15, 2026 at 21:49 (123h 49m after Collected)
The Ministry of Digital Affairs (MODA) is promoting Taiwan's sovereign AI training corpus, widely collecting high-quality, localized traditional Chinese corpora. After the corpus went online late last year, MODA continues to collaborate with various units to expand its content. AI model trainers can apply to use and obtain the latest datasets from the corpus to expand more AI application possibilities.
MODA today announced via a press release that Academia Sinica has uploaded multiple representative research and popular science text resources, including academic research, policy analysis, historical culture, and popular science, totaling over 6.2 million tokens, with both professional depth and knowledge value.
MODA introduced that Academia Sinica's uploaded corpora include "Policy Recommendation Reports," covering various policy recommendations such as agriculture, technology, and financial reform, providing in-depth analysis and forward-looking perspectives on important issues; "Selected Research Achievements" compiles highlights of Taiwan's research, spanning humanities and social sciences, mathematics and natural sciences, and life sciences, allowing AI models to more accurately grasp the knowledge background and context of different fields.
In addition, MODA pointed out that "Research Insights" and "Popular Science Columns" convey scientific knowledge in a vivid and easy-to-understand manner, transforming complex content into easily comprehensible knowledge expressions, serving as important materials for AI to learn diverse tones and knowledge translation; "Selected Collections from the Institute of Taiwan History" contains rich local history and cultural memories, supplementing the model's understanding of Taiwan's historical perspectives; "Academia Sinica Integrity E-newsletter" supplements scientific research ethics and institutional aspects, enhancing the model's judgment and response capabilities on ethical issues.
MODA explained that specialized domain knowledge corpora can enhance the model's application and understanding capabilities in specific contexts and domains, and also help strengthen the model's inference capabilities and response accuracy, such as building Retrieval-Augmented Generation (RAG) knowledge bases, developing professional Q&A systems, performing model fine-tuning, and applying to tasks such as summarization, classification, and knowledge extraction, promoting the development of in-depth and professional AI applications.
MODA stated that since the launch of the Taiwan Sovereign AI Training Corpus late last year, it has accumulated over 3,000 datasets and over 1.2 billion tokens, and will continue to collaborate with various agencies and academic research units to expand the release of text data with Taiwanese characteristics and professional value, jointly strengthening the foundation for Taiwan's sovereign AI development. (Edited by Yang Lan-hsuan) 1150410
MODA today announced via a press release that Academia Sinica has uploaded multiple representative research and popular science text resources, including academic research, policy analysis, historical culture, and popular science, totaling over 6.2 million tokens, with both professional depth and knowledge value.
MODA introduced that Academia Sinica's uploaded corpora include "Policy Recommendation Reports," covering various policy recommendations such as agriculture, technology, and financial reform, providing in-depth analysis and forward-looking perspectives on important issues; "Selected Research Achievements" compiles highlights of Taiwan's research, spanning humanities and social sciences, mathematics and natural sciences, and life sciences, allowing AI models to more accurately grasp the knowledge background and context of different fields.
In addition, MODA pointed out that "Research Insights" and "Popular Science Columns" convey scientific knowledge in a vivid and easy-to-understand manner, transforming complex content into easily comprehensible knowledge expressions, serving as important materials for AI to learn diverse tones and knowledge translation; "Selected Collections from the Institute of Taiwan History" contains rich local history and cultural memories, supplementing the model's understanding of Taiwan's historical perspectives; "Academia Sinica Integrity E-newsletter" supplements scientific research ethics and institutional aspects, enhancing the model's judgment and response capabilities on ethical issues.
MODA explained that specialized domain knowledge corpora can enhance the model's application and understanding capabilities in specific contexts and domains, and also help strengthen the model's inference capabilities and response accuracy, such as building Retrieval-Augmented Generation (RAG) knowledge bases, developing professional Q&A systems, performing model fine-tuning, and applying to tasks such as summarization, classification, and knowledge extraction, promoting the development of in-depth and professional AI applications.
MODA stated that since the launch of the Taiwan Sovereign AI Training Corpus late last year, it has accumulated over 3,000 datasets and over 1.2 billion tokens, and will continue to collaborate with various agencies and academic research units to expand the release of text data with Taiwanese characteristics and professional value, jointly strengthening the foundation for Taiwan's sovereign AI development. (Edited by Yang Lan-hsuan) 1150410
FAQ
What kind of corpora are provided in the Taiwan Sovereign AI Training Corpus?
Academia Sinica has uploaded over 6.2 million tokens of high-quality traditional Chinese corpora, including academic research, policy analysis, historical culture, and popular science texts.
What benefits do specialized knowledge corpora bring to AI models?
They enhance the model's understanding and inference accuracy in specific fields, contributing to the development of RAG knowledge bases, professional Q&A systems, model fine-tuning, summarization, classification, and knowledge extraction tasks.