NTT Establishes World's First 'Token Commonization' Technology to Overcome the 'Vocabulary Barrier' Between LLMs

NTT has established the world's first 'token commonization' technology that overcomes the 'vocabulary barrier' between different Large Language Models (LLMs). This technology allows for the flexible reduction of LLM token vocabularies during inference without accuracy degradation, enabling knowledge integration and transfer between heterogeneous LLMs.
調査NQ 88/100出典:PR Times

📋 Article Processing Timeline

  • 📰 Published: April 23, 2026 at 00:00
  • 🔍 Collected: April 23, 2026 at 00:02 (2 min after Published)
  • 🤖 AI Analyzed: April 23, 2026 at 02:49 (2h 47m after Collected)
Key Points of the Announcement:

NTT has established the world's first theory and algorithm that allows for the flexible reduction of the vocabulary set of 'tokens,' the input/output units in Large Language Model (LLM) inference, without accuracy degradation during inference.

This technology enables collaboration between arbitrary heterogeneous LLMs via a common vocabulary set.

By applying this technology to collaboration techniques such as ensemble and NTT's unique portable tuning, knowledge integration and transfer can be achieved among a wider variety of heterogeneous LLMs.

NTT Corporation (Headquarters: Chiyoda-ku, Tokyo; President and CEO: Akira Shimada; hereinafter 'NTT') has established the world's first inference technology that can reduce the vocabulary of 'tokens,' the input/output units in Large Language Models (LLMs), without accuracy degradation, and commonize token vocabularies even among different LLMs. Previously, to achieve inference-time collaboration, represented by ensemble*1, using multiple LLMs, it was necessary for the token vocabularies of each LLM to match. This technology resolves that constraint, enabling various inference-time collaborations, such as ensemble and NTT's unique portable tuning*2, which were previously difficult between arbitrary heterogeneous LLMs, thereby achieving higher accuracy through knowledge integration and transfer. This achievement will be presented at the International Conference on Learning Representations (ICLR) 2026*3, the most challenging international conference in the field of deep learning, held in Rio de Janeiro, Brazil, from April 23 to 27, 2026.

1. Background

In recent years, Large Language Models (LLMs) have rapidly gained traction as AI capable of performing inference in natural language. LLMs efficiently perform inference in units of subwords called 'tokens' instead of outputting text character by character. More specifically, they repeatedly calculate 'next token prediction,' which predicts candidate tokens to be output next with probability values, and proceed with inference based on these prediction results. The set of these candidate tokens is called 'token vocabulary,' and it consists of tens of thousands to hundreds of thousands of tokens. However, it is common for token vocabularies to differ, especially among LLMs developed by different organizations or at different times. When token vocabularies differ among LLMs in this way, it is impossible to compare and refer to each other's next token prediction results during inference. This 'vocabulary barrier' has made it difficult to utilize various token-level collaboration technologies between heterogeneous LLMs, such as ensemble, which improves inference accuracy by integrating prediction results from multiple LLMs, or portable tuning, which transfers specialized knowledge to another model.

2. Overview of Research Results

In this research, we established the world's first technology that can flexibly reduce the token vocabulary used by LLMs for inference without accuracy degradation (Figure 1). Specifically, the next token prediction results for all tokens calculated during LLM inference are converted at each step into prediction results that only consider a specified subset of tokens (partial vocabulary) as candidates. By designing a conversion algorithm based on a unique theory to ensure that the overall tendency of the finally output text does not change during this conversion, inference with an arbitrary partial vocabulary became possible without degrading the inference accuracy of the original LLM. By applying this technology, inference-time collaboration becomes possible between LLMs with different vocabularies on their 'maximum common vocabulary' (Figure 2). In other words, inference-time collaborations that were previously hindered by token vocabulary mismatches, such as knowledge integration through ensemble and knowledge transfer through portable tuning, can now be realized between heterogeneous LLMs via common tokens. Experiments also verified ensemble targeting LLMs with different token vocabularies, confirming that collaboration via common tokens is possible while maintaining the performance of each LLM, and that this collaboration can improve inference accuracy.

Figure 1: Established conversion technology that can flexibly reduce token vocabulary while maintaining LLM output tendencies.

Figure 2: Ensemble is possible even between LLMs with different vocabularies via a common vocabulary.

3. Key Points of the Technology

① New Theory for Flexible Vocabulary Reduction 'Without Loss'

We established the world's first theory that enables next token prediction with fewer token candidates (partial vocabulary) while maintaining the output quality of LLMs. Simply reducing token candidates would be possible by removing information about unwanted token candidates from next token prediction, but this would lead to significant performance degradation as strings that should have been output would no longer be generated. In this research, we constructed a theoretical framework that can uniformly handle token probabilities related to the original vocabulary and token probabilities related to the partial vocabulary, and designed a conversion algorithm based on a unique theory to ensure that the distribution of the final output string remains invariant.