REAL WORLD Inc. (RLWRLD, Japan corporation: Chiyoda-ku, Tokyo; Representative: Junhee Ryu), which develops world-leading physical AI foundation models, officially released its proprietary Robotics Foundation Model (RFM), 'RLDX-1 (RealDex)', on May 7, 2026. This model employs a 'Dexterity-First' design that processes vision, language, as well as force, touch, and working memory within a single model. It achieves high performance through technology that differs from conventional vision- and language-centric VLA (Vision-Language-Action) models.

A New Design Philosophy Starting with Dexterity: 'Dexterity-First'

It has been generally believed in the industry that 'if intelligence is achieved first, dexterity will naturally follow.' However, REAL WORLD (RLWRLD) proposes a different approach. RLDX-1 posits that 'dexterity is not something that follows as a result of intelligence, but rather an essential process when intelligence acts in the physical world.' This is based on the recognition that precise automation of manual tasks in industrial settings can only be achieved by handling signals that cannot be captured by vision alone, such as force (torque), touch, and contact timing.

To demonstrate this concept, REAL WORLD (RLWRLD) focused on repetitive manual tasks in industrial settings and systematically defined them through its proprietary benchmark, 'DexBench.' Specifically, they have established a mechanism to quantitatively evaluate hand manipulation capabilities from five perspectives: ① Grasp Diversity, ② Spatial Precision, ③ Temporal Precision, ④ Contact Precision, and ⑤ Context Awareness.

High Performance Confirmed in Global Benchmarks and Superiority Demonstrated in Real Robot Environments

RLDX-1 recorded results surpassing existing State-of-the-Art (SOTA) models such as NVIDIA's GR00T and Physical Intelligence's πO (Pi Zero) in eight global public benchmarks. In 'RoboCasa Kitchen,' designed for long-term and contact-centric tasks, it scored 70.6 points, demonstrating a high standard for a VLA model reaching the 70s. In the humanoid-specific evaluation 'GR-1 Tabletop,' it achieved 58.7 points, a difference of 10.7 points over the next best model. Furthermore, in 'LIBERO-Plus,' which evaluates robustness against seven variables including camera, lighting, language, and background, it recorded 86.7%, outperforming comparative models.

Moreover, an even more significant difference was observed in real robot environments. In an evaluation using WiRobotics' humanoid 'ALLEX,' in which REAL WORLD (RLWRLD) participated in development, it achieved a 70.8% success rate in the 'Pot-to-Cup Pouring' task involving dynamic weight changes. Considering that comparative models remained in the high 30% range, this represents approximately double the performance.

Adopts 'MSAT' Core Architecture that 'Sees, Feels, Remembers, and Adapts'

REAL WORLD (RLWRLD)'s technological core is the Multi-Stream Action Transformer (MSAT). While conventional VLA (Vision-Language-Action) models process different signals such as vision, language, action, touch, and memory in a single stream, MSAT adopts a structure that establishes independent streams for each modality and integrates them through joint attention between modalities. Furthermore, physical signals that cannot be captured by vision, such as force (torque) and touch, as well as long-term memory, are processed by dedicated modules (Physics Module, Memory Module), enabling a single model to realize a series of processes: 'see, feel, remember, and adapt.'

Paik Jae-kyung, CTO of REAL WORLD (RLWRLD), stated, 'The core of RLDX-1 is the separation of structures to allow each modality to fully exhibit its characteristics. The ability to precisely capture the moment of contact through torque signals and infer dynamic changes over time is an area that was structurally difficult to handle with conventional VLAs.'

FACT BOX

  • Source: PR TIMES
  • Category: New Product
  • Organizations: NVIDIA / Physical Intelligence / WiRobotics
  • Products / services: DexBench / MSAT (Multi-Stream Action Transformer)