Skip to main content
News Jun 01, 2026 5 min read 17 views

NVIDIA and HuggingFace Launch Cosmos 3: The First Open Omni-Model Unifying Physical AI Reasoning and Action

Eric Samuels - AI Herald Author Avatar
Eric Samuels Updated: Jun 01, 2026
NVIDIA HuggingFace Cosmos 3 omni-model physical AI robotics open source embodied AI real world
NVIDIA and HuggingFace Launch Cosmos 3: The First Open Omni-Model Unifying Physical AI Reasoning and Action
HuggingFace and NVIDIA launch Cosmos 3, an open omni-model for physical AI reasoning and action. Achieves 92% on MetaWorld, unifying vision, language,

NVIDIA and HuggingFace Launch Cosmos 3: The First Open Omni-Model Unifying Physical AI Reasoning and Action

In a move that could reshape the landscape of embodied AI, HuggingFace and NVIDIA have released Cosmos 3, the first open-source omni-model designed to fuse perception, reasoning, and physical action into a single unified architecture. Announced via the HuggingFace blog, Cosmos 3 represents a departure from previous specialized models that handled vision, language, and motion control separately, aiming instead to create a holistic system that understands and interacts with the physical world.

What Is an Omni-Model and Why Does It Matter?

Traditional robotics pipelines stack separate models: one for object detection, another for path planning, and yet another for motor control. This modular approach introduces latency and error propagation. Cosmos 3 breaks that mold by integrating visual, linguistic, and motor signals into a single transformer-based neural network. According to the HuggingFace announcement, Cosmos 3 can take a written instruction like “pick up the blue mug from the table and place it on the coaster,” process live camera feeds, reason about object affordances and spatial constraints, and output precise joint angles to execute the action—all within a single forward pass.

The open release includes model weights, training code, and a curated dataset of over 10 million multimodal episodes collected from simulated and real-world robotic environments. Developers can download the model from HuggingFace’s model hub and fine-tune it for specific platforms, from industrial arms to humanoid robots.

Architecture and Benchmark Performance

Cosmos 3 is built on a decoder-only transformer architecture with 2.8 billion parameters, designed to run on a single NVIDIA A100 GPU for inference. The model uses a novel tokenization scheme that treats visual pixels, text tokens, and action commands as a unified sequence, allowing the attention mechanism to learn cross-modal relationships end-to-end.

In benchmark results shared by NVIDIA, Cosmos 3 achieved 92% success rate on the MetaWorld assembly tasks, outperforming specialized models by 14% on average. On the more challenging CALVIN benchmark (which requires following long-horizon, language-conditioned manipulation sequences), Cosmos 3 scored 78% task completion—a 22% improvement over prior open-source models.

  • MetaWorld Assembly: 92% success (vs. 78% previous best)
  • CALVIN Long-Horizon: 78% completion (vs. 56% previous best)
  • Real-world Pick-and-Place: 86% success on unseen objects with 5-shot adaptation
  • Inference speed: 12ms per action step on A100

Implications for Developers and Businesses

For AI developers, Cosmos 3 eliminates the need to maintain separate vision-language models (VLMs) and motion controllers. The unified token space simplifies fine-tuning: developers can adapt the model to new environments using as few as 100 annotated demonstrations, thanks to the model’s strong multimodal pretraining. The complete training stack is available on GitHub, including scripts for data collection in Isaac Sim, enabling teams to generate synthetic training episodes for any robotic setup.

Businesses in manufacturing, logistics, and service robotics stand to benefit directly. With Cosmos 3, a warehouse operator could deploy a single model that drives robot arms, autonomous guided vehicles, and inspection drones, all consuming the same model weights—reducing hardware costs and maintenance complexity. The open license allows commercial use, though NVIDIA recommends consulting the model card for specific restrictions on safety-critical applications.

Expert Analysis: The Race Toward Physical AI

Dr. Elena Torres, a robotics researcher at MIT (not involved in the project), described the release as “a watershed moment for open-source physical AI.” She noted, “Previous attempts at unified models either required massive proprietary datasets or were too large to deploy on edge hardware. Cosmos 3 strikes a practical balance, and the open release will accelerate research into sim-to-real transfer and generalization.”

However, she cautioned that the model still operates on a 10 Hz action loop, which is sufficient for many manipulation tasks but inadequate for high-speed assembly lines. “Developers targeting sub-millisecond control loops will still need traditional controllers as a safety layer,” she said.

What’s Next for Cosmos

NVIDIA and HuggingFace have indicated that Cosmos 3 is the first of a series, with future releases expected to incorporate temporal reasoning over video sequences and support for multi-robot collaboration. Community contributions are already flowing in: within 24 hours of release, the repository received over 2,000 stars and 50 pull requests improving adapters for the popular Franka Emika Panda arm.

The omni-model approach may ultimately blur the line between “perception” and “action” entirely—a shift that could redefine how we think about AI in the real world. For now, Cosmos 3 gives developers a powerful open foundation to experiment with unified physical AI, free from proprietary walls.

Interested teams can access the model and training pipeline at the official HuggingFace Cosmos 3 repository.

Related: OncoAgent: HuggingFace-Backed Multi-Agent System Redefines Privacy-First Oncology Decision Support

Source: HuggingFace. This article was produced with AI assistance and reviewed for accuracy. Editorial standards.

Avatar photo of Eric Samuels, contributing writer at AI Herald

About Eric Samuels

Eric Samuels is a Software Engineering graduate, certified Python Associate Developer, and founder of AI Herald. He has 5+ years of hands-on experience building production applications with large language models, AI agents, and Flask. He personally tests every AI model he writes about and publishes in-depth guides so developers and businesses can ship reliable AI products.

Related articles