The Agent Stack #040 — Monday Build


NVIDIA just released Cosmos 3, and it’s the first open model that can reason about physics and plan actions in the real world. This isn’t another chatbot that hallucinates physics. It’s a foundation model trained on millions of videos that understands how objects move, collide, and interact.

Why This Changes Everything for Builders

Most AI agents live in pure text or image land. They can write code and analyse data, but ask them to help a robot stack boxes or navigate a room? Disaster. Cosmos 3 bridges that gap by understanding physical cause and effect.

The model comes in three sizes: 5B, 14B, and 40B parameters. All available on HuggingFace with Apache 2.0 licensing. The 14B model runs on a single RTX 4090, making it accessible for serious experimentation.

Here’s what makes it different: it predicts what happens next in physical scenarios. Show it a video of someone dropping a ball, and it knows the ball falls down, not up. Show it a robotic arm reaching for an object, and it can predict whether the grasp will succeed.

Building Your First Physical AI Agent

Let’s walk through setting up a simple agent that can plan robotic actions. You’ll need the Cosmos 3 model and some basic computer vision.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import cv2

# Load Cosmos 3 (14B model)
model = AutoModelForCausalLM.from_pretrained(
    "nvidia/cosmos-3-14b",
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("nvidia/cosmos-3-14b")

class PhysicalReasoningAgent:
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
    
    def predict_action_outcome(self, video_frames, planned_action):
        # Encode video context
        context = self.encode_video_sequence(video_frames)
        
        # Add action description
        prompt = f"Given this scene: {context}\nIf we {planned_action}, what happens next?"
        
        inputs = self.tokenizer(prompt, return_tensors="pt")
        
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=256,
                temperature=0.7
            )
        
        prediction = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        return prediction
    
    def score_action_safety(self, video_frames, action):
        prediction = self.predict_action_outcome(video_frames, action)
        
        # Simple safety scoring based on keywords
        danger_words = ['fall', 'break', 'collision', 'tip over', 'unstable']
        safety_score = 1.0 - sum(word in prediction.lower() for word in danger_words) * 0.2
        
        return max(0.0, safety_score), prediction

# Usage example
agent = PhysicalReasoningAgent(model, tokenizer)

# Test with webcam feed
cap = cv2.VideoCapture(0)
frames = []

for _ in range(30):  # Capture 1 second at 30fps
    ret, frame = cap.read()
    if ret:
        frames.append(frame)

# Evaluate a planned robotic action
action = "move the robotic arm to grasp the red cup"
safety_score, outcome = agent.score_action_safety(frames, action)

print(f"Action: {action}")
print(f"Safety Score: {safety_score:.2f}")
print(f"Predicted Outcome: {outcome}")

The key insight is using the model’s physics understanding to evaluate actions before executing them. This prevents the classic “robot accidentally knocks everything over” problem.

Architecture for Real-World Deployment

For production systems, you want this pattern:

  1. Perception Module: Computer vision pipeline that processes camera feeds
  2. Physics Reasoning: Cosmos 3 evaluates possible actions
  3. Action Planning: Traditional robotics stack executes the safest viable action
  4. Feedback Loop: Results feed back to improve future predictions

The magic happens in step 2. Instead of hard-coded physics rules, you get learned intuition about how the world works.

Early tests show 73% improvement in manipulation success rates compared to traditional robotics planners. The model particularly excels at predicting failure modes that rule-based systems miss.

Quick Hits

Memory bottleneck solved: XCENA raised £108M betting that AI’s real limit is memory bandwidth, not compute. Their in-memory processing chips could 10x inference speed for models like Cosmos 3.

GitHub Copilot goes token-based: Developers are furious about the new usage-based pricing. Expect £15-30/month bills for heavy users. Time to optimise those prompts.

Shift wants your cleaning data: This startup will clean your home for free in exchange for training footage. It’s weird, but the robot training data market is worth billions.

One Thing to Try

Download the Cosmos 3 14B model and run it on a simple physics prediction task. Feed it a video of objects falling or moving, then ask it to predict what happens next. Compare its predictions to reality. You’ll quickly see where current physical AI shines and where it still struggles.

The robots are coming. Make sure they understand physics first.