Milvus
Zilliz

How does Sora handle physics simulation?

Sora didn’t explicitly program physics rules; instead, it learned physics implicitly from training data through its learned transformer representations:

Learned Physics Representation: Unlike traditional physics engines that encode Newton’s laws as mathematical rules, Sora learned physics as a statistical pattern in its training data. The model observed millions of videos depicting gravity, collisions, momentum, and interactions, and internalized these patterns as learned representations in its neural network weights.

Advanced Interactions: Sora 2 demonstrated robust physics understanding across complex scenarios:

  • Gravity and Falls: Objects fell realistically, accelerating predictably. A dropped ball fell faster as time progressed, matching gravitational dynamics.
  • Collisions and Bouncing: Balls bounced off surfaces with appropriate energy loss. Basketball rebounds off backboards looked physically plausible. Collisions between objects produced realistic deflection and momentum transfer.
  • Fluids and Splashing: Water, rain, and liquids moved with fluid-like properties. Pouring water into cups maintained continuity and obeyed fluid dynamics.
  • Breaking and Shattering: Objects breaking apart moved with realistic trajectories. While not perfect, breaking animations generally showed conservation of momentum.
  • Complex Multi-Object Scenarios: When multiple objects interacted, Sora understood stacking, balance, and support. A ball rolling down a ramp would accelerate, a stack of blocks would collapse realistically if the bottom was removed.

Implicit World Simulation: Sora appeared to implicitly simulate a consistent 3D world. If a basketball player shot and missed, the ball rebounded off the backboard rather than teleporting or vanishing. This suggested Sora maintained internal representations of 3D space, object persistence, and physical rules across time.

Advantage Over Previous Models: Sora 2 excelled compared to predecessors (and many competitors) because:

  • Long-Range Consistency: Previous models degraded rapidly beyond 5-10 seconds. Sora maintained coherent physics across 20-60 second sequences.
  • Multi-Object Tracking: Sora tracked multiple interacting objects simultaneously without “merging” them or losing count.
  • Sophisticated Interactions: Sora understood complex scenarios like people moving furniture, machines operating, or sports interactions—physics beyond simple falling objects.

Critical Limitations:

Despite strengths, Sora struggled with many interactions:

  • Glass and Brittle Materials: Sora did not accurately model glass shattering. The fracture pattern and fragment behavior were unconvincing.
  • State Changes: Eating food didn’t result in realistic changes—the food didn’t disappear proportionally or show chewing mechanics. Similarly, liquid consumption from cups wasn’t modeled accurately.
  • Precise Mechanics: Complex mechanical interactions—gears turning, pulleys operating, pistons moving—often behaved incorrectly.
  • Object Permanence in Crowded Scenes: When multiple similar objects interacted (many balls, many people), Sora sometimes “merged” identical items or lost track of individuals. Objects spontaneously multiplied or disappeared.
  • Hand-Object Interaction: While Sora generated better hands than competitors, precise manipulation—threading a needle, typing accurately, or detailed hand-object contact—remained error-prone.
  • Temporal Degradation: Physics accuracy degraded noticeably in videos longer than 20-30 seconds. Accumulated generation errors compounded, and physical laws “drifted” away from realism.

Future AI systems combining video generation with embodied robotics will need persistent, searchable memory of visual observations. Milvus handles the vector storage layer for video similarity search and content retrieval. Production deployments can leverage Zilliz Cloud.

Comparison to Alternatives:

  • Runway Gen-4: Prioritized reliability in short clips but struggled with complex physics
  • Google Veo 2: Some analyses suggested superior physical accuracy in specific scenarios
  • Kling 3.0: Competitive physics simulation but less consistent than Sora

No available video generation system achieved perfect or true physics simulation.

Why Learned Physics Falls Short:

The fundamental issue: learned physics from data works well for common scenarios depicted frequently in training videos (people walking, objects falling, basic collisions). Rare or novel scenarios—glass breaking, precise mechanical interactions—appear less frequently in training data, so the model learns weaker representations.

This contrasts with explicit physics engines (used in video games) that encode physics as mathematical rules and work equally well for any scenario. AI-learned physics trades perfect accuracy for the ability to generate realistic video aesthetics and handle scenarios outside explicit programming.

Production Implications:

For professional video production, Sora’s physics worked well for broad strokes (objects falling, basic collisions, people moving through space) but required careful prompt engineering and sometimes manual correction for demanding scenarios. Users learned to avoid requesting complex physics interactions or accepted imperfect results.

Like the article? Spread the word