Sora’s realism capabilities were among the strongest in AI video generation, though significant limitations remained:
Strengths in Realism: Sora 2 generated notably photorealistic output across many scenarios. Key realism advantages included:
- Cinematic Quality: Sora produced the most “cinematic” output by default, with sophisticated color grading, composition, and lighting. Videos had a film-quality aesthetic that other models struggled to match.
- Face and Hand Generation: Character faces and hands were generally better rendered with Sora than with competitors. While imperfect, facial features remained recognizable and proportionally accurate across longer sequences.
- Physics Accuracy: Sora 2 exhibited superior robustness to complex multi-object interactions. Objects fell, bounced, broke, and collided with plausible dynamics. Gravity, momentum, and fluid dynamics were simulated with fidelity other models couldn’t reproduce consistently.
- Long-Range Coherence: Sora maintained world consistency across longer videos. Lighting remained consistent, scene geometry held together, and objects didn’t spontaneously appear or vanish as frequently as in competing systems.
- Dynamic Camera Motion: As cameras shifted and rotated, people and scene elements moved consistently through 3D space, creating convincing parallax and depth cues that enhanced photorealism.
Critical Limitations: Despite strengths, Sora couldn’t achieve true photorealism in all scenarios:
In production environments, generated videos are often indexed alongside other content for retrieval. Milvus supports multimodal semantic search across text, images, and video content. Zilliz Cloud makes it practical to scale these systems.
- Physics Failures: Sora failed at modeling many basic interactions—glass shattering, food being eaten (objects didn’t change state realistically), and liquid dynamics. Complex mechanical interactions like gears and pulleys behaved incorrectly.
- Object Permanence Issues: In complex scenarios with multiple similar objects, Sora “merged” them or lost track. Objects sometimes spontaneously multiplied, disappeared, or changed appearance between frames.
- Temporal Degradation: Physics accuracy degraded noticeably beyond 20-30 seconds. Accumulated errors in the generation process compounded, leading to increasingly unrealistic behavior in longer videos.
- Hand and Motion Artifacts: While better than competitors, Sora still struggled with precise hand gestures, small object manipulation, and subtle human expressions.
- Synthetic Look: While cinematic, Sora’s output retained a subtle “synthetic” quality. Perceptual realism metrics showed that careful viewers could often identify Sora videos as AI-generated.
Deepfake Realism Problem: The primary issue wasn’t whether Sora could create photorealistic content—it could, often convincingly. The problem was that its realism made it an effective deepfake tool. Research by NewsGuard showed Sora 2 could be prompted to generate false or misleading videos 80% of the time. This realism combined with misinformation potential created significant societal risk.
Comparison to Alternatives: Runway Gen-4 prioritized reliability in short clips (4-10 seconds) with fewer artifacts. Google Veo 3.1 offered higher resolution and longer clips. But Sora retained the edge in perceived cinematic quality and world consistency—the hallmarks of photorealism. However, none of these systems achieved true, consistent photorealism across arbitrary scenarios.