Why Motion-o
From snapshots to trajectories
Existing grounded video models can localize evidence in space and time, but usually leave the motion connecting those observations implicit. Motion-o formalizes that missing dimension as Spatial-Temporal-Trajectory (STT) reasoning.
Instead of relying on fluent but weakly grounded narration, Motion-o pairs
<obj>, <box>, and <t> evidence with a structured
<motion/> tag so trajectory claims are explicit and verifiable.