(click to see motion-o ingredients)

Motion-o: Trajectory-Grounded Video Reasoning

Motion-o extends grounded video reasoning from where and when to how objects move with a Spatial-Temporal-Trajectory evidence chain and a structured Motion Chain of Thought.

Recent video reasoning systems can produce fluent <think> traces, but those traces are often weakly grounded: they do not explicitly tie claims back to where and when evidence occurs in the video, and they do not encode the motion that connects successive observations. Motion-o frames this missing capability as Spatial-Temporal-Trajectory reasoning.

The core mechanism is a trajectory-faithful evidence chain: timestamped <obj>, <box>, and <t> observations paired with a structured <motion/> tag that summarizes direction, speed, and scale change. This turns motion from an implicit interpolation step into an explicit and verifiable part of the reasoning process, while requiring no architectural modification to the underlying model.

From snapshots to trajectories

Existing grounded video models can localize evidence in space and time, but usually leave the motion connecting those observations implicit. Motion-o formalizes that missing dimension as Spatial-Temporal-Trajectory (STT) reasoning.

Instead of relying on fluent but weakly grounded narration, Motion-o pairs <obj>, <box>, and <t> evidence with a structured <motion/> tag so trajectory claims are explicit and verifiable.

Spatial <obj> + <box>

Ground the relevant object.

Temporal <t>

Anchor the observation in time.

Trajectory <motion/>

Summarize direction, speed, and scale change.

Top

Video Frames (w/o Motion Grounding)

Bottom

Video Frames + Spatial-Temporal-Trajectory Evidence Chain (w/ Motion Grounding)

Question

What is the girl's trajectory across the playground?

Answer

The girl walks from the left side of the frame, passes in front of the jungle gym, and exits on the right side of the frame.

Motion-o makes the evidence chain explicit

Motion-o grounds the trajectory with explicit spatial, temporal, and motion tags. The baseline stays fluent, but leaves that structure implicit.

Motion-o STT + MCoT
Spatial Temporal Trajectory
Qwen2.5-VL-7B Baseline CoT

Teach the model to speak motion explicitly

Motion-o expands sparse keyframes into denser tracks, computes motion primitives, injects them into the reasoning chain during SFT, and reinforces them with trajectory-grounded rewards during RL.

Motion-o training pipeline figure

The missing step is explicit motion

CoT narrates snapshots. MCoT inserts an inspectable motion step between them and makes trajectory reasoning explicit.

Comparison of chain of thought and motion chain of thought
Animated MCoT versus CoT comparison

Qualitative Examples

01 Duck trajectory
Qualitative example sampled GIF preview
Question

Answer

Grounded reasoning trace

motion-aware

BibTeX

If you find Motion-o useful for your research, please cite our paper:

@article{galoaa2026motion,
  title   = {Motion-Aware Trajectory Reasoning for Video Understanding},
  author  = {Galoaa, Bishoy* and Moezzi, Shayda* and Bai, Xiangyu and Ostadabbas, Sarah},
  journal = {arXiv preprint arXiv:2603.18856},
  year    = {2026},
  url     = {https://arxiv.org/abs/2603.18856}
}