ICML 2026 Position Paper Track

Video LLMs Must Not Ignore the Pixel Dynamics in Plain Sight

Shayda Moezzi¹, Umer Saleem¹, Andong Deng², Chen Chen², Sarah Ostadabbas¹

¹ Northeastern University

² University of Central Florida

Static-Cue Dominance

Salient appearance and context can overshadow low-signal but decisive pixel dynamics, causing state transitions and motion to be underweighted or missed.

Prior-Driven Temporal Hallucination

Learned event priors can override the observed motion and complete the most likely script even when the video shows otherwise.

Video is inherently temporal: the meaning of an event often lies not in what appears in a single moment, but in how states change, interact, and unfold over time. Video makes this dynamic structure explicit by encoding motion, interaction, and causality directly in pixel-level changes across time.

Together, these patterns misorder the evidence hierarchy of video understanding, allowing models to succeed without explicitly tracking state evolution over time. Temporal and causal claims must therefore remain grounded in observed pixel dynamics.

The Illusion of Success: Benchmark Alignment and Architectural Biases

Current Video LLMs achieve remarkable performance on recent video benchmarks, yet these systems still fail at rudimentary temporal understanding and movement perception. The apparent success of these models often reflects how well they satisfy benchmark requirements, not necessarily how well they perceive temporal dynamics.

A Benchmark Perspective

If a video understanding benchmark can be solved without seeing the video, it is not measuring video understanding; it is measuring language reasoning. Frame shuffling, semantic-only probes, and shortcut-aware diagnostics repeatedly show how weak current progress metrics can be with respect to time.

An Architecture Perspective

Current architectures systematically compress or defer temporal information. Image-first encoders, shallow fusion modules, and language-heavy reasoning stacks preserve object semantics far more reliably than the dynamics that emerge only through transitions over time.

Describe the motion of the balls in the video.

The physically invalid Newton’s cradle makes the problem unusually clear. The rightmost ball remains stationary after impact, yet the model restores the canonical momentum-transfer story.

Failure Modes of Current Video LLMs

Video LLMs can appear temporally competent while systematically underusing the very signal that makes video distinct: dynamic evolution in pixels. Across recent diagnostic probes, two recurring patterns emerge.

Failure Mode 1: Static-Cue Dominance

The prevailing paradigm still inherits an image-language bias: video is treated as a set of sampled frames rather than a continuous spatiotemporal signal. When appearance is held nearly fixed and only the trajectory changes, even simple motion primitives become fragile.

Motivating Motion Probes on Simulated Physics

Accuracy (%) on 3 second collision videos from AVoE using 16 sampled frames and a binary Yes/No task: does the left/right object change direction after collision?

Model	Input Frames	Expected	Surprising

Direction-change probes

These 3 second collision clips isolate the binary question of direction change after impact while holding appearance nearly fixed. Across all three examples, both models answer incorrectly.

Motion blindness

Directionality is encoded only across time. Across these clips, the model repeatedly misclassifies clockwise rotation as counter-clockwise.

Failure Mode 2: Prior-Driven Temporal Hallucination

When the visual evidence is subtle or counterintuitive, Video LLMs often do not default to uncertainty. Instead, they hallucinate a false reality, substituting canonical event scripts for evidence-based state tracking.

Isolating dynamics in plain sight

These examples expose environmental hallucination, role reversal, invented trajectories, and fabricated causal mechanisms under motion-focused prompting.

Describe the complete sequence of motion and events in this video from start to finish. Focus specifically on the dynamics of the scene.

Causal fabrication

In an IntPhys2 teleportation video, the ball appears on the other side of a wall after a camera pan. Rather than identifying a physical impossibility, the model fabricates an impact-driven wall rotation.

Gemini-2.5-Pro output

“The force of the moving ball strikes the left side of this panel. This transfer of kinetic energy causes the panel to pivot sharply on its vertical axis.”

Diagnostic probes across both failure modes

Recent benchmark design increasingly exposes the same blind spot: the models are often fluent about time without being reliably grounded in it.

Alternative Views

Counterpoint 01

What the argument gets right

Where the paper draws the line

Call to Action: Towards Dynamically-Aware Video Understanding

Progress in video understanding will not come from scaling context windows or language-model capacity alone, but from representational, structural, and evaluative changes that make spatiotemporal evidence unavoidable.

Exploration 1 / 3