Sora won't stay slop machine forever

Vision will soon reason, just like language before it.

Jay Kim

08 Oct 2025 — 4 min read

There's no denying it: OpenAI Sora's feed is pure dopamine sludge.

Most of it borderline unwatchable while frying your brain.

we've pretty much maxed our lifetime quota of sama's facefeeds

But the research underneath, thankfully, points the other way.

DeepMind's new paper on Veo 3 is signaling we may have reached the "hello world" moment of visual reasoning.

Video models are zero-shot learners and reasoners

The remarkable zero-shot capabilities of Large Language Models (LLMs) have propelled natural language processing from task-specific models to unified, generalist foundation models. This transformation emerged from simple primitives: large, generative models trained on web-scale data. Curiously, the same primitives apply to today’s generative video models. Could video models be on a trajectory towards general-purpose vision understanding, much like LLMs developed general-purpose language understanding? We demonstrate that Veo 3 can solve a broad variety of tasks it wasn’t explicitly trained for: segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and more. These abilities to perceive, model, and manipulate the visual world enable early forms of visual reasoning like maze and symmetry solving. Veo’s emergent zero-shot capabilities indicate that video models are on a path to becoming unified, generalist vision foundation models.

arXiv.orgThaddäus Wiedemer

So what are Veo 3, Sora, and their ilk actually cooking under all that slop?

Veo 3 builds intelligence across four rough rungs. It:

Perceives (edges, masks, denoising),
Models (physics, materials, light),
Manipulates (edits and transforms that hold up),
Reasons (goal → sequence → outcome).

It doesn't just render pixels. It reads a scene, maintains a world model over time, applies targeted changes, and plays out a plan.

Given only a prompt and a still image, the model animates through tasks once reserved for bespoke CV stacks. No bespoke heads, no fine-tunes.

Just describe the task and let the frames do the work.

at this point, hard to dismiss these models as a gimmick

You can see the vision-native edge here. When you render a maze as text, a top-tier LLM can certainly hang. But flip that same maze into an image, Veo 3 cleanly wins.

Which checks out.

Planning lands better when you start in pixels, not tokens, and that's where most of our real-world work lives anyway.

In practice, it plans: perceive → update → check → continue. That simple loop recurs everywhere, from mazes and object extraction to rule following and image manipulation.

Veo 3 also improves materially with cheap retries, the same way LLMs jump from pass@1 to pass@k.

In short, accuracy economics now scale with test-time sampling, not just larger pre-trains.

Operationally, that shows up at the prompt layer.

Prompts now leave fingerprints you can quantify. For instance, a green backdrop beats white on segmentation, likely a learned keying prior, and benchmarks back it up. Veo 3 posts strong mean Intersection over Union (mIoU) on instance segmentation and clears small puzzles that stalled earlier versions.

fyi. "green" screens got first adopted for being the *easiest to erase, hardest to mess up*

It echoes the early LLM era circa 2022, only this time for Visual Foundation Models (VFMs).

Once a generalist clears "good enough" across the long tail, the incentive shifts from wiring dozens of point tools to orchestrating one model with verifiers and a retry budget.

Put simply, if language became programmable through LLMs, vision is becoming programmable through video models.

In the near term it'll still look like creator toys, but the durable value is universal visual operators embedded into real workflows, such as:

product imagery that updates to spec,
QA that evaluates motion not just pixels,
UI automation written in plain English,
robot demos distilled straight into action logic.

The moat won't just be a cute feed.

It will be the control stack around the model: task phrasing, memory, verifiers, and cheap sampling that turn today's Sora thirst trap into a system you can ship and hang a KPI on.

For now, Sora's invite frenzy is the biggest headline, with "invite" codes going for ~$500(!) on eBay.

apparently oai's burning $1 a clip while the reseller's cashing in. *can't make this up*

But the line to watch is this:

Once framewise reasoning hits stable accuracy at short clip lengths, standard resolutions, and sane costs, a single VFM begins subsuming much of CV, the same way LLMs displaced bespoke NLP.

Just as language had its turn under large models, vision's up next.

…

and then there's Gary Marcus.