Why AI Video Sometimes Gets It Backwards

0
126
Why AI Video Sometimes Gets It Backwards


If 2022 was the 12 months that generative AI captured a wider public’s creativeness, 2025 is the 12 months the place the brand new breed of generative video frameworks coming from China appears set to do the identical.

Tencent’s Hunyuan Video has made a main impression on the hobbyist AI group with its open-source launch of a full-world video diffusion mannequin that customers can tailor to their wants.

Close on its heels is Alibaba’s more moderen Wan 2.1, one of the highly effective image-to-video FOSS options of this era – now supporting customization by means of Wan LoRAs.

Besides the provision of current human-centric basis mannequin SkyReels, on the time of writing we additionally await the discharge of Alibaba’s complete VACE video creation and enhancing suite:

Click to play. The pending launch of Alibaba’s multi-function AI-editing suite VACE has excited the consumer group. Source: https://ali-vilab.github.io/VACE-Page/

Sudden Impact

The generative video AI analysis scene itself isn’t any much less explosive; it is nonetheless the primary half of March, and Tuesday’s submissions to Arxiv’s Computer Vision part (a hub for generative AI papers) got here to almost 350 entries – a determine extra related to the peak of convention season.

The two years for the reason that launch of Stable Diffusion in summer time of 2022 (and the next improvement of Dreambooth and LoRA customization strategies) have been characterised by the shortage of additional main developments, till the previous few weeks, the place new releases and improvements have proceeded at such a breakneck tempo that it’s virtually unattainable to maintain apprised of all of it, a lot much less cowl all of it.

Video diffusion fashions comparable to Hunyuan and Wan 2.1 have solved, in the end, and after years of failed efforts from tons of of analysis initiatives, the downside of temporal consistency because it pertains to the era of  people, and largely additionally to environments and objects.

There might be little doubt that VFX studios are at present making use of workers and sources to adapting the brand new Chinese video fashions to resolve fast challenges comparable to face-swapping, regardless of the present lack of ControlNet-style ancillary mechanisms for these techniques.

It have to be such a aid that one such vital impediment has probably been overcome, albeit not by means of the avenues anticipated.

Of the issues that stay, this one, nonetheless, is just not insignificant:

Click to play. Based on the immediate ‘A small rock tumbles down a steep, rocky hillside, displacing soil and small stones ‘, Wan 2.1, which achieved the very highest scores in the new paper, makes one simple error. Source: https://videophy2.github.io/

Up The Hill Backwards

All text-to-video and image-to-video systems currently available, including commercial closed-source models, have a tendency to produce physics bloopers such as the one above, where the video shows a rock rolling uphill, based on the prompt ‘A small rock tumbles down a steep, rocky hillside, displacing soil and small stones ‘.

One theory as to why this happens, recently proposed in an academic collaboration between Alibaba and UAE, is that models train always on single images, in a sense, even when they’re training on videos (which are written out to single-frame sequences for  training purposes); and they may not necessarily learn the correct temporal order of ‘before’ and ‘after’ pictures.

However, the most likely solution is that the models in question have used data augmentation routines that involve exposing a source training clip to the model both forwards and backwards, effectively doubling the training data.

It has long been known that this shouldn’t be done arbitrarily, because some movements work in reverse, but many do not. A 2019 study from the UK’s University of Bristol sought to develop a method that could distinguish equivariant, invariant and irreversible source data video clips that co-exist in a single dataset (see image below), with the notion that unsuitable source clips might be filtered out from data augmentation routines.

Examples of three types of movement, only one of which is freely reversible while maintaining plausible physical dynamics. Source: https://arxiv.org/abs/1909.09422

Examples of three types of movement, only one of which is freely reversible while maintaining plausible physical dynamics. Source: https://arxiv.org/abs/1909.09422

The authors of that work frame the problem clearly:

‘We find the realism of reversed videos to be betrayed by reversal artefacts, aspects of the scene that would not be possible in a natural world. Some artefacts are subtle, while others are easy to spot, like a reversed ‘throw’ motion the place the thrown object spontaneously rises from the ground.

‘We observe two types of reversal artefacts, physical, those exhibiting violations of the laws of nature, and improbable, those depicting a possible but unlikely scenario. These are not exclusive, and many reversed actions suffer both types of artefacts, like when uncrumpling a piece of paper.

‘Examples of physical artefacts include: inverted gravity (e.g. ‘dropping something’), spontaneous impulses on objects (e.g. ‘spinning a pen’), and irreversible state modifications (e.g. ‘burning a candle’). An instance of an inconceivable artefact: taking a plate from the cabinet, drying it, and inserting it on the drying rack.

‘This kind of re-use of data is very common at training time, and can be beneficial – for example, in making sure that the model does not learn only one view of an image or object which can be flipped or rotated without losing its central coherency and logic.

‘This only works for objects that are truly symmetrical, of course; and learning physics from a ‘reversed’ video only works if the reversed version makes as much sense as the forward version.’

Temporary Reversals

We don’t have any evidence that systems such as Hunyuan Video and Wan 2.1 allowed arbitrarily ‘reversed’ clips to be exposed to the model during training (neither group of researchers has been specific regarding data augmentation routines).

Yet the only reasonable alternative possibility, in the face of so many reports (and my own practical experience), would seem to be that hyperscale datasets powering these model may contain clips that actually feature movements occurring in reverse.

The rock in the example video embedded above was generated using Wan 2.1, and features in a new study that examines how well video diffusion models handle physics.

In tests for this project, Wan 2.1 achieved a score of only 22% in terms of its ability to consistently adhere to physical laws.

However, that’s the best score of any system tested for the work, indicating that we may have found our next stumbling block for video AI:

Scores obtained by leading open and closed-source systems, with the output of the frameworks evaluated by human annotators. Source: https://arxiv.org/pdf/2503.06800

Scores obtained by leading open and closed-source systems, with the output of the frameworks evaluated by human annotators. Source: https://arxiv.org/pdf/2503.06800

The authors of the new work have developed a benchmarking system, now in its second iteration, called VideoPhy, with the code available at GitHub.

Though the scope of the work is beyond what we can comprehensively cover here, let’s take a general look at its methodology, and its potential for establishing a metric that could help steer the course of future model-training sessions away from these bizarre instances of reversal.

The study, conducted by six researchers from UCLA and Google Research, is called VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation. A crowded accompanying project site is also available, along with code and datasets at GitHub, and a dataset viewer at Hugging Face.

Click to play. Here, the feted OpenAI Sora model fails to understand the interactions between oars and reflections, and is not able to provide a logical physical flow either for the person in the boat or the way that the boat interacts with her.

Method

The authors describe the latest version of their work, VideoPhy-2, as a ‘challenging commonsense evaluation dataset for real-world actions.’ The collection features 197 actions across a range of diverse physical activities such as hula-hooping, gymnastics and tennis, as well as object interactions, such as bending an object until it breaks.

A large language model (LLM) is used to generate 3840 prompts from these seed actions, and the prompts are then used to synthesize videos via the various frameworks being trialed.

Throughout the process the authors have developed a list of ‘candidate’ physical rules and laws that AI-generated videos should satisfy, using vision-language models for evaluation.

The authors state:

‘For example, in a video of sportsperson playing tennis, a physical rule would be that a tennis ball should follow a parabolic trajectory under gravity. For gold-standard judgments, we ask human annotators to score each video based on overall semantic adherence and physical commonsense, and to mark its compliance with various physical rules.’

Above: A text prompt is generated from an action using an LLM and used to create a video with a text-to-video generator. A vision-language model captions the video, identifying possible physical rules at play. Below: Human annotators evaluate the video’s realism, verify rule violations, add lacking guidelines, and test whether or not the video matches the unique immediate.

Above: A textual content immediate is generated from an motion utilizing an LLM and used to create a video with a text-to-video generator. A vision-language mannequin captions the video, figuring out attainable bodily guidelines at play. Below: Human annotators consider the video’s realism, verify rule violations, add lacking guidelines, and test whether or not the video matches the unique immediate.

Initially the researchers curated a set of actions to guage bodily commonsense in AI-generated movies. They started with over 600 actions sourced from the Kinetics, UCF-101, and SSv2 datasets, specializing in actions involving sports activities, object interactions, and real-world physics.

Two unbiased teams of STEM-trained pupil annotators (with a minimal undergraduate qualification obtained) reviewed and filtered the record, deciding on actions that examined ideas comparable to gravity, momentum, and elasticity, whereas eradicating low-motion duties comparable to typing, petting a cat, or chewing.

After additional refinement with Gemini-2.0-Flash-Exp to eradicate duplicates, the ultimate dataset included 197 actions, with 54 involving object interactions and 143 centered on bodily and sports activities actions:

Samples from the distilled actions.

Samples from the distilled actions.

In the second stage, the researchers used Gemini-2.0-Flash-Exp to generate 20 prompts for every motion within the dataset, leading to a complete of three,940 prompts. The era course of targeted on seen bodily interactions that could possibly be clearly represented in a generated video. This excluded non-visual components comparable to feelings, sensory particulars, and summary language, however included various characters and objects.

For instance, as an alternative of a easy immediate like ‘An archer releases the arrow’, the model was guided to produce a more detailed version such as ‘An archer draws the bowstring back to full tension, then releases the arrow, which flies straight and strikes a bullseye on a paper target‘.

Since modern video models can interpret longer descriptions, the researchers further refined the captions using the Mistral-NeMo-12B-Instruct prompt upsampler, to add visual details without altering the original meaning.

Sample prompts from VideoPhy-2, categorized by physical activities or object interactions. Each prompt is paired with its corresponding action and the relevant physical principle it tests.

Sample prompts from VideoPhy-2, categorized by physical activities or object interactions. Each prompt is paired with its corresponding action and the relevant physical principle it tests.

For the third stage, physical rules were not derived from text prompts but from generated videos, since generative models can struggle to adhere to conditioned text prompts.

Videos were first created using VideoPhy-2 prompts, then ‘up-captioned’ with Gemini-2.0-Flash-Exp to extract key details. The model proposed three expected physical rules per video, which human annotators reviewed and expanded by identifying additional potential violations.

Examples from the upsampled captions.

Examples from the upsampled captions.

Next, to identify the most challenging actions, the researchers generated videos using CogVideoX-5B with prompts from the VideoPhy-2 dataset. They then selected 60 actions out of 197 where the model consistently failed to follow both the prompts and basic physical commonsense.

These actions involved physics-rich interactions such as momentum transfer in discus throwing, state changes such as bending an object until it breaks, balancing tasks such as tightrope walking, and complex motions that included back-flips, pole vaulting, and pizza tossing, among others. In total, 1,200 prompts were chosen to increase the difficulty of the sub-dataset.

The resulting dataset comprised 3,940 captions – 5.72 times more than the earlier version of VideoPhy. The average length of the original captions is 16 tokens, while upsampled captions reaches 138 tokens – 1.88 times and 16.2 times longer, respectively.

The dataset also features 102,000 human annotations covering semantic adherence, physical commonsense, and rule violations across multiple video generation models.

Evaluation

The researchers then defined clear criteria for evaluating the videos. The main goal was to assess how well each video matched its input prompt and followed basic physical principles.

Instead of simply ranking videos by preference, they used rating-based feedback to capture specific successes and failures. Human annotators scored videos on a five-point scale, allowing for more detailed judgments, while the evaluation also checked whether videos followed various physical rules and laws.

For human evaluation, a group of 12 annotators were selected from trials on Amazon Mechanical Turk (AMT), and provided ratings after receiving detailed remote instructions. For fairness, semantic adherence and physical commonsense were evaluated separately (in the original VideoPhy study, they were assessed jointly).

The annotators first rated how well videos matched their input prompts, then separately evaluated physical plausibility, scoring rule violations and overall realism on a five-point scale. Only the original prompts were shown, to maintain a fair comparison across models.

The interface presented to the AMT annotators.

The interface presented to the AMT annotators.

Though human judgment remains the gold standard, it’s expensive and comes with a number of caveats. Therefore automated evaluation is essential for faster and more scalable model assessments.

The paper’s authors tested several video-language models, including Gemini-2.0-Flash-Exp and VideoScore, on their ability to score videos for semantic accuracy and for ‘physical commonsense’.

The models again rated each video on a five-point scale, while a separate classification task determined whether physical rules were followed, violated, or unclear.

Experiments showed that existing video-language models struggled to match human judgments, mainly due to weak physical reasoning and the complexity of the prompts. To improve automated evaluation, the researchers developed VideoPhy-2-Autoeval, a 7B-parameter model designed to provide more accurate predictions across three categories: semantic adherence; physical commonsense; and rule compliance, fine-tuned on the VideoCon-Physics model using 50,000 human annotations*.

Data and Tests

With these tools in place, the authors tested a number of generative video systems, both through local installations and, where necessary, via commercial APIs: CogVideoX-5B; VideoCrafter2; HunyuanVideo-13B; Cosmos-Diffusion; Wan2.1-14B; OpenAI Sora; and Luma Ray.

The models were prompted with upsampled captions where possible, except that Hunyuan Video and VideoCrafter2 operate under 77-token CLIP limitations, and cannot accept prompts above a certain length.

Videos generated were kept to less than 6 seconds, since shorter output is easier to evaluate.

The driving data was from the VideoPhy-2 dataset, which was split into a benchmark and training set. 590 videos were generated per model, except for Sora and Ray2; due to the cost factor (equivalent lower numbers of videos were generated for these).

(Please refer to the source paper for further evaluation details, which are exhaustively chronicled there)

The initial evaluation dealt with physical activities/sports (PA) and object interactions (OI), and tested both the general dataset and the aforementioned ‘harder’ subset:

Results from the initial round.

Results from the initial round.

Here the authors comment:

‘Even the best-performing model, Wan2.1-14B, achieves only 32.6% and 21.9% on the full and hard splits of our dataset, respectively. Its relatively strong performance compared to other models can be attributed to the diversity of its multimodal training data, along with robust motion filtering that preserves high-quality videos across a wide range of actions.

‘Furthermore, we observe that closed models, such as Ray2, perform worse than open models like Wan2.1-14B and CogVideoX-5B. This suggests that closed models are not necessarily superior to open models in capturing physical commonsense.

‘Notably, Cosmos-Diffusion-7B achieves the second-best score on the hard split, even outperforming the much larger HunyuanVideo-13B model. This may be due to the high representation of human actions in its training data, along with synthetically rendered simulations.’

The results showed that video models struggled more with physical activities like sports than with simpler object interactions. This suggests that improving AI-generated videos in this area will require better datasets – particularly high-quality footage of sports such as tennis, discus, baseball, and cricket.

The study also examined whether a model’s bodily plausibility correlated with different video high quality metrics, comparable to aesthetics and movement smoothness. The findings revealed no sturdy correlation, that means a mannequin can’t enhance its efficiency on VideoPhy-2 simply by producing visually interesting or fluid movement – it wants a deeper understanding of bodily commonsense.

Though the paper gives plentiful qualitative examples, few of the static examples supplied within the PDF appear to narrate to the in depth video-based examples that the authors furnish on the undertaking website. Therefore we are going to take a look at a small number of the static examples after which some extra of the particular undertaking movies.

The top row shows videos generated by Wan2.1. (a) In Ray2, the jet-ski on the left lags behind before moving backward. (b) In Hunyuan-13B, the sledgehammer deforms mid-swing, and a broken wooden board appears unexpectedly. (c) In Cosmos-7B, the javelin expels sand before making contact with the ground.

The high row reveals movies generated by Wan2.1. (a) In Ray2, the jet-ski on the left lags behind earlier than shifting backward. (b) In Hunyuan-13B, the sledgehammer deforms mid-swing, and a damaged picket board seems unexpectedly. (c) In Cosmos-7B, the javelin expels sand earlier than making contact with the bottom.

Regarding the above qualitative check, the authors remark:

‘[We] observe violations of bodily commonsense, comparable to jetskis shifting unnaturally in reverse and the deformation of a strong sledgehammer, defying the ideas of elasticity. However, even Wan suffers from the shortage of bodily commonsense, as proven in [the clip embedded at the start of this article].

‘In this case, we spotlight {that a} rock begins rolling and accelerating uphill, defying the bodily legislation of gravity.’

Further examples from the undertaking website:

Click to play. Here the caption was ‘An individual vigorously twists a moist towel, water spraying outwards in a visual arc’ – however the ensuing supply of water is way extra like a water-hose than a towel.

Click to play. Here the caption was ‘A chemist pours a transparent liquid from a beaker right into a check tube, rigorously avoiding spills’, however we are able to see that the amount of water being added to the beaker is just not in line with the quantity exiting the jug.

As I discussed on the outset, the amount of fabric related to this undertaking far exceeds what might be lined right here. Therefore please check with the supply paper, undertaking website and associated websites talked about earlier, for a really exhaustive define of the authors’ procedures, and significantly extra testing examples and procedural particulars.

 

* As for the provenance of the annotations, the paper solely specifies ‘acquired for these duties’ – it appears quite a bit to have been generated by 12 AMT staff.

First revealed Thursday, March 13, 2025

LEAVE A REPLY

Please enter your comment!
Please enter your name here