HunyuanCustom Brings Single-Image Video Deepfakes, With Audio and Lip Sync

0
256
HunyuanCustom Brings Single-Image Video Deepfakes, With Audio and Lip Sync


This article discusses a brand new launch of a multimodal Hunyuan Video world mannequin known as ‘HunyuanCustom’. The new paper’s breadth of coverage, combined with several issues in many of the supplied example videos at the project page*, constrains us to more general coverage than usual, and to limited reproduction of the huge amount of video material accompanying this release (since many of the videos require significant re-editing and processing in order to improve the readability of the layout).

Please note additionally that the paper refers to the API-based generative system Kling as ‘Keling’. For clarity, I refer to ‘Kling’ instead throughout.

 

Tencent is in the process of releasing a new version of its Hunyuan Video model, titled HunyuanCustom. The new release is apparently capable of making Hunyuan LoRA models redundant, by allowing the user to create ‘deepfake’-style video customization through a single image:

Click to play. Prompt: ‘A man is listening to music and cooking snail noodles in the kitchen’. The new method compared to both close-source and open-source methods, including Kling, which is a significant opponent in this space. Source: https://hunyuancustom.github.io/ (warning: CPU/memory-intensive site!)

In the left-most column of the video above, we see the single source image supplied to HunyuanCustom, followed by the new system’s interpretation of the prompt in the second column, next to it. The remaining columns show the results from various proprietary and FOSS systems: Kling; Vidu; Pika; Hailuo; and the Wan-based SkyReels-A2.

In the video below, we see renders of three scenarios essential to this release: respectively, person + object; single-character emulation; and virtual try-on (person + clothes):

Click to play. Three examples edited from the material at the supporting site for Hunyuan Video.

We can notice a few things from these examples, mostly related to the system relying on a single source image, instead of multiple images of the same subject.

In the first clip, the man is essentially still facing the camera. He dips his head down and sideways at not much more than 20-25 degrees of rotation, but, at an inclination in excess of that, the system would really have to start guessing what he looks like in profile. This is hard, probably impossible to gauge accurately from a sole frontal image.

In the second example, we see that the little girl is smiling in the rendered video as she is in the single static source image. Again, with this sole image as reference, the HunyuanCustom would have to make a relatively uninformed guess about what her ‘resting face’ looks like. Additionally, her face does not deviate from camera-facing stance by more than the prior example (‘man eating crisps’).

In the last example, we see that since the source material – the woman and the clothes she is prompted into wearing – are not complete images, the render has cropped the scenario to fit – which is actually rather a good solution to a data issue!

The point is that though the new system can handle multiple images (such as person + crisps, or person + clothes), it does not apparently allow for multiple angles or alternative views of a single character, so that diverse expressions or unusual angles could be accommodated. To this extent, the system may therefore struggle to replace the growing ecosystem of LoRA models that have sprung up around HunyuanVideo since its release last December, since these can help HunyuanVideo to produce consistent characters from any angle and with any facial expression represented in the training dataset (20-60 images is typical).

Wired for Sound

For audio, HunyuanCustom leverages the LatentSync system (notoriously hard for hobbyists to set up and get good results from) for obtaining lip movements that are matched to audio and text that the user supplies:

Features audio. Click to play. Various examples of lip-sync from the HunyuanCustom supplementary site, edited together.

At the time of writing, there are no English-language examples, but these appear to be rather good – the more so if the method of creating them is easily-installable and accessible.

Editing Existing Video

The new system offers what appear to be very impressive results for video-to-video (V2V, or Vid2Vid) editing, wherein a segment of an existing (real) video is masked off and intelligently replaced by a subject given in a single reference image. Below is an example from the supplementary materials site:

Click to play. Only the central object is targeted, but what remains around it also gets altered in a HunyuanCustom vid2vid pass.

As we can see, and as is standard in a vid2vid scenario, the entire video is to some extent altered by the process, though most altered in the targeted region, i.e., the plush toy. Presumably pipelines could be developed to create such transformations under a garbage matte approach that leaves the majority of the video content identical to the original. This is what Adobe Firefly does under the hood, and does quite well –  but it is an under-studied process in the FOSS generative scene.

That said, most of the alternative examples provided do a better job of targeting these integrations, as we can see in the assembled compilation below:

Click to play. Diverse examples of interjected content using vid2vid in HunyuanCustom, exhibiting notable respect for the untargeted material.

A New Start?

This initiative is a development of the Hunyuan Video project, not a hard pivot away from that development stream. The project’s enhancements are introduced as discrete architectural insertions rather than sweeping structural changes, aiming to allow the model to maintain identity fidelity across frames without relying on subject-specific fine-tuning, as with LoRA or textual inversion approaches.

To be clear, therefore, HunyuanCustom is not trained from scratch, but rather is a fine-tuning of the December 2024 HunyuanVideo foundation model.

Those who have developed HunyuanVideo LoRAs may wonder if they will still work with this new edition, or whether they will have to reinvent the LoRA wheel yet again if they want more customization capabilities than are built into this new release.

In general, a heavily fine-tuned release of a hyperscale model alters the model weights enough that LoRAs made for the earlier model will not work properly, or at all, with the newly-refined model.

Sometimes, however, a fine-tune’s popularity can challenge its origins: one example of a fine-tune becoming an effective fork, with a dedicated ecosystem and followers of its own, is the Pony Diffusion tuning of Stable Diffusion XL (SDXL). Pony currently has 592,000+ downloads on the ever-changing CivitAI domain, with a vast range of LoRAs that have used Pony (and not SDXL) as the base model, and which require Pony at inference time.

Releasing

The project page for the new paper (which is titled HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation) features links to a GitHub site that, as I write, just became functional, and appears to contain all code and necessary weights for local implementation, together with a proposed timeline (where the only important thing yet to come is ComfyUI integration).

At the time of writing, the project’s Hugging Face presence is still a 404. There is, however, an API-based version of where one can apparently demo the system, so long as you can provide a WeChat scan code.

I have rarely seen such an elaborate and extensive usage of such a wide variety of projects in one assembly, as is evident in HunyuanCustom – and presumably some of the licenses would in any case oblige a full release.

Two models are announced at the GitHub page: a 720px1280px version requiring 8)GB of GPU Peak Memory, and a 512px896px version requiring 60GB of GPU Peak Memory.

The repository states ‘The minimum GPU memory required is 24GB for 720px1280px129f but very slow…We recommend using a GPU with 80GB of memory for better generation quality’ – and iterates that the system has only been tested so far on Linux.

The earlier Hunyuan Video model has, since official release, been quantized down to sizes where it can be run on less than 24GB of VRAM, and it seems reasonable to assume that the new model will likewise be adapted into more consumer-friendly forms by the community, and that it will quickly be adapted for use on Windows systems too.

Due to time constraints and the overwhelming amount of information accompanying this release, we can only take a broader, rather than in-depth look at this release. Nonetheless, let’s pop the hood on HunyuanCustom a little.

A Look at the Paper

The data pipeline for HunyuanCustom, apparently compliant with the GDPR framework, incorporates both synthesized and open-source video datasets, including OpenHumanVid, with eight core categories represented: humans, animals, plants, landscapes, vehicles, objects, architecture, and anime.

From the release paper, an overview of the diverse contributing packages in the HunyuanCustom data construction pipeline. Source: https://arxiv.org/pdf/2505.04512

From the release paper, an overview of the diverse contributing packages in the HunyuanCustom data construction pipeline. Source: https://arxiv.org/pdf/2505.04512

Initial filtering begins with PySceneDetect, which segments videos into single-shot clips. TextBPN-Plus-Plus is then used to remove videos containing excessive on-screen text, subtitles, watermarks, or logos.

To address inconsistencies in resolution and duration, clips are standardized to five seconds in length and resized to 512 or 720 pixels on the short side. Aesthetic filtering is handled using Koala-36M, with a custom threshold of 0.06 applied for the custom dataset curated by the new paper’s researchers.

The subject extraction process combines the Qwen7B Large Language Model (LLM), the YOLO11X object recognition framework, and the popular InsightFace architecture, to identify and validate human identities.

For non-human subjects, QwenVL and Grounded SAM 2 are used to extract relevant bounding boxes, which are discarded if too small.

Examples of semantic segmentation with Grounded SAM 2, used in the Hunyuan Control project. Source: https://github.com/IDEA-Research/Grounded-SAM-2

Examples of semantic segmentation with Grounded SAM 2, used in the Hunyuan Control project. Source: https://github.com/IDEA-Research/Grounded-SAM-2

Multi-subject extraction utilizes Florence2 for bounding box annotation, and Grounded SAM 2 for segmentation, followed by clustering and temporal segmentation of training frames.

The processed clips are further enhanced via annotation, using a proprietary structured-labeling system developed by the Hunyuan team, and which furnishes layered metadata such as descriptions and camera motion cues.

Mask augmentation strategies, including conversion to bounding boxes, were applied during training to reduce overfitting and ensure the model adapts to diverse object shapes.

Audio data was synchronized using the aforementioned LatentSync, and clips discarded if synchronization scores fall below a minimum threshold.

The blind image quality assessment framework HyperIQA was used to exclude videos scoring under 40 (on HyperIQA’s bespoke scale). Valid audio tracks were then processed with Whisper to extract features for downstream tasks.

The authors incorporate the LLaVA language assistant model during the annotation phase, and they emphasize the central position that this framework has in HunyuanCustom. LLaVA is used to generate image captions and assist in aligning visual content with text prompts, supporting the construction of a coherent training signal across modalities:

The HunyuanCustom framework supports identity-consistent video generation conditioned on text, image, audio, and video inputs.

The HunyuanCustom framework supports identity-consistent video generation conditioned on text, image, audio, and video inputs.

By leveraging LLaVA’s vision-language alignment capabilities, the pipeline good points a further layer of semantic consistency between visible parts and their textual descriptions – particularly invaluable in multi-subject or complex-scene situations.

Custom Video

To permit video technology primarily based on a reference picture and a immediate, the 2 modules centered round LLaVA had been created, first adapting the enter construction of HunyuanVideo in order that it might settle for a picture together with textual content.

This concerned formatting the immediate in a approach that embeds the picture straight or tags it with a brief id description. A separator token was used to cease the picture embedding from overwhelming the immediate content material.

Since LLaVA’s visible encoder tends to compress or discard fine-grained spatial particulars throughout the alignment of picture and textual content options (significantly when translating a single reference picture right into a basic semantic embedding), an id enhancement module was integrated. Since practically all video latent diffusion fashions have some problem sustaining an id with out an LoRA, even in a five-second clip, the efficiency of this module in neighborhood testing could show vital.

In any case, the reference picture is then resized and encoded utilizing the causal 3D-VAE from the unique HunyuanVideo mannequin, and its latent inserted into the video latent throughout the temporal axis, with a spatial offset utilized to stop the picture from being straight reproduced within the output, whereas nonetheless guiding technology.

The mannequin was skilled utilizing Flow Matching, with noise samples drawn from a logit-normal distribution – and the community was skilled to recuperate the proper video from these noisy latents. LLaVA and the video generator had been each fine-tuned collectively in order that the picture and immediate might information the output extra fluently and preserve the topic id constant.

For multi-subject prompts, every image-text pair was embedded individually and assigned a definite temporal place, permitting identities to be distinguished, and supporting the technology of scenes involving a number of interacting topics.

Sound and Vision

HunyuanCustom situations audio/speech technology utilizing each user-input audio and a textual content immediate, permitting characters to talk inside scenes that replicate the described setting.

To help this, an Identity-disentangled AudioWeb module introduces audio options with out disrupting the id indicators embedded from the reference picture and immediate. These options are aligned with the compressed video timeline, divided into frame-level segments, and injected utilizing a spatial cross-attention mechanism that retains every body remoted, preserving topic consistency and avoiding temporal interference.

A second temporal injection module gives finer management over timing and movement, working in tandem with AudioWeb, mapping audio options to particular areas of the latent sequence, and utilizing a Multi-Layer Perceptron (MLP) to transform them into token-wise movement offsets. This permits gestures and facial motion to comply with the rhythm and emphasis of the spoken enter with larger precision.

HunyuanCustom permits topics in current movies to be edited straight, changing or inserting folks or objects right into a scene with no need to rebuild all the clip from scratch. This makes it helpful for duties that contain altering look or movement in a focused approach.

Click to play. An extra instance from the supplementary website.

To facilitate environment friendly subject-replacement in current movies, the brand new system avoids the resource-intensive strategy of current strategies such because the currently-popular VACE, or those who merge complete video sequences collectively, favoring as an alternative the compression  of a reference video utilizing the pretrained causal 3D-VAE –  aligning it with the technology pipeline’s inside video latents, after which including the 2 collectively. This retains the method comparatively light-weight, whereas nonetheless permitting exterior video content material to information the output.

A small neural community handles the alignment between the clear enter video and the noisy latents utilized in technology. The system checks two methods of injecting this info: merging the 2 units of options earlier than compressing them once more; and including the options body by body. The second technique works higher, the authors discovered, and avoids high quality loss whereas retaining the computational load unchanged.

Data and Tests

In checks, the metrics used had been: the id consistency module in ArcFace, which extracts facial embeddings from each the reference picture and every body of the generated video, after which calculates the common cosine similarity between them; topic similarity, by way of sending YOLO11x segments to Dino 2 for comparability; CLIP-B, text-video alignment, which measures similarity between the immediate and the generated video; CLIP-B once more, to calculate similarity between every body and each its neighboring frames and the primary body, in addition to temporal consistency; and dynamic diploma, as outlined by VBench.

As indicated earlier, the baseline closed supply rivals had been Hailuo; Vidu 2.0; Kling (1.6); and Pika. The competing FOSS frameworks had been VACE and SkyReels-A2.

Model performance evaluation comparing HunyuanCustom with leading video customization methods across ID consistency (Face-Sim), subject similarity (DINO-Sim), text-video alignment (CLIP-B-T), temporal consistency (Temp-Consis), and motion intensity (DD). Optimal and sub-optimal results are shown in bold and underlined, respectively.

Model efficiency analysis evaluating HunyuanCustom with main video customization strategies throughout ID consistency (Face-Sim), topic similarity (DINO-Sim), text-video alignment (CLIP-B-T), temporal consistency (Temp-Consis), and movement depth (DD). Optimal and sub-optimal outcomes are proven in daring and underlined, respectively.

Of these outcomes, the authors state:

‘Our [HunyuanCustom] achieves the perfect ID consistency and topic consistency. It additionally achieves comparable leads to immediate following and temporal consistency. [Hailuo] has the perfect clip rating as a result of it will possibly comply with textual content directions properly with solely ID consistency, sacrificing the consistency of non-human topics (the worst DINO-Sim). In phrases of Dynamic-degree, [Vidu] and [VACE] carry out poorly, which can be because of the small measurement of the mannequin.’

Though the venture website is saturated with comparability movies (the format of which appears to have been designed for web site aesthetics reasonably than straightforward comparability), it doesn’t presently characteristic a video equal of the static outcomes crammed collectively within the PDF, in regard to the preliminary qualitative checks. Though I embody it right here, I encourage the reader to make an in depth examination of the movies on the venture website, as they provide a greater impression of the outcomes:

From the paper, a comparison on object-centered video customization. Though the viewer should (as always) refer to the source PDF for better resolution, the videos at the project site might be a more illuminating resource.

From the paper, a comparability on object-centered video customization. Though the viewer ought to (as all the time) confer with the supply PDF for higher decision, the movies on the venture website could be a extra illuminating useful resource on this case.

The authors remark right here:

‘It may be seen that [Vidu], [Skyreels A2] and our technique obtain comparatively good leads to immediate alignment and topic consistency, however our video high quality is best than Vidu and Skyreels, due to the great video technology efficiency of our base mannequin, i.e., [Hunyuanvideo-13B].

‘Among industrial merchandise, though [Kling] has an excellent video high quality, the primary body of the video has a copy-paste [problem], and generally the topic strikes too quick and [blurs], main a poor viewing expertise.’

The authors additional remark that Pika performs poorly when it comes to temporal consistency, introducing subtitle artifacts (results from poor information curation, the place textual content parts in video clips have been allowed to pollute the core ideas).

Hailuo maintains facial id, they state, however fails to protect full-body consistency. Among open-source strategies, VACE, the researchers assert, is unable to take care of id consistency, whereas they contend that HunyuanCustom produces movies with sturdy id preservation, whereas retaining high quality and variety.

Next, checks had been carried out for multi-subject video customization, in opposition to the identical contenders. As within the earlier instance, the flattened PDF outcomes will not be print equivalents of movies obtainable on the venture website, however are distinctive among the many outcomes introduced:

Comparisons using multi-subject video customizations. Please see PDF for better detail and resolution.

Comparisons utilizing multi-subject video customizations. Please see PDF for higher element and backbone.

The paper states:

‘[Pika] can generate the required topics however reveals instability in video frames, with cases of a person disappearing in a single state of affairs and a lady failing to open a door as prompted. [Vidu] and [VACE] partially seize human id however lose vital particulars of non-human objects, indicating a limitation in representing non-human topics.

‘[SkyReels A2] experiences extreme body instability, with noticeable modifications in chips and quite a few artifacts in the precise state of affairs.

‘In distinction, our HunyuanCustom successfully captures each human and non-human topic identities, generates movies that adhere to the given prompts, and maintains excessive visible high quality and stability.’

An extra experiment was ‘digital human commercial’, whereby the frameworks had been tasked to combine a product with an individual:

From the qualitative testing round, examples of neural 'product placement'. Please see PDF for better detail and resolution.

From the qualitative testing spherical, examples of neural ‘product placement’. Please see PDF for higher element and backbone.

For this spherical, the authors state:

‘The [results] show that HunyuanCustom successfully maintains the id of the human whereas preserving the small print of the goal product, together with the textual content on it.

‘Furthermore, the interplay between the human and the product seems pure, and the video adheres intently to the given immediate, highlighting the substantial potential of HunyuanCustom in producing commercial movies.’

One space the place video outcomes would have been very helpful was the qualitative spherical for audio-driven topic customization, the place the character speaks the corresponding audio from a text-described scene and posture.

Partial results given for the audio round – though video results might have been preferable in this case. Only the top half of the PDF figure is reproduced here, as it is large and hard to accommodate in this article. Please refer to source PDF for better detail and resolution.

Partial outcomes given for the audio spherical – although video outcomes might need been preferable on this case. Only the highest half of the PDF determine is reproduced right here, as it’s giant and laborious to accommodate on this article. Please confer with supply PDF for higher element and backbone.

The authors assert:

‘Previous audio-driven human animation strategies enter a human picture and an audio, the place the human posture, apparel, and atmosphere stay according to the given picture and can’t generate movies in different gesture and atmosphere, which can [restrict] their software.

‘…[Our] HunyuanCustom allows audio-driven human customization, the place the character speaks the corresponding audio in a text-described scene and posture, permitting for extra versatile and controllable audio-driven human animation.’

Further checks (please see PDF for all particulars) included a spherical pitting the brand new system in opposition to VACE and Kling 1.6 for video topic alternative:

Testing subject replacement in video-to-video mode. Please refer to source PDF for better detail and resolution.

Testing topic alternative in video-to-video mode. Please confer with supply PDF for higher element and backbone.

Of these, the final checks introduced within the new paper, the researchers opine:

‘VACE suffers from boundary artifacts as a result of strict adherence to the enter masks, leading to unnatural topic shapes and disrupted movement continuity. [Kling], in distinction, reveals a copy-paste impact, the place topics are straight overlaid onto the video, resulting in poor integration with the background.

‘In comparability, HunyuanCustom successfully avoids boundary artifacts, achieves seamless integration with the video background, and maintains sturdy id preservation—demonstrating its superior efficiency in video modifying duties.’

Conclusion

This is a captivating launch, not least as a result of it addresses one thing that the ever-discontent hobbyist scene has been complaining about extra currently – the dearth of lip-sync, in order that the elevated realism succesful in techniques comparable to Hunyuan Video and Wan 2.1 could be given a brand new dimension of authenticity.

Though the format of practically all of the comparative video examples on the venture website makes it reasonably tough to match HunyuanCustom’s capabilities in opposition to prior contenders, it should be famous that very, only a few initiatives within the video synthesis house have the braveness to pit themselves in checks in opposition to Kling, the industrial video diffusion API which is all the time hovering at or close to the highest of the leader-boards; Tencent seems to have made headway in opposition to this incumbent in a reasonably spectacular method.

 

* The situation being that a number of the movies are so huge, quick, and high-resolution that they won’t play in commonplace video gamers comparable to VLC or Windows Media Player, exhibiting black screens.

First revealed Thursday, May 8, 2025

LEAVE A REPLY

Please enter your comment!
Please enter your name here