Seedance 2.0 VIP

ByteDance's cinematic multimodal video model with native synced audio

Overview

Seedance 2.0 VIP is ByteDance's (Doubao) native multimodal video model. Instead of stitching visuals together after the fact, it generates picture and sound as one.

Feed it text, an image, a video, or audio, and it returns cinematic video with natively synced sound — lip-sync, dialogue, ambient noise, and background music all arrive together with the picture.

Capabilities

It's strong at multi-shot consistency, keeping characters and scenes coherent across cuts, with cinematic camera control to match.

Resolutions run from 480p and 720p through 1080p all the way to 4K, with clips of 4 to 15 seconds at 24fps.

This is the full-power, complete model, so what you get is its most capable self.

How to use here

Here, open «Generate → Video», describe the shot you have in mind, or upload a reference image or clip.

It runs instantly with no render queue — hit generate and it gets to work right away.

Real human-face reference photos are enabled here too: you can drop in a real face as a reference (Seedance natively rejects real faces; here that's switched on). It's reference-based generation, not face-swap or deepfake.

Credits

A default 720p, 5-second clip costs around 40 credits.

Higher resolution and longer duration scale the cost up from there. One credit is roughly ¥0.1.

Best for & tips

It shines on short films, ads, product demos, and dialogue-driven little scenes — anything where sound and image need to land as one.

A tip: rough out your shot and pacing at 720p and 5 seconds first, then bump up to 1080p or 4K for the final render. You'll save credits and move faster.