Replace the blackswan with a white cat.
Overview
Balancing semantic edits and motion fidelity
Current instruction-guided video editing systems often face a tradeoff between strong semantic modification and faithful motion preservation. SAMA addresses this by separating the problem into sparse semantic planning and motion modeling, avoiding heavy reliance on brittle external priors such as VLM features or structural control signals.
- Semantic Anchoring jointly predicts semantic tokens and video latents at sparse anchor frames.
- Motion Alignment learns temporal dynamics from raw videos through motion-centric pretext tasks.
- A two-stage training pipeline produces strong zero-shot editing and competitive benchmark performance.
Teaser Figure
Method
Factorized semantic anchoring and motion alignment
Semantic Anchoring
SAMA predicts semantic tokens together with target latents at sparse anchor frames, providing instruction-aware structure planning in semantic space while retaining high-fidelity rendering in latent space.
Motion Alignment
The same backbone is pre-trained with cube inpainting, speed perturbation, and tube shuffle tasks so it can internalize temporal coherence directly from raw videos.
Two-Stage Training
Stage 0 performs factorized pre-training on perturbed videos and captions, then Stage 1 fine-tunes on paired editing data to resolve remaining semantic-motion conflicts and improve fidelity.
Results
Comparison with other methods
Add a white cartoon cat in the right of the boy.
Turn into watercolor style.
Change all person's clothes color into black.
Remove the football.
Add a brown hat on the man's head.
Replace the background with a dynamic desert highway scene. Heat waves should shimmer above the asphalt, dust occasionally drifts across the road, and distant birds fly across the sky. The lighting should be bright and natural, with subtle shadows moving as if from a passing cloud. The man and SUV remain perfectly still.
Replace the subtitles "What's locked in these frames could rewrite her story." at the bottom of the video with "Her fingers trace fragile film - soft light, a life unspooling." in white text without border style.
Add a gorilla on the left, standing and trying to copy the dance moves.
Remove the young boy standing at the chalkboard.
Remove the wolf-like dog on the right.
Benchmark highlights
Open-source SOTA
SAMA achieves state-of-the-art performance among open-source instruction-guided video editing models and remains competitive with leading commercial systems.
More visualizations
Add task
Add a tiny red fez hat on the monkey's head.
Add a third, smaller Shiba Inu puppy sitting between the two dogs.
Add a woman with short black hair and big gold earrings Sitting in front of the beige L-shaped sofa.
Add a woman with alternating light and dark hair sitting in the box.
Add a small pink paper boat floating on the water to the left of the duck.
Add a small, orange octopus sitting on the seal's head.
Add a desk lamp on the desk to the left of the laptop.
Add a golden crown hat on the head of the chipmunk on the left.
Replace task
Replace the man in the green hoodie with a fluffy St. Bernard dog sitting in the chair, paws on the desk.
Replace the cows with a wolf.
Change the color of women's trousers to pink.
Replace the man's light gray hoodie with a sharp dark navy blue business suit, white shirt, and black tie, maintaining the same position and pose within the scene.
Replace the blue sports car with an electric green eco-friendly car, ensuring it maintains the same position and pose within the video scene.
Replace the gray squirrel with a small pigeon pecking food from the hand.
Replace the spotted baby seal on the sand with a red crab.
Replace the pigeon on the right with a squirrel.
Remove task
Remove the subtitles at the top of the video.
Remove the boy.
Remove the woman.
Remove the bowl in the middle.
Remove the black hat on the man's head.
Remove the young woman with long, wavy brown hair and a serene expression from the entire video sequence. She is wearing a textured, off-white knit sweater with wide, ruffled sleeves, gazing upwards with her lips slightly parted and eyes softly closed, her head tilting slightly to the side while maintaining a relaxed posture with shoulders subtly leaning back. The background must be reconstructed with temporal consistency, and all other video content must remain unchanged.
Remove the woman on the left.
Remove the man in the green hoodie.
Style task
Convert the video into a Ghibli style.
Make it snowy.
Transform the entire scene into pixel art style.
Convert the video into a Sketch style.
Zero-shot video editing behavior
Change the color of the boat to yellow.
Add a A little girl with long brown hair beside the boy.
Remove the dancing man.
Transform into illustration style.
The factorized pre-training stage alone already induces strong zero-shot video editing behavior before supervised fine-tuning on paired editing data.
Citation
BibTeX
@misc{zhang2026samafactorizedsemanticanchoring,
title={SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing},
author={Xinyao Zhang and Wenkai Dong and Yuxin Song and Bo Fang and Qi Zhang and Jing Wang and Fan Chen and Hui Zhang and Haocheng Feng and Yu Lu and Hang Zhou and Chun Yuan and Jingdong Wang},
year={2026},
eprint={2603.19228},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.19228},
}