arXiv 2026

SAMA

Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

Xinyao Zhang1,2,*, Wenkai Dong1,*, Yuxin Song1,*,†, Bo Fang1,3, Qi Zhang1, Jing Wang1,2, Fan Chen1, Hui Zhang1, Haocheng Feng1, Yu Lu4,‡, Hang Zhou1, Chun Yuan2, Jingdong Wang1

1 Baidu Inc 2 Tsinghua University 3 City University of Hong Kong 4 Zhejiang University

* Equal Contribution† Project leader‡ Corresponding Author

Balancing semantic edits and motion fidelity

Current instruction-guided video editing systems often face a tradeoff between strong semantic modification and faithful motion preservation. SAMA addresses this by separating the problem into sparse semantic planning and motion modeling, avoiding heavy reliance on brittle external priors such as VLM features or structural control signals.

  • Semantic Anchoring jointly predicts semantic tokens and video latents at sparse anchor frames.
  • Motion Alignment learns temporal dynamics from raw videos through motion-centric pretext tasks.
  • A two-stage training pipeline produces strong zero-shot editing and competitive benchmark performance.

Teaser Figure

Figure 1 teaser from the SAMA paper

Factorized semantic anchoring and motion alignment

Figure 2 overall pipeline of SAMA

Semantic Anchoring

SAMA predicts semantic tokens together with target latents at sparse anchor frames, providing instruction-aware structure planning in semantic space while retaining high-fidelity rendering in latent space.

Motion Alignment

The same backbone is pre-trained with cube inpainting, speed perturbation, and tube shuffle tasks so it can internalize temporal coherence directly from raw videos.

Two-Stage Training

Stage 0 performs factorized pre-training on perturbed videos and captions, then Stage 1 fine-tunes on paired editing data to resolve remaining semantic-motion conflicts and improve fidelity.

Comparison with other methods

Source
SAMA
Kling-o1
UniVideo

Replace the blackswan with a white cat.

Source
SAMA
Kling-o1
UniVideo

Add a white cartoon cat in the right of the boy.

Source
SAMA
Kling-o1
UniVideo

Turn into watercolor style.

Source
SAMA
Kling-o1
UniVideo

Change all person's clothes color into black.

Source
SAMA
Kling-o1
UniVideo

Remove the football.

Source
SAMA
Kling-o1
UniVideo

Add a brown hat on the man's head.

Source
SAMA
Kling-o1
UniVideo

Replace the background with a dynamic desert highway scene. Heat waves should shimmer above the asphalt, dust occasionally drifts across the road, and distant birds fly across the sky. The lighting should be bright and natural, with subtle shadows moving as if from a passing cloud. The man and SUV remain perfectly still.

Source
SAMA
Kling-o1
UniVideo

Replace the subtitles "What's locked in these frames could rewrite her story." at the bottom of the video with "Her fingers trace fragile film - soft light, a life unspooling." in white text without border style.

Source
SAMA
Kling-o1
UniVideo

Add a gorilla on the left, standing and trying to copy the dance moves.

Source
SAMA
Kling-o1
UniVideo

Remove the young boy standing at the chalkboard.

Source
SAMA
Kling-o1
UniVideo

Remove the wolf-like dog on the right.

Benchmark highlights

Table 2 comparison results on VIE-Bench Table 3 comparison results on OpenVE-Bench Table 4 comparison results on ReCo-Bench

Open-source SOTA

SAMA achieves state-of-the-art performance among open-source instruction-guided video editing models and remains competitive with leading commercial systems.

Add task

Source Video
SAMA

Add a tiny red fez hat on the monkey's head.

Source Video
SAMA

Add a third, smaller Shiba Inu puppy sitting between the two dogs.

Source Video
SAMA

Add a woman with short black hair and big gold earrings Sitting in front of the beige L-shaped sofa.

Source Video
SAMA

Add a woman with alternating light and dark hair sitting in the box.

Source Video
SAMA

Add a small pink paper boat floating on the water to the left of the duck.

Source Video
SAMA

Add a small, orange octopus sitting on the seal's head.

Source Video
SAMA

Add a desk lamp on the desk to the left of the laptop.

Source Video
SAMA

Add a golden crown hat on the head of the chipmunk on the left.

Replace task

Source Video
SAMA

Replace the man in the green hoodie with a fluffy St. Bernard dog sitting in the chair, paws on the desk.

Source Video
SAMA

Replace the cows with a wolf.

Source Video
SAMA

Change the color of women's trousers to pink.

Source Video
SAMA

Replace the man's light gray hoodie with a sharp dark navy blue business suit, white shirt, and black tie, maintaining the same position and pose within the scene.

Source Video
SAMA

Replace the blue sports car with an electric green eco-friendly car, ensuring it maintains the same position and pose within the video scene.

Source Video
SAMA

Replace the gray squirrel with a small pigeon pecking food from the hand.

Source Video
SAMA

Replace the spotted baby seal on the sand with a red crab.

Source Video
SAMA

Replace the pigeon on the right with a squirrel.

Remove task

Source Video
SAMA

Remove the subtitles at the top of the video.

Source Video
SAMA

Remove the boy.

Source Video
SAMA

Remove the woman.

Source Video
SAMA

Remove the bowl in the middle.

Source Video
SAMA

Remove the black hat on the man's head.

Source Video
SAMA

Remove the young woman with long, wavy brown hair and a serene expression from the entire video sequence. She is wearing a textured, off-white knit sweater with wide, ruffled sleeves, gazing upwards with her lips slightly parted and eyes softly closed, her head tilting slightly to the side while maintaining a relaxed posture with shoulders subtly leaning back. The background must be reconstructed with temporal consistency, and all other video content must remain unchanged.

Source Video
SAMA

Remove the woman on the left.

Source Video
SAMA

Remove the man in the green hoodie.

Style task

Source Video
SAMA

Convert the video into a Ghibli style.

Source Video
SAMA

Make it snowy.

Source Video
SAMA

Transform the entire scene into pixel art style.

Source Video
SAMA

Convert the video into a Sketch style.

Zero-shot video editing behavior

Source
SAMA-stage0 Result
SAMA-stage1 Result

Change the color of the boat to yellow.

Source
SAMA-stage0 Result
SAMA-stage1 Result

Add a A little girl with long brown hair beside the boy.

Source
SAMA-stage0 Result
SAMA-stage1 Result

Remove the dancing man.

Source
SAMA-stage0 Result
SAMA-stage1 Result

Transform into illustration style.

The factorized pre-training stage alone already induces strong zero-shot video editing behavior before supervised fine-tuning on paired editing data.

BibTeX

@misc{zhang2026samafactorizedsemanticanchoring,
      title={SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing},
      author={Xinyao Zhang and Wenkai Dong and Yuxin Song and Bo Fang and Qi Zhang and Jing Wang and Fan Chen and Hui Zhang and Haocheng Feng and Yu Lu and Hang Zhou and Chun Yuan and Jingdong Wang},
      year={2026},
      eprint={2603.19228},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.19228},
}