SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

Overview

Balancing semantic edits and motion fidelity

Current instruction-guided video editing systems often face a tradeoff between strong semantic modification and faithful motion preservation. SAMA addresses this by separating the problem into sparse semantic planning and motion modeling, avoiding heavy reliance on brittle external priors such as VLM features or structural control signals.

Semantic Anchoring jointly predicts semantic tokens and video latents at sparse anchor frames.
Motion Alignment learns temporal dynamics from raw videos through motion-centric pretext tasks.
A two-stage training pipeline produces strong zero-shot editing and competitive benchmark performance.

Teaser Figure

Method

Factorized semantic anchoring and motion alignment

Semantic Anchoring

SAMA predicts semantic tokens together with target latents at sparse anchor frames, providing instruction-aware structure planning in semantic space while retaining high-fidelity rendering in latent space.

Motion Alignment

The same backbone is pre-trained with cube inpainting, speed perturbation, and tube shuffle tasks so it can internalize temporal coherence directly from raw videos.

Two-Stage Training

Stage 0 performs factorized pre-training on perturbed videos and captions, then Stage 1 fine-tunes on paired editing data to resolve remaining semantic-motion conflicts and improve fidelity.

Results

Comparison with other methods

Source

SAMA

Kling-o1

UniVideo

Replace the blackswan with a white cat.

Source

SAMA

Kling-o1

UniVideo

Add a white cartoon cat in the right of the boy.

Source

SAMA

Kling-o1

UniVideo

Turn into watercolor style.

Source

SAMA

Kling-o1

UniVideo

Change all person's clothes color into black.

Source

SAMA

Kling-o1

UniVideo

Remove the football.

Source

SAMA

Kling-o1

UniVideo

Add a brown hat on the man's head.

Source

SAMA

Kling-o1

UniVideo

Replace the background with a dynamic desert highway scene. Heat waves should shimmer above the asphalt, dust occasionally drifts across the road, and distant birds fly across the sky. The lighting should be bright and natural, with subtle shadows moving as if from a passing cloud. The man and SUV remain perfectly still.

Source

SAMA

Kling-o1

UniVideo

Replace the subtitles "What's locked in these frames could rewrite her story." at the bottom of the video with "Her fingers trace fragile film - soft light, a life unspooling." in white text without border style.

Source

SAMA

Kling-o1

UniVideo

Add a gorilla on the left, standing and trying to copy the dance moves.

Source

SAMA

Kling-o1

UniVideo

Remove the young boy standing at the chalkboard.

Source

SAMA

Kling-o1

UniVideo

Remove the wolf-like dog on the right.

Benchmark highlights

Table 3 comparison results on OpenVE-Bench

Table 4 comparison results on ReCo-Bench

Open-source SOTA

SAMA achieves state-of-the-art performance among open-source instruction-guided video editing models and remains competitive with leading commercial systems.

More visualizations

Add task

Source Video

SAMA

Add a tiny red fez hat on the monkey's head.

Source Video

SAMA

Add a third, smaller Shiba Inu puppy sitting between the two dogs.

Source Video

SAMA

Add a woman with short black hair and big gold earrings Sitting in front of the beige L-shaped sofa.

Source Video

SAMA

Add a woman with alternating light and dark hair sitting in the box.

Source Video

SAMA

Add a small pink paper boat floating on the water to the left of the duck.

Source Video

SAMA

Add a small, orange octopus sitting on the seal's head.

Source Video

SAMA

Add a desk lamp on the desk to the left of the laptop.

Source Video

SAMA

Add a golden crown hat on the head of the chipmunk on the left.

Replace task

Source Video

SAMA

Replace the man in the green hoodie with a fluffy St. Bernard dog sitting in the chair, paws on the desk.

Source Video

SAMA

Replace the cows with a wolf.

Source Video

SAMA

Change the color of women's trousers to pink.

Source Video

SAMA

Replace the man's light gray hoodie with a sharp dark navy blue business suit, white shirt, and black tie, maintaining the same position and pose within the scene.

Source Video

SAMA

Replace the blue sports car with an electric green eco-friendly car, ensuring it maintains the same position and pose within the video scene.

Source Video

SAMA

Replace the gray squirrel with a small pigeon pecking food from the hand.

Source Video

SAMA

Replace the spotted baby seal on the sand with a red crab.

Source Video

SAMA

Replace the pigeon on the right with a squirrel.

Remove task

Source Video

SAMA

Remove the subtitles at the top of the video.

Source Video

SAMA

Remove the boy.

Source Video

SAMA

Remove the woman.

Source Video

SAMA

Remove the bowl in the middle.

Source Video

SAMA

Remove the black hat on the man's head.

Source Video

SAMA

Remove the young woman with long, wavy brown hair and a serene expression from the entire video sequence. She is wearing a textured, off-white knit sweater with wide, ruffled sleeves, gazing upwards with her lips slightly parted and eyes softly closed, her head tilting slightly to the side while maintaining a relaxed posture with shoulders subtly leaning back. The background must be reconstructed with temporal consistency, and all other video content must remain unchanged.

Source Video

SAMA

Remove the woman on the left.

Source Video

SAMA

Remove the man in the green hoodie.

Style task

Source Video

SAMA

Convert the video into a Ghibli style.

Source Video

SAMA

Make it snowy.

Source Video

SAMA

Transform the entire scene into pixel art style.

Source Video

SAMA

Convert the video into a Sketch style.

Zero-shot video editing behavior

Source

SAMA-stage0 Result

SAMA-stage1 Result

Change the color of the boat to yellow.

Source

SAMA-stage0 Result

SAMA-stage1 Result

Add a A little girl with long brown hair beside the boy.

Source

SAMA-stage0 Result

SAMA-stage1 Result

Remove the dancing man.

Source

SAMA-stage0 Result

SAMA-stage1 Result

Transform into illustration style.

The factorized pre-training stage alone already induces strong zero-shot video editing behavior before supervised fine-tuning on paired editing data.

Citation

BibTeX

@article{zhang2026sama,
      title={SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing},
      author={Zhang, Xinyao and Dong, Wenkai and Song, Yuxin and Fang, Bo and Zhang, Qi and Wang, Jing and Chen, Fan and Zhang, Hui and Feng, Haocheng and Lu, Yu and others},
      journal={arXiv preprint arXiv:2603.19228},
      year={2026}
}