How to Write Better Prompts for AI Video Generation with Reference Images

Learn how to write stronger prompts for DojoClip AI video generation with reference images, including how to choose your images, avoid contradictions, and guide subject, style, and motion more clearly.

Pansa Legrandhow to use

If Start + End Frames mode is about directing a transition, Reference Images mode is about directing consistency.

You use reference images when you want the model to stay anchored to:

  • a person
  • a character
  • a product
  • a visual style
  • a composition language

In DojoClip AI Video Generation, Reference Images mode lets you upload 1 to 3 reference images. The point is not to overload the model with random inspiration. The point is to give it a stable visual anchor, then use the prompt to describe what should happen in the video.

That last part matters.

The easiest rule to remember is this:

Reference images define what it is. The prompt defines what it does.

If you keep that rule in mind, your prompts will immediately get better.


What reference images are good for

Reference images are especially useful when you want:

  • the same person to stay recognizable
  • a product to keep its exact shape and design
  • a character to remain visually consistent
  • a campaign style to carry through the shot
  • a video to feel based on a specific visual world

This mode is usually stronger than pure text prompting when consistency matters.

For example:

  • a beauty ad with one exact bottle design
  • a fashion clip featuring one specific model look
  • a mascot or toy character that must stay recognizable
  • a branded lifestyle scene with a stable visual identity

If identity matters, reference images help a lot.


The biggest beginner mistake

Many beginners upload reference images and then write a prompt like this:

A woman with long dark hair wearing a cream trench coat and gold earrings in a softly lit luxury hotel hallway with beige walls and warm cinematic lighting.

This is weak for the same reason weak image-to-video prompts are weak:

  • it mostly re-describes what the images already show
  • it does not clearly describe the motion
  • it does not direct the camera

Better prompt:

Elegant slow tracking shot as the subject walks forward with calm confidence and briefly looks toward camera. Soft fabric movement, warm hallway reflections, and subtle depth-of-field create a premium fashion-film mood.

The images already tell the model who the subject is and what she looks like. Your prompt should focus on:

  • action
  • camera movement
  • scene energy
  • mood

That is where the useful control lives.


How to choose better reference images

The quality of the reference set matters as much as the wording of the prompt.

Good reference images are usually:

  • sharp and high quality
  • visually consistent with each other
  • focused on the same subject or product
  • useful from slightly different angles or framings
  • aligned in lighting and styling logic

Bad reference sets are often:

  • low quality
  • contradictory
  • mixing different people or products
  • wildly different in wardrobe, age, colors, or artistic style
  • trying to teach too many ideas at once

If the three images disagree with each other, your prompt has to fight unnecessary confusion.


What a strong 1-to-3 image set looks like

In practice, a good reference set often follows one of these patterns:

Pattern 1: One subject, one look, three useful angles

Use this when identity is the priority.

Example:

  • image 1: clean front-facing portrait
  • image 2: three-quarter angle
  • image 3: full-body or medium shot showing outfit silhouette

This works well for:

  • fashion
  • characters
  • influencers
  • portraits

Pattern 2: One product, three clarity shots

Use this when product design is the priority.

Example:

  • image 1: front hero angle
  • image 2: side angle showing form
  • image 3: close-up of material or label detail

This works well for:

  • perfume
  • skincare
  • sneakers
  • packaging

Pattern 3: One subject plus one style direction

Use this carefully.

If your subject is already stable, the extra image should reinforce the visual world, not contradict it. If the style image is too different, the result can drift.


A beginner-friendly prompt formula for reference images

Use this formula:

[shot type / camera move] + [subject or product action] + [environment motion] + [mood / style] + [ending emphasis]

Example template:

Smooth [camera move] as the subject [action]. [Environment motion] adds life to the scene. The overall feeling is [tone words], with a clean, cinematic finish.

Because the images already carry appearance, you often do not need to write:

  • exact hair color
  • exact outfit details
  • exact product design
  • every background object

Instead, focus on what the video should do.


Keep your wording general when the image already shows the subject

This is a subtle but useful trick.

When you already supplied image references, it often works better to refer to the person or object in broad terms like:

  • the subject
  • the woman
  • the man
  • the model
  • the bottle
  • the product

This keeps the prompt clean and avoids over-specifying details that the images already contain.

For example, instead of this:

The brunette woman with a cream trench coat and gold earrings turns slowly as her hair moves.

Try this:

The subject turns slowly as the fabric and hair move gently in the air.

Cleaner prompts are often stronger prompts.


Reference Images mode is not for random moodboards

This is worth saying clearly.

Do not treat the 1 to 3 image slots like a Pinterest board.

If one image is:

  • a red sports car

and the next is:

  • a watercolor anime portrait

and the third is:

  • a luxury perfume bottle

you are not helping the model. You are creating conflict.

Reference images should point in the same direction.

Ask yourself:

  • Are these images describing the same subject or visual world?
  • Would a human art director see them as one coherent set?
  • Is each image adding clarity instead of confusion?

If the answer is no, change the set before you change the prompt.


Prompt examples you can test

Here are prompt examples built for later rendering and demo use.

Example 1: Fashion portrait

Reference set idea: three images of the same model in the same outfit from different angles

Prompt:

Smooth tracking shot as the subject walks toward camera with restrained confidence, then briefly turns her gaze to the side. Soft air movement lifts the hair and coat slightly, while reflected city lights shimmer in the background. The mood feels premium, editorial, and cinematic.

Why it works:

  • reference images handle identity and wardrobe
  • prompt handles motion and mood
  • camera instruction is simple and usable

Example 2: Product commercial

Reference set idea: three images of the same skincare bottle, including one close-up of texture and label

Prompt:

Elegant slow push-in on the product as condensation gathers on the surface and soft light glides across the bottle. Water droplets roll gently, background highlights shimmer, and the shot feels clean, modern, and luxurious with a polished commercial finish.

Why it works:

  • keeps the product central
  • motion is minimal but visually rich
  • avoids re-describing the label design line by line

Example 3: Stylized character video

Reference set idea: two or three images of the same illustrated character with consistent clothing, face, and palette

Prompt:

Slow cinematic push forward as the subject stands still for a beat, then raises their chin and lets a faint smile appear. Wind moves through the hair and clothing, glowing particles drift through the frame, and the atmosphere feels heroic, calm, and slightly magical.

Why it works:

  • the references hold character identity
  • the prompt creates the performance
  • the scene stays focused on one emotional beat

A bad prompt vs a better prompt

Bad:

Make a really beautiful luxury fashion video with a stylish woman and amazing cinematic lighting and expensive vibes.

Why it is weak:

  • vague
  • almost no motion direction
  • no camera direction
  • no scene behavior

Better:

Slow side-tracking shot as the subject walks through the hallway and lightly brushes one hand against the wall. The fabric moves softly, warm reflections pulse across the floor, and the mood feels elegant, quiet, and high-end.

Why it is better:

  • clear camera idea
  • clear action
  • clear environment motion
  • clear mood

How many reference images should you use?

Use the fewest number that clearly teaches the model what matters.

Use 1 image when:

  • the subject is simple
  • style is obvious
  • you only need one strong anchor

Use 2 images when:

  • you need a second angle
  • you want identity plus pose clarity

Use 3 images when:

  • the subject or product has important details from multiple views
  • each image adds real clarity

Do not use 3 just because 3 is available.

More is only better when each image helps.


Final checklist for better Reference Images prompts

Before you generate, ask:

  • Do my reference images all describe the same subject or product?
  • Are the images high quality and visually consistent?
  • Does my prompt focus on motion instead of re-describing appearance?
  • Did I clearly define the camera move?
  • Am I asking for only one scene and one emotional beat?

That is enough to improve your results immediately.

The best reference-image prompts are usually not the longest. They are the ones where the images carry identity, and the words clearly direct the action.

If you want to test these ideas directly, try DojoClip AI Video Generator here: Generate videos with DojoClip