Domain: video

Workflow

Full description of the Synthesia Dify workflow — inputs, internal logic, and node-by-node flow.


Inputs

InputTypeDefaultDescription
textparagraphSource content for the video
course_titlestringLocalized course title
video_titlestringTitle of the current video section
avatar_idstringSynthesia avatar UUID (from trainer asset)
tone_of_voiceparagraphTrainer voice profile (from trainer asset)
sections_positionstringSection number prefix, e.g. "1.1.2"
background_imagestringURL of the slide background image
video_lengthselectnormalshort or normal; short = 2–3 slide template, 100–150 words total
test_modeselecttrue"true" or "false"; test = does not publish the video in Synthesia
fake_modeselectfalse"true" or "false"; see below
languageselectDefaultDefault (auto detection), fr, or nl
formatselectVideoPresentationVideoPresentation or VideoTalkingHead

Full pipeline

Start
  ├─► IF fake_mode = true ──► fake UUID ──► End (no API call made)
  │
  └─► Synthesia GET /v2/templates
       │
       └─► Filter by curated allowlist from repo
            (34 supported template IDs; see templates.md)
            │
            └─► Split by format
                 (VideoPresentation or VideoTalkingHead)
                 │
                 └─► LLM Step 1 — template selection (Gemini 2.5 Pro)
                      (choose best template, generate image prompts if needed)
                      │
                      └─► IF generate_image = true?
                           ├─ true ──► Iteration (parallel ×4):
                           │            for each image prompt →
                           │            IF aspect_ratio = horizontal?
                           │              → Nano Banana 1536×1024
                           │              → Nano Banana 1024×1536
                           │            → collect {variable_name, image_url}
                           └─ false ─┐
                                     └─► Template-transform (variables + image URLs)
                                          │
                                          └─► Loop (up to 10 retries):
                                               LLM Step 2 — build payload (Gemini 2.5 Pro)
                                               │
                                               └─► Validate JSON (Python)
                                                    └─► IF valid?
                                                         ├─ true ──► Exit loop
                                                         └─ false ─► retry
                                               │
                                               └─► POST https://api.synthesia.io/v2/videos/fromTemplate
                                                    │
                                                    └─► IF status_code = 6000?
                                                         ├─ yes ──► fallback POST (hardcoded payload)
                                                         └─► Extract video_id ──► End

Template sources and filtering

available_templates is not passed in by the caller. The workflow builds it internally at runtime from two different sources with different roles:

  1. Repository allowlist — the pipeline supports a curated set of 34 template IDs. This curated list is documented in templates.md. The repo is the source of truth for which templates are allowed in Fraya.
  2. Synthesia APIGET /v2/templates?limit=100&offset=0&source=workspace fetches all templates currently present in the Synthesia workspace, including their live metadata: id, title, description, and variables.

At runtime, the workflow:

  1. fetches the full template catalog from Synthesia;
  2. filters that catalog by the curated allowlist of 34 IDs supported by Fraya;
  3. keeps the live Synthesia metadata for the matching templates;
  4. splits the result into presentations vs talking_head;
  5. passes only the subset matching the format input to Step 1.

This distinction matters:

  • The repo decides which template IDs are allowed.
  • Synthesia provides the current metadata for those IDs.
  • The workflow does not treat the repo as the source of variable definitions or template descriptions at runtime.

Transition note: Google Sheets

Historically, the allowlist was maintained in Google Sheets and later copied into the repo. Some Dify workflow exports may still contain a Google Sheets node from that earlier setup.

For documentation purposes, treat Google Sheets as legacy migration residue, not as the intended long-term source of truth. The target model is:

repo allowlist + live Synthesia metadata

If the workflow implementation still reads the old sheet in some environment, it should be understood as transitional behavior rather than the architectural design of the system.


Prompt sourcing

The Synthesia workflow should not treat prompt text as manually copied workflow content. The intended pattern is:

  • prompt definitions live in prompts/video/
  • Dify fetches prompt metadata from GET /api/prompts/:name when it needs the contract
  • Dify renders prompt text from POST /api/prompts/:name at runtime
  • the prompt API always returns system + user, even if the underlying YAML file was authored as a single-message prompt

For the video workflow, this means Step 1 and Step 2 prompt text should be loaded from the Fraya API rather than maintained as separate prompt copies inside Dify.


fake_mode

When fake_mode = "true", the workflow skips the entire pipeline and immediately returns a randomly generated fake video ID. No LLM calls, no API calls, no image generation.

Use this to test the calling workflow (e.g. section content pipeline) without spending Synthesia credits or waiting for actual rendering.


Image generation

If Step 1 decides images are needed (generate_image = true), the workflow runs an Iteration node (4 parallel threads) before Step 2.

For each image prompt returned by Step 1:

  • aspect_ratio = horizontalNano Banana at 1536×1024
  • aspect_ratio = verticalNano Banana at 1024×1536

Results are collected as {variable_name, image_url} pairs and passed to Step 2.

Two image styles — see content_rules.md for prompting guidelines.


JSON validation loop

After Step 2 generates the payload, a Python validator runs inside a loop (up to 10 retries):

What it checks:

  • Required top-level fields present: test, templateData, visibility, templateId, title
  • templateData is parseable (array of "key: value" strings → converts to dict)
  • avatar_id is in templateData (injected from workflow input if missing)
  • background_image is in templateData (injected from workflow input if missing)
  • All required template variables from Step 1 are present (no missing keys)
  • No empty values (empty string → replaced with single space " ")
  • Whitespace normalized (\u00A0, \u200B removed)

If valid = false, the LLM retries with the same instructions. In practice, 1–2 iterations suffice.

The validated dict (not the original array-of-strings) is what gets POSTed to Synthesia.


Worked example

Real short-form talking-head run, simplified from production output.

1. Workflow input

{
  "text": "<p>Hai costruito le tue fondamenta interiori... Esploriamo come formare queste connessioni forti e interdipendenti.</p>",
  "course_title": "Le 7 abitudini per essere più efficace",
  "video_title": "La vittoria pubblica: dall'io al noi",
  "avatar_id": "49dc8f46-8c08-45f1-8608-57069c173827",
  "sections_position": "R:171272 - 2.1.1(1787)",
  "tone_of_voice": "Tone: Clear, insightful, minimalist, guiding...",
  "test_mode": false,
  "fake_mode": false,
  "background_image": "synthesia.3f9b74bc-f870-4cc6-881e-7bff2394b686",
  "video_length": "short",
  "language": "Default (auto detection)",
  "format": "VideoTalkingHead"
}

2. What the pipeline does with it

  1. Fetches all workspace templates from Synthesia.
  2. Filters them by Fraya's curated allowlist of supported template IDs.
  3. Keeps only the VideoTalkingHead subset.
  4. Step 1 selects template 9c23e3c3-401b-4373-9ca2-e15b669974a3: [de] Talking head (A-Q-A-F).
  5. No image generation is needed for this template.
  6. Step 2 writes the slide scripts and text variables.
  7. The validator normalizes the payload and converts templateData into a dict before the API call.

3. Final payload sent to Synthesia

{
  "test": false,
  "templateData": {
    "course_title": "Le 7 abitudini per essere più efficace",
    "video_title": "La vittoria pubblica: dall'io al noi",
    "slide_0_script_intro": "Ora, passiamo dall'io al noi.",
    "slide_1_script": "Hai costruito le tue fondamenta interiori. Ma l'autosufficienza non è il punto d'arrivo. È il terreno stabile di cui hai bisogno per connetterti con gli altri.",
    "slide_2_key_idea_max_50": "Dall'autosufficienza alla collaborazione.",
    "slide_3_script": "Per raggiungere una vera collaborazione, devi prima costruire la fiducia, la valuta di ogni relazione. <break time=\"1s\" /> In questo modo, la tua mente si allontana dalla competizione e inizi ad ascoltare profondamente per comprendere.",
    "slide_4_title_max_30": "Connessioni interdipendenti",
    "slide_4_textblock": "🔹 Costruire fiducia\n🔹 Ascoltare per capire\n🔹 Creare soluzioni insieme",
    "slide_4_script": "Alla fine, impari a fondere punti di vista diversi per creare soluzioni che nessuna singola persona potrebbe costruire da sola. Questo è il potere delle connessioni interdipendenti.",
    "avatar_id": "49dc8f46-8c08-45f1-8608-57069c173827",
    "background_image": "synthesia.3f9b74bc-f870-4cc6-881e-7bff2394b686"
  },
  "visibility": "public",
  "templateId": "9c23e3c3-401b-4373-9ca2-e15b669974a3",
  "title": "R:171272 - 2.1.1(1787) La vittoria pubblica: dall'io al noi ([de] Talking head (A-Q-A-F))"
}

4. Workflow output

{
  "video_id": "701d1437-d1ec-4d19-8598-29a3366fe19b",
  "json": {
    "test": false,
    "templateData": {
      "course_title": "Le 7 abitudini per essere più efficace",
      "video_title": "La vittoria pubblica: dall'io al noi",
      "slide_0_script_intro": "Ora, passiamo dall'io al noi.",
      "slide_1_script": "Hai costruito le tue fondamenta interiori. Ma l'autosufficienza non è il punto d'arrivo. È il terreno stabile di cui hai bisogno per connetterti con gli altri.",
      "slide_2_key_idea_max_50": "Dall'autosufficienza alla collaborazione.",
      "slide_3_script": "Per raggiungere una vera collaborazione, devi prima costruire la fiducia, la valuta di ogni relazione. <break time=\"1s\" /> In questo modo, la tua mente si allontana dalla competizione e inizi ad ascoltare profondamente per comprendere.",
      "slide_4_title_max_30": "Connessioni interdipendenti",
      "slide_4_textblock": "🔹 Costruire fiducia\n🔹 Ascoltare per capire\n🔹 Creare soluzioni insieme",
      "slide_4_script": "Alla fine, impari a fondere punti di vista diversi per creare soluzioni che nessuna singola persona potrebbe costruire da sola. Questo è il potere delle connessioni interdipendenti.",
      "avatar_id": "49dc8f46-8c08-45f1-8608-57069c173827",
      "background_image": "synthesia.3f9b74bc-f870-4cc6-881e-7bff2394b686"
    },
    "visibility": "public",
    "templateId": "9c23e3c3-401b-4373-9ca2-e15b669974a3",
    "title": "R:171272 - 2.1.1(1787) La vittoria pubblica: dall'io al noi ([de] Talking head (A-Q-A-F))"
  }
}

5. Why this example is useful

  • It shows the difference between the LLM output contract of Step 2 and the validated runtime payload used by the workflow.
  • It demonstrates a real VideoTalkingHead selection for a short video.
  • It shows that the final workflow result returns both the created video_id and the normalized JSON payload that was sent to Synthesia.

Production design rules

These rules apply when creating templates in Synthesia, not during the LLM pipeline.

Backgrounds and transitions:

  • Use environmental backgrounds on talking-head (avatar-only) slides.
  • Use blue solid backgrounds on presentation (list/text) slides.
  • Background changes between slides → set transition to fade.
  • Background stays the same between slides → no transition.

Layout:

  • Logo in bottom right corner. If occupied → top right.
  • Use in-between title slides (Q-type) in presentation templates.
  • Use different zoom levels across slides for visual rhythm.

Slide structure:

  • Every template starts with slide_0_script_intro + avatar + video_title + course_title.
  • Q slides display a framing question or key statement in the centre — no voiceover.
  • Limit continuous I (diagram) slides to 4. For larger models, use grouping.