Voice Instructions in Golpo: One Field, Twelve Different Voices
A second optional field — voice_instructions — quietly controls more of how your video lands than the script itself. We analyzed five thousand real Golpo generations to find the eight categories users actually use it for, then produced twelve side-by-side demos that show how dramatically one or two sentences can reshape the narrator.

If video_instructions is the most powerful single field in the Golpo API (we made that case here), then voice_instructions is the quietest. Most users leave it blank. The ones who don't quietly produce videos that sound like they were narrated by a real person hired for the job — not a default TTS read.
To find out what people actually write in that field, we pulled the 5,000 most recent Golpo generations from production and bucketed them. There were 1,495 distinct voice prompts — and despite that variety, the prompts collapse into a small number of repeating categories. Eight, to be exact. Some are obvious (accent, tone, pacing). Some are not (line-specific pauses, demographic anchoring, "what the voice must NOT sound like"). The single most common voice prompt across all 5,000 generations was the same five words: "warm, clear teacher — engaging and encouraging", used 431 times.
Below: one or two side-by-side demos for each of the eight categories. Same script style. Same Golpo Sketch engine. The only variable is the voice_instructions string. Press play on a few and you'll hear how much one or two sentences can reshape who you're listening to.
→ Accent & language · Persona · Tone · Pacing · Pauses · Pronunciation · Demographics · Negative constraints
What we found in the data
Before the demos, the high-level shape of 5,000 generations:
- ~30% are short directives — under 25 characters. Things like "British Accent", "talk like a professor", "español latino". Short prompts tend to nudge a single dimension.
- ~50% are structured paragraphs — 80–500 characters covering tone + pace + accent in one breath. This is the sweet spot.
- ~20% are full directorial briefs — pronunciation guides, line-by-line pause markers, energy curves, pronunciation tables for Greek/Hebrew/finance tickers. These are the high-effort prompts shipped by power users.
- The top three repeated prompts were "warm, clear teacher — engaging and encouraging" (431x), the same prompt with a four-line expansion (356x), and the bare phrase "British Accent" (146x).
- Eight content categories covered roughly 95% of all distinct prompts: accent / language, persona / archetype, tone / mood, pacing / tempo, pause & rhythm control, pronunciation guidance, demographic anchoring, and explicit negative constraints.
Twelve demos follow, structured by those eight categories. Style was held constant (Golpo Sketch Classic) so your ears can isolate the voice change.
1. Accent and language — where the narrator lives
The single most-named axis. Users routinely specify an accent ("British", "Indian English", "español de España", "Korean broadcast"). The interesting bit is how often the prompt blends two: a base accent plus a softer secondary modifier underneath. The best example we found in real data layered a light educated London accent over a subtle Caribbean lilt — and it works.
Educated London with a subtle Caribbean warmth — used 64× in production
Prompt: Three small habits of great everyday teachers — explained warmly. · Voice slot: solo-female-3 · Length: 1 minute.
voice_instructions: "Female voice, mid-20s to early 30s. Light British accent — think educated London, not posh, not RP, definitely not cockney. Underneath the British is a subtle Caribbean warmth — the melodic rise and fall in the cadence carries a hint of island lilt, but the vowels and consonants stay British. Not patois, not exaggerated. Warm, conversational, slightly playful — like a friend explaining something interesting over coffee."
Voice · solo-female-3 · UK + Caribbean
Castellano neutral for an older audience — used 37× in production
Prompt: Tres pasos sencillos para entender la inteligencia artificial. · Voice slot: solo-male-3 · Language: es · Length: 30 seconds.
voice_instructions: "Voz masculina, español de España (castellano), acento neutro de España (no latinoamericano). Tono cálido, pausado y de confianza, para un público de personas mayores."
Voice · solo-male-3 · Castellano
What we learned: Accent prompts honor most reliably when the narration language matches the cultural target. For non-English accents, set language explicitly — the model picks the regional voice cues from the language code as much as from the prompt. For English-language regional accents (UK, Indian English, Irish), the voice_instructions string does the work, and stacked accents ("British base with a Caribbean lilt") get surprisingly far. The harder failure mode is over-specifying: prompts that ask for "posh RP British" or "thick Cockney" tend to be more cartoonish than the lighter "educated London" framing.
2. Persona and character archetype — who the narrator is
The second-most-used category. Instead of describing voice qualities, users name a character. "Talk like a TEDx keynote speaker" was used 68 times. "Authoritative scholarly teacher" — paired with strict negative constraints — recurs across multiple religious-studies and academic accounts. Naming a persona pulls a bundle of tonal, pacing, and rhythm choices in one shorthand.
TEDx keynote speaker — used 68× in production
Prompt: Why one small daily habit beats a hundred motivational speeches. · Voice slot: solo-male-3 · Length: 1 minute.
voice_instructions: "Talk like a TEDx keynote speaker, emphasizing the right words, speaking slowly when necessary and keeping the audience hooked. Confident, warm, deliberate. Pause before key reveals. Energy builds toward the close."
Voice · solo-male-3 · TEDx
Authoritative scholarly teacher — explicitly NOT a pastor
Prompt: What the Greek word "metanoia" actually meant — and how we lost the meaning. · Voice slot: solo-male-3 · Length: 1 minute.
voice_instructions: "Authoritative scholarly teacher. The voice of someone who has spent years in primary sources and is now making the work accessible. NOT a pastor, NOT an inspirational speaker, NOT a devotional guide, NOT a motivational coach. No revival cadence, no sermon stress, no sing-song phrasing, no vocal fry, no warmth filler. Pronunciation: metanoia — meh-TAH-noy-ah; noos — NOH-ohs; paenitentia — pie-nih-TEN-tee-ah. Em dashes — brief beat. Closing line: deliberate, quiet authority. Do not rush it."
Voice · solo-male-3 · Scholar
What we learned: Naming a persona is the single highest-leverage move you can make in three to seven words. "TEDx keynote speaker" pulls confident pacing, deliberate emphasis, and pre-reveal pauses without you having to spell any of them out. But the scholarly example shows an important pattern: persona names alone aren't always enough. For categories where the default voice has a "wrong attractor" (sermon cadence, motivational coach energy), pairing the positive persona with a paragraph of negative constraints does more work than either half alone.
3. Tone and emotional register — how the narrator feels
The most populated category by sheer volume of unique prompts. Where persona names a character, tone names a feeling. The two most distinctive tonal patterns we saw repeatedly in production: "deep, gritty, knowing-insider" (used 42×) for content that wants to feel like a quiet truth being shared, and "warm, calm, and grounded" (used 12+ times in multiple variations) for self-improvement and wellness content.
Deep, gritty, knowing-insider tone — used 42× in production
Prompt: The one thing every junior engineer believes about senior engineers — and why it's wrong. · Voice slot: solo-male-3 · Length: 1 minute.
voice_instructions: "Deep, gritty, knowing-insider tone. Like a friend who has done the research and is dropping a hard truth. Confident but not preachy. Slight edge. Speak at a deliberate, steady pace. Let each idea land before moving on. Pause naturally between sentences. Do not rush. Each sentence should take long enough for the listener to fully picture what you said before the next sentence starts."
Voice · solo-male-3 · Gritty insider
Warm, calm, and grounded — used 50+ times across variations
Prompt: What your nervous system does when you skip lunch — explained in three calm minutes. · Voice slot: solo-female-3 · Length: 1 minute.
voice_instructions: "Warm, calm, and grounded voice. Speak at a moderate, thoughtful pace with natural pauses. Tone should feel honest and slightly reflective, encouraging self-awareness rather than judgment. The overall feeling should be wise, steady, and gently eye-opening rather than dramatic. Avoid any sense of urgency or hype."
Voice · solo-female-3 · Calm wise
What we learned: Tone prompts honor best when they bundle three things: a feeling word (gritty, grounded, calm), an analogy (like a friend, like a trusted teacher), and a pacing direction (deliberate, moderate, unhurried). All three of those layers together push the voice toward a coherent emotional center. A feeling word alone tends to drift — "calm" without "moderate pace" sometimes produces something monotone rather than calm.
4. Pacing and tempo — how fast the narrator moves
Pacing is the bluntest dial on the voice. Some prompts are one line ("talk in a slow and calm tone" — used 35×). Others target a specific words-per-minute number ("120-130 words/min"). The two ends of the spectrum we found most clearly used in production: a deliberately slow, meditative pace for self-improvement content, and a fast, punchy broadcast tempo for short-form social.
Slow and unhurried — meditation-teacher pacing
Prompt: Three reasons your meditation streak keeps breaking — and what to do. · Voice slot: solo-female-3 · Length: 1 minute.
voice_instructions: "Talk in a slow and calm tone. Speak deliberately. Pause naturally between sentences. Tone should feel grounded and unhurried, like a meditation teacher giving practical advice — not breathy or hushed. Just steady and clear."
Voice · solo-female-3 · Slow & calm
Fast, punchy Korean broadcast — used 14× in production
Prompt: 다섯 가지 놀라운 인공지능 활용 사례 — 1분 안에 빠르게. · Voice slot: solo-male-3 · Language: ko · Length: 1 minute.
voice_instructions: "Fast, energetic, punchy delivery. Korean broadcast narrator. Quick pace with sharp emphasis on key words. Minimal pauses. Urgent and driven rhythm throughout. No slow or drawn-out narration."
Voice · solo-male-3 · Korean punchy
What we learned: Pacing prompts work best when paired with a category of speaker (meditation teacher, broadcast anchor, sports commentator). A bare "fast" or "slow" instruction tends to land as a general nudge; "fast like a Korean broadcast anchor" pulls a whole rhythm template. WPM numbers (140, 150, 170) get honored loosely — they're a directional hint, not an exact metronome.
5. Pause and rhythm control — where the narrator stops
The most underrated category. Pause instructions are how power users sculpt emphasis. Some users list specific quoted lines that must be followed by a beat ("pause after: 'The job market changed'"). Others give numeric specifications ("0.8 seconds after rhetorical questions"). The line-specific approach we found in production for a recruiting brand is one of the better directorial briefs we've ever seen pass through the field.
Pause after specific quoted lines — used 8× for the same brand
Prompt: Three quiet truths every job applicant should hear about the modern hiring market. · Voice slot: solo-female-3 · Length: 1 minute.
voice_instructions: "Use a calm, confident British female voice. Tone should feel intelligent, grounded and reassuring. Do not sound excited, salesy, dramatic or overly motivational. Speak naturally, with enough space between ideas for the visuals to land.
Pause slightly after these lines:
'The job market changed.'
'But that world no longer exists.'
'Being qualified is no longer enough.'
'You have to be visible.'
'Conversations do.'
The final line, 'Apply for an assessment conversation,' should be clear and calm, not rushed."
Voice · solo-female-3 · Line-specific pauses
What we learned: When you quote the exact line in the prompt, the model treats that line as a landmark and respects the pause around it. This is the only reliable way to get a specific beat at a specific moment. Numeric pause times (0.5s, 0.8s, etc.) are honored only loosely; quoted lines act as anchors and work much better.
6. Pronunciation guidance — proper names, tickers, foreign terms
For finance, religious-studies, medical, and pharmaceutical content, the difference between a video that sounds credible and one that doesn't is whether the narrator pronounces the right things correctly. Power users supply explicit pronunciation tables. Two patterns we found repeatedly: ticker spellouts for finance content, and Greek/Hebrew/Latin syllable guides for religious and academic content.
Financial ticker spellouts and number formatting — recurring pattern
Prompt: Three stocks that defined the AI boom — and what every retail investor missed. · Voice slot: solo-male-3 · Length: 1 minute.
voice_instructions: "Confident, fast-moving, retail-investor energy, like a sharp trader sharing alpha. Pronounce these tickers letter by letter: N V D A, M U, C R W V, A A P L, A I. Say 'one point six T' as written. Say large numbers clearly and deliberately. Brief pause before the big reveals, and slow down slightly on the key numbers so they land. No em dashes — read it as natural speech."
Voice · solo-male-3 · Finance / tickers
What we learned: Pronunciation tables are the highest-effort, highest-payoff prompts. The pattern that works most consistently: list each name once, in capital letters with hyphens between syllables and the stressed syllable capitalized — meh-TAH-noy-ah, ZY-mer-gen, NO-vo-niks. For tickers, force letter-by-letter spellout — "N V D A", with spaces between letters. Without the explicit spellout the model frequently reads tickers as words (which sounds wrong half the time and unintentionally funny the other half).
7. Demographic anchoring — age, gender, voice timbre
The fourth-most-named axis. Users supply concrete demographic specs: "male, late 30s, deep baritone", "female mid-20s, light, curious", "45-year-old American doctor". These prompts work as guardrails on the voice slot — solo-male-3 might default to a young-sounding read, but adding "deep baritone with warm resonance, late 30s" pushes the voice toward a specific point in the demographic space.
Deep baritone late 30s — mentor energy
Prompt: Three life lessons no one tells you until your thirties — and why they're worth the wait. · Voice slot: solo-male-3 · Length: 1 minute.
voice_instructions: "Male, deep baritone with warm resonance — confident, grounded, and wise. Accent: neutral North American (no regional drawl). Tone: calm authority, motivational, compassionate — sounds like he's explaining life lessons, not reading a script. Style: educational / cinematic narrator. Energy curve: start low and reflective for the intro; grow firmer and emotionally charged on key ideas; ease back into a smooth, hopeful register by the outro."
Voice · solo-male-3 · Baritone mentor
What we learned: Demographic specs are most useful when they describe timbre, not just age. "Deep baritone with warm resonance" gives the model something to honor; "male, late 30s" alone is too thin. The "energy curve" instruction (low/reflective → firmer → smooth) is a power-user pattern worth borrowing — you're telling the model how the voice should evolve across the video, not just how it should start.
8. Negative constraints — what the narrator must NOT sound like
The most consistently high-leverage pattern in the entire dataset. Across every category, the prompts that landed most reliably included an explicit list of what the voice should not be. The pattern is so common we pulled it out as its own category. Users specifying anti-guru, anti-pastor, anti-influencer, anti-salesy, anti-trailer-voice prompts produce videos that sound calibrated rather than generic.
"Just keep the voice human" — anti-guru, anti-trailer
Prompt: Three things every productivity guru gets dangerously wrong about being busy. · Voice slot: solo-male-3 · Length: 1 minute.
voice_instructions: "Calm, sharp, warm male voice. Natural delivery with slight dry humor. Honest, not motivational. Pause after punchlines. No dramatic trailer voice, no guru tone, no influencer hype. Just keep the voice human. Read it as if you're explaining the truth to a friend over coffee, not delivering a TED talk."
Voice · solo-male-3 · Anti-guru, human
What we learned: Negative constraints work because they exclude the "wrong attractors" — the default modes the TTS model gravitates toward when given vague positive prompts. "Be calm" can drift into bored. "Be calm. Do not sound like a meditation guru, no breathy reverence, no spa voice" lands far more reliably. The more specific the negative, the better. "No trailer voice" beats "don't be dramatic".
What we learned across all 12
Seven observations from looking at 5,000 production generations and running these 12 demos side-by-side:
- 1. Persona names are shortcuts. Three to seven words ("TEDx keynote speaker", "true-crime narrator", "high-end documentary voice", "investment-banker British") pull a whole bundle of pacing, tone, and emphasis choices. Use them.
- 2. Stack two cultural references for distinctive voices. "Educated London accent with a subtle Caribbean lilt underneath" produces a more specific narrator than either half alone. The blend is the differentiator.
- 3. Quote the exact line you want a pause around. Numeric pause times (0.5s, 0.8s) are honored loosely; quoted lines act as landmarks. This is the only reliable way to control rhythm at a specific moment.
- 4. Pronunciation tables earn their weight. For finance tickers, religious terms, brand names, drug names — spell each one phonetically with capitalized stress and hyphens. The model honors them remarkably well.
- 5. Force letter-by-letter on acronyms and tickers. Without explicit spaced spellout ("N V D A"), the model will read them as words half the time.
- 6. Negative constraints beat positive ones — again. Same lesson as video_instructions: listing the wrong attractors ("no guru tone, no trailer voice, no sermon cadence") does more to shape the output than any positive description alone.
- 7. Length doesn't matter — anchoring does. The single most effective five-word prompt in the entire dataset is "warm, clear teacher — engaging and encouraging" (used 431 times). The shortest prompts that work consistently are the ones that name a persona AND specify a tone in one breath.
How to write your own
The pattern that produced the most coherent voices across the 12 demos and the 5,000-generation dataset:
- Open with a persona or accent name — one short clause ("TEDx keynote speaker", "Castellano neutral", "deep gritty knowing-insider").
- Add one tonal anchor and one pacing anchor — "warm and grounded, moderate pace", "confident and fast, broadcast tempo".
- List 2–4 absolute negatives — what NOT to be. "Not a pastor. Not a motivational speaker. No vocal fry. No salesy energy."
- If pronunciation matters, add a small table — name → stressed syllable in caps. Five entries is plenty.
- If a specific moment needs a beat, quote the line — the model treats quoted lines as anchors.
- Stop there. 100–400 character prompts honor more reliably than 800+ character paragraphs.
Two-line template:
"[Persona / accent / language]. [One tonal anchor + one pacing anchor]."
"[2–4 absolute negatives — what the voice must NOT sound like.]"
Where the field lives
In the Golpo dashboard, voice_instructions is the text area labeled Voice Instructions on the create-video screen — directly below the script and voice-slot pickers, available on the Business plan and above (see pricing). Type your instruction string there before hitting generate.
If you're calling the Golpo API, pass voice_instructions as a string field in the request payload. See the API payload examples guide and the API access guide for the full request shape. The field accepts free-text in any language — Korean voice prompts written in Korean honor better than transliterated versions.
Related guides
- Video Instructions: One Line, Fifteen Different Looks — the visual sibling of this guide. Same exercise, opposite field.
- Every Golpo Video Style — what changes when the engine, not the prompt, changes.
- Golpo prompt cheatsheet — companion patterns for the script itself.
- Use your own narration (audio_clip) — when voice_instructions isn't enough and you want to upload a real human voice.
- Golpo AI complete tutorial — full walkthrough of the dashboard.
Want help calibrating a brand voice that survives across thousands of videos? Book a 15-minute call — we'll help you write the prompt.


