Written by Greg Martin

20 min read

APR 30 2026

Making AI Sound (Almost) Human

Crafting realistic conversations using AI voice in ElevenLabs Creative.

Written by Greg Martin

20 min read

APR 30 2026

Making AI Sound (Almost) Human

Crafting realistic conversations using AI voice in ElevenLabs Creative.

Written by Greg Martin

10 min read

APR 30 2026

Making AI Sound (Almost) Human

Crafting realistic conversations using AI voice in ElevenLabs Creative.

Written by Greg Martin

20 min read

APR 30 2026

Making AI Sound (Almost) Human

Crafting realistic conversations using AI voice in ElevenLabs Creative.

In April 2026, I was asked to lead a workshop on using ElevenLabs Creative to craft realistic conversations with AI voice. What follows is that presentation adapted into article format.

We've all heard bad artificial voices, in fact some of us grew up with them. So I was genuinely surprised when, in late 2024, a project needing voiceover led me to investigate how much AI voice had improved.

AI voice today kind of blows my mind.

Ok, it has improved a LOT. Far more than I expected, especially given how often YouTube and TikTok still serve up robotic, lifeless VO.

Is it human quality? Yes, in small doses. Can you still tell it's AI? Almost always. But it's getting better fast. To appreciate where AI voice stands today, two things are worth keeping in mind. First, you have to not hate AI. If you do (no judgment), no amount of progress is going to impress you. Second, consider what it's replacing. Is AI voice better than a real human performance? Almost certainly not. Is it a massive improvement over the robotic voices we've lived with until now? Yes. A thousand times over.

Introducing ElevenLabs Creative

If you haven't heard of ElevenLabs, you've likely heard their work, whether in a more human-sounding call agent or while listening to a news article. They're not the only player in the AI voice space, but they're a significant one. Since launching in 2022, they've built a multifaceted suite of APIs, agents, and creative tools for multimedia production. For this article, I'm focusing purely on the latter: the ElevenLabs Creative toolset, which I spent considerable time with on a recent project.

So let's talk a little about that project, and where AI voice played a role.

Let's build an engaging educational platform for beginner day traders.

In 2019, I did some work for a client building a trading journal tool for day traders. We reconnected in 2024 to explore other opportunities in that space, and the one we latched onto was education. Day trading is a dense, often intimidating topic. Beginners looking for information have to sift through a lot of noise to find something useful. There's a lot of ego, and a lot of paywalls, and very few well-crafted educational experiences. That last part is surprising, given how much the educational space has grown and evolved. So we started exploring what it would take to build a platform for day trader education, with three goals: make it informative, make it engaging, and make it accessible.

I want to share a little about this project so you have context for why I reached for AI voice as a tool, and what I learned along the way.

Learning from the best.

Every project starts with a bit of research, and we had a wealth of experiences to learn from (pun definitely intended). Here’s three of our top inspirations, and why.

What’s our ideal educational format?

When thinking about the ideal format for making complex topics accessible, we immediately thought of podcasts. I'm a reflective learner, meaning I learn better from dialogue that approaches a topic from multiple angles and probes with questions, rather than a straight lecture from a single voice.

“Hey, that’s kind of what podcasts do.”

Podcasts with more than one person, whether a straight discussion or an interview, fit this mold perfectly. Having two voices explore a topic is not only engaging (assuming the hosts are enthusiastic) but allows for questions and illuminating back-and-forth, where complex ideas can be picked apart from multiple directions. It makes me wish all my college lectures had been taught by two or three-person panels. So many topics would have benefited from the energy multiple voices bring.

Combining this with our research into existing educational platforms, we quickly aligned on a lesson format built around a series of voice and motion vignettes, broken up by short quiz sections. This would give us a "learn a little, then apply it" cadence, with each segment teaching just enough to quiz on before moving to the next.

But there we ran into a problem: neither of us wanted to record our own voices for a podcast-style lesson. Since the AI renaissance was, and still is, in full swing, we decided to explore what AI voice could do instead.

How good could a podcast built on AI voice actually be?

Enter NotebookLM. If you haven't looked into this tool from Google, you should. It's kind of awesome, allowing you to gather insights and query any informational content you put into it. One of its features is the ability to generate an AI-driven podcast from your content. We used this as a quick proof of concept for how our lesson content might sound as a podcast, and as a personal proof point that AI voice could create an engaging educational experience. And it delivered. From a rough draft of a single lesson, NotebookLM generated a 28-minute podcast that, aside from a few artifacts and glitches, was a thoroughly engaging ramble through our source material.

This made us feel solid on the hypothesis that AI voices can be crafted into podcast-style lessons that are engaging and educational, without distracting from what's being taught.

The only downside we saw in NotebookLM was the utter lack of control over how the information is presented, but we'll get into that in a moment. As a proof of concept, we thought it was brilliant.

Enter NotebookLM. If you haven't looked into this tool from Google, you should. It's kind of awesome, allowing you to gather insights and query any informational content you put into it. One of its features is the ability to generate an AI-driven podcast from your content. We used this as a quick proof of concept for how our lesson content might sound as a podcast, and as a personal proof point that AI voice could create an engaging educational experience. And it delivered. From a rough draft of a single lesson, NotebookLM generated a 28-minute podcast that, aside from a few artifacts and glitches, was a thoroughly engaging ramble through our source material.

This made us feel solid on the hypothesis that AI voices can be crafted into podcast-style lessons that are engaging and educational, without distracting from what's being taught.

The only downside we saw in NotebookLM was the utter lack of control over how the information is presented, but we'll get into that in a moment. As a proof of concept, we thought it was brilliant.

Enter NotebookLM. If you haven't looked into this tool from Google, you should. It's kind of awesome, allowing you to gather insights and query any informational content you put into it. One of its features is the ability to generate an AI-driven podcast from your content. We used this as a quick proof of concept for how our lesson content might sound as a podcast, and as a personal proof point that AI voice could create an engaging educational experience. And it delivered. From a rough draft of a single lesson, NotebookLM generated a 28-minute podcast that, aside from a few artifacts and glitches, was a thoroughly engaging ramble through our source material.

This made us feel solid on the hypothesis that AI voices can be crafted into podcast-style lessons that are engaging and educational, without distracting from what's being taught.

The only downside we saw in NotebookLM was the utter lack of control over how the information is presented, but we'll get into that in a moment. As a proof of concept, we thought it was brilliant.

Prototyping before the deep dive.

But we still needed to test our lesson format before committing to a full build, and that's where we come back to ElevenLabs and crafting conversations with AI voice. In February 2025, we took our working curriculum, selected a single lesson as our test case, and built it out as a fully dressed experience to put in front of others and see how our assumptions played out. Below are the first few vignettes from that prototype lesson (covering candlestick charts) to give you a sense of how the final prototyped experience looked, felt, and sounded.

Let’s talk about how I built this.

There's a lot more to our prototype lesson than I'm sharing here (it runs about 10 to 15 minutes), but I want to move away from the product discussion and start digging into how I built the experience — and what I learned about crafting conversations along the way. Specifically, I want to cover writing a script, selecting voices, tuning each line to sound human, and editing those lines into a realistic podcast-style conversation.

Writing our conversational script.

First, we need words. Without them, our AI voices have nothing to say. And as it turns out, writing a script is actually a challenge if you don't already do that sort of thing. It took a few trial runs, but the process my partner and I eventually landed on was simple: he would write the lesson content the way he wanted it, then hand it to me to translate into scripted dialogue while keeping his intent intact. Here's an example of his output:

A Candlestick Chart is a type of financial chart that provides the same information as a bar chart (Open, High, Low, and Close, or OHLC) but presents it in a visually intuitive format. Its design, which uses color-coded candlesticks, makes it easier to interpret trends and market sentiment at a glance. Candlestick Body: The rectangular portion of the candlestick, known as the body, represents the range between the opening and closing prices during a specific timeframe.  Compact Display of Information:  A single candlestick contains all four key price points (OHLC) within a specific timeframe.

Dry, textbook-style content. Not bad by any means, but not engaging — nor was it supposed to be. Here's what I would translate it into:

You get the idea. We went this route because we had clear opinions on how we wanted our lessons to flow — what we covered, how we explained it, and in what order we unpacked it so each segment built on the last. I'd take his block of dry, textbook-style content with annotated visual sketches and rewrite it so two imagined podcast hosts could make the material more engaging and accessible.

There are actually three ways to generate a script if we did this today, and they run on a spectrum from control to convenience.

We’ve got our script, now we need voices.

It's time to dig into ElevenLabs and their extensive voice library. Selecting one voice is hard enough (the dilemma of choice is real). But if you're building a conversation with two voices, how those voices work together matters a lot, because the voices you select tell a story. This means selecting two or more voices is a full-on casting exercise, and it helps to have a clear set of criteria for what you're looking for, even if that changes later.

Here are a few key qualities I considered while shopping for voices, and why they mattered for my specific project. Yours may differ.

It's also worth noting that not all voices in ElevenLabs' library have the same audio quality. Some sound like they were recorded in a professional booth, while others have reverb or ambient noise, as if recorded in a kitchen. This is a minor thing, but unless you want to do extra audio cleanup, it's worth keeping an ear out for when pairing voices. That said, maybe having people sound like they're calling in from home is actually a good thing for your project.

Test-driving your voices.

As nice as the sample audio is for each voice in the ElevenLabs library, you have to take them for a spin to see how they mesh. Here's the sample script I used to test our finalist voice pairings. It not only removed the variable of random dialogue during the selection process, but also combined informational speech with more natural, human asides.

And here's the sample audio for our three finalist pairings, along with notes on how each was assessed for our specific needs.

Building lines and conversations.

We have a script and we have our voices. It's time to build our podcast dialogue. This is where things get technical with the ElevenLabs toolset. Once I'd pasted the script into an ElevenLabs project and assigned my voices, I needed to answer two questions:

How do I get my voices to say their lines the way I want them to?
How do I get my scripted conversation to feel natural?

Let's tackle these one at a time.

We have a script and we have our voices. It's time to build our podcast dialogue. This is where things get technical with the ElevenLabs toolset. Once I'd pasted the script into an ElevenLabs project and assigned my voices, I needed to answer two questions:

How do I get my voices to say their lines the way I want them to?
How do I get my scripted conversation to feel natural?

Let's tackle these one at a time.

We have a script and we have our voices. It's time to build our podcast dialogue. This is where things get technical with the ElevenLabs toolset. Once I'd pasted the script into an ElevenLabs project and assigned my voices, I needed to answer two questions:

How do I get my voices to say their lines the way I want them to?
How do I get my scripted conversation to feel natural?

Let's tackle these one at a time.

A new project in ElevenLabs showing a script with voices assigned to their respective lines.

We'll start by pulling our script into an ElevenLabs project and assigning our voices to their respective lines.

How do I get my voices to say things the way I want them to?

First, hats off to ElevenLabs. The first time you generate a line of voice, there's a good chance it'll sound great. Not perfect, but better than good. There's still work to be done to make each line sound as human and engaging as possible, and that work happens on a line-by-line basis. Here are the tools at your disposal:

Regeneration
Formatting and Punctuation
Overrides
Voice Models and Audio Tags

First, hats off to ElevenLabs. The first time you generate a line of voice, there's a good chance it'll sound great. Not perfect, but better than good. There's still work to be done to make each line sound as human and engaging as possible, and that work happens on a line-by-line basis. Here are the tools at your disposal:

Regeneration
Formatting and Punctuation
Overrides
Voice Models and Audio Tags

First, hats off to ElevenLabs. The first time you generate a line of voice, there's a good chance it'll sound great. Not perfect, but better than good. There's still work to be done to make each line sound as human and engaging as possible, and that work happens on a line-by-line basis. Here are the tools at your disposal:

Regeneration
Formatting and Punctuation
Overrides
Voice Models and Audio Tags

Serendipity is still your friend.

The first tool for getting your voices to say lines the way you want is simple: regenerate the line. There's a certain amount of variation in how the AI will read a given line, not unlike how a human might vary in repetition. This means you can get significantly different results from the same scripted line. Here's an example.

The power of formatting and punctuation.

The way you write a line has a huge impact on how the AI voice reads it. Simple things like punctuation and capitalization signal to the AI where to place emphasis and where to pause. And it's not just normal grammar rules. To get your AI to say things the way you want, your script might need to look a little weird.

Here's a quick snippet of script using our Juniper voice, written normally.

And here's a second take with some extra punctuation and formatting applied. It may be subtle, but notice how the timing and emphasis change.

Intonation, emphasis, and timing are huge components of how we speak when we're not just reading lines. They're also critical for directing your audience's attention to what matters.

Here’s a few examples of formatting tweaks you can use for voice direction:

Playing with overrides.

In addition to script formatting, ElevenLabs also provides a set of override tools to change how your voice speaks. Here’s the set you can play with for the Multilingual v2 Model (more on models in a moment) and broadly what they do for you. It’s important to note that there are no specific recipes for specific outcomes here — these are broadly “play around and see what you get” tools.

Playing with voice models.

The default voice model in ElevenLabs is v2, which provides consistent, studio-quality output with lifelike intonation and the ability to evoke emotion. All of the examples so far have used this model, including the lesson prototype (v3 came out right as we finished our work).

At the time of writing, there is also a v3 model. It's much more expressive and can be directed via prompts for emotion, tone, cadence, and other non-verbal elements. In short, it's pretty great. Let's listen to how v2 compares to v3, using Juniper again to demonstrate.

And here's Juniper using the full range of what v3 provides:

A significant difference, not only in how it sounds (I wouldn't advise mixing v2 and v3 models in long-form audio) but in how it can be directed. Let's talk about those bracketed elements. These are audio tags, small prompt windows that let you directly instruct the AI voice on cadence, flow, tone, emotional state, and non-verbal exclamations. Here are a few examples:

I'm just scratching the surface here, but you can dive deeper by reading ElevenLabs' blog post on audio tags.

A project in ElevenLabs showing a script with assigned and tuned voices.

How do I get my scripted conversation to feel natural?

We've done a lot of work getting our individual lines out of the uncanny valley. Now, how do we stitch them into a realistic conversational flow?  
If you play through your scripted conversation without any editing, it'll sound pretty flat, with each line coming one after the other. Even with well-tuned lines, this feels off. That's because this isn't how people speak normally, especially when they're engaged conversational partners. They probably should let each other finish. But they usually don't.

Here’s a sample playback of a few tuned lines, un-edited.

Engaged partners jump in.

Engaged partners tumble over each other, building on or reacting to what the other is saying, often talking over each other. This speeds up the conversation, injecting enthusiasm and energy into the dialogue. It's how people talk when they're excited about a topic and actively engaged. Maybe not all the time, but often.

Engaged partners backchannel.

If jumping in is how engaged partners show enthusiasm, backchannel VO is how they show they're listening. Backchannel voice-over is all the little vocalizations person B makes while reacting to what person A is saying, communicating agreement, disagreement, shock, or surprise. Things like "mm-hmm," or "yeah," or "no..." in the background. These need to be used strategically to avoid feeling overtly rude, but they make a huge difference in communicating the dynamic between your voices.

Here's the same conversational sample with tighter editing and backchannel elements baked in:

Engaged partners have human moments, and maybe even a little fun.

We have one more tool to deploy in our quest for human-sounding conversations: unscripted human moments. Interruptions, talking over each other, quick apologies, the odd tangent or joke. These are things you'll probably need to add to your script — snippets incidental to the main point of dialogue. But honestly, they're some of the most fun elements to build.

A fully tuned and edited project in ElevenLabs.

If we pull all of that together: the tuned lines, the tighter editing, the backchannel vocalizations, and a few human moments, we can get some truly convincing conversation. Here's a final clip, built using the ElevenLabs v3 model with all of the above learnings applied.

A few closing insights.

AI voice has come a long way, and tools like ElevenLabs are making it more accessible and more capable by the month. But the most important thing I took away from my experience so far is that the technology is only part of the equation. The craft of writing, casting, and editing a conversation is where the real work happens, and where the real returns are. Get that right, and the result can be genuinely compelling. With that, here are a few closing thoughts from everything I learned along the way.

AI voice today kind of blows my mind.

AI voice today kind of blows my mind.

AI voice today kind of blows my mind.

Introducing ElevenLabs Creative

Introducing ElevenLabs Creative

Introducing ElevenLabs Creative

Let's build an engaging educational platform for beginner day traders.

Let's build an engaging educational platform for beginner day traders.

Let's build an engaging educational platform for beginner day traders.

Learning from the best.

Learning from the best.

Learning from the best.

What’s our ideal educational format?

What’s our ideal educational format?

What’s our ideal educational format?

How good could a podcast built on AI voice actually be?

How good could a podcast built on AI voice actually be?

How good could a podcast built on AI voice actually be?

Prototyping before the deep dive.

Prototyping before the deep dive.

Prototyping before the deep dive.

Let’s talk about how I built this.

Let’s talk about how I built this.

Let’s talk about how I built this.

Writing our conversational script.

Writing our conversational script.

Writing our conversational script.

Write your own (100% Control)

Write your own (100% Control)

Write your own (100% Control)

Co-author with LLM support. (50/50 Control + Convenience)

Co-author with LLM support. (50/50 Control + Convenience)

Co-author with LLM support. (50/50 Control + Convenience)

Let AI generate your podcast (100% Convenience)

Let AI generate your podcast (100% Convenience)

Let AI generate your podcast (100% Convenience)

We’ve got our script, now we need voices.

We’ve got our script, now we need voices.

We’ve got our script, now we need voices.

Age

Age

Age

Energy + Personality

Energy + Personality

Energy + Personality

Gender

Gender

Gender

Test-driving your voices.

Test-driving your voices.

Test-driving your voices.

Mark + Alania

Mark + Alania

Mark + Alania

Finn + Jessica

Finn + Jessica

Finn + Jessica

Mark + Juniper (winner)

Mark + Juniper (winner)

Mark + Juniper (winner)

Building lines and conversations.

Building lines and conversations.

Building lines and conversations.

How do I get my voices to say things the way I want them to?

How do I get my voices to say things the way I want them to?

How do I get my voices to say things the way I want them to?

Serendipity is still your friend.

Serendipity is still your friend.

Serendipity is still your friend.

The power of formatting and punctuation.

The power of formatting and punctuation.

The power of formatting and punctuation.

Juniper

Juniper

Juniper

Juniper

Juniper

Juniper

Emphasis

Emphasis