Giving our blog a voice: how we turned posts into audio with open models, locally, on a Mac @ LayerX Blog

Blog
Giving our blog a voice: how we turned posts into audio with open models, locally, on a Mac

The walk that started it 🚶

A few weeks ago I was on a walk, half-listening to a podcast, half-thinking about the article I'd published the week before. And it hit me. That article had probably been read by hundreds of people sitting at a desk, but nobody had listened to it. Not because they didn't want to. Because there was nothing to listen to.

That bothered me more than it should have. We spend weeks writing these posts at LayerX, polishing arguments, choosing the right metaphors, trimming a paragraph for the third time. Then we ship a wall of text, and that's the only door into the content. If you're commuting, cooking, or on a flight without good Wi‑Fi, the post simply isn't there for you.

So I asked the obvious question. How hard would it be to give every post on this blog its own voice? Not a robotic voice. A real, listenable one. And, because we're us, could we do it without sending a single line of our writing to a third-party TTS API, without paying per character, without any cloud at all? Just the laptop, open-source models, and a coffee.

That little side question turned into a project I fell in love with. This is the story of how we built it.

The article you're reading right now? It has an audio version. Hit play. We'll be here when you're done. 🎧

The naive first sketch ✏️

The instinct, when you hear "blog post to audio," is to grab the first text-to-speech library you find, point it at a markdown file, and let it talk. We tried that early on, just to feel the shape of the problem. The result was instructive. By instructive I mean: it sounded awful.

Markdown is written for the eye, not the ear. Headings get spoken as part of the prose, like someone is awkwardly announcing chapter titles mid-sentence. Code blocks become long, painful, character-by-character recitations of curly braces and semicolons. Bullet lists collapse into a stream of disconnected fragments. URLs are read out letter by letter. There are no breaths, no pauses, no shape. It's a wall of text wearing a costume.

That experiment told us something important. A blog post is not a script. Turning one into the other is its own creative job, not a translation job. Anyone who's ever recorded an audiobook knows this. The narrator doesn't just read. They edit on the fly. They smooth out things that work on a page but trip the tongue. They pause where the text wants you to breathe. They paraphrase the parts the ear wasn't built for.

So we stopped looking for a one-shot TTS solution and started designing a small pipeline that mimics how a human would prepare a piece for narration.

The three-stage idea 🧩

The pipeline we ended up with has three stages, and the easiest way to think about it is as three different people in a small studio.

The first one is the reader. They take the raw blog post, markdown or HTML, and break it apart. Headings here, paragraphs there, code blocks set aside in their own pile. Every section gets labelled with its heading level, like dividers in a binder.
The second one is the editor. This is where the real magic happens. The editor takes each section and rewrites it for the ear. Code blocks stop being literal text and become short, conversational descriptions, like "here we define a function that takes a list of paragraphs and returns the cleaned-up version." Long, dense paragraphs get gentle pauses inserted. Heading transitions get a slightly longer breath before them. The editor's job is to produce a version of the post that sounds like the post, instead of one that is the post read aloud.
The third one is the voice actor. They take the script the editor produced and turn it into actual audio. They pronounce the words, they respect the pauses, they keep the right pacing.

That's the whole architecture. Three roles, one pipeline. Once we settled on it, the rest of the project mostly designed itself.

Why we put an LLM in the middle 🧠

The editor stage is where most TTS pipelines we'd seen cut a corner. Skipping it is tempting. You save a network call, you save tokens, you ship faster. It's also where the listening experience lives or dies.

A reasoning model (we use a small, fast one through OpenRouter) handles this kind of rewriting well. Hand it a section that includes a code snippet and it does the right thing instinctively: describe the snippet's intent, skip the syntax, keep the surrounding narrative intact. Hand it a paragraph with five short sentences in a row and it figures out where a brief pause helps the listener catch up. Hand it a heading and it knows the section that follows should start fresh, with a beat of silence first.

We encode those decisions as tiny markers in the script: [PAUSE], [SHORT PAUSE], [LONG PAUSE]. They're musical rests. The voice actor downstream knows to insert silence whenever it sees one. That's how you get prose that breathes.

The audio version of a post should feel like the post, instead of the post being read by a stranger who's never seen it before.

There's a second, quieter benefit to this stage. The LLM also catches things that visually work but acoustically don't: abbreviations that should be expanded, technical terms that need smoother phrasing, formatting artefacts that snuck in from the markdown source. By the time the script reaches the voice actor, it's clean.

The part where everything runs on a laptop 💻

Here's the bit I get most excited about.

For a long time, "good text-to-speech" meant calling an expensive cloud API: ElevenLabs, OpenAI TTS, Google Cloud Text-to-Speech, Azure Speech. The voices were impressive, but the price scaled with every paragraph, the latency depended on someone else's data centre, and your content quietly left the building every time you generated audio. For a feature we wanted to run on every post we ever publish, the math didn't work, and the data-handling story didn't either.

What changed is the combination of two things that landed in the last couple of years. Open-source TTS models that actually sound human, and MLX, Apple's machine-learning framework built for the unified-memory architecture of M-series chips. Together they unlock something that felt impossible eighteen months ago: running a high-quality narrator entirely on the same MacBook that's open in front of you.

MLX deserves a paragraph of its own, because it's the piece of this stack that quietly does the heavy lifting. It's an open-source array framework from Apple, designed from scratch for Apple Silicon. The trick that matters most is unified memory. On a normal machine, the CPU and the GPU live in different worlds with their own separate memory, and shipping data between them is half the cost of running a model. On an M-series chip, they share the same pool. Tensors don't get copied around to be used. The model weights load once and the GPU starts running. The practical effect is that a laptop with an M1 Pro can hold and run models that would have needed a dedicated GPU server eighteen months ago, with no driver setup, no CUDA, no Docker image with NVIDIA tooling baked in. You pip install mlx-audio, you load the model, you generate audio. That's the entire onboarding.

The economics shift accordingly. Local inference, on hardware you already own, is effectively free. No per-character billing. No tokens metered against a quota. No request-per-second cap. The marginal cost of generating one more minute of narration is the electricity it takes to run the chip for a few seconds, which is rounding error against any cloud TTS bill. For a feature like blog narration, where every published post triggers a generation, that gap compounds quickly. After a couple of months of usage, the cloud version of this pipeline would have cost more than the laptop did.

And it's not just cheap. It's fast. Kokoro narrates faster than real-time on an M1 Pro, which means a five-minute audio clip is ready in well under a minute. The latency that used to live in a network round-trip is gone. You hit generate, you hear the result.

We tried two models, and I want to talk about both because they're each interesting in different ways.

The first is Kokoro, an 82M-parameter open-source model that punches well above its weight. It's tiny, it's fast, and the default voice we chose (a warm, calm tone called bf_isabella) sounds pleasant. On an M-series Mac, Kokoro generates audio faster than you'd expect. For a typical 1,500-word post, narration is ready in under a minute. For most posts, this is the model we ship.

The second is Qwen3 TTS, an expressive model from Alibaba's open-weights family. It's bigger, slower, and noticeably more theatrical. The thing that makes it special is that you can give it instructions alongside the text. "Narrate this dramatically." "Read this conversationally." "This is a heading, emphasise it." That single feature changes everything. We pass the heading level from our parser through the pipeline and ask Qwen to give the title a different reading from a sub-heading, which gets a different reading from a body paragraph. It's a small thing, and it's what separates a robot reciting from a narrator performing.

Both models load through mlx-audio, both run on the GPU side of the Apple Silicon unified memory pool, and both stay completely offline. Our content never leaves the machine. After the initial model download, our cost per minute of generated audio is whatever electricity costs.

What the pipeline actually produces 📦

The whole thing is a small Python pipeline. You give it a blog post, and out the other end come two artefacts that fit neatly into any modern audio player.

The first is the audio itself, a single mono .wav or .mp3 file. Mono on purpose. Narration doesn't benefit from stereo, and going mono cuts the file size roughly in half, which matters when your reader is on a phone with patchy reception.

The second is a peak JSON file. This is the unsung hero of a good audio player. Drawing a waveform in the browser usually means downloading the entire audio file just to look at it, which is wasteful when the listener might never press play. So as we encode the audio, we also down-sample it into a small JSON array of around a thousand floating-point values between -1 and 1. That's the shape of the waveform, pre-computed and ready to render. The player loads the peak file in milliseconds, draws the waveform straight away, and only fetches the full audio when the listener decides to listen.

Two files. Both small enough to cache aggressively and drop into any frontend without much ceremony.

One last detail that's easy to miss but matters a lot in practice. The HTTP server you put in front of the mp3 has to support HTTP range requests. That's the mechanism browsers use to fetch a slice of a file instead of the whole thing, without them, a five-minute narration becomes a five-megabyte stall before any sound comes out. Most modern object stores and CDNs handle this for free, but it's worth checking explicitly. The audio file is the largest artefact in the pipeline, and a player that can't seek or stream will undo a lot of the work upstream.

What it feels like in production 🎧

The end-to-end experience, from a writer's perspective, is unremarkable in the best possible way. You publish a post the way you always have. A few minutes later, an audio player shows up at the top of the article, with a waveform and a play button. You hit play and you hear the post. Your post, with the right pauses, the right emphasis, the headings landing like real headings.

What's striking is how quickly the audio version stops feeling like an add-on and starts feeling like a first-class version of the content. Some readers (listeners now) tell us they prefer it. They listen on the train, on a run, while doing the dishes. The blog is suddenly available in spaces where reading was never going to happen.

And the running cost is, honestly, hilarious. No monthly TTS bill. No API quota to watch. No third-party data agreement to keep on file. A MacBook Pro, two open-source models, a bit of glue, and a CDN. That's the whole stack.

The stack, in one place 🧱

For anyone who wants to reproduce this on their own blog, here's everything that's actually in the box.

Hardware

An Apple Silicon Mac. The whole thing was built and runs on my MacBook Pro with the M1 Pro chip, which tells you something about how little hardware you actually need for this. Anything from an M1 onward will narrate blog-length content comfortably.
That's it. No GPU, no separate inference box, no cloud instance.

Models

Kokoro (mlx-community/Kokoro-82M-bf16) - our default TTS model. Tiny, fast, pleasant voice. Open-source.
Qwen3 TTS (mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit) - used when we want expressive, instructable narration. Open-weights.
A small reasoning LLM through OpenRouter for the markdown-to-script rewriting step. Easy to swap; we've tried several and the quality difference at this scale is small.

Core frameworks

Python 3.11+ as the only runtime.
MLX and mlx-audio - Apple's machine-learning framework and the audio toolkit on top of it. This is what makes high-quality TTS practical on a laptop.
markdown-it-py for parsing markdown blog posts. The Python stdlib html.parser for HTML.
NumPy and soundfile for stitching the per-section audio chunks into a single mono track and writing the final wav/mp3.

The whole thing is well under a thousand lines of Python. Most of those lines are parsing edge cases, not AI.

Reflections 🪞

This project sits squarely in the middle of a shift we've been talking about a lot at LayerX: the move from "AI as a paid cloud service you call into" to "AI as a small, capable model running on the hardware you already own." It's the same shift that pushed us into fine-tuning small open-source LLMs for our own customer support earlier this year, and it runs through how our engineering workflow has evolved with agentic AI too. Eighteen months ago, building this with comparable quality would have meant integrating a hosted TTS provider, accepting their pricing, and shipping a feature that depended on their uptime. Today it's an afternoon of writing parsing code, an afternoon of pipeline glue, and a couple of pip install commands.

The lesson we keep relearning: the interesting AI work happens above the model now, not inside it. Models are commodities. What turns one into a product is the small editorial intelligence around it. The parser that knows what a heading is. The editor that knows code shouldn't be read literally. The peak file that lets a player draw a waveform without downloading the audio. None of that is AI. It's what makes the AI usable.

If you write a blog and you've been quietly wishing it had voices, I'd encourage you to try this approach. Open-source TTS is good now. MLX makes it fast on Apple Silicon. The pipeline is short enough to fit in your head. And, selfishly, the world is better when more good writing is also listenable.

Key takeaways 💡

A blog post is not a script. The single biggest quality win came from inserting a rewriting step between parsing and synthesis, instead of feeding raw markdown straight into a TTS engine.
Small open-source models are production-ready. Kokoro at 82M parameters and Qwen3 TTS produce audio we're happy to ship, with no cloud dependency, on a single Apple Silicon machine.
MLX changes the economics. Running TTS locally on a Mac removes per-character billing, latency variance, and third-party data exposure from the equation.
The output is two simple files. A mono .wav or .mp3 and a peak JSON. Both small, both cache-friendly, both ready for any modern audio player.
The interesting work is above the model. Parsing, pause planning, heading-level emphasis, waveform pre-rendering. None of that is AI, and it's all what makes the AI feel like a product.

Frequently asked questions ❓

Can I really run text-to-speech locally on a Mac? Yes, and surprisingly well. With MLX and the latest open-source TTS models, an M-series MacBook can generate broadcast-quality narration faster than real time. A 1,500-word blog post is typically ready in well under a minute. No cloud, no API key, no per-character billing.

Is Kokoro good enough to ship to production? For most blog posts, yes. Kokoro is an 82M-parameter open-source TTS model that produces clean, natural-sounding speech at very low latency. We use it as the default for the LayerX blog. For posts that need more expressive, theatrical narration (interviews, longer-form storytelling), we switch to Qwen3 TTS and use its instruct parameter to direct the performance.

Do I need a GPU to run MLX? No. MLX is designed for the unified memory architecture of Apple Silicon (M1/M2/M3/M4 chips), so the same chip that runs your editor runs the model. There's no separate GPU to provision, no CUDA setup, no container with NVIDIA drivers. If you already have an M-series Mac, you have everything you need.

How does this compare to ElevenLabs, OpenAI TTS, or Google Cloud Text-to-Speech? Cloud TTS providers still have an edge in voice variety and ultra-realistic prosody. The trade-off is cost (per-character pricing scales aggressively if you generate audio for every post you publish), latency (network round trips), and data handling (your content leaves your infrastructure on every call). For a recurring, high-volume use case like an entire blog, a local open-source pipeline wins on economics and privacy.

Do I have to use markdown? Can I narrate HTML pages? Both work. The pipeline accepts markdown or HTML, parses it into sections by heading, and feeds each section through the same LLM rewriting step before synthesis.

Why use an LLM in the middle? Can't I just feed text straight into the TTS engine? You can, but the result sounds like a robot reading a document. Code blocks, bullet lists, URLs, and headings all need to be transformed for the ear before they're synthesised. The LLM is what turns "a blog post" into "a script."

The pipeline described here is what generates the audio version of every article on this blog. If you're building something at the intersection of AI, content, and product UX, I'd love to hear about it. Find me on X or come say hi at layerx.xyz.