How to Turn PDFs into Private, Cheap Audio — a hybrid 'local summary + cloud TTS' pattern
The problem
You want audio from a PDF. Maybe it’s a scanned paper. Maybe it’s sensitive. You don’t want to upload the whole file. And you don’t want a big monthly bill.
There are two obvious choices. One: upload the whole thing to cloud services (OCR, LLM, TTS) and get a finished MP3 fast. Two: do everything locally and keep everything private. Both work. Both have costs — money, time, and complexity.
There’s a third option. A hybrid pattern that keeps raw documents local, sends only compact scripts to the cloud, and saves money while tightening privacy.
The hybrid pattern, in one line
Run OCR locally. Extract raw text. Summarize or generate a short narration locally (or selectively). Send only the short script — not the whole document — to a cloud TTS. Or, if you need full zero‑upload privacy, run TTS locally.
Why this matters now
• OCRmyPDF adds a local OCR text layer to scanned PDFs so you can extract searchable text without sending the source file to a service (OCRmyPDF project page).
• High-quality cloud TTS like ElevenLabs offers production voices and developer APIs, but they bill by characters. Their API also includes a zero‑logging flag for enterprise customers (enable_logging / zero retention), and the public pricing is per‑1K characters (ElevenLabs API docs and pricing pages).
Put simply: if you can reduce what you send to the TTS API by 80–90%, you cut TTS billable units by 80–90% while keeping the document itself private on your machine or network.
Step-by-step workflow (practical)
- OCR locally. Use OCRmyPDF to add a text layer to any scanned PDF. That gives you machine text you can extract with pdftotext or tools of your choice.
- Slice the document. Break the text into digestible chunks: title, abstract/summary, section headings, and one paragraph per section. That’s your raw material.
- Summarize on‑device. Run a local summarizer or write a short script (3–6 sentences) that captures the takeaway per section. You don’t have to summarize the whole paper — pick the sections you need.
- Assemble a narration script. Stitch the short section summaries into a single narration script. Keep it tight. 500–2,500 characters is usually enough for a 2–10 minute listen.
- Send the script to the TTS API. Use a cloud TTS for natural voice output. If you use ElevenLabs, their convert endpoint returns MP3 and supports an enable_logging flag to limit retention for higher‑privacy requests (ElevenLabs API reference).
- Optional: publish privately. Upload the MP3 to a private podcast host or internal S3 bucket and share a private RSS feed if you need distribution inside a team.
Concrete tradeoffs and a quick cost check
Cloud TTS is priced by characters. ElevenLabs’ public pricing shows per‑1K character rates for their TTS models (see pricing page). That means a long document will cost linearly more.
Example math (illustrative): • Sending a 50,000‑character script to a TTS endpoint costs roughly 50 × the per‑1K character rate; sending a 5,000‑character script costs 10× less.
That math is exact once you know the per‑1K price in your plan. The takeaway: reduce the script length and you directly reduce cost.
Privacy tradeoffs: keeping OCR and summarization local means you never transmit the original file. Cloud TTS receives a short script instead. If you need stronger guarantees, ElevenLabs’ API documents an enable_logging parameter and mentions zero‑retention modes available to enterprise customers — a useful privacy knob when you must use cloud TTS.
When to use full cloud pipelines
If you need verbatim narration of the entire document (every paragraph read aloud), or you want AI to extract facts and re-write in a new voice, full cloud processing simplifies the pipeline and reduces engineering time. But costs scale with length, and you’re sending more data off‑site.
When to run everything locally
If the PDF is legally protected, contains PHI, or your compliance requires no uploads, run OCR and TTS on‑device. OCRmyPDF handles OCR locally. For TTS, open‑source engines (Coqui, etc.) can run on a decent laptop or server; they trade off voice naturalness and speed for privacy.
Quick checklist for engineers and power users
- Use OCRmyPDF to add a local text layer to scanned PDFs.
- Extract needed sections, not the whole document.
- Produce a short narration script locally; target ~500–2,500 characters for short listens.
- If using a cloud TTS, enable privacy options where available; check provider docs for retention knobs.
- If automation is required, trigger the pipeline with a local script, or use webhooks that send only the final script to the TTS provider.
Sources and further reading
- OCRmyPDF project (adds local OCR text layer to scanned PDFs): https://github.com/ocrmypdf/OCRmyPDF
- ElevenLabs TTS API — convert endpoint and request flags (enable_logging / zero retention): https://elevenlabs.io/docs/api-reference/text-to-speech/convert
- ElevenLabs API Pricing (per‑1K characters): https://elevenlabs.io/pricing/api
- How to create a private podcast (for private episode distribution): https://transistor.fm/private-podcast/
Bottom line
You don’t have to choose between leaking your PDF and paying a large TTS bill. Run OCR locally, summarize locally, and send only what you need for a great listening experience. It’s cheaper. It’s faster in practice. And it keeps the source document off cloud servers.
Summary: local OCR + local summarization + short cloud TTS (or local TTS) gives you privacy, speed, and predictable costs.