Integrating ElevenLabs AI Text to Speech with a Headless CMS

September 27, 2024

Storyblok is the first headless CMS that works for developers & marketers alike.

In this tutorial, we’ll explore how to generate audio versions of your content entries using Storyblok and ElevenLabs. ElevenLabs is an innovative technology company specializing in realistic and natural-sounding AI-generated voices. Using ElevenLabs text-to-speech API endpoint in combination with Storyblok’s webhooks and APIs, we’ll create a serverless function that automatically produces an mp3 file hosted in Storyblok’s built-in DAM. This approach is technology-agnostic, allowing you to serve high-quality narrated content across different channels, improving the user experience and accessibility of your digital offerings.

DEMO:

Are you curious to jump right into the code? Check out the GitHub repository containing the complete logic required in the serverless function.

Requirements

In order to follow this tutorial, please make sure you meet these requirements:

A basic understanding of JavaScript and TypeScript, serverless functions, and webhooks
A Storyblok (opens in a new window) account
An ElevenLabs (opens in a new window) account
A Netlify (opens in a new window) account
Netlify CLI (opens in a new window) installed
Node.js LTS (opens in a new window) installed

Creating the content model in Storyblok

First of all, let’s set up a suitable component schema in Storyblok. If you’re unfamiliar with this step, please refer to our documentation. For any content type of your choice, create an Asset field with the technical name audio, configured only to accept audio files. And…that’s it! As we will crawl your production website to retrieve the content as it is delivered to the user, this designated field is all we need to proceed.

Creating the serverless function

In a blank local project initialized with npm, run netlify init followed by netlify functions:create. Pick Serverless function (Node/Go) and TypeScript from the options. We can use the basic [hello-world] example as a template. Let’s call the function text-to-speech. After the installation has been completed, you’ll find the generated boilerplate code in functions/text-to-speech/text-to-speech.ts.

Let’s create a few additional folders so that the project has the following folder structure:

netlify
└── functions
    └── text-to-speech
src
└── lib
    ├── elevenlabs
    └── storyblok

As you can see, we’ll manage the relevant logic for ElevenLabs and Storyblok in two respective folders. Before aggregating the final serverless function, let’s therefore create and discuss these separately.

Before moving on, let’s install all required additional dependencies:

npm i elevenlabs storyblok-js-client cheerio dotenv formdata-node

Also, let’s make sure to set “type”: “module” in the package.json.

Retrieving an audio stream from ElevenLabs

Now, let’s create src/lib/elevenlabs/text_to_speech_stream.ts with the following content:

src/lib/elevenlabs/text_to_speech_stream.ts

import { ElevenLabsClient } from "elevenlabs"

if (!process.env.ELEVENLABS_API_KEY) {
  throw new Error("Missing ELEVENLABS_API_KEY in environment variables.")
}

const client = new ElevenLabsClient({
  apiKey: process.env.ELEVENLABS_API_KEY,
})

export const createAudioStreamFromText = async (
  text: string
): Promise<Buffer | undefined> => {
  try {
    const audioStream = await client.generate({
      voice: "Rachel",
      model_id: "eleven_turbo_v2",
      text,
    })

    const chunks: Buffer[] = []
    for await (const chunk of audioStream) {
      chunks.push(chunk)
    }

    const content = Buffer.concat(chunks)
    return content
  } catch (err) {
    return
  }
}

ElevenLabs conveniently provides a JavaScript client. After checking whether the ElevenLabs API key has been provided as an environment variable, the client is initialized and the createAudioStreamFromText function is exported. This function accepts a string as a parameter, allowing us to pass text content from Storyblok at a later stage. Using the generate method, an audio stream of the text content is created and returned. It is possible to customize the output by selecting a voice and a model.

hint:

You could create voice and model fields in your content model to allow content creators to influence the AI-generated audio output per story.

Let’s export the function by creating src/lib/elevenlabs/index.ts with the following content:

src/lib/elevenlabs/index.ts

export { createAudioStreamFromText } from "./text_to_speech_stream"

Handling the content entry and uploading the audio file to Storyblok

In the src/lib/storyblok folder, we’ll need to provide the code to:

retrieve and process the content of a published or modified Storyblok story to hand it over to the previously created createAudioStreamFromText function
upload the generated audio file as an asset in the Storyblok space and assign it to the audio asset field of the relevant story

For both operations, we’ll require the Storyblok JavaScript client (opens in a new window). We can initialize the client in a dedicated src/lib/storyblok/client.ts file:

src/lib/storyblok/client.ts

import StoryblokClient from "storyblok-js-client"

if (!process.env.STORYBLOK_PERSONAL_ACCESS_TOKEN) {
  throw new Error(
    "Missing STORYBLOK_PERSONAL_ACCESS_TOKEN in environment variables."
  )
}
const Storyblok = new StoryblokClient({
  oauthToken: process.env.STORYBLOK_PERSONAL_ACCESS_TOKEN,
})

export default Storyblok

Similarly to the ElevenLabs counterpart, we’ll make sure that the access token is provided before proceeding to initiate the client. For our purposes, the client needs to be initiated with a personal access token (opens in a new window) in order to be able to use Storyblok’s Management API.

Hereafter, we can proceed to retrieve and process the content. In this example, we’ve chosen to use Cheerio (opens in a new window) in order to crawl and parse the content as it is delivered to the audience, directly from the production environment. The advantage of this approach is that the logic that aggregates the story content in the frontend does not have to be reproduced in the serverless function (which is particularly a concern when dealing with complex stories with many nested components).

hint:

Should your project not have any crawlable production environment, it is also possible to fetch and aggregate the story content directly in the serverless function.

Let’s create src/lib/storyblok/get_story_content.ts with the following content:

src/lib/storyblok/get_story_content.ts

import * as cheerio from "cheerio"
import Storyblok from "./client"

const titleSelector = "h1"
const bodySelector = "[data-blog-content]"

const getStoryUrl = async (
  spaceId: number,
  storyId: number
): Promise<string> => {
  try {
    const res = await Storyblok.get(`/spaces/${spaceId}/stories/${storyId}`)
    return res.data.story.full_slug
  } catch (err) {
    console.log(err)
    return ""
  }
}

if (!process.env.PRODUCTION_DOMAIN) {
  throw new Error("Missing PRODUCTION_DOMAIN in environment variables.")
}

export const getStoryContent = async (
  spaceId: number,
  storyId: number
): Promise<string> => {
  try {
    const domain = process.env.PRODUCTION_DOMAIN
    const url = await getStoryUrl(spaceId, storyId)
    const urlToCrawl = `${domain}${url}?ts=${Date.now()}`
    const res = await fetch(urlToCrawl)
    const urlText = await res.text()
    const cheerioDocument = cheerio.load(urlText)
    return `Article title: ${cheerioDocument(titleSelector).text()}. <break time="1.0s" /> Article content: ${cheerioDocument(bodySelector).text()}`
  } catch (err) {
    console.log(err)
    return ""
  }
}

The first function, getStoryUrl, requires the parameters spaceId and storyId (which are included in the webhook payload), fetches the full story object, and returns the story’s full_slug. It uses the stories endpoint (opens in a new window) of the Management API.

In combination with the production domain (defined via an environment variable), we can dynamically construct the URL to be crawled using Cheerio, which occurs in the getStoryContent function. You can customize the titleSelector and bodySelector to match the DOM of your production environment.

Once the content has been loaded using the JavaScript Fetch API and parsed using Cheerio’s load method, the final output is improved by separating the main headline from the main section of the text and instructing ElevenLabs (opens in a new window) to pause after reading the headline using the <break time="1.0s" /> tag.

Lastly, we need to take care of uploading the audio stream returned by the createAudioStreamFromText function as an asset to Storyblok. In order to accomplish this, let’s create src/lib/storyblok/upload_asset.ts with the following content:

src/lib/storyblok/upload_asset.ts

import { FormData } from "formdata-node"
import Storyblok from "./client"

interface SignedResponse {
  fields: { [key: string]: string }
  post_url: string
  pretty_url: string
  id: number
}

interface StoryblokAssetResponse {
  data: SignedResponse
}

export const uploadAsset = async (
  fileContent: Buffer,
  spaceId: number,
  storyId: number
): Promise<boolean> => {
  const fileName = `${storyId}-text-to-speech.mp3`
  try {
    const newAssetEntry = (await Storyblok.post(`/spaces/${spaceId}/assets/`, {
      filename: fileName,
    })) as unknown as StoryblokAssetResponse

    const signedResponse = newAssetEntry.data as SignedResponse
    const blob = new Blob([fileContent])
    const assetRequestBody = new FormData()

    for (let key in signedResponse.fields) {
      if (signedResponse.fields[key])
        assetRequestBody.set(key, signedResponse.fields[key])
    }

    assetRequestBody.set("file", blob, fileName)

    await fetch(signedResponse.post_url, {
      method: "POST",
      body: assetRequestBody,
    })

    await Storyblok.get(
      `spaces/${spaceId}/assets/${signedResponse.id}/finish_upload`
    )

    const getStoryRes = await Storyblok.get(
      `/spaces/${spaceId}/stories/${storyId}`
    )
    const updatePayload = getStoryRes.data
    const oldAudio = getStoryRes.data.story.content.audio

    updatePayload.story.content.audio = {
      filename: signedResponse.pretty_url,
      fieldtype: "asset",
      is_external_url: false,
      id: signedResponse.id,
    }

    await Storyblok.put(`/spaces/${spaceId}/stories/${storyId}`, updatePayload)

    if (oldAudio) {
      try {
        await Storyblok.delete(`/spaces/${spaceId}/assets/${oldAudio.id}`, {})
      } catch (err) {
        console.log(err)
      }
    }
    return true
  } catch (err) {
    console.log(err)
    return false
  }
}

The uploadAsset function accepts three parameter: the file content retrieved from ElevenLabs, and the story and space IDs included in the webhook payload. First of all, a new asset entry is created and a signed response from Storyblok’s API is returned. Once the asset upload has been completed, the whole story object is fetched in order to replace the asset referenced in the audio field and update the story. Lastly, the previous asset is removed.

Let’s export both functions by creating src/lib/storyblok/index.ts with the following content:

src/lib/storyblok/index.ts

export { uploadAsset } from './upload_asset'
export { getStoryContent } from './get_story_content'

Finalizing and deploying

The last step is to replace the code of the dummy serverless function located under netlify/functions/text-to-speech/text-to-speech.ts with the following:

netlify/functions/text-to-speech/text-to-speech.ts

import { createAudioStreamFromText } from "~/lib/elevenlabs"
import { getStoryContent, uploadAsset } from "~/lib/storyblok"

export default async (req: Request) => {
  if (req.method === "OPTIONS") {
    return new Response("", { status: 200 })
  }
  if (req.method !== "POST") {
    return new Response("", { status: 405 })
  }

  const body = await req.json()
  const content = await getStoryContent(body.space_id, body.story_id)
  if (!content) {
    return Response.json(
      { message: "The entry content is empty." },
      { status: 500 }
    )
  }

  const audioBuffer = await createAudioStreamFromText(content.slice(0, 100))
  if (!audioBuffer) {
    return Response.json(
      { message: "Error while generating the audio." },
      { status: 500 }
    )
  }

  const uploadAssetRes = await uploadAsset(
    audioBuffer,
    body.space_id,
    body.story_id
  )

  return uploadAssetRes
    ? Response.json(
        { message: "Text-to-Speech successfully created and uploaded." },
        { status: 200 }
      )
    : Response.json({ message: "Something went wrong." }, { status: 500 })
}

Here, we’ll use all of the functions created previously. After verifying that a POST request to this endpoint is made, the webhook payload is turned into a JavaScript object, allowing us to provide the space and story IDs as parameters for getStoryContent. Subsequently, the crawled content is provided as string parameter for createAudioStreamFromText in order to retrieve the audio stream from ElevenLabs. Finally, the generated audio is uploaded to Storyblok and linked to the story using uploadAsset.

hint:

For security purposes, preventing anyone from invoking our serverless function, you would want to use a webhook secret and verify the signature in a production scenario.

Now, we can conveniently deploy the serverless function by running netlify build followed by netlify deploy —prod. Don’t forget to add your environment variables to the Netlify project. A link to the Function logs of the Netlify project is provided. Here, we can copy the URL of the Endpoint.

hint:

In case the articles are very long and the serverless function times out, you could consider alternative backend solutions such as Netlify’s Background Functions.

Configuring the webhook in Storyblok

As we’d like to automatically generate and attach a new audio file whenever new or changed content is published, we’ll have to configure a webhook that gets fired whenever this event occurs. Therefore, in our Storyblok space, let’s head to Settings > Webhooks and create a New Webhook. Let’s name this webhook “Generate Audio via ElevenLabs” and paste the endpoint copied from Netlify beforehand. In the Triggers section, we’ll have to check Story published.

learn:

Please refer to our webhook documentation to learn more.

And that’s it! Try publishing any story, and you should see the generated audio file has been added to the story shortly after. You can also check the Netlify function logs to confirm whether the function has been invoked and run correctly.

Now, you can fetch the audio file directly from Storyblok’s Asset CDN and implement it in your digital offerings to provide state-of-the-art, AI-driven audio versions of your content.

Take this barebones example and feel free to customize it as required for your project. For example, you could provide additional customization and control options for editors, add more robust error handling, check whether the request is for a story from a particular folder or of a specific content type, and much more.

Question or Feedback?

Authors

Christian Zoppi

Github (opens in a new window)

Christian is a full-stack developer and he's the Head of the Website & Developer Experience department. He's from Viareggio, Italy.

Manuel Schröder

A former International Relations graduate, Manuel ultimately pursued a career in web development, working as a frontend engineer. His favorite technologies, other than Storyblok, include Vue, Astro, and Tailwind. These days, Manuel coordinates and oversees Storyblok's technical documentation, combining his technical expertise with his passion for writing and communication.

Integrating ElevenLabs AI Text to Speech with a Headless CMS

Requirements

Creating the content model in Storyblok

Creating the serverless function

Retrieving an audio stream from ElevenLabs

Handling the content entry and uploading the audio file to Storyblok

Finalizing and deploying

Configuring the webhook in Storyblok

Question or Feedback?

Authors

Christian Zoppi

Manuel Schröder

More to read