How to generate a Sitemap with a headless CMS?

A Sitemap is a crucial component, often underestimated, that improves the website's visibility. It acts as a "table of contents", offering search engines a structured overview of the content available on the site, enabling search engines to navigate and index your pages more efficiently.

The sitemap.xml is an XML file with a specific structure to allow the search engine to understand your website's structure better.

For a dynamic website where the pages could be created, deleted, or moved, the sitemap file must reflect the updated structure of the pages you want to make visible to the search engines.

In this article, we will build a dynamic sitemap.xml based on the structure of a Storyblok space with the published stories.

Sitemap information

The Sitemap file, typically named sitemap.xml file, is an XML file that includes for each link/ URL/ page you want to have at least this information:

  • The location URI of a document. The URI must conform to RFC 2396. Just to let you know, this is required.
  • The date the document was last modified. The date must conform to the W3C DATETIME format, for example, 2005-05-10. The day can also be a timestamp. Example: 2005-05-10T17:33:30+08:00. This attribute is optional.
  • A string that indicates how frequently a URL's content will change. Valid values are: always, hourly, daily, weekly, monthly, yearly, and never. The value always should describe documents that change each time they are accessed. The value never should be used to describe archived URLs. This value is a friendly suggestion for the search engine crawler, not a command. This is optional.
  • A number between 0.0 and 1.0. It indicates the priority of a particular URL relative to other pages on the same site. This element 0.0 value identifies the lowest priority page(s). The default priority of a page is 0.5. Priority is used to select between pages on your site. Setting a priority of 1.0 for all URLs will not help you, as the relative priority of pages on your site will be considered. This is optional (default 0.5). If it is omitted for all the URLs, all the URLs will be managed with the same level of priority.

In order to build a Sitemap using the Storyblok platform, we need at least to retrieve the stories' real paths. To retrieve all the real paths for the published stories, we can use the links endpoint to filter the published stories.

Typically for any listing, the Storyblok content delivery API paginates the results, and you need to walk through all the pages to obtain all the stories (if you have more than 100 stories and/or folders). Therefore, we can use the Storyblok JavaScript Client, which allows us to use pagination conveniently. Let's see how to retrieve the real path of all the published stories in an easy way.

The Content Delivery API exposes the Links endpoint that you can use to retrieve the real path of the stories.

If you plan to use the Storyblok JavaScript Client, you can install it via your package manager:

        
      npm i storyblok-js-client --save
    

Once the package is installed, you can start using the StoryblokClient object:

        
      import StoryblokClient from "storyblok-js-client"

const Storyblok = new StoryblokClient({
    accessToken: 'your-access-token'
})
    

To fill the accessToken parameter, you need to retrieve the access token from your current Storyblok space.

Once you have the Storyblok instance, you can call the method getAll(), with the first parameter the path of the Link endpoint cdn/links and then as the second parameter an object for defining some parameters. In this case, we need to retrieve the published stories so that we will see the version parameter with published value according to the Retrieve Multiple Links documentation.

        
      let links = await Storyblok.getAll('cdn/links', {
    version: 'published'
})
    

The getAll() method returns stories and folders according to the parameters (the second argument).

The getAll() method returns an array of link objects. A link object looks like this sample:

        
      {
  id: 353978374,
  slug: 'categories-2/products/productvariants/bike-004-xl',
  name: 'Bike 004 XL',
  is_folder: false,
  parent_id: 353978364,
  published: true,
  path: null,
  position: 0,
  uuid: '07a906b9-d6fa-4d48-a73e-8c9191043590',
  is_startpage: false,
  real_path: '/categories-2/products/productvariants/bike-004-xl'
}
    

The relevant attributes from the link object for creating a sitemap are is_folder and real_path. The is_folder attribute helps to retrieve only stories and skip the folders to generate the sitemap URLs. The real_path constitutes the correct and complete URL. Once you have your array of story links, you can create a loop, for example:

        
      let total = 0
links.forEach((item) => {
    if (!item.is_folder) {
        total++
        console.log(
            item.real_path,
        )
    }
});
    

Now we understand how to retrieve all the valid URLs, and you can build the sitemap file with the proper format.

For demonstration purposes, we will create a string that contains the XML. Then, once you have the XML string, you can serve or store the string as you prefer (for creating a response or storing it in a file to be served with a static approach).

        
      // 001 define the prefix of URL with HTTPS protocol and the hostname
const prefixUrl = 'https://www.example.org'
// 002 loop through the links via map() function
let sitemap_entries = links.map((link) => {
// 003 skipping the folders
    if (link.is_folder) return ''
// 004 creating the URL loc element with the real_path
    return `\n    <url><loc>${prefixUrl}${link.real_path}</loc></url>`
})
// 005 compose the whole XML with the header
let sitemap = `<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    ${sitemap_entries.join('')}
</urlset>`
    

Filtering the URLs by folder

Suppose you want to create a sitemap for a specific website section, for example, with all the stories stored in the articles folder. In that case, you can use the starts_with parameter. The second parameter of the getAll method sets all the parameters (one is the version parameter).

For example, if you need to filter the stories by the articles folder:

        
      let links = await Storyblok.getAll('cdn/links', {
    version: 'published',
    starts_with: 'articles/'
})
    

If you have nested folders, you can also set the multi-level path like, for example, articles/javascript:

        
      let links = await Storyblok.getAll('cdn/links', {
    version: 'published',
    starts_with: 'articles/javascript/'
})
    

Understanding the pagination mechanism

Using the getAll() method allows you to retrieve an array of items (for example, all the Stories or all the links). It helps you, as a developer, if you need to retrieve the Link object with all the stories and the folders according to the version and the starts_with parameters and saves you from worrying about pagination.

Suppose you need to access the pagination mechanism. For example, if you need to retrieve only the first page or you want to limit the number of entries retrieved, you can use the paginated parameter.

For managing the pagination manually, you need to use the get() method instead of the getAll() method. And you have to set the paginated parameter with the value 1. To customize the number of items and define the page you want to retrieve, you can use the per_page and page parameters (in addition to the paginated parameter). The per_page parameter allows you to set the number of entries you want for each page. The default is 25.

The page parameter lets you specify the page number you want to retrieve. By default, the page number is 1.

Hence, the API call with the Storyblok Javascript Client with the parameters for the pagination should be:

        
      import StoryblokClient from "storyblok-js-client";

let response = {};
const getLinks = async function () {
  const Storyblok = new StoryblokClient({
    accessToken: "your-access-token",
    cache: {
      clear: "auto",
      type: "memory"
    },
  });
  let per_page = 100;
  let currentPage = 1;
  let links = {};

  let thatsAll = false;

  while (!thatsAll) {
    response = await Storyblok.get("cdn/links/", {
      per_page: per_page,
      page: currentPage,
      paginated: 1,
      version: "published"
    });
    links = { ...links, ...response.data.links };
    thatsAll = response.headers.total <= currentPage * per_page;
    currentPage++;
  }
  return links;
};
    

In the case of the cdn/links endpoint, another difference exists between the get() method and the getAll() method. The getAll() method returns an array of links. Thus, to loop through the links, you can use all the array functions provided by the JavaScript APIs. Instead, the get() method returns a response with some attributes:

  • data: contains the links object. The links object contain all the links, with the uuid of the entry as the key and the link object as the value;
  • headers: the headers of the HTTP response,
  • perPage: how many items in the current page (it is present only in case of pagination);
  • total: the total amount of the items across all the pages (it is present only in case of pagination)

With the get() method, if you avoid setting the paginated parameter, you will see a different behavior between the spaces created before the 9th of May, 2023. By default, the APIs return all the links for the old spaces. The APIs return the pages with 25 items for the newer spaces by default. If you need consistency across old and new spaces, use the paginated parameter. In other words, it means that for the new spaces, the cdn/links endpoint creates responses paginated by default.

If you installed the Translatable Slugs, you can retrieve the translated links for each story. When you install the Translatable Slugs application, the content editor can define a specific path for each story for each language configured.

Translatable Slugs app, use different URLs for different language versionsFrom the API perspective, with the application installed, in the Link object included in the HTTP response, you have an additional attribute named alternates, an array of alternates links, one for each language configured in the space. For example, the alternates attribute could be something like this:

        
      [
  {
    path: 'page-1',
    name: null,
    lang: 'fr',
    published: null,
    translated_slug: 'folder-01/page-1'
  },
  {
    path: 'pagina-1',
    name: 'Prima pagina',
    lang: 'it',
    published: true,
    translated_slug: 'folder-01/pagina-1'
  }
]
    

Therefore, to retrieve the right language, you need to filter the elements with the proper lang attribute. For example, if you want to filter the translated slug in the Italian language:

        
      import StoryblokClient from "storyblok-js-client";
const Storyblok = new StoryblokClient({
  accessToken: "your-access-token",
});
let links = await Storyblok.getAll("cdn/links", {
  version: "published"
});
const prefixUrl = "https://www.example.org";
const langToFilter = "it";
let sitemap_entries = links.map((link) => {
  // filter the language
  let translated_item = link.alternates.filter(
    (item) => item.lang === langToFilter,
  )[0];
  if (link.is_folder) return "";
  // using the translated_slug
  return `\n    <url><loc>${prefixUrl}/${translated_item.translated_slug}</loc></url>`;
});
let sitemap = `<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    ${sitemap_entries.join("")}
</urlset>`;

console.log(sitemap);
    

Managing rate limits

Especially when you perform multiple APIs call for the pagination, you need to consider the rate limits for the API call.

A rate limit is a mechanism implemented by systems, servers, or APIs to control the rate at which incoming requests or operations are allowed to occur within a certain timeframe. It's essentially a way to prevent an excessive number of requests from overwhelming a system or service, which could lead to performance degradation, instability, or abuse.

The Storyblok rate limit for the Content Delivery API depends on a few factors. The first one is about caching if you are using the cached request from the CDN. The second factor is the size of the page. So now you can understand why it is important to balance the size of the page.

The limits for the Content Delivery APIs are listed here: https://www.storyblok.com/docs/api/content-delivery/v2#topics/rate-limit .

If your running code exceeds the defined limit, the Storyblok Content Delivery API responds with an error message, a specific HTTP status code (such as 429 Too Many Requests). If you want to control the two factors that influence the rate limit, you can act on the per_page attribute even if you are calling the getAll() method:

        
      let links = await Storyblok.getAll("cdn/links", {
  version: "published",
  per_page: 25, // set your number
});
    

The other parameter is the rate_limits parameter during the StoryblokClient object initialization.

        
      import StoryblokClient from "storyblok-js-client";
const Storyblok = new StoryblokClient({
  accessToken: "your-access-token",
  rateLimit: 10 // set your rate limit
});
    

Calling the API

With the previous example, we used the Storyblok JavaScript client, and it is recommended if you want to use the Javascript language. Ultimately, the Client collects the parameters and prepares the URL string to perform an HTTP GET request. For example, if you want to use curl for retrieving the data, you can use something like:

        
      curl -L "http://api.storyblok.com/v2/cdn/links?cv=undefined&paginated=1&per_page=25&page=1&token=pC1yEE4AKy5J4BqdDJN6Ngtt&version=published"
    

As you can see, because the caching mechanism requires the cv parameter, you need to use the -L parameter that indicates curl to follow the redirect. If you call the Content Delivery API with an invalid value for the cv parameter, you will receive a redirect with the correct cv value replaced in the original URL.

Calling the API directly via HTTP requires you to manage all of these cases, so this is one of the reasons to use the Storyblok JS client instead of the fetch function in Javascript.

Recap

A good and updated sitemap can improve the SEO performance of your website and can positively impact your search engine rankings. It helps the search engines understand your website's structure, ensuring that all your important pages get indexed.

It can improve the visibility and discoverability of your content because the search engines easily navigate your site through the sitemap. For a search engine is difficult to navigate throw a complex JavaScript menu build for the end user.

It can improve the speed of indexing new pages, and updates to existing pages are quickly detected by search engines, leading to faster indexing of your content.

To prioritize some content, for example, you can set priority levels in the sitemap.xml to indicate the relative importance of different pages, helping search engines understand your content hierarchy.

Lastly and importantly, sitemaps are widely supported by various search engines, making it a standardized way to improve your site's visibility across different platforms.

Author

Roberto Butti

Roberto Butti

Roberto is a Developer Relations Engineer at Storyblok who loves supporting teams for building websites focused on the code's performance and code quality. His favorite technologies include Laravel, PHP, Vue, Svelte and ... Storyblok. He loves exploring and designing composable architectures.