Handling Large Multi-Language Sitemaps with PHP/Laravel
This article provides a resource-efficient solution for handling 60 million URLs while creating sitemaps. These methods can be implemented to CSV, log, or whatever you’re trying to create files.
I've been working on my hobby projects for a long time in my free time. During this journey, I've learned tons of stuff and still learning. This also helps me to handle big data in this company. We're dancing with a 100TB database and 100m requests per day. So, we need every way of optimization here.
If you’re thinking, what the hell is that site map? Let me answer it first.
Sitemaps are XML files that include all pages as a map for search engine crawlers. The crawlers visit your sitemap files to access the list of URLs of your website. This saves too many resources while they're exploring your website pages. It's a kind of integration between your website and the crawlers. If you want to learn more about sitemaps, you can visit this link.
This content is about ideas, not direct implementation. If you’re looking for something like “How to create a sitemap with Laravel?” spatie/laravel-sitemap may help you.
When you start researching building a sitemap with Laravel, you'll see some packages, including spatie/laravel-sitemap. This is a package that helps you to create sitemaps quickly with sitemap indexes support. At first look, it seems it can create sitemaps with tons of items. But it's not. You need some memory, query, and structure optimization when you want to handle 3 million pages and 20 language variants of the pages. This means 60 million URLs to handle.
Let's look at what we need.
- Every sitemap can include 50000 items and must be less than 50MB. (Google has some information about this here)
- Every sitemap needs to include language variants of the pages. (Google reference for implementation)
- The sitemap needs to be updated, which includes the updated or created content.
- The process of updating or creating a new sitemap file during content creation or updating shouldn't require too much of a server's time or resources.
- Only the indexable contents need to be inside the sitemap. If you unpublish content, you need to remove it from the sitemap.
- The site is available in 20 different languages. As a result, each content variant must be linked as an alternative in the sitemaps.
I can show you slug models. Slugs are the URLs reserved by related models. You can see the database structure of the slugs and other models. This will allow you to see how I handle multi-language content and the URLs.
The first method I used was;
When implementing this logic into my CMS, I predicted this subject would need deep-dive inspection and custom strategies. Based on this concept, I've developed a command that uses queues to generate sitemaps and sitemap indexes from the underlying database automatically. But, It never gets enough.
All models and the tables belong to models, and their relations have the proper indexes on MySQL tables.
Steps for the current method:
- Get the total model count.
- Split model count to loop count with limit.
- For each loop, get limited models from the database.
- Process each model to add them to the sitemap
- Write sitemaps and sitemap index.
- Upload them to the storage server. (I'm using Amazon S3 as a storage server.)
Here's an example from my code:
In this case, as you can see, the query seems optimized. But, here are some conditions under this.
- There is no select field optimization. This may create network effects because of the data size. When you have 1000 models, it's no problem. But when you're trying to get 50k records from MySQL, this will cause considerable slowness.
- As you can see, there are relationships. This also appears useful for eager loading, but you must consider 50k models with 50k x at least two more objects plus 20x translations. MyModel contains 23 model objects. This will consume an excessive amount of memory and resources.
- Getting 50k records to the memory still had issues. Before reaching that limit, I was not affected by it.
- Using the spatie/laravel-sitemap package was not allowing me to write like steam. This means I could not get 5k records per loop until I reached the sitemap limit.
Here's an example of the laravel-sitemap package implementation:
In this method, you can not start the sitemap whenever you want; write items as steam, then finish the sitemap. This means you can not create a limit like this:
As you can see, items are still in the sitemap object until the processedItems count reaches the 50000 limits. They are still in memory until the index reaches the 50k limit. Instead, we could write and stream these items to the XML file after processing them part by part with a 5k loop limit.
The second method that I've implemented
In this way, we've been unable to use the Spatie package. Because in this method, we need to be able to open/close our sitemap file whenever it's required.
This method allows you to get small parts of huge model sets and write them to the same sitemap until you reach the limit. If you have limited RAM, you can follow this way to have control of this memory usage.
Here's the example implementation
Using this way helped to increase the speed and decreased resource usage. But… Yeah, there's another but…
Still, we’re using 5–6 tables to create sitemaps by models. We can fasten this by writing raw queries instead of an eloquent query builder. But I do not want to spend my time for now; this method allows me to handle 4m URLs. One day I will need another optimization, and I’m ready for it. I know that this method can still be optimized. I have another idea. Let’s deep more…
The third method that I'm going to implement;
In both ways, I have many models that can add their URLs to the sitemap. They may have Slug models, Blog Post, Blog Category, User Profile, etc.
I'm going to create another model like SitemapURL. This will have fields like the model, url, alternates, and updated_at. The models that may want to add their URLs to the sitemap will have this relation. This relation is going to be updated with every CRUD action by model. Probably observer objects are going to help me with that.
Then I'll get all sitemap URLs from one table. This will allow me to order cross-model URLs by updated_at fields.
This is another way to optimize the search engine crawling process for updated or recently added pages. The crawlers will first read that URL, the updated or recently added one from the XML files or file.
There is a myth that if you put the current added/updated URL to the top line of the sitemap, it will be detected first by the crawler bots.
I hope you enjoyed the article. If you have different ideas, let's meet in the comments section.
Thanks for your time.
By the way, If you're interested in being a team that handles 100m requests per day, you can join us by clicking the links below. We're there