Why you should tell Google to not index a lot of page urls
If you have a lot of urls on your site, you should think about selecting which pages Google should index
I am a co-founder of Epiloge, a platform to connect in your field of interest, write about projects and share knowledge. By its very nature we have a lot of urls which are dynamically created by users. And as everyone else who has a website, we want to grow.
Part of growing a user base is SEO — search engine optimization. What good are a lot of urls with interesting content and users, if nobody can find them… and then sign up?
You want Google to index your pages and show them in their search results — and not on page 5 or 17, but preferably on the first page.
At Epiloge, we have public user profiles, organization profiles and projects/articles as well as other uploads. All of them are potential landing pages for users from Google search result.
So, after we launched and started promoting Epiloge in early 2020, we made sure Google indexes all pages we could think of should potentially appear in Google searches. That’s what you want to do, right? Every single page indexed by Google can get you traffic, so why not simply maximize page indexing?
Well, the answer to that is actually — no, you shouldn’t index every last page. You want Google to index all relevant pages with relevant content you have, not every last one.
You want Google to index all relevant pages with relevant content you have, not every last one.
Making sure Google finds all urls for indexing
If you know what SEO is, you probably know that you can get your website urls indexed by Google by not doing anything at all. A single link to your site and a good link structure within your site, can get Google to index a lot, if not most of your urls.
If Google finds one page, it saves all links on it, follows them and ends up finding more links — until it will have crawled most of your page.
For websites that have static content only, i.e. urls that rarely change or are only edited or deleted by the website owner, don’t worry about adding a meta-tag for ‘index’ or ‘noindex’. You should focus on having a good link structure for your site so Google can follow all links to crawl or pages.
Things change, if you have dynamically changing urls, i.e. new user profiles, added comments or posts and more, which you want Googlebot to discover and index.
After creating our sitemap for all our urls we could think of as relevant, we uploaded our sitemap.xml to our server, headed over to the Google Search Console and told it we now had an updated sitemap. At that point, our sitemap included over 1500 urls.
Google started crawling in the next days and after a week or so, most of our urls had been indexed. Great! Or so we thought.
Search volume and impressions went up, but our ‘site authority’ suffered
Google SEO is a double edged sword. On the one hand you want to get indexed and ranked high, on the other hand you don’t want Google to think your site has a large number of uninteresting or even nearly empty pages.
We kept updating our sitemap.xml file with our algorithm throughout March and early April and soon reached 2800 urls indexed. The growth was principally due to more users who signed up, more projects and articles and other uploads posted and more organization profiles created.
But it dawned on us that a substantial part of these urls, especially a large number organization profiles, aren’t necessarily a good fit for Google indexing. In addition, around 25% of our users’ profiles, especially new users, have only limited content on their pages. And lastly, some users published projects and uploads with very brief descriptions, which work on their profiles, but are not good landing pages by themselves from Google search.
Overall, while our search impressions and inbound traffic went up nicely, we were wondering if excluding urls which aren’t interesting to outside readers isn’t actually a recommended SEO strategy.
Well, long story short, it is, you can check out Why the noindex tag is good for SEO if you are careful or Want More Traffic? Deindex Your Pages.
Changing our mindset — adding noindex to all pages that aren’t relevant as landing pages
Google and other search indexes have a default setting. Unless you add a meta-tag that includes ‘noindex’, or ‘noindex,nofollow’ to one of your pages, they will crawl and index that page, if they deem it relevant.
//this is the default, you don't need to add this to any page
<meta name="robots" content="index,follow">//this will prevent a page from being indexed, but links will still be followed (recommended)
<meta name="robots" content="noindex">//if you sure a page is irrelevant including links
<meta name="robots" content="noindex,nofollow">
We thought about which pages are actually relevant to be indexed. The criteria for that should be: For an outside user, is this page a relevant landing page?
The criteria for having Google index a page should be as follows: For an outside user, is this page a relevant landing page?
For users and organization profiles it requires profiles that are interesting to read about. For content, a good relevance factor is a certain length. For all our static sites, you should ask yourself if the information is relevant enough for users coming from outside your site.
Epiloge has a lot of lists of users, followers, explore pages etc. too. While those are good sites to crawl to find other urls, they definitely are not something you really want as a landing page from Google Search. Nobody’s googling for a list of followers for a specific user. You want the user profile to be found, not that list.
Google’s crawl budget
There are two key reasons you want to work with ‘noindex’ on your site if you have a lot of urls. The first one is Google’s crawl budget.
If you just started out on your blog or another website, the crawl budget won’t impact you. You will have a handful of pages and Google can just index them and that’s it.
But once you have thousands of urls or even a lot more (especially dynamically created urls such as from pagination), Google’s crawl budget can have an impact.
The more pages you have in relation to your site’s overall relevance, the smaller the available crawl budget.
Now you could say that doesn’t really matter, you just need pages indexed once. But what if your pages are changing a lot as users add information or you yourself change things? You may want Google to actually reindex those pages. And not in a month, but within a few days or a week.
Not to mention that you definitely want to have new content indexed as fast as possible and not weeks or months after it is created.
In other words, not going for a blanket ‘index everything’ strategy but limiting search indexes to indexing only pages that are actually relevant to users who search on Google, can go a long way to help with your crawl budget.
Site or domain authority
Epiloge is a small website in comparison to other sites. We, however, already have interesting content and due to the nature of our site have thousands of urls (and potentially millions of variations of urls due to dynamic url creation for search or explore or list functions).
For years, there has been a debate on whether Google ranks newer websites’ content lower just because such sites don’t have a certain site or domain authority. Domain authority refers to the overall reputation of a single domain — and how it affects standalone content. Short answer: Google doesn’t. Sort of.
While Google doesn’t use the concept of domain authority to rank individual pages (and I can attest to that, as a lot of our urls are ranked on the first page on Google search), it, however, has a lot of metrics that look into similar things and adjacent metrics to determine relevance of content.
As you may guess, a site with a lot of irrelevant content or which users click on, but don’t want to read, can influence the ranking of other content on your site. Not because of domain authority, but because of the use of adjacent metrics by Google… in other words, Google sort of uses a soft type of domain or site authority.
To sum up, indexing all your urls on Google is great, indexing only relevant content is greater
Our suggestion from our experience is to use a smart approach to Google indexing.
Make sure that your sitemap.xml includes all relevant content that Google users may be interested in. Have Google read it and index all these pages.
But think hard whether you don’t have a lot of potential page urls that users actually should not land on based on a Google search. Such as lists, helper pages and generally empty profiles. And if so, make sure your site includes a noindex meta tag on all these pages.