Magento indexing issues are often created by incorrectly configured sites, but much of the problem can lay with poor site data and structure.
Managing and making changes to content and URL structure require careful considered decisions with compromises and balances — but the added benefit is in addition to solving SEO issues they will also create a better usability experience for site visitors.
A common example is how many attributes are your anchor categories showing? A category with a large number of filterable attributes can be overwhelming for web users. If a visitor is having difficulty so will a search engine robot trying to make sense of the site. Sites with high numbers of filterable attributes will create potentially tens of thousands of attribute permutations to follow in URL combinations.
To solve this, the number of attributes should be rationalised. Are there any that could be considered duplicates? Are there some filters that only return a single or small subset of products. Importantly are these attributes useful for a customer i.e. do people shop for products by these criteria?
By removing excessive attributes you are not only reducing the number of links on the page but also creating a simpler navigation experience for visitors.
There are multiple approaches here and each have their pros and cons - though some have some clear advantages ahead of the rest. These all are the nofollow rel attribute value, robots.txt, Google Search Console parameters and meta robots tags. We’ll look at each of these in turn.
The value nofollow for the rel attribute on anchor tags was introduced by Google in 2005 to annotate a link to say “I can’t or don’t want to vouch for this link.” This was quickly picked up as technique to do what became known as PageRank sculpting. Here one could concentrate PageRank to favoured pages on their sites. Around 2008 Google updated how PageRank flowed so this no longer provided any real benefit and in fact can be harmful as it stops the flow of PageRank ultimately getting passed back. This post by Inchoo explains why http://inchoo.net/ecommerce/why-relnofollow-in-ecommerce-menus-is-a-bad-idea/
In general using nofollow is an “unnatural” technique and should be avoided. As Google Engineer Matt Cutts advises “I pretty much let PageRank flow freely throughout my site, and I’d recommend that you do the same. I don’t add nofollow on my category or my archive pages.” https://www.mattcutts.com/blog/pagerank-sculpting/
Now PageRank is no longer quite the same thing it was back then, the general concept of authority passing through links remains as an indicator in search engine algorithms and often gets referred to as “link juice”.
There are a couple of cases where may want to to use nofollow, such as links to register, sign-in and checkout pages. Though these can be as easily excluded via some of the other techniques discussed later.
robots.txt and Google Search Console parameters
Both robots.txt and Google Search Console parameters share some similar drawbacks and benefits.
These crawl blocking techniques have the drawback that the URL still may appear in the search index, but with “A description for this result is not available because of this site’s robots.txt – learn more” as it’s description.
Similar to the rel=nofollow using robots.txt can create a PageRank deadend.
With robots.txt you can use regex matches to block all parameters, but you need to be also sure that do not block pagination by whitelisting combinations involved p= parameter. You can take the alternative approach and blacklist parameters but this creates extra maintenance work as your site evolves and new product attributes in layered navigation added.
Another pitfall is when wanting to remove a page from the index quickly. If you just block them with robots.txt Google won’t remove them from their index any time soon because it won’t be able to crawl them and see if they either 404, get redirected or have a robots meta tag such as “noindex, follow”. This is a common mistake to make.
However, there are a few circumstances where may consider blocking pages with robots.txt. Blocking static content, such as a txt or html file shipped with the Magento platform e.g. README.html or your server has limited resources and your wish to stop crawlers visiting specific pages such as /catalogsearch/ which tends to place a heavy load on the server.
The other circumstance, and only if this becomes an issue, is if your domain has a high quantity of products but also a relatively low domain authority. The may have a low crawl budget and you may want to conserve what you have. To explain crawl budget Matt Cutts offers the following explanation. https://www.stonetemple.com/matt-cutts-interviewed-by-eric-enge-2/
The best way to think about it is that the number of pages that we crawl is roughly proportional to your PageRank. So if you have a lot of incoming links on your root page, we’ll definitely crawl that. Then your root page may link to other pages, and those will get PageRank and we’ll crawl those as well. As you get deeper and deeper in your site, however, PageRank tends to decline.
Another way to think about it is that the low PageRank pages on your site are competing against a much larger pool of pages with the same or higher PageRank. The pages that get linked to often, both internally and externally, tend to get discovered and crawled quite quickly but the lower PageRank pages are likely to be crawled not as often. This can prove a challenge when a site has deep category structures and large changing product catalogs.
In this scenario I’d also step in with configuring all your parameters in Google Search Console and explicitly define what you want and do not want to get crawled.
“noindex, follow” meta robot tag
This is the most “natural” technique and recommend in the majority of cases. It’s the easiest to manage and configure. There are a number of extensions that will add this to your layered navigation and search pages https://paulnrogers.com/mageseo/ and https://www.creare.co.uk/blog/news/creare-seo-magento-extension. The latter is more feature rich and generally my go to, but MageSEO does have the useful option of being able to control meta robots tag to any page url using regex or url matching - handy when some extra control over other third party extensions is needed.
It’s also fairly straightforward for a developer to add the following to your template’s local.xml
<?xml version="1.0"?> <layout> <catalog_category_layered> <reference name="head"> <action method="setRobots"><meta>NOINDEX,FOLLOW</meta></action> </reference> </catalog_category_layered> <catalog_category_layered_nochildren> <reference name="head"> <action method="setRobots"><meta>NOINDEX,FOLLOW</meta></action> </reference> </catalog_category_layered_nochildren> </layout>
By allowing search engines to crawl the page you are still letting your link juice flow naturally throughout the site. Once setup this apporach will require overall less attention and work as you won’t need to be managing an ongoing list of parameters in your robots.txt as described in some of the previous approaches above.