Quickly identify thin & duplicate content for Google Panda

panda

In the age of Google’s Panda algorithm, fixing issues with thin and duplicate content is more important than ever. However this can be a daunting task when faced with a big site that has millions of pages and potentially thousands of duplicate pages, or pages low on unique content — just sifting through and identifying this content can be a big task before even addressing it.

In order to address some of these challenges you can now use my free duplicate & thin content checker tool.

The tool is pretty simple to use:

1. Gather a list of URLs to check

You can enter a plain text list, or specify an XML Sitemap file to check the URLs for. If you’re getting URLs from a site: query from Google, URLProfiler has a free SERP scraping tool to help you.

You can also pull a list from Google’s Webmaster Tools HTML suggestions area, to quickly check the content Google has flagged as duplicate:

Webmaster_Tools_-_HTML_Improvements_-_http___www_mirror_co_uk_

Generally this export is a bit useless as it gives you a ‘ | ‘ separated list of URLs. However you can just copy & paste this column into the tool & it will split these URLs out for you:

Screenshot_14_01_2015_17_07

2. Tell the tool where content lives

If you’re checking articles, you can specify a CSS3 Selector to point the tool to your article content. Otherwise it will look at content within the <body> tag.

Google_Panda_Helper___Rob_Hammond

3. Run the report

When the report runs, as with most of my SEO tools, the browser feedback is real-time. When you stop the report, or it finishes, you’ll see a screen similar to the following:

Google_Panda_Helper_Results___Rob_Hammond

You can then download the CSV into Excel, where you can use the Conditional Formatting > Highlight Cells Rules > Duplicate Values on the Text Hash column to identify duplicate content.

Thanks to the word count you can also identify thin content, and using Excel’s Conditional Formatting > Color Scales rule you can quickly spot any outliers:

Screenshot_14_01_2015_17_24