SEO Audit, Part 1: Crawlability and Indexation

Conducting an SEO audit for your business website doesn’t have to cost a bomb. In fact, leaving aside the time you’ll need to spend actually doing it – which presumably does have some kind of monetary value – it can be done for free.

This guide will not only take you through all the steps you need to perform an SEO audit but also introduce you to free versions of all the tools you need to do it.

NB: This guide is aimed at small business and personal websites. Large sites with hundreds of pages or more warrant forking out for the professional tools. Nevertheless, the methodology is the same and I’ll nod in the direction of the paid tools I’d recommend.

Why perform an SEO audit?

Visibility in the major search engines – particularly Google – can dramatically increase the number of visitors to your website. Whether people are searching for your name, your brand name, your products or services which you provide, being indexed by the search engines and ranking highly for those terms greatly increases the odds of people finding you and clicking through to your site.

The aim of an SEO audit is to identify where your site is falling short; where improvements can be made to help you get indexed and rank higher for the search terms which are important to you.

1. Crawlability and Indexation

If Google et al can’t crawl your site then they can’t index it and there’s no chance of any of your pages showing up in search results. So the first thing we need to do is ensure that all the pages you want to be in Google’s index are either already in Google’s index or are on their way to being indexed.

Free tool: Xenu’s Link Sleuth

Okay, this looks like a pretty complicated way to start, and maybe it is, but Xenu’s Link Sleuth is actually remarkably easy to use and provides data that we’ll need for several parts of this audit. It crawls a website in much the same way that a search engine does, recording information about a page and then following links to other pages to do the same. Once you’ve downloaded it, run it and hit ‘Check URL…’ under the file menu, enter your website’s URL and hit OK.


Link Sleuth will then crawl your website – how long it takes depends on how many pages you have. Upon completion you’ll be told how many broken links were found and asked if you want a report. Hit ‘Yes’ and a webpage will open in your browser which gives details of all the broken links on your site and where to find them.

In the actual program you’ll see a page of results like this.


Save your results for safe keeping and also ‘Export to TAB separated file…’ under the file menu. You can then open this exported file in Microsoft Excel (or, if we’re keeping things genuinely free, LibreOffice Calc). What you’ll find is a list of all the links that Xenu found on your site and what it found when it followed them. We’ll need to perform a little Excel (or Calc)  trickery to extract what we need first: a list of all the pages on your website that it’s possible to navigate to.

Select all and apply filters. Then filter by ‘Type’ so that you only see ‘text/html’ – this provides a list of webpages only, cutting out all the clutter of other filetypes. Now apply conditional formatting to your first column and apply filters so that you only see ‘Address’ entries which contain your domain – this gets rid of any external links. Now copy what’s left – your website’s pages – to a new sheet and call it ‘Pages’ or something.

Free tool: Google Webmaster Tools

Google Webmaster Tools, or GWT for short, is a free suite of tools which Google provides for just this purpose. When you first visit GWT you’ll be invited to verify your site and given instructions on how to do so (it’s easier if you’ve already set up Google Analytics). You’ll need to set it up in advance as it takes a few days to populate with your data.

The tool we’re interested in first is the ‘Index Status’ tool under ‘Google Index’ in the menu. The number of pages reported as indexed should be close to the number of entries in the ‘Pages’ sheet we constructed from your Xenu crawl. If it’s much lower then there’s some impediment to your pages being indexed by Google.

Possible causes include:

Robots.txt – this is a file on your server that tells Google what it is and isn’t allowed to crawl. You can see its contents in GWT under ‘Blocked URLs’. A typical Robots.txt file for a WordPress site looks like this:

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/

Here Google is being told not to crawl two folders – the admin sections of the site. If your robots.txt file contains anything else you may be blocking access to large sections of your site – or even all of it.

Sitemaps – a sitemap is an XML file which, as the name suggests, provides a map of all the indexable content on your site to search engines. Without a sitemap you’re leaving Google to find your pages by itself. You can generate a sitemap with Xenu (under the file menu ‘Create Google Sitemap File’) and submit it in GWT under ‘Sitemaps’.

Duplicate content – Google will de-index any pages that it thinks are duplicates of other pages. Under ‘HTML Improvements’ in GWT you’ll find a list of pages with duplicate titles and meta descriptions – a clue that those pages may contain identical content. We’ll look at duplicate content again later.

Broken links – links which lead to ‘404 not found’ errors are a sign that your site is badly maintained and maybe not worth Google’s attention. GWT provides a list of your broken links under ‘Crawl Errors’ but it tends often to be out of date. The broken link report that Xenu provided is more likely to be accurate and also shows broken links from two angles – helping you find them and correct/remove them more easily. You can also run the crawl again when you’ve finished your remedial action to check that it worked.

Thin content – since Google introduced its Panda update in 2011 it has removed a lot of pages from its index which have ‘thin’ content. What the precise definition of thin content is nobody is certain of but if you have a large number of pages with very little original content on them they might not be getting indexed. More on this later.

How to find out which of your pages aren’t being indexed

Now that you have a list of your pages generated from the Xenu crawl you can check their indexation status in Google. You can either do this manually by searching for the page title in Google and using the ‘site:’ operator, e.g. [page title], and seeing if anything shows up or by using Neils Bosma’s SEO Tools for Excel.

Free tool: Neils Bosma’s SEO Tools for Excel

Once you’ve incorporated these tools into Excel by following the instructions on Niels’ site you can check the indexation status of any URL with one of the new formulas which becomes available to you (GoogleIndexCount which will return 1 if the URL is indexed, 0 otherwise). This should help you identify whether there is a pattern among the pages that aren’t being indexed – essential for working out why they aren’t indexed.

Categorised as SEO

By Jamie Griffiths

Jamie Griffiths is a freelance content strategist, copywriter and SEO consultant who lives and works in Hackney, East London.

Leave a comment

Your email address will not be published.