Eli Schwartz

The root of technical SEO is a deep understanding of a website’s architecture and how Google relates to the pages on the site. In my experience, the best way to gain this level of knowledge on any individual site is through deep log file analysis of Googlebot and user access logs.

Google Search Console reveals some of their crawl issues but it’s always at a very high level and aggregated. For a highly trafficked site, it would be nearly impossible to find the specific pages with issues.

On a site with many Googlebot entries, the number of rows will easily overwhelm your computer’s ram if you try to do this in Excel. Just opening the file will cause the system to slow and this is without even trying to run any queries.

Therefore, my favorite tool for this task is Splunk. For the unfamiliar, Splunk is a fantastic big data tool which allows you to parse large amounts of data quickly and easily to make important decisions.  Splunk even has a free version, which allows you to index up to 500MB per day. For many websites, this free version should be more than enough just to upload and analyze your access logs.

Here are the top 3 ways I use Splunk to help me with technical SEO efforts.

Find 404 pages generated by a Googlebot visit

404 pages (not found error) are wasted visit for every bot or human visitor. Every time a user hits a 404 page instead of the page they meant to see you are missing an opportunity to show them the correct content and at the same time they are having a subpar experience with your site. You can always proactively find 404’s with a crawling tool like ScreamingFrog, DeepCrawl or Oncrawl but if you have a lot of broken URL’s fixing all of them might not be a realistic goal.

Additionally, this doesn’t help you find a 404 resulting from an incorrect link on someone else’s site. When discovered by Googlebot, these links will send Googlebot to a non-existent page.

This is where log parsing becomes very helpful as you can discover 404’ed URL’s that are frequently accessed by users/bots and choose to either fix them or redirect the traffic to a working page.

Once you have your data imported into Splunk, here’s how you set up the query to find the 404 pages:

  1. First choose your time period. For this type of query, I usually use 30 days, but you can choose whatever you want.
  2. Type the following into the query box.

Index = {the name of your index} status = 404 | top limit = 50 uri

Your limit can be whatever you want, but I like to work with 50 URL’s for 404 pages to make sure I don’t miss any. Once this query completes, click on the statistics tab, and you will see all the URL’s that you need to urgently address laid out in a table.

Google expects 404 errors on every website, so the existence of them isn’t necessarily an urgent issue. However, some 404 URL’s could be the result of an unintended error or a valuable link (internal or external) pointing to the wrong page.  Running this analysis will allow you to make an uneducated decision.

Calculate the number of pages crawled by Google every day

If you use Google’s Search Console, then you are probably familiar with the screen where Google shows how many URL’s they crawl per day. This data may or may not be accurate, but you won’t know until you look in your logs to see how many URL’s Google actually crawls per day. Finding the daily crawl amount is very easy in Splunk once your data is uploaded.

  1. Choose a time period of 30 days (or 7 if you have a lot of data)
  2. Type the following query:

index ={name of your index} googlebot | timechart count by day

Once the query completes, click on the statistics tab, and you will have the true amount of pages crawled by Googlebot each day. For added fun, you can check out the visualization tab, and see how this changes over the searched time period.

This is more of an FYI than an urgent fix, but it is helpful to know if Google is now picking up new categories on a site or slowing a crawl. If either of these are true, it could be time to dig into the data.

Find rogue URL’s wasting crawl budget

As most marketers (should) know, Google allots a crawl budget to each site based on their Page Rank – not the visible one, but the real one in the Google black box.

If Googlebot wastes some of your valuable budget on URL’s you don’t care about, it obviously has less bandwidth to use on more important URL’s. Without knowing where Googlebot is spending time, you can’t know if your budget is being used effectively.

Splunk can help you quickly discover all the URL’s Googlebot is crawling which will then give you the data to make a decision about what should be added to your robots.txt file.

Here’s how you find the URL’s that Googlebot is crawling:

  1. Choose your time period. This can be any amount of time, and you should keep trying different time periods to find problematic URL’s.
  2. Type in the following query:

index={name of your index} googlebot uri_stem=”*”| top limit=20 uri

You can set the limit to whatever you want, but 20 is an easily manageable number. Once the query completes, click on the statistic tab, and you will have a table showing the top URL’s that Google is crawling. Now you can make a decision about any pages that should be removed, blocked by a robots file, or noindexed in the head of the page.

I use Splunk in over a dozen different ways to help me accomplish various SEO tasks, and these are just three of my most common uses.