Hey kids, most of you have heard about the arrival of Google Analytics Universal.
This new version of Google Analytics is about:

  • revolutionizing analytics measurement with a unified protocol,
  • giving you a better, user-centric view of the customer experience via multiple platforms and devices,
  • giving you access to custom dimensions and metrics,
  • tracking offline activity (although you need *some* connectivity to send data home to the GA mothership)

In this post, I intend to share a technique / proof of concept for on-the-fly measurement of PDF files  downloads with Universal Analytics. Without Javacript.
Please note that n00bs are now strongly advised to leave this page (I can live with this bounce rate!) or continue at the risk of their own mental sanity.

I can haz Universal Analytics, kplzthxnomnomnom?

First of all, if you have looked into this topic, you’ll notice that the new Universal Analytics measurement protocol no longer requests the now infamous _utm.gif pixel.
The new “pixel” used by Universal Analytics is what we API nerds call an “endpoint” (where API queries go to die, hence the name).

In the case of Universal Analytics, our endpoint is now:

  http://www.google-analytics.com/collect

HTTPS version where available :-)

Now we get to add more parameters to that endpoint, such as:

What about cookie information and assorted random numbers and timestamps, you ask?
Move along, sir, nothing to see here; it’s all take care of in Google Universal Analytics’s back-end. Which kind of reads funny, come to think of it. Oh well.

Long story short: less taggy, more clicky! (in the admin console).

The problem

But back to our example. The problem with tracking file downloads is that there is no way of knowing whether the download was completed – or not.

I once consulted with a now-defunct Linux distribution publisher. This Linux distribution, like any good “distro”, was made available for download in DVD ISO format – a nice, wholesome, bandwidth-hogging 4.7 GB download. Made your modem sing “youuuu shook me all night looooong!” (<3 Angus)
The problem is that we had no visibility on the actual download status. Even by tagging links with Google Analytics (virtual page view, event), the best we could get was a half-assed download intention rate.

Why? Because unless you look at the log servers, you would be none the wiser, with no knowledge of a potential download interruption due to power failure, act of God, alien invasion, etc.. 

Today, the size of the Interweb’s series of pipes is significantly higher than it was 20 years ago so we don’t worry so much about smaller files. Anything under a gig’ failed to download properly? “Meh”. Large files such as the aforementioned ISO dump now take an average of 2 hours to download instead of 12 ;-)

But the problem is not resolved if: you are stuck with download intentions.

Solution (yay!)

My suggestion/POC is to replace PDF link tagging with Universal Analytics tracking that sends data to your Google Analytics account – without Javascript.

First things first: figure out where your web server logfile sits. Any good self-respecting web server has its Apache logfile in / var/log/apache2. If you use daily or weekly rotation of logs, good for you but here we will focus on the current log, which is commonly called access_log.

In this log, call your PDF files will most likely be as follows:
127.0.0.1 – [04/Apr/2013: 9:05:24 p.m. +0200] “GET / HTTP/1.1 test.pdf” 200 72 839

Let us concentrate on the log file tail. In the example above, this log entry mentions the filename ‘test.pdf’, served with no error (HTTP status code 200) and 72KB filesize served.
With a little log injection or Apache log configuration, you can also insert the actual file size.

Example:
127.0.0.1 – [04/Apr/2013: 9:05:24 p.m. +0200] “GET / HTTP/1.1 test.pdf” 200 72 839
becomes
127.0.0.1 – [04/Apr/2013: 9:05:24 p.m. +0200] “GET / HTTP/1.1 test.pdf” 200 47 816 72 839
where 47816 is the file size served (that the user has received) and 72839 is the reference size (the one that the user should have received)

In the latter case, we are faced with an incomplete download: they don’t match.

Now let’s create a small shell script that will scan your log file to find the latest mention of my PDF file.

For this example, I defined a relatively simple regular expression:

GET /(.*\.pdf)

The expression in brackets indicates that I retrieve this value and use it later, especially for constructing the URL of my endpoint.

The finished URL looks like this:

http://www.google-analytics.com/collect
?v=1 // site/app version
&t=event // GA hit type (event here)
&tid=UA-7634164-5 // GA profile ID
&cid=555 // anonymous visitor ID
&dh=juliencoquet.com // hostname
&ec=Downloads // Event - Category
&ea=PDF // Event - Action
&el=$myfile // Event - Label
&ev=1  // Event - Value
&cm3=1   // Custom metrics #3 (Downloads), incremented by 1
google analytics tracking universal server pdf EN

Pardon my French

As soon as my call to Universal Analytics triggers, data flows into my Google Analytics profile!

Click to download the complete script tracking PDFs in Universal Analytics
(And yes I also track this download with the same method)

#!/bin/bash

#Define PDF detection regex
pattern='GET /(.*\.pdf)'

# Grab latest lines in log file and check against regex
tail -f access_log | grep --line-buffered .pdf | while read line; do
  echo $line
  if [[ $line =~ $pattern ]]; then
    #PDF filename is stored in BASH_REMATCH
    myfile=${BASH_REMATCH[1]}
  fi
  #Trigger for Universal Analytics endpoint, with event tracking, yadda yadda
  wget -q -O  /tmp/pixel.gif "http://www.google-analytics.com/collect?v=1&t=event&tid=UA-7634164-5&cid=555&dh=juliencoquet.com&ec=Downloads&ea=PDF&el=$myfile&ev=1&cm3=1"
done

So, with this code, which you run on your web server (with server startup if possible), you will send an event to Google Analytics Universal as soon as a new line with a PDF file mention pops up in your log file.

Pretty neat, huh?

As you can imagine, this code is easily adaptable to the format of your logfile to track size difference between the actual filesize and size served! You can also create different types of hits/events based on the file extension (.doc, .xls, .zip etc.).

Obviously, this is more a proof of concept than a usable methodology, especially if you have a large consumption of PDF files and the number of new logfile lines becomes too cumbersome.

As always, constructive comments are welcome!