In this post, I intend to share a technique / proof of concept for on-the-fly measurement of PDF files  downloads with Universal Analytics. Without Javacript.
Please note that n00bs are now strongly advised to leave this page (I can live with this bounce rate!) or continue at the risk of their own mental sanity.

Most of you have heard about the arrival of Google Analytics Universal.
This new version of Google Analytics is about:

  • revolutionizing analytics measurement with a unified protocol,
  • giving you a better, user-centric view of the customer experience via multiple platforms and devices,
  • giving you access to custom dimensions and metrics,
  • tracking offline activity (although you need *some* connectivity to send data home to the GA mothership)

I can haz Universal Analytics, kplzthxnomnomnom?

First of all, if you have looked into this topic, you’ll notice that the new Universal Analytics measurement protocol no longer requests the now infamous _utm.gif pixel.
The new “pixel” used by Universal Analytics is what we API nerds call an “endpoint” (where API queries go to die, hence the name).

In the case of Universal Analytics, our endpoint is now:

  https://www.google-analytics.com/collect

Now we get to add more parameters to that endpoint, such as:

What about cookie information and assorted random numbers and timestamps, you ask? It’s all taken care of in Google Universal Analytics’ back-end. Which kind of reads funny, come to think of it. Oh well.

Long story short: less taggy, more clicky! (in the admin console).

The problem

But back to our example. The problem with tracking file downloads is that there is no way of knowing whether the download was completed – or not.

I once consulted with a now-defunct Linux distribution publisher. This Linux distribution, like any good “distro”, was made available for download in DVD ISO format – a nice, wholesome, bandwidth-hogging 4.7 GB download. Made your modem sing “youuuu shook me all night looooong!

The problem is that we had no visibility on the actual download status. Even by tagging links with Google Analytics (virtual page view, event), the best we could get was a half-assed download intention rate.

Why? Because unless you look at the server logs, you would be none the wiser, with no knowledge of a potential download interruption due to power failure, act of God, alien invasion, etc.. 🙂

Today, the size of the Interweb’s series of tubes is significantly higher than it was 25 years ago so we don’t worry so much about smaller files. Anything under a gig’ failed to download properly? “Meh”. Large files such as the aforementioned ISO dump now take an average of 2 hours to download instead of 12.

But the problem is not resolved if you are stuck with measuring download intentions.

Solution (yay!)

My suggestion/POC is to replace PDF link tagging with Universal Analytics tracking that sends data to your Google Analytics account – without Javascript.

First things first: figure out where your web server logfile sits. Any good self-respecting web server has its Apache logfile in / var/log/apache2. If you use daily or weekly rotation of logs, good for you but here we will focus on the current log, which is commonly called access_log.

In this log, call your PDF files will most likely be as follows:

127.0.0.1 - [04/Apr/2013: 9:05:24 p.m. +0200] "GET / HTTP/1.1 test.pdf" 200 72 839

Let us concentrate on the log file tail. In the example above, this log entry mentions the filename ‘test.pdf’, served with no error (HTTP status code 200) and 72KB filesize served.
With a little log injection or Apache log configuration, you can also insert the actual file size.

Example:

127.0.0.1 - [04/Apr/2013: 9:05:24 p.m. +0200] "GET / HTTP/1.1 test.pdf" 200 72839

becomes:

127.0.0.1 - [04/Apr/2013: 9:05:24 p.m. +0200] "GET / HTTP/1.1 test.pdf" 200 47816 72839

where 47816 is the file size served (that the user has received) and 72839 is the reference size (the one that the user should have received)

In the latter case, we are faced with an incomplete download: the filesizes don’t match.

Now let’s create a small shell script that will scan your log file to find the latest mention of my PDF file.

For this example, I defined a relatively simple regular expression:

GET /(.*\.pdf)

The expression in brackets indicates that I retrieve this value and use it later, especially for constructing the URL for my Universal Analytics endpoint.

The finished URL looks like this:

http://www.google-analytics.com/collect
?v=1 // site/app version
&t=event // GA hit type (event here)
&tid=UA-7634164-5 // GA profile ID
&cid=555 // anonymous client ID
&dh=juliencoquet.com // hostname
&ec=Downloads // Event - Category
&ea=PDF // Event - Action
&el=$myfile // Event - Label
&ev=1  // Event - Value
&cm3=1   // Custom metrics #3 (Downloads), incremented by 1
google analytics tracking universal server pdf EN
Pardon my French

As soon as my call to Universal Analytics triggers, data flows into my Google Analytics property!

Click to download the complete script tracking PDFs in Universal Analytics
(And yes I also track this download with the same method)

#!/bin/bash
# Define PDF detection regex
 pattern='GET /(.*.pdf)'
# Grab latest lines in log file and check against regex
 tail -f access_log | grep --line-buffered .pdf | while read line; do
 echo $line
 if [[ $line =~ $pattern ]]; then
 # PDF filename is stored in BASH_REMATCH
 myfile=${BASH_REMATCH[1]}
 fi
 #Trigger for Universal Analytics endpoint, with event tracking
 wget -q -O  /tmp/pixel.gif "http://www.google-analytics.com/collect?v=1&t=event&tid=UA-7634164-5&cid=555&dh=juliencoquet.com&ec=Downloads&ea=PDF&el=$myfile&ev=1&cm3=1"
done

So, with this code, which you run on your web server (with server startup if possible), you will send an event to Google Analytics Universal as soon as a new line with a PDF file mention pops up in your log file.

Pretty neat, huh?

As you can imagine, this code is easily adaptable to the format of your logfile to track size difference between the actual filesize and size served! You can also create different types of hits/events based on the file extension (.doc, .xls, .zip etc.).

This is both a proof of concept and a usable methodology, but do mind that this can become a heavier process especially if you have a large consumption of PDF files and the number of new logfile lines becomes too cumbersome.

As always, constructive comments are welcome!