Track PDF downloads with Google Universal Analytics – no Javascript!

In this post, I intend to share a technique / proof of concept for on-the-fly measurement of PDF files  downloads with Universal Analytics. Without Javacript.
Please note that n00bs are now strongly advised to leave this page (I can live with this bounce rate!) or continue at the risk of their own mental sanity.

Most of you have heard about the arrival of Google Analytics Universal.
This new version of Google Analytics is about:

  • revolutionizing analytics measurement with a unified protocol,
  • giving you a better, user-centric view of the customer experience via multiple platforms and devices,
  • giving you access to custom dimensions and metrics,
  • tracking offline activity (although you need *some* connectivity to send data home to the GA mothership)

I can haz Universal Analytics, kplzthxnomnomnom?

First of all, if you have looked into this topic, you’ll notice that the new Universal Analytics measurement protocol no longer requests the now infamous _utm.gif pixel.
The new “pixel” used by Universal Analytics is what we API nerds call an “endpoint” (where API queries go to die, hence the name).

In the case of Universal Analytics, our endpoint is now:

  https://www.google-analytics.com/collect

Now we get to add more parameters to that endpoint, such as:

What about cookie information and assorted random numbers and timestamps, you ask? It’s all taken care of in Google Universal Analytics’ back-end. Which kind of reads funny, come to think of it. Oh well.

Long story short: less taggy, more clicky! (in the admin console).

The problem

But back to our example. The problem with tracking file downloads is that there is no way of knowing whether the download was completed – or not.

I once consulted with a now-defunct Linux distribution publisher. This Linux distribution, like any good “distro”, was made available for download in DVD ISO format – a nice, wholesome, bandwidth-hogging 4.7 GB download. Made your modem sing “youuuu shook me all night looooong!

The problem is that we had no visibility on the actual download status. Even by tagging links with Google Analytics (virtual page view, event), the best we could get was a half-assed download intention rate.

Why? Because unless you look at the server logs, you would be none the wiser, with no knowledge of a potential download interruption due to power failure, act of God, alien invasion, etc.. 🙂

Today, the size of the Interweb’s series of tubes is significantly higher than it was 25 years ago so we don’t worry so much about smaller files. Anything under a gig’ failed to download properly? “Meh”. Large files such as the aforementioned ISO dump now take an average of 2 hours to download instead of 12.

But the problem is not resolved if you are stuck with measuring download intentions.

Solution (yay!)

My suggestion/POC is to replace PDF link tagging with Universal Analytics tracking that sends data to your Google Analytics account – without Javascript.

First things first: figure out where your web server logfile sits. Any good self-respecting web server has its Apache logfile in / var/log/apache2. If you use daily or weekly rotation of logs, good for you but here we will focus on the current log, which is commonly called access_log.

In this log, call your PDF files will most likely be as follows:

127.0.0.1 - [04/Apr/2013: 9:05:24 p.m. +0200] "GET / HTTP/1.1 test.pdf" 200 72 839

Let us concentrate on the log file tail. In the example above, this log entry mentions the filename ‘test.pdf’, served with no error (HTTP status code 200) and 72KB filesize served.
With a little log injection or Apache log configuration, you can also insert the actual file size.

Example:

127.0.0.1 - [04/Apr/2013: 9:05:24 p.m. +0200] "GET / HTTP/1.1 test.pdf" 200 72839

becomes:

127.0.0.1 - [04/Apr/2013: 9:05:24 p.m. +0200] "GET / HTTP/1.1 test.pdf" 200 47816 72839

where 47816 is the file size served (that the user has received) and 72839 is the reference size (the one that the user should have received)

In the latter case, we are faced with an incomplete download: the filesizes don’t match.

Now let’s create a small shell script that will scan your log file to find the latest mention of my PDF file.

For this example, I defined a relatively simple regular expression:

GET /(.*\.pdf)

The expression in brackets indicates that I retrieve this value and use it later, especially for constructing the URL for my Universal Analytics endpoint.

The finished URL looks like this:

http://www.google-analytics.com/collect
?v=1 // site/app version
&t=event // GA hit type (event here)
&tid=UA-7634164-5 // GA profile ID
&cid=555 // anonymous client ID
&dh=juliencoquet.com // hostname
&ec=Downloads // Event - Category
&ea=PDF // Event - Action
&el=$myfile // Event - Label
&ev=1  // Event - Value
&cm3=1   // Custom metrics #3 (Downloads), incremented by 1
google analytics tracking universal server pdf EN
Pardon my French

As soon as my call to Universal Analytics triggers, data flows into my Google Analytics property!

Click to download the complete script tracking PDFs in Universal Analytics
(And yes I also track this download with the same method)

#!/bin/bash
# Define PDF detection regex
 pattern='GET /(.*.pdf)'
# Grab latest lines in log file and check against regex
 tail -f access_log | grep --line-buffered .pdf | while read line; do
 echo $line
 if [[ $line =~ $pattern ]]; then
 # PDF filename is stored in BASH_REMATCH
 myfile=${BASH_REMATCH[1]}
 fi
 #Trigger for Universal Analytics endpoint, with event tracking
 wget -q -O  /tmp/pixel.gif "http://www.google-analytics.com/collect?v=1&t=event&tid=UA-7634164-5&cid=555&dh=juliencoquet.com&ec=Downloads&ea=PDF&el=$myfile&ev=1&cm3=1"
done

So, with this code, which you run on your web server (with server startup if possible), you will send an event to Google Analytics Universal as soon as a new line with a PDF file mention pops up in your log file.

Pretty neat, huh?

As you can imagine, this code is easily adaptable to the format of your logfile to track size difference between the actual filesize and size served! You can also create different types of hits/events based on the file extension (.doc, .xls, .zip etc.).

This is both a proof of concept and a usable methodology, but do mind that this can become a heavier process especially if you have a large consumption of PDF files and the number of new logfile lines becomes too cumbersome.

As always, constructive comments are welcome!

Author: Julien Coquet

Expert de la mesure d’audience sur Internet depuis plus de 15 ans, Julien Coquet est consultant senior digital analytics et responsable produit et évangélisation pour Hub’Scan, une solution d’assurance qualité du marquage analytics. > A propos de Julien Coquet

8 thoughts on “Track PDF downloads with Google Universal Analytics – no Javascript!”

  1. Oh, log files! Can we go back? I am so ready! (Sadly, yes, I am serious.)

    Real comment: Back in ancient times, we also got 206 requests in our logs especially for PDFs. Did the big pipes make this a non-thing now? Or is that a special thing only some servers do? I know it’s a non-concrete question because I haven’t seen a log file in years, but I was just curious.

  2. Weird – I wish I remembered where I saw them now. Thanks for the fast response! I am not at all shocked Hub’Scan has this covered 😉

  3. Jen, that’s hilarious I was thinking the same thing.

    Would repackaging the PDF with some Js in it to send home usage events be easier in these situations?

  4. Many thanks for the post, Julien. You are at the very least pushing the limits of our imagination (who would have thought we would end up going back to those log files!).

    My quick question: How do you handle multiple requests from the same PDF? If I remember well, this is how Acrobat deals with large files directly viewed on the browser (it serves a few pages first, then downloads more as you scroll down). At least JavaScript got rid of that issue!

    Best,

    1. Hola Sergio,

      I was not aware of that mechanism in Adobe Reader. My guess is that if they did that, they’d need to use a 206 code embedded in Reader. Which other “natural” PDF readers like Preview.app (Mac) do not have, to the best of my knowledge. In the case of a 206, we’d get to the point Jennifer was making in the first comment.

      See you at Adobe Summit London? 😉

      Un saludo,

      Julien

  5. Julien,
    I like smart. I like simple. The logfile tail with UA is smart and simple! Props to you for this idea. As you say, a proof of concept so one extension I propose is to manage the identification and appropriate handling of bot requests. Whilst you can add solid bot filtering at a profile level, the UA call needs to be decorated with enough data to grab and deal with bot requests. On that subject, if the log file is sufficiently rich a whole wealth of request meta data also needs to be pumped in via the UA request.

    This has my brain flying along with ideas!

    Regards

    Doug

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.