Track PDF downloads with Google Universal Analytics – no Javascript!

Hey kids, most of you have heard about the arrival of Google Analytics Universal.
This new version of Google Analytics is about:

  • revolutionizing analytics measurement with a unified protocol,
  • giving you a better, user-centric view of the customer experience via multiple platforms and devices,
  • giving you access to custom dimensions and metrics,
  • tracking offline activity (although you need *some* connectivity to send data home to the GA mothership)

In this post, I intend to share a technique / proof of concept for on-the-fly measurement of PDF files  downloads with Universal Analytics. Without Javacript.
Please note that n00bs are now strongly advised to leave this page (I can live with this bounce rate!) or continue at the risk of their own mental sanity.

I can haz Universal Analytics, kplzthxnomnomnom?

First of all, if you have looked into this topic, you’ll notice that the new Universal Analytics measurement protocol no longer requests the now infamous _utm.gif pixel.
The new “pixel” used by Universal Analytics is what we API nerds call an “endpoint” (where API queries go to die, hence the name).

In the case of Universal Analytics, our endpoint is now:

  http://www.google-analytics.com/collect

HTTPS version where available :-)

Now we get to add more parameters to that endpoint, such as:

What about cookie information and assorted random numbers and timestamps, you ask?
Move along, sir, nothing to see here; it’s all take care of in Google Universal Analytics’s back-end. Which kind of reads funny, come to think of it. Oh well.

Long story short: less taggy, more clicky! (in the admin console).

The problem

But back to our example. The problem with tracking file downloads is that there is no way of knowing whether the download was completed – or not.

I once consulted with a now-defunct Linux distribution publisher. This Linux distribution, like any good “distro”, was made available for download in DVD ISO format – a nice, wholesome, bandwidth-hogging 4.7 GB download. Made your modem sing “youuuu shook me all night looooong!” (<3 Angus)
The problem is that we had no visibility on the actual download status. Even by tagging links with Google Analytics (virtual page view, event), the best we could get was a half-assed download intention rate.

Why? Because unless you look at the log servers, you would be none the wiser, with no knowledge of a potential download interruption due to power failure, act of God, alien invasion, etc.. 

Today, the size of the Interweb’s series of pipes is significantly higher than it was 20 years ago so we don’t worry so much about smaller files. Anything under a gig’ failed to download properly? “Meh”. Large files such as the aforementioned ISO dump now take an average of 2 hours to download instead of 12 ;-)

But the problem is not resolved if: you are stuck with download intentions.

Solution (yay!)

My suggestion/POC is to replace PDF link tagging with Universal Analytics tracking that sends data to your Google Analytics account – without Javascript.

First things first: figure out where your web server logfile sits. Any good self-respecting web server has its Apache logfile in / var/log/apache2. If you use daily or weekly rotation of logs, good for you but here we will focus on the current log, which is commonly called access_log.

In this log, call your PDF files will most likely be as follows:
127.0.0.1 – [04/Apr/2013: 9:05:24 p.m. +0200] “GET / HTTP/1.1 test.pdf” 200 72 839

Let us concentrate on the log file tail. In the example above, this log entry mentions the filename ‘test.pdf’, served with no error (HTTP status code 200) and 72KB filesize served.
With a little log injection or Apache log configuration, you can also insert the actual file size.

Example:
127.0.0.1 – [04/Apr/2013: 9:05:24 p.m. +0200] “GET / HTTP/1.1 test.pdf” 200 72 839
becomes
127.0.0.1 – [04/Apr/2013: 9:05:24 p.m. +0200] “GET / HTTP/1.1 test.pdf” 200 47 816 72 839
where 47816 is the file size served (that the user has received) and 72839 is the reference size (the one that the user should have received)

In the latter case, we are faced with an incomplete download: they don’t match.

Now let’s create a small shell script that will scan your log file to find the latest mention of my PDF file.

For this example, I defined a relatively simple regular expression:

GET /(.*\.pdf)

The expression in brackets indicates that I retrieve this value and use it later, especially for constructing the URL of my endpoint.

The finished URL looks like this:

http://www.google-analytics.com/collect
?v=1 // site/app version
&t=event // GA hit type (event here)
&tid=UA-7634164-5 // GA profile ID
&cid=555 // anonymous visitor ID
&dh=juliencoquet.com // hostname
&ec=Downloads // Event - Category
&ea=PDF // Event - Action
&el=$myfile // Event - Label
&ev=1  // Event - Value
&cm3=1   // Custom metrics #3 (Downloads), incremented by 1
google analytics tracking universal server pdf EN

Pardon my French

As soon as my call to Universal Analytics triggers, data flows into my Google Analytics profile!

Click to download the complete script tracking PDFs in Universal Analytics
(And yes I also track this download with the same method)

#!/bin/bash

#Define PDF detection regex
pattern='GET /(.*\.pdf)'

# Grab latest lines in log file and check against regex
tail -f access_log | grep --line-buffered .pdf | while read line; do
  echo $line
  if [[ $line =~ $pattern ]]; then
    #PDF filename is stored in BASH_REMATCH
    myfile=${BASH_REMATCH[1]}
  fi
  #Trigger for Universal Analytics endpoint, with event tracking, yadda yadda
  wget -q -O  /tmp/pixel.gif "http://www.google-analytics.com/collect?v=1&t=event&tid=UA-7634164-5&cid=555&dh=juliencoquet.com&ec=Downloads&ea=PDF&el=$myfile&ev=1&cm3=1"
done

So, with this code, which you run on your web server (with server startup if possible), you will send an event to Google Analytics Universal as soon as a new line with a PDF file mention pops up in your log file.

Pretty neat, huh?

As you can imagine, this code is easily adaptable to the format of your logfile to track size difference between the actual filesize and size served! You can also create different types of hits/events based on the file extension (.doc, .xls, .zip etc.).

Obviously, this is more a proof of concept than a usable methodology, especially if you have a large consumption of PDF files and the number of new logfile lines becomes too cumbersome.

As always, constructive comments are welcome!

Track PDF downloads with Google Universal Analytics – no Javascript! was last modified: August 28th, 2013 by Julien Coquet

Julien Coquet

Expert de la mesure d’audience sur Internet depuis plus de 15 ans, Julien Coquet est consultant senior digital analytics et responsable produit et évangélisation pour Hub’Scan, une solution d’assurance qualité du marquage analytics.

A propos de Julien Coquet

Loading Facebook Comments ...
Loading Disqus Comments ...

8 thoughts on “Track PDF downloads with Google Universal Analytics – no Javascript!

  1. Jen

    Oh, log files! Can we go back? I am so ready! (Sadly, yes, I am serious.)

    Real comment: Back in ancient times, we also got 206 requests in our logs especially for PDFs. Did the big pipes make this a non-thing now? Or is that a special thing only some servers do? I know it’s a non-concrete question because I haven’t seen a log file in years, but I was just curious.

  2. Jen

    Weird – I wish I remembered where I saw them now. Thanks for the fast response! I am not at all shocked Hub’Scan has this covered 😉

  3. Kevin

    Jen, that’s hilarious I was thinking the same thing.

    Would repackaging the PDF with some Js in it to send home usage events be easier in these situations?

  4. Sergio Maldonado

    Many thanks for the post, Julien. You are at the very least pushing the limits of our imagination (who would have thought we would end up going back to those log files!).

    My quick question: How do you handle multiple requests from the same PDF? If I remember well, this is how Acrobat deals with large files directly viewed on the browser (it serves a few pages first, then downloads more as you scroll down). At least JavaScript got rid of that issue!

    Best,

    1. Julien Coquet Post author

      Hola Sergio,

      I was not aware of that mechanism in Adobe Reader. My guess is that if they did that, they’d need to use a 206 code embedded in Reader. Which other “natural” PDF readers like Preview.app (Mac) do not have, to the best of my knowledge. In the case of a 206, we’d get to the point Jennifer was making in the first comment.

      See you at Adobe Summit London? 😉

      Un saludo,

      Julien

  5. Doug Hall

    Julien,
    I like smart. I like simple. The logfile tail with UA is smart and simple! Props to you for this idea. As you say, a proof of concept so one extension I propose is to manage the identification and appropriate handling of bot requests. Whilst you can add solid bot filtering at a profile level, the UA call needs to be decorated with enough data to grab and deal with bot requests. On that subject, if the log file is sufficiently rich a whole wealth of request meta data also needs to be pumped in via the UA request.

    This has my brain flying along with ideas!

    Regards

    Doug

Comments are closed.

No Trackbacks.