Filtering language codes in Google Analytics

Ever looked at a language report in Google Analytics and getting confused because of all the seemingly duplicate values? This post shows you how to rationalize everything – or try to anyway 🙂

Hey do you speak… OMG what is that thing?!

First things first, let’s take a look at the mess that is the Audience > Geo > Language report in Google Analytics:

Lots of values there, which are meant to adopt the ISO-639-1 standard for language codes. For reference, ISO-639-1 uses a 2-letter language code such as en for English and fr for French. Which is easy and should be implemented everywhere.

Except in Web browsers where they have been using derived versions of 639-1 by adding a local code. For instance, en-gb is for English as spoken in Great Britain. This is due to a specification for Web browsers and servers based on which language content should be accepted and served. Because fuck standards, am I right?

Assuming you want to restore peace and order to your language reports, here is a quick filter to set everything right again.

Creating the language code filter

Go to your Google Analytics admin console and look for the view on which you want to apply the filter.

Next we want to click “Add filter” (big red button) and start with the following filter definition:

Make sure it’s a custom filter and using advanced mode.

The regular expression in field A looks for the first 2 letters of the language code and isolates them by overwrtiting the language code field. $A1 references whatever was in the first set of parentheses in the Field A box. This means any unpleasant extensive language codes will normalize over time.

As with most things Google Analytics, this filter is not retroactive, meaning that for a while you’ll have reports including both short-form, standard, ISO639-1-compliant language codes and not the ab-cde-wtf nonsense you may have observed in the past.

In closing

You might not use your Audience > Geo > Language more after this.

Don’t get me wrong: the language code data will be clean and easily readable. However, it is not necessarily going to be more useful. Why? Because the report only includes the language setting for the browser, not the language setting for the content being viewed.

Arguably, the language code report could tell me if I have, say, lots of users with Chinese language browsers looking at my content, which I know is primarily in English and French. But I still have no idea whether they’re using the Google Translate extension in Chrome for instance. Should I start writing in Mandarin? 🙂

At any rate, start cross-referencing filtered language code reports with content language reports. You will start getting more “insights” there 🙂

What about you? Do you look at your browser language reports? Do you use them as they are or do you clean them up outside of Analytics?

Let me know in the comments!

Author: Julien Coquet

Expert de la mesure d’audience sur Internet depuis plus de 15 ans, Julien Coquet est consultant senior digital analytics et responsable produit et évangélisation pour Hub’Scan, une solution d’assurance qualité du marquage analytics. > A propos de Julien Coquet

One thought on “Filtering language codes in Google Analytics”

  1. Had a look around this report last week and though “what a bloody mess”. Will implement this filter for sure to stop seeing this mixed bag of codes and claim some structure.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.