The spam referrals problem in Google Analytics is turning into the new “not provided”. Almost everyone I follow has posted/retweeted/said something about it in the past month. I didn’t want to miss this oportunity to bring back some of the scripts I did in the past year for myself and share it with everyone. So in this post I’m going to try to address the Spam problem from another perspective that is not based on using referrals lists and filters.
As almost all of you may know, this problem is related to the Google Analytics Measurement Protocol not using any authentication mechanism, and being Websites tracked client-side it’s almost imposible to hide any protection mechanism if we don’t force user’s to be logged in some way to some service, what is not possible as we’ll need some way to track anonymous users. They’re using the Measurement Protocol to spam us? ok, so let’s play that game too and use the Measurement Protocol to protect our data from being spammed =)
Some years ago I wrote some PHP scripts to track the ecommerce transactions server-side to avoid any client-side problem like: users blocking GA, JavaScript errors forcing ga to fail, users network problems, etc.
So I updated that code to work with Universal to achieve the following goals:
- Our real UA Number will be totally hidden for others, ie: there will be no way for anyone to guess our real Property number.
- We’ll attach some string to our hostname parameters (&dh), and we’ll use that to filter our new view to prevent any spammers looping around al UA-XXXXX-Y to send hits.
- As we’re going to proxy all our ga hits through our server-side script, it may allow other people to directly load that php file, so we’re using PHP sessions to avoid anyone to query that file if isn’t there any open session for that user.
For Google Analytics Classic or the first Google Analytics Measurement Protocol releases, tracking every hit from server-side was a problem because we weren’t able to send the real users ip addreses (missing all GEO location related data), or their user-agents (this was possible forcing the request user-agent header anyway). But since some months ago this is no longer a problem as we can now send those values as parameters within our hits payload.
First thing we need to do is to tell Google Analytics to send a copy of the hits to our own PHP tracking file (the one that is going to take care of proxing the hits to the real UA), this can be achived using the “Tasks”, so our Universal Analytics tag will look similar to:
ga('create', 'UA-XXXXXX-Y', 'auto'); ga(function(tracker) { var originalSendHitTask = tracker.get('sendHitTask'); tracker.set('sendHitTask', function(model) { var payLoad =model.get('hitPayload'); originalSendHitTask(model); var i=new Image(1,1); i.src="/collect.php"+"?"+payLoad; i.onload=function() { return; } }); }); ga('send', 'pageview');
If you are interested on knowing how this piece of code works, you can take a read to this other post (it’s in spanish sorry …)
Now we’ll need to install our server-side tracking code. It just has 2 files. gaproxy.class.php and collect.php . Just take care of uploading the collect.php file to accesible path on your site and then matching the code above with the right path for the file.
You can grab the files from the following GitHub Repo: https://github.com/thyngster/ga-proxy
So now that we have everything on place, let’s configure everything. In first place, let’s configure your own variables in the class file, it should be pretty straightforward:
// Configuration Start // Set your real Property Nmber where you want to redirect the data private $property_id = 'UA-AAAAAAA-B'; // This will be attached to the hostname value, so we can then filter any hit not coming from this script private $filterHash = 'dontspamme'; // set this to true, if you want to remove the last Ip's Octet private $anonymizeIp = true; // Configuration End
Now we’ll need to add this code to all our pages (into the header), so it creates the session, that will be used to avoid the collect.php to be called directly.
<?php include_once('gaproxy.class.php'); $ga = new GaProxy(); $ga->setupProxy(); ?>
We’re finish!, Now if collect.php is called directly without any previously loaded page, or by a script that doesn’t allow and keep cookies, the session will not be active and our file won’t be sending those requests to Google Analytics endpoint.
Let’s resume what we did:
- We told analytics.js to fordward a copy of the hits to our local collect.php
- We added 3 lines of php into our pages, so a session is started and a tracking token is set.
- We’ve configured the gaproxy.class.php file with our real UA Number, plus we added a little hash to be able to filter the real hits to our property.
Now we’ll need to configure 2 filters in our views:
- One excluding all hits where the field hostname does not include our filterHash.
- One replacing our filterHash by an empty string.
As you may see in the code there’s a reserved function that will be used to check the requesting IP’s against blacklists, maybe using some throttling mechanism, checking the referrers against a blacklisted list that can be automatically updated, etc. This is actually in progress as I need to think about the best way to achieve these feature, so any suggestion will be really welcomed. I’m planning to port it to other languages as Python or Ruby too, but I’d like to have a more polished PHP version before that.
I know this workaround will not be accesible to everyone, or it may take some extra efforts to get it running but using a plain filter, but it has more benefits and you won’t need to keep your filters updated for each new spammer that starts to mess around with your account.