Tracking the anchor text for the incoming links in Google Tag Manager
Introduction
It's been a long time since I took care of this blog's "Analytics" ( In the blacksmith's house, a wooden knife). And I noticed that would be cool having the info about the Anchor Text
the sites referring to my sites are using to link me.
So I'm sharing the solution I built today in order to capture which Anchor Text was on the referring URLs
and sending the info back to Google Tag Manager and from there we'll be able send an event to APP+WEB or to any other place we want :)
How it works
Execution Flow Chart
The flow chart on the right side, shows how the executions flow works. We'll have 2 main pieces:
- One GTM CUSTOM HTML Tag
- One PHP File
The first one will the responsible of doing the main logic and doing a XMLRequest call to the second one that will take care of reading the current visitor referrer page and scrape it in order to try to find the current Anchor Text
that the user clicked.
We're using extense logic to void any kind of false positives/duplicate hits. For example when an user goes back into a mobile phone or swipes. We don't want to consider these "page reloads" as landings despite they may still hold a valid referrer info.
SERVER SIDE CODE
PHP Snippet Code
First we need to upload the following php snippet to any server supporting PHP 7.x ( because of the use of arrays literals ).
This code can be highly improved for example for adding a timeout in the case the page is not reachable. If someone asks I may adding more sanity check for the script.
// David Vallejo (@thyngster)
// 2020-04-14
// Needs PHP7.X
if(!isset($_GET["url"])){
die("missing url parameter");
}
$links = [];
if(isset($_SERVER["HTTP_REFERER"])){
$url = $_GET["url"];
$referrer_link_html_content = file_get_contents($url);
$current_domain = str_replace("www.","", parse_url($_SERVER["HTTP_REFERER"], PHP_URL_HOST));
$doc = new DOMDocument();
$doc->loadHTML($referrer_link_html_content);
$rows = $doc->getElementsByTagName('a');
foreach ($rows as $row)
{
if($row instanceof DOMElement){
preg_match_all('/'.$current_domain.'/i', $row->getAttribute('href'), $matches, PREG_OFFSET_CAPTURE);
if(count($matches[0]) > 0){
$links[] = [
"url" => $row->getAttribute('href'),
"anchor_text" => $row->textContent
];
}
}
}
}
header('Content-type: application/json; charset=UTF-8');
header("Access-Control-Allow-Origin: *");
echo json_encode($links, JSON_PRETTY_PRINT | JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_SLASHES);
exit;
Python Snippet code
I know this code is not the best one since I'm not a python coder, but it can give an overall idea about how to run this based on the Python.
should be used like:
python anchor.py REFFERER_LINK LINKTOSEARCH
# use: python anchor.py REFFERER LINKTOSEARCH
#!/usr/bin/env python
import json
import urllib2
import requests
import sys
from bs4 import BeautifulSoup
from urlparse import urlparse
links = []
if len(sys.argv) > 1:
url = sys.argv[1]
else:
print("URL argument is missing")
sys.exit()
if len(sys.argv) > 2:
referrer = sys.argv[2]
else:
print("REFERRER argument is missing")
sys.exit()
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers = headers)
soup = BeautifulSoup(response.text, "html.parser")
for ahref in soup.select('a[href*="'+urlparse(referrer).netloc.replace("www.", "")+'"]'):
links.append({
"url": ahref.attrs["href"],
"anchor_text": ahref.text
})
print json.dumps(links, sort_keys=True,indent=4, separators=(',', ': '))
GTM Custom HTML Code
NOTE Remember that the following code needs to be added to GTM wrapped between
<script>
</script>
tags!
Also remember that we need to update the endPointUrl
value to the domain where we've uploaded the PHP script
(function(){
try{
var endPointUrl = 'https://domain.com/getLinkInfo.php';
// We don't want this to run on page reloads or navigations. Just on Real Landings
if (window.performance && window.performance.navigation && window.performance.navigation.type === 0) {
var referrer = document.referrer;
var current_url = document.location.href;
var grab_hostname_from_url = function(url) {
var h;
var a = document.createElement("a");
a.href = url;
h = a.hostname.replace('www.', '');
return h;
}
// Only continue if the current referrer is set to a valid URL
if (referrer.match(/^(?:http(s)?:\/\/)?[\w.-]+(?:\.[\w\.-]+)+[\w\-\._~:/?#[\]@!\$&'\(\)\*\+,;=.]+$/)) {
// current referrer domain != current_domain
console.log(grab_hostname_from_url(grab_hostname_from_url(referrer).indexOf(grab_hostname_from_url(current_url)) === -1))
if (grab_hostname_from_url(referrer).indexOf(grab_hostname_from_url(current_url)) === -1) {
fetch(endPointUrl+ '?url=' + referrer).then(function(response) {
return response.json();
}).then(function(json) {
json.forEach(function(link) {
if (current_url.indexOf(link.url)>-1) {
//if (current_url===link.url.indexOf) {
window.dataLayer.push({
event: 'incoming-link',
linked_url: link.url,
landing_url: document.location.href,
referring_url: referrer,
anchor_text: link.linkText
});
}
})
});
}
}
}
}catch(e){}
})();
Now we're only one step away of having this working, we'll need to setup a firing trigger for our tag, this ideally should be the All Pages
trigger to get it fired asap.
Reported Data Info
dataLayer Key | dataLayer Value |
---|---|
event | incoming-link |
linked_url | Current Link in the Referral Page |
landing_url | Current URL |
referring_url | Full Referrer Info |
anchor_text | The Anchor Text on the referrer page linking to your site |
Caveats
Please note that this solution relies on the current document.referrer, so don't expect it to work for all referrals since some of them may be stripping the full referrer info, like Google SERPS do, or even some browser may end stripping the referrer details down to origin for privacy reason.
Also it may happens that the referring URL is linking to us in more than 1 place, on this case the scraping endpoint will return all the links and anchors texts matching. From that point of, it's up to you how you report it in Google Analytics or any too :D
In any case this should work for most of the common referrals traffic.