Anonymize IP Geolocation Accuracy Impact Assessment
Impact assessment 2.0 available here! (The new version uses a much more accurate methodology.)
The General Data Protection Regulation (GDPR) is fast approaching. Thus it is highly likely that anonymizing IP addresses might become compulsory in order for companies to comply with data protection regulations. German data protection authorities have already implemented such laws for IP Address anonymization. The rest of Europe may follow when GDPR comes into force on 25th May 2018.
Google Analytic’s _anonymizeIp function
Google Analytics has provided a function for users to comply with such regulations – _anonymizeIp as described below:
When a customer of Google Analytics requests IP address anonymization, Google Analytics anonymizes the address as soon as technically feasible at the earliest possible stage of the collection network. The IP anonymization feature in Google Analytics sets the last octet of IPv4 visitor IP addresses and the last 80 bits of IPv6 addresses to zeros in memory shortly after being sent to the Google Analytics Collection Network. The full IP address is never written to disk in this case.
When researching the topic, online sources typically highlight that whole process will result in a slight reduction in the accuracy of geographic reporting. But how much exactly is this slight inaccuracy? We tested this to find out.
This experiment aims to quantify the reduction in the accuracy of geolocation identification under various circumstances
- Various geographic levels:
- at continent level
- at country level
- at states(for US) or region (for UK) level
- at city level
- Domestic visitors vs. Overseas visitors
- UK based site vs. US based site
We found that the more you ask of geolocation accuracy, the larger the impact. The continent and country inaccuracy increases are negligible. If you’re asking for city level data, prepare for an increase in error margins.
Experiment – Anonymized IP VS. Full IP
The test was conducted simultaneously on one of our clients’ UK site and US site from 2017-02-07 to 2017-05-17. Each site has two properties set up in GA: one using full IP address to identify users’ geographic locations, the other one using partially blocked IP address, using the _anonymizeIp function.
We start with the assumption that geo-location identified with a full IP address is 100% accurate and compare the anonymized IP address version against it to assess the impact of IP address anonymization. However, it’s worth noting that the method of using the IP address to infer geolocation isn’t 100% accurate due to the nature of IP addresses. Stéphane Hamel’s study shows the precision varies from a few meters to 250km.
You may find the detailed methodology in here.
UK Anonymized IP Geolocation Identification Discrepancies
In addition, in some few cases, the attributed location using partially blocked IP address is not even on the same continent. There is a weighted average discrepancy of 0.94%.
As indicated in the table above, for a UK-based site, IP address anonymization is more likely to distort geolocation identification at the city level than at the country and region level. Overall, there is a 1.44% and 2.38% weighted average inaccuracies at the country and region level respectively, whereas at the city level the weighted average discrepancy is widened to 17.00%.
Additionally, whether the visitors are from within the UK or outside of UK also makes a huge difference for a UK based site. For a UK-based site, geolocation identifications accuracy at the country and region level for overseas visitors are much more likely to be affected by anonymizing IP than for domestic visitors. As shown in the table, IP anonymization will only cause a negligible 0.91% and 1.19% discrepancy at the country and region level for UK based visitors. While for overseas visitors, the discrepancies for country and region attribution widened to 4.36% and 9.01%. However, city attribution for overseas visitors seems to have a smaller discrepancy than domestic visitors.
US Anonymized IP Geolocation Identification Discrepancies
In addition, in some few cases, the attributed location using partially blocked IP address is not even on the same continent. There is a weighted average discrepancy of 0.64%.
The US-based site appears to exhibit the same pattern as the UK-based site, IP address anonymization has a much bigger impact on geolocation identification at the city level than at the country and state level. Overall, there is a 1.07% and 4.03% weighted average inaccuracies at the country and state level respectively, whereas at the city level the weighted average discrepancy is widened to 20.33%.
Similar to the UK site, whether the visitors are domestic or overseas have a huge impact on location attribution For a US-based site, geolocation identification accuracy at the country and state level for overseas visitors is much more likely to be affected by anonymizing IP address than for domestic visitors. As shown in the table, IP address anonymization will only cause a 0.35% and 3.06% discrepancy at the country and state level for US based visitors. While for overseas visitors, the discrepancies for country and state attribution widened to 3.04% and 6.69%. However, when it comes to the city level, overseas visitors seem to have a smaller discrepancy than domestic visitors.
UK vs. US
Full IP vs. Anonymized IP Discrepancies:
In comparison, overall, the UK site location attribution accuracy suffered slightly less from IP address anonymization than the US site.
IP anonymization impact on geolocation accuracy:
- City(17%~21%) > Region/State(2%~4%) > Country(1%) > Continent(0.6~0.9%) (overall)
- Overseas vs. Domestic Visitors:
- Country & State/Region Level: Overseas visitors > Domestic visitors
- City Level: Overseas visitors < Domestic visitors
- US vs. UK Site
- Country Level: UK > US
- State/Region & City Level: UK < US (mostly)