The IAB Spiders & Robots enrichment uses the IAB/ABC International Spiders and Bots List to determine whether an event was produced by a user or a robot/spider based on its’ IP address and user agent.
Spiders & bots are sometimes considered a necessary evil of the web. We want search engine crawlers to find our site, but we also don’t want a lot of non-human traffic clouding our reporting.
The Interactive Advertising Bureau (IAB) is an advertising business organization that develops industry standards, conducts research, and provides legal support for the online advertising industry.
Their internationally recognized list of spiders and bots is regularly maintained to try and identify the IP addresses of known bots and spiders.
There are three fields that can be added to the
parameters section of the enrichment configuration JSON:
They correspond to one of the IAB/ABC database files, and need to have two inner fields:
databasefield containing the name of the database file.
urifield containing the URI of the bucket in which the database file is found. This field supports
The table below describes the three types of database fields:
|Field name||Database description||Database filename|
|Blacklist of IP addresses considered to be robots of spiders|
|Blacklist of useragent strings considered to be robots or spiders|
|Whitelist of useragent strings considered to be browsers|
All three of these fields must be added to the enrichment JSON, as the IAB lookup process uses all three databases in order to detect robots and spiders. Note that the database files are commercial and proprietary and should not be stored publicly – for instance, on unprotected HTTPS or in a public S3 bucket.
This enrichment uses the following fields of a Snowplow event:
useragentto determine an event’s user agent, which will be validated against the databases described in
user_ipaddressto determine an event’s IP address, which will be validated against the database described in
derived_tstampto determine an event’s datetime. Some entries in the Spiders & Robots List can be considered “stale”, and will be given a
ACTIVE_SPIDER_OR_ROBOTbased on their age.
This enrichment adds a new context to the enriched event with this schema.