First and last touch attribution models in SQL [tutorial]¶
In order to calculate the return on marketing spend on individual campaigns, digital marketers need to connect revenue events, downstream in a user journey, with marketing touch events, upstream in a user journey. This connection is necessary so that the cost of those associated with the marketing campaign that drove those marketing touches can be connected to profit associated with the conversion events later on.
Different attribution models involve applying different logic to connecting those marketing touch events with subsequent revenue events. In this blog post, we will document the analytic approach and the corresponding SQL statements to perform the most basic types of attribution: first and last touch. Our example SQL will focus on the web use case. The same undelying analytic approach can, however, be applied to any channel.
Note that all the SQL given below is Redshift compatible. It can be easier to do attribution analysis with other dialects of SQL that support complicated data types (arrays and objects) in particular.
Identifying the different marketing touches¶
First we need to identify all our marketing touch events. We can generate a table with all of these as follows:
create table derived.marketing_touches as ( select domain_userid, derived_tstamp, event_id, mkt_medium, mkt_source, mkt_term, mkt_content, mkt_campaign, refr_medium, refr_source, refr_term from atomic.events where refr_medium != 'internal' and refr_medium is not null order by 1,2);
The above table includes a line of data per marketing touch, ordered by user (as identified via the first party cookie id
domain_userid) and time (as identified by the
Identifying the different revenue events¶
Now lets create a table with all our different revenue events. What these look like will depend on your own particular event schema - for this example we’ll assume that revenue events are standard Snowplow transaction events. It should be straightforward to modify / update the below SQL with your own set of revenue events.
create table derived.revenue_events as ( select domain_userid, derived_tstamp, event_id, tr_total from atomic.events where event_name = 'transaction' order by 1,2);
Connecting marketing touches with revenue events: first touch attribution model¶
Now that we have our marketing touches and our revenue events, we need to join them together.
How we do the join is firstly a question of business logic: what type of attribution model do we want to apply? The simplest model is a first touch model - this credits all the value associated with the revenue event to the first marketing touch for each user.
To do this, we create a new
derived.first_marketing_touch table, that records only the first marketing touch for each user. This is a subset of the marketing touches recorded in the
with first_touch_tstamps as ( select domain_userid, min(derived_tstamp) as first_touch_tstamp from derived.marketing_touches group by 1, order by 1 ) create table derived.first_marketing_touch as ( select m.domain_userid, m.derived_tstamp, m.event_id, m.mkt_medium, m.mkt_source, m.mkt_term, m.mkt_content, m.mkt_campaign, m.refr_medium, m.refr_source, m.refr_term from derived.marketing_touches m join first_touch_tstamps f -- only return first touch tstamps on m.domain_userid = f.domain_userid and m.derived_tstamp = f.first_touch_tstamp );
Now it is trivial to join our
derived.first_marketing_touch table with our
select f.*, r.tr_total from derived.first_marketing_touch f right join derived.revenue_events r -- right join in case there is no marketing touch event to join to the revenue event on f.domain_userid = r.domain_userid
Bingo! We have a table with a line of data for each revenue event, and all the marketing data associated with the corresponding first touch event for that user.
Note that the above is especially straightforward because:
It is easy to identify the first marketing touch for each user. (It’s simply the one wiht the earlierst timestamp.) By the time we do the join we have a table with a maximum of one marketing touch event per user ID, so we do not have to worry about generating a cartesian product when performing the join. Because neither of the above is true when we are applying a last click attribution model, the SQL gets a bit more complicated.
Connecting marketing touches with revenue events: last touch attribution model¶
In a last touch attribution model, we want to credit all the value associated with each revenue event to the most recent marketing touch that occurred prior to that event.
There are a number of ways to do this in SQL - I think the following is the most straightforward, but welcome any suggestions at alternatives that are clearer / more performant.
First, we need to identify for each revenue event what is the corresponding marketing touch event that we wish to connect. To do this, we first union our marketing touches and revenue event tables into a single table that contains both the marketing touches and the revenue events. For performance reasons, we only include a subset of the columns in our marketing touches table.
create table derived.marketing_touches_and_revenue_events as ( select domain_userid, derived_tstamp, event_id as marketing_event_id, null as revenue_event_id, 'marketing touch' as event_type, null as revenue from derived.marketing_touches union select domain_userid, derived_tstamp, null as marketing_event_id, event_id as revenue_event_id 'revenue event' as event_type, tr_total as revenue from derived.revenue_events );
The above table includes a line for every marketing touch event and every revenue events. Marketing touch events have a
marketing_event_id set and revenue events have a
We need to aggregate over this table so that we set the
marketing_event_id for each revenue event. This will be the
event_id of the most recent marketing touch event prior to the revenue event. We’ll then be able to use the event ID to join back with our marketing touches table, to pull all the metadata associated with that marketing touch to the revenue event.
To do that, we use a window function to identify the most recent marketing touch event prior to the revenue event:
select domain_userid, derived_tstamp, last_value(marketing_event_id ignore nulls) over ( partition by domain_userid order by derived_tstamp rows between unbounded preceding and current row ) as marketing_event_id, revenue_event_id, event_type, tr_total from derived.marketing_touches_and_revenue_events
The window function is doing a lot of work for us, so it is worth explaining what’s going on before we use the above query to generate our final result set:
- First, it partitions our marketing touch and revenue events by user ID
- Then it orders the event stream by time
- Then for each event, it fetches the most recent not null marketing_event_id value. Note that this will be applied to every row in the table i.e. marketing touch events and transaction events.
Where it is applied to marketing events, the most recent marketing event ID will be the marketing event ID for the current event. That doesn’t matter (we’re going to filter these events out of the event stream in the next step). The important thing is that for revenue events, it will correctly fetch the most recent marketing event ID. (Because the marketing event ID for the current row will be null, so will be ignored.)
Now we apply the above window function to generate our final result set:
with last_touch_event_ids_calculated as ( select domain_userid, derived_tstamp, last_value(marketing_event_id ignore nulls) over ( partition by domain_userid order by derived_tstamp rows between unbounded preceding and current row ) as last_marketing_event_id, revenue_event_id, event_type, tr_total from derived.marketing_touches_and_revenue_events ) select r.domain_userid, r.derived_tstamp as revenue_event_tstamp, r.last_marketing_event_id r.revenue_event_id, r.tr_total, m.mkt_medium, m.mkt_source, m.mkt_term, m.mkt_content, m.mkt_campaign, m.refr_medium, m.refr_source, m.refr_term from last_touch_event_ids_calculated r right join derived.marketing_touches m on r.last_marketing_event_id = m.event_id -- only perform the join for the last touch event where r.event_type = 'revenue event' -- only fetch revenue events from the last_touch_event_ids_calculated table
The above query will generate a result set with one line of data per revenue event, and each line including the marketing data associated with the last channel that each user engaged with prior before the revenue event occurred.