Transition from Google Analytics to GA4
It’s about time to think about transition from Google Analytics to GA4. In my opinion learning GA4 panel (logging in to see default reports) is a waste of time, at least for me. As GA4 allows you to have RAW DATA in BigQuery you can create any report you wish. You have all information 🙂
Why should i bother about Google Analytics 4
Probably you are a user of Google Analytics. And everything works, is reported. Live is a great journey. Nice! However Google will halt Google Analytics in a couple of months. It will work till the end of June 2023. Meaning if your reporting is based on Google Analytics, you will not be able to report anything in a couple of months!
I suppose it is very important argument, why you should start using something new. For example Google Analytics 4
GA4 data as in GA
Data in BigQuery (RawData) allows you to create any report which you so far use. It gives you possibility to create reports for all reports needed (mentioned before). From my experience the trickiest one is Ads reporting. Why? Because to report it you need information about:
- Session – sessionaizing data in GA4
- Source, medium, campaign of each session traffic
- Others (landing page, bounce rate, session duration etc.)
- Additionals
It get’s tricky 😃
Sessionaizing data in GA4
There is many ways to do it. In this article I would like to show the easiest one. Maybe it is not the most robust one, BUT it is (in most cases) good enought one 🙂
Let’s start with the beggining just showing all data from GA4:
DECLARE a DATE DEFAULT "2023-01-01"; DECLARE b DATE DEFAULT "2023-02-24"; SELECT * FROM `ga4dronsomcom.analytics_263819657.events_*` WHERE _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',a) AND FORMAT_DATE('%Y%m%d',b)
How to add session identifier to it? Add it from event_params:
DECLARE a DATE DEFAULT "2023-01-01"; DECLARE b DATE DEFAULT "2023-02-24"; SELECT *, cast((select value.int_value from unnest(event_params) where key = 'ga_session_number') as string) as session_number FROM `ga4dronsomcom.analytics_263819657.events_*` WHERE _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',a) AND FORMAT_DATE('%Y%m%d',b)
As you can see there is only one line change (here divided into 4):
cast(( select value.int_value from unnest(event_params) where key = 'ga_session_number' ) as string) as session_number
Source, medium, campaign of each session
This one is REALLY tricky. It took me some time, and I’m pretty sure it is possible to improve my code. The thing is – code works. I tested it with real data it correlates with GA’s data ~95-99%, so it’s enough. The code is:
DECLARE a DATE DEFAULT "2023-01-01"; DECLARE b DATE DEFAULT "2023-02-24"; with AllDataSessionized as ( SELECT *, cast((select value.int_value from unnest(event_params) where key = 'ga_session_number') as string) as session_number FROM `ga4dronsomcom.analytics_263819657.events_*` WHERE _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',a) AND FORMAT_DATE('%Y%m%d',b) ), #Getting more information about traffic - referrer, source, medium, campaign, term, content AddingInformationAboutTrafficOrigin as ( SELECT event_date, event_timestamp, user_pseudo_id, session_number ,(select value.string_value from unnest(event_params) where key = 'page_referrer') as page_referrer_ga4 ,(select value.string_value from unnest(event_params) where key = 'source') as source_ga4 ,(select value.string_value from unnest(event_params) where key = 'medium') as medium_ga4 ,(select value.string_value from unnest(event_params) where key = 'campaign') as campaign_ga4 ,(select value.string_value from unnest(event_params) where key = 'term') as term_ga4 ,(select value.string_value from unnest(event_params) where key = 'content') as content_ga4 , if( ( lower((select value.string_value from unnest(event_params) where key = 'page_location')) like "%gclid%" ), (select value.string_value from unnest(event_params) where key = 'page_location'), null ) as page_location_ga4 FROM AllDataSessionized ), FirstSessionTimestamp as ( select min(event_timestamp) as min_event_timestamp, user_pseudo_id as min_user_pseudo_id, session_number as min_session_number, event_date as min_event_date from AddingInformationAboutTrafficOrigin group by 2,3,4 ), OnlySessionBeggining as ( select * except(min_event_timestamp, min_user_pseudo_id, min_session_number, min_event_date) from ( select * from AddingInformationAboutTrafficOrigin left join FirstSessionTimestamp on event_timestamp=min_event_timestamp and user_pseudo_id = min_user_pseudo_id and session_number = min_session_number and event_date = min_event_date ) where min_event_date is not null ), #Take only the first timestamp of session #There is chance that in first timestamp there will be multiple events with different ... origins #Let's take all and later with row_number choose which one to select FirstSessionTimestampGrouped as ( select * except(event_timestamp) from OnlySessionBeggining group by 1,2,3,4,5,6,7,8,9,10 ), DescribedSourceOrigin as ( select *, #This if is two times. Second time to give number for row_number to know what to be based on #If all is null then direct if(page_referrer_ga4 is null and source_ga4 is null and medium_ga4 is null and campaign_ga4 is null and page_location_ga4 is null,"direct", #If link has gclid it's google / cpc if(lower(page_location_ga4) like "%gclid%", "google / cpc", #If contains source / medium #If it does then concat source, medium, campaign into one column. It is easier this way. if(source_ga4 is not null and medium_ga4 is not null, concat(source_ga4, "_____", medium_ga4, "_____", campaign_ga4), #If refferer is from page domain itself if(page_referrer_ga4 is not null and lower(page_referrer_ga4) like "%dronsom%","internal / referrer", #If referrer, I would like to get only domain if(page_referrer_ga4 is not null, concat( left(replace(replace(replace(page_referrer_ga4,"https://",""),"http://",""),"www.",""), if(strpos(replace(replace(replace(page_referrer_ga4,"https://",""),"http://",""),"www.",""),"/")>1,strpos(replace(replace(replace(page_referrer_ga4,"https://",""),"http://",""),"www.",""),"/")-1,length(replace(replace(replace(page_referrer_ga4,"https://",""),"http://",""),"www.","")))) , " / referral"), "anotherTraffic"))))) as TrafficOrigin, #If all is null then direct if(page_referrer_ga4 is null and source_ga4 is null and medium_ga4 is null and campaign_ga4 is null and page_location_ga4 is null,6, #If link has gclid it's google / cpc if(lower(page_location_ga4) like "%gclid%", 1, #If contains source / medium if(source_ga4 is not null and medium_ga4 is not null, 2, #If refferer is from page domain itself if(page_referrer_ga4 is not null and lower(page_referrer_ga4) like "%dronsom%",4, #If referrer, I would like to get only domain if(page_referrer_ga4 is not null, 3, 5))))) #Order which is set: #google / cpc #source / medium #referrer #referrer of domain itself #anotherTraffic (it should be empty!) #direct as OrderTrafficOrigin, #Getting cd parameter from url. It is google campaign id. REGEXP_EXTRACT(page_location_ga4, r'[?&]cd=([^&]+)') as google_campaign_id from FirstSessionTimestampGrouped ), FullDescriptionTrafficOrigin as ( select event_date as event_date_combining, user_pseudo_id as user_pseudo_id_combining, session_number as session_number_combining, TrafficOrigin as TrafficOrigin_of_Session, google_campaign_id, term_ga4, content_ga4 from ( select *, ROW_NUMBER() OVER (PARTITION BY event_date, user_pseudo_id, session_number ORDER BY OrderTrafficOrigin ASC) AS NumberOfSourceOrigin from ( select * except(page_referrer_ga4, source_ga4, medium_ga4, campaign_ga4, page_location_ga4) from DescribedSourceOrigin group by 1,2,3,4,5,6,7,8 )) where NumberOfSourceOrigin=1 ), AllDataSessionizedWithOrigin as ( select * except(event_date_combining, user_pseudo_id_combining, session_number_combining) from ( select * from AllDataSessionized left join FullDescriptionTrafficOrigin on event_date = event_date_combining and user_pseudo_id = user_pseudo_id_combining and session_number = session_number_combining ) ) select *, PARSE_DATE("%Y%m%d", event_date) as data_partition from AllDataSessionizedWithOrigin
For simplicity I am going to save this into separate table. Remember to use partition table in such case. Of course you can set automatic process for it so your data is ready every day
CREATE or replace TABLE `ga4dronsomcom.analytics_263819657.DronsomDataSessionized` PARTITION BY data_partition as (
Landing page, bounce rate etc.
The trickiest part is done – now you have sessionized data with origin (source, medium, campaign) of each session! The toughest one is … google traffic. As you do not use utm’s, and you need to get campaign id from separate column (google_campaign_id). Getting landing page per session or bounce rate is much much easier.
Landing Page:
with AllData as ( SELECT * FROM `ga4dronsomcom.analytics_263819657.DronsomDataSessionized` WHERE data_partition >= "2023-01-01" ) select user_pseudo_id, TrafficOrigin_of_Session, event_date, session_number, (select value.string_value from unnest(event_params) where key = 'page_location') as SessionLandingPage from ( select *, #User Row_number to get the very first page_view event of each session, for each users, from each trafficOriginOfSession ROW_NUMBER() OVER (PARTITION BY user_pseudo_id, TrafficOrigin_of_Session, event_date, session_number ORDER BY event_timestamp ASC) AS NumberOfPageInSession from AllData where event_name="page_view" ) #Show only the first page in session - which is landing page where NumberOfPageInSession=1
Bounce Rate:
with AllData as ( SELECT * FROM `ga4dronsomcom.analytics_263819657.DronsomDataSessionized` WHERE data_partition >= "2023-01-01" ), AllDataWithPageLocation as ( select *, (select value.string_value from unnest(event_params) where key = 'page_location') as page_location_ga4 from AllData ), IsSessionBounced as ( select *, if(NumberOfDifferentPageLocationsInSession>=2,0,1) as BouncedSession from ( select user_pseudo_id, TrafficOrigin_of_Session, event_date, session_number, count(distinct(page_location_ga4)) as NumberOfDifferentPageLocationsInSession from AllDataWithPageLocation group by 1,2,3,4 ) ) #Here count(*) calculates number of sessions select TrafficOrigin_of_Session, sum(BouncedSession)/count(*) as BounceRate from IsSessionBounced group by 1
This query checks only if user was on one or more url’s it does not take into account scroll events, time on page etc.
Additionals
There is additional room for improvement, for example:
- Session definition is different from GA’s. It does not treat especially direct the same way as GA and user opening a site from another origin
- Maybe in your case you should take only specific events (for example page_view) to set session traffic or calculate time on page etc.
- Combining user_id with pseudo_id (and any other identifier which you collect) can improve user definition
- Page_location can be improved not to divide pages between those with or without utms
- Don’t forget about referrers from payment gates, it also can be handled in a better way
- You can exclude traffic from any other domains (your test ones or some bots)
Summary
Moving from one tool to another can be challenging. Especially if it is really important tool which is a base for data-driven approach in your company. GA4 is powerful, especially if you use RawData in BigQuery. This way you are able to create any report, and replace GA. As shown it is not so simple, but You can do it! Especially if you are an data analyst, then you even should do it. If your company needs any help you can contact me at: rkozlowski@dronsom.com
Let’s find out what we can do together 🙂
Leave a Reply