SQL & BigQuery & Google Analytics 4

SQL in BigQuery – First steps in Raw Data Google Analytics 4

How to make an SQL report from Raw Data of Google Analytics 4? First, you need to have data. Have already? Log in to Google BigQuery

 

BigQuery panel for queries

On the left panel choose your project (sugestowo in my screen) -> analytics_XXXXX -> event_s and click Query table. Now the magic starts 😉

 

I heard the opinion, that if you know English, you know SQL. I partially agree 🙂 It’s not that hard. Let’s start with the basics:

SELECT * FROM `sugestowo.analytics_263648625.events_20210303`

Select – it means that you are choosing something

“*” – means ALL (all columns)

FROM – means which table would you like to get data from

When you click run, you will get all the data for the 3rd of March 2021:

Select Star SQL

As you can see in the green in my example query processed 363.3 KB it’s very small amount. Queries easily take 10 GB. And Google provides you with 1TB free per month (each next TB cost ~5USD).

You can check a number of rows in your data looking at the bottom of the page (580), or by running:

SELECT count(*) FROM `sugestowo.analytics_263648625.events_20210303`

count(*) means to count all rows, so it provides you with 580. You can change your column name easily with “as”:

SELECT count(*) as NumberOfRows FROM `sugestowo.analytics_263648625.events_20210303`

As it is data from Google Analytics 4 let’s check which events are measured, event_name shows that:

SELECT event_name as NumberOfRows FROM `sugestowo.analytics_263648625.events_20210303`

But it gives … 580 rows of different types. So let’s make them unique, meaning … distinct. Yes, that’s the function 😉

SELECT distinct(event_name) as NumberOfRows FROM `sugestowo.analytics_263648625.events_20210303`

 

BigQuery / SQL Counting and grouping by

How to calculate how many times each event was fired? You need both “event_name” and “count(*)” so the first approach would be:

SELECT event_name, count(*) as NumberOfRows FROM `sugestowo.analytics_263648625.events_20210303`

But it will not work. Google will show you what’s wrong

Error handling in BigQuery

You need to “group by” as google doesn’t know how to count(*) if there is an additional column. You need to make in the query group by event_name. This way it will now that you want to calculate a number of raws for each specific event_name:

SELECT event_name, count(*) as NumberOfRows FROM `sugestowo.analytics_263648625.events_20210303` group by event_name

To make it easier you do not need to provide event_name after group by. “1” will be enough as event_name is the first column

Grouping By for Counting - SQL - Google Analytics 4 Raw Data

Date(s) handling in BigQuery SQL – Google Analytics 4

Let’s get to something more complicated. As on the end of a table name (in from), there is a date (20210303), how to create a table, which will take into account range of dates, for example between 1st and 3rd of march? Fairly simple. At the begining of your statement declare data variables:

DECLARE a DATE DEFAULT "2021-03-01"; DECLARE b DATE DEFAULT "2021-03-03"; #And in the next line:
SELECT event_name, count(*) FROM `sugestowo.analytics_263648625.events_*` WHERE _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',a) AND FORMAT_DATE('%Y%m%d',b) group by 1

 

Why I added “#”? Because with the hash you are commenting:

Date range handling in BigQuery SQL

How to make queries (codes) more user-friendly?

Let’s dive into “with as”. This way you can create a “helping” query for example:

with AllData as (
SELECT * FROM `sugestowo.analytics_263648625.events_*` WHERE _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',a) AND FORMAT_DATE('%Y%m%d',b)
)

Then in your statement, you can use it in an easier way – by making it from “AllData”. Let’s look at it:

SubQueries - BigQuery SQL - using with

If you want to use many such sub-queries you need to do something like:

with FirstSubQuery as (XXXXXX),

SecondSubQuery as (YYYYY),

ThirdSubQuery as (ZZZZZ)

As you can see you provide “with” only one time. And you need to add a comma after “)” to let BigQuery SQL know, that you are starting the next subquery. If you have only one, you don’t use “,” after “)”.

Filtering results in BigQuery

You won’t guess it – to filter results you need to add “where” to a statement. So for example I would like to take only those raws of data in which event_name is outband_link

select * from AllData where event_name="outband_link"

Something more advanced? I want to check which users (cookies) made such activity on the 3rd of March:

#Don't forget to change dates to 3rd of march (beggining and end)

select user_pseudo_id from AllData where event_name="outband_link"

How about making those users distinct? (I don’t want duplicates)

select distinct(user_pseudo_id) from AllData where event_name="outband_link"

Disticnt & Filter - SQL query

And now the funny things start. Lets’ check which of those users (who made an outband_link on the 3rd of March) were on a site between 28th of Feb till 2nd of March.

We need to start with getting those users. It is fairly simple just declare two additional dates, and get all user_pseudo_id

select distinct(user_pseudo_id) from AllData2
#where AllData2 is also declared, but based on date range between 28th of Feb and 2nd of March

SQL Left join in BigQuery based on user cookies

Now we can select both, export them to Google Sheets and check it. However, there is of course better way 🙂 The query is:

#Start Date
DECLARE
a DATE DEFAULT "2021-03-03";
#End date
DECLARE
b DATE DEFAULT "2021-03-03";
#Start Date
DECLARE
c DATE DEFAULT "2021-02-28";
#End date
DECLARE
d DATE DEFAULT "2021-03-02";
#All data betwenn a and b date
with AllData as (SELECT * FROM `sugestowo.analytics_263648625.events_*` WHERE _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',a) AND FORMAT_DATE('%Y%m%d',b) ),
AllData2 as ( SELECT * FROM `sugestowo.analytics_263648625.events_*` WHERE _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',c) AND FORMAT_DATE('%Y%m%d',d) ),
DistinctUserIDMarchOutbound as (select distinct(user_pseudo_id) from AllData),
DistinctUserIDFebMarchVisit as (select distinct(user_pseudo_id) as user_pseudo_id_v1 from AllData2)
select * from DistinctUserIDMarchOutbound
left join
DistinctUserIDFebMarchVisit on user_pseudo_id=user_pseudo_id_v1

The only new thing here is left join. First I have a table of unique users who made an event outbound link on the 3rd of march (“select * from DistinctUserIDMarchOutbound”), then left join (add to this table to the additional columns) DistinctUserIDFebMarchVisit. Match data based on user_pseudo_id. The results are here:

Left Join - SQL - Matching on cookie id - pseudo_cookie_id

Handling nested data with Cross Join in BigQuery

Let’s look at exactly one raw of data, for example, some random pageview:

How nested data looks - raw which looks as 12 rows

Yes, this is one raw. The data in event_params is nested. You can unnest this data by:

CROSS JOIN UNNEST (event_params) AS params

But what the heck. It’s 12 rows now!

Let’s look to the right:

Unnesting - creating multiple row - Google Analytics 4 BigQuery

Unnest took the nested data:

Nested Data in BigQuery

And created a copy for each of them, but on the very right, it divided it. So to take only page URL on which user was in this event you need to:

  1. Filter event_params.key to be equal page_location
  2. Have a column: event_params.value.string_value

Let’s do it:

#Start Date
DECLARE
a DATE DEFAULT "2021-03-03";
#End date
DECLARE
b DATE DEFAULT "2021-03-03";
with AllData as (SELECT * FROM `sugestowo.analytics_263648625.events_*` WHERE _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',a) AND FORMAT_DATE('%Y%m%d',b) )
Select value.string_value from AllData
CROSS JOIN UNNEST (event_params) AS params
where event_timestamp=1614805011838408 and user_pseudo_id="1984128012.1614805012" and event_name="page_view" and key="page_location"

URL get from page_locatin - Google Analytics 4

It was an example with only one user. How to get all the users, and their pages? In select add “user_pseudo_id” and in where delete not necessary conditions:

String value in nested data - Google Analytics 4

How to get users and visited URL’s

Finally how to get distinct values? Use group by. So the final query is:

#Start Date
DECLARE
a DATE DEFAULT "2021-03-03";
#End date
DECLARE
b DATE DEFAULT "2021-03-03";
with AllData as (SELECT * FROM `sugestowo.analytics_263648625.events_*` WHERE _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',a) AND FORMAT_DATE('%Y%m%d',b) )
Select user_pseudo_id, value.string_value from AllData
CROSS JOIN UNNEST (event_params) AS params
where key="page_location"
group by 1,2

Working on nested data in BigQuery

If you prefer video version of tutorials here you will find it:

 

Those are the basics of reporting with Google Analytics 4 raw data. Let me know in a comment how do you like this article 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *