BigQuery - easier way to calculate median - using UDF function

Median in BigQuery is a joke

Have you ever tried calculating a median in BigQuery?

It’s a joke how to do that! Let’s take only two values 1 and 50. Median should be 25.5, in BigQuery there is APPROX_QUANTILES function which calculates quantiles in 1 line. Median should be the same as 50th percentile. But it is not!

BigQuery_not_properly_calculates_median

 

Standard approach – calculating median in BigQuery

There is a way to calculate it precisely. However it’s not the easiest way. So far I was using this approach thou, as I didn’t know better.

 

Let’s say you have dataset with columns: country, year, number_of_cars_produced, car_type (RandomData at the bottom of article). You need median and mean.

Mean is easy:

SELECT country, year, avg(number_of_cars_produced) as mean FROM RandomData group by 1,2

But if you do the same for median:


SELECT country, year, PERCENTILE_CONT(number_of_cars_produced, 0.5) OVER (partition by country, year) AS median FROM RandomData

It won’t work properly. As it will give you as many rows as you already have in your table. So to have it properly you need additional group by:


select * from (
SELECT country, year, PERCENTILE_CONT(number_of_cars_produced, 0.5) OVER (partition by country, year) AS median FROM RandomData 
) group by 1,2,3

Which is not the best approach, because if you need: country, year, mean, median you need to combine those two. You are not able to do it in one line.
It’s median, one of the most commonly used statistic and it is not included in BigQuery the easy way for user. What a joke!

Solution – Median (or any other percentile) in BigQuery

I just found out (using internet, more internet, chat gpt and thinking) how to resolve it 🙂
Solution is to create and UDF function. You can save function by running below code:


CREATE OR REPLACE FUNCTION `.udf_functions.percentileExact`(p FLOAT64, providedValues ARRAY, excludeNegativeValues BOOL )
RETURNS FLOAT64
LANGUAGE js AS """
    if (p < 0 || p > 1) {
        throw new Error("Percentile p must be between 0 (0%) and 1 (100%).");
    }

    const filteredValues = providedValues.map(value =>
        excludeNegativeValues && value < 0 ? null : value // Change negative values to null, if third parameter of function is true ).filter(value => value != null); // Filter out both null and undefined values in one line

    //If there are no values after filtering then return null
    if (filteredValues.length < 1) {
        return null
    }

    // Check if all elements are numbers
    if (!filteredValues.every(Number.isFinite)) {
        throw new Error("All elements must be finite numbers");
    }

    //If there is only 1 value after filtering, then return this value (it works like this in postgreSQL percentile_cont)
    if (filteredValues.length < 2) { return filteredValues[0] } // Sort filtered values filteredValues.sort((a, b) => a - b);

    const i = p * (filteredValues.length - 1); // Calculates the exact index position for percentile
    const floored = Math.floor(i); // Gets the lower bound index
    if (floored === i) {
        //No interpolation is needed. For example if p=0.5 it means there is odd number of values in array, so median is just the value in center.
        return filteredValues[i];
    }
    const decimal = i - floored; // Calculate the fractional part of the index
    const difference = filteredValues[floored + 1] - filteredValues[floored]; // Difference between the neighboring elements
    return filteredValues[floored] + difference * decimal; // Interpolated value
"""

Now you have just what you needed. One line for both median and mean:


SELECT country, year, udf_functions.percentileExact(0.5,array_agg(cast(number_of_cars_produced as float64)),false) AS median, avg(number_of_cars_produced) as mean FROM RandomData group by 1,2

Is BigQuery a joke?

Not at all! I use it all the time. The are some glitches like this. This is not the only one. With this article I wish google will improve it (and other BigQuery issues).
What’s your opinion about it? 🙂

Example dataset for calculations below:


WITH RandomData AS (
  -- Data for 2018
  SELECT 'USA' AS country, 2018 AS year, 1000 AS number_of_cars_produced, 'Sedan' AS car_type UNION ALL
  SELECT 'USA', 2018, 1200, 'SUV' UNION ALL
  SELECT 'USA', 2018, 1300, 'Convertible' UNION ALL
  SELECT 'USA', 2018, 1400, 'Coupe' UNION ALL
  SELECT 'USA', 2018, 1500, 'Pickup' UNION ALL
  SELECT 'USA', 2018, 1100, 'Hatchback' UNION ALL
  SELECT 'USA', 2018, 1250, 'Estate' UNION ALL
  SELECT 'USA', 2018, 1350, 'Sports Car' UNION ALL
  SELECT 'USA', 2018, 1450, 'Minivan' UNION ALL
  SELECT 'Europe', 2018, 1050, 'Sedan' UNION ALL
  SELECT 'Europe', 2018, 1250, 'SUV' UNION ALL
  SELECT 'Europe', 2018, 1350, 'Convertible' UNION ALL
  SELECT 'Europe', 2018, 1450, 'Coupe' UNION ALL
  SELECT 'Europe', 2018, 1550, 'Pickup' UNION ALL
  SELECT 'Europe', 2018, 1150, 'Hatchback' UNION ALL
  SELECT 'Europe', 2018, 1300, 'Estate' UNION ALL
  SELECT 'Europe', 2018, 1400, 'Sports Car' UNION ALL
  SELECT 'Europe', 2018, 1500, 'Minivan' UNION ALL
  -- Data for 2019
  SELECT 'USA', 2019, 1020, 'Sedan' UNION ALL
  SELECT 'USA', 2019, 1220, 'SUV' UNION ALL
  SELECT 'USA', 2019, 1320, 'Convertible' UNION ALL
  SELECT 'USA', 2019, 1420, 'Coupe' UNION ALL
  SELECT 'USA', 2019, 1520, 'Pickup' UNION ALL
  SELECT 'USA', 2019, 1120, 'Hatchback' UNION ALL
  SELECT 'USA', 2019, 1270, 'Estate' UNION ALL
  SELECT 'USA', 2019, 1370, 'Sports Car' UNION ALL
  SELECT 'USA', 2019, 1470, 'Minivan' UNION ALL
  SELECT 'Europe', 2019, 1070, 'Sedan' UNION ALL
  SELECT 'Europe', 2019, 1270, 'SUV' UNION ALL
  SELECT 'Europe', 2019, 1370, 'Convertible' UNION ALL
  SELECT 'Europe', 2019, 1470, 'Coupe' UNION ALL
  SELECT 'Europe', 2019, 1570, 'Pickup' UNION ALL
  SELECT 'Europe', 2019, 1170, 'Hatchback' UNION ALL
  SELECT 'Europe', 2019, 1320, 'Estate' UNION ALL
  SELECT 'Europe', 2019, 1420, 'Sports Car' UNION ALL
  SELECT 'Europe', 2019, 1520, 'Minivan' UNION ALL
  -- Data for 2020
  SELECT 'USA', 2020, 1040, 'Sedan' UNION ALL
  SELECT 'USA', 2020, 1240, 'SUV' UNION ALL
  SELECT 'USA', 2020, 1340, 'Convertible' UNION ALL
  SELECT 'USA', 2020, 1440, 'Coupe' UNION ALL
  SELECT 'USA', 2020, 1540, 'Pickup' UNION ALL
  SELECT 'USA', 2020, 1140, 'Hatchback' UNION ALL
  SELECT 'USA', 2020, 1290, 'Estate' UNION ALL
  SELECT 'USA', 2020, 1390, 'Sports Car' UNION ALL
  SELECT 'USA', 2020, 1490, 'Minivan' UNION ALL
  SELECT 'Europe', 2020, 1090, 'Sedan' UNION ALL
  SELECT 'Europe', 2020, 1290, 'SUV' UNION ALL
  SELECT 'Europe', 2020, 1390, 'Convertible' UNION ALL
  SELECT 'Europe', 2020, 1490, 'Coupe' UNION ALL
  SELECT 'Europe', 2020, 1590, 'Pickup' UNION ALL
  SELECT 'Europe', 2020, 1190, 'Hatchback' UNION ALL
  SELECT 'Europe', 2020, 1340, 'Estate' UNION ALL
  SELECT 'Europe', 2020, 1440, 'Sports Car' UNION ALL
  SELECT 'Europe', 2020, 1540, 'Minivan' UNION ALL
  -- Data for 2021
  SELECT 'USA', 2021, 1060, 'Sedan' UNION ALL
  SELECT 'USA', 2021, 1260, 'SUV' UNION ALL
  SELECT 'USA', 2021, 1360, 'Convertible' UNION ALL
  SELECT 'USA', 2021, 1460, 'Coupe' UNION ALL
  SELECT 'USA', 2021, 1560, 'Pickup' UNION ALL
  SELECT 'USA', 2021, 1160, 'Hatchback' UNION ALL
  SELECT 'USA', 2021, 1310, 'Estate' UNION ALL
  SELECT 'USA', 2021, 1410, 'Sports Car' UNION ALL
  SELECT 'USA', 2021, 1510, 'Minivan' UNION ALL
  SELECT 'Europe', 2021, 1110, 'Sedan' UNION ALL
  SELECT 'Europe', 2021, 1310, 'SUV' UNION ALL
  SELECT 'Europe', 2021, 1410, 'Convertible' UNION ALL
  SELECT 'Europe', 2021, 1510, 'Coupe' UNION ALL
  SELECT 'Europe', 2021, 1610, 'Pickup' UNION ALL
  SELECT 'Europe', 2021, 1210, 'Hatchback' UNION ALL
  SELECT 'Europe', 2021, 1360, 'Estate' UNION ALL
  SELECT 'Europe', 2021, 1460, 'Sports Car' UNION ALL
  SELECT 'Europe', 2021, 1560, 'Minivan' UNION ALL
  -- Data for 2022
  SELECT 'USA', 2022, 1080, 'Sedan' UNION ALL
  SELECT 'USA', 2022, 1280, 'SUV' UNION ALL
  SELECT 'USA', 2022, 1380, 'Convertible' UNION ALL
  SELECT 'USA', 2022, 1480, 'Coupe' UNION ALL
  SELECT 'USA', 2022, 1580, 'Pickup' UNION ALL
  SELECT 'USA', 2022, 1180, 'Hatchback' UNION ALL
  SELECT 'USA', 2022, 1330, 'Estate' UNION ALL
  SELECT 'USA', 2022, 1430, 'Sports Car' UNION ALL
  SELECT 'USA', 2022, 1530, 'Minivan' UNION ALL
  SELECT 'Europe', 2022, 1130, 'Sedan' UNION ALL
  SELECT 'Europe', 2022, 1330, 'SUV' UNION ALL
  SELECT 'Europe', 2022, 1430, 'Convertible' UNION ALL
  SELECT 'Europe', 2022, 1530, 'Coupe' UNION ALL
  SELECT 'Europe', 2022, 1630, 'Pickup' UNION ALL
  SELECT 'Europe', 2022, 1230, 'Hatchback' UNION ALL
  SELECT 'Europe', 2022, 1380, 'Estate' UNION ALL
  SELECT 'Europe', 2022, 1480, 'Sports Car' UNION ALL
  SELECT 'Europe' AS country, 2022 AS year, 1580 AS number_of_cars_produced, 'Minivan' AS car_type
)

Leave a Reply

Your email address will not be published. Required fields are marked *