Median in BigQuery is a joke
Have you ever tried calculating a median in BigQuery?
It’s a joke how to do that! Let’s take only two values 1 and 50. Median should be 25.5, in BigQuery there is APPROX_QUANTILES function which calculates quantiles in 1 line. Median should be the same as 50th percentile. But it is not!
Standard approach – calculating median in BigQuery
There is a way to calculate it precisely. However it’s not the easiest way. So far I was using this approach thou, as I didn’t know better.
Let’s say you have dataset with columns: country, year, number_of_cars_produced, car_type (RandomData at the bottom of article). You need median and mean.
Mean is easy:
SELECT country, year, avg(number_of_cars_produced) as mean FROM RandomData group by 1,2
SELECT country, year, PERCENTILE_CONT(number_of_cars_produced, 0.5) OVER (partition by country, year) AS median FROM RandomData
select * from (
SELECT country, year, PERCENTILE_CONT(number_of_cars_produced, 0.5) OVER (partition by country, year) AS median FROM RandomData
) group by 1,2,3
Solution – Median (or any other percentile) in BigQuery
I just found out (using internet, more internet, chat gpt and thinking) how to resolve it 🙂
Solution is to create and UDF function. You can save function by running below code:
CREATE OR REPLACE FUNCTION `.udf_functions.percentileExact`(p FLOAT64, providedValues ARRAY, excludeNegativeValues BOOL )
RETURNS FLOAT64
LANGUAGE js AS """
if (p < 0 || p > 1) {
throw new Error("Percentile p must be between 0 (0%) and 1 (100%).");
}
const filteredValues = providedValues.map(value =>
excludeNegativeValues && value < 0 ? null : value // Change negative values to null, if third parameter of function is true ).filter(value => value != null); // Filter out both null and undefined values in one line
//If there are no values after filtering then return null
if (filteredValues.length < 1) {
return null
}
// Check if all elements are numbers
if (!filteredValues.every(Number.isFinite)) {
throw new Error("All elements must be finite numbers");
}
//If there is only 1 value after filtering, then return this value (it works like this in postgreSQL percentile_cont)
if (filteredValues.length < 2) { return filteredValues[0] } // Sort filtered values filteredValues.sort((a, b) => a - b);
const i = p * (filteredValues.length - 1); // Calculates the exact index position for percentile
const floored = Math.floor(i); // Gets the lower bound index
if (floored === i) {
//No interpolation is needed. For example if p=0.5 it means there is odd number of values in array, so median is just the value in center.
return filteredValues[i];
}
const decimal = i - floored; // Calculate the fractional part of the index
const difference = filteredValues[floored + 1] - filteredValues[floored]; // Difference between the neighboring elements
return filteredValues[floored] + difference * decimal; // Interpolated value
"""
SELECT country, year, udf_functions.percentileExact(0.5,array_agg(cast(number_of_cars_produced as float64)),false) AS median, avg(number_of_cars_produced) as mean FROM RandomData group by 1,2
Is BigQuery a joke?
Not at all! I use it all the time. The are some glitches like this. This is not the only one. With this article I wish google will improve it (and other BigQuery issues).
What’s your opinion about it? 🙂
Example dataset for calculations below:
WITH RandomData AS (
-- Data for 2018
SELECT 'USA' AS country, 2018 AS year, 1000 AS number_of_cars_produced, 'Sedan' AS car_type UNION ALL
SELECT 'USA', 2018, 1200, 'SUV' UNION ALL
SELECT 'USA', 2018, 1300, 'Convertible' UNION ALL
SELECT 'USA', 2018, 1400, 'Coupe' UNION ALL
SELECT 'USA', 2018, 1500, 'Pickup' UNION ALL
SELECT 'USA', 2018, 1100, 'Hatchback' UNION ALL
SELECT 'USA', 2018, 1250, 'Estate' UNION ALL
SELECT 'USA', 2018, 1350, 'Sports Car' UNION ALL
SELECT 'USA', 2018, 1450, 'Minivan' UNION ALL
SELECT 'Europe', 2018, 1050, 'Sedan' UNION ALL
SELECT 'Europe', 2018, 1250, 'SUV' UNION ALL
SELECT 'Europe', 2018, 1350, 'Convertible' UNION ALL
SELECT 'Europe', 2018, 1450, 'Coupe' UNION ALL
SELECT 'Europe', 2018, 1550, 'Pickup' UNION ALL
SELECT 'Europe', 2018, 1150, 'Hatchback' UNION ALL
SELECT 'Europe', 2018, 1300, 'Estate' UNION ALL
SELECT 'Europe', 2018, 1400, 'Sports Car' UNION ALL
SELECT 'Europe', 2018, 1500, 'Minivan' UNION ALL
-- Data for 2019
SELECT 'USA', 2019, 1020, 'Sedan' UNION ALL
SELECT 'USA', 2019, 1220, 'SUV' UNION ALL
SELECT 'USA', 2019, 1320, 'Convertible' UNION ALL
SELECT 'USA', 2019, 1420, 'Coupe' UNION ALL
SELECT 'USA', 2019, 1520, 'Pickup' UNION ALL
SELECT 'USA', 2019, 1120, 'Hatchback' UNION ALL
SELECT 'USA', 2019, 1270, 'Estate' UNION ALL
SELECT 'USA', 2019, 1370, 'Sports Car' UNION ALL
SELECT 'USA', 2019, 1470, 'Minivan' UNION ALL
SELECT 'Europe', 2019, 1070, 'Sedan' UNION ALL
SELECT 'Europe', 2019, 1270, 'SUV' UNION ALL
SELECT 'Europe', 2019, 1370, 'Convertible' UNION ALL
SELECT 'Europe', 2019, 1470, 'Coupe' UNION ALL
SELECT 'Europe', 2019, 1570, 'Pickup' UNION ALL
SELECT 'Europe', 2019, 1170, 'Hatchback' UNION ALL
SELECT 'Europe', 2019, 1320, 'Estate' UNION ALL
SELECT 'Europe', 2019, 1420, 'Sports Car' UNION ALL
SELECT 'Europe', 2019, 1520, 'Minivan' UNION ALL
-- Data for 2020
SELECT 'USA', 2020, 1040, 'Sedan' UNION ALL
SELECT 'USA', 2020, 1240, 'SUV' UNION ALL
SELECT 'USA', 2020, 1340, 'Convertible' UNION ALL
SELECT 'USA', 2020, 1440, 'Coupe' UNION ALL
SELECT 'USA', 2020, 1540, 'Pickup' UNION ALL
SELECT 'USA', 2020, 1140, 'Hatchback' UNION ALL
SELECT 'USA', 2020, 1290, 'Estate' UNION ALL
SELECT 'USA', 2020, 1390, 'Sports Car' UNION ALL
SELECT 'USA', 2020, 1490, 'Minivan' UNION ALL
SELECT 'Europe', 2020, 1090, 'Sedan' UNION ALL
SELECT 'Europe', 2020, 1290, 'SUV' UNION ALL
SELECT 'Europe', 2020, 1390, 'Convertible' UNION ALL
SELECT 'Europe', 2020, 1490, 'Coupe' UNION ALL
SELECT 'Europe', 2020, 1590, 'Pickup' UNION ALL
SELECT 'Europe', 2020, 1190, 'Hatchback' UNION ALL
SELECT 'Europe', 2020, 1340, 'Estate' UNION ALL
SELECT 'Europe', 2020, 1440, 'Sports Car' UNION ALL
SELECT 'Europe', 2020, 1540, 'Minivan' UNION ALL
-- Data for 2021
SELECT 'USA', 2021, 1060, 'Sedan' UNION ALL
SELECT 'USA', 2021, 1260, 'SUV' UNION ALL
SELECT 'USA', 2021, 1360, 'Convertible' UNION ALL
SELECT 'USA', 2021, 1460, 'Coupe' UNION ALL
SELECT 'USA', 2021, 1560, 'Pickup' UNION ALL
SELECT 'USA', 2021, 1160, 'Hatchback' UNION ALL
SELECT 'USA', 2021, 1310, 'Estate' UNION ALL
SELECT 'USA', 2021, 1410, 'Sports Car' UNION ALL
SELECT 'USA', 2021, 1510, 'Minivan' UNION ALL
SELECT 'Europe', 2021, 1110, 'Sedan' UNION ALL
SELECT 'Europe', 2021, 1310, 'SUV' UNION ALL
SELECT 'Europe', 2021, 1410, 'Convertible' UNION ALL
SELECT 'Europe', 2021, 1510, 'Coupe' UNION ALL
SELECT 'Europe', 2021, 1610, 'Pickup' UNION ALL
SELECT 'Europe', 2021, 1210, 'Hatchback' UNION ALL
SELECT 'Europe', 2021, 1360, 'Estate' UNION ALL
SELECT 'Europe', 2021, 1460, 'Sports Car' UNION ALL
SELECT 'Europe', 2021, 1560, 'Minivan' UNION ALL
-- Data for 2022
SELECT 'USA', 2022, 1080, 'Sedan' UNION ALL
SELECT 'USA', 2022, 1280, 'SUV' UNION ALL
SELECT 'USA', 2022, 1380, 'Convertible' UNION ALL
SELECT 'USA', 2022, 1480, 'Coupe' UNION ALL
SELECT 'USA', 2022, 1580, 'Pickup' UNION ALL
SELECT 'USA', 2022, 1180, 'Hatchback' UNION ALL
SELECT 'USA', 2022, 1330, 'Estate' UNION ALL
SELECT 'USA', 2022, 1430, 'Sports Car' UNION ALL
SELECT 'USA', 2022, 1530, 'Minivan' UNION ALL
SELECT 'Europe', 2022, 1130, 'Sedan' UNION ALL
SELECT 'Europe', 2022, 1330, 'SUV' UNION ALL
SELECT 'Europe', 2022, 1430, 'Convertible' UNION ALL
SELECT 'Europe', 2022, 1530, 'Coupe' UNION ALL
SELECT 'Europe', 2022, 1630, 'Pickup' UNION ALL
SELECT 'Europe', 2022, 1230, 'Hatchback' UNION ALL
SELECT 'Europe', 2022, 1380, 'Estate' UNION ALL
SELECT 'Europe', 2022, 1480, 'Sports Car' UNION ALL
SELECT 'Europe' AS country, 2022 AS year, 1580 AS number_of_cars_produced, 'Minivan' AS car_type
)
Leave a Reply