Problem Description –
Employee table holds all employees. The employee table has three columns: Employee Id, Company Name, and Salary.
Write a SQL query to find the median salary of each company. Bonus points if you can solve it without using any built-in SQL functions.
Difficulty Level – Hard
WITH cte AS ( SELECT id, company, Salary, ROW_NUMBER() OVER(PARTITION BY company ORDER BY Salary) as rn, COUNT(*) OVER(PARTITION BY company ) as rc FROM Employee ) SELECT Id, company, salary FROM cte WHERE rn IN ( (rc+1 ) DIV 2 , (rc+2) DIV 2 )
Most of the time when the interviewers asked you to find the median, they want you to calculate it without any functions like percentile_cont , so that is what we are going to do.
Let’s first look at the common table expression. If you run the following query –
SELECT id, company, Salary, ROW_NUMBER() OVER(PARTITION BY company ORDER BY Salary) as rn, COUNT(*) OVER(PARTITION BY company ) as rc FROM Employee
The output will be –
For Company A, we have six rows of data, So ROW_NUMBER() function will assign values from 1 to 6, and the COUNT() window function will produce the output of 6 for all the rows. For company B we have 6 rows and for company C we have 5 rows.
Now let’s look at the other part of the query –
SELECT Id, company, salary FROM cte WHERE rn IN ( (rc+1 ) DIV 2 , (rc+2) DIV 2 )
In the WHERE clause, we are using DIV for division instead of ‘/’ because we want an integer division. When we have an even number of rows like 6 in the case of companies A and B. This expression will evaluate to –
SELECT (6+1) DIV 2 first_div, (6+2) DIV 2 second_div
The first division will result in 3 and the second one is 4, which is what we need. The median will be the average of the 3rd and the 4th-row value.
And when we will have odd numbers, then the expression will evaluate to –
SELECT (5+1) DIV 2 first_div, (5+2) DIV 2 second_div
For both the division we will get 3. If we have 5 numbers then the median will be the number in the 3rd row given the data is sorted from lowest to highest.
So, this expression takes care of both the scenario of even and odd rows of data.