SQL: Window Functions

SQL: Window Functions

Introduction

Firstly we excute the following SQL statements and create a table Payroll in Postgresql:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
CREATE DATABASE WindowFunctions;

CREATE TABLE Payroll (
employee_id INTEGER,
employee_name VARCHAR(20) NOT NULL,
department_name VARCHAR(15) NOT NULL,
salary_amount DECIMAL (8,2) NOT NULL CHECK (salary_amount > 0.00)
);


INSERT INTO Payroll
(employee_id, employee_name, department_name, salary_amount)
VALUES
(1, 'Mark Stone', 'Accounting', 16000.00),
(2, 'Maria Stone', 'Accounting', 13000.00),
(3, 'Geetha Singh', 'Accounting', 13000.00),
(4, 'Richard Hathaway', 'Accounting', 14000.00),
(5, 'Joseph Bastion', 'Accounting', 14000.00),
(6, 'Arthur Prince', 'Production',12000.00),
(7, 'Adele Morse', 'Production', 13000.00),
(8, 'Sheamus O Kelly', 'Production', 24000.00),
(9, 'Sheilah Flask', 'Production', 24000.00),
(10, 'Brian James', 'Production', 16000.00),
(11, 'Adam Scott', 'Production', 16000.00),
(12, 'Maurice Moss', 'IT',12000.00),
(13, 'Roy', 'IT', 12001.00),
(14, 'Jen Barber', 'IT', 28000.00),
(15, 'Richard Hammond', 'IT', 10000.00),
(16, 'James May', 'IT', 10000.00),
(18, 'Jeremy Clarkson', 'IT', 10000.00),
(17, 'John Doe', 'IT', 100000.00); --Note how employee_id was switched around here

Now we have a table payroll in the database windowfunctions:

window_function_1

Then we can execute the following SQL:

1
2
3
4
5
6
7
8
9
10
SELECT
department_name,
COUNT(*) AS dept_employee_count,
MIN(salary_amount) AS min_dept_salary,
MAX(salary_amount) AS max_dept_salary,
AVG(salary_amount)::DECIMAL(8,2) AS average_dept_salary,
SUM(salary_amount) AS total_dept_salaries
FROM Payroll
GROUP BY department_name
ORDER BY department_name;

And the output is:

window_function_2

Now we want to “combine” the result of the above SQL and SELECT * FROM Payroll, to get the following table:

window_function_3

The traditional way is to use JOIN:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
SELECT 
p.*,
t.dept_employee_count,
t.min_dept_salary,
t.max_dept_salary,
t.average_dept_salary,
t.total_dept_salaries
FROM Payroll p
JOIN
(
SELECT
department_name,
COUNT(*) AS dept_employee_count,
MIN(salary_amount) AS min_dept_salary,
MAX(salary_amount) AS max_dept_salary,
AVG(salary_amount)::DECIMAL(8,2) AS average_dept_salary,
SUM(salary_amount) AS total_dept_salaries
FROM Payroll
GROUP BY department_name
ORDER BY department_name
) t
ON (p.department_name = t.department_name)
ORDER BY p.department_name;

And a more simple way is to use window functions:

1
2
3
4
5
6
7
8
9
SELECT 
*,
COUNT(*) OVER (PARTITION BY department_name) AS dept_employee_count,
MIN(salary_amount) OVER (PARTITION BY department_name) AS min_dept_salary,
MAX(salary_amount) OVER (PARTITION BY department_name) AS max_dept_salary,
AVG(salary_amount) OVER (PARTITION BY department_name)::DECIMAL(8,2) AS avg_dept_sal,
SUM(salary_amount) OVER (PARTITION BY department_name) AS total_dept_salaries
FROM Payroll
ORDER BY department_name;

That is why we use window functions, which can simplify our SQL statements.

Basic Syntax of Window Functions

We use an OVER clause to open the windows, and ( ) is the window, i.e., a set of rows. That is, if a function is followed by OVER(), we can treat it as a window function:

1
2
3
4
5
window_function() OVER() --> () is the window

window_function() OVER(PARTITION BY ...) --> PARTITION BY divides the rows into groups

window_function() OVER(PARTITION BY ... ORDER BY ...) --> ORDER BY controls the order in which rows are processed by window functions

Common Window Functions

Aggregate Functions

Aggregate functions, i.e., COUNT, SUM, MIN, MAX and AVG, can be used as window functions. For example:

1
2
3
4
5
6
7
SELECT 
*,
COUNT(*) OVER () AS total_employee_count,
AVG(salary_amount) OVER(PARTITION BY department_name)::DECIMAL(8,2) AS avg_dept_salary
FROM
Payroll
ORDER BY employee_id;

window_function_4

One thing worth noting is that windows are affected by the WHERE clause, that is, windows cannot work outside our base data set. For example:

1
2
3
4
5
6
7
SELECT 
*,
COUNT(*) OVER () AS total_employee_count,
AVG(salary_amount) OVER(PARTITION BY department_name)::DECIMAL(8,2) AS avg_dept_salary
FROM Payroll
WHERE department_name = 'IT'
ORDER BY employee_id;

window_function_5

As shown, the total_employee_count is 7 rather than 18, which means that we firstly execute the statement SELECT * FROM Payroll WHERE department_name = 'IT', then open the windows, then ORDER BY employee_id.

ROW_NUMBER() & RANK() & DENSE_RANK()

ROW_NUMBER(), RANK(), and DENSE_RANK() functions assign an integer to each row based on its order in its result set.

ROW_NUMBER()

Here is the the first example of ROW_NUMBER():

1
2
3
4
5
SELECT 
*,
ROW_NUMBER() OVER () AS "Base Row No"
FROM Payroll
ORDER BY employee_id;

window_function_6

We can break the above statement into 3 steps:

  1. SELECT * FROM Payroll
  2. Open the window, i.e., ROW_NUMBER() OVER () AS "Base Row No"
  3. ORDER BY employee_id

window_function_7

Understanding the above executing procedures, we can see another ROW_NUMBER() example, which following the similar procedures:

1
2
3
4
5
6
SELECT
*,
-- Order each partition first, then assign row numbers.
ROW_NUMBER() OVER (PARTITION BY department_name ORDER BY salary_amount) AS "Dept,Salary Row No"
FROM Payroll
ORDER BY employee_id;

window_function_8

RANK() & DENSE_RANK()

For RANK(), duplicated values are assigned the same rank, skipping the next numebr in line:

1
2
3
4
5
1                            1
2 suppose line 2,3 2
3 have the same value 2
4 ====================> 4 (here we skipped 3)
5 5

While for DENSE_RANK(), duplicated values are assigned the same rank, and no values are skipped:

1
2
3
4
5
1                            1
2 suppose line 2,3 2
3 have the same value 2
4 ====================> 3 (here we did not skip 3)
5 4

For example:

1
2
3
4
5
6
7
SELECT 
*,
-- Ranks --> Equal values are ranked the same, creating gaps in numbering
RANK() OVER (PARTITION BY department_name ORDER BY salary_amount DESC),
-- Dense_Ranks --> Equal values are ranked the same, without gaps in numbering
DENSE_RANK() OVER (PARTITION BY department_name ORDER BY salary_amount DESC)
FROM Payroll;

window_function_9

There is one thing that we should notice, if we execute the following statement:

1
2
3
4
5
6
7
SELECT
*,
RANK() OVER (), -- Ranks are useless without an ORDER BY
DENSE_RANK() OVER (), -- Ranks are useless without an ORDER BY
RANK() OVER (ORDER BY salary_amount DESC),
DENSE_RANK() OVER (ORDER BY salary_amount DESC)
FROM Payroll;

window_function_10

As shown, we must use ORDER BY for ranks, otherwise each row’s rank would be 1.

Statistical Functions

  • percent_rank(): Relative rank of the current row, i.e., (rank-1)/(total rows - 1)

  • cume_dist(): Relative rank of the current row, i.e., (rank)/(total rows)

  • ntile(num_buckets integer): Returns integer ranging from 1 to the argument value, dividing the partition as equally as possible

For example:

1
2
3
4
5
6
7
8
9
10
11
SELECT 	*,
RANK() OVER(ORDER BY series),
(PERCENT_RANK() OVER (ORDER BY series))::DEC(8,2), -- (rank-1)/(total rows - 1)
(CUME_DIST() OVER (ORDER BY series))::DEC(8,2), -- (rank)/(total rows)
NTILE(2) OVER (ORDER BY series), -- Split dataset into 2 parts
NTILE(3) OVER (ORDER BY series) -- Split dataset into 3 parts
FROM
(
SELECT generate_series(1,10) AS series
) dataset
ORDER BY series;

window_function_11

Offset Functions

Offset functions includes FIRST_VALUE(), LAST_VALUE(), LEAD() and LAG().

FIRST_VALUE() & LAST_VALUE()

For example:

1
2
3
4
5
6
SELECT
*,
FIRST_VALUE(employee_name) OVER (PARTITION BY department_name ORDER BY salary_amount DESC) AS "Higest Earner",
LAST_VALUE(employee_name) OVER (PARTITION BY department_name ORDER BY salary_amount DESC) AS "Lowest Earner"
FROM Payroll
ORDER BY employee_id;

window_function_12

LEAD() & LAG()

For example:

1
2
3
4
5
6
7
8
9
10
SELECT
employee_id,
employee_name,
department_name,
LAG(employee_name) OVER (PARTITION BY department_name ORDER BY employee_id) AS "Previous Employee",
LEAD(employee_name) OVER (PARTITION BY department_name ORDER BY employee_id) AS "Next Employee",
LAG(employee_name, 2) OVER (PARTITION BY department_name ORDER BY employee_id) AS "Previous Offset 2",
LAG(employee_name, 2, 'none') OVER (PARTITION BY department_name ORDER BY employee_id) AS "Previous Offset 2 with Defaults"
FROM Payroll
ORDER BY employee_id;

window_function_13