Advanced SQL: Window Functions Explained

Welcome to this exploration of
Advanced SQL – Window Functions Part 4.

** To learn frame type – ROWS, RANGE, and GROUPS, refer to the
part 1 of this blog series
** To Discover more about EXCLUDE, PARTITION BY, LAG and LEAD, refer
to the part 2 of this blog series
** To Discover more about FIRST_VALUE, LAST_VALUE, NTH_VALUE,
ROW_NUMBER, DENSE_RANK, RANK, NTILE, refer to the part 3 of this
blog series

RECAP:

What are window functions?

With SQL:2003, the ISO SQL Standard introduced
window functions, a new mode of row-based computation:

Copy Code

 
                                              Input            Output
  Aggregate functions(sum, avg, count)    group of rows →  row (one per group)
  window function                         row vicinity →  row (one per row)

Window functions operate on a set of rows ( commonly known as window or frame or row
vicinity) and return a single value for each row.
The term window describes the set of rows on which the function operates.
A window function uses values from the rows in a window.

In the Part 1 of this blog series, we learnt how to create a frame
using frame_types – ROWS, RANGE and GROUPS.

In the part 2 of this blog series, we learnt about different
functions like EXCLUDE, PARTITION BY, LAG and LEAD

In the part 3 of this blog series, we learnt about different
functions like FIRST_VALUE, LAST_VALUE, NTH_VALUE, ROW_NUMBER,
DENSE_RANK, RANK, NTILE

Now we’ll see more about the other functionality that window
functions provide.

USE CASE – What is the likelihood of nice weather on weekends?

Input – Daily weather sensor readings consisting of day, weekday,
temperature, rain

The weather is fine on day d if—on d and the two days prior—the minimum temperature is above
15°C and
the overall rainfall is
less than 600ml/m².

Copy Code

 
  DROP TABLE IF EXISTS sensors;
  CREATE TABLE sensors (
    day     int PRIMARY KEY, -- day of month
    weekday text,            -- day of week (Mon...Sun)
    temp    float,           -- temperature in °C
    rain    float);          -- rainfall in ml

  INSERT INTO sensors(day, weekday, temp, rain) VALUES
    ( 1, 'Thu', 13,   0),
    ( 2, 'Fri', 10, 800),
    ( 3, 'Sat', 12, 300),
    ( 4, 'Sun', 16, 100),
    ( 5, 'Mon', 20, 400),
    ( 6, 'Tue', 20,  80),
    ( 7, 'Wed', 18, 500),
    ( 8, 'Thu', 14,   0),
    ( 9, 'Fri', 10,   0),
    (10, 'Sat', 12, 500),
    (11, 'Sun', 14, 300),
    (12, 'Mon', 14, 800),
    (13, 'Tue', 16,   0),
    (14, 'Wed', 15,   0),
    (15, 'Thu', 18, 100),
    (16, 'Fri', 17, 100),
    (17, 'Sat', 15,   0),
    (18, 'Sun', 16, 300),
    (19, 'Mon', 16, 400),
    (20, 'Tue', 19, 200),
    (21, 'Wed', 19, 100),
    (22, 'Thu', 18,   0),
    (23, 'Fri', 17,   0),
    (24, 'Sat', 16, 200);

As per the problem statement, the weather is fine
on day d if—on d and the two days prior—the minimum
temperature is above 15°C and the overall rainfall is less than
600ml/m².

Pseudo code

Step 1

Collect the weather data for each day and 2 days prior as a frame.
Find the minimum temperature over the 3 days in the window
Find the overall rainfall over the 3 days in the window

weather data for each day and 2 days prior
– ORDER BY s.dayROWS BETWEEN 2 PRECEDING AND CURRENT ROW

minimum temperature over the 3 days in the window –
MIN(s.temp) OVER three_days

overall rainfall over the 3 days in the window
SUM(s.rain) OVER three_days

Combining all the above into a query will be

Copy Code

 
  SELECT s.day, s.weekday,
           MIN(s.temp) OVER three_days AS temp,
           SUM(s.rain) OVER three_days AS rain
    FROM   sensors AS s
    WINDOW three_days AS (ORDER BY s.day ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)

Step 2:
Over the collected data for current day and 2 prior, check if
minimum temperature is above 15°C and total rainfall is less than
600 ml/m2

Copy Code

 
  SELECT s.day, s.weekday,
           CASE WHEN s.temp >= 15 and s.rain <= 600
                THEN 'sunny'
                ELSE 'rainy'
           END AS condition
  FROM   three_day_sensors AS s

Step 3:

Now that we have determined whether a day is sunny or rainy. From this data collect only the
weekend data (saturday and sunday)
Group all the weekends and weekdays data separately
For each group, count the number of days it was sunny and calculate its percentage.

Copy Code

 
  SELECT w.weekday IN ('Sat', 'Sun') AS "weekend?",
          (COUNT(*) FILTER (WHERE w.condition = '☀') * 100.0 /
           COUNT(*)) :: int  AS "% fine"
    FROM   weather AS w
    GROUP BY "weekend?";

FInal Query will be

Copy Code

 
  WITH
  three_day_sensors(day, weekday, temp, rain) AS (
    SELECT s.day, s.weekday,
           MIN(s.temp) OVER three_days AS temp,
           SUM(s.rain) OVER three_days AS rain 
    FROM sensors AS s  
    WINDOW three_days AS (ORDER BY s.day ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
  ),

  weather(day, weekday, condition) AS (
    SELECT s.day, s.weekday,
           CASE WHEN s.temp >= 15 and s.rain <= 600
                THEN 'sunny'
                ELSE 'rainy'
           END AS condition  
    FROM   three_day_sensors AS s 
  )

  SELECT w.weekday IN ('Sat', 'Sun') AS "weekend?",
        (COUNT(*) FILTER (WHERE w.condition = 'sunny') * 100.0 /
        COUNT(*)) :: int  AS "% fine"  
  FROM   weather AS w  
  GROUP BY "weekend?";

OUTPUT -

  weekend? | % fine
  ---------+--------
     f     |  29
     t     |  43

Weekdays have pleasant weather 29% of the time and weekends have
pleasant weather 43% of the time.

USE CASE - Sessionization

Input:
A table that contains timestamps of 2 user's logs. For a particular
user, if the difference between 2 timestamps is more than 30
seconds, it should have a new session id

Copy Code

 
  DROP TABLE IF EXISTS log;
  CREATE TABLE log(uid text NOT NULL,ts  timestamp NOT NULL);
  
  INSERT INTO log(uid, ts) VALUES
    ('User1', '05-25-2020 07:25:12'),  -- assumes datestyle = 'ISO, MDY'
    ('User1', '05-25-2020 07:25:18'),
    ('User1', '05-25-2020 07:25:21'),
    ('User2', '05-25-2020 08:01:55'),
    ('User2', '05-25-2020 08:05:07'),
    ('User2', '05-25-2020 08:05:30'),
    ('User2', '05-25-2020 08:05:53'),
    ('User2', '05-25-2020 08:06:19'), -- ⎱ sessions by User2 and User1 within
    ('User1', '05-25-2020 08:06:30'), -- ⎰ 30 seconds, still *two* sessions ⇒ partition by uid ⁑
    ('User1', '05-25-2020 08:06:42'),
    ('User1', '05-25-2020 18:32:07'),
    ('User1', '05-25-2020 18:32:27'),
    ('User1', '05-25-2020 18:32:44'),
    ('User1', '05-25-2020 18:33:00'),
    ('User2', '05-25-2020 22:20:06'),
    ('User2', '05-25-2020 22:20:16');

Pseudo code

1. Create 2 groups, for user 1 and user 2, and order the timestamps
in ascending order.
PARTITION BY uid ORDER BY ts
Partition by is used to separate user1 and user2 logs and the
timestamps are ordered.

2. If the time difference between the current row and its previous
row is greater than 30 secs then mark the session as 1 else mark it
as 0.
Note : 1 marks the beginning of new session
Current.ts > LAG(ts, 1, '-infinity) OVER window_function

Copy Code

 
  SELECT l.*,
           CASE WHEN l.ts > LAG (l.ts, 1, '-infinity') OVER (PARTITION BY l.uid ORDER BY l.ts) + :'inactivity'
              THEN 1
              ELSE 0
          END AS sos
  FROM   log AS l
  ORDER BY l.uid, l.ts  -- ← for presentation purposes only

OUTPUT -

  uid  |         ts          | sos
-------+---------------------+---------------
 User1 | 2020-05-25 07:25:12 |   1
 User1 | 2020-05-25 07:25:18 |   0
 User1 | 2020-05-25 07:25:21 |   0
 User1 | 2020-05-25 08:06:30 |   1
 User1 | 2020-05-25 08:06:42 |   0
 User1 | 2020-05-25 18:32:07 |   1
 User1 | 2020-05-25 18:32:27 |   0
 User1 | 2020-05-25 18:32:44 |   0
 User1 | 2020-05-25 18:33:00 |   0
 User2 | 2020-05-25 08:01:55 |   1
 User2 | 2020-05-25 08:05:07 |   1
 User2 | 2020-05-25 08:05:30 |   0
 User2 | 2020-05-25 08:05:53 |   0
 User2 | 2020-05-25 08:06:19 |   0
 User2 | 2020-05-25 22:20:06 |   1
 User2 | 2020-05-25 22:20:16 |   0

3. Perform a running sum over the start of session tag to assign
session IDs

Copy Code

 
  SELECT t.*,
         SUM (t.sos) OVER (PARTITION BY t.uid ORDER BY t.ts) AS session
    FROM   tagged AS t
    ORDER BY t,uid, t.ts

    uid   |         ts          | sos | session
  --------+---------------------+-----+------------
    User1 | 2020-05-25 07:25:12 |   1 |       1
    User1 | 2020-05-25 07:25:18 |   0 |       1
    User1 | 2020-05-25 07:25:21 |   0 |       1
    User1 | 2020-05-25 08:06:30 |   1 |       2
    User1 | 2020-05-25 08:06:42 |   0 |       2
    User1 | 2020-05-25 18:32:07 |   1 |       3
    User1 | 2020-05-25 18:32:27 |   0 |       3
    User1 | 2020-05-25 18:32:44 |   0 |       3
    User1 | 2020-05-25 18:33:00 |   0 |       3
    User2 | 2020-05-25 08:01:55 |   1 |       1
    User2 | 2020-05-25 08:05:07 |   1 |       2
    User2 | 2020-05-25 08:05:30 |   0 |       2
    User2 | 2020-05-25 08:05:53 |   0 |       2
    User2 | 2020-05-25 08:06:19 |   0 |       2
    User2 | 2020-05-25 22:20:06 |   1 |       3
    User2 | 2020-05-25 22:20:16 |   0 |       3

4. With this information, we can calculate the analytical data such
as time per session, total sessions etc.

Copy Code

 
  SELECT s.uid,
         s.session,
         MAX(s.ts) - MIN(s.ts) AS duration
    FROM   sessionized AS s
    GROUP BY s.uid, s.session
    ORDER BY s.uid, s.session


  uid   | session | duration
  ------+---------+----------
  User1 |       1 | 00:00:09
  User1 |       2 | 00:00:12
  User1 |       3 | 00:00:53
  User2 |       1 | 00:00:00
  User2 |       2 | 00:01:12
  User2 |       3 | 00:00:10

FINAL QUERY

Copy Code

 
  set inactivity '30 seconds'

  WITH

  tagged(uid, ts, sos) AS (
    SELECT l.*,
          CASE WHEN l.ts >          
                    LAG (l.ts, 1, '-infinity') OVER (PARTITION BY l.uid ORDER BY l.ts) + :'inactivity'
                THEN 1 
                ELSE 0 
            END AS sos
    FROM   log AS l
    ORDER BY l.uid, l.ts  
  ),

  sessionized(uid, ts, sos, session) AS (
    SELECT t.*,
          SUM (t.sos) OVER (PARTITION BY t.uid ORDER BY t.ts) AS session

    FROM   tagged AS t
    ORDER BY t,uid, t.ts  
  ),

  measured(uid, session, duration) AS (
    SELECT s.uid,
          s.session,
          MAX(s.ts) - MIN(s.ts) AS duration
    FROM   sessionized AS s
    GROUP BY s.uid, s.session
    ORDER BY s.uid, s.session  
  )
  TABLE measured;

USE CASE - Identify Consecutive Ranges

Input -

Table of random numbers - example - [5,2,14,3,1,42,6,10,7,13] shall
be converted to [1-3,5-7,10,13&14,42].

Input - [5,2,14,3,1,42,6,10,7,13]
On order by/ sorting - [1,2,3,5,6,7,10,13,14,42 ]
Reducing -
[ 1,2,3 ] to 1-3 [5,6,7] to 5-7 [13,14] to 13&14 42 - 42
Approach -
5. Sort the numbers
6. Create row numbers to the table using ROW_NUMBER()
7. Subtract the number with ROW NUMBER.
8. Similar values are grouped together.

  ┌─────
  │ ref │
  ├─────
  │   1 │ -  1 =  0    ⎫
  │   2 │ -  2 =  0    ⎬  range 0
  │   3 │ -  3 =  0 __ ⎭
  │   5 │ -  4 =  1    ⎫
  │   6 │ -  5 =  1    ⎬  range 1
  │   7 │ -  6 =  1 __ ⎭
  │  10 │ -  7 =  3 __ }  range 3
  │  13 │ -  8 =  5    ⎱  range 5
  │  14 │ -  9 =  5 __ ⎰
  │  42 │ - 10 = 32    }  range 32
  └─────

Copy Code

 
  DROP TABLE IF EXISTS citations;
  CREATE TABLE citations(ref int PRIMARY KEY);

  INSERT INTO citations VALUES
    (5), (2), (14), (3), (1), (42), (6), (10), (7), (13);

1. Sort the numbers
2. Create row numbers to the table using ROW_NUMBER()
3. Subtract the number with ROW NUMBER.

Copy Code

 
  ref - ROW_NUMBER() OVER (ORDER BY c.ref) AS range

  SELECT c.ref,
         c.ref - ROW_NUMBER() OVER (ORDER BY c.ref) AS range  
    FROM citations AS c

  ref | range
  ----+-------
  1   |   0
  2   |   0
  3   |   0
  5   |   1
  6   |   1
  7   |   1
  10  |   3
  13  |   5
  14  |   5
  42  |   32

From a given range, find the min and max values. These are the first
and last values of the range

Copy Code

 
  SELECT r.range, MIN(r.ref) AS first, MAX(r.ref) AS last  
    FROM   ranges AS r  
    GROUP BY r.range

  range  | first | last
  -------+-------+------
    3    |  10   |   10
    5    |  13   |   14
    0    |  1    |    3
    32   |  42   |   42
    1    |  5    |    7

Next step is to add - or & between first and last values based on
their difference.

FINAL QUERY

Copy Code

 
  WITH ranges(ref, range) AS (
    SELECT c.ref,
           c.ref - ROW_NUMBER() OVER (ORDER BY c.ref) AS range
    FROM citations AS c
  ),
  outputs(range, first, last) AS (
    SELECT r.range, MIN(r.ref) AS first, MAX(r.ref) AS last
    FROM   ranges AS r
    GROUP BY r.range
  )
  SELECT string_agg(CASE o.last - o.first
                     WHEN 0 THEN o.first :: text
                     WHEN 1 THEN o.first || '&' || o.last
                     ELSE        o.first || '-' || o.last
                   END,
                  ','
                  ORDER BY o.range) AS citations  
  FROM   outputs AS o;

citations
-----------------------------
1-3,5-7,10,13&14,42

RECAP:

What are window functions?

USE CASE – What is the likelihood of nice weather on weekends?

Pseudo code

USE CASE - Sessionization

USE CASE - Identify Consecutive Ranges

Take the first step and Connect with us today

Mastering Advanced SQL: Part 3

Efficient Code Splitting in React

Our AI Products

Our Services

Industries

Insights

About

RECAP:

What are window functions?

USE CASE – What is the likelihood of nice weather on weekends?

Pseudo code

USE CASE - Sessionization

USE CASE - Identify Consecutive Ranges

Take the first step and Connect with us today

Tags:

Mastering Advanced SQL: Part 3

Efficient Code Splitting in React