我们知道在 SQL 中有一类函数叫做聚合函数,例如 sum()、avg()、max() 等等,这类函数可以将多行数据按照规则聚集为一行,一般来讲聚集后的行数是要少于聚集前的行数的。但是,有时候我们既要显示聚集前的数据,又要显示聚集后的数据,此时我们便引入了窗口函数。窗口函数主要用于 OLAP 数据分析。

0x00 说在前面

本文采用的数据表是 order_window(user_name, order_date, amount),详细如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
jack    2015-01-01    10
tony 2015-01-02 15
jack 2015-02-03 23
tony 2015-01-04 29
jack 2015-01-05 46
jack 2015-04-06 42
tony 2015-01-07 50
jack 2015-01-08 55
mart 2015-04-08 62
mart 2015-04-09 68
neil 2015-05-10 12
mart 2015-04-11 75
neil 2015-06-12 80
mart 2015-04-13 94

在深入研究 over子句 之前,一定要注意:在SQL处理中,窗口函数都是最后一步执行,而且仅位于order by字句之前。

0x01 OVER子句

官方 OVER子句 包括几个部分:

  • 聚合函数(count, sum, min, max, avg)
  • OVER 子句
  • PARTITION BY 子句
  • ORDER BY 子句
  • WINDOW 子句

结合具体的业务场景,SQL 语句如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
---1)201504月份的销售额
select sum(amount) as total_amt
from order_window
where substr(order_date,1,7)='2015-04'
;
---2)201504月份的订单明细与销售额
select user_name, order_date, amount
,sum(amount) over() as total_amt
from order_window
where substr(order_date,1,7)='2015-04'
;
---3)客户的订单明细与月购买金额
select user_name, order_date, amount
,sum(amount) over (partition by month(order_date)) month_amt
from order_window
;
---4)客户的订单明细与累计购买金额
select user_name, order_date, amount
,sum(amount) over (partition by month(order_date) order by order_date) month_add_amt
from order_window
;
---5)不同窗口的销售额
select
user_name
,order_date
,amount
,sum(amount) over() as sample1 --所有行相加
,sum(amount) over(partition by user_name) as sample2 --按name分组,组内数据相加
,sum(amount) over(partition by user_name order by order_date) as sample3 --按name分组,组内数据累加
,sum(amount) over(partition by user_name order by order_date rows between UNBOUNDED PRECEDING and current row) as sample4 --和sample3一样,由起点到当前行的聚合
,sum(amount) over(partition by user_name order by order_date rows between 1 PRECEDING and current row) as sample5 --当前行和前面一行做聚合
,sum(amount) over(partition by user_name order by order_date rows between 1 PRECEDING and 1 FOLLOWING) as sample6 --当前行和前边一行及后面一行
,sum(amount) over(partition by user_name order by order_date rows between current row and UNBOUNDED FOLLOWING) as sample7 --当前行及后面所有行
from order_window
;

0x02 WINDOW子句

带有窗口规范的OVER子句。窗口可以在WINDOW子句中单独定义。窗口规范支持如下格式:

1
2
3
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING
关键字 说明
PRECEDING 表示当前行之前的行
UNBOUNDED PRECEDING 表示当前行之前无边界行,即第一行
num PRECEDING 表示当前行之前第num行
CURRENT ROW 表示当前行
FOLLOWING 表示当前行后面的行
UNBOUNDED FOLLOWING 表示当前行后面无边界行,即最后一行
num FOLLOWING 表示当前行后面第num行

当缺少WINDOW子句并指定使用ORDER BY时,窗口规范默认为RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW,即从第一行到当前行。

当缺少ORDER BY和WINDOW子句时,窗口规范默认为ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING,即第一行到最后一行。

0x03 窗口函数

窗口函数 描述
LAG() 返回分区中当前行之前行(可以指定第几行)的值。如果没有行,则返回null。
LEAD() 返回分区中当前行后面行(可以指定第几行)的值。如果没有行,则返回null。
FIRST_VALUE 返回相对于窗口中第一行的指定列的值。
LAST_VALUE 返回相对于窗口中最后一行的指定列的值。

结合具体的业务场景,SQL 语句如下:

1
2
3
4
5
6
7
select user_name, order_date, amount
,lag(order_date,1,'1900-01-01') over(partition by user_name order by order_date) as pre_order_date --当前订单的上一单日期
,lead(order_date,1,'9999-12-31') over(partition by user_name order by order_date) as fol_order_date --当前订单的下一单日期
,first_value(amount,true) over (partition by user_name order by order_date) as first_order_amt --截至当前订单的第一单金额
,last_value(amount,true) over (partition by user_name order by order_date) as last_order_amt --截至当前订单的最后一单金额
from order_window
;

参考文献

Windowing and Analytics Functions
Hive窗口函数
窗口函数与分析函数