在互联网环境下,存量市场的竞争越来越激烈,许多公司都将用户提到了战略高度。借助用户生命周期管理与用户画像等,实现对用户的精细化运营。而增长模型是支撑这些应用的基础数据模型,包括日活、新增与留存。

0x00 表模型概览

这套增长模型是基于公司客户端日志来设计的,ETL处理流程如下:

增长模型ETL流程

涉及的表有5张,请参考表说明:

表名 说明
ods.com_client_log_di Kafka上报的客户端日志
cdm.dwd_bhv_app_startup_di APP启动信息表
cdm.dws_bhv_device_actuser_di 用户活跃信息表
cdm.dim_bhv_device_ds 用户全量信息表 (新增)
ads.dm_bhv_device_remain_di 用户留存信息表

接下来会逐个介绍每个表的ETL逻辑。

0x01 客户端日志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
create table if not exists ods.com_client_log_di (
log_id bigint comment '全局唯一日志ID'
,log_timestamp bigint comment '上报日志时间戳'
,device string comment '唯一设备ID'
,os string comment '操作系统'
,brand string comment '品牌'
,model string comment '机型'
,manufacturer string comment '制造商'
,ip string comment '客户端IP'
,network string comment '网络类型'
,app_version string comment '应用版本'
,package_name string comment '包名'
,sdk_version string comment 'SDK版本'
,event string comment '事件'
,page string comment '页面'
,extend map<string,string> comment '扩展属性集'
)
comment '客户端日志'
partitioned by (
dt string comment '日期分区'
)
stored as parquet
tblproperties('parquet.compression'='SNAPPY')
;

一般客户端日志的上报量非常大,采用 Kafka 入库 Hive 的方式,消费时延可以做到小时级别。

0x02 APP启动信息表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
insert overwrite table cdm.dwd_bhv_app_startup_di partition (dt = '${YESTERDAY}')
select
cast(log_id as string) as log_id
,from_unixtime(log_timestamp div 1000, 'yyyy-MM-dd HH:mm:ss') as log_time
,device
,os
,brand
,model
,manufacturer
,ip
,network
,app_version
,package_name
,sdk_version
,from_unixtime(unix_timestamp(), 'yyyy-MM-dd HH:mm:ss') as etl_time
from ods.com_client_log_di
where dt = '${YESTERDAY}'
and event = 'app_startup'
;

0x03 用户活跃信息表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
insert overwrite table cdm.dws_bhv_device_actuser_di partition (dt = '${YESTERDAY}')
select
device
,os
,ip
,from_unixtime(unix_timestamp(), 'yyyy-MM-dd HH:mm:ss') as etl_time
from (
select
device
,os
,ip
,row_number() over(partition by device order by log_id asc) as rank
from cdm.dwd_bhv_app_startup_di
where dt = '${YESTERDAY}'
and device is not null
and device != ''
) t
where rank = 1
;

0x04 用户全量信息表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
insert overwrite table cdm.dim_bhv_device_ds partition(dt = '${YESTERDAY}')
select
coalesce(a.device,b.device) as device
,coalesce(b.activate_date,a.dt) as activate_date
,if(b.device is null,1,0) as is_new
,case when b.device is not null and a.device is not null
then concat(b.actlist,',1')
when b.device is null and a.device is not null
then concat(repeat('0,',datediff('${YESTERDAY}','2019-01-01')),'1')
when b.device is not null and a.device is null
then concat(b.actlist,',0')
else 'other'
end as actlist
,from_unixtime(unix_timestamp(), 'yyyy-MM-dd HH:mm:ss') as etl_time
from (
select dt, device
from cdm.dws_bhv_device_actuser_di
where dt = '${YESTERDAY}'
) a
full join (
select dt, device, activate_date, actlist
from cdm.dim_bhv_device_ds
where dt = date_sub('${YESTERDAY}',1)
) b
on a.device = b.device
;

0x05 用户留存信息表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
insert overwrite table ads.dm_bhv_device_remain_di partition(dt) 
select
a.device
,udf.calc_remain(b.actlist, b.dt, a.dt, 1) as remain_s1d
,udf.calc_remain(b.actlist, b.dt, a.dt, 2) as remain_s2d
,udf.calc_remain(b.actlist, b.dt, a.dt, 3) as remain_s3d
,udf.calc_remain(b.actlist, b.dt, a.dt, 4) as remain_s4d
,udf.calc_remain(b.actlist, b.dt, a.dt, 5) as remain_s5d
,udf.calc_remain(b.actlist, b.dt, a.dt, 6) as remain_s6d
,udf.calc_remain(b.actlist, b.dt, a.dt, 7) as remain_s7d
,udf.calc_remain(b.actlist, b.dt, a.dt, 14) as remain_s14d
,udf.calc_remain(b.actlist, b.dt, a.dt, 30) as remain_s30d
,from_unixtime(unix_timestamp(), 'yyyy-MM-dd HH:mm:ss') as etl_time
,a.dt
from (
select device, dt
from cdm.dws_bhv_device_actuser_di
where dt in (date_add('${YESTERDAY}',-1),date_add('${YESTERDAY}',-2)
,date_add('${YESTERDAY}',-3),date_add('${YESTERDAY}',-4)
,date_add('${YESTERDAY}',-5),date_add('${YESTERDAY}',-6)
,date_add('${YESTERDAY}',-7),date_add('${YESTERDAY}',-14)
,date_add('${YESTERDAY}',-30))
) a
left join (
select dt, device, actlist
from cdm.dim_bhv_device_ds
where dt = '${YESTERDAY}'
) b
on a.device = b.device
;

0x06 延伸

正如开篇所说的,这套增长模式是简化版的,还可以衍生出很多的数据模型,比如新增留存信息表等。另外,每家公司对于日活、新增和留存的业务逻辑定义不一样,ETL处理流程会比本文复杂得多,比如剔除作弊用户等。

参考文献

数据仓库的初级手册