离线数据仓库的最大特点是离线,也就是说数据会经过ETL处理之后才显示,最常见的形式是T+1,比如隔天出数。而提前设计的ETL任务一般需要在凌晨去运行,数据工程师不可能每天凌晨起来执行任务,这时候就需要有任务调度系统了。它不仅要能够定时地启动任务,还要能够管理任务之间复杂的依赖关系,使得成千上万的任务有条不紊地运行起来,保证数据能够在预定时间(比如上班前)能够出来,方便其他同事进行查阅。

由于任务调度系统太重要了,很多有开发能力的公司都选择自主研发。尽管需要一些成本,但是为了稳定性和个性化需求也是值得的。当然,对于一些初创型的互联网公司,市场上也有很多开源的任务调度系统,它们经过反复迭代,稳定性有了很大提升,甚至达到了企业级标准。接下来将介绍几款开源的任务调度系统,它们都分别在各大互联网公司有着成功案例,请放心使用。

0x00 Luigi

Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

github: https://github.com/spotify/luigi
documentation: https://luigi.readthedocs.io/en/stable/

0x01 Azkaban

Azkaban is a batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy to use web user interface to maintain and track your workflows.

github: https://github.com/azkaban/azkaban
documentation: https://azkaban.readthedocs.io/en/latest/

0x02 Apache Airflow

Apache Airflow (or simply Airflow) is a platform to programmatically author, schedule, and monitor workflows.

When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative.

Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.

github: https://github.com/apache/airflow
documentation: https://airflow.apache.org/

0x03 Dolphin Scheduler

Dolphin Scheduler is a distributed and easy-to-expand visual DAG workflow scheduling system, dedicated to solving the complex dependencies in data processing, making the scheduling system out of the box for data processing.

github: https://github.com/apache/incubator-dolphinscheduler
documentation: https://dolphinscheduler.apache.org/en-us/docs/user_doc/quick-start.html

0x04 延伸

上述提到的开源任务调度系统,尽管功能很完善,但是易用性有待提高,更不能满足企业的个性化需求,建议进行二次开发。

参考文献

Existing Workflow systems
Job/Task Schedule
数据仓库的初级手册