July 13th, 2020

Author

Bruce Liu

In the past decade, two software breakthroughs have been making a lasting impact on digital transformation: machine learning/artificial intelligence (ML/AI) and big data computing engine. People might be fairly familiar with ML/AI-powered functions such as facial/voice recognition, recommendation feeds, etc. Big data computing engine, on the other hand, is less known outside of developers' circle. In this post, we highlight two widely adopted big data computing engines.

Apache Flink

Released in 2011, Apache Flink is an open-source stream-processing framework developed by the Apache Software Foundation. Tech giants widely adopt it because of its scalability on large computer clusters. For example, Alibaba chose Apache Flink to process data for its various eCommerce platforms due to Apache Flink's support of both stream and batch processing. Interestingly, over time, Alibaba has become the most prominent tech company that contributes to the development of Apache Flink.

Apache Flink is Alibaba's preferred data computing engine

Alibaba even developed its in-house version of Flink, called Blink. 70% of Alibaba's internal business uses Blink, and AliCloud also runs it for new cloud solutions, such as ET Brain and Data Middle Platform.

Blink: Alibaba's in-house Apache Flink

In addition to Alibaba, nearly all other cloud giants adopted Apache Flink for internal and external services. E.g., AWS Kinesis, a real-time video and data stream analytics tool, supports Apache Flink. Netflix is also using Apache Flink for massive data processing.

Apache Spark

Apache Spark is another popular open-source distributed cluster-computing framework for big data, out of UC Berkeley's AMPLab. Similar to Apache Flink, it can stream and batch large data sets. Spark's in-memory processing system allows it to run at a much faster speed versus Apache Flink. The original creators of Apache Spark started Databricks in 2013. Like other open-source companies such as MongoDB and Elastic Search, Databrick provides both free versions of Spark and premium SaaS services.

5 Trillion Records Processed Per Day on Databricks

Leveraging the popularity of Apache Spark on the cloud, Databricks creates a unified analytics platform (SaaS) running on major public clouds to serve their large customer base better. Databricks today provides not only big data computing, but also many new functions such as AutoML, cybersecurity, genomics, and graph processing. Databricks raised $400 million on a $6.2 billion valuation in late 2019.

Databricks' Unified Data analytics platform

Compared to Apache Flink, Apache Spark is more advanced in machine learning and used by more than 500,000 data scientists worldwide. Machine learning usually takes three steps: a) ETL (extra, transform, load), b) Training, and c) Inferencing. Apache Spark is critical in the ETL stage.

Nvidia recently announced end-to-end GPU acceleration for Apache Spark 3.0. Nvidia claimed that its Ampere-based GPU (A100) offers 20x performance improvements over the previous Volta GPU architecture. In Nvidia CEO Jensen Huang's own words: "Native GPU acceleration for Spark pipeline, from extracting, transforming and loading the data to training to inference, delivers the performance and the scale needed to finally connect the potential of big data with the power of AI."