Why Developers are Ditching Hadoop for Kubernetes
Author: Tom Hoblitzell | 5 min read | July 31, 2019
Big data software companies that used to run their applications on Hadoop are now switching to Kubernetes. What’s behind the recent move from Hadoop to Kubernetes, and where is the big data landscape going in the future?
A Brief History of Hadoop
Apache Hadoop is an open-source software framework for distributed storage and processing of massive data sets. Hadoop was first released in 2011, when the big data landscape was significantly more challenging in terms of network latency and scalability.
Hadoop consists of three main types of tools:
- Data: The Hadoop Distributed File System (HDFS), a distributed file system for storing information across hundreds or thousands of nodes.
- Orchestration: Apache Hadoop YARN for resource management and job scheduling.
- Middleware: Platforms like Apache Spark, Apache Pig, Apache Hive, and MapReduce for various big data functions.
What is Kubernetes?
Originally developed by Google, Kubernetes is an open-source container orchestration system for managing virtualized Linux containers in the cloud. By packaging applications together with their required libraries and dependencies, containers create a consistent, reliable experience when running software in different computing environments.
With such obvious benefits, growth in container technologies like Kubernetes has dramatically increased in the past several years, boosted by simultaneous growth in the cloud.
According to a 2018 survey by Cloud Foundry, use of container technology in production is now at 38 percent of companies and rising. Kubernetes is the dominant container technology in the public cloud: it powers 85 percent of containerized workloads on Google Cloud Platform, and 65 percent on Microsoft Azure.
From Hadoop to Kubernetes
YARN is the closest analogue to Kubernetes in the Hadoop ecosystem. Since version 2.6 of Hadoop, YARN has been able to handle Docker containers.
However, one drawback of YARN and Hadoop is that users are limited to Java-based tools. In the years since Hadoop’s release, however, many other big data and machine learning technology stacks have emerged in languages like Python, which has the popular frameworks NumPy, pandas, and scikit-learn.
Kubernetes is independent of any single programming language, operating system, or cloud provider, and this flexibility makes it an appealing choice for many developers. Technology consultant Erkan Yanar has speculated on the potential for Kubernetes to become an infrastructure of its own, forming a “lingua franca” between different tech ecosystems. You can even use Kubernetes as the orchestration layer of Hadoop if you still want access to Hadoop-specific functionality.
Platforms like Hadoop were created during and for a different era in big data. When Hadoop was first released, Internet speeds were slower and most big data assets were stored on-premise rather than in the cloud.
Today, most organizations are not only using the cloud, but going for a multi-cloud, hybrid cloud, or private cloud strategy that combines multiple options. In this more complex big data ecosystem, businesses need the guarantee that applications running in one environment will behave identically when deployed in another.
That’s where technologies like containers and Kubernetes come in. By provisioning resources for containers and managing their lifecycle from start to finish, Kubernetes facilitates the IT groundwork that needs to be done before running big data applications. Hybrid and multi-cloud environments are becoming more popular than ever, which will likely serve only to increase the adoption of container services like Kubernetes for big data.