Hadoop: The ultimate list of frameworks
- select the contributor at the end of the page -
As a developer, understanding the Hadoop ecosystem can make you very valuable. Companies are leveraging it for more projects each day, and the average Hadoop developer salary is around $120,000 a year. At first glance, the Hadoop ecosystem can seem overwhelming (What is Hive? Pig? Flume? How do all of these frameworks fit together in Hadoop?). But thankfully, it's not as intimidating as it sounds. To get started, we've put together this quick reference guide, explaining the frameworks. Please note that this list focuses on the Apache open source Hadoop applications/frameworks, most of these are top-level projects, but some are incubating. Let's dive in.
Hadoop: This is a software library written in Java used for processing large amounts of data in a distributed environment. It allows developers to setup clusters of computers, starting with a single node that can scale up to thousands of nodes.
Hive: Hive is data warehousing framework that's built on Hadoop. It allows for structuring data and querying using a SQL-like language called HiveQL. Developers can use Hive and HiveQL to write complex MapReduce over structured data in a distributed file system. Hive is the closest thing to a relational-database in the Hadoop ecosystem.
Pig: Pig is an application for transforming large data sets. Like Hive, Pig has its own SQL-Like language called Pig Latin. Where Hive is used for structured data, Pig excels in transforming semi-structured and unstructured data. Pig Latin allows developers to write complex MapReduce jobs without having to write them in Java.
Flume: Odds are if you are in the Hadoop ecosystem you will need to move around large amounts of data. Flume is a distributed service that helps collect, aggregate and move around large log data. It's written in Java and typically delivers files directly into HDFS.
Drill: Why not use tools with cool names like drill and drill bits? Apache Drill is a schema-free SQL query engine for data exploration. Drill is listed as real SQL and not just "SQL-like," which allows developers or analysts to use existing SQL knowledge to begin writing queries in minutes. Apache Drill is extendable with User Define Functions.
Kafka: Another great tool for messaging in Hadoop is Kafka. Kafka is used as a queuing system when working with Storm.
Tez: If you're using YARN, you'll want to learn about the Tez project. Tez allows for building applications that process DAG (directed acyclic graph) tasks. Basically, Tez allows Hive and Pig jobs to be written with fewer MapReduce jobs, which makes Hive and Pig scripts run faster.
Sqoop: Do you have structured data in a relational database, SQL Server or MySQL, and want to pull that data into your Big Data platform? Well Sqoop can help. Sqoop allows developers to transfer data from a relational database into Hadoop.
Storm: Hadoop works in batch processing, but many applications need real-time processing and this is where Storm fits in. Storm allows for streaming data, so analysis can happen in real-time. Storm boasts a benchmark speed of over a million tuples processed per second, per node.
Ambari: One of most useful tools you'll use if you're administering a Hadoop cluster, Ambari allows administrators to install, manage and monitor Hadoop clusters with a simple Web interface. Ambari provides an easy-to-follow wizard for setting up a Hadoop cluster of any size.
HBase: When developing applications you'll often want real-time read/write access to your data. Hadoop runs processes in batch and doesn't allow for modification, and this is what makes HBase so popular. HBase provides the capability to modify data in real-time and still run in a HDFS environment.
Mahout: Looking to run Singular Value Decomposition, K-nearest neighbor, or Naive Bayes Classification in a Hadoop environment? Mahout can help. Mahout provides specialized data analysis algorithms that run in a distributed file system. Think of Mahout as a Java library with distributed algorithms to reference in MapReduce jobs.
Zookeeper: Zookeeper provides centralized services for Hadoop cluster configuration management, synchronization and group services. For example, think about how a global configuration file works on a Web application; Zookeeper is like that configuration file, but at a much higher level.
Spark: A real-time general engine for data processing, Spark boasts a speed 100-times faster than Hadoop and works in memory. Spark supports Scala, Python and Java. It also contains a Machine Learning Library (MLlib), which provides scalable machine learning libraries comparable to Mahout.
Zeppelin: Zeppelin is a Web-based notebook for interactive data analytics. It makes data visualization as easy as drag and drop. Zeppelin works with Hive and Spark (all languages) and markdown.
OK, so you may be feeling a bit overwhelmed at realizing how much is on this list (especially once you notice that it's not even a complete list, as new frameworks are being developed each day). But the important thing is that you work toward a basic understanding of these frameworks, so that when a new one pops up, you can relate it back to one of the above. By learning the basic frameworks you're building a strong foundation that will accelerate your learning in the Hadoop ecosystem.
Want to learn more? Check out these Pluralsight Hadoop courses.