<code> Monkey </code>: What is Hadoop? (from Interview with Amr Awadallah

First, it’s worth making the important clarifying point that Hadoop is not a database. Hadoop is a data processing system, and in fact, I would even go as far as saying Hadoop is an operating system. The core of an operating system boils down to a file system, the storage of files, and a process scheduling system that runs applications on top of these files.

There are many other components that help with devices, credentials and user access, and so on, but that is the core. Hadoop is exactly the same thing. The core of Hadoop is the Hadoop Distributed File System, which is a file system that’s runs across many nodes. It links together the file systems on many local nodes to make them into one big file system. Hadoop MapReduce is really the job scheduling system that takes care of scheduling jobs on top of all those nodes.

That is the key distinction between Hadoop’s approach and that of database systems. Hadoop, at its heart, does not require any structure to your data. You can just upload files directly from anywhere, like a web server, RFID device, or cell phone mobile device, directly into Hadoop.

They could be images, videos, or just a bunch of bits. They don’t have to have a schema with column types and so on, which gives you tremendous agility and flexibility.

Hadoop has a very nice model that I sometimes refer to as schema on read. Whereas defining your schema as you’re writing the data in limits what you can put in by requiring it to be conformant to the schema that you created, Hadoop allows you to define the schema as you’re reading stuff out.

That gives you a lot of flexibility and agility, since you can add files that have dynamic parts like JSON or new standards coming up like Avro, which is a very good project coming out of the Hadoop project that’s similar to protocol buffers from Google and Thrift from Facebook. Avro makes files have a schema around them as well, but these schemas are semi-structured, rather than conforming to a strict relational model.

That said, it’s also important to point out that structured stuff is a subset of unstructured stuff. The fact that Hadoop at its heart is a file system doesn’t mean that it can’t do database relational stuff. It does actually, in the same way that Windows at its heart is a file system, but you can run SQL Server on top of it to get the relational services, schemas, column types, and so on.

One of the key projects on top of Hadoop is Hive, which actually came out of Facebook. Hive essentially provides a relational database on top of Hadoop that utilizes the underlying file system but has a metastore that keeps the schema of the files.

It knows that a given file is tab delimited or whatever, it knows the column type for these files, and Hive allows you to write SQL against these files. It will look up the schema and then it will write for you the MapReduce jobs so that you don’t have to go and learn MapReduce from scratch.

Now you have the flexibility of going either way. One approach is to get at the core of the MapReduce framework using Java MapReduce, which we sometimes refer to as being like assembly language for Hadoop. It gives you the most flexibility and performance, but it is fairly complex and difficult to learn.

Alternately, you can go in with a high level language like Hive. In this case, you can just use SQL, if that’s what you’re used to, to write your job. Hive itself has lots of optimizations. It understands the underlying MapReduce framework, so it can properly map your problem on top of your data.