Saturday 18 August 2012

Why is everyone talking about Hadoop?

Hadoop is an Apache project, which means it’s open source software, and it’s written in Java. What it does is support data-intensive distributed applications. It comes from work Google were doing and allows applications to use thousands of independent computers and petabytes of data.

Yahoo has been a big contributor to the project. The Yahoo Search Webmap is a Hadoop application that is used in every Yahoo search. Facebook claims to have the largest Hadoop cluster in the world. Other users include Amazon, eBay, LinkedIn, and Twitter. But now, there’s talk of IBM taking more than a passing interest.

According to IBM: “Apache Hadoop has two main subprojects:
  • MapReduce – The framework that understands and assigns work to the nodes in a cluster.
  • HDFS – A file system that spans all the nodes in a Hadoop cluster for data storage. It links together the file systems on many local nodes to make them into one big file system. HDFS assumes nodes will fail, so it achieves reliability by replicating data across multiple nodes.”

It goes on to say: “Hadoop changes the economics and the dynamics of large-scale computing. Its impact can be boiled down to four salient characteristics. Hadoop enables a computing solution that is:
  • Scalable – New nodes can be added as needed, and added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top.
  • Cost effective – Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.
  • Flexible – Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.
  • Fault tolerant – When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat.”

According to Alan Radding writing in IBM Systems Magazine (http://www.ibmsystemsmag.com/mainframe/trends/whatsnew/hadoop_mainframe/) IBM “is taking a federated approach to the big data challenge by blending traditional data management technologies with what it sees as complementary new technologies, like Hadoop, that address speed and flexibility, and are ideal for data exploration, discovery and unstructured analysis.”

Hadoop could run on any mainframe already running Java or Linux. Radding lists tools to make life easier like:
  • SQOOP – imports data from relational databases into Hadoop.
  • Hive – enables data to be queried using an SQL-like language called HiveQL.
  • Apache Pig – a high-level platform for creating the MapReduce programs used with Hadoop.

There’s also ZooKeeper, which provides a centralized infrastructure and services that enable synchronization across a cluster.

Harry Battan, data serving manager for System z, suggests that 2,000 instances of Hadoop could run on Linux on the System z, which would make a fairly large Hadoop configuration.

Hadoop still needs to be certified for mainframe use, but sites with newer hybrid machines (z114 or z196) could have Hadoop today by putting it on their x86 blades, for which Hadoop is already certified, and it could then process data from DB2 on the mainframe. But you can see why customers might be looking to get it on their mainframes because it gives them a way to get more information out of the masses of data they already possess. And data analysis is often seen as the key to continuing business success for larger organizations.

1 comment:

Unknown said...

hadoop online training always recommends this site for valuable information related not only about the hadoop but also about various other technologies which are cloud related. Thanks for maintaining this wonderful blog.