Accelerating Hadoop With Cumulus Linux

By Atul Chavan - 4:43 PM

One of the questions I’ve encountered in talking to our customers has  been “What environments are a good example of working on top of the  Layer-3 Clos design?”  Most engineers are familiar with the classic  Layer-2 based Core/Distribution/Access/Edge model for building a data  center.  And while that has served us well in the older client-server  north-south traffic flow approaches and in smaller deployments, modern  distributed applications stress the approach to its breaking point.   Since L2 designs normally need to be built around pairs of devices,  relying on individual platforms to carry 50% of your data center traffic  can present a risk at scale.  On top of this you have to have a long  list of protocols that can result in a brittle and operationally complex  environment as you deploy 10’s of devices.
Hence the rise of the L3 Clos approach allowing for combining many  small boxes, each carrying only a subset of your traffic, along with  running industry standard protocols that have a long history of  operational stability and troubleshooting ease.  And, while the approach  can be applied to many different problems, building a practical  implementation of a problem is the best way to show it to be true.  With  that in mind we recently setup a Hadoop cluster leading to a solution  validation guide we are publishing with our new release.
Big Data analytics is becoming increasingly common across businesses  of all sizes.  With the growth of genomic, geographic, social-graph,  search indexing and other large data sources, the ability for a single  computer to process across these sets in a reasonable time has  diminished.  Distributed processing models like Hadoop have become  increasingly the way to approach the data, breaking down the processing  into steps that can be distributed along with the data across the  compute nodes.
Many of the Hadoop solutions being published have been built around  assuming a high cost of the network, so they have focused on 1Gig  Ethernet attached servers, pressing the issues of locality to keep  traffic on the same ToR and optimizing keeping traffic off the network.   And while the speed of even 10Gig Ethernet can not keep up with locally  attached storage, being able to build a low-to-no oversubscription  network fabric at 10Gig and higher, in concert with most Big Data class  servers shipping with integrated 10Gig Ethernet on the motherboard  (LOM), the prices to accomplish this rival solutions built around 1Gig  Ethernet and remove your Big Data results from having to be so tied to  the locality of the data in your environment.
When it comes to building a network for Hadoop, we chose the  enterprise grade Hortonworks Data Platform (HDP) driven by Hortonworks  as the platform to stand up and test for a new validated solution.   Hortonworks, being a major contributor to open source initiatives  (Apache Hadoop, HDFS, Pig, Hive, HBase, Zookeeper), has extensive  experience managing production level Hadoop clusters.  Given the open  nature of both Cumulus Linux and Hortonworks we were able to stand up  our environment quickly and validate the operations on the topology.  As  a follow-on to this project, by combining in the automation powers of  tools like Ansible, we will have a demo in Cumulus Workbench to show how  you can automate, both on the network and server, the environment and  to deploy using a single tool.  Keep your eyes out for it.

  • Share:

You Might Also Like

0 comments