Big Data Management

Hadoop Cluster Build

PSG, as part of its ongoing research and development, found it needed to make use of big data. That led to needing a big data store. Since we already had a local cloud implemented on commodity hardware using the OpenStack software, we decided to use Apache’s Ambari software as a data store implementation and management tool for a Hadoop cluster. Ambari would not only allow us to manage the cluster but it also has other configurable services for data analytics.

The first thing we had to do was to decide how many resources our cluster would use. Reading the Hortonworks Apache Ambari Installation document provided us with general guidelines on this. Next, we prepared our environment, we had to configure our OpenStack implementation to offer these resources as virtual machines (VMs).

Once the environment was ready, we had OpenStack provision the VMs and we prepared the VMs for the Ambari installation. This meant updating the package list index, upgrading the existing software packages, and installing Ambari prerequisites and packages, as well as establish password-less ssh between VMs

Finally, we were ready to install Ambari and configure a cluster. And we did. We built, tore down, and re-built many clusters in our quest to build the cluster we desired. Throughout this process we discovered many caveats and best practices that turned into a very specific sequence of steps. Once we had our cluster built we also realized we wanted to be able to repeat the process of building a cluster. We had our primary cluster but we also wanted to play with things like cluster configuration, cluster security without risking our active cluster.

So, we scripted the cluster build. The script runs from a host VM in our OpenStack environment. It tells OpenStack to provision the VMs, then the script preps the VMs, and finally the script builds and configures a cluster. By executing a single command, we are able to build a complete cluster that is ready to be used.

Now we sharing this script with the public so you too can stand up a cluster quickly, accurately and cheaply. The value of such a script will really shine when you want to stand up a cluster in multiple environments, i.e. dev/test/prod and they all need to be the same or when you need a cluster to quickly test something but don’t need to spend a lot of man hours to get a data store set up.

Here at PSG we are glad we could help you. It you would like assistance, please feel free to contact us.