May 21, 2015

How to fix Incompatible clusterIDS in Hadoop?

When you are installing and trying to setup your Hadoop cluster you might face a issue like below.
FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool (Datanode Uuid unassigned) service to master/ Exiting. Incompatible clusterIDs in /home/hadoop/hadoop/data: namenode clusterID = CID-68a4c0d2-5524-486e-8bc9-e1fc3c5c2e29; datanode clusterID = CID-c6c3e9e5-be1c-4a3f-a4b2-bb9441a989c5
I just quoted first two line of the error. But full stack trace would look like below.

You might haven't formatted your name node properly. But if this was in test environment you can easily delete data and name node folders, and reformat the HDFS. To format you can run below command.

hdfs namenode -format
But if you have a lot of data in your Hadoop cluster and you can't easily format it. Then this post is for you.

First stop all Hadoop processes running. Then login into you name node. Find the value of property. Run below command with your namenode folder.
cat <>/current/VERSION
Then You will see a content like below.
#Thu May 21 08:29:01 UTC 2015
Copy the clusterID from nematode. Then login into the problematic slave node. Find folder. Run below command to edit the VERSION file.
vim <>/current/VERSION 
Your datanode cluster VERSION file will look like below. Replace the cluster ID you copied from name node.
#Thu May 21 08:31:31 UTC 2015
Then everything will be okay!

Hadoop MultipleInputs Example

Let's assume you are working for ABC Group. And they have ABC America airline,  ABM Mobile, ABC Money and ABC hotel blah blah. ABC this and that. So you got multiple data sources. They have different types/columns. So you can't run single Hadoop Job on all the data.

You got several data files from all these businesses.
(Edited this data file 33 time to get it aligned. ;) Don't tell anyone!)

So your job is to calculate the total amount that one person spent for ABC group. For this you can run jobs for each company and then run another job to calculate the sum. But what I'm going to tell you is "NOOOO! You can do this with one job." Your Hadoop administrator will love this idea.

You need to develop custom InputFormat and a custom RecordReader. I have created both of these classes inside custom InputFormat class. Sample InputFormat should look like below.

nextKeyValue() method is the place where you should code according to your data files.

Developing custom InputFormat classes is not just enough. Also you need to change the main class of your job. You main class should look like below.

Line no. 26-28 adds your custom inputs to the job. Also you don't want to set Mapper class separately because you can't set it too. If you want you can develop separate mapper classes for your different file types. I'll write a blog post about that method also.
To build the JAR from my sample project you need Maven. Run below command to build JAR from Maven project. You can find the JAR file inside the target folder once you build the project.
mvn clean install
          |    |----/airline.txt
          |    |----/book.txt
With this change you may have to change the way you run the job. My file structure looks like above. I have different folders for different types. You can run job from the command below.
hadoop jar /vagrant/muiltiinput-sample-1.0-SNAPSHOT.jar /user/hadoop/airline_data /user/hadoop/book_data /user/hadoop/mobile_data output_result
If you have followed all the steps properly you will get job's output like this.

Job will create a folder called output_result. If you want to see the content you can run below command.
hdfs dfs -cat output_result1/part*
I ran my sample project on my sample data set. My result file looked like below.
12345678 500
23452345 937
34252454 850
43545666 1085
56785678 709
67856783 384
Source code of this project is available on GitHub

Enjoy Hadoop!

May 18, 2015

IMAP Java Test program and JMeter Script

One of my colleagues wanted to write a JMeter script to test IMAP. But that code failed. So I also got involved in that. JMeter BeanShell uses Java in the backend. First I tried with a Maven project. Finally I could write a code to list the IMAP folders. Java implementation is shown below.

Then we wrote a code to print IMAP folder count for JMeter BeanShell. Code is show below.

Complete Maven project is available on GitHub -

Increase memory and CPUs on Vagrant Virtual Machines

Last post I showed how to create multiple nodes in a single Vagrant project. Usually "ubuntu/trusty64" box comes with 500MB. For some developers need more RAM, more CPUs. From this post I'm going to show how to increase the memory and number of CPUs in a vagrant project. Run below commands
mkdir testProject1
cd testProject1
vagrant init
Then edit the Vagrant file like below.

Above changes will increase memory to 8GB and also it will add one more core. Run below commands to start the vagrant machine and get the SSH access.
vagrant up
vagrant ssh
If you have an existing project, you just have to add these lines. When you restart the project memory would be increased.

Multiple nodes on Vagrant

Recently I started working with Vagrant. Vagrant is a good tool that you can use for development. From this post I'm going to explain how to create multiple nodes on Vagrant project.
mkdir testProject
cd testProject
vagrant init

If you run above commands, it will create a Vagrant project for you. Now we have to do changes to the vagrant file. Your initial vagrant file will look like below.

You have to edit Vagrantfile add content like below.

Above sample vagrant file will create three nodes. Now run below command to start Vagrant virtual machines.
vagrant up

If you followed the instruction properly, you will get and output like below.

If you want to connect to master node, run below command.
vagrant ssh master
If you want to connect to slave1 node, run below command.
vagrant ssh slave1
What ever the machine you want to connect you just have to type vagrant ssh . Hope this will help you!

May 14, 2015

Alfresco 5.0.1 Document Preview doesn't work on Ubuntu?

I recently installed Alfresco for testing in vagrant instance. I used Ubuntu image for the vagrant instance. But I forgot to install all the libraries which is necessary to be installed on Ubuntu before you install alfresco. But fortunately alfresco worked with out those dependencies.

Above link gives you what are the libraries you should install before you install Alfresco. You should run below command to install libraries.
sudo apt-get install libice6 libsm6 libxt6 libxrender1 libfontconfig1 libcups2
But still office document previews didn't work properly. Some documents worked properly but some of them did't. Then I tried to debug it with one of my colleagues. We found below text in our logs

Then we tried to run soffice application from terminal. Look what we got!
/home/vagrant/alfresco-5.0.1/libreoffice/program/oosplash: error while loading shared libraries: cannot open shared object file: No such file or directory
Then we realised that we should install that library on Ubuntu. Run below command on Ubuntu server to install the missing library.

sudo apt-get install libxinerama1

Make sure you run both commands above!