May 21, 2015

How to fix Incompatible clusterIDS in Hadoop?

When you are installing and trying to setup your Hadoop cluster you might face a issue like below.
FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool (Datanode Uuid unassigned) service to master/192.168.1.1:9000. Exiting. 
java.io.IOException: Incompatible clusterIDs in /home/hadoop/hadoop/data: namenode clusterID = CID-68a4c0d2-5524-486e-8bc9-e1fc3c5c2e29; datanode clusterID = CID-c6c3e9e5-be1c-4a3f-a4b2-bb9441a989c5
I just quoted first two line of the error. But full stack trace would look like below.

You might haven't formatted your name node properly. But if this was in test environment you can easily delete data and name node folders, and reformat the HDFS. To format you can run below command.

WARNING!!! : IF YOU RUN BELOW COMMAND YOU WILL LOOSE ALL YOUR DATA.
hdfs namenode -format
But if you have a lot of data in your Hadoop cluster and you can't easily format it. Then this post is for you.

First stop all Hadoop processes running. Then login into you name node. Find the value of dfs.namenode.name.dir property. Run below command with your namenode folder.
cat <dfs.namenode.name.dir>/current/VERSION
Then You will see a content like below.
#Thu May 21 08:29:01 UTC 2015
namespaceID=1938842004
clusterID=CID-68a4c0d2-5524-486e-8bc9-e1fc3c5c2e29
cTime=0
storageType=NAME_NODE
blockpoolID=BP-2104944316-127.0.1.1-1430820636449
layoutVersion=-60
Copy the clusterID from nematode. Then login into the problematic slave node. Find dfs.datanode.data.dir folder. Run below command to edit the VERSION file.
vim <dfs.datanode.data.dir>/current/VERSION 
Your datanode cluster VERSION file will look like below. Replace the cluster ID you copied from name node.
#Thu May 21 08:31:31 UTC 2015
storageID=DS-b7d3c421-0366-4a66-8d14-78362389ed73
clusterID=CID-c6c3e9e5-be1c-4a3f-a4b2-bb9441a989c5
cTime=0
datanodeUuid=724f8bad-c0ca-4ded-98d6-a860d3165289
storageType=DATA_NODE
layoutVersion=-56
Then everything will be okay!

Hadoop MultipleInputs Example

Let's assume you are working for ABC Group. And they have ABC America airline,  ABM Mobile, ABC Money and ABC hotel blah blah. ABC this and that. So you got multiple data sources. They have different types/columns. So you can't run single Hadoop Job on all the data.

You got several data files from all these businesses.
(Edited this data file 33 time to get it aligned. ;) Don't tell anyone!)

So your job is to calculate the total amount that one person spent for ABC group. For this you can run jobs for each company and then run another job to calculate the sum. But what I'm going to tell you is "NOOOO! You can do this with one job." Your Hadoop administrator will love this idea.

You need to develop custom InputFormat and a custom RecordReader. I have created both of these classes inside custom InputFormat class. Sample InputFormat should look like below.


nextKeyValue() method is the place where you should code according to your data files.

Developing custom InputFormat classes is not just enough. Also you need to change the main class of your job. You main class should look like below.

Line no. 26-28 adds your custom inputs to the job. Also you don't want to set Mapper class separately because you can't set it too. If you want you can develop separate mapper classes for your different file types. I'll write a blog post about that method also.
To build the JAR from my sample project you need Maven. Run below command to build JAR from Maven project. You can find the JAR file inside the target folder once you build the project.
mvn clean install
/
|----/user
     |----/hadoop
          |----/airline_data
          |    |----/airline.txt
          |----/book_data
          |    |----/book.txt
          |----/mobile_data
               |----/mobile.txt
With this change you may have to change the way you run the job. My file structure looks like above. I have different folders for different types. You can run job from the command below.
hadoop jar /vagrant/muiltiinput-sample-1.0-SNAPSHOT.jar /user/hadoop/airline_data /user/hadoop/book_data /user/hadoop/mobile_data output_result
If you have followed all the steps properly you will get job's output like this.

Job will create a folder called output_result. If you want to see the content you can run below command.
hdfs dfs -cat output_result1/part*
I ran my sample project on my sample data set. My result file looked like below.
12345678 500
23452345 937
34252454 850
43545666 1085
56785678 709
67856783 384
Source code of this project is available on GitHub
https://github.com/dedunu/hadoop-multiinput-sample

Enjoy Hadoop!

May 18, 2015

IMAP Java Test program and JMeter Script

One of my colleagues wanted to write a JMeter script to test IMAP. But that code failed. So I also got involved in that. JMeter BeanShell uses Java in the backend. First I tried with a Maven project. Finally I could write a code to list the IMAP folders. Java implementation is shown below.

Then we wrote a code to print IMAP folder count for JMeter BeanShell. Code is show below.

Complete Maven project is available on GitHub - https://github.com/dedunu/imapTest

Increase memory and CPUs on Vagrant Virtual Machines

Last post I showed how to create multiple nodes in a single Vagrant project. Usually "ubuntu/trusty64" box comes with 500MB. For some developers need more RAM, more CPUs. From this post I'm going to show how to increase the memory and number of CPUs in a vagrant project. Run below commands
mkdir testProject1
cd testProject1
vagrant init
Then edit the Vagrant file like below.


Above changes will increase memory to 8GB and also it will add one more core. Run below commands to start the vagrant machine and get the SSH access.
vagrant up
vagrant ssh
If you have an existing project, you just have to add these lines. When you restart the project memory would be increased.