August 15, 2016

Cross section vs Time Series Data

There are two types of data sets based on the time.

Cross-Section Data

Cross section data is collected on the same point in time. There might be different variables but all of them are collected for the same period of the time. So you won't see any time column.

Time-Series Data 

Time series data expands across periods. Same variable is recorded for different time periods.


Type of Variables

Quantitative Variable

Variables which can be measured.

Discrete Variable

 Countable variables are discrete variables. There is no need to be whole numbers.

Continuous Variable 

Uncountable variables are continuous variables. This contrasts with discrete variables.

Qualitative Variables

Non-numerical variable are called qualitative variables. Sometimes qualitative variables are represented by numbers. But it is useless to perform arithmetic operations on those variables.

Point Estimation and Interval Estimation

Point Estimation

Point estimate is a statistic which is inferred from sample data set. Also a closer guess to the population parameter.

Interval Estimation

Interval estimation describe a range which can contain the value in a population. This contrasts with Point estimation.

May 14, 2016

Apache Spark Job with Maven

Today, I'm going to show you how to write a sample word count application using Apache Spark. For dependency resolution and building tasks, I'm using Apache Maven. How ever, you can use SBT (Simple Build Tool). Most of the Java Developers are familiar with Maven. Hence I decided to show an example using Maven.


This application is pretty much similar to the WordCount Example of the Hadoop. This job exactly does the same thing. Content of the Drive.scala is given below.

This job basically reads all the files in the input folder. Then tokenize every word from space (" "). Then count each and every word individually. Moreover, you can see that application is reading arguments from the args variable. First argument will be the input folder. Second argument will be used to dump the output.

Maven projects needs a pom.xml. Content of the pom.xml is given below.

Run below command to build the Maven project.
mvn clean package
Maven will download all the dependencies and all the other stuff on behalf of you. Then what you need to do is run this job. To run the job, please run below command on your terminal window.
/home/dedunu/bin/spark-1.6.1/bin/spark-submit          \
     --class org.dedunu.datascience.sample.Driver      \
     target/sample-Spark-Job-jar-with-dependencies.jar \
     /home/dedunu/input                                \
     /home/dedunu/output


Output of the job will look like below.


You can find the project on Github - https://github.com/dedunu/spark-example
Enjoy Spark!

Getting Public Data Sets for Data Science Projects

All of us are interested in doing brilliant things with data sets. Most people use Twitter data streams for their projects. But there a lot of free data sets in the Internet. Today, I'm going to list down few of them. Almost all of these links, I found from a Lynda.com course called Up and Running with Public Data Sets. If you want more details, please watch the complete course on Lynda.com.


Moreover, Quandl offers R package which you can download data into your R projects very easily. I'm hoping to write a blog post on getting Quandl data with R. This site mainly includes economic data sets.


I'm from Sri Lanka. Sri Lankan researchers might need Sri Lankan data sets as well. Below links will help you to find Sri Lankan data sets.


Hope you enjoy playing with those data sets!

April 24, 2016

vboxdrv setup says make not found

After you update kernal you need to run vboxdrv setup. But if you are trying to compile it for the first time or after removing build-essential package, you might see the below error.

user@ubuntu:~$ sudo /etc/init.d/vboxdrv setup
[sudo] password for user:
Stopping VirtualBox kernel modules ...done.
Recompiling VirtualBox kernel modules ...failed!
  (Look at /var/log/vbox-install.log to find out what went wrong)
user@ubuntu:~$ cat /var/log/vbox-install.log
/usr/share/virtualbox/src/vboxhost/build_in_tmp: 62: 
/usr/share/virtualbox/src/vboxhost/build_in_tmp: make: not found
/usr/share/virtualbox/src/vboxhost/build_in_tmp: 62: 
/usr/share/virtualbox/src/vboxhost/build_in_tmp: make: not found
/usr/share/virtualbox/src/vboxhost/build_in_tmp: 62: 
/usr/share/virtualbox/src/vboxhost/build_in_tmp: make: not found

Ubuntu needs build-essential to run above command. Run below command to install the build-essential.

sudo apt-get install build-essentail
sudo /etc/init.d/vboxdrv setup

Then you can use virtual box!

March 03, 2016

How to create an EMR cluster using Boto3?

I wrote a blog post about Boto2 and EMR clusters few months ago. Today I'm going to show how to create EMR clusters using Boto3. Boto3 documentation is available at https://boto3.readthedocs.org/en/latest/.

HDFS - How to recover corrupted HDFS metadata in Hadoop 1.2.X?

You might have Hadoop in your production. And sometimes Tera-bytes of data is residing in Hadoop. HDFS metadata can get corrupted. Namenode won't start in such cases. When you check Namenode logs you might see exceptions.

ERROR org.apache.hadoop.dfs.NameNode: java.io.EOFException
    at java.io.DataInputStream.readFully(DataInputStream.java:178)
    at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
    at org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90)
    at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:433)
    at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:759)
    at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:639)
    at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:222)
    at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:79)
    at org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:254)
    at org.apache.hadoop.dfs.FSNamesystem.<init>(FSNamesystem.java:235)
    at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:131)
    at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:176)
    at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:162)
    at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:846)
    at org.apache.hadoop.dfs.NameNode.main(NameNode.java:855)

If you have a development environment, you can always format the HDFS and continue. This blog posts even suggest that - https://autofei.wordpress.com/2011/03/27/hadoop-namenode-failed-and-reset/

BUT IF YOU FORMAT HDFS YOU LOOSE ALL THE FILES IN HDFS!!! 

So Hadoop Administrators can't format HDFS simply. But you can recover your HDFS to last checkpoint. You might loose some data files. But more than 90% of the data might be safe. Let's see how to recover corrupted HDFS metadata.

Hadoop is creating checkpoints periodically in Namenode folder. You might see three folders in Namenode directory. They are
  1. current
  2. image
  3. previous.checkpoint

current folder must be corrupted most probably.
  • Stop all the Hadoop services from all the nodes.
  • Backup both "current" and "previous.checkpoint" directories. 
  • Delete "current" directory. 
  • Rename "previous.checkpoint" to "current"
  • Restart Hadoop services. 

Steps I followed I have mentioned above. Below commands were ran to recover the HDFS. Commands might slightly change depending on your installation.

/usr/local/hadoop/stop-all.sh
cd <namenode.dir>
cp -r current current.old
cp -r previous.checkpoint previous.checkpoint.old
mv previous.checkpoint current
/usr/local/hadoop/start-all.sh

That's all!!!! Everything was okay after that!

January 29, 2016

How to fix InsecurePlatformWarning on Ubuntu?

Python modules sometimes give issues. We got below warning from a python application.

/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/util/ssl_.py:120: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning


After a small research we found out, this might be an issue related to outdated module. We ran below commands in that server. After that warning didn't occur.

$ sudo apt-get install build-essential python-dev libffi-dev libssl-dev
$ sudo pip install --upgrade ndg-httpsclient

January 14, 2016

Vagrant on Windows 7 vs Ubuntu 14.04

My whole team had to work on a project which is using Vagrant. Most of us had 8GB memory except one unfortunate intern. He had only 4GB of memory on his workstation. All the team members could spawn Vagrant machines without a problem except him.

So we requested for more memory. Insisted IT department to upgrade it to 8GB. Oh no! Our IT is going to retire desktops. So they don't buy new parts for existing desktop system. Somehow we managed to get 1GB memory card. Now he got 5GB memory in his computer. This computer had Athlon processor. ( I cannot recall the model number.)

Then we tried to spin it up again. To provision to vagrant machine it took at least 3 hours. Sometimes package gets corrupted. Somehow he stopped to provision machines, once he realized that it is useless.

Then we moved him to another task. There he had to work with Ubuntu closely. So I forced him to install Ubuntu. However this kid was okay to install Ubuntu. Then I created Ubuntu 14.04 USB bootable drive. Then he installed Ubuntu. After installation he just tried to start the Vagrant. Vagrant machine got provisioned within minutes. Still system has 1GB+ free memory.

Windows 7 vs Ubuntu 14.04, Ubuntu 14.04 wins when it compares with memory consumption.  

January 05, 2016

How to specify ReleaseLabel for EMR cluster with Boto2

Boto is the AWS SDK for Python. You can create clusters, instances or anything using Boto. But sometimes Boto imposes limitations. I wanted to create a EMR cluster with RelaseLabel 4.2.0. But we were using Boto2. ReleaseLabel is an option in Boto3. For Boto2 there was no documented option for RelaseLabel.

So I found out a way to create EMR (Elastic Map Reduce) clusters using Boto 2 with a given ReleaseLabel.


I have commented AMI Version because ReleaseLabel will pick AMI version correctly. Above program will print the cluster ID in terminal. 

Sometimes you might get an issue saying "No Default VPC found.". This is a network related issue. In that case you might need to specify subnet ID for EMR cluster. Then you don't need to specify an availability zone.