May 14, 2016

Apache Spark Job with Maven

Today, I'm going to show you how to write a sample word count application using Apache Spark. For dependency resolution and building tasks, I'm using Apache Maven. How ever, you can use SBT (Simple Build Tool). Most of the Java Developers are familiar with Maven. Hence I decided to show an example using Maven.

This application is pretty much similar to the WordCount Example of the Hadoop. This job exactly does the same thing. Content of the Drive.scala is given below.

This job basically reads all the files in the input folder. Then tokenize every word from space (" "). Then count each and every word individually. Moreover, you can see that application is reading arguments from the args variable. First argument will be the input folder. Second argument will be used to dump the output.

Maven projects needs a pom.xml. Content of the pom.xml is given below.

Run below command to build the Maven project.
mvn clean package
Maven will download all the dependencies and all the other stuff on behalf of you. Then what you need to do is run this job. To run the job, please run below command on your terminal window.
/home/dedunu/bin/spark-1.6.1/bin/spark-submit          \
     --class org.dedunu.datascience.sample.Driver      \
     target/sample-Spark-Job-jar-with-dependencies.jar \
     /home/dedunu/input                                \

Output of the job will look like below.

You can find the project on Github -
Enjoy Spark!

Getting Public Data Sets for Data Science Projects

All of us are interested in doing brilliant things with data sets. Most people use Twitter data streams for their projects. But there a lot of free data sets in the Internet. Today, I'm going to list down few of them. Almost all of these links, I found from a course called Up and Running with Public Data Sets. If you want more details, please watch the complete course on

Moreover, Quandl offers R package which you can download data into your R projects very easily. I'm hoping to write a blog post on getting Quandl data with R. This site mainly includes economic data sets.

I'm from Sri Lanka. Sri Lankan researchers might need Sri Lankan data sets as well. Below links will help you to find Sri Lankan data sets.

Hope you enjoy playing with those data sets!

April 24, 2016

vboxdrv setup says make not found

After you update kernal you need to run vboxdrv setup. But if you are trying to compile it for the first time or after removing build-essential package, you might see the below error.

user@ubuntu:~$ sudo /etc/init.d/vboxdrv setup
[sudo] password for user:
Stopping VirtualBox kernel modules ...done.
Recompiling VirtualBox kernel modules ...failed!
  (Look at /var/log/vbox-install.log to find out what went wrong)
user@ubuntu:~$ cat /var/log/vbox-install.log
/usr/share/virtualbox/src/vboxhost/build_in_tmp: 62: 
/usr/share/virtualbox/src/vboxhost/build_in_tmp: make: not found
/usr/share/virtualbox/src/vboxhost/build_in_tmp: 62: 
/usr/share/virtualbox/src/vboxhost/build_in_tmp: make: not found
/usr/share/virtualbox/src/vboxhost/build_in_tmp: 62: 
/usr/share/virtualbox/src/vboxhost/build_in_tmp: make: not found

Ubuntu needs build-essential to run above command. Run below command to install the build-essential.

sudo apt-get install build-essentail
sudo /etc/init.d/vboxdrv setup

Then you can use virtual box!

March 03, 2016

How to create an EMR cluster using Boto3?

I wrote a blog post about Boto2 and EMR clusters few months ago. Today I'm going to show how to create EMR clusters using Boto3. Boto3 documentation is available at