August 15, 2016

Cross section vs Time Series Data

There are two types of data sets based on the time.

Cross-Section Data

Cross section data is collected on the same point in time. There might be different variables but all of them are collected for the same period of the time. So you won't see any time column.

Time-Series Data 

Time series data expands across periods. Same variable is recorded for different time periods.


Type of Variables

Quantitative Variable

Variables which can be measured.

Discrete Variable

 Countable variables are discrete variables. There is no need to be whole numbers.

Continuous Variable 

Uncountable variables are continuous variables. This contrasts with discrete variables.

Qualitative Variables

Non-numerical variable are called qualitative variables. Sometimes qualitative variables are represented by numbers. But it is useless to perform arithmetic operations on those variables.

Point Estimation and Interval Estimation

Point Estimation

Point estimate is a statistic which is inferred from sample data set. Also a closer guess to the population parameter.

Interval Estimation

Interval estimation describe a range which can contain the value in a population. This contrasts with Point estimation.

May 14, 2016

Apache Spark Job with Maven

Today, I'm going to show you how to write a sample word count application using Apache Spark. For dependency resolution and building tasks, I'm using Apache Maven. How ever, you can use SBT (Simple Build Tool). Most of the Java Developers are familiar with Maven. Hence I decided to show an example using Maven.


This application is pretty much similar to the WordCount Example of the Hadoop. This job exactly does the same thing. Content of the Drive.scala is given below.

This job basically reads all the files in the input folder. Then tokenize every word from space (" "). Then count each and every word individually. Moreover, you can see that application is reading arguments from the args variable. First argument will be the input folder. Second argument will be used to dump the output.

Maven projects needs a pom.xml. Content of the pom.xml is given below.

Run below command to build the Maven project.
mvn clean package
Maven will download all the dependencies and all the other stuff on behalf of you. Then what you need to do is run this job. To run the job, please run below command on your terminal window.
/home/dedunu/bin/spark-1.6.1/bin/spark-submit          \
     --class org.dedunu.datascience.sample.Driver      \
     target/sample-Spark-Job-jar-with-dependencies.jar \
     /home/dedunu/input                                \
     /home/dedunu/output


Output of the job will look like below.


You can find the project on Github - https://github.com/dedunu/spark-example
Enjoy Spark!