Bigdata simplified

Thursday, 9 April 2015

Apache Storm single node installation

Video Reference

Step 1: Download Zookeeper

wget http://www.eu.apache.org/dist/zookeeper/zookeeper-3.4.6/zookeeper-3.4.6.tar.gz

Step 2:
tar -zxcf zookeeper-3.4.6.tar.gz
cd zookeeper-3.4.6
cd conf

Step 3:
cp zookeeper_sample.cfg zoo.cfg
vi zoo.cfg
tickTime=2000
dataDir=/home/hadoop/zookeeper
clientPort=2181

Step 4: Download Storm

wget http://apache.mesi.com.ar/storm/apache-storm-0.9.3/apache-storm-0.9.3.tar.gz

Step 5:
tar -zxvf apache-storm-0.9.3.tar.gz
cd apache-storm-0.9.3
cd conf

Step 6:
vi storm.yaml

storm.zookeeper.servers:
- "localhost"

storm.zookeeper.port: 2181
nimbus.host: "localhost"

Step 6: Start all the service (Zookeeper + Storm )

Zookeeper

bin/zkServer.sh start
jps
QuorumPeerMain

Storm
bin/storm nimbus

bin/storm supervisor

bin/storm ui

JPS over all service

Step 7:Check UI

localhost:8080

Additional Native dependencies:(Optional to install but need when you go for advance )

wget http://download.zeromq.org/zeromq-2.1.7.tar.gz

tar –xzf zeromq-2.1.7.tar.gz

cd zeromq-2.1.7

./configure

Make

sudo make install

Download and installation commands for JZMQ:

Obtain JZMQ using

git clone https://github.com/nathanmarz/jzmq.git

cd jzmq

sudo apt-get install autoconf
sudo apt-get install automake
sudo apt-get install libtool

./autogen.sh

./configure

make

sudo make install

Wednesday, 18 March 2015

Apache Spark and Hadoop Integration with example

Step 1 : Install hadoop in your machine ( 1x or 2x) and also you need to set java path and scala path in .bashrc ( for setting path refer this post Spark installation )

Step 2: Check all hadoop daemons are up running.

Step 3: Write some data in your hdfs (here my file name in hdfs is word)

Step 4: Download apache spark for hadoop 1x or 2x based on you hadoop installed version in step 1.

Step 5: Untar the spark hadoop file.

Step 6: Start the spark hadoop shell.

Step 7: Type the following command.once spark shell started

Step 8: See the Out put In terminal.

Step 9: Check the Namenode UI (localhost:50070)

Step 10: Check the spark UI (localhost:4040) for monitoring the job

Tuesday, 10 March 2015

Apache spark word count in scala and python

Wordcount in Scala

bin/spark-shell

scala>val textFile = sc.textFile("/home/username/word.txt")

scala>val counts = textFile.flatMap(line => line.split(" "))map(word => (word, 1))reduceByKey(_ + _)

scala>counts.collect()

Wordcount in Python

bin/pyspark

>>>text_file = sc.textFile("/home/username/word.txt")

>>>counts = text_file.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)

>>>counts.collect()

Input File for word.txt

I love bigdata
I like bigdata

Spark web UI

Friday, 6 March 2015

Apache Spark videos

Apache Spark Quick introduction Lesson 1

Apache Spark wordcount in scala and python

Apache Spark Installation

Step 1 Download Spark Click here

Step 2 Download Scala Click here

Step 3 Download Java

Click here

NOTE install git Go to -->terminal --> sudo apt-get insatll git

Step 4 Untar Spark , Scala , Jdk

Step 5 Set the environment path in .bashrc

Step 6 Start building spark using sbt

Step 7 Start spark shell

For Video Apache Spark installation

Thursday, 13 November 2014

Top 60 Hadoop Interview Question

1. What is Hadoop framework?
Ans: Hadoop is an open source framework which is written in java by apache software foundation.This framework is used to write software application which requires to process vast amount of data (It could handle multi tera bytes of data). It works in-parallel on large clusters which could have 1000 of computers (Nodes) on the clusters. It also process data very reliably and fault-tolerant manner.

2. On What concept the Hadoop framework works?
Ans: It works on MapReduce, and it is devised by the Google.

3. What is MapReduce?
Ans: Map reduces is an algorithm or concept to process Huge amount of data in a faster way. As per its name you can divide it Map and Reduce.
• The main MapReduce job usually splits the input data-set into independent chunks. (Big data sets in the multiple small datasets)
• Reduce Task: And the above output will be the input for the reduce tasks, produces the final result.
Your business logic would be written in the Mapped Task and Reduced Task. Typically both the input and the output of the job are stored in a file-system (Not database). The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.