Summary

1. Introduction

2. Architecture

3. CentOS setup

4. Hadoop Setup

5. Launch and Shutdown Hadoop Cluster Service

6. Verify the hadoop cluster is up and healthy

7. End


1. Introduction

This posts will give all related detail in how to setup a Hadoop cluster on CentOS linux system. Before you read this article, I assume you already have all basic conceptions about Hadoop and Linux operating system.

P.S. Since the community of Hadoop version 2.7.x does still not exceed mature, I downgrade the version back to 2.6.5

2. Architecture

IP Address Hostname Role
192.168.171.132 master NameNode, ResourceManager
192.168.171.133 slave1 SecondaryNameNode, DataNode, NodeManager
192.168.171.134 slave2 DataNode, NodeManager


3. CentOS setup

3.1. install necessary packages for OS

We pick up CentOS minimal ISO as our installation prototype, once the system installed, we need several more basic packages:

sudo yum install -y net-tools
sudo yum install -y openssh-server
sudo yum install -y wget

The first line is to install ifconfig, while the second one is to be able to be ssh login by remote peer.

3.2. setup hostname for all nodes

This step is optional, but important for better self-identify while you use same username to walk through different nodes.

sudo hostnamectl set-hostname master

ex: at master node

re-login to check the effect

3.3. setup jdk for all nodes

install jdk from oracle official website

sudo wget --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u121-b13/e9e7ea248e2c4826b92b3f075a80e441/jdk-8u121-linux-x64.rpm
sudo yum localinstall -y jdk-8u121-linux-x64.rpm
sudo rm jdk-8u121-linux-x64.rpm

add java.sh under /etc/profile.d/

java.sh content:

export JAVA_HOME=/usr/java/jdk1.8.0_121
export JRE_HOME=/usr/java/jdk1.8.0_121/jre
export CLASSPATH=$JAVA_HOME/lib:.
export PATH=$PATH:$JAVA_HOME/bin

re-login, and you’ll find all environment variables, and java is well installed.

Approach to verification:

java -version
ls $JAVA_HOME
echo $PATH

if the java version goes wrong, you can

sudo update-alternatives --config java

then choose a correct version.

3.4. setup user and user group on all nodes

sudo groupadd hadoop
sudo useradd -d /home/hadoop -g hadoop hadoop
sudo passwd hadoop

3.5. modify hosts file for network inter-recognition on all nodes

echo '192.168.171.132 master' >> /etc/hosts
echo '192.168.171.133 slave1' >> /etc/hosts
echo '192.168.171.134 slave2' >> /etc/hosts

check the recognition works:

ping master
ping slave1
ping slave2

3.6. setup ssh no password login on all nodes

su - hadoop
ssh-keygen -t rsa
ssh-copy-id master
ssh-copy-id slave1
ssh-copy-id slave2

now you can ssh login to all 3 nodes without passwd, please have a try to check it out.

3.7. stop & disable firewall

sudo systemctl stop firewalld.service
sudo systemctl disable firewalld.service


4. Hadoop Setup

P.S. the whole Step 4 operations happens on a single node, let’s say, master. In addition, we’ll login as user hadoop to finish all operations.

su - hadoop

4.1. Download and untar on the file system.

wget http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.6.5/hadoop-2.6.5.tar.gz
tar -xvf hadoop-2.6.5.tar.gz
rm hadoop-2.6.5.tar.gz
chmod 775 hadoop-2.6.5

4.2. Add environment variables for hadoop

append following content onto ~/.bashrc

export HADOOP_HOME=/home/hadoop/hadoop-2.6.5
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

then make these variables effect:

source ~/.bashrc

4.3. Modify configuration files for hadoop

  • Add slave node hostnames into $HADOOP_HOME/etc/hadoop/slaves file
echo slave1 > $HADOOP_HOME/etc/hadoop/slaves
echo slave2 >> $HADOOP_HOME/etc/hadoop/slaves
  • Add secondary node hostname into $HADOOP_HOME/etc/hadoop/masters file
echo slave1 > $HADOOP_HOME/etc/hadoop/masters
  • Modify $HADOOP_HOME/etc/hadoop/core-site.xml as following
<configuration>
<property>
    <name>fs.defaultFS</name>
    <value>hdfs://master:9000/</value>
    <description>namenode settings</description>
</property>
<property>
    <name>hadoop.tmp.dir</name>
    <value>/home/hadoop/hadoop-2.6.5/tmp/hadoop-${user.name}</value>
    <description> temp folder </description>
</property>  
<property>
    <name>hadoop.proxyuser.hadoop.hosts</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.hadoop.groups</name>
    <value>*</value>
</property>
</configuration>
  • Modify $HADOOP_HOME/etc/hadoop/hdfs-site.xml as following
<configuration>  
    <property>  
        <name>dfs.namenode.http-address</name>  
        <value>master:50070</value>  
        <description> fetch NameNode images and edits </description>  
    </property>
    <property>  
        <name>dfs.namenode.secondary.http-address</name>  
        <value>slave1:50090</value>  
        <description> fetch SecondNameNode fsimage </description>  
    </property> 
    <property>
        <name>dfs.replication</name>
        <value>2</value>
        <description> replica count </description>
    </property>
    <property>  
        <name>dfs.namenode.name.dir</name>  
        <value>file:///home/hadoop/hadoop-2.6.5/hdfs/name</value>  
        <description> namenode </description>  
    </property>  
    <property>  
        <name>dfs.datanode.data.dir</name>
        <value>file:///home/hadoop/hadoop-2.6.5/hdfs/data</value>  
        <description> DataNode </description>  
    </property>  
    <property>  
        <name>dfs.namenode.checkpoint.dir</name>  
        <value>file:///home/hadoop/hadoop-2.6.5/hdfs/namesecondary</value>  
        <description>  check point </description>  
    </property> 
    <property>
        <name>dfs.webhdfs.enabled</name>
        <value>true</value>
    </property>
    <property>
        <name>dfs.stream-buffer-size</name>
        <value>131072</value>
        <description> buffer </description>
    </property> 
    <property>  
        <name>dfs.namenode.checkpoint.period</name>  
        <value>3600</value>  
        <description> duration </description>  
    </property> 
</configuration>
  • Modify $HADOOP_HOME/etc/hadoop/mapred-site.xml as following
<configuration>  
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
        </property>
    <property>
        <name>mapreduce.jobtracker.address</name>
        <value>hdfs://trucy:9001</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>master:10020</value>
        <description>MapReduce JobHistory Server host:port, default port is 10020.</description>
    </property>
    <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>master:19888</value>
        <description>MapReduce JobHistory Server Web UI host:port, default port is 19888.</description>
    </property>
</configuration>
  • Modify $HADOOP_HOME/etc/hadoop/yarn-site.xml as following
<configuration>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>master</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
        <name>yarn.resourcemanager.address</name>
        <value>master:8032</value>
    </property>
    <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>master:8030</value>
    </property>
    <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>master:8031</value>
    </property>
    <property>
        <name>yarn.resourcemanager.admin.address</name>
        <value>master:8033</value>
    </property>
    <property>
        <name>yarn.resourcemanager.webapp.address</name>
        <value>master:8088</value>
    </property>
</configuration>

4.4. Create necessary folders

mkdir -p $HADOOP_HOME/tmp
mkdir -p $HADOOP_HOME/hdfs/name
mkdir -p $HADOOP_HOME/hdfs/data

4.5. Copy hadoop folders and environment settings to slaves

scp ~/.bashrc slave1:~/
scp ~/.bashrc slave2:~/

scp -r ~/hadoop-2.6.5 slave1:~/
scp -r ~/hadoop-2.6.5 slave2:~/


5. Launch hadoop cluster service

  • Format namenode for the first time launch
hdfs namenode -format
  • Launch dfs distributed file system
start-dfs.sh
  • Launch yarn distributed computing system
start-yarn.sh
  • Shutdown Hadoop Cluster
stop-yarn.sh
stop-dfs.sh


6. Verify the hadoop cluster is up and healthy

6.1. Verify by jps processus

Check jps on each node, and view results.

  • jps normal results on master node:
# jps
32967 NameNode
33225 Jps
32687 ResourceManager 
  • On slave1 node:
# jps
28227 SecondaryNameNode
28496 Jps
28179 DataNode
28374 NodeManager 
  • On slave2 node:
# jps
27680 DataNode
27904 Jps
27784 NodeManager

6.2. Verify on Web interface

192.168.171.132:50070 to view hdfs storage status.
192.168.171.132:8088 to view yarn computing system resources and application status.

7. End

This is all about basic version of hadoop 3 nodes cluster, for high availability version & hadoop relative eco-systems, I’ll give it on other posts, thanks for contacting me if there is anything mistype or you have any suggestions or anything you don’t understand. Hope you all enjoy the hadoop journey! :)