These instructions have only been tested on:
- Red Hat Enterprise Linux Server release 6.8
Select the Hadoop distribution of your choice. Supported Hadoop versions are 2.6.0, 2.7.5 and 2.9.0.
Step 1 — Install Hadoop 2.x.x
For example:- Hadoop 2.6.0
Download and extract the hadoop-2.6.0 binary into your machine. It’s available at hadoop-2.6.0.tar.gz.
$ mkdir ~/Hadoop $ cd ~/Hadoop $ wget https://archive.apache.org/dist/hadoop/core/hadoop-2.6.0/hadoop-2.6.0.tar.gz $ tar -xvzf hadoop-2.6.0.tar.gz
Set the environment variables in file
~/.bashrc
.$ vim ~/.bashrc
Add the following text to the file and update the values for
<where Java locates>
and<where hadoop locates>
with the path of where Java and Hadoop are located in your system.export JAVA_HOME="<where Java locates>" #e.g. ~/opt/jdk1.8.0_91 export HADOOP_HOME="<where hadoop-2.6.0 locates>" #e.g. ~/hadoop-2.6.0 export YARN_HOME=$HADOOP_HOME export PATH=$HADOOP_HOME/bin:$JAVA_HOME/bin:$PATH source $HADOOP_HOME/etc/hadoop/hadoop-env.sh export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export HADOOP_YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
Run the following command to make sure the changes are applied.
$ source ~/.bashrc
Check if environment variables are set correctly by running the following command.
$ hadoop
The results should look similar to the example below.
Usage: hadoop [--config confdir] COMMAND where COMMAND is one of: fs run a generic filesystem user client version print the version jar <jar> run a jar file checknative [-a|-h] check native hadoop and compression libraries availability distcp <srcurl> <desturl> copy file or directories recursively archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive classpath prints the class path needed to get the credential interact with credential providers Hadoop jar and the required libraries daemonlog get/set the log level for each daemon trace view and modify Hadoop tracing settings or CLASSNAME run the class named CLASSNAME Most commands print help when invoked w/o parameters.
Follow steps (i)-(v) to modify the following files in the Apache Hadoop distribution.
(i).
$HADOOP_HOME/etc/hadoop/core-site.xml
:$ vim $HADOOP_HOME/etc/hadoop/core-site.xml
Copy the following text into the file and replace ${namenode} with the IP address of the name node and ${user.name} with your user name.
<configuration> <property> <name>fs.default.name</name> <value>hdfs://${namenode}:9010</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/tmp/hadoop-${user.name}</value> <description>A base for other temporary directories.</description> </property> </configuration>
(ii).
$HADOOP_HOME/etc/hadoop/hdfs-site.xml
:$ vim $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Copy the following text into the file and replace ${hadoop_home} with the path of where Hadoop is located in your system and ${namenode} with the IP address of the name node.
<configuration> <property> <name>dfs.hosts</name> <value>${hadoop_home}/etc/hadoop/slaves</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.http-address</name> <value>${namenode}:50271</value> </property> <property> <name>dfs.namenode.secondary.http-address</name> <value>${namenode}:50291</value> </property> </configuration>
(iii).
$HADOOP_HOME/etc/hadoop/mapred-site.xml
: You will be creating this file. It doesn’t exist in the original package.$ vim $HADOOP_HOME/etc/hadoop/mapred-site.xml
Copy the following text into the file.
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.map.collective.memory.mb</name> <value>100000</value> </property> <property> <name>mapreduce.map.collective.java.opts</name> <value>-Xmx90000m -Xms90000m</value> </property> </configuration>
(iv).
$HADOOP_HOME/etc/hadoop/yarn-site.xml
:$ vim $HADOOP_HOME/etc/hadoop/yarn-site.xml
Copy the following text into the file and replace ${namenode} with the IP address of your name node.
<configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>${namenode}</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>${namenode}:8132</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>${namenode}:8230</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.log-dirs</name> <value>/tmp/hadoop-${user name}</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>128000</value> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>120000</value> </property> <property> <name>yarn.nodemanager.delete.debug-delay-sec</name> <value>10000000</value> </property> </configuration>
(v).
$HADOOP_HOME/etc/hadoop/slaves
:$ vim $HADOOP_HOME/etc/hadoop/slaves
Update the
slaves
file by replacing ${namenode}, etc with the IP addresses of the name node and other data nodes.${namenode} ${other node 1} ${other node 2} ...
Format the file system using the following code.
$ hdfs namenode -format
You should be able to see it exit with status 0 as follows.
... ... xx/xx/xx xx:xx:xx INFO util.ExitUtil: Exiting with status 0 xx/xx/xx xx:xx:xx INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at xxx.xxx.xxx.xxx
Launch NameNode, SecondaryNameNode and DataNode daemons.
$ $HADOOP_HOME/sbin/start-dfs.sh
Launch ResourceManager and NodeManager Daemons.
$ $HADOOP_HOME/sbin/start-yarn.sh
Check if the daemons started successfully by running the following command.
$ jps
The output should look similar to the following text with
xxxxx
replaced by the process ids for “NameNode”, “SecondaryNameNode”, etc.xxxxx NameNode xxxxx SecondaryNameNode xxxxx DataNode xxxxx NodeManager xxxxx Jps xxxxx ResourceManager
If all the processes listed above aren’t in your output recheck your configurations and rerun steps 6 through 8 after executing the following commands.
Replace ${user.name} with the name given in step 5 (a).
$ $HADOOP_HOME/sbin/stop-dfs.sh $ $HADOOP_HOME/sbin/stop-yarn.sh $ rm -r /tmp/hadoop-${user.name}
Step 2 — Install Harp
Clone Harp repository using the following command. It is available at DSC-SPIDAL/harp.
$ git clone https://github.com/DSC-SPIDAL/harp.git
Set the environment variables in file
~/.bashrc
.$ vim ~/.bashrc
Add the following text into the file. Replace
<where Harp locates>
with the path of where Harp is located in your system.export HARP_ROOT_DIR="<where Harp locates>" #e.g. ~/harp export HARP_HOME=$HARP_ROOT_DIR/core/
Run the following command to make sure the changes are applied.
$ source ~/.bashrc
If hadoop is still running, stop it first with the following code.
$ $HADOOP_HOME/sbin/stop-dfs.sh $ $HADOOP_HOME/sbin/stop-yarn.sh
Enter “harp” home directory using the following command.
$ cd $HARP_ROOT_DIR
Compile harp
Select the profile related to your hadoop version (For ex: hadoop-2.6.0) and compile using maven. Supported hadoop versions are 2.6.0, 2.7.5 and 2.9.0.
$ mvn clean package -Phadoop-2.6.0
Install harp plugin to hadoop as demonstrated below.
$ cp core/harp-collective/target/harp-collective-0.1.0.jar $HADOOP_HOME/share/hadoop/mapreduce/ $ cp core/harp-hadoop/target/harp-hadoop-0.1.0.jar $HADOOP_HOME/share/hadoop/mapreduce/ $ cp third_party/fastutil-7.0.13.jar $HADOOP_HOME/share/hadoop/mapreduce/
Edit mapred-site.xml in $HADOOP_HOME/etc/hadoop by using the following code.
$ vim $HADOOP_HOME/etc/hadoop/mapred-site.xml
Add java opts settings for map-collective tasks. For example:
<property> <name>mapreduce.map.collective.memory.mb</name> <value>512</value> </property> <property> <name>mapreduce.map.collective.java.opts</name> <value>-Xmx256m -Xms256m</value> </property>
You have completed the Harp installation.
Note
To develop Harp applications add the following property when configuring the job.
jobConf.set("mapreduce.framework.name", "map-collective");
Step 3 — Run harp kmeans example
Format the other data nodes. Replace ${data node} by the IP address.
$ ssh ${data node} $ hadoop datanode -format
You have to do this step in every node except the namenode.
Copy harp examples to
$HADOOP_HOME
using the following code.$ cp $HARP_ROOT_DIR/ml/java/target/harp-java-0.1.0.jar $HADOOP_HOME
Start Hadoop.
$ cd $HADOOP_HOME $ sbin/start-dfs.sh $ sbin/start-yarn.sh
ssh
to the other data nodes and check if the Hadoop processes are running.$ jps
This output will only appear in the data node. The output should look similar to the following text with
xxxxx
replaced by the process ids for “DataNode” and “NodeManager”.xxxxx DataNode xxxxx NodeManager xxxxx Jps
To view your running applications in the terminal, use:
$ yarn application -list
To shutdown a running application, use:
$ yarn application -kill application-id
Run Kmeans Map-collective job. Make sure you are in the
$HADOOP_HOME
folder. The usage is:$ hadoop jar harp-java-0.1.0.jar edu.iu.kmeans.regroupallgather.KMeansLauncher <num of points> <num of centroids> <vector size> <num of point files per worker> <number of map tasks> <num threads> <number of iteration> <work dir> <local points dir>
<num of points>
— the number of data points you want to generate randomly<num of centriods>
— the number of centroids you want to clustering the data to<vector size>
— the number of dimension of the data<num of point files per worker>
— how many files which contain data points in each worker<number of map tasks>
— number of map tasks<num threads>
— how many threads to launch in each worker<number of iteration>
— the number of iterations to run<work dir>
— the root directory for this running in HDFS<local points dir>
— the harp kmeans will firstly generate files which contain data points to local directory. Set this argument to determine the local directory.
For example:
$ hadoop jar harp-java-0.1.0.jar edu.iu.kmeans.regroupallgather.KMeansLauncher 1000 10 100 5 2 2 10 /kmeans /tmp/kmeans
To fetch the results, use the following command:
$ hdfs dfs –get <work dir> <local dir> #e.g. hdfs dfs -get /kmeans ~/Document