harp Documentation - Quick Start Guide

These instructions have only been tested on:

Red Hat Enterprise Linux Server release 6.8

Select the Hadoop distribution of your choice. Supported Hadoop versions are 2.6.0, 2.7.5 and 2.9.0.

Step 1 — Install Hadoop 2.x.x

For example:- Hadoop 2.6.0

Download and extract the hadoop-2.6.0 binary into your machine. It’s available at hadoop-2.6.0.tar.gz.

$ mkdir ~/Hadoop
$ cd ~/Hadoop
$ wget https://archive.apache.org/dist/hadoop/core/hadoop-2.6.0/hadoop-2.6.0.tar.gz
$ tar -xvzf hadoop-2.6.0.tar.gz

Set the environment variables in file ~/.bashrc.

$ vim ~/.bashrc

Add the following text to the file and update the values for <where Java locates> and <where hadoop locates> with the path of where Java and Hadoop are located in your system.

export JAVA_HOME="<where Java locates>"
#e.g. ~/opt/jdk1.8.0_91
export HADOOP_HOME="<where hadoop-2.6.0 locates>"
#e.g. ~/hadoop-2.6.0
export YARN_HOME=$HADOOP_HOME
export PATH=$HADOOP_HOME/bin:$JAVA_HOME/bin:$PATH
source $HADOOP_HOME/etc/hadoop/hadoop-env.sh
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

Run the following command to make sure the changes are applied.
```
$ source ~/.bashrc
```

Check if environment variables are set correctly by running the following command.

$ hadoop

The results should look similar to the example below.

Usage: hadoop [--config confdir] COMMAND
       where COMMAND is one of:
  fs                   run a generic filesystem user client
  version              print the version
  jar <jar>            run a jar file
  checknative [-a|-h]  check native hadoop and compression libraries availability
  distcp <srcurl> <desturl> copy file or directories recursively
  archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
  classpath            prints the class path needed to get the
  credential           interact with credential providers
                       Hadoop jar and the required libraries
  daemonlog            get/set the log level for each daemon
  trace                view and modify Hadoop tracing settings
 or
  CLASSNAME            run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

Follow steps (i)-(v) to modify the following files in the Apache Hadoop distribution.

(i).$HADOOP_HOME/etc/hadoop/core-site.xml:

$ vim $HADOOP_HOME/etc/hadoop/core-site.xml

Copy the following text into the file and replace ${namenode} with the IP address of the name node and ${user.name} with your user name.

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://${namenode}:9010</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/tmp/hadoop-${user.name}</value>
    <description>A base for other temporary directories.</description>
  </property>
</configuration>

(ii).$HADOOP_HOME/etc/hadoop/hdfs-site.xml:

$ vim $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Copy the following text into the file and replace ${hadoop_home} with the path of where Hadoop is located in your system and ${namenode} with the IP address of the name node.

<configuration>
  <property>
    <name>dfs.hosts</name>
    <value>${hadoop_home}/etc/hadoop/slaves</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.namenode.http-address</name>
    <value>${namenode}:50271</value>
  </property>
  <property>
    <name>dfs.namenode.secondary.http-address</name>
    <value>${namenode}:50291</value>
  </property>
</configuration>

(iii).$HADOOP_HOME/etc/hadoop/mapred-site.xml: You will be creating this file. It doesn’t exist in the original package.

$ vim $HADOOP_HOME/etc/hadoop/mapred-site.xml

Copy the following text into the file.

<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
  <property>
    <name>mapreduce.map.collective.memory.mb</name>
    <value>100000</value>
  </property>
  <property>
    <name>mapreduce.map.collective.java.opts</name>
    <value>-Xmx90000m -Xms90000m</value>
  </property>
</configuration>

(iv).$HADOOP_HOME/etc/hadoop/yarn-site.xml:

$ vim $HADOOP_HOME/etc/hadoop/yarn-site.xml

Copy the following text into the file and replace ${namenode} with the IP address of your name node.

<configuration>
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>${namenode}</value>
  </property>
  <property>
    <name>yarn.resourcemanager.address</name>
    <value>${namenode}:8132</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>${namenode}:8230</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.log-dirs</name>
    <value>/tmp/hadoop-${user name}</value>
  </property>
  <property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>128000</value>
  </property>
  <property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>120000</value>
  </property>
  <property>
    <name>yarn.nodemanager.delete.debug-delay-sec</name>
    <value>10000000</value>
  </property>
</configuration>

(v).$HADOOP_HOME/etc/hadoop/slaves:

$ vim $HADOOP_HOME/etc/hadoop/slaves

Update the slaves file by replacing ${namenode}, etc with the IP addresses of the name node and other data nodes.

${namenode}
${other node 1}
${other node 2}
...

Format the file system using the following code.

$ hdfs namenode -format

You should be able to see it exit with status 0 as follows.

...
...
xx/xx/xx xx:xx:xx INFO util.ExitUtil: Exiting with status 0
xx/xx/xx xx:xx:xx INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at xxx.xxx.xxx.xxx

Launch NameNode, SecondaryNameNode and DataNode daemons.
```
$ $HADOOP_HOME/sbin/start-dfs.sh
```
Launch ResourceManager and NodeManager Daemons.
```
$ $HADOOP_HOME/sbin/start-yarn.sh
```
Check if the daemons started successfully by running the following command.
```
$ jps
```
The output should look similar to the following text with xxxxx replaced by the process ids for “NameNode”, “SecondaryNameNode”, etc.
```
xxxxx NameNode
xxxxx SecondaryNameNode
xxxxx DataNode
xxxxx NodeManager
xxxxx Jps
xxxxx ResourceManager
```
If all the processes listed above aren’t in your output recheck your configurations and rerun steps 6 through 8 after executing the following commands.

Replace ${user.name} with the name given in step 5 (a).
```
$ $HADOOP_HOME/sbin/stop-dfs.sh
$ $HADOOP_HOME/sbin/stop-yarn.sh
$ rm -r /tmp/hadoop-${user.name} 
```

Step 2 — Install Harp

Clone Harp repository using the following command. It is available at DSC-SPIDAL/harp.
```
$ git clone https://github.com/DSC-SPIDAL/harp.git
```
Set the environment variables in file ~/.bashrc.
```
$ vim ~/.bashrc
```
Add the following text into the file. Replace <where Harp locates> with the path of where Harp is located in your system.
```
export HARP_ROOT_DIR="<where Harp locates>"
#e.g. ~/harp
export HARP_HOME=$HARP_ROOT_DIR/core/
```
Run the following command to make sure the changes are applied.
```
$ source ~/.bashrc
```

If hadoop is still running, stop it first with the following code.

$ $HADOOP_HOME/sbin/stop-dfs.sh
$ $HADOOP_HOME/sbin/stop-yarn.sh

Enter “harp” home directory using the following command.
```
$ cd $HARP_ROOT_DIR
```
Compile harp

Select the profile related to your hadoop version (For ex: hadoop-2.6.0) and compile using maven. Supported hadoop versions are 2.6.0, 2.7.5 and 2.9.0.
```
$ mvn clean package -Phadoop-2.6.0
```

Install harp plugin to hadoop as demonstrated below.

$ cp core/harp-collective/target/harp-collective-0.1.0.jar $HADOOP_HOME/share/hadoop/mapreduce/
$ cp core/harp-hadoop/target/harp-hadoop-0.1.0.jar $HADOOP_HOME/share/hadoop/mapreduce/
$ cp third_party/fastutil-7.0.13.jar $HADOOP_HOME/share/hadoop/mapreduce/

Edit mapred-site.xml in $HADOOP_HOME/etc/hadoop by using the following code.

$ vim $HADOOP_HOME/etc/hadoop/mapred-site.xml

Add java opts settings for map-collective tasks. For example:

<property>
  <name>mapreduce.map.collective.memory.mb</name>
  <value>512</value>
</property>
<property>
  <name>mapreduce.map.collective.java.opts</name>
  <value>-Xmx256m -Xms256m</value>
</property>

You have completed the Harp installation.

Note

To develop Harp applications add the following property when configuring the job.

jobConf.set("mapreduce.framework.name", "map-collective");

Step 3 — Run harp kmeans example

Format the other data nodes. Replace ${data node} by the IP address.
```
$ ssh ${data node}
$ hadoop datanode -format
```
You have to do this step in every node except the namenode.

Copy harp examples to $HADOOP_HOME using the following code.

$ cp $HARP_ROOT_DIR/ml/java/target/harp-java-0.1.0.jar $HADOOP_HOME

Start Hadoop.

$ cd $HADOOP_HOME
$ sbin/start-dfs.sh
$ sbin/start-yarn.sh

ssh to the other data nodes and check if the Hadoop processes are running.
```
$ jps
```
This output will only appear in the data node. The output should look similar to the following text with xxxxx replaced by the process ids for “DataNode” and “NodeManager”.
```
xxxxx DataNode
xxxxx NodeManager
xxxxx Jps
```
To view your running applications in the terminal, use:
```
$ yarn application -list
```

To shutdown a running application, use:

$ yarn application -kill application-id

Run Kmeans Map-collective job. Make sure you are in the $HADOOP_HOME folder. The usage is:
```
$ hadoop jar harp-java-0.1.0.jar edu.iu.kmeans.regroupallgather.KMeansLauncher <num of points> <num of centroids> 
<vector size> <num of point files per worker> <number of map tasks> <num threads> <number of iteration> <work dir> <local points dir>
```
- <num of points> — the number of data points you want to generate randomly
- <num of centriods> — the number of centroids you want to clustering the data to
- <vector size> — the number of dimension of the data
- <num of point files per worker> — how many files which contain data points in each worker
- <number of map tasks> — number of map tasks
- <num threads> — how many threads to launch in each worker
- <number of iteration> — the number of iterations to run
- <work dir> — the root directory for this running in HDFS
- <local points dir> — the harp kmeans will firstly generate files which contain data points to local directory. Set this argument to determine the local directory.
For example:
```
$ hadoop jar harp-java-0.1.0.jar edu.iu.kmeans.regroupallgather.KMeansLauncher 1000 10 100 5 2 2 10 /kmeans 
/tmp/kmeans
```

To fetch the results, use the following command:

$ hdfs dfs –get <work dir> <local dir>
#e.g. hdfs dfs -get /kmeans ~/Document