These instructions have only been tested on:

  • Red Hat Enterprise Linux Server release 6.8

Select the Hadoop distribution of your choice. Supported Hadoop versions are 2.6.0, 2.7.5 and 2.9.0.

Step 1 — Install Hadoop 2.x.x

For example:- Hadoop 2.6.0

  1. Download and extract the hadoop-2.6.0 binary into your machine. It’s available at hadoop-2.6.0.tar.gz.

    $ mkdir ~/Hadoop
    $ cd ~/Hadoop
    $ wget https://archive.apache.org/dist/hadoop/core/hadoop-2.6.0/hadoop-2.6.0.tar.gz
    $ tar -xvzf hadoop-2.6.0.tar.gz
    
  2. Set the environment variables in file ~/.bashrc.

    $ vim ~/.bashrc
    

    Add the following text to the file and update the values for <where Java locates> and <where hadoop locates> with the path of where Java and Hadoop are located in your system.

    export JAVA_HOME="<where Java locates>"
    #e.g. ~/opt/jdk1.8.0_91
    export HADOOP_HOME="<where hadoop-2.6.0 locates>"
    #e.g. ~/hadoop-2.6.0
    export YARN_HOME=$HADOOP_HOME
    export PATH=$HADOOP_HOME/bin:$JAVA_HOME/bin:$PATH
    source $HADOOP_HOME/etc/hadoop/hadoop-env.sh
    export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
    export HADOOP_MAPRED_HOME=$HADOOP_HOME
    export HADOOP_COMMON_HOME=$HADOOP_HOME
    export HADOOP_HDFS_HOME=$HADOOP_HOME
    export HADOOP_YARN_HOME=$HADOOP_HOME
    export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
    
  3. Run the following command to make sure the changes are applied.

    $ source ~/.bashrc
    
  4. Check if environment variables are set correctly by running the following command.

    $ hadoop
    

    The results should look similar to the example below.

    Usage: hadoop [--config confdir] COMMAND
           where COMMAND is one of:
      fs                   run a generic filesystem user client
      version              print the version
      jar <jar>            run a jar file
      checknative [-a|-h]  check native hadoop and compression libraries availability
      distcp <srcurl> <desturl> copy file or directories recursively
      archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
      classpath            prints the class path needed to get the
      credential           interact with credential providers
                           Hadoop jar and the required libraries
      daemonlog            get/set the log level for each daemon
      trace                view and modify Hadoop tracing settings
     or
      CLASSNAME            run the class named CLASSNAME
    Most commands print help when invoked w/o parameters.
    
  5. Follow steps (i)-(v) to modify the following files in the Apache Hadoop distribution.

    (i).$HADOOP_HOME/etc/hadoop/core-site.xml:

    $ vim $HADOOP_HOME/etc/hadoop/core-site.xml
    

    Copy the following text into the file and replace ${namenode} with the IP address of the name node and ${user.name} with your user name.

    <configuration>
      <property>
        <name>fs.default.name</name>
        <value>hdfs://${namenode}:9010</value>
      </property>
      <property>
        <name>hadoop.tmp.dir</name>
        <value>/tmp/hadoop-${user.name}</value>
        <description>A base for other temporary directories.</description>
      </property>
    </configuration>
    

    (ii).$HADOOP_HOME/etc/hadoop/hdfs-site.xml:

    $ vim $HADOOP_HOME/etc/hadoop/hdfs-site.xml
    

    Copy the following text into the file and replace ${hadoop_home} with the path of where Hadoop is located in your system and ${namenode} with the IP address of the name node.

    <configuration>
      <property>
        <name>dfs.hosts</name>
        <value>${hadoop_home}/etc/hadoop/slaves</value>
      </property>
      <property>
        <name>dfs.replication</name>
        <value>1</value>
      </property>
      <property>
        <name>dfs.namenode.http-address</name>
        <value>${namenode}:50271</value>
      </property>
      <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>${namenode}:50291</value>
      </property>
    </configuration>
    

    (iii).$HADOOP_HOME/etc/hadoop/mapred-site.xml: You will be creating this file. It doesn’t exist in the original package.

    $ vim $HADOOP_HOME/etc/hadoop/mapred-site.xml
    

    Copy the following text into the file.

    <configuration>
      <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
      </property>
      <property>
        <name>mapreduce.map.collective.memory.mb</name>
        <value>100000</value>
      </property>
      <property>
        <name>mapreduce.map.collective.java.opts</name>
        <value>-Xmx90000m -Xms90000m</value>
      </property>
    </configuration>
    

    (iv).$HADOOP_HOME/etc/hadoop/yarn-site.xml:

    $ vim $HADOOP_HOME/etc/hadoop/yarn-site.xml
    

    Copy the following text into the file and replace ${namenode} with the IP address of your name node.

    <configuration>
      <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>${namenode}</value>
      </property>
      <property>
        <name>yarn.resourcemanager.address</name>
        <value>${namenode}:8132</value>
      </property>
      <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>${namenode}:8230</value>
      </property>
      <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
      </property>
      <property>
        <name>yarn.nodemanager.log-dirs</name>
        <value>/tmp/hadoop-${user name}</value>
      </property>
      <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>128000</value>
      </property>
      <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>120000</value>
      </property>
      <property>
        <name>yarn.nodemanager.delete.debug-delay-sec</name>
        <value>10000000</value>
      </property>
    </configuration>
    

    (v).$HADOOP_HOME/etc/hadoop/slaves:

    $ vim $HADOOP_HOME/etc/hadoop/slaves
    

    Update the slaves file by replacing ${namenode}, etc with the IP addresses of the name node and other data nodes.

    ${namenode}
    ${other node 1}
    ${other node 2}
    ...
    
  6. Format the file system using the following code.

    $ hdfs namenode -format
    

    You should be able to see it exit with status 0 as follows.

    ...
    ...
    xx/xx/xx xx:xx:xx INFO util.ExitUtil: Exiting with status 0
    xx/xx/xx xx:xx:xx INFO namenode.NameNode: SHUTDOWN_MSG:
    /************************************************************
    SHUTDOWN_MSG: Shutting down NameNode at xxx.xxx.xxx.xxx
    
  7. Launch NameNode, SecondaryNameNode and DataNode daemons.

    $ $HADOOP_HOME/sbin/start-dfs.sh
    
  8. Launch ResourceManager and NodeManager Daemons.

    $ $HADOOP_HOME/sbin/start-yarn.sh
    
  9. Check if the daemons started successfully by running the following command.

    $ jps
    

    The output should look similar to the following text with xxxxx replaced by the process ids for “NameNode”, “SecondaryNameNode”, etc.

    xxxxx NameNode
    xxxxx SecondaryNameNode
    xxxxx DataNode
    xxxxx NodeManager
    xxxxx Jps
    xxxxx ResourceManager
    

    If all the processes listed above aren’t in your output recheck your configurations and rerun steps 6 through 8 after executing the following commands.

    Replace ${user.name} with the name given in step 5 (a).

    $ $HADOOP_HOME/sbin/stop-dfs.sh
    $ $HADOOP_HOME/sbin/stop-yarn.sh
    $ rm -r /tmp/hadoop-${user.name} 
    

Step 2 — Install Harp

  1. Clone Harp repository using the following command. It is available at DSC-SPIDAL/harp.

    $ git clone https://github.com/DSC-SPIDAL/harp.git
    
  2. Set the environment variables in file ~/.bashrc.

    $ vim ~/.bashrc
    

    Add the following text into the file. Replace <where Harp locates> with the path of where Harp is located in your system.

    export HARP_ROOT_DIR="<where Harp locates>"
    #e.g. ~/harp
    export HARP_HOME=$HARP_ROOT_DIR/core/
    
  3. Run the following command to make sure the changes are applied.

    $ source ~/.bashrc
    
  4. If hadoop is still running, stop it first with the following code.

    $ $HADOOP_HOME/sbin/stop-dfs.sh
    $ $HADOOP_HOME/sbin/stop-yarn.sh
    
  5. Enter “harp” home directory using the following command.

    $ cd $HARP_ROOT_DIR
    
  6. Compile harp

    Select the profile related to your hadoop version (For ex: hadoop-2.6.0) and compile using maven. Supported hadoop versions are 2.6.0, 2.7.5 and 2.9.0.

    $ mvn clean package -Phadoop-2.6.0
    
  7. Install harp plugin to hadoop as demonstrated below.

    $ cp core/harp-collective/target/harp-collective-0.1.0.jar $HADOOP_HOME/share/hadoop/mapreduce/
    $ cp core/harp-hadoop/target/harp-hadoop-0.1.0.jar $HADOOP_HOME/share/hadoop/mapreduce/
    $ cp third_party/fastutil-7.0.13.jar $HADOOP_HOME/share/hadoop/mapreduce/
    
  8. Edit mapred-site.xml in $HADOOP_HOME/etc/hadoop by using the following code.

    $ vim $HADOOP_HOME/etc/hadoop/mapred-site.xml
    

    Add java opts settings for map-collective tasks. For example:

    <property>
      <name>mapreduce.map.collective.memory.mb</name>
      <value>512</value>
    </property>
    <property>
      <name>mapreduce.map.collective.java.opts</name>
      <value>-Xmx256m -Xms256m</value>
    </property>
    

    You have completed the Harp installation.

    Note

    To develop Harp applications add the following property when configuring the job.

    jobConf.set("mapreduce.framework.name", "map-collective");
    

Step 3 — Run harp kmeans example

  1. Format the other data nodes. Replace ${data node} by the IP address.

    $ ssh ${data node}
    $ hadoop datanode -format
    

    You have to do this step in every node except the namenode.

  2. Copy harp examples to $HADOOP_HOME using the following code.

    $ cp $HARP_ROOT_DIR/ml/java/target/harp-java-0.1.0.jar $HADOOP_HOME
    
  3. Start Hadoop.

    $ cd $HADOOP_HOME
    $ sbin/start-dfs.sh
    $ sbin/start-yarn.sh
    
  4. ssh to the other data nodes and check if the Hadoop processes are running.

    $ jps
    

    This output will only appear in the data node. The output should look similar to the following text with xxxxx replaced by the process ids for “DataNode” and “NodeManager”.

    xxxxx DataNode
    xxxxx NodeManager
    xxxxx Jps
    
  5. To view your running applications in the terminal, use:

    $ yarn application -list
    
  6. To shutdown a running application, use:

    $ yarn application -kill application-id
    
  7. Run Kmeans Map-collective job. Make sure you are in the $HADOOP_HOME folder. The usage is:

    $ hadoop jar harp-java-0.1.0.jar edu.iu.kmeans.regroupallgather.KMeansLauncher <num of points> <num of centroids> 
    <vector size> <num of point files per worker> <number of map tasks> <num threads> <number of iteration> <work dir> <local points dir>
    
    • <num of points> — the number of data points you want to generate randomly
    • <num of centriods> — the number of centroids you want to clustering the data to
    • <vector size> — the number of dimension of the data
    • <num of point files per worker> — how many files which contain data points in each worker
    • <number of map tasks> — number of map tasks
    • <num threads> — how many threads to launch in each worker
    • <number of iteration> — the number of iterations to run
    • <work dir> — the root directory for this running in HDFS
    • <local points dir> — the harp kmeans will firstly generate files which contain data points to local directory. Set this argument to determine the local directory.

    For example:

    $ hadoop jar harp-java-0.1.0.jar edu.iu.kmeans.regroupallgather.KMeansLauncher 1000 10 100 5 2 2 10 /kmeans 
    /tmp/kmeans
    
  8. To fetch the results, use the following command:

    $ hdfs dfs –get <work dir> <local dir>
    #e.g. hdfs dfs -get /kmeans ~/Document