Sunday, June 18, 2017

HiBench Suite―How to Build and Run the Big Data Benchmarks

As known from a previous article:
Three Benchmarks for SQL Coverage in HiBench Suite ― a Bigdata Micro Benchmark Suite
HiBench Suite is a big data benchmark suite that helps evaluate different big data frameworks in terms of speed, throughput and system resource utilization.

When your big data platform (e.g.,e HDP) evolves, it comes times that you need to upgrade your benchmark suite accordingly.

In this article, we will cover how to pick up the latest HiBench Suite (i.e., version 6.1) to work with Spark 2.1.



HiBench Suite


To download the master branch of HiBench Suite (click the diagram to enlarge), you can visit its home page here . On 06/18/2017, its latest version is 6.1.

To download, we have selected "Download ZIP" and saved it to our Linux system.


Maven


From the home page, you can select "docs" link to view all available document links:
From the build-hibench.md link, it tells you how to build HiBench Suite using Maven. For example, if you want to build all workloads in HiBench, you use the below command:

mvn -Dspark=2.1 -Dscala=2.11 clean package
This could be time consuming because the hadoopbench (one of the workload) relies on 3rd party tools like Mahout and Nutch. The build process automatically downloads these tools for you. If you won't run these workloads, you can only build a specific framework (e.g., sparkbench) to speed up the build process.

To get familiar with Maven, you can start with this pdf file. In it, you will learn how to download Maven and how to setup system to run it. Here we will just discuss some issues that we have run into while building all workloads using Maven.


Maven Installation Issues and Solutions


Proxy Server

Since our Linux system sits behind the firewall, we need to set up the following environment variables:
export http_proxy=http://your.proxy.com:80/
export https_proxy=http://your.proxy.com:80/

Environment Setup

As instructed in pdf file, we have setup below additional environment variables:

export JAVA_HOME=~/JVMs/8u40_fcs
export PATH=/scratch/username/maven/apache-maven-3.5.0/bin:$PATH
export PATH=$JAVA_HOME/bin:$PATH


Maven Configuration & Debugging

POM stands for Project Object Model. which
  • Is the Fundamental Unit of Work in Maven
  • Is an XML file
  • Always resides in the base directory of the project as pom.xml.

The POM contains information about the project and various configuration detail used by Maven to build the project(s).

In the default ~/.m2/settings, we have set the following entries for POM:

<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd">
  <localRepository>/scratch/username/.m2/repository</localRepository>
  <server>
    <id>central</id>
    <configuration>
      <httpConfiguration>
        <all>
          <connectionTimeout>120000</connectionTimeout>
          <readTimeout>120000</readTimeout>
        </all>
      </httpConfiguration>
    </configuration>
  </server>

First we have set the localRepository to a new location because an issue described here.[7,8] Secondly, we have set longer timeout for both connection and read.

If you have run into issues with a plugin, you can use "help:describe"
mvn  help:describe -Dplugin=com.googlecode.maven-download-plugin:maven-download-plugin
to display a list of its attributes and goals for debugging.

How to Run Sparkbench


To learn how to run a specific benchmark named sparkbench, you can click on the document link below:
run-sparkbench.md
Without much ado, we will focus on the configuration and tuning part of the task. For other details, please refer to the document.

New Configuration Files

In the new HiBench, there are two levels of configuration:

(Global level)

${hibench.home}/conf/hadoop.conf 
${hibench.home}/hibench.conf 
${hibench.home}/conf/spark.conf
(Workload level)
${hibench.home}/conf/workloads/micro/terasort.conf  

It has also introduced a new hierarchy (i.e. category like micro, websearch, sql, etc) to organize workload runtime scripts:
${hibench.home}/<benchmark>/<framework>
  where <benchmark> could be:
    micro/terasort
    websearch/pagerank
    sql/aggregation
    sql/join
    sql/scan
  where <framework> could be:
    spark
    hadoop
    prepare
Similarly for the workload-specific configuration file, they are stored under the new category level:

${hibench.home}/conf/workloads/${benchmark.conf}
  where <benchmark.conf> could be:
    micro/terasort.conf
    websearch/pagerank.conf
    sql/aggregation.conf
    sql/join.conf
    sql/scan.conf


References

  1. HORTONW0RKS DATA PLATFORM (HDP®)
  2. Readme (HiBench 6.1)
  3. HiBench Download
  4. How to build HiBench (HiBench 6.1)
  5. How to run sparkbench (HiBench 6.1)
  6. How-to documents (HiBench 6.1)
  7. Idiosyncrasies of ${HOME} that is an NFS Share (Xml and More)
  8. Apache Maven Build Tool (pdf)
  9. How do I set the location of my local Maven repository?
  10. Guide to Configuring Plug-ins (Apache Maven Project)
  11. Available Plugins (Apache Maven Project)
  12. MojoExecutionException
  13. Installing Maven Plugins (SourceForge.net)
  14. Download Plugin For Maven » 1.2.0
  15. Group: com.googlecode.maven-download-plugin

No comments: