Saturday, August 19, 2017

JMeter—How to Reorder and Regroup JMeter Elements

The JMeter test tree contains elements that are both hierarchical and ordered. Some elements in the test trees are strictly hierarchical (Listeners, Config Elements, Post-Processors, Pre-Processors, Assertions, Timers), and some are primarily ordered (controllers, samplers).[1,7]

When you add a new JMeter element to a parent element (i.e,. Test Plan, Thread Group, etc.), it adds the new element to the end of child list.  So, sometimes you need to reorder element on the list or regroup elements under different parent elements.  For this purpose, the following JMeter's GUI support becomes handy:
  • Drag and drop
  • Cut, copy, and paste
 In this article, we will demonstrate these editing capabilities of Apache JMeter.

Cut, Copy and Paste (Case #1)

In the following Test Plan, we have three different Thread Groups under Test Plan.  At beginning, all child elements were listed under jp@gc - Stepping Thread Group.

To move these child elements into Thread Group, I can click on CSV Data Set Config, press Shift Key, select all child elements using Down Arrow, and Ctrl-X to cut them.

To paste them into Thread Group, I click on Thread Group, and Ctrl-V to paste them.

In this scenario, it will be easy for you to experiment with three different Thread Group plugins and learn their different capabilities.

Cut, Copy, and Paste (Case #2)

Sometimes you want to copy JMeter elements from one .jmx to another .jmx.  In this case, you can launch two JMeter GUI's following the instructions here .  For example, you can click on jmeter.bat twice to start two different JMeter sessions in Windows.

After you open two Test Plans in two different GUI's, you can then copy-and-paste element from one JMeter to another similar to the previous example.

Drag and Drop

For elements stored in the test tree, you can also drag-and-drop them from one position to another or change their levels in the tree hierarchy.  Note that the level of elements in the test tree determines the scope of its effect.  Read [1,6,7] for more information.

To drag a child element from one position to another position, for example,  I can click on HTTP Cache Manager,

drag it to a new position (i.e., before HTTP Cookie Manager), and drop it.

Note that the ordering of Cookie and Cache Managers in this example doesn't matter.


Monday, August 14, 2017

JMeter―Using the Transaction Controller

Apache JMeter is a performance testing tool written in Java and supports many operation systems. Controllers are main part of JMeter and they are useful to control the execution of JMeter scripts for load testing.

For example, you can use Transaction Controller to get the total execution time of a transaction (i.e., an end-to-end scenario) which might include the following transaction steps:
Login → Compute Details → Billing Metrics → Back to Dashboard → Logout

Watch below video for more details of Transaction Controller.


JMeter has two types of Controllers:[3]

  • Samplers
    • Can be used to specify which types of requests to be sent to a server
    • You may add Configuration Elements to these Samplers to customize your server requests.
    • Examples of Samplers include, but not limited to:
      • HTTP Request
      • FTP Request
      • JDBC Request
      • Java Request
      • SOAP/XML-RPC Request
      • WebService (SOAP) Request
      • LDAP Request
      • LDAP Extended RequestAccess Log Sampler
      • BeanShell Sampler
  • Logic Controllers
    • Can be used to customize the logic that JMeter uses to decide when to send requests
      • For these requests, JMeter may randomly select (using Random Controller), repeat (using Loop Controller), interchange (using Interleave Controller) etc.
      • The child elements of a Logic Controllers may comprise 
    • Examples of Logic Controllers include, but not limited to:
      • Transaction Controller
      • Simple Controller
      • Loop Controller
      • Interleave Controller
      • Random Controller
      • Random Order Controller
      • Throughput Controller
      • Recording Controller
In this article, we will focus mainly on Transaction Controller which may be used to
  • Generate a “virtual” sample to measure aggregate times of all nested samples

Option: "Generate Parent Sample"

When "Generate parent sample" in Transaction Controller is
  • Checked 
    • Only Transaction Controller virtual sample will be generated and  all other Transaction Controller's nested samples will not be displayed in the report
  • Unchecked 
    • Additional parent sample (i.e. Transaction Controller virtual sample) after nested samples will be displayed in the report

Option: "Include Duration of Timer and Pre-Post Processors in Generated Sample"

Each Sampler can be preceded by one or more Pre-processor element, followed by Post-processor element. There is also an option in Transaction Controller to include and/or exclude timers, pre and post processors execution time into/from virtual samples.

When the check box "Include duration of timer and pre-post processors in generated sample" is
  • Checked
    • The aggregate time includes all processing within the controller scope, not just the nested samples
  • Unchecked
    • The aggregate time includes just the nested samples; however, excludes all pre-post processing within the controller scope

Sunday, August 13, 2017

JMeter: Using the HTTP Cookie Manager

In a stateless internet, many sites and applications use cookies to retain a handle between sessions or to keep some state on the client side. If you are planning to use JMeter to test such web applications, then cookie manager will be required.

To learn how to enable HTTP Cookie Manager and run tests in JMeter, you can watch below video.

In this article, we will cover two topics:
  1. Why cookie manager?
  2. Where to put cookie manager?

Why Cookie Manager

If you need to extract a cookie data from response body, one option is to use a Regular Expression Extractor on the response headers.[4] Another simpler option is adding a HTTP Cookie Manager which automatically handles cookies in many configurable ways.

HTTP Cookie Manager has three functions:
  1. Stores and sends cookies just like a web browser
    • Each JMeter thread has its own "cookie storage area".
      • Note that such cookies do not appear on the Cookie Manager display, but they can be seen using the View Results Tree Listener.
  2. Received Cookies can be stored as JMeter thread variables
    • Versions of JMeter 2.3.2+ no longer do this by default
    • To save cookies as variables, define the property "" by
      • Setting it in file, or
      • Passing a corresponding parameter to JMeter startup scrip
        • jmeter
    • The names of the cookies contain the prefix "COOKIE_" which can be configured by the property ""
    • See [4] for an example
  3. Supports adding a cookie to the Cookie Manager manually
    • Note that if you do this, the cookie will be shared by all JMeter threads—such cookies are created with an expiration date far in the future.

Where to Put Cookie Manager

Nearly all web testing should use cookie support, unless your application specifically doesn't use cookies. To add cookie support, simply add an HTTP Cookie Manager to each Thread Group in your test plan. This will ensure that each thread gets its own cookies, but shared across all HTTP Request objects.

To add the HTTP Cookie Manager, simply select the Thread Group, and choose AddConfig ElementHTTP Cookie Manager, either from the Edit Menu, or from the right-click pop-up menu.


  1. Using the HTTP Cookie Manager in JMeter
  2. Understanding and Using JMeter Cookie Manager
  3. Adding Cookie Support
  4. Header Cookie “sid” value to a variable
  5. Using RegEx (Regular Expression Extractor) with JMeter
  6. JMeter: How to Turn Off Captive Portal from the Recording Using Firefox (Xml and More)

Friday, August 11, 2017

JMeter: How to Turn Off Captive Portal from the Recording Using Firefox

Apache JMeter is an Apache project that can be used as a load testing tool for analyzing and measuring the performance of a variety of services, with a focus on web applications.

Assume you have installed JMeter and are familiar with it; otherwise, you can watch a good series of videos here to get started.  In this article, we will discuss how to remove the extra HTTP request (1 /success.txt ) in the recording (click below diagram to enlarge).

HTTP(S) Test Script Recorder

You can follow the instructions here to record your web test with "HTTP(S) Test Script Recorder".  In this article, we have chosen Firefox as the browser for JMeter's proxy recorder.

When we recorded a test plan, some repetitive HTTP requests related to Captive Portal feature in Firefox
have been captured.  These HTTP traffics are not related to the tested web application and should be excluded from the test plan.  So, how can we achieve that?

How to Turn Off "Captive Portal"

Captive Portal feature in Firefox covers the detection and implementation of handles for captive portals inside Firefox browser. Firefox is expected to handle the handling of a captive portal page upon detection of such.

There is no UI checkbox for disabling Captive Portal.  But, you can turn off Captive Portal using the Configuration Editor of Firefox:[3]
  1. In a new tab, type or paste about:config in the address bar and press Enter/Return. Click the button promising to be careful.
  2. In the search box above the list, type or paste captiv and pause while the list is filtered
  3. Double-click the network.captive-portal-service.enabled preference to switch the value from true to false
If you are in a managed environment using an autoconfig file, for example, you could use this to switch the default:
user_pref("network.captive-portal-service.enabled", false)


  1. Apache JMeter
  2. JMeter Beginner Tutorial 21 - How to use Test Script Recorder
  3. HTTP(S) Test Script Recorder 
  4. Proxy Step by Step (Apache JMeter)
  5. Book: Apache JMeter (Publisher: Packt Publishing) 
  6. Turn off captive portal (Mozilla Support)

Saturday, June 24, 2017

How to Access OAuth Protected Resources Using Postman

To access an OAuth 2.0 protected resource, you need to provide an access token to access it.  For example, in the new implementation of Oracle Event Hub Cloud Service, Kafka brokers are OAuth 2.0 protected resources.

In this article, we will demonstrate how to obtain an access token of "bearer" type using Postman.

OAuth 2.0

OAuth enables clients to access protected resources by obtaining an access token, which is defined in "The OAuth 2.0 Authorization Framework" as "a string representing an access authorization issued to the client", rather than using the resource owner's credentials directly.

There are different access token types.  For example,

Each access token type specifies the additional attributes (if any) sent to the client together with the "access_token" response parameter. It also defines the HTTP authentication method used to include the access token when making a protected resource request.

For example, in this article, you will learn how to retrieve a bearer token using Postman, in which the generated HTTP response will look like below:

    "access_token": "eyJ4NXQjUzI1Ni <snipped> M8Ei_VoT0kjc",
    "token_type": "Bearer",
    "expires_in": 3600

To prevent misuse, bearer tokens need to be protected from disclosure in storage and in transport.


Postman is a Google Chrome app for interacting with HTTP APIs. It presents you with a friendly GUI for constructing requests and reading responses. To download it, click on this link.

You can generate code snippets (see above diagram; however, a better alternative is to export/import a collection) using Postman for sharing purpose.  For example, we will use the following snippets for illustration in this article.

POST /oauth2/v1/token HTTP/1.1
Content-Type: application/x-www-form-urlencoded
Accept: application/json
Cache-Control: no-cache
Postman-Token: 55cfed4b-509c-5a6f-a415-8542d04fc7ad


Generating Bearer Token

To access OAuth protected resources, you need to retrieve an access token first.  In this example, we will demonstrate with the access token of bearer type.

Based on shared code snippets above, it tells us to send a HTTP POST request to the following URL:

which is composed from the following information in the snippets:

POST /oauth2/v1/token HTTP/1.1

Note that we have used https instead of http in the URL.

For the Authorization, we have specified "Basic Auth" type with an Username and a Password and, in the snippets, it shows as below:


In the "Header" part, we have specified two headers in addition to the "Authorization" header using "Bulk Edit" mode:


In the "Body" part, we have copied the last line from the code snippets to it in raw mode:


Note that the above body part is specifically to the Oracle Identity Cloud Service (IDCS) implementation.  Similarly, the "Authorization" part requires us to specify "Client ID" and "Client Secret" as username and password, which are also IDCS-specific.

How to Use Bearer Token

To access OAuth protected resources, you specify retrieved access token in the "Header" of subsequent HTTP requests with the following format:

Authorization:Bearer eyJ4NXQjUzI1Ni <snipped> M8Ei_VoT0kjc

Note that this access token will expire in one hour as noted in the HTTP response:

"expires_in": 3600


From this article, we have demonstrated that:
  • What a Bearer Token is
  • What an access token looks like
  • How to share a code snippet
    • We have shown to reverse-engineer from the shared code snippets to the final setup in Postman is not straightforward.  For example, the code snippet doesn't tell us:
      • What the "Username" and "Password" to be used.  For example, we need to know that it requires the "Client ID" and "Client Secret" of application to be used in this case.
    • Therefore, if you share the code snippets with co-workers, you also need to add further annotations to allow them to reproduce the HTTP requests to be sent. 

Sunday, June 18, 2017

HiBench Suite―How to Build and Run the Big Data Benchmarks

As known from a previous article:
Three Benchmarks for SQL Coverage in HiBench Suite ― a Bigdata Micro Benchmark Suite
HiBench Suite is a big data benchmark suite that helps evaluate different big data frameworks in terms of speed, throughput and system resource utilization.

When your big data platform (e.g.,e HDP) evolves, it comes times that you need to upgrade your benchmark suite accordingly.

In this article, we will cover how to pick up the latest HiBench Suite (i.e., version 6.1) to work with Spark 2.1.

HiBench Suite

To download the master branch of HiBench Suite (click the diagram to enlarge), you can visit its home page here . On 06/18/2017, its latest version is 6.1.

To download, we have selected "Download ZIP" and saved it to our Linux system.


From the home page, you can select "docs" link to view all available document links:
From the link, it tells you how to build HiBench Suite using Maven. For example, if you want to build all workloads in HiBench, you use the below command:

mvn -Dspark=2.1 -Dscala=2.11 clean package
This could be time consuming because the hadoopbench (one of the workload) relies on 3rd party tools like Mahout and Nutch. The build process automatically downloads these tools for you. If you won't run these workloads, you can only build a specific framework (e.g., sparkbench) to speed up the build process.

To get familiar with Maven, you can start with this pdf file. In it, you will learn how to download Maven and how to setup system to run it. Here we will just discuss some issues that we have run into while building all workloads using Maven.

Maven Installation Issues and Solutions

Proxy Server

Since our Linux system sits behind the firewall, we need to set up the following environment variables:
export http_proxy=
export https_proxy=

Environment Setup

As instructed in pdf file, we have setup below additional environment variables:

export JAVA_HOME=~/JVMs/8u40_fcs
export PATH=/scratch/username/maven/apache-maven-3.5.0/bin:$PATH
export PATH=$JAVA_HOME/bin:$PATH

Maven Configuration & Debugging

POM stands for Project Object Model. which
  • Is the Fundamental Unit of Work in Maven
  • Is an XML file
  • Always resides in the base directory of the project as pom.xml.

The POM contains information about the project and various configuration detail used by Maven to build the project(s).

In the default ~/.m2/settings, we have set the following entries for POM:

<settings xmlns=""

First we have set the localRepository to a new location because an issue described here.[7,8] Secondly, we have set longer timeout for both connection and read.

If you have run into issues with a plugin, you can use "help:describe"
mvn  help:describe -Dplugin=com.googlecode.maven-download-plugin:maven-download-plugin
to display a list of its attributes and goals for debugging.

How to Run Sparkbench

To learn how to run a specific benchmark named sparkbench, you can click on the document link below:
Without much ado, we will focus on the configuration and tuning part of the task. For other details, please refer to the document.

New Configuration Files

In the new HiBench, there are two levels of configuration:

(Global level)

(Workload level)

It has also introduced a new hierarchy (i.e. category like micro, websearch, sql, etc) to organize workload runtime scripts:
  where <benchmark> could be:
  where <framework> could be:
Similarly for the workload-specific configuration file, they are stored under the new category level:

  where <benchmark.conf> could be:


  2. Readme (HiBench 6.1)
  3. HiBench Download
  4. How to build HiBench (HiBench 6.1)
  5. How to run sparkbench (HiBench 6.1)
  6. How-to documents (HiBench 6.1)
  7. Idiosyncrasies of ${HOME} that is an NFS Share (Xml and More)
  8. Apache Maven Build Tool (pdf)
  9. How do I set the location of my local Maven repository?
  10. Guide to Configuring Plug-ins (Apache Maven Project)
  11. Available Plugins (Apache Maven Project)
  12. MojoExecutionException
  13. Installing Maven Plugins (
  14. Download Plugin For Maven » 1.2.0
  15. Group: com.googlecode.maven-download-plugin

Tuesday, June 13, 2017

Linux sar Command: Using -o and -f in Pairs

System Activity Reporter (SAR) is one of the important tool to monitor Linux servers. By using this command you can analyse the history of different resource usages.

In this article, we will examine how to monitor resource usages of servers (e.g., in a cluster) during the entire run of an application (e.g., a benchmark) using the following sar command pairs:
  • Data Collection
    • nohup sar -A -o /tmp/ 10 > /dev/null &
  • Record Extraction
    • sar -f /tmp/ [-u | -d | -n DEV]

Sar Command Options

In the data collection phase, we will use -o option to save data in a file of binary format and then use -f option combined with other options (e.g.,  [-u | -d | -n DEV]) to extract records related to different statistics (e.g., CPU, I/O, Network):

Main options

       -o [ filename ]
              Save the readings in the file in binary form. Each reading is in
              a separate record. The default value of the  filename  parameter
              is  the  current daily data file, the /var/log/sa/sadd file. The
              -o option is exclusive of the -f option.  All the data available
              from  the  kernel  are saved in the file (in fact, sar calls its
              data collector sadc with the option "-S ALL". See sadc(8) manual

       -f [ filename ]
              Extract records from filename (created by the -o filename flag).
              The default value of the filename parameter is the current daily
              data file, the /var/log/sa/sadd file. The -f option is exclusive
              of the -o option.


       -u [ ALL ]
              Report CPU utilization. The ALL keyword indicates that  all  the
              CPU fields should be displayed.

       -d    Report activity for each block device  (kernels  2.4  and  newer

       -n { keyword [,...] | ALL }
              Report network statistics.

Monitoring the Entire Run of a Benchmark

In the illustration, we will use three benchmarks (i.e., scan / aggregation / join) in the HiBench suite as examples (see [2] for details).  At beginning of each benchmark run, we will start up sar commands on the servers of a cluster; then followed by running spark application of a specific workload; finally, we will kill the sar processes at the end of run.

if [ $# -ne 2 ]; then
  echo "usage: "
  echo "  where could be:"
  echo "    scan"
  echo "    aggregation"
  echo "    join"
  echo "  where could be:"
  echo "    mapreduce"
  echo "    spark/java"
  echo "    spark/scala"
  echo "    spark/python"
  exit 1


mkdir ~/$workload/$target

echo "start all sar commands ..."

./ start

while read -r vmIp
  echo "start stats on $vmIp"
  ./myssh opc@$vmIp "~/ start" &
done < vm.lst

# run a test in different workloads using different lang interfaces

echo "stop all sar commands ..."
./ stop

while read -r vmIp
  echo "stop stats on $vmIp"
  ./myssh opc@$vmIp "~/ stop" &
done < vm.lst


case $1 in
        pkill sar
        rm /tmp/
        nohup sar -A -o /tmp/ 10 > /dev/null &
        pkill sar
        scp /tmp/ ~
        echo "usage: $0 start|stop"

CPU Statistics

To view the overall CPU statistics, you can use option -u as follows:

$ sar -f -u

03:39:28 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle

03:39:38 PM     all      0.03      0.00      0.01      0.02      0.00     99.94

03:39:48 PM     all      0.05      0.00      0.05      0.02      0.01     99.88


Average:        all      0.09      0.00      0.02      0.02      0.00     99.86


I/O Statistics of Block Devices

To view the activity for each block device, you can use option -d as follows:

$ sar -f -d

03:39:28 PM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
03:39:38 PM dev202-16      1.20      0.00     16.06     13.33      0.02     14.67      6.50      0.78
03:39:38 PM dev202-32      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
03:39:38 PM dev202-48      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
03:39:38 PM dev202-64      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
03:39:38 PM dev202-80      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
03:39:38 PM  dev251-0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
03:39:38 PM  dev251-1      1.20      0.00     16.06     13.33      0.02     14.67      6.50      0.78
03:39:38 PM  dev251-2      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
03:39:38 PM  dev251-3      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
03:39:38 PM  dev251-4      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

Average:          DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
Average:    dev202-16      1.22      0.00     15.79     12.99      0.01     11.85      6.57      0.80
Average:    dev202-32      0.85      0.00      8.92     10.46      0.01     10.27      4.18      0.36
Average:    dev202-48      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:    dev202-64      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:    dev202-80      0.21      0.00      1.74      8.43      0.00      0.30      0.08      0.00
Average:     dev251-0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:     dev251-1      1.25      0.00     15.97     12.73      0.01     11.78      6.37      0.80
Average:     dev251-2      0.90      0.00      8.92      9.88      0.01     10.44      3.95      0.36
Average:     dev251-3      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:     dev251-4      0.22      0.00      1.74      8.00      0.00      0.28      0.08      0.00

If you are interested in the average tps of dev251-1:
                     Indicate  the  number  of  transfers per second that were
                     issued to the device.  Multiple logical requests  can  be
                     combined  into  a  single  I/O  request  to the device. A
                     transfer is of indeterminate size.
you can specify the following command:
$ sar -f "$destDir/" -d | grep Average  | grep dev251-1 | awk '{print $3}'

Network Statistics

To view the overall statistics of network devices like eth0, bond, etc, you can use option -n as follows:

sar -n [VALUE]
The VALUE can be:
  • DEV: For network devices like eth0, bond, etc. 
  • EDEV: For network device failure details 
  • NFS: For NFS client info 
  • NFSD: For NFS server info 
  • SOCK: For sockets in use for IPv4 
  • IP: For IPv4 network traffic 
  • EIP: For IPv4 network errors 
  • ICMP: For ICMPv4 network traffic 
  • EICMP: For ICMPv4 network errors 
  • TCP: For TCPv4 network traffic 
  • ETCP: For TCPv4 network errors 
  • UDP: For UDPv4 network traffic 
  • SOCK6, IP6, EIP6, ICMP6, UDP6 : For IPv6 
  • ALL: For all above mentioned information.
$ sar -f -n DEV

03:39:28 PM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s

03:39:38 PM      eth0     12.35     16.47      1.34      4.04      0.00      0.00      0.00
03:39:38 PM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00
03:39:48 PM      eth0      9.63     14.64      1.17      4.03      0.00      0.00      0.00
03:39:48 PM        lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00

Average:        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s 
  Average:         eth0     11.26     16.14      3.95      6.46      0.00      0.00      0.00 
  Average:           lo      1.23      1.23      0.33      0.33      0.00      0.00      0.00

If you are interested in the average rxkB/s or txkB/s of eth0:
                     Total number of kilobytes received per second.

                     Total number of kilobytes transmitted per second.

you can specify the following command:
sar -f "$destDir/" -n DEV|grep Average|grep eth0 |awk '{print $5}'
sar -f "$destDir/" -n DEV|grep Average|grep eth0 |awk '{print $6}'


  1. sar command for Linux system performance monitoring
  2. Three Benchmarks for SQL Coverage in HiBench Suite ― a Bigdata Micro Benchmark Suite

Sunday, April 23, 2017

Spark SQL一Knowing the Basics

There are two ways to interact with SparkSQL:[1]

Features of SparkSQL

SparkSQL is one of Spark's modules, which provides SQL Interface to Spark. Below is a list of SparkSQL functionalities:
  • Supports both schema-on-write and schema-on-read
    • schema-on-write
      • Requires data to be modeled before it can be stored (hint: traditional database systems)
      • SparkSQL supports schema-on-write through columnar formats such as Parquet and ORC (Optimized Row Columnar)
    • schema-on-read
      • A schema is applied to data when it is read
        • A user can store data in its native format without worrying about how it will be queried
        • It not only enables agility but also allows complex evolving data
      • One disadvantage of a schema-on-read system is that queries are slower than those executed on data stored in a schema-on-write system.[19]
  • Has a SQLContext and a HiveContext
    • SQLContext (i.e., org.apache.spark.sql.SQLContext)
      • Can read data directly from the filesystem
        • This is useful when the data you are trying to analyze does not reside in Hive (for example, JSON files stored in HDFS).[13]
    • HiveContext (or org.apache.spark.sql.hive.HiveContext)
  • Compatible with Apache Hive
    • Not only supports HiveQL, but can also access Hive metastore, SerDes (i.e. Hive serialization and deserialization libraries), and UDFs (i.e., user-defined functions)
      • HiveQL queries run much faster on Spark SQL than on Hive
    • Existing Hive workloads can be easily migrated to Spark SQL.
    • You can use Spark SQL with or without Hive
    • Can be configured to read Hive metastores created with different versions of Hive
  • Prepackaged with a Thrift/JDBC/ODBC server
    • A client application can connect to this server and submit SQL/HiveQL queries using Thrift, JDBC, or ODBC interface
  • Bundled with Beeline
    • Which can be used to submit HiveQL queries

Architecture of SparkSQL

The architecture of SparkSQL contains three layers:
  • Data Sources
  • Schema RDD
  • Language API


Data Sources

Usually the Data Source for spark-core is a text file, Avro file, etc. However, Spark SQL operates on a variety of data sources through the DataFrame (see details below). The default data source is parquet unless otherwise configured by spark.sql.sources.default, which is used when

Some of data sources supported by SparkSQL are listed below:
  • JSON Datasets
    • Spark SQL can automatically capture the schema of a JSON dataset and load it as a DataFrame.
  • Hive Tables
    • Hive comes bundled with the Spark library as HiveContext
  • Parquet Files
    • Use a columnar format
  • Cassandra database

Read here for the methods of loading and saving data using the Spark Data Sources and options that are available for the built-in data sources.

Schema RDD (or DataFrame)

Spark Core is designed with special data structure called RDD (a native data structure of Spark). However, Spark SQL works on schemas, tables, and records via SchemaRDD, which was later renamed as “DataFrame” API.

With a SQLContext, applications can create DataFrame from an array of different sources such as:
  • Hive tables
  • Structured Data files
  • External databases
  • Existing RDDs.
It an also be registered as a temporary table. Registering a DataFrame as a table allows you to run SQL queries over its data.

Here is the summary of DataFrame API:
  • DataFrame vs RDD
    • DataFrame stores much more information about the structure of the data, such as the data types and names of the columns, than RDD.
      • This allows the DataFrame to optimize the processing much more effectively than Spark transformations and Spark actions doing processing on RDD.
      • Once data has been transformed into a Data Frame with a schema, It can then be stored in
        • Hive (for persistence)
          • If it needs to be accessed on a regular basis or registered
        • Temp table(s)
          • Which will exist only as long as the parent Spark application and it's executors (the application can run indefinitely)
    • Conversions from RDD to DataFrame and vice versa
  • Registration of DataFrames as Tables
    • An existing RDD can be implicitly converted to a DataFrame and then be registered as a table.
      • All of the tables that have been registered can then be made available for access as a JDBC/ODBC data source via the Spark thrift server.
  • Supports big datasets (up to Petabytes)
  • Supports different data formats and storage systems
    • Data formats
      • Avro, csv, elastic search, and Cassandra
    • Storage systems
      • HDFS, HIVE tables, mysql, etc.
  • Provides language APIs for Python, Java, Scala, and R Programming
    • It also achieves consistent performance of DataFrame API calls across languages using the state of art optimization and code generation through the Spark SQL Catalyst optimizer (tree transformation framework)

Language API

Spark SQL comes prepackaged with a Thrift/JDBC/ODBC server. A client application can connect to this server and submit SQL/HiveQL queries using Thrift, JDBC, or ODBC interface. It translates queries written using any of these interfaces into MapReduce, Apache Tez and Spark jobs.
    Spark is compatible with different languages. All the supported programming languages of Spark can be used to develop applications using the DataFrame API of Spark SQL. For example, Spark SQL supports the following language APIs:
    • Python
    • Scala
    • Java
    • R
    • HiveQL
      • Is an SQL-like language with schema on read and transparently converts queries to MapReduce, Apache Tez and Spark jobs.
        • An SQL-dialect with differences in structure and working.
          • The differences are mainly because Hive is built on top of the Hadoop ecosystem and has to comply with the restrictions of Hadoop and MapReduce.
    Finally, Spark SQL API consists of three key abstractions (as described above):
    • SQLContext
    • HiveContext
    • DataFrame


    1. Using Spark SQL (Hortonworks)
    2. Setting Up HiveServer2 (Apache Hive)
    3. HiveServer2 (
    4. Hive Metastore Administration (Apache)
    5. HiveServer2 Overview (Apache)
    6. SQLLine 1.0.2
    7. Hadoop Cluster Maintenance
    8. Big Data Analytics with Spark: A Practitioner’s Guide to Using Spark for Large-Scale Data Processing, Machine Learning, and Graph Analytics, and High-Velocity Data Stream Processing
    9. Apache Hive—Hive CLI vs Beeline (Xml And More)
      • Beeline is a JDBC client based on the SQLLine CLI — although the JDBC driver used communicates with HiveServer2 using HiveServer2’s Thrift APIs.
    10. Apache Hive Essentials
    11. Three Benchmarks for SQL Coverage in HiBench Suite ― a Bigdata Micro Benchmark Suite
    12. Accessing ORC Files from Spark (Hortonworks)
    13. Using the Spark DataFrame API
    14. Spark Shell — spark-shell shell script
    15. Tuning Spark (Hortonworks)
    16. Spark SQL, DataFrames and Datasets Guide (Spark 1.6.1)
    17. spark.sql.sources.default (default: parquet)
    18. Mastering Apache Spark
    19. Three Benchmarks for SQL Coverage in HiBench Suite ― a Bigdata Micro Benchmark Suite
    20. Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer (
    21. Accessing Spark SQL through JDBC and ODBC (Hortonworks)
    22. Using Spark to Virtually Integrate Hadoop with External Systems
      • This article focuses on how to use SparkSQL to integrate, expose, and accelerate multiple sources of data from a single "Federation Tier".