Xml and More: April 2017

Sunday, April 23, 2017

Spark SQL一Knowing the Basics

There are two ways to interact with SparkSQL:^[1]

As a distributed SQL query engine (i.e. via Spark-shell)

When spark-shell is run,
- it automatically creates a SparkContext and a HiveContext
Variants of Spark shell

spark-shell for Scala
pyspark for Python

As a library

Which operates through one of the following two APIs and the Spark Thrift Server:

JDBC

using Java code or via the Beeline JDBC client

ODBC

via the Simba ODBC driver

Features of SparkSQL

SparkSQL is one of Spark's modules, which provides SQL Interface to Spark. Below is a list of SparkSQL functionalities:

Supports both schema-on-write and schema-on-read

schema-on-write

Requires data to be modeled before it can be stored (hint: traditional database systems)
SparkSQL supports schema-on-write through columnar formats such as Parquet and ORC (Optimized Row Columnar)

schema-on-read

A schema is applied to data when it is read

A user can store data in its native format without worrying about how it will be queried
It not only enables agility but also allows complex evolving data

One disadvantage of a schema-on-read system is that queries are slower than those executed on data stored in a schema-on-write system.^[19]

Has a SQLContext and a HiveContext

SQLContext (i.e., org.apache.spark.sql.SQLContext)

Can read data directly from the filesystem

This is useful when the data you are trying to analyze does not reside in Hive (for example, JSON files stored in HDFS).^[13]

HiveContext (or org.apache.spark.sql.hive.HiveContext)

Is a super set of the SQLContext (or HiveContext extends SQLContext)
Read data by interacting with the Hive MetaStore

For example, to start using ORC, you need to define a HiveContext instance first.^[12]

Compatible with Apache Hive

Not only supports HiveQL, but can also access Hive metastore, SerDes (i.e. Hive serialization and deserialization libraries), and UDFs (i.e., user-defined functions)

HiveQL queries run much faster on Spark SQL than on Hive

Existing Hive workloads can be easily migrated to Spark SQL.
You can use Spark SQL with or without Hive
Can be configured to read Hive metastores created with different versions of Hive

Prepackaged with a Thrift/JDBC/ODBC server

A client application can connect to this server and submit SQL/HiveQL queries using Thrift, JDBC, or ODBC interface

Bundled with Beeline

Which can be used to submit HiveQL queries

Architecture of SparkSQL

The architecture of SparkSQL contains three layers:

Data Sources
Schema RDD
Language API

Data Sources

Usually the Data Source for spark-core is a text file, Avro file, etc. However, Spark SQL operates on a variety of data sources through the DataFrame (see details below). The default data source is parquet unless otherwise configured by spark.sql.sources.default, which is used when

Reading or writing data in DataFrameWriter and DataFrameReader
Creating external table from a path (in Catalog.createExternalTable)
Streaming DataStreamReader and DataStreamWriter

Some of data sources supported by SparkSQL are listed below:

JSON Datasets

Spark SQL can automatically capture the schema of a JSON dataset and load it as a DataFrame.

Hive Tables

Hive comes bundled with the Spark library as HiveContext

Parquet Files

Use a columnar format

Cassandra database

Read here for the methods of loading and saving data using the Spark Data Sources and options that are available for the built-in data sources.

Schema RDD (or DataFrame)

Spark Core is designed with special data structure called RDD (a native data structure of Spark). However, Spark SQL works on schemas, tables, and records via SchemaRDD, which was later renamed as “DataFrame” API.

With a SQLContext, applications can create DataFrame from an array of different sources such as:

Hive tables
Structured Data files
External databases
Existing RDDs.

It an also be registered as a temporary table. Registering a DataFrame as a table allows you to run SQL queries over its data.

Here is the summary of DataFrame API:

DataFrame vs RDD

DataFrame stores much more information about the structure of the data, such as the data types and names of the columns, than RDD.

This allows the DataFrame to optimize the processing much more effectively than Spark transformations and Spark actions doing processing on RDD.
Once data has been transformed into a Data Frame with a schema, It can then be stored in

Hive (for persistence)

If it needs to be accessed on a regular basis or registered

Temp table(s)

Which will exist only as long as the parent Spark application and it's executors (the application can run indefinitely)

Conversions from RDD to DataFrame and vice versa

Many of the Spark transformations and Spark actions that are typically applied on RDDs can also be applied on DataFrames.
Spark SQL supports two different methods for converting existing RDDs into DataFrames

Registration of DataFrames as Tables

An existing RDD can be implicitly converted to a DataFrame and then be registered as a table.

All of the tables that have been registered can then be made available for access as a JDBC/ODBC data source via the Spark thrift server.

Supports big datasets (up to Petabytes)
Supports different data formats and storage systems

Data formats

Avro, csv, elastic search, and Cassandra

Storage systems

HDFS, HIVE tables, mysql, etc.

Provides language APIs for Python, Java, Scala, and R Programming

It also achieves consistent performance of DataFrame API calls across languages using the state of art optimization and code generation through the Spark SQL Catalyst optimizer (tree transformation framework)

Language API

Spark SQL comes prepackaged with a Thrift/JDBC/ODBC server. A client application can connect to this server and submit SQL/HiveQL queries using Thrift, JDBC, or ODBC interface. It translates queries written using any of these interfaces into MapReduce, Apache Tez and Spark jobs.

Spark is compatible with different languages. All the supported programming languages of Spark can be used to develop applications using the DataFrame API of Spark SQL. For example, Spark SQL supports the following language APIs:

Python
Scala
Java
R
HiveQL

Is an SQL-like language with schema on read and transparently converts queries to MapReduce, Apache Tez and Spark jobs.

An SQL-dialect with differences in structure and working.

The differences are mainly because Hive is built on top of the Hadoop ecosystem and has to comply with the restrictions of Hadoop and MapReduce.

Finally, Spark SQL API consists of three key abstractions (as described above):

SQLContext
HiveContext
DataFrame

References

Using Spark SQL (Hortonworks)
Setting Up HiveServer2 (Apache Hive)
HiveServer2 (slideshare.net)
Hive Metastore Administration (Apache)
HiveServer2 Overview (Apache)
SQLLine 1.0.2
Hadoop Cluster Maintenance
Big Data Analytics with Spark: A Practitioner’s Guide to Using Spark for Large-Scale Data Processing, Machine Learning, and Graph Analytics, and High-Velocity Data Stream Processing
Apache Hive—Hive CLI vs Beeline (Xml And More)

Beeline is a JDBC client based on the SQLLine CLI — although the JDBC driver used communicates with HiveServer2 using HiveServer2’s Thrift APIs.

Apache Hive Essentials
Three Benchmarks for SQL Coverage in HiBench Suite ― a Bigdata Micro Benchmark Suite
Accessing ORC Files from Spark (Hortonworks)
Using the Spark DataFrame API
Spark Shell — spark-shell shell script
Tuning Spark (Hortonworks)
Spark SQL, DataFrames and Datasets Guide (Spark 1.6.1)
spark.sql.sources.default (default: parquet)

Sets the default data source to use in input/output
Is used when reading or writing data in DataFrameWriter and DataFrameReader, when creating external table from a path (in Catalog.createExternalTable) and in the streaming DataStreamReader and DataStreamWriter.

Mastering Apache Spark
Three Benchmarks for SQL Coverage in HiBench Suite ― a Bigdata Micro Benchmark Suite
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer (slideshare.net)
Accessing Spark SQL through JDBC and ODBC (Hortonworks)
Using Spark to Virtually Integrate Hadoop with External Systems

This article focuses on how to use SparkSQL to integrate, expose, and accelerate multiple sources of data from a single "Federation Tier".

Saturday, April 22, 2017

Apache Hive—Hive CLI vs Beeline

Lineage of Apache Hive

Original model

was a heavyweight command-line tool that accepted queries and executed them utilizing MapReduce

Client-server model

Hive CLI + HiveServer1
Beeline + HiveServer2 (HS2)

In this article, we will examine the differences between Hive CLI and Beeline, especially a new Hive CLI implementation (i.,e Beeline + embedded HS2).

Hive CLI vs Beeline

Hive CLI, which is an Apache Thrift-based client, Beeline is a JDBC client based on the SQLLine CLI — although the JDBC driver used communicates with HiveServer2 using HiveServer2’s Thrift APIs.

In the latest Apache Hive, both "Hive CLI" and Beeline are supported via

exec "${HIVE_HOME}/bin/hive.distro" "$@"

For example, to launch both command line interfaces, you do

Hive CLI

$ hive --service cli --help

Beeline

$ hive --service beeline --help

Using Hive (version: 1.2.1000.2.4.2.0-258) as an example, here are the list of services available:

beeline cleardanglingscratchdir cli help hiveburninclient hiveserver2 hiveserver hwi jar lineage metastore metatool orcfiledump rcfilecat schemaTool version

Note that "beeline" command is equivalent to "hive --service beeline".

Hive CLI (New)

Because of the wide use of Hive CLI, the Hive community is replacing Hive CLI's implementation with a new Hive CLI on top of Beeline plus embedded HiveServer2 (HIVE-10511) so that the Hive community only needs to maintain a single code path.^[2]

In this way, the new Hive CLI is just an alias to Beeline at two levels:

Shell script level
High code level.

Using the JMH to measure the average time cost when retrieving a data set, The community has reported that there is no clear performance gap between New Hive CLI and Beeline in terms of retrieving data.

Interactive Shell Commands Support

When $HIVE_HOME/bin/hive is run without either the -e or -f option, it enters interactive shell mode. To learn more, read the following references:

Beeline

With HiveServer2 (HS2), Beeline is the recommended command-line interface, To learn more, read the following references:

applicable to Beeline as well

References

Migrating from Hive CLI to Beeline: A Primer
Replacing the Implementation of Hive CLI Using Beeline
Setting up HiveServer2 (Apache Hive)
Hive CLI

HiveServer2 Clients (Apache)
SQLLine Manual
Beeline—Command Line Shell
Embedded mode

Running Hive client tools with embedded servers is a convenient way to test a query or debug a problem. While both Hive CLI and Beeline can embed a Hive server instance, you would start them in embedded mode in slightly different ways.

Using the Hive command line and Beeline (Book: Apache Hive Essentials)

For Beeline, ; is not needed after the command that starts with !.
When running a query in Hive CLI, the MapReduce statistics information is shown in the console screen while processing, whereas Beeline does not.
Both Beeline and Hive CLI do not support running a pasted query with <tab> inside, because <tab> is used for autocomplete by default in the environment. Alternatively, running the query from files has no such issues.
Hive CLI shows the exact line and position of the Hive query or syntax errors when the query has multiple lines. However, Beeline processes the multiple-line query as a single line, so only the position is shown for query or syntax errors with the line number as 1 for all instances. For this aspect, Hive CLI is more convenient than Beeline for debugging the Hive query.
In both Hive CLI and Beeline, using the up and down arrow keys can retrieve up to 10,000 previous commands. The !history command can be used in Beeline to show all history.
Both Hive CLI and Beeline supports variable substitution.

Sunday, April 16, 2017

Idiosyncrasies of ${HOME} that is an NFS Share

NFS is perhaps best for more 'permanent' network mounted directories such as /homedir or regularly accessed shared resources. In this article, we will cover the following topics:

Set up NFS share via automounter
Idiosyncrasies of /homedir that is an NFS share

Automounter

One drawback to using /etc/fstab is that, regardless of how infrequently a user accesses the NFS mounted file system, the system must dedicate resources to keep the mounted file system in place. This is not a problem with one or two mounts, but when the system is maintaining mounts to many systems at one time, overall system performance can be affected.

An alternative to /etc/fstab is to use the kernel-based automount utility. An automounter consists of two components:^[1]

A kernel module

implements a file system

A user-space daemon

performs all of the other functions

The automount utility can mount and unmount NFS file systems automatically (on demand mounting) therefore saving system resources. The automount utility can be used to mount other file systems including AFS, SMBFS, CIFS and local file systems.

${HOME}

When your home directory is automounted, it has different behaviors than other file systems due to its sharing. For example, you could run into the following two issues:

cp: cannot stat "KeePass-2.14.zip": Permission denied^[2]
".bashrc" E509: Cannot create backup file (add ! to override)"

In the below sections, we will discuss these two issues in more details.

cp: cannot stat "KeePass-2.14.zip" : Permission denied

In [2], the author has described an issue in which she has tried to copy a file from her home directory to /usr:

$ chmod 777 KeePass-2.14.zip
$ cp KeePass-2.14.zip /usr/keepass/
cp: cannot create regular file `/usr/keepass/KeePass-2.14.zip': Permission denied
$ sudo cp KeePass-2.14.zip /usr/keepass/
cp: cannot stat `KeePass-2.14.zip': Permission denied

However, sudo cp can't statKeePass-2.14.zi because${HOME} is on an NFS mount and the NFS server doesn't grant your machine root permission to the NFS share.

To workaround this "cannot stat: Permission denied" issue, you need to copy the file to another directory (e.g., /tmp) first:

cp KeePass-2.14.zip /tmp
sudo cp /tmp/KeePass-2.14.zip /usr/keepass/

".bashrc" E509: Cannot create backup file (add ! to override)"

One time when I edited and saved my $HOME/.bashrc, the system has thrown the following message:

".bashrc" E509: Cannot create backup file (add ! to override)"

Then I used "df" command to find the disk space available on my homedir:

$ df -h .
Filesystem            Size  Used Avail Use% Mounted on
server1:/export/home4/myusername
                      5.0T  1.4T  3.7T  28% /home/myusername

It showed that there were still plenty of space. However, because ${HOME} is NFS shared for the home directories of many others, every user has been assigned a disk quota. To find out how much quota you have been assigned for your homedir, you can run:

$ quota -Q -s
Disk quotas for user myusername (uid 40000):
     Filesystem  blocks   quota   limit   grace   files   quota   limit   grace
server1:/export/home4/myusername

                  1624M   2048M   2048M               0       0       0

So, to resolve this issue, you can simply remove other junk files form the homedir to gain some disk space for saving the file.