There are two ways to interact with SparkSQL:[1]
SparkSQL is one of Spark's modules, which provides SQL Interface to Spark. Below is a list of SparkSQL functionalities:
The architecture of SparkSQL contains three layers:
,
Usually the Data Source for spark-core is a text file, Avro file, etc. However, Spark SQL operates on a variety of data sources through the DataFrame (see details below). The default data source is parquet unless otherwise configured by spark.sql.sources.default, which is used when
Some of data sources supported by SparkSQL are listed below:
Read here for the methods of loading and saving data using the Spark Data Sources and options that are available for the built-in data sources.
Spark Core is designed with special data structure called RDD (a native data structure of Spark). However, Spark SQL works on schemas, tables, and records via SchemaRDD, which was later renamed as “DataFrame” API.
With a SQLContext, applications can create DataFrame from an array of different sources such as:
Here is the summary of DataFrame API:
Spark SQL comes prepackaged with a Thrift/JDBC/ODBC server. A client application can connect to this server and submit SQL/HiveQL queries using Thrift, JDBC, or ODBC interface. It translates queries written using any of these interfaces into MapReduce, Apache Tez and Spark jobs.
- As a distributed SQL query engine (i.e. via Spark-shell)
- When spark-shell is run,
- it automatically creates a SparkContext and a HiveContext
- Variants of Spark shell
- spark-shell for Scala
- pyspark for Python
- As a library
- Which operates through one of the following two APIs and the Spark Thrift Server:
- JDBC
- using Java code or via the Beeline JDBC client
- ODBC
- via the Simba ODBC driver
Features of SparkSQL
SparkSQL is one of Spark's modules, which provides SQL Interface to Spark. Below is a list of SparkSQL functionalities:
- Supports both schema-on-write and schema-on-read
- schema-on-write
- Requires data to be modeled before it can be stored (hint: traditional database systems)
- SparkSQL supports schema-on-write through columnar formats such as Parquet and ORC (Optimized Row Columnar)
- schema-on-read
- A schema is applied to data when it is read
- A user can store data in its native format without worrying about how it will be queried
- It not only enables agility but also allows complex evolving data
- One disadvantage of a schema-on-read system is that queries are slower than those executed on data stored in a schema-on-write system.[19]
- Has a SQLContext and a HiveContext
- SQLContext (i.e., org.apache.spark.sql.SQLContext)
- Can read data directly from the filesystem
- This is useful when the data you are trying to analyze does not reside in Hive (for example, JSON files stored in HDFS).[13]
- HiveContext (or org.apache.spark.sql.hive.HiveContext)
- Is a super set of the SQLContext (or HiveContext extends SQLContext)
- Read data by interacting with the Hive MetaStore
- For example, to start using ORC, you need to define a HiveContext instance first.[12]
- Compatible with Apache Hive
- Not only supports HiveQL, but can also access Hive metastore, SerDes (i.e. Hive serialization and deserialization libraries), and UDFs (i.e., user-defined functions)
- HiveQL queries run much faster on Spark SQL than on Hive
- Existing Hive workloads can be easily migrated to Spark SQL.
- You can use Spark SQL with or without Hive
- Can be configured to read Hive metastores created with different versions of Hive
- Prepackaged with a Thrift/JDBC/ODBC server
- A client application can connect to this server and submit SQL/HiveQL queries using Thrift, JDBC, or ODBC interface
- Bundled with Beeline
- Which can be used to submit HiveQL queries
Architecture of SparkSQL
The architecture of SparkSQL contains three layers:
- Data Sources
- Schema RDD
- Language API
,
Data Sources
Usually the Data Source for spark-core is a text file, Avro file, etc. However, Spark SQL operates on a variety of data sources through the DataFrame (see details below). The default data source is parquet unless otherwise configured by spark.sql.sources.default, which is used when
- Reading or writing data in DataFrameWriter and DataFrameReader
- Creating external table from a path (in Catalog.createExternalTable)
- Streaming DataStreamReader and DataStreamWriter
Some of data sources supported by SparkSQL are listed below:
- JSON Datasets
- Spark SQL can automatically capture the schema of a JSON dataset and load it as a DataFrame.
- Hive Tables
- Hive comes bundled with the Spark library as HiveContext
- Parquet Files
- Use a columnar format
- Cassandra database
Read here for the methods of loading and saving data using the Spark Data Sources and options that are available for the built-in data sources.
Schema RDD (or DataFrame)
Spark Core is designed with special data structure called RDD (a native data structure of Spark). However, Spark SQL works on schemas, tables, and records via SchemaRDD, which was later renamed as “DataFrame” API.
With a SQLContext, applications can create DataFrame from an array of different sources such as:
- Hive tables
- Structured Data files
- External databases
- Existing RDDs.
Here is the summary of DataFrame API:
- DataFrame vs RDD
- DataFrame stores much more information about the structure of the data, such as the data types and names of the columns, than RDD.
- This allows the DataFrame to optimize the processing much more effectively than Spark transformations and Spark actions doing processing on RDD.
- Once data has been transformed into a Data Frame with a schema, It can then be stored in
- Hive (for persistence)
- If it needs to be accessed on a regular basis or registered
- Temp table(s)
- Which will exist only as long as the parent Spark application and it's executors (the application can run indefinitely)
- Conversions from RDD to DataFrame and vice versa
- Many of the Spark transformations and Spark actions that are typically applied on RDDs can also be applied on DataFrames.
- Spark SQL supports two different methods for converting existing RDDs into DataFrames
- Registration of DataFrames as Tables
- An existing RDD can be implicitly converted to a DataFrame and then be registered as a table.
- All of the tables that have been registered can then be made available for access as a JDBC/ODBC data source via the Spark thrift server.
- Supports big datasets (up to Petabytes)
- Supports different data formats and storage systems
- Data formats
- Avro, csv, elastic search, and Cassandra
- Storage systems
- HDFS, HIVE tables, mysql, etc.
- Provides language APIs for Python, Java, Scala, and R Programming
- It also achieves consistent performance of DataFrame API calls across languages using the state of art optimization and code generation through the Spark SQL Catalyst optimizer (tree transformation framework)
Language API
Spark SQL comes prepackaged with a Thrift/JDBC/ODBC server. A client application can connect to this server and submit SQL/HiveQL queries using Thrift, JDBC, or ODBC interface. It translates queries written using any of these interfaces into MapReduce, Apache Tez and Spark jobs.
- Python
- Scala
- Java
- R
- HiveQL
- SQLContext
- HiveContext
- DataFrame
References
- Using Spark SQL (Hortonworks)
- Setting Up HiveServer2 (Apache Hive)
- HiveServer2 (slideshare.net)
- Hive Metastore Administration (Apache)
- HiveServer2 Overview (Apache)
- SQLLine 1.0.2
- Hadoop Cluster Maintenance
- Big Data Analytics with Spark: A Practitioner’s Guide to Using Spark for Large-Scale Data Processing, Machine Learning, and Graph Analytics, and High-Velocity Data Stream Processing
- Apache Hive—Hive CLI vs Beeline (Xml And More)
- Beeline is a JDBC client based on the SQLLine CLI — although the JDBC driver used communicates with HiveServer2 using HiveServer2’s Thrift APIs.
- Apache Hive Essentials
- Three Benchmarks for SQL Coverage in HiBench Suite ― a Bigdata Micro Benchmark Suite
- Accessing ORC Files from Spark (Hortonworks)
- Using the Spark DataFrame API
- Spark Shell — spark-shell shell script
- Tuning Spark (Hortonworks)
- Spark SQL, DataFrames and Datasets Guide (Spark 1.6.1)
- spark.sql.sources.default (default: parquet)
- Sets the default data source to use in input/output
- Is used when reading or writing data in DataFrameWriter and DataFrameReader, when creating external table from a path (in Catalog.createExternalTable) and in the streaming DataStreamReader and DataStreamWriter.
- Mastering Apache Spark
- Three Benchmarks for SQL Coverage in HiBench Suite ― a Bigdata Micro Benchmark Suite
- Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer (slideshare.net)
- Accessing Spark SQL through JDBC and ODBC (Hortonworks)
- Using Spark to Virtually Integrate Hadoop with External Systems
- This article focuses on how to use SparkSQL to integrate, expose, and accelerate multiple sources of data from a single "Federation Tier".
It was really a nice article and I was really impressed by reading this Big data hadoop online training India
ReplyDeleteGood post. Its so much useful to learners. I like the way you describe this post. Thanks for sharing. I am waiting for your more posts like this or related to any other informative topic. Here we have some stuff regarding How to Manipulate Structured Data Using Apache Spark SQL. I think this information is helpful.
ReplyDeleteManipulate Structured Data Using Apache Spark SQL
It looks very useful, informative and we can say a place one can have good experience in SQL. I have recently start SQL Training at JanBask and your blog seems to be very useful for me or all those want to start their career in this field. I will definitely share this content within my community. cheers!! Keep on doing the amazing job.
ReplyDeleteWonderful blog.. Thanks for sharing informative Post. Its very useful to me.
ReplyDeleteSailpoint Online Training
SAP PP Online Training
Great blog thanks for sharing. Are you looking for digital marketing service?
ReplyDeletedigital marketing company in chennai
seo service in chennai
web designing company in chennai
social media marketing company in chennai
Nice article. Thanks for sharing.
ReplyDeletedevops online training
I really liked your blog post.Much thanks again. Awesome.
ReplyDeletespark online course
For the most current and accurate information on Botswana trade statistics, I recommend checking Import Globals. For more information about global import export data visit our website.
ReplyDeleteBotswana Import Data
Spark SQL comes prepackaged with a Thrift/JDBC/ODBC server. A client application can connect to this server and submit SQL/HiveQL queries using Thrift, JDBC, or ODBC interface. It translates queries written using any of these interfaces into MapReduce, Apache Tez and Spark jobs.
ReplyDeleteSpark is compatible with different languages. All the supported programming languages of Spark can be used to develop applications using the DataFrame API of Spark SQL. For example, Spark SQL supports the following language APIs:
python projects for engineering students
Deep Learning Projects for Final Year
Machine Learning Final Year Projects