Spark
2.0 is a major release of spark. This release packages, structured API improvements(Unification of DataFrame,DataSet,SparkSession), MLIB model exports and various update in platform libraries. Spark 2.0 supports the features in Scala 2.12 and SQL 2003. It enables
ease of development with the streamed lined API , faster and
optimized way of code complication and advanced intelligence making
it smarter.
The whole-stage code optimization improves the performance by 10x faster the existing one. The I/O operations are also optimized in builtin cache and parquet storage.
Spark
2.0 continuous to support the standard SQL & bring up new ANSI
SQL parser extends the querying capabilities drastically over other
libraries. On the API side Spark 2.0 will merge the DataFrame and
Dataset APIs.
RDD , DataFrames & DataSets
RDD
was the standard abstraction of Spark. But from Spark 2.0, Dataset
will become the new abstraction layer for spark. Though RDD API will
be available, it will become low level API, used mostly for runtime
and library development
Dataset
is a superset of Dataframe API which is released in Spark 1.3.
Dataset together with Dataframe API brings better performance and
flexibility to the platform compared to RDD API. Dataset will be also
replacing RDD as an abstraction for streaming in future releases.Spark 2.0 supports infinite dataframes, It helps to run various functions (aggregate,GroupBy etc) on the whole data.
Spark Session
Spark 2.0
brings up single point of entry SparkSession is
essentially combination of SQLContext, HiveContext and future
StreamingContext. All the API’s available on those contexts are
available on spark session also.Spark session internally has a spark
context for actual computation.
When
you start the interactive shell (spark-shell)
We
see a sample of using spark SQL. Basically we can read data from
various it
can be used to read data from various types of input formats such
as CSV,JSON,ORC,AVRO,parquet, text, jdbc, table etc.
Here is an example of reading a data from amazon S3.
val hadoopConf = sc.hadoopConfiguration;
hadoopConf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3n.awsAccessKeyId", "yourKeyid")hadoopConf.set("fs.s3n.awsSecretAccessKey","your key")
Set
the properties with amazon credentials and read it via the bucket
URL.
val dF= spark.read.json("s3n://buket-name/*/*/*/*/*/*/*/*.json.gz")
This
will read the files in S3 bucket allows you to query it on the
fly.
it
reads as dataFrame and we can print the schema of the data read from
the system
The
speed it runs 10x faster than previous version Spark 1.6. Now the registerTempTable method is deprecated we use the new one
dF.createOrReplaceTempView("logStash")Once the temporary view is created, with the spark session , queries can be triggered
spark.table("logStash")
spark.sql("select Records.eventName,Records.eventTime from logStash").show()
The process works in the above way |
Spark 2.0 Compiler update
Spark plays a vital role in the performance which has keyed up many big data developers start preferring it over existing M/R engines. Spark 2.0 updated its execution layer to reduce the number of CPU cycles used by I/O operations happen during the process execution. Spark 2.0 ships with the second generation Tungsten engine. It will help to emit optimized bytecode at runtime that collapses the entire query into a single function, eliminating virtual function calls and leveraging CPU registers for intermediate data. This promises around 10X improvement in the performance.
Spark Structured Streaming
Currently Spark streaming API provides real time streaming and data processing. It evolved as the first attempt in the bigdata space in unifying batch and streaming computation. As a first streaming API called DStream and introduced in Spark 0.7. It offered highly scalable , fault tolerance systems with high throughput.Now Spark 2.0 enabled structured streaming.
It enables the applications to take decision in real time. Exiting system supported just streaming of data. To add more intelligent to the real time processing the Structured Streaming enables the feature to combine business logic with data streaming on the fly. Now the streaming API works as full stack, where the developers doesn't need to depend on external applications to apply logic on the streams.
The structure streaming will be a extension to DataSet/DataFrame API.The newly introduced the SparkSession will soon support the Streaming context soon . This unification should make adoption easy for existing Spark users, allowing them to leverage their knowledge of Spark batch API to answer new questions in real-time. Key features here will include support for event-time based processing, out-of-order/delayed data, sessionization and tight integration with non-streaming data sources and sinks.
References
http://www.datanami.com/2016/02/25/spark-2-0-to-introduce-new-structured-streaming-engine/
http://thenewstack.io/spark-2-0-will-offer-interactive-querying-live-data/
https://databricks.com/blog/2016/05/11/apache-spark-2-0-technical-preview-easier-faster-and-smarter.html
https://www.linkedin.com/pulse/experiment-spark-20-session-vishnu-viswanath?trk=mp-reader-card
http://blog.madhukaraphatak.com/introduction-to-spark-two-part-1/