Apache Spark, Resilent Distributed Dataset RDD.

Ekran Resmi 2015-04-29 10.53.12

Apache Spark is a fast, general engine for large scale data processing on a  cluster.

Advantages of Spark

High level programming framework

Write applications quickly in  Scala, Python or Java.

Ekran Resmi 2015-04-29 10.13.23

Cluster computing

Combine SQL, streaming, and complex analytics

Ekran Resmi 2015-04-29 10.09.54

Distributed storage

Data in memory

Ekran Resmi 2015-04-29 10.17.29

Easier Development

Near real time processing

In-Memory Data Storage

We can use Apache Spark for

Personalization and ad analytics

Real time video stream optimization

Real time analytics for telco clients

Cross device personalized video experience

Extract/Transform/Load (ETL)

Text mining

Index building

Graph creation and analysis  

Patterrn recogniton

Collaborative filtering

Prediction models

Sentiment analysis

Risk assessment

We can use Python Shell(pyspark),Scala Shell (spark-shell)

What is Resilent Distributed Dataset.

Which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

if data in memory is lost, it can be recreated. Stored in memory across the cluster.

How to create Resilent Distributed Dataset?

From a file or set of files – From data in memory – From another RDD

Example.

Sports.txt

Bayern Munich missed all four of their penalties as Borussia Dortmund reached the German Cup final after a shootout.

Bayern midfielder Xabi Alonso also slipped at the crucial moment – straight after Lahm

Klopp will now have a chance to win the German Cup for a second time with Dortmund in his last match in charge

mydata = sc.textFile(“sport.txt”)

mydata_uc = mydata.map(lambda line: line.upper())

mydata_filt = \

     mydata_uc.filter(lambda line: \

     line.startswith(‘B’))

mydata_filt.count()

2

Download Apache Spark

https://spark.apache.org/downloads.html

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s