Easier Development
Near real time processing
In-Memory Data Storage
We can use Apache Spark for
Personalization and ad analytics
Real time video stream optimization
Real time analytics for telco clients
Cross device personalized video experience
Extract/Transform/Load (ETL)
Text mining
Index building
Graph creation and analysis
Patterrn recogniton
Collaborative filtering
Prediction models
Sentiment analysis
Risk assessment
We can use Python Shell(pyspark),Scala Shell (spark-shell)
What is Resilent Distributed Dataset.
Which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
if data in memory is lost, it can be recreated. Stored in memory across the cluster.
How to create Resilent Distributed Dataset?
From a file or set of files – From data in memory – From another RDD
Example.
Sports.txt
Bayern Munich missed all four of their penalties as Borussia Dortmund reached the German Cup final after a shootout.
Bayern midfielder Xabi Alonso also slipped at the crucial moment – straight after Lahm
Klopp will now have a chance to win the German Cup for a second time with Dortmund in his last match in charge
mydata = sc.textFile(“sport.txt”)
mydata_uc = mydata.map(lambda line: line.upper())
mydata_filt = \
mydata_uc.filter(lambda line: \
line.startswith(‘B’))
mydata_filt.count()
2
Download Apache Spark