Near real time processing
In-Memory Data Storage
We can use Apache Spark for
Personalization and ad analytics
Real time video stream optimization
Real time analytics for telco clients
Cross device personalized video experience
Graph creation and analysis
We can use Python Shell(pyspark),Scala Shell (spark-shell)
What is Resilent Distributed Dataset.
Which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
if data in memory is lost, it can be recreated. Stored in memory across the cluster.
How to create Resilent Distributed Dataset?
From a file or set of files – From data in memory – From another RDD
Bayern Munich missed all four of their penalties as Borussia Dortmund reached the German Cup final after a shootout.
Bayern midfielder Xabi Alonso also slipped at the crucial moment – straight after Lahm
Klopp will now have a chance to win the German Cup for a second time with Dortmund in his last match in charge
mydata = sc.textFile(“sport.txt”)
mydata_uc = mydata.map(lambda line: line.upper())
mydata_filt = \
mydata_uc.filter(lambda line: \
Download Apache Spark