- General purpose data processing engine designed for big data.
- Written in scala
- Spark is a platform for cluster computing.
- Spark lets you spread data and computations over clusters with multiple nodes (each node as a separate computer).
- Very large datasets are split into smaller datasets and each node only works with a small amount of data.
- Data processing and computation are performed in parallel over the nodes in the cluster.
- However, with greater computing power comes greater complexity.
- Can be used for Analytics, Data Integration, Machine learning, Stream Processing.
- Master and Worker:
- Master:
- Connected to the rest of the computers in the cluster, which are called worker
- sends the workers data and calculations to run
- Worker:
- They send their results back to the master.
- Spark's core data structure is the Resilient Distributed Dataset (RDD)
- Instead of RDDs, it is easier to work with Spark DataFrame abstraction built on top of RDDs ( Operations using DataFrames are automatically optimized.)
- spark dataframes are immutable, you need to return a new instance after modification
- You start working with `SparkSession` or `SparkContext` entrypoint
- 2 modes:
- local mode : Single computer
- cluster mode : cluster computers
- You first build in local mode and deploy in cluster mode (no code change is required)
- Spark shell :
- interactive environment for spark jobs
- allow interacting with data on disk or in memory