Advantages
- Lighting fast processing – Spark enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk
- Support for sophisticated analytics – Spark supports SQL queries, streaming data, complex analytics such as graph algorithms, and machine learning. Users can combine all these capabilities in a single workflow
- Real-time Stream Processing
- Ability to integrate with Hadoop and existing Hadoop Data
- Active and expanding community
Disadvantages
- Data arriving out of time order is a problem for batch-based processing
- Batch length restricts Window-based analytics – data is often of poor quality, some records might be missing and streams can arrive with data out of time order
- It offers limited performance per server according to stream processing standards these days. It scales out large numbers of servers to gain overall system performance
- Writing stream processing operations from scratch is not easy – Spark streaming offers limited binaries of stream functions
Components
- Types of cluster managers:
– Standalone: a simple cluster manager that makes it easy to set up a cluster
– Apache Mesos: a general cluster manager that can run service applications
– Hadoop YARN: the resource manager in Hadoop 2.0
- Shipping code to the cluster – dynamically adding new files to be sent to executors
- Monitoring – offers information about running executors and tasks
- Job scheduling – control over resource allocation both on across and within applications is permitted
Development tools
- IntelliJ
- Eclipse