Spark

Apache Spark is an open-source computing framework. It was originally developed at the University of California, Berkeley's AMPLab in 2009 and donated to the Apache Software Foundation. It's part of a greater set of tools, along with Apache Hadoop and other open-source resources which are used in today’s analytics community.

Advantages

  • Lighting fast processing – Spark enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk
  • Support for sophisticated analytics – Spark supports SQL queries, streaming data, complex analytics such as graph algorithms, and machine learning. Users can combine all these capabilities in a single workflow
  • Real-time Stream Processing
  • Ability to integrate with Hadoop and existing Hadoop Data
  • Active and expanding community

Disadvantages

  • Data arriving out of time order is a problem for batch-based processing
  • Batch length restricts Window-based analytics – data is often of poor quality, some records might be missing and streams can arrive with data out of time order
  • It offers limited performance per server according to stream processing standards these days. It scales out large numbers of servers to gain overall system performance
  • Writing stream processing operations from scratch is not easy – Spark streaming offers limited binaries of stream functions

Components

  • Types of cluster managers:

– Standalone: a simple cluster manager that makes it easy to set up a cluster
– Apache Mesos: a general cluster manager that can run service applications
– Hadoop YARN: the resource manager in Hadoop 2.0

  • Shipping code to the cluster – dynamically adding new files to be sent to executors
  • Monitoring – offers information about running executors and tasks
  • Job scheduling – control over resource allocation both on across and within applications is permitted

Development tools

  • IntelliJ
  • Eclipse
Scroll to Top