Abstract: |
As the most active project in the Hadoop ecosystem these days (Zaharia, 2014), Spark is a fast and general purpose engine for large-scale data processing. Thanks to its advanced Directed Acyclic Graph (DAG) execution engine and in-memory computing mechanism, Spark runs programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk (Apache, 2016). However, Spark performance is impacted by many system software, hardware and dataset factors especially memory and JVM related, which makes capacity planning and tuning for Spark clusters extremely difficult. Current planning methods are mostly estimation based and are highly dependent on experience and trial-and-error. These approaches are far from efficient and accurate, especially with increasing software stack complexity and hardware diversity. Here, we propose a novel Spark simulator based on CSMethod (Bian et al., 2014), extension with a fine-grained multi-layered memory subsystem, well suitable for Spark cluster deployment planning,performance evaluation and optimization before system provisioning. The whole Spark application execution life cycle is simulated by the proposed simulator, including DAG generation, Resilient Distributed Dataset (RDD) processing and block management. Hardware activities derived from these software operations are dynamically mapped onto architecture models for processors, storage, and network devices. Performance behaviour of cluster memory system at multiple layers (Spark, JVM, OS, hardware) are modeled as an enhanced fine-grained individual global library. Experimental results with several popular Spark micro benchmarks and a real case IoT workloads demonstrate that our Spark Simulator achieves high accuracy with an average error rate below 7%. With light weight computing resource requirement (a laptop is enough) our simulator runs at the same speed level than native execution on multi-node high-end cluster. |