Spark

From UFRC
Revision as of 03:57, 17 May 2018 by Giljael (talk | contribs)
Jump to navigation Jump to search

Description

spark website  

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Environment Modules

Run module spider spark to find out what environment modules are available for this application.

System Variables

  • HPC_{{#uppercase:spark}}_DIR - installation directory
  • HPC_{{#uppercase:spark}}_BIN - executable directory
  • HPC_{{#uppercase:spark}}_SLURM - SLURM job script examples
  • SPARK_HOME - examples directory

Running Spark in HiperGator

To run your Spark jobs in HiperGator, first, a Spark cluster should be created in HiperGator via SLURM. This section shows a simple example how to create a Spark cluster in HiperGator and how to submit your Spark jobs into the cluster.

  • Spark cluster in HiperGator
  1. Set SLURM parameters for Spark cluster
  2. Set Spark parameters for Spark cluster
  3. Set Spark Master and Workers
  4. Submit the SLURM job script to HiperGator

Spark interactive shells in Scalar and Python.

  • Spark interactive shell in Scalar (spark-shell)
  • Spark interactive shell in Python (pyspark)
  1. Pi estimation in pyspark
  2. Pi estimation from file with pyspark
  • Spark-submit
  1. Pi estimation
  2. Wordcount