Spark

Description

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Environment Modules

Run module spider spark to find out what environment modules are available for this application.

System Variables

HPC_{{#uppercase:spark}}_DIR - installation directory
HPC_{{#uppercase:spark}}_BIN - executable directory
HPC_{{#uppercase:spark}}_SLURM - SLURM job script examples
SPARK_HOME - examples directory

Running Spark in HiperGator

To run your Spark jobs in HiperGator, first, a Spark cluster should be created in HiperGator via SLURM. This section shows a simple example how to create a Spark cluster in HiperGator and how to submit your Spark jobs into the cluster.

Spark cluster in HiperGator

Set SLURM parameters for Spark cluster
Set Spark parameters for Spark cluster
Set Spark Master and Workers
Submit the SLURM job script to HiperGator

Spark interactive shells in Scalar and Python.

Spark interactive shell in Scalar (spark-shell)
Spark interactive shell in Python (pyspark)

Pi estimation in pyspark
Pi estimation from file with pyspark

Spark-submit

Pi estimation
Wordcount

Spark

Description

Environment Modules

System Variables

Running Spark in HiperGator

Navigation menu

Search