This tutorial describes how to install, configure, and run Apache Spark on Clear Linux* OS on a single machine running the master daemon and a worker daemon.
Apache Spark is a fast, general-purpose cluster computing system with the following features:
- Provides high-level APIs in Java*, Scala*, Python*, and R*.
- Includes an optimized engine that supports general execution graphs.
- Supports high-level tools including Spark SQL, MLlib, GraphX, and Spark Streaming.
Clear Linux OS installed on your host system.
For detailed instructions on installing Clear Linux OS on a bare metal system, visit the bare metal installation guide.
Before installing any new packages, update Clear Linux OS with the following command:
sudo swupd update
Apache Spark is included in the big-data-basic bundle. To install the framework, run the following command:
sudo swupd bundle-add big-data-basic
Create the configuration directory:
sudo mkdir /etc/spark
Copy the default templates from
sudo cp /usr/share/defaults/spark/* /etc/spark
Since Clear Linux OS is a stateless system, you should never modify the files under the
/usr/share/defaultsdirectory. The software updater overwrites those files.
Copy the template files shown below to create custom configuration files:
sudo cp /etc/spark/spark-defaults.conf.template /etc/spark/spark-defaults.conf sudo cp /etc/spark/spark-env.sh.template /etc/spark/spark-env.sh sudo cp /etc/spark/log4j.properties.template /etc/spark/log4j.properties
/etc/spark/spark-env.shfile and add the
SPARK_MASTER_HOSTvariable. Replace the example address below with your localhost IP address. View your IP address using the hostname -I command.
This optional step enables the master’s web user interface to view information needed later in this tutorial.
/etc/spark/spark-defaults.conffile and update the
spark.mastervariable with the SPARK_MASTER_HOST address and port 7077.
Start the master server:
Start one worker daemon and connect it to the master using the
spark.mastervariable defined earlier:
sudo /usr/share/apache-spark/sbin/./start-slave.sh spark://10.300.200.100:7077
Open an internet browser and view the worker daemon information using the master’s IP address and port 8080:
Run the wordcount example using a file on your local host and output the results to a new file with the following command:
sudo spark-submit /usr/share/apache-spark/examples/src/main/python/wordcount.py ~/Documents/example_file > ~/Documents/results
Open an internet browser and view the application information using the master’s IP address and port 8080:
View the results of the wordcount application in the
You have successfully installed and set up a standalone Apache Spark cluster, and ran a simple wordcount example.