Data Analytics Reference Stack

This guide explains how to use the DARS, and to optionally build your own DARS container image.

Any system that supports Docker* containers can be used with DARS. This steps in this guide use Clear Linux* OS as the host system.

The Data Analytics Reference Stack release

The Data Analytics Reference Stack (DARS) provides developers and enterprises a straightforward, highly optimized software stack for storing and processing large amounts of data. More detail is available on the DARS architecture and performance benchmarks.

The Data Analytics Reference Stack provides two pre-built Docker images, available on Docker Hub:

We recommend you view the latest component versions for each image in the README found in the Data Analytics Reference Stack GitHub* repository. Because Clear Linux OS is a rolling distribution, the package version numbers in the Clear Linux OS-based containers may not be the latest released by Clear Linux OS.

Note

The Data Analytics Reference Stack is a collective work, and each piece of software within the work has its own license. Please see the DARS Terms of Use for more details about licensing and usage of the Data Analytics Reference Stack.

Using the Docker images

  1. To immediately start using the latest stable DARS images, pull an image directly from Docker Hub. This example uses the DARS with Intel® MKL Docker image.

  2. Once you have downloaded the image, you can run it with

    docker run -it --ulimit nofile=1000000:1000000 --name mkl <name of image>
    

    This will launch the image and drop you into a bash shell inside the container. You will see output similar to the following:

    root@fd5155b89857 /root # spark-shell
    spark-shell
    Config directory: /usr/share/defaults/spark/
    Welcome to
      ____              __
     / __/__  ___ _____/ /__
     _\ \/ _ \/ _ `/ __/  '_/
    /___/ .__/\_,_/_/ /_/\_\   version 2.4.0
       /_/
    
    Using Scala version 2.12.7 (OpenJDK 64-Bit Server VM, Java 1.8.0-internal)
    Type in expressions to have them evaluated.
    Type :help for more information.
    
    scala>
    

    The --ulimit nofile parameter is currently required in order to increase the number of open files opened at certain point by the spark engine.

Building DARS images

If you choose to build your own DARS container images, you can customize them as needed. Use the provided Dockerfile as a baseline.

To construct images with Clear Linux OS, start with a Clear Linux OS development platform that has the containers-basic-dev bundle installed. Learn more about bundles and installing them by using swupd.

  1. Clone the Data Analytics Reference Stack GitHub* repository.

    git clone https://github.com/clearlinux/dockerfiles/tree/master/stacks/dars -b master
    
  2. Inside the DARS directory, run make to build OpenBLAS and MKL images.

    make
    

    Run make baseline to build the baseline CentOS image. Depending on the system, it may take a while to finish building.

    make baseline
    
  3. Once completed, check the resulting images with Docker

    docker images | grep dars
    
  4. You can use any of the resulting images to launch fully functional containers. If you need to customize the containers, you can edit the provided Dockerfile.