Database Reference Stack¶
This guide describes the hardware and installation requirements for using the DBRS, along with getting started configuration examples, using Clear Linux* OS as the host system.
Overview¶
The Database Reference Stack is integrated, highly-performant, open source, and optimized for 2nd generation Intel® Xeon® Scalable Processors and Intel® Optane™ DC Persistent Memory. This open source community release is part of an effort to ensure developers have easy access to the features and functionality of Intel Platforms.
Stack Features¶
Current supported database applications are Apache Cassandra* and Redis*, which have been enabled for Intel Optane DC PMM.
DBRS with Apache Cassandra can be deployed as a standalone container or inside a Kubernetes* cluster.
The Redis stack application is enabled for a multinode Kubernetes environment, using AEP persistent memory DIMM in fsdax mode for storage.
The release announcement for this release provides more detail about the stack features, as well as benchmark results.
Note
The Database Reference Stack is a collective work, and each piece of software within the work has its own license. Please see the DBRS Terms of Use for more details about licensing and usage of the Database Reference Stack.
Hardware Requirements¶
- Intel Xeon Scalable Platform with Intel C620 chipset series
- 2nd Gen Intel Xeon Scalable processor CPU (Intel Optane DC PMM-enabled stepping) Provides cache & memory control. Intel Optane DC persistent memory works only on systems powered by 2nd Generation Intel® Xeon® Platinum or Gold processors.
- BIOS with Reference Code
- Intel Optane DC persistent memory
Hardware configuration used in stacks development¶
- Intel® Server System R2208WFTZSR
- BIOS with Reference Code * BIOS ID: SE5C620.86B.0D.01.0438.032620191658 * BMC Firmware: 1.94.6b42b91d * Intel® Optane™ DC Persistent MemoryFirmware: 1.2.0.5310
- 2x Intel Xeon Platinum 8268 Processor
- Intel SSD DC S5600 Series 960GB 2.5in SATA Drive
- 64 GB RAM - Distributed in 4x 16 GB DDR4 DIMM’s
- 2x Intel Optane DC Persistent Memory 256GB Module
- 1-1-1 Layout 8 Optane : 1 RAM ratio
Channel 2 | Channel 2 | Channel 1 | Channel 1 | Channel 0 | Channel 0 |
---|---|---|---|---|---|
Slot 1 | Slot 0 | Slot 1 | Slot 0 | Slot 1 | Slot 0 |
256 GB DCPMM | 16 GB DRAM | 16 GB DRAM |
Firmware configuration¶
Important
When updating DCPMM Firmware, all DCPMM parts must be in the same mode (you cannot mix 1LM and 2LM parts).
The latest firmware download for the Intel® Server System S2600WF Family is available at the Intel Download Center
Firmware Update Steps¶
- Unzip the contents of the update package and copy all files to the root directory of a removable media (USB flash drive).
- Insert the USB flash drive to any available USB port on the system to be updated.
- Boot to EFI shell.
- Input “fsx(x:0,1,…):” to enter into your usb disk
- Run “startup.nsh”
- After update BMC firmware, system BIOS, ME firmware,FD, FRUSDR, system will reboot automatically.
If Intel Optane DC Persistent Memory is installed, run startup.nsh a second time after the first reboot to upgrade Intel Optane DC Persistent Memory Firmware:
- Boot to EFI shell.
- Input “fsx(x:0,1,…):” to enter into your usb disk
- Run “startup.nsh” again to update the corresponding AEP FW.
Hardware Configuration¶
Online Resources¶
Before going through the configuration steps, we strongly recommend visiting the following resources and wikis to have a broader understanding of what is being done
- Quick Start Guide Configure Intel Optane DC Persistent Memory Modules on Linux
- Managing NVDIMMs
- Configure, Manage, and Profile Intel Optane DC Persistent Memory Modules
Optane DIMM Configuration¶
The persistent memory DIMMs can be configured in devdax or fsdax mode. The use case to enable database stack on a kubernetes environment currently only support fsdax mode.
Configuration Steps¶
Important
Run the following steps with root privileges (sudo) as shown in the examples
To configure Optane DIMMs for App direct mode run this command
sudo ipmctl create -goal PersistentMemoryType=AppDirect
Verify the Optane Configuration by showing the defined region, then reboot the system for your changes to take effect
sudo ipmctl show -region
Next, list the defined namespaces for the pmem devices in the system. If they are not defined, create them as shown in the following step.
sudo ndctl list -N
Create namespaces based on the regions and set mode as fsdax – use the names of the regions listed in previous step as the –-region parameter (default is region0 and region1; one for each CPU socket)
sudo ndctl create-namespace --region=region0 --mode=fsdax sudo ndctl create-namespace --region=region1 --mode=fsdax
Create the filesystem and mount it. We are using /mnt/dax{#} as a convention in this guide to mount our devices
sudo mkfs.ext4 /dev/pmem0 sudo mount -o dax /dev/pmem0 /mnt/dax0 sudo mkfs.ext4 /dev/pmem1 sudo mount -o dax /dev/pmem1 /mnt/dax1
Running DBRS with Apache Cassandra*¶
DBRS with Apache Cassandra can be deployed as a standalone container or inside Kubernetes*. Instructions for both cases is included here. Note that you can use the released Docker image with Apache Cassandra (Docker* examples below). These instructions provide a baseline for creating your own container image. If you are using the released image, skip this section.
Important
At the initial release of DBRS, Apache Cassandra is considered to be Engineering Preview release quality and may not be suitable for production release. Please take this into consideration when planning your project.
Build the DBRS with Apache Cassandra container¶
To build the container with Apache Cassandra, you must build cassandra-pmem, and then build the container using the docker build command. We are using Clear Linux OS as our container host as well as the OS in the container.
Build cassandra-pmem¶
Important
At the initial release of DBRS, the pmem-csi driver is considered to be Engineering Preview release quality and may not be suitable for production release. Please take this into consideration when planning your project.
In the DBRS github repository, there is a file called build-cassandra-pmem.sh, which handles all the requirements for compiling cassandra-pmem for Dockerfile usage. The dependencies for this build can be installed with swupd.
sudo swupd bundle-add c-basic java-basic devpkg-pmdk pmdk
Once installed, we run the script
./build-cassandra-pmem.sh
At the completion of the build you will have a file called cassandra-pmem-build.tar.gz
. Place this file in the same directory with the Dockerfile to build the Docker image.
Build the Docker container¶
To build the Docker image, run the Dockerfile in the same directory with the cassandra-pmem-build.tar.gz
docker build --force-rm --no-cache -f Dockerfile -t $build_image_name .
Once it completes, the Docker image is ready to be used.
Deploy Apache Cassandra PMEM as a standalone container¶
Requirements¶
To deploy Apache Cassandra PMEM, you must meet the following requirements
- PMEM memory must be configured in devdax or fsdax mode. The container image is able to handle both modes and depending on the PMEM mode, the mount points inside the container must be different.
- In order to make available devdax pmem devices inside the container you must use the –device directive. Internally the container always uses /dev/dax0.0, so the mapping should be: --device=/dev/<host-device>:/dev/dax0.0
- In a similar fashion for fsdax we need the device to be mapped to /mnt/pmem inside the container: --mount type=bind,source=<source-mount-point>,target=/mnt/pmem
Preparing PMEM for container use¶
The cassandra-pmem image is capable of using both fsdax and devdax, the necessary steps to configure the PMEM to work with cassandra are documented here.
We need to verify the device we want to use is in devdax mode
sudo ndctl create-namespace -fe namespace0.0 --mode=devdax
{
"dev":"namespace0.0",
"mode":"devdax",
"map":"dev",
"size":"3.94 GiB (4.23 GB)",
"uuid":"cb738cc7-711d-4578-bebf-1f7ba02ca169",
"daxregion":{
"id":0,
"size":"3.94 GiB (4.23 GB)",
"align":2097152,
"devices":[
{
"chardev":"dax0.0",
"size":"3.94 GiB (4.23 GB)"
}
]
},
"align":2097152
}
If needed, we can reconfigure it using ndctl create-namespace -fe <namespace-name> --mode=devdax.
Before using a devdax device we need to clear the device:
sudo pmempool rm -vaf /dev/dax0.0
The jvm.options configuration for Apache Cassandra should look like the following:
-Dpmem_path=/dev/dax0.0
-Dpool_size=0
Where * pmem_path is the devdax device. * pool_size=0 indicates to use the entire devdax device.
When using the Docker image with Apache Cassandra, the file jvm.options is automatically populated.
Verify that the PMEM is in fsdax mode
sudo ndctl list -u
{
"dev":"namespace0.0",
"mode":"fsdax",
"map":"mem",
"size":"4.00 GiB (4.29 GB)",
"sector_size":512,
"blockdev":"pmem0"
}
If for some reason the device is not in fsdax mode you can reconfigure the namespace as follows:
sudo `ndctl create-namespace -fe <namespace-name> --mode=fsdax`
Once the PMEM namespace is configured, you will see a device named /dev/pmem0-9
. We will create a filesystem on that device. The filesystem could be ext4 or xfs, for this example we are going to use ext4.
sudo mkfs.ext4 /dev/pmem0
mke2fs 1.45.2 (27-May-2019)
Creating filesystem with 1031680 4k blocks and 258048 inodes
Filesystem UUID: 303c03f5-ac4e-4462-8bf9-bc6b0fae53fe
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736
Allocating group tables: done
Writing inode tables: done
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done
Once the filesystem is created, we mount it with the dax option
sudo mount /dev/pmem0 /mnt/pmem -o dax
When using fsdax mode cassandra-pmem creates a pool file on the pmem mountpoint, so the jvm.options configuration should look like the output below:
-Dpmem_path=/mnt/pmem/cassandra_pool
-Dpool_size=3221225472
Where * pmem_path is the path to the pool file, which should include the path itself and the file name * pool_size is the size of the pool file in bytes. If you are using the Docker image with Apache Cassandra you can pass this value as an environment variable to the container runtime in Gb and the calculation is done automatically.
Is important to note that when creating the filesystem in the pmem device certain amount of space of the device is used by the filesystem metadata so the pool_size should be smaller than the total pmem namespace size.
When using the Docker image with Apache Cassandra, the file jvm.options is automatically populated with the environment variables CASSANDRA_PMEM_POOL_NAME and CASSANDRA_FSDAX_POOL_SIZE_GB.
Run the DBRS Container¶
Replace <image-id> in the following commands with the name of the image you are using.
In devdax mode:
docker run --device=/<devdax-device>:/dev/dax0.0 --ulimit nofile=262144:262144 -p 9042:9042 -p 7000:7000 -it --name cassandra-test <image-id>
In fsdax mode:
docker run --mount type=bind,source=/<fsdax-mountpoint>,target=/mnt/pmem --ulimit nofile=262144:262144 -p 9042:9042 -p 7000:7000 -it -e 'CASSANDRA_FSDAX_POOL_SIZE_GB=<fsdax-pool-size-in-gb>' --name cassandra-test <image-id>
Container Configuration¶
Using environment variables¶
The container listens on the primary container IP address, but if required, some parameters can be provided as environment variables using –env.
- CASSANDRA_CLUSTER_NAME Cassandra cluster name, by default Cassandra Cluster
- CASSANDRA_LISTEN_ADDRESS Cassandra listen address
- CASSANDRA_RPC_ADDRESS Cassandra RPC address
- CASSANDRA_SEED_ADDRESSES A comma separated list of hosts in the cluster, if not provided, cassandra is going to run as a single node.
- CASSANDRA_SNITCH The snitch type for the cluster, by default it is SimpleSnitch, for more complex snitches you can mount your own cassandra-rackdc.properties file.
- LOCAL_JMX If set to no the JMX service will listen on all IP addresses, the default is yes and listens just on localhost 127.0.0.1
- JVM_OPTS When set you can pass additional arguments to the JVM for cassandra execution, for example for specifying memory heap sizes JVM_OPTS=-Xms16G -Xmx16G -Xmn12G
When using PMEM in fsdax mode, there are some parameters to control the allocation of memory:
- CASSANDRA_FSDAX_POOL_SIZE_GB The size of the fsdax pool in GB, if it is not specified the pool size is 1
- CASSANDRA_PMEM_POOL_NAME The filename of the pool created in PMEM, by default cassandra_pool
Using custom files¶
For more complex deployments it is also possible to provide custom cassandra.yaml and jvm.options files as shown below:
docker run --mount type=bind,source=/<fsdax-mountpoint>,target=/mnt/pmem -it --ulimit nofile=262144:262144 --mount type=bind,source=/<path-to-file>/cassandra.yaml,target=/workspace/cassandra/conf/cassandra.yaml --mount type=bind,source=/path-to-file>/jvm.options,target=/workspace/cassandra/conf/jvm.options --name cassandra-custom-files
Clustering¶
For a simple two node cluster using PMEM in fsdax mode on both containers:
Node 1¶
- IP: 172.17.0.2
- PMEM mountpoint: /mnt/pmem1
docker run --mount type=bind,source=/mnt/pmem1,target=/mnt/pmem --ulimit nofile=262144:262144 -it -e 'CASSANDRA_FSDAX_POOL_SIZE_GB=2' -e 'CASSANDRA_SEED_ADDRESSES=172.17.0.2:7000,172.17.0.3:7000' --name cassandra-node1 <image-id>
Node 2¶
- IP: 172.17.0.3
- PMEM mountpoint: /mnt/pmem2
docker run --mount type=bind,source=/mnt/pmem2,target=/mnt/pmem --ulimit nofile=262144:262144 -it -e 'CASSANDRA_FSDAX_POOL_SIZE_GB=2' -e 'CASSANDRA_SEED_ADDRESSES=172.17.0.2:7000,172.17.0.3:7000' --name cassandra-node2 <image-id>
Once both nodes are running, eventually the gossip is settled and we can use nodetool on either container to check cluster status.
docker exec -it <container-id> bash /workspace/cassandra/bin/nodetool status
The output should look similar to this:
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 172.17.0.3 0 bytes 256 100.0% 22387159-8192-41cf-8b6c-8bf0e1049eb7 rack1
UN 172.17.0.2 0 bytes 256 100.0% 219b56ba-c07c-400b-a018-a5dc20edeb09 rack1
Persistence¶
By default you can access the data written to Apache Cassandra as long as the container exists. In order to persist the data past that, you can mount volumes or bind mounts on /workspace/cassandra/data
and /workspace/cassandra/logs
and in this way the data can still be accessed once the container is deleted.
Deploy An Apache Cassandra-PMEM cluster on Kubernetes*¶
Many containerized workloads are deployed in clusters and orchestration software like Kubernetes can be useful. We will use the cassandra-pmem-helm Helm* chart in this example.
Requirements¶
- Kubectl* must be configured to access the Kubernetes Cluster
- A Kubernetes cluster with pmem-csi enabled
- The Kubernetes cluster must have helm and tiller installed
- PMEM hardware
Important
When selecting the fsdax pool file size, it is important to consider that when requesting a volume, certain amount of space is used by the filesystem metadata on that volume and the available space turns out to be less than total amount specified. Taking this into consideration the size of the fsdax pool file should be ~2G less than the total volume size requested.
Configuration¶
In order to configure the Apache Cassandra PMEM cluster some variables and values are provided. These values are set in test/cassandra-pmem-helm/values.yaml
, and can be modified according to your specific needs. A summary of those parameters is shown below:
- clusterName: The cluster Name set across all deployed nodes
- replicaCount: The number of nodes in the cluster to be deployed
- image.repository: The address of the container registry where the cassandra-pmem image should be pulled
- image.tag: The tag of the image to be pulled during deployment
- image.name: The name of the image to be pulled during deployment
- pmem.containerPmemAllocation: The size of the persistent volume claim to be used as heap, it uses the storage class pmem-csi-sc-ext4 from pmem-csi The size of the fsdax pool to be created inside the persistent volume claim, in practice it should be 1G less than pmem.containerPmemAllocation
- pmem.fsdaxPoolSizeInGB: The size of the fsdax pool to be created inside the persistent volume claim, in practice it should be 1G less than pmem.containerPmemAllocation
- enablePersistence: If set to true, K8s persistent volumes are deployed to store data and logs
- persistentVolumes.logsVolumeSize: The size of the persistent volume used for storing logs on each node, the default is 4G
- persistentVolumes.dataVolumeSize: The size of the persistent volume used for storing data on each node, the default is 4G
- persistentVolumes.logsStorageClass: Storage class used by the logs pvc, by default it uses pmem-csi-sc-ext4
- persistentVolumes.dataStorageClass: Storage class used by the data pvc, by default it uses pmem-csi-sc-ext4
- provideCustomConfig: If set to true, it mounts all the files located on <helm-chart-dir>/files/conf on /workspace/cassandra/conf inside each container in order to provide a way to customize the deployment beyond the options provided here
- exposeJmxPort: When set to true it exposes the JMX port as part of the Kubernetes headless service. It should be used together with enableAdditionalFilesConfigMap in order to provide authentication files needed for JMX when the remote connections are allowed. When set to false only local access through 127.0.0.1 is granted and no additional authentication is needed.
- enableClientToolsPod: If set to true, an additional pod independent from the cluster is deployed, this pod contains various Cassandra client tools and mounts test profiles located under <helm-chart-dir>/files/testProfiles to /testProfiles inside the pod. This pod is useful to test and launch benchmarks
- enableAdditionalFilesConfigMap: When set to true, it takes the files located in <helm-chart-dir>/files/additionalFiles and mount them in /etc/cassandra inside the pods, some additional files for cassandra can be stored here, such as JMX auth files
- jvmOpts.enabled: If set to true the environment variable JVM_OPTS is overridden with the value provided on jvmOpts.value
- jvmOpts.value: Sets the value of the environment variable JVM_OPTS, in this way some java runtime configurations can be provided such as RAM heap usage
- resources.enabled: if set to true, the resource constraints are set on each pod using the values under resources.requests and resources.limits
- resources.requests.memory: Initial resource allocation for each pod in the cluster
- resources.request.cpu: Initial resource allocation for each pod in the cluster
- resources.limits.memory: Limits for memory allocation for each pod in the cluster
- resources.limits.cpu: Limits for cpu allocation for each pod in the cluster
Installation¶
Once all the configurations are set, to install the chart inside a given Kubernetes cluster you must run:
helm install ./cassandra-pmem-helm
Eventually all the given nodes will be shown as running using kubectl get pods.
Running DBRS with Redis¶
The Redis stack application is enabled for a multinode Kubernetes environment using Intel Optane DCPMM persistent memory DIMMs in fsdax mode for storage.
The source code used for this application can be found in the Github repository
The following examples will use the Docker image with Redis. You can also build your own image with Docker by using the Dockerfile
and running with this command
docker build --force-rm --no-cache -f Dockerfile -t ${DOCKER_IMAGE} .
Single node¶
Prior to starting the container, you will need to have the Intel Optane DCPMM module in fsdax with a file system and mounted in /mnt/dax0 as shown above.
Use the following to start the container, replacing ${DOCKER_IMAGE} with the name of the image you are using.
docker run --mount type=bind,source=/mnt/dax0,target=/mnt/pmem0 -i -d --name pmem-redis ${DOCKER_IMAGE} --nvm-maxcapacity 200 --nvm-dir /mnt/pmem0 --nvm-threshold 64 --protected-mode no
Redis Operator in a Kubernetes cluster¶
After setting up Kubernetes* in Clear Linux OS, you will need to enable it to support DCPMM using the pmem-cls driver. To install the driver follow the instructions in the pmem-csi repository.
We are using source code from the Redis operator .
Note
If you already have a redis-operator, you will need to delete it before installing a new one.
After installing the operator you are ready to deploy redisfailover instances using a yaml file, like this example for persistent memory. You can download it and change the source of the image to reflect your environment. We have named our yaml redis-failover.yml
To start a redisfailover instance in Kubernetes run the following
kubectl create -f redis-failover.yml
Important
There is a known issue in which the sentinels do not have enough memory to create the InitContainer. The current workaround is to build the image increasing the limits for the InitContainer memory to 32Mb