Building a compact Pi cluster

March 2021 — first iteration
A Pi 4B with Cluster HAT, and 4 Pi Zero 2 Ws, in a perspex box. Blinkenlights optional.

Configuration

I used several resources to get this rebuilt, but since I am using the new Pis, and wanted to update to the latest OS, there were a number of changes along the way. First of all, huge shout outs to Chris Burton of Cluster HAT fame, the contributors to the Cluster HAT mailing list, Garrett Mills, and (mostly) to Davin L. who wrote the excellent “The Missing Cluster HAT tutorial” that got me up-and-running both this time, and the previous time, around.

Pre-work

  • I knew I was going to build a “CNAT” configuration Cluster HAT setup again. This means that only the controller is directly accessible / visible on my home network, with the 4 nodes accessible from the controller itself.
  • There are 5 OS images (1 Controller and 4 Nodes) downloadable from the Cluster HAT site, which are mostly pre-configured. I chose the Bullseye 64-bit builds from the testing directory; recent mailing list conversation shows that they are pretty solid (but I’m thinking again about this, see later)
  • Install each image to microSD card. I used Raspberry Pi Imager, but resisted the urge to use the Cmd-X advanced options to change settings, since the images were otherwise customised.
  • Before finishing with each card, I mounted them on my Mac, and ran touch /Volumes/boot/ssh (to enable ssh), then ejected.
  • For the controller image, I also pre-configured /Volumes/boot/wpa_supplicant.conf so that it would pop up on the network when it was ready.

Controller

  • boot the main Pi (note that I did this and started configuring it, while I was still imaging the other node SD cards)
  • SSH in, and change the default pi user password
  • sudo apt install avahi-utils ntpdate file apt-file plocate zip mosquitto mosquitto-clients
  • Using raspi-configchange hostname
  • (per tutorial)
    ssh-keygen -t rsa -b 4096
    cat ~/.ssh/id_rsa.pub and copy
  • Enable fan, reduce GPU memory: set options in/boot/config.txt
    dtoverlay=gpio-fan,gpiopin=18,temp=65000
    gpu_mem=16
  • I choose to manually disable wifi powersave at boot in /etc/rc.local
    iw wlan0 set power_save off
  • sudo apt update && sudo apt upgrade -y
  • sudo apt-file update && sudo updatedb
  • sudo rpi-eeprom-update -a
  • reboot…
  • clusterctrl on
    (wait for nodes to start up / autosize their FS)

Each node

  • mkdir -p .ssh
  • echo [key] >> ~/.ssh/authorized_keys
  • sudo apt install avahi-utils ntpdate file mosquitto-clients -y
  • Using raspi-config
    - change pw
    - install legacy GL driver
    - set country for wifi (“just in case we use wifi in future”)
  • I want to save power and resource, and the nodes are attached via USB gadget / RNDIS, so edit /boot/config.txt
    dtoverlay=disable-bt
    dtoverlay=disable-wifi
    gpu_mem=16
  • Edit /etc/rc.local
    tvservice -o (to disable HDMI at boot, to save power, but requires the legacy GL driver)
  • sudo apt update && sudo apt upgrade -y && reboot

Back to the controller

  • edit /etc/hosts
  • setup storage and NFS per tutorial. Initially I’m using a portable USB drive, which is an additional power drain / concern. I’ve seen voltage warnings once in the log, so I’m keeping an eye on it and may move NFS storage to another machine on the network, but that would mean rethinking the use of CNAT over CBRIDGE.

Munge setup

  • per the tutorial, munge didn’t start initially complaining of permissions issues. Need to run chown root / /etc /usr to get this to where things were happy (otherwise they were owned by the pi user)
  • need to ensure munge.key is chmod 400 on the local nodes. I couldn’t get the munge/unmunge confirmation step to work until I worked this out and rebuilt, copied, and set permissions for the keyfile.

Slurm setup

  • per the tutorial, except on Bullseye, config files are in /etc/slurm (not /etc/slurm-llnl)
  • care with copy/paste from the tutorial — needed to fix smart quotes, and range notation for nodes [-] (this gave me pain)
  • I set CPUs=4 for each node in the config file, since the Zero 2 W is multicore; and also allocated 2 cores from the controller.
  • Note: the srun --ntasks hostname command errors out as ntasks wants a parameter; not sure if this is a change in behaviour in Slurm between versions?
  • To run upgrades across the cluster (this is neat; previously I’d done it all by hand):
    sudo su -
    srun --nodes=5 apt upgrade -y
close ups of the innards

Bonus blinkenlichten!

The for-fun “bonus” is that there’s a blink(1) mk 3 attached on USB on the controller, with blink1-tiny-server software managing it. This is set up to run as a service when the controller boots.

Node-RED flow to set Blink(1) colour from Cheerlights, over the internal network
Look ma! It’s a cluster in a box!

So… what have we learned?

  • The Cluster HAT is a really nice piece of kit (it worked the first time I set it up, and with the new Pis matching the form factor, it still works; happy with that)
  • I learned about setting up Munge and Slurm…
  • There was some debugging / trial-and-error in translating from the previous set of instructions to what works now, which is a great opportunity to learn and explore.
  • Somehow, I’ve only just learned about the “Broadcast input” option in iTerm2… super cool for doing the same thing on multiple nodes at once, during the setup stage.

What’s next?

This is primarily an educational project for myself, and there are some potential things to look into already:

  • running something like Micro K8S or K3S (although, the memory limitations may remain an issue)
  • the 64-bit performance of Bullseye on the Pi Zero 2 W may not be ideal and I may need to rebuild with 32-bit on the nodes
  • replacing the storage with something that is less of a power suck

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Andy Piper

Andy Piper

Work: @Twitter; Interests: Open Source | Community | LEGO | MQTT | IoT; Views: entirely my own