SLURM dispatch does not receive jobs

Hi all, I am experimenting with a single box arvados deployment with salt… Now I would like to configure crunch to dispatch jobs over to 2 other dedicated Slurm nodes. I have slurm up and running on the nodes and I have arvados (configured as the slurm controller for now) able to execute sinfo, srun, squeue and return expected results.

However, crunch-dispatch-slurm never seems to pull any container requests from the apiserver. The arvados VM (set up using the single-host provision script) is running as the slurmctld and my two seperate compute nodes are running slurmd, I followed the set up a node guide and configured the nodes according to the Arvados | Set up a Slurm compute node documentation.

crunch-dispatch-local does intercept jobs as expected and I thought maybe that was keeping jobs from heading to slurm but I’ve since tried disabling that with no effect. I suspect that there is something set somewhere that is telling container requests to always go to the local dispatch. I also tried to understand how salt is configuring the crunch dispatcher in general and I see the parts where it is configured, but don’t see how to adapt it to use slurm. I have very little salt experience so that might be part of the problem.

As an aside, I’m having a little trouble understanding how the slurm nodes are supposed to communicate with the api server, I have the

python-arvados-fuse crunch-run arvados-docker-cleaner

packages installed on the slurm nodes but pretty much nothing else (besides docker and host file entries for the arvados VM). Should I be copying the arvados config.yml and installing/configuring other arvados packages on those nodes as well? Or is crunch-dispatch-slurm talking to slurmctld and beyond that the slurm nodes do not need to be integrated in any way to the arvados cluster?

This is obviously just for experimentation, in prod I would do a full manual install with proper DNS and actual certificates from our CA.

Could you post some logs from crunch-dispatch-slurm? It should be calling sbatch to queue jobs.