SLURM dispatch does not receive jobs

Andrei_Wasylyk · 9 June 2021 12:58

Hi all, I am experimenting with a single box arvados deployment with salt… Now I would like to configure crunch to dispatch jobs over to 2 other dedicated Slurm nodes. I have slurm up and running on the nodes and I have arvados (configured as the slurm controller for now) able to execute sinfo, srun, squeue and return expected results.

However, crunch-dispatch-slurm never seems to pull any container requests from the apiserver. The arvados VM (set up using the single-host provision script) is running as the slurmctld and my two seperate compute nodes are running slurmd, I followed the set up a node guide and configured the nodes according to the Arvados | Set up a Slurm compute node documentation.

crunch-dispatch-local does intercept jobs as expected and I thought maybe that was keeping jobs from heading to slurm but I’ve since tried disabling that with no effect. I suspect that there is something set somewhere that is telling container requests to always go to the local dispatch. I also tried to understand how salt is configuring the crunch dispatcher in general and I see the parts where it is configured, but don’t see how to adapt it to use slurm. I have very little salt experience so that might be part of the problem.

As an aside, I’m having a little trouble understanding how the slurm nodes are supposed to communicate with the api server, I have the

python-arvados-fuse crunch-run arvados-docker-cleaner

packages installed on the slurm nodes but pretty much nothing else (besides docker and host file entries for the arvados VM). Should I be copying the arvados config.yml and installing/configuring other arvados packages on those nodes as well? Or is crunch-dispatch-slurm talking to slurmctld and beyond that the slurm nodes do not need to be integrated in any way to the arvados cluster?

This is obviously just for experimentation, in prod I would do a full manual install with proper DNS and actual certificates from our CA.

tetron · 11 June 2021 19:27

Could you post some logs from crunch-dispatch-slurm? It should be calling sbatch to queue jobs.

Andrei_Wasylyk · 2 August 2021 14:20

Hi Tetron,
Sorry, I was on vacation for a few weeks. I’ll show you what I have thus far:

root@arvdv:~# arv container_request create --container-request '{                                                                                                                                                    
"name":            "test",                                                                                                                                                                                       
"state":           "Committed",                                                                                                                                                                                  
"priority":        1,                                                                                                                                                                                            
"container_image": "arvados/jobs:2.1.2",                                                                                                                                                                         
"command":         ["echo", "Hello, Crunch!"],                                                                                                                                                                   
"output_path":     "/out",                                                                                                                                                                                       
"mounts": {                                                                                                                                                                                                        
"/out": {                                                                                                                                                                                                          
"kind":        "tmp",                                                                                                                                                                                            
"capacity":    1000                                                                                                                                                                                            
}                                                                                                                                                                                                              
},                                                                                                                                                                                                               
"runtime_constraints": {                                                                                                                                                                                           
"vcpus": 1,                                                                                                                                                                                                         
"ram": 8388608                                                                                                                                                                                              
}                                                                                                                                                                                                            
}'

root@arvdv:~# systemctl status crunch-dispatch-local.service
● crunch-dispatch-local.service - Arvados Crunch Dispatcher for LOCAL service
Loaded: loaded (/etc/systemd/system/crunch-dispatch-local.service; enabled; vendor preset: 
enabled)
Active: active (running) since Mon 2021-08-02 10:08:43 EDT; 1min 18s ago
Docs: https://doc.arvados.org/
 Main PID: 146291 (crunch-dispatch)
    Tasks: 7 (limit: 9431)
   CGroup: /system.slice/crunch-dispatch-local.service
           └─146291 /usr/bin/crunch-dispatch-local -poll-interval=1 -crunch-run- 
command=/usr/local/bin/crunch-run.sh

Aug 02 10:08:43 arvdv.i.fungalgenomics.ca systemd[1]: Started Arvados Crunch Dispatcher for 
LOCAL service.
Aug 02 10:08:43 arvdv.i.fungalgenomics.ca crunch-dispatch-local[146291]: 
{"level":"info","msg":"crunch-dispatch-local 2.2.0 started","time":"2021-08-02T10:08:43.181514188- 
04:00"}

root@arvdv:~# systemctl status crunch-dispatch-slurm.service
● crunch-dispatch-slurm.service - Arvados Crunch Dispatcher for SLURM
   Loaded: loaded (/lib/systemd/system/crunch-dispatch-slurm.service; enabled; vendor preset: 
enabled)
   Active: active (running) since Mon 2021-08-02 09:58:18 EDT; 11min ago
     Docs: https://doc.arvados.org/
 Main PID: 145544 (crunch-dispatch)
    Tasks: 7 (limit: 9431)
   CGroup: /system.slice/crunch-dispatch-slurm.service
           └─145544 /usr/bin/crunch-dispatch-slurm

Aug 02 09:58:18 arvdv.i.fungalgenomics.ca systemd[1]: Starting Arvados Crunch Dispatcher for SLURM...
Aug 02 09:58:18 arvdv.i.fungalgenomics.ca crunch-dispatch-slurm[145544]: 
{"level":"info","msg":"crunch-dispatch-slurm 2.2.0 started","time":"2021-08-02T09:58:18.239725508-  
Aug 02 09:58:18 arvdv.i.fungalgenomics.ca crunch-dispatch-slurm[145544]: 
{"level":"warning","msg":"deprecated or unknown config entry: 
Clusters.arvdv.API.RailsSessionSecretToken","time":"2021-08-02T09:58:18.24Aug 02 09:58:18 
arvdv.i.fungalgenomics.ca crunch-dispatch-slurm[145544]: {"level":"warning","msg":"deprecated or 
unknown config entry: Clusters.arvdv.Containers.PollInterval","time":"2021-08- 
\02T09:58:18.245848Aug 02 09:58:18 arvdv.i.fungalgenomics.ca crunch-dispatch-slurm[145544]: 
{"level":"warning","msg":"deprecated or unknown config entry: 
Clusters.arvdv.ForceLegacyAPI14","time":"2021-08-02T09:58:18.245926669-04:Aug 02 09:58:18 
arvdv.i.fungalgenomics.ca systemd[1]: Started Arvados Crunch Dispatcher for SLURM.

root@arvdv:~# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

root@arvdv:~# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
test*        up   infinite      2   idle sb[1-2]