Salt-install fails

I want to run all the services except computing on a host node.

Distributor ID: Ubuntu
Description: Ubuntu 20.04.4 LTS
Release: 20.04
Codename: focal

The details of installation.

  1. git clone https://github.com/arvados/arvados and checkout to 2.4-release.

  2. run


cp local.params.example.multiple_hosts local.params

cp -r config_examples/multi_host/aws local_config_dir

  1. modify the local.param

# Internal IPs for the configuration

CLUSTER_INT_CIDR=10.0.0.0/16

# Note the IPs in this example are shared between roles, as suggested in

# https://doc.arvados.org/main/install/salt-multi-host.html

CONTROLLER_INT_IP=10.0.0.4

WEBSOCKET_INT_IP=10.0.0.4

KEEP_INT_IP=10.0.0.4

# Both for collections and downloads

KEEPWEB_INT_IP=10.0.0.4

KEEPSTORE0_INT_IP=10.0.0.4

WORKBENCH1_INT_IP=10.0.0.4

WEBSHELL_INT_IP=10.0.0.4

DATABASE_INT_IP=10.0.0.4

SHELL_INT_IP=10.0.0.4

INITIAL_USER="admin"

SSL_MODE="self-signed"

RELEASE="production"

  1. create a crt and a key and copy them for all services.

  2. run sudo ./provision.sh --config local.params --roles database

It throws error but success (Succeeded: 39 (changed=15), Failed: 0).

Error:


Initializing git_internal_dir /var/lib/arvados/internal.git: directory exists, skipped.

Making sure '/var/lib/arvados/internal.git' has the right permission... done.

Job for nginx.service failed because the control process exited with error code.

See "systemctl status nginx.service" and "journalctl -xe" for details.

dpkg: error processing package arvados-api-server (--configure):

 installed arvados-api-server package post-installation script subprocess returned error exit status 1

Errors were encountered while processing:

 arvados-api-server

E: Sub-process /usr/bin/dpkg returned an error code (1)

-----------------------

● nginx.service - A high performance web server and a reverse proxy server

     Loaded: loaded (/lib/systemd/system/nginx.service; enabled; vendor preset: enabled)

     Active: failed (Result: exit-code) since Fri 2022-04-29 15:13:23 UTC; 2min 35s ago

       Docs: man:nginx(8)

    Process: 4016248 ExecStartPre=/usr/sbin/nginx -t -q -g daemon on; master_process on; (code=exited, status=1/FAILURE)

Apr 29 15:13:23 pai-master systemd[1]: Starting A high performance web server and a reverse proxy server...

Apr 29 15:13:23 pai-master nginx[4016248]: nginx: [emerg] host not found in upstream "keepproxy_upstream" in /etc/nginx/sites-enabled/arvados_keepproxy_ssl.conf:13

Apr 29 15:13:23 pai-master nginx[4016248]: nginx: configuration file /etc/nginx/nginx.conf test failed

Apr 29 15:13:23 pai-master systemd[1]: nginx.service: Control process exited, code=exited, status=1/FAILURE

Apr 29 15:13:23 pai-master systemd[1]: nginx.service: Failed with result 'exit-code'.

Apr 29 15:13:23 pai-master systemd[1]: Failed to start A high performance web server and a reverse proxy server.

/etc/nginx/sites-enabled/arvados_keepproxy_ssl.conf:


server {

    server_name keep.admin.qc;

    listen 443 http2 ssl;

    index index.html index.htm;

    location / {

        proxy_pass http://keepproxy_upstream;

        proxy_read_timeout 90;

        proxy_connect_timeout 90;

        proxy_redirect off;

        proxy_set_header X-Forwarded-Proto https;

        proxy_set_header Host $http_host;

        proxy_set_header X-Real-IP $remote_addr;

        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

        proxy_buffering off;

    }

    client_body_buffer_size 64M;

    client_max_body_size 64M;

    proxy_http_version 1.1;

    proxy_request_buffering off;

    include snippets/ssl_hardening_default.conf;

    ssl_certificate /etc/nginx/ssl/arvados-keepproxy.pem;

    ssl_certificate_key /etc/nginx/ssl/arvados-keepproxy.key;

    access_log /var/log/nginx/keepproxy.admin.qc.access.log combined;

    error_log /var/log/nginx/keepproxy.admin.qc.error.log;

}

This error may be caused by certs. So I change SSL_MODE="lets-encrypt". But got new error:

+agree-tos = True
+authenticator = dns-route53
+deploy-hook = systemctl reload nginx
+email = admin@master.qc
+expand = True
+keep-until-expiring = True
+max-log-backups = 0
+server = https://acme-v02.api.letsencrypt.org/directory

[INFO    ] Completed state [/etc/letsencrypt/cli.ini] at time 16:36:20.318243 (duration_in_ms=32.369)
[INFO    ] Running state [/usr/bin/certbot certonly --quiet --cert-name controller.admin.qc -d admin.qc --non-interactive] at time 16:36:20.319294
[INFO    ] Executing state cmd.run for [/usr/bin/certbot certonly --quiet --cert-name controller.admin.qc -d admin.qc --non-interactive]
[INFO    ] The functions from module 'ansible' are being loaded from the provided __load__ attribute
[INFO    ] Executing command git in directory '/root'
[INFO    ] Executing command '/usr/bin/certbot' in directory '/root'
[INFO    ] Executing command '/usr/bin/certbot' in directory '/root'
[ERROR   ] Command '/usr/bin/certbot' failed with return code: 1
[ERROR   ] stderr: Unable to register an account with ACME server
[ERROR   ] retcode: 1
[ERROR   ] {'pid': 4018443, 'retcode': 1, 'stdout': '', 'stderr': 'Unable to register an account with ACME server'}
[INFO    ] Completed state [/usr/bin/certbot certonly --quiet --cert-name controller.admin.qc -d admin.qc --non-interactive] at time 16:36:24.525342 (duration_in_ms=4206.043)
[ERROR   ] An un-handled exception was caught by salt's global exception handler:
AttributeError: 'bool' object has no attribute 'get'
Traceback (most recent call last):
  File "/usr/bin/salt-call", line 11, in <module>
    load_entry_point('salt==3004.1', 'console_scripts', 'salt-call')()
  File "/usr/lib/python3/dist-packages/salt/scripts.py", line 432, in salt_call
    client.run()
  File "/usr/lib/python3/dist-packages/salt/cli/call.py", line 55, in run
    caller.run()
  File "/usr/lib/python3/dist-packages/salt/cli/caller.py", line 111, in run
    ret = self.call()
  File "/usr/lib/python3/dist-packages/salt/cli/caller.py", line 218, in call
    ret["return"] = self.minion.executors[fname](
  File "/usr/lib/python3/dist-packages/salt/loader/lazy.py", line 149, in __call__
    return self.loader.run(run_func, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/salt/loader/lazy.py", line 1201, in run
    return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/salt/loader/lazy.py", line 1216, in _run_as
    return _func_or_method(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/salt/executors/direct_call.py", line 10, in execute
    return func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/salt/loader/lazy.py", line 149, in __call__
    return self.loader.run(run_func, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/salt/loader/lazy.py", line 1201, in run
    return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/salt/loader/lazy.py", line 1216, in _run_as
    return _func_or_method(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/salt/modules/state.py", line 794, in apply_
    return highstate(**kwargs)
  File "/usr/lib/python3/dist-packages/salt/modules/state.py", line 1117, in highstate
    ret = st_.call_highstate(
  File "/usr/lib/python3/dist-packages/salt/state.py", line 4541, in call_highstate
    return self.state.call_high(high, orchestration_jid)
  File "/usr/lib/python3/dist-packages/salt/state.py", line 3279, in call_high
    ret = self.call_chunks(chunks)
  File "/usr/lib/python3/dist-packages/salt/state.py", line 2497, in call_chunks
    running = self.call_chunk(low, running, chunks)
  File "/usr/lib/python3/dist-packages/salt/state.py", line 2984, in call_chunk
    running = self.call_chunk(chunk, running, chunks)
  File "/usr/lib/python3/dist-packages/salt/state.py", line 2878, in call_chunk
    status, reqs = self.check_requisite(low, running, chunks, pre=True)
  File "/usr/lib/python3/dist-packages/salt/state.py", line 2651, in check_requisite
    self.reconcile_procs(running)
  File "/usr/lib/python3/dist-packages/salt/state.py", line 2574, in reconcile_procs
    proc = running[tag].get("proc")
AttributeError: 'bool' object has no attribute 'get'
Traceback (most recent call last):
  File "/usr/bin/salt-call", line 11, in <module>
    load_entry_point('salt==3004.1', 'console_scripts', 'salt-call')()
  File "/usr/lib/python3/dist-packages/salt/scripts.py", line 432, in salt_call
    client.run()
  File "/usr/lib/python3/dist-packages/salt/cli/call.py", line 55, in run
    caller.run()
  File "/usr/lib/python3/dist-packages/salt/cli/caller.py", line 111, in run
    ret = self.call()
  File "/usr/lib/python3/dist-packages/salt/cli/caller.py", line 218, in call
    ret["return"] = self.minion.executors[fname](
  File "/usr/lib/python3/dist-packages/salt/loader/lazy.py", line 149, in __call__
    return self.loader.run(run_func, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/salt/loader/lazy.py", line 1201, in run
    return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/salt/loader/lazy.py", line 1216, in _run_as
    return _func_or_method(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/salt/executors/direct_call.py", line 10, in execute
    return func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/salt/loader/lazy.py", line 149, in __call__
    return self.loader.run(run_func, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/salt/loader/lazy.py", line 1201, in run
    return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/salt/loader/lazy.py", line 1216, in _run_as
    return _func_or_method(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/salt/modules/state.py", line 794, in apply_
    return highstate(**kwargs)
  File "/usr/lib/python3/dist-packages/salt/modules/state.py", line 1117, in highstate
    ret = st_.call_highstate(
  File "/usr/lib/python3/dist-packages/salt/state.py", line 4541, in call_highstate
    return self.state.call_high(high, orchestration_jid)
  File "/usr/lib/python3/dist-packages/salt/state.py", line 3279, in call_high
    ret = self.call_chunks(chunks)
  File "/usr/lib/python3/dist-packages/salt/state.py", line 2497, in call_chunks
    running = self.call_chunk(low, running, chunks)
  File "/usr/lib/python3/dist-packages/salt/state.py", line 2984, in call_chunk
    running = self.call_chunk(chunk, running, chunks)
  File "/usr/lib/python3/dist-packages/salt/state.py", line 2878, in call_chunk
    status, reqs = self.check_requisite(low, running, chunks, pre=True)
  File "/usr/lib/python3/dist-packages/salt/state.py", line 2651, in check_requisite
    self.reconcile_procs(running)
  File "/usr/lib/python3/dist-packages/salt/state.py", line 2574, in reconcile_procs
    proc = running[tag].get("proc")
AttributeError: 'bool' object has no attribute 'get'

I created a key and a crt. Copy them to the service. I got a new error:

----------
          ID: arvados-config-package-install-pkg-installed
    Function: pkg.installed
        Name: arvados-server
      Result: False
     Comment: Problem encountered installing package(s). Additional info follows:

              errors:
                  - Running scope as unit: run-re4f1efe0abbf4902869864e9d43f05f5.scope
                    E: Version '2.4.0' for 'arvados-server' was not found
     Started: 17:22:35.656668
    Duration: 3316.652 ms
     Changes:

Summary for local
-------------
Succeeded: 59 (changed=2)
Failed:     1
-------------
Total states run:     60
Total run time:   13.987 s

For this last failure, are you sure you are on the 2.4-release branch of the source tree?

The problem is that the 2.4.0 Arvados package can not be found in the apt repository. That could be caused by a network problem, or perhaps the apt source is not configured right.

Could you paste the contents of /etc/apt/sources.list.d/*arvados* here?

Only /etc/apt/sources.list.d/arvados.list

Can you paste the contents of /etc/apt/sources.list.d/arvados.list here?

1 arvados.list :
X deb [signed-by=/usr/share/keyrings/arvados-archive-keyring.gpg arch=amd64] http://apt.arvados.org/focal focal main

That looks ok. Does the output of apt-get update show that the arvados apt repository is available and that there are no signature verification errors?

I’ve done some local testing and I can’t replicate this problem. I suspect that the problem is caused by (intermittent?) network issues connecting to apt.arvados.org. If this is really the problem, re-running the provision script a few times may get you past the problem.

I find some problems with your scripts that some files repeat the content.
For example:

/etc/apt/sources.list.d/pgdg.list:1:deb [signed-by=/usr/share/postgresql-common/pgdg/apt.postgresql.org.gpg] http://apt.postgresql.org/pub/repos/apt focal-pgdg maindeb [signed-by=/usr/share/postgresql-common/pgdg/apt.postgresql.org.gpg] http://apt.postgresql.org/pub/repos/apt focal-pgdg maindeb [signed-by=/usr/share/postgresql-common/pgdg/apt.postgresql.org.gpg] http://apt.postgresql.org/pub/repos/apt focal-pgdg maindeb [signed-by=/usr/share/postgresql-common/pgdg/apt.postgresql.org.gpg] http://apt.postgresql.org/pub/repos/apt focal-pgdg maindeb [signed-by=/usr/share/postgresql-common/pgdg/apt.postgresql.org.gpg] http://apt.postgresql.org/pub/repos/apt focal-pgdg maindeb [signed-by=/usr/share/postgresql-common/pgdg/apt.postgresql.org.gpg] http://apt.postgresql.org/pub/repos/apt focal-pgdg maindeb [signed-by=/usr/share/postgresql-common/pgdg/apt.postgresql.org.gpg] http://apt.postgresql.org/pub/repos/apt focal-pgdg maindeb [signed-by=/usr/share/postgresql-common/pgdg/apt.postgresql.org.gpg] http://apt.postgresql.org/pub/repos/apt focal-pgdg maindeb [signed-by=/usr/share/postgresql-common/pgdg/apt.postgresql.org.gpg] http://apt.postgresql.org/pub/repos/apt focal-pgdg maindeb [signed-by=/usr/share/postgresql-common/pgdg/apt.postgresql.org.gpg] http://apt.postgresql.org/pub/repos/apt focal-pgdg maindeb [signed-by=/usr/share/postgresql-common/pgdg/apt.postgresql.org.gpg] http://apt.postgresql.org/pub/repos/apt focal-pgdg maindeb [signed-by=/usr/share/postgresql-common/pgdg/apt.postgresql.org.gpg] http://apt.postgresql.org/pub/repos/apt focal-pgdg maindeb [signed-by=/usr/share/postgresql-common/pgdg/apt.postgresql.org.gpg] http://apt.postgresql.org/pub/repos/apt focal-pgdg maindeb [signed-by=/usr/share/postgresql-common/pgdg/apt.postgresql.org.gpg] http://apt.postgresql.org/pub/repos/apt focal-pgdg maindeb [signed-by=/usr/share/postgresql-common/pgdg/apt.postgresql.org.gpg] http://apt.postgresql.org/pub/repos/apt focal-pgdg maindeb [signed-by=/usr/share/postgresql-common/pgdg/apt.postgresql.org.gpg] http://apt.postgresql.org/pub/repos/apt focal-pgdg maindeb [signed-by=/usr/share/postgresql-common/pgdg/apt.postgresql.org.gpg] http://apt.postgresql.org/pub/repos/apt focal-pgdg main
/etc/hosts: 10.0.0.4 ........

I update the installation details on the top.

  1. run sudo ./provision.sh --config local.params --roles api

More errors:


      ID: arvados-config-package-install-pkg-installed
Function: pkg.installed
    Name: arvados-server
  Result: False
 Comment: Problem encountered installing package(s). Additional info follows:

          errors:
              - Running scope as unit: run-r81ab8386e48d439e9df185d2c7ed5582.scope
                W: --force-yes is deprecated, use one of the options starting with --allow instead.
                E: Version '2.4.0' for 'arvados-server' was not found
 Started: 15:51:57.047604
Duration: 6249.007 ms
 Changes:

      ID: listener_nginx_service
Function: service.mod_watch
    Name: nginx
  Result: False
 Comment: Job for nginx.service failed.
          See "systemctl status nginx.service" and "journalctl -xe" for details.
 Started: 15:52:03.297223

For the first error, the version does not match:

$ sudo apt search arvados-server
Sorting… Done
Full Text Search… Done
arvados-server/focal,now 2.4.0-1 amd64 [installed]
Arvados server daemons

For the second error, the script does not add the domain to /etc/hosts correctly. After running provision.sh several times, it becomes:

10.0.0.4 controller_upstream workbench_upstream keepproxy_upstream db.admin.qc database.admin.qc admin.qc ws.admin.qc workbench.admin.qc workbench2.admin.qc keep.admin.qc download.admin.qc collections.admin.qc webshell.admin.qc shell.admin.qc keep0.admin.qc controller_upstream workbench_upstream keepproxy_upstream db.admin.qc database.admin.qc admin.qc ws.admin.qc workbench.admin.qc workbench2.admin.qc keep.admin.qc download.admin.qc collections.admin.qc webshell.admin.qc shell.admin.qc keep0.admin.qc
10.0.0.4 controller_upstream workbench_upstream keepproxy_upstream db.admin.qc database.admin.qc admin.qc ws.admin.qc workbench.admin.qc workbench2.admin.qc keep.admin.qc download.admin.qc collections.admin.qc webshell.admin.qc shell.admin.qc keep0.admin.qc controller_upstream workbench_upstream keepproxy_upstream db.admin.qc database.admin.qc admin.qc ws.admin.qc workbench.admin.qc workbench2.admin.qc keep.admin.qc download.admin.qc collections.admin.qc webshell.admin.qc shell.admin.qc keep0.admin.qc

Hello! I had similar issues; here are the changes I made to fix the install: fix-install.patch - Arvados

hi @erow, sorry for the late reply. The problem you ran into

E: Version '2.4.0' for 'arvados-server' was not found

was fixed a while ago in the 2.4-release branch. We’ve also just released Arvados 2.4.1.

2 Likes

Hello! @cure @mr-c, I got a new problem. I don’t know how to write a /etc/arvados/config.yml correctly? Will it be generated automatically?


      ID: arvados-config-file-file-managed
Function: file.managed
    Name: /etc/arvados/config.yml
  Result: False
 Comment: check_cmd execution failed
          transcoding config data: json: cannot unmarshal number into Go struct field TestUser.Clusters.Login.Test.Users.Password of type string
 Started: 15:20:06.088730
Duration: 115.244 ms
 Changes:

Summary for local

Succeeded: 126 (changed=1)
Failed: 1


@erow Did you put an unquoted number for a password? The error seems to be that you put something like Password: 1234 but it interprets that as a number so you need to quote it, e.g. Password: "1234"

Actually, the config file was empty. The script did not generate this file.