Arv-put and WB2 "Could not write sufficient replicas" with custom storage class/volume setup

Hello!

I am trying to add new storage classes and local fs-backed volumes to a test “single-host single-hostname” installation of Arvados 2.7.3 release.

The problem is that despite the presence of 4 volumes (with 2 of them configured for the default storage classes), arv-put (and arv-copy) command-line tool always fail – it doesn’t think it has written enough replicas.

In my config, I defined 3 additional storage classes (test, another default storage class; backup1 and backup2, both optional), with 3 extra volumes backing each of them. The “cluster” is configured to use default replication factor of 2.

There is a single keepstore service “endpoint” (is that the correct word?) running on the localhost http://127.0.0.1:25107 as defined by the default config. In addition, I also added a keep-balance service running on http://127.0.0.1:9005. All services have been restarted after changing the config file.

Arvados confguration file snippet
Clusters:
  xlbrs:
    Collections:
      DefaultReplication:2 
    StorageClasses:
      default:
        Default: true
        Priority: 20
      test:   
        Default: true
        Priority: 10
      backup1:
         Priority: 5
      backup2:
         Priority: 5
    Volumes:
      xlbrs-nyw5e-000000000000000:
        AccessViaHosts:
          http://127.0.0.1:25107:
            ReadOnly: false
        Driver: Directory
        DriverParameters:
          Root: /var/lib/arvados/keep
        Replication: 1
        StorageClasses:
          default: true
      xlbrs-nyw5e-000000000000001:
        AccessViaHosts:
          http://127.0.0.1:25107:
            ReadOnly: false
        Driver: Directory
        DriverParameters:
          Root: /mnt/vf1/keep
        Replication: 1
        StorageClasses:
          test: true
      xlbrs-nyw5e-000000000000002:
        AccessViaHosts:
          http://127.0.0.1:25107:
            ReadOnly: false
        Driver: Directory
        DriverParameters:
          Root: /mnt/vf2/keep
        Replication: 1
        StorageClasses:
          backup1: true
      xlbrs-nyw5e-000000000000003:
        AccessViaHosts:
          http://127.0.0.1:25107:
            ReadOnly: false
        Driver: Directory
        DriverParameters:
          Root: /mnt/vf3/keep
        Replication: 1
        StorageClasses:
          backup2: true
    Services:
      Keepstore:
        InternalURLs:
          http://127.0.0.1:25107: {}

(Notice that in the config above, I changed the Replication property of the default storage volume to 1 from 2 as initially defined in the fresh installation. I think the default config is faking a replication factor of 2 for the underlying storage no matter the actual value – so that the single-host installation runs with the default replication target?)

When I run the arv-put command on a test data collection directory,

(client) arv-put command and error messages
(python3-arvados-python-client) arvdev@user:~$ arv-put test-collection
2024-06-14 03:29:27 arvados.arv_put[9516] INFO: Calculating upload size, this could take some time...
2024-06-14 03:29:27 arvados.arv_put[9516] INFO: Resuming upload from cache file /home/arvdev/.cache/arvados/arv-put/a09afa74011e9afc42559d231b000a44
0M / 0M 0.0% 2024-06-14 03:36:28 arvados.arv_put[9516] ERROR: arv-put: Error writing some blocks: block d8e8fca2dc0f896fd7cb4cb0031ba249+5 raised KeepWriteError ([req-7jbdytugz6kh48204bkc] failed to write d8e8fca2dc0f896fd7cb4cb0031ba249 after 11 attempts (wanted (2, ['default', 'test']) copies but wrote (0, [])): service https://xlbrs.vir-test.home.arpa:8801/ responded with 503 HTTP/1.1 100 Continue
  HTTP/1.1 503 Service Unavailable) (X-Request-Id: req-7jbdytugz6kh48204bkc)

The body of the error message from keepproxy in the server log is

(server) keepproxy error response message body as JSON
{
  "ClusterID": "xlbrs",
  "PID": 808,
  "RequestID": "req-7jbdytugz6kh48204bkc",
  "err": "Could not write sufficient replicas",
  "expectLength": 5,
  "level": "info",
  "locator": "d8e8fca2dc0f896fd7cb4cb0031ba249+5",
  "msg": "response",
  "priority": 0,
  "queue": "api",
  "remoteAddr": "127.0.0.1:34504",
  "reqBytes": 5,
  "reqForwardedFor": "192.168.60.132",
  "reqHost": "xlbrs.vir-test.home.arpa:8801",
  "reqMethod": "PUT",
  "reqPath": "d8e8fca2dc0f896fd7cb4cb0031ba249",
  "reqQuery": "",
  "respBody": "Could not write sufficient replicas\n",
  "respBytes": 36,
  "respStatus": "Service Unavailable",
  "respStatusCode": 503,
  "time": "2024-06-14T11:36:28.338791625+08:00",
  "timeToStatus": 0.011939,
  "timeTotal": 0.011947,
  "timeWriteBody": 9e-06,
  "userFullName": "",
  "userUUID": "xlbrs-tpzed-5lowdrl04yijbg8",
  "wantReplicas": 2,
  "wroteReplicas": 2
}

I am confused by the fact that both wantReplicas and wroteReplicas in the keepproxy error message is 2; however, keepproxy doesn’t seem to think it has written enough copies.

Furthermore, when I try uploading a collection with data in Workbench2, I basically get the same error of “Could not write sufficient replicas” from keep-web logs. Creating a new collection from WB2 works, and the newly created empty collection has the two default storage classes (default and test, see setup above). However, uploading new files to the collection always ends up in failure.

I don’t know where the problem could be – I suspect the storage setup was incorrect? (for both keepproxy and keep-web fail, and the common point seems to be the keepstore service itself). Your help is appreciated!