Hello!
I am trying to add new storage classes and local fs-backed volumes to a test “single-host single-hostname” installation of Arvados 2.7.3 release.
The problem is that despite the presence of 4 volumes (with 2 of them configured for the default storage classes), arv-put
(and arv-copy
) command-line tool always fail – it doesn’t think it has written enough replicas.
In my config, I defined 3 additional storage classes (test
, another default storage class; backup1
and backup2
, both optional), with 3 extra volumes backing each of them. The “cluster” is configured to use default replication factor of 2.
There is a single keepstore service “endpoint” (is that the correct word?) running on the localhost http://127.0.0.1:25107
as defined by the default config. In addition, I also added a keep-balance service running on http://127.0.0.1:9005
. All services have been restarted after changing the config file.
Arvados confguration file snippet
Clusters:
xlbrs:
Collections:
DefaultReplication:2
StorageClasses:
default:
Default: true
Priority: 20
test:
Default: true
Priority: 10
backup1:
Priority: 5
backup2:
Priority: 5
Volumes:
xlbrs-nyw5e-000000000000000:
AccessViaHosts:
http://127.0.0.1:25107:
ReadOnly: false
Driver: Directory
DriverParameters:
Root: /var/lib/arvados/keep
Replication: 1
StorageClasses:
default: true
xlbrs-nyw5e-000000000000001:
AccessViaHosts:
http://127.0.0.1:25107:
ReadOnly: false
Driver: Directory
DriverParameters:
Root: /mnt/vf1/keep
Replication: 1
StorageClasses:
test: true
xlbrs-nyw5e-000000000000002:
AccessViaHosts:
http://127.0.0.1:25107:
ReadOnly: false
Driver: Directory
DriverParameters:
Root: /mnt/vf2/keep
Replication: 1
StorageClasses:
backup1: true
xlbrs-nyw5e-000000000000003:
AccessViaHosts:
http://127.0.0.1:25107:
ReadOnly: false
Driver: Directory
DriverParameters:
Root: /mnt/vf3/keep
Replication: 1
StorageClasses:
backup2: true
Services:
Keepstore:
InternalURLs:
http://127.0.0.1:25107: {}
(Notice that in the config above, I changed the Replication
property of the default storage volume to 1 from 2 as initially defined in the fresh installation. I think the default config is faking a replication factor of 2 for the underlying storage no matter the actual value – so that the single-host installation runs with the default replication target?)
When I run the arv-put
command on a test data collection directory,
(client) arv-put command and error messages
(python3-arvados-python-client) arvdev@user:~$ arv-put test-collection
2024-06-14 03:29:27 arvados.arv_put[9516] INFO: Calculating upload size, this could take some time...
2024-06-14 03:29:27 arvados.arv_put[9516] INFO: Resuming upload from cache file /home/arvdev/.cache/arvados/arv-put/a09afa74011e9afc42559d231b000a44
0M / 0M 0.0% 2024-06-14 03:36:28 arvados.arv_put[9516] ERROR: arv-put: Error writing some blocks: block d8e8fca2dc0f896fd7cb4cb0031ba249+5 raised KeepWriteError ([req-7jbdytugz6kh48204bkc] failed to write d8e8fca2dc0f896fd7cb4cb0031ba249 after 11 attempts (wanted (2, ['default', 'test']) copies but wrote (0, [])): service https://xlbrs.vir-test.home.arpa:8801/ responded with 503 HTTP/1.1 100 Continue
HTTP/1.1 503 Service Unavailable) (X-Request-Id: req-7jbdytugz6kh48204bkc)
The body of the error message from keepproxy
in the server log is
(server) keepproxy error response message body as JSON
{
"ClusterID": "xlbrs",
"PID": 808,
"RequestID": "req-7jbdytugz6kh48204bkc",
"err": "Could not write sufficient replicas",
"expectLength": 5,
"level": "info",
"locator": "d8e8fca2dc0f896fd7cb4cb0031ba249+5",
"msg": "response",
"priority": 0,
"queue": "api",
"remoteAddr": "127.0.0.1:34504",
"reqBytes": 5,
"reqForwardedFor": "192.168.60.132",
"reqHost": "xlbrs.vir-test.home.arpa:8801",
"reqMethod": "PUT",
"reqPath": "d8e8fca2dc0f896fd7cb4cb0031ba249",
"reqQuery": "",
"respBody": "Could not write sufficient replicas\n",
"respBytes": 36,
"respStatus": "Service Unavailable",
"respStatusCode": 503,
"time": "2024-06-14T11:36:28.338791625+08:00",
"timeToStatus": 0.011939,
"timeTotal": 0.011947,
"timeWriteBody": 9e-06,
"userFullName": "",
"userUUID": "xlbrs-tpzed-5lowdrl04yijbg8",
"wantReplicas": 2,
"wroteReplicas": 2
}
I am confused by the fact that both wantReplicas
and wroteReplicas
in the keepproxy error message is 2; however, keepproxy doesn’t seem to think it has written enough copies.
Furthermore, when I try uploading a collection with data in Workbench2, I basically get the same error of “Could not write sufficient replicas” from keep-web logs. Creating a new collection from WB2 works, and the newly created empty collection has the two default storage classes (default
and test
, see setup above). However, uploading new files to the collection always ends up in failure.
I don’t know where the problem could be – I suspect the storage setup was incorrect? (for both keepproxy and keep-web fail, and the common point seems to be the keepstore service itself). Your help is appreciated!