Arvados-cwl-runner DNS error when outside Kubernetes cluster

I have an Arvados cluster running on Kubernetes, and I’m trying to use arvados-cwl-runner from outside the cluster. However, it seems like I get DNS errors:

$ arvados-cwl-runner --debug --create-workflow bwa-mem.cwl
INFO /usr/local/bin/arvados-cwl-runner 2.1.0, arvados-python-client 2.1.0, cwltool 3.0.20200807132242
INFO Resolved 'bwa-mem.cwl' to 'file:///home/mludwig/arvados-test/bwa-mem.cwl'
DEBUG Parsed job order from command line: {
    "id": "bwa-mem.cwl",
    "PL": null,
    "group_id": null,
    "read_p1": null,
    "read_p2": null,
    "reference": null,
    "sample_id": null
}
INFO Using cluster 3rzp3 (https://10.8.47.219:444)
INFO Uploading Docker image quay.io/biocontainers/bwa:0.7.17--ha92aebf_3
2020-10-26 19:18:36 arvados.arv_put[3387992] INFO: Resuming upload from cache file /home/mludwig/.cache/arvados/arv-put/14d1decaacaf50a6bea32c5847b666b1
0M / 94M 0.0% 2020-10-26 19:18:36 arvados.keep[3387992] DEBUG: {'3rzp3-bi6l4-soaotya77wddkrk': OrderedDict([('href', '/keep_services/3rzp3-bi6l4-soaotya77wddkrk'), ('kind', 'arvados#keepService'), ('etag', 'b0m1700vrmft9o2ctazbs8o33'), ('uuid', '3rzp3-bi6l4-soaotya77wddkrk'), ('owner_uuid', '3rzp3-tpzed-000000000000000'), ('created_at', '2020-10-21T10:22:32.155265000Z'), ('modified_by_client_uuid', '3rzp3-ozdt8-dy6wkmdfw1qsr7g'), ('modified_by_user_uuid', '3rzp3-tpzed-000000000000000'), ('modified_at', '2020-10-21T10:22:32.358468000Z'), ('service_host', 'arvados-keep-store-0.arvados-keep-store'), ('service_port', 25107), ('service_ssl_flag', False), ('service_type', 'disk'), ('read_only', False), ('_service_root', 'http://arvados-keep-store-0.arvados-keep-store:25107/')]), '3rzp3-bi6l4-t3ed9j6d21r50gc': OrderedDict([('href', '/keep_services/3rzp3-bi6l4-t3ed9j6d21r50gc'), ('kind', 'arvados#keepService'), ('etag', '1pb5lilpukxzjlnl69h305up0'), ('uuid', '3rzp3-bi6l4-t3ed9j6d21r50gc'), ('owner_uuid', '3rzp3-tpzed-000000000000000'), ('created_at', '2020-10-21T10:22:33.118259000Z'), ('modified_by_client_uuid', '3rzp3-ozdt8-dy6wkmdfw1qsr7g'), ('modified_by_user_uuid', '3rzp3-tpzed-000000000000000'), ('modified_at', '2020-10-21T10:22:33.320829000Z'), ('service_host', 'arvados-keep-store-1.arvados-keep-store'), ('service_port', 25107), ('service_ssl_flag', False), ('service_type', 'disk'), ('read_only', False), ('_service_root', 'http://arvados-keep-store-1.arvados-keep-store:25107/')])} (X-Request-Id: req-z1ms5bxcoeon2tmc8kv6)
2020-10-26 19:18:36 arvados.keep[3387992] DEBUG: b09fae35d758cd937a8271dc579a7217+31714304: ['http://arvados-keep-store-0.arvados-keep-store:25107/', 'http://arvados-keep-store-1.arvados-keep-store:25107/']
2020-10-26 19:18:36 arvados.keep[3387992] DEBUG: Pool max threads is 2
2020-10-26 19:18:36 arvados.keep[3387992] DEBUG: Request: PUT http://arvados-keep-store-0.arvados-keep-store:25107/b09fae35d758cd937a8271dc579a7217
2020-10-26 19:18:36 arvados.keep[3387992] DEBUG: Request: PUT http://arvados-keep-store-1.arvados-keep-store:25107/b09fae35d758cd937a8271dc579a7217
2020-10-26 19:18:36 arvados.keep[3387992] DEBUG: Request fail: PUT http://arvados-keep-store-1.arvados-keep-store:25107/b09fae35d758cd937a8271dc579a7217 => <class 'arvados.errors.HttpError'>: (0, "(6, 'Could not resolve host: arvados-keep-store-1.arvados-keep-store')")
2020-10-26 19:18:36 arvados.keep[3387992] DEBUG: Request fail: PUT http://arvados-keep-store-0.arvados-keep-store:25107/b09fae35d758cd937a8271dc579a7217 => <class 'arvados.errors.HttpError'>: (0, "(6, 'Could not resolve host: arvados-keep-store-0.arvados-keep-store')")

DNS works on both api-server and keep-proxy pods:

root@arvados-api-server-fd84b99dc-px94g:/# nslookup arvados-keep-store-0.arvados-keep-store
Server:         169.254.25.10
Address:        169.254.25.10#53

Name:   arvados-keep-store-0.arvados-keep-store.arvados-demo.svc.cluster.local
Address: 10.233.79.167

root@arvados-api-server-fd84b99dc-px94g:/# nslookup arvados-keep-store-1.arvados-keep-store
Server:         169.254.25.10
Address:        169.254.25.10#53

Name:   arvados-keep-store-1.arvados-keep-store.arvados-demo.svc.cluster.local
Address: 10.233.120.139
root@arvados-keep-proxy-545cd4b664-5x7v5:/# nslookup arvados-keep-store-0.arvados-keep-store 
Server:         169.254.25.10
Address:        169.254.25.10#53

Name:   arvados-keep-store-0.arvados-keep-store.arvados-demo.svc.cluster.local
Address: 10.233.79.167

root@arvados-keep-proxy-545cd4b664-5x7v5:/# nslookup arvados-keep-store-1.arvados-keep-store
Server:         169.254.25.10
Address:        169.254.25.10#53

Name:   arvados-keep-store-1.arvados-keep-store.arvados-demo.svc.cluster.local
Address: 10.233.120.139

The proxy is available from the cwl-runner host:

$ nmap 10.8.47.219 -p 25107
Starting Nmap 7.70 ( https://nmap.org ) at 2020-10-26 14:54 EDT
Nmap scan report for 10.8.47.219
Host is up (0.0012s latency).

PORT      STATE SERVICE
25107/tcp open  unknown

But the keep-store pods of course aren’t accessible from the cwl-runner host, which I think is why it’s failing. Is there a way to use the keep-proxy with arvados-cwl-runner?

I also tried running the same command from inside the cluster (on the shell-server pod, after installing cwl-runner), which succeeded, so I’m fairly certain the keep-store pods being inaccessible from outside is the problem.

root@arvados-shell-server-8645664676-n4fwl:/home/mludwig# arvados-cwl-runner --create-workflow bwa-mem.cwl
INFO /usr/local/bin/arvados-cwl-runner 2.1.0, arvados-python-client 2.1.0, cwltool 3.0.20200807132242
INFO Resolved 'bwa-mem.cwl' to 'file:///home/mludwig/bwa-mem.cwl'
INFO Using cluster 3rzp3 (https://10.8.47.219:444)
INFO ['docker', 'pull', 'lh3lh3/bwa']
Using default tag: latest
latest: Pulling from lh3lh3/bwa
Image docker.io/lh3lh3/bwa:latest uses outdated schema1 manifest format. Please upgrade to a schema2 image for better future compatibility. More information at https://docs.docker.com/registry/spec/deprecated-schema-v1/
d56ac91634e2: Pull complete 
a3ed95caeb02: Pull complete 
Digest: sha256:ecb80258bdaebe4d42445eb34adea936c929b3a3439bea154a128939f7cce95d
Status: Downloaded newer image for lh3lh3/bwa:latest
docker.io/lh3lh3/bwa:latest
INFO Uploading Docker image lh3lh3/bwa:latest
2020-10-26 22:51:11 arvados.arv_put[4377] INFO: Creating new cache file at /root/.cache/arvados/arv-put/f0bdda26db6887d5ca906aef92ccdac0
1M / 1M 100.0% 2020-10-26 22:51:11 arvados.arv_put[4377] INFO: 

2020-10-26 22:51:11 arvados.arv_put[4377] INFO: Collection saved as 'Docker image lh3lh3 bwa:latest sha256:c66bf'
3rzp3-4zz18-9cx44yx9hl4ccr3
2020-10-26 22:51:11 cwltool[4377] INFO: ['docker', 'pull', 'arvados/jobs:2.1.0']
2.1.0: Pulling from arvados/jobs
8559a31e96f4: Pull complete 
6880da06a4a9: Pull complete 
72c96cad4268: Pull complete 
8acf86f98e38: Pull complete 
0ce8c1e0dd01: Pull complete 
2b381ae22fdd: Pull complete 
824ec0548c57: Pull complete 
0720cb34bd6e: Pull complete 
f0a6d2641296: Pull complete 
e928bba34ab6: Pull complete 
14a1bd0a41d0: Pull complete 
Digest: sha256:33484303914787c57b8796511c9c394926f1986e832f5936a5c99c93661afaf7
Status: Downloaded newer image for arvados/jobs:2.1.0
docker.io/arvados/jobs:2.1.0
2020-10-26 22:51:22 arvados.cwl-runner[4377] INFO: Uploading Docker image arvados/jobs:2.1.0
2020-10-26 22:51:38 arvados.arv_put[4377] INFO: Creating new cache file at /root/.cache/arvados/arv-put/d8b92189bf59bcc03a995eac3c2725f0
239M / 239M 100.0% 2020-10-26 22:51:41 arvados.arv_put[4377] INFO: 

2020-10-26 22:51:41 arvados.arv_put[4377] INFO: Collection saved as 'Docker image arvados jobs:2.1.0 sha256:e7866'
3rzp3-4zz18-zas3cq19mkkpwf2
3rzp3-7fd4e-pfbb2hwxokpmsbv
2020-10-26 22:51:42 cwltool[4377] INFO: Final process status is success

I don’t want to run the user shell server inside the cluster since containers aren’t really suited for the problem (cron job for the login-sync plus an SSH server), and it’s less secure.

Hey @mluds, welcome!

But the keep-store pods of course aren’t accessible from the cwl-runner host, which I think is why it’s failing. Is there a way to use the keep-proxy with arvados-cwl-runner?

Yes; there is a mechanism in the nginx configuration for the api server that tries to determine if a request is coming from inside vs outside the cluster.

What happens is that the keep client (in this case, arv_put in the Arvados Python SDK, called from arvados-cwl-runner) asks the API server for the location of the Keep service(s). When the request comes from inside the cluster, the API server should respond with the Keep service IPs. When it comes from outside, the API server is supposed to return the address for Keepproxy.

Based your description, it sounds like this is not working right. The configuration lives in charts/arvados/templates/api-server-configmap.yaml:

    geo $external_client {
      default     1;
      10.0.0.0/8  0;
    }

In a k8s environment, there can be several layers of address translation between the client and the API service. I suspect something like that is going on here. You could verify by looking in the logs for the API server, check the source IP address for the requests that result from your arvados-cwl-runner command.

It’s also possible that the geo IP discovery is working as intended, but that your machine has a cached Arvados discovery document with old information. You could wipe out the discovery document cache in ~/.cache/arvados/ to make sure you have the latest.

We can also chat on gitter (https://gitter.im/arvados/community) to help debug, if you like.

Does this help?

Thanks,
Ward.

PS: we are doing bi-weekly Arvados user group calls via Google Meet. We’d be happy to see you there – http://meet.google.com/eig-fvsw-xvd. The next one is this Friday at 10am Eastern time, see https://arvados.org/community/ for more details.

Perfect, thank you!

There were two problems I ran into:

First is that the external host I was running arvados-cwl-runner from had an IP address in the 10.0.0.0/8 range. I added a setting to the helm values to configure this.

The second problem was that MetalLB obscures the source IP when using the Cluster policy, so it was always coming from an internal IP (whichever node is handling the traffic). The Local policy would fix this, though it doesn’t allow inter-node traffic (on MetalLB), which is required.

My solution was to always have it use the keep-proxy (setting internalIPs: []), which seems to work, though not sure if it will cause other problems.

Glad you got it working! I’m not entirely surprised. The nginx geo approach doesn’t really translate well to k8s, I think we should come up with a different solution.

In terms of always using keepproxy, that should work. It will have some performance impact; your bandwidth into Keep is now more limited (just 1 path via the keepproxy, and keepproxy has to work harder than keepstore to get the data in/out of Keep). You could try running more than one keepproxy.

For a demo/test installation this should work, I wouldn’t recommend it for a production setup.