Keep error during a CWL run

Hi all,

The “bwa-mem” tutorial (https://doc.arvados.org/master/user/cwl/cwl-runner.html) runs successfully. Please see below:

However the “remove-dups” workflow tutorial (https://doc.arvados.org/master/user/composer/composer.html) fails with the keep error as shown below. Please note that the child processes “sort.cwl” and “uniq.cwl” both completes successfully and I can see their outputs.

Here is the full --debug output:

arvadosgcp@arv-ubuntu-18-04-2:~$ arvados-cwl-runner --debug birworkflow.cwl --input deneme_in
put.txt
INFO /home/arvadosgcp/.local/bin/arvados-cwl-runner 2.0.2, arvados-python-client 2.0.2, cwltool 1.0.20190831161204
INFO Resolved 'birworkflow.cwl' to 'file:///home/arvadosgcp/birworkflow.cwl'
DEBUG Parsed job order from command line: {
    "id": "birworkflow.cwl",
    "input": {
        "class": "File",
        "location": "file:///home/arvadosgcp/deneme_input.txt"
    }
}
INFO Upload local files: "deneme_input.txt"
DEBUG {'arvad-bi6l4-2a82fc92e13d100f': OrderedDict([('href', '/keep_services/arvad-bi6l4-2a82fc92e13d100f'), ('kind', 'arvados#keepService'), ('etag', '3yue5zhfxtvvxqe46ydk8zyw9'), ('uuid', 'arvad-bi6l4-2a82fc92e13d100f'), ('owner_uuid', 'arvad-tpzed-000000000000000'), ('created_at', '2020-06-25T23:55:45.929711000Z'), ('modified_by_client_uuid', None), ('modified_by_user_uuid', 'arvad-tpzed-000000000000000'), ('modified_at', '2020-06-25T23:55:45.929711000Z'), ('service_host', 'keep0.arvad.elmgenomics.com'), ('service_port', 25107), ('service_ssl_flag', False), ('service_type', 'disk'), ('read_only', False), ('_service_root', 'http://keep0.arvad.elmgenomics.com:25107/')]), 'arvad-bi6l4-070387e8b18e199c': OrderedDict([('href', '/keep_services/arvad-bi6l4-070387e8b18e199c'), ('kind', 'arvados#keepService'), ('etag', 'c6v1jyouzt2bzk8pgdh36nz3e'), ('uuid', 'arvad-bi6l4-070387e8b18e199c'), ('owner_uuid', 'arvad-tpzed-000000000000000'), ('created_at', '2020-06-25T23:55:45.929711000Z'), ('modified_by_client_uuid', None), ('modified_by_user_uuid', 'arvad-tpzed-000000000000000'), ('modified_at', '2020-06-25T23:55:45.929711000Z'), ('service_host', 'keep1.arvad.elmgenomics.com'), ('service_port', 25107), ('service_ssl_flag', False), ('service_type', 'disk'), ('read_only', False), ('_service_root', 'http://keep1.arvad.elmgenomics.com:25107/')])}
DEBUG 1624ee8a28860a5107a41d0cfc5c96e6+64: ['http://keep1.arvad.elmgenomics.com:25107/', 'http://keep0.arvad.elmgenomics.com:25107/']
DEBUG Pool max threads is 2
DEBUG Request: PUT http://keep1.arvad.elmgenomics.com:25107/1624ee8a28860a5107a41d0cfc5c96e6
DEBUG Request: PUT http://keep0.arvad.elmgenomics.com:25107/1624ee8a28860a5107a41d0cfc5c96e6
INFO PUT 200: 64 bytes in 33.16497802734375 msec (0.002 MiB/sec)
DEBUG KeepWriterThread <KeepWriterThread(Thread-2, started daemon 140388825073408)> succeeded 1624ee8a28860a5107a41d0cfc5c96e6+64 http://keep0.arvad.elmgenomics.com:25107/
INFO PUT 200: 64 bytes in 35.94660758972168 msec (0.002 MiB/sec)
DEBUG KeepWriterThread <KeepWriterThread(Thread-1, started daemon 140388833466112)> succeeded 1624ee8a28860a5107a41d0cfc5c96e6+64 http://keep1.arvad.elmgenomics.com:25107/
INFO Using collection 59752f525b87e0b7a084ca996f003fc4+60 (arvad-4zz18-av5i8y7d4d81f8j)
INFO Using empty collection d41d8cd98f00b204e9800998ecf8427e+0
INFO Using empty collection d41d8cd98f00b204e9800998ecf8427e+0
INFO Using empty collection d41d8cd98f00b204e9800998ecf8427e+0
INFO Using collection cache size 256 MiB
DEBUG ENTER jobiter 1593141293.7100556
DEBUG EXIT jobiter 1593141293.711501 0.0014452934265136719
DEBUG ENTER run 1593141293.7118447
INFO [container birworkflow] submitted container_request arvad-xvhdp-rsapsuefkn0lgtl
DEBUG EXIT run 1593141293.9711041 0.2592594623565674
DEBUG ENTER jobiter 1593141293.9711988
DEBUG EXIT jobiter 1593141293.9712565 5.7697296142578125e-05
INFO [container birworkflow] arvad-xvhdp-rsapsuefkn0lgtl is Final
DEBUG cef82e2f2d0bb80d84f9a802f378d34f+70978+Ac67555e4213ed84d8fdd321d68b5cccd82444861@5f07de05: ['http://keep1.arvad.elmgenomics.com:25107/', 'http://keep0.arvad.elmgenomics.com:25107/']
DEBUG Request: GET http://keep1.arvad.elmgenomics.com:25107/cef82e2f2d0bb80d84f9a802f378d34f+70978+Ac67555e4213ed84d8fdd321d68b5cccd82444861@5f07de05
INFO GET 200: 70978 bytes in 1.8079280853271484 msec (37.441 MiB/sec)
ERROR [container birworkflow] (arvad-dz642-gtzw0q6aluq9wzc) error log:

  2020-06-26T03:17:38.963580275Z stderr INFO Overall process status is success
  2020-06-26T03:17:38.963580275Z stderr DEBUG ENTER jobiter 1593141458.9602315
  2020-06-26T03:17:38.963580275Z stderr DEBUG EXIT jobiter 1593141458.9603672 0.00013566017150878906
  2020-06-26T03:17:38.993128583Z stderr DEBUG {'arvad-bi6l4-2a82fc92e13d100f': OrderedDict([('href', '/keep_services/arvad-bi6l4-2a82fc92e13d100f'), ('kind', 'arvados#keepService'), ('etag', '3yue5zhfxtvvxqe46ydk8zyw9'), ('uuid', 'arvad-bi6l4-2a82fc92e13d100f'), ('owner_uuid', 'arvad-tpzed-000000000000000'), ('created_at', '2020-06-25T23:55:45.929711000Z'), ('modified_by_client_uuid', None), ('modified_by_user_uuid', 'arvad-tpzed-000000000000000'), ('modified_at', '2020-06-25T23:55:45.929711000Z'), ('service_host', 'keep0.arvad.elmgenomics.com'), ('service_port', 25107), ('service_ssl_flag', False), ('service_type', 'disk'), ('read_only', False), ('_service_root', 'http://keep0.arvad.elmgenomics.com:25107/')]), 'arvad-bi6l4-070387e8b18e199c': OrderedDict([('href', '/keep_services/arvad-bi6l4-070387e8b18e199c'), ('kind', 'arvados#keepService'), ('etag', 'c6v1jyouzt2bzk8pgdh36nz3e'), ('uuid', 'arvad-bi6l4-070387e8b18e199c'), ('owner_uuid', 'arvad-tpzed-000000000000000'), ('created_at', '2020-06-25T23:55:45.929711000Z'), ('modified_by_client_uuid', None), ('modified_by_user_uuid', 'arvad-tpzed-000000000000000'), ('modified_at', '2020-06-25T23:55:45.929711000Z'), ('service_host', 'keep1.arvad.elmgenomics.com'), ('service_port', 25107), ('service_ssl_flag', False), ('service_type', 'disk'), ('read_only', False), ('_service_root', 'http://keep1.arvad.elmgenomics.com:25107/')])}
  2020-06-26T03:17:38.993128583Z stderr DEBUG 5e3733b7373fb88ba7be5eeb7f98a003+147: ['http://keep0.arvad.elmgenomics.com:25107/', 'http://keep1.arvad.elmgenomics.com:25107/']
  2020-06-26T03:17:38.993128583Z stderr DEBUG Pool max threads is 2
  2020-06-26T03:17:38.993128583Z stderr DEBUG Request: PUT http://keep0.arvad.elmgenomics.com:25107/5e3733b7373fb88ba7be5eeb7f98a003
  2020-06-26T03:17:38.993128583Z stderr DEBUG Request: PUT http://keep1.arvad.elmgenomics.com:25107/5e3733b7373fb88ba7be5eeb7f98a003
  2020-06-26T03:17:39.052006916Z stderr DEBUG Request fail: PUT http://keep0.arvad.elmgenomics.com:25107/5e3733b7373fb88ba7be5eeb7f98a003 => <class 'arvados.errors.HttpError'>: (0, "(6, 'Could not resolve host: keep0.arvad.elmgenomics.com')")
  2020-06-26T03:17:39.052047957Z stderr DEBUG Request fail: PUT http://keep1.arvad.elmgenomics.com:25107/5e3733b7373fb88ba7be5eeb7f98a003 => <class 'arvados.errors.HttpError'>: (0, "(6, 'Could not resolve host: keep1.arvad.elmgenomics.com')")
  2020-06-26T03:17:42.992559175Z stderr DEBUG {'arvad-bi6l4-2a82fc92e13d100f': OrderedDict([('href', '/keep_services/arvad-bi6l4-2a82fc92e13d100f'), ('kind', 'arvados#keepService'), ('etag', '3yue5zhfxtvvxqe46ydk8zyw9'), ('uuid', 'arvad-bi6l4-2a82fc92e13d100f'), ('owner_uuid', 'arvad-tpzed-000000000000000'), ('created_at', '2020-06-25T23:55:45.929711000Z'), ('modified_by_client_uuid', None), ('modified_by_user_uuid', 'arvad-tpzed-000000000000000'), ('modified_at', '2020-06-25T23:55:45.929711000Z'), ('service_host', 'keep0.arvad.elmgenomics.com'), ('service_port', 25107), ('service_ssl_flag', False), ('service_type', 'disk'), ('read_only', False), ('_service_root', 'http://keep0.arvad.elmgenomics.com:25107/')]), 'arvad-bi6l4-070387e8b18e199c': OrderedDict([('href', '/keep_services/arvad-bi6l4-070387e8b18e199c'), ('kind', 'arvados#keepService'), ('etag', 'c6v1jyouzt2bzk8pgdh36nz3e'), ('uuid', 'arvad-bi6l4-070387e8b18e199c'), ('owner_uuid', 'arvad-tpzed-000000000000000'), ('created_at', '2020-06-25T23:55:45.929711000Z'), ('modified_by_client_uuid', None), ('modified_by_user_uuid', 'arvad-tpzed-000000000000000'), ('modified_at', '2020-06-25T23:55:45.929711000Z'), ('service_host', 'keep1.arvad.elmgenomics.com'), ('service_port', 25107), ('service_ssl_flag', False), ('service_type', 'disk'), ('read_only', False), ('_service_root', 'http://keep1.arvad.elmgenomics.com:25107/')])}
  2020-06-26T03:17:42.992559175Z stderr DEBUG 5e3733b7373fb88ba7be5eeb7f98a003+147: ['http://keep0.arvad.elmgenomics.com:25107/', 'http://keep1.arvad.elmgenomics.com:25107/']
  2020-06-26T03:17:42.992559175Z stderr DEBUG Pool max threads is 2
  2020-06-26T03:17:42.992559175Z stderr DEBUG Request: PUT http://keep0.arvad.elmgenomics.com:25107/5e3733b7373fb88ba7be5eeb7f98a003
  2020-06-26T03:17:42.992559175Z stderr DEBUG Request: PUT http://keep1.arvad.elmgenomics.com:25107/5e3733b7373fb88ba7be5eeb7f98a003
  2020-06-26T03:17:43.004928484Z stderr DEBUG Request fail: PUT http://keep0.arvad.elmgenomics.com:25107/5e3733b7373fb88ba7be5eeb7f98a003 => <class 'arvados.errors.HttpError'>: (0, "(6, 'Could not resolve host: keep0.arvad.elmgenomics.com')")
  2020-06-26T03:17:43.053745024Z stderr DEBUG Request fail: PUT http://keep1.arvad.elmgenomics.com:25107/5e3733b7373fb88ba7be5eeb7f98a003 => <class 'arvados.errors.HttpError'>: (0, "(6, 'Could not resolve host: keep1.arvad.elmgenomics.com')")
  2020-06-26T03:17:50.993254854Z stderr DEBUG {'arvad-bi6l4-2a82fc92e13d100f': OrderedDict([('href', '/keep_services/arvad-bi6l4-2a82fc92e13d100f'), ('kind', 'arvados#keepService'), ('etag', '3yue5zhfxtvvxqe46ydk8zyw9'), ('uuid', 'arvad-bi6l4-2a82fc92e13d100f'), ('owner_uuid', 'arvad-tpzed-000000000000000'), ('created_at', '2020-06-25T23:55:45.929711000Z'), ('modified_by_client_uuid', None), ('modified_by_user_uuid', 'arvad-tpzed-000000000000000'), ('modified_at', '2020-06-25T23:55:45.929711000Z'), ('service_host', 'keep0.arvad.elmgenomics.com'), ('service_port', 25107), ('service_ssl_flag', False), ('service_type', 'disk'), ('read_only', False), ('_service_root', 'http://keep0.arvad.elmgenomics.com:25107/')]), 'arvad-bi6l4-070387e8b18e199c': OrderedDict([('href', '/keep_services/arvad-bi6l4-070387e8b18e199c'), ('kind', 'arvados#keepService'), ('etag', 'c6v1jyouzt2bzk8pgdh36nz3e'), ('uuid', 'arvad-bi6l4-070387e8b18e199c'), ('owner_uuid', 'arvad-tpzed-000000000000000'), ('created_at', '2020-06-25T23:55:45.929711000Z'), ('modified_by_client_uuid', None), ('modified_by_user_uuid', 'arvad-tpzed-000000000000000'), ('modified_at', '2020-06-25T23:55:45.929711000Z'), ('service_host', 'keep1.arvad.elmgenomics.com'), ('service_port', 25107), ('service_ssl_flag', False), ('service_type', 'disk'), ('read_only', False), ('_service_root', 'http://keep1.arvad.elmgenomics.com:25107/')])}
  2020-06-26T03:17:50.993254854Z stderr DEBUG 5e3733b7373fb88ba7be5eeb7f98a003+147: ['http://keep0.arvad.elmgenomics.com:25107/', 'http://keep1.arvad.elmgenomics.com:25107/']
  2020-06-26T03:17:50.993254854Z stderr DEBUG Pool max threads is 2
  2020-06-26T03:17:50.993254854Z stderr DEBUG Request: PUT http://keep0.arvad.elmgenomics.com:25107/5e3733b7373fb88ba7be5eeb7f98a003
  2020-06-26T03:17:50.993254854Z stderr DEBUG Request: PUT http://keep1.arvad.elmgenomics.com:25107/5e3733b7373fb88ba7be5eeb7f98a003
  2020-06-26T03:17:51.022167680Z stderr DEBUG Request fail: PUT http://keep1.arvad.elmgenomics.com:25107/5e3733b7373fb88ba7be5eeb7f98a003 => <class 'arvados.errors.HttpError'>: (0, "(6, 'Could not resolve host: keep1.arvad.elmgenomics.com')")
  2020-06-26T03:17:51.053570398Z stderr DEBUG Request fail: PUT http://keep0.arvad.elmgenomics.com:25107/5e3733b7373fb88ba7be5eeb7f98a003 => <class 'arvados.errors.HttpError'>: (0, "(6, 'Could not resolve host: keep0.arvad.elmgenomics.com')")
  2020-06-26T03:18:07.010691715Z stderr DEBUG {'arvad-bi6l4-2a82fc92e13d100f': OrderedDict([('href', '/keep_services/arvad-bi6l4-2a82fc92e13d100f'), ('kind', 'arvados#keepService'), ('etag', '3yue5zhfxtvvxqe46ydk8zyw9'), ('uuid', 'arvad-bi6l4-2a82fc92e13d100f'), ('owner_uuid', 'arvad-tpzed-000000000000000'), ('created_at', '2020-06-25T23:55:45.929711000Z'), ('modified_by_client_uuid', None), ('modified_by_user_uuid', 'arvad-tpzed-000000000000000'), ('modified_at', '2020-06-25T23:55:45.929711000Z'), ('service_host', 'keep0.arvad.elmgenomics.com'), ('service_port', 25107), ('service_ssl_flag', False), ('service_type', 'disk'), ('read_only', False), ('_service_root', 'http://keep0.arvad.elmgenomics.com:25107/')]), 'arvad-bi6l4-070387e8b18e199c': OrderedDict([('href', '/keep_services/arvad-bi6l4-070387e8b18e199c'), ('kind', 'arvados#keepService'), ('etag', 'c6v1jyouzt2bzk8pgdh36nz3e'), ('uuid', 'arvad-bi6l4-070387e8b18e199c'), ('owner_uuid', 'arvad-tpzed-000000000000000'), ('created_at', '2020-06-25T23:55:45.929711000Z'), ('modified_by_client_uuid', None), ('modified_by_user_uuid', 'arvad-tpzed-000000000000000'), ('modified_at', '2020-06-25T23:55:45.929711000Z'), ('service_host', 'keep1.arvad.elmgenomics.com'), ('service_port', 25107), ('service_ssl_flag', False), ('service_type', 'disk'), ('read_only', False), ('_service_root', 'http://keep1.arvad.elmgenomics.com:25107/')])}
  2020-06-26T03:18:07.010691715Z stderr DEBUG 5e3733b7373fb88ba7be5eeb7f98a003+147: ['http://keep0.arvad.elmgenomics.com:25107/', 'http://keep1.arvad.elmgenomics.com:25107/']
  2020-06-26T03:18:07.010691715Z stderr DEBUG Pool max threads is 2
  2020-06-26T03:18:07.010691715Z stderr DEBUG Request: PUT http://keep0.arvad.elmgenomics.com:25107/5e3733b7373fb88ba7be5eeb7f98a003
  2020-06-26T03:18:07.010691715Z stderr DEBUG Request: PUT http://keep1.arvad.elmgenomics.com:25107/5e3733b7373fb88ba7be5eeb7f98a003
  2020-06-26T03:18:07.038264934Z stderr DEBUG Request fail: PUT http://keep0.arvad.elmgenomics.com:25107/5e3733b7373fb88ba7be5eeb7f98a003 => <class 'arvados.errors.HttpError'>: (0, "(6, 'Could not resolve host: keep0.arvad.elmgenomics.com')")
  2020-06-26T03:18:07.038264934Z stderr DEBUG Request fail: PUT http://keep1.arvad.elmgenomics.com:25107/5e3733b7373fb88ba7be5eeb7f98a003 => <class 'arvados.errors.HttpError'>: (0, "(6, 'Could not resolve host: keep1.arvad.elmgenomics.com')")
  2020-06-26T03:18:28.388394986Z crunch-run caught signal: terminated
  2020-06-26T03:18:28.388448289Z crunch-run removing container
  2020-06-26T03:18:28.515506897Z crunch-run error closing stdout logs: close /tmp/crunch-run.arvad-dz642-gtzw0q6aluq9wzc.243436082/keep272255977/tmp0/cwl.output.json: software caused connection abort
  2020-06-26T03:18:28.629369809Z crunch-run Container exited with code: 137
  2020-06-26T03:18:28.641122501Z crunch-run caught signal: terminated
  2020-06-26T03:18:28.641948995Z crunch-run removing container
  2020-06-26T03:18:28.646686939Z crunch-run error removing container: Error: No such container: 7ffbc5cb7641f106b55dd1637e18cc75670f5ebad8bb9ab32ee5ec3d3bf32dde
  2020-06-26T03:18:28.707840874Z crunch-run error in CaptureOutput: error scanning files to copy to output: open /tmp/crunch-run.arvad-dz642-gtzw0q6aluq9wzc.243436082/keep272255977/tmp0/.arvados#collection: no such file or directory
  2020-06-26T03:18:28.708096923Z crunch-run Cancelled
ERROR Overall process status is permanentFail
INFO Final output collection None
{}
WARNING Final process status is permanentFail

Thanks a lot!

It seems like arvados-cwl-runner running inside a docker container cannot resolve the hostname for your keep nodes. This is strange because it does seem to be able to access your API server. I think there is something going on with your DNS (or are you using a hosts file?) and Docker.

Yes I am using hosts file that defines:

127.0.0.1 localhost
127.0.0.30 localapi.com
127.0.0.10 keep0.arvad.elmgenomics.com
127.0.0.11 keep1.arvad.elmgenomics.com

And keep.arvad.elmgenomics.com subdomain forwards to my External IP. But don’t you think it is strange that bwa-mem tutorial works perfectly fine. bwa-mem.cwl also generates a container using dockerPull: lh3lh3/bwa and we do not see the same resolve error there?

I see. I think at least part of your problem is that inside the container, the 127.* network is the container’s localhost and not the host’s localhost. The easiest workaround is to have keep0 and keep1 point to your external IP address. You could also set up some kind of virtual bridge device that isn’t 127.* (similar to what Docker), but I don’t know exactly how to do that offhand.