Workbench stuck after successful process run

georgebax · 26 November 2020 12:04

Hi,

I am using arvbox to test arvados locally. I created a container request for a container that had already run, using the API. Then, again using the API, I set its state to “Committed” to make it run, with the following call:

curl -k -X POST \                                                                               
 -H "Authorization: OAuth2 $ARVADOS_API_TOKEN" \
 --data-urlencode container_request@/dev/stdin \
 https://$ARVADOS_API_HOST/arvados/v1/container_requests/ \
 <<EOF
{
  "container_uuid":"x3dm2-xvhdp-c7ge2z8ax89miaa",
  "container_image":"x3dm2-4zz18-q3zuy64urbb1r2v",
  "state":"Committed",
  "cwd":"/",
  "output_path":"/output",
  "runtime_constraints":
    {
      "ram":12000000000,
      "vcpus":2,
      "API":true
    },
  "command":[
    "arvados-cwl-runner",
    "--local",
    "--api=containers",
    "--collection-cache-size=256",
    "/var/lib/cwl/workflow.json#main",
    "/var/lib/cwl/cwl.input.json"
  ]                                                               
}
EOF

to which, I got a normal response as such:

{"href":"/container_requests/x3dm2-xvhdp-3oj97dymux86xpj","kind":"arvados#containerRequest","etag":"3zreu6u2kjy1u1ngtr0eddetx","uuid":"x3dm2-xvhdp-3oj97dymux86xpj","owner_uuid":"x3dm2-tpzed-amlr5fgf7gp7kgb","created_at":"2020-11-25T17:54:11.490567000Z","modified_by_client_uuid":"x3dm2-ozdt8-53qhxfv1x9d8p7l","modified_by_user_uuid":"x3dm2-tpzed-amlr5fgf7gp7kgb","modified_at":"2020-11-25T17:54:11.526004000Z","command":["arvados-cwl-runner","--local","--api=containers","--collection-cache-size=256","/var/lib/cwl/workflow.json#main","/var/lib/cwl/cwl.input.json"],"container_count":1,"container_count_max":3,"container_image":"x3dm2-4zz18-q3zuy64urbb1r2v","container_uuid":"x3dm2-xvhdp-c7ge2z8ax89miaa","cwd":"/","description":null,"environment":{},"expires_at":null,"filters":null,"log_uuid":null,"mounts":{},"name":null,"output_name":null,"output_path":"/output","output_uuid":null,"output_ttl":0,"priority":0,"properties":{},"requesting_container_uuid":null,"runtime_constraints":{"ram":12000000000,"vcpus":2,"API":true},"scheduling_parameters":{},"state":"Final","use_existing":true}

The problem is that now I can’t view the dashboard. https://172.17.0.2/ just gets me to a page saying “Oh… fiddlesticks. Sorry, I had some trouble handling your request. Path not found (req-9jae7foedgv070cs7vv7) [API: 404]”

The weird part is that the “req-*” code changes every time I navigate to this page.

I have tried killing the box, restarting, using a different browser, logging out and back in (my account in arvados), but nothing has worked so far! Any ideas?

Thanks in advance!

Edit: If I send a GET request for all the “state: Final” container requests through the API, I can see the process of the last workflow I sent. Also, I can access “Home” from the “Projects” dropdown, but when I select the “Processes” tab, I get a “Oops, request failed” error…

georgebax · 26 November 2020 16:20

Eventually I had to reset the arvbox because I had to get it working again, but I am leaving this up in case it is a bug.

tetron · 27 November 2020 21:18

The “req-*” code is a request id which can be used to look up the request in the logs for more information. Every time you load the page, it generates a new request, with a new request id.

You are hitting a workbench bug. It is trying to display your container request and it hitting a bug and crashing. The workbench logs have more information, including a stack trace. Unfortunately It sounds like you deleted the evidence but for future reference you can view logs with arvbox log workbench.

However, your container request also looks a little strange. What are you trying to do, exactly?

georgebax · 2 December 2020 08:58

The workflow I am trying to use contains some calculations that take about 5 minutes, nothing too serious. I am trying to re-run an already run container, which I did by sending the request I posted in my first post, but without the container_uuid and the runtime_constraints fields, and with an Uncommitted state instead of Committed. TAt the time being I can’t reproduce the error. If it happens again I will update.

tetron · 2 December 2020 15:15

The way to re-run a container is to submit a new container request that is slightly different (so it doesn’t match the past container) or set use_existing: false in the request.

https://doc.arvados.org/v2.1/api/methods/container_requests.html

georgebax · 2 December 2020 18:43

The problem now is that I get the container_uuid and container_image fields from the API response of an already run workflow/process, I replace them on my API request (first post). Then (even if I add use_existing : false, I get a container request which is “complete”, but it didn’t run now, but 5 days ago (when I ran it using the arvados-cwl-runner) !

Is there a way to see the request that the cwl runner builds before it is sent? Thanks.

tetron · 2 December 2020 20:35

Let me try to clarify a few things -

A container image is a Docker container image. It provides the runtime environment. It is just a special a special type of input.

You submit a container request. When you do this, you are really asking is for Arvados to give you the results of a computation.

Arvados may create a container to fulfill the container request. A container represents an individual execution. It can fail, re-tries happen by creating a new container associated with the same request. It can also fulfill a request with a container that ran in the past.

Normally, when you submit a container request, container_uuid shouldn’t be set. It is the job of the API server to go and find or create a container to fulfill the request.

When you submit a container request with the container_uuid already filled, you are telling the API that the request is actually already fulfilled, which is not what you want. So if you are programmatically copying the container request record and submitting a new one, you should be sure to delete the container_uuid field.

The use_existing flag controls whether the API server will search for existing containers or only create new containers to fulfill the request.

No there isn’t. Although there is a flag --submit-request-uuid that will make it update an existing container request instead of creating a new one, maybe that’s helpful.

georgebax · 3 December 2020 09:05

Ok thanks for clarifying, it really helped. Now every container request I create appears to be a new one, it’s just “On hold” and not running. I am using the address of the container image of a previous successful run, which is in this format : 9d6262c515ee608eeba21c101779ef90+1396. Am I doing something wrong? I am pretty confident that’s what I was doing for previous runs.

tetron · 3 December 2020 14:37

The request state can also be “Committed” or “Uncommitted”. In the uncommitted state the fields can still be updated (used while setting inputs in workbench for example).

There’s also a “priority” field. Priority is between 0 and 1000. If the priority is 0 it won’t run or will stop running if it has previously started.

Make sure that you are setting “state” to “Committed” and that “priority” is 500.

georgebax · 3 December 2020 15:47

Well this way it creates new requests but they are all Queued for a time and then it’s cancelled. I have no other pending containers, or other committed requests.
Furthermore, I get the same result when I submit the request through the java sdk (I have added some support for containers and container requests with a new api client) like this, where containerImage is “9d6262c515ee608eeba21c101779ef90+1396”, as above:

   	ContainerRequest containerRequest = new ContainerRequest();
    	containerRequest.setContainerImage(containerImage);
    	containerRequest.setState("Committed"); // TODO make this an enum
    	containerRequest.setCwd("/");
    	containerRequest.setOutputPath("/output");
    	containerRequest.setPriority(1000);
    	List<String> command = new ArrayList<>();
    	command.add("arvados-cwl-runner");
    	command.add("--local");
    	command.add("--api=containers");
    	command.add("--collection-cache-size=256");
    	command.add("/var/lib/cwl/workflow.json#main");
    	command.add("/var/lib/cwl/cwl.input.json");
    	containerRequest.setCommand(command);
    	RuntimeConstraints runtimeConstraints = new RuntimeConstraints();
    	runtimeConstraints.setApi(Boolean.TRUE);
    	runtimeConstraints.setRam(120000000L);
    	runtimeConstraints.setVcpus(1);
    	containerRequest.setRuntimeConstraints(runtimeConstraints);
    	return containerRequestsApiClient.create(containerRequest);

tetron · 3 December 2020 15:50

Check the container logs, they should probably give you some idea about why they were cancelled.

georgebax · 3 December 2020 15:50

Ha, I almost beat you to it, but I came to post this:

2020-12-03T15:49:52.727157695Z error in Run: While setting up mounts: Output path does not correspond to a writable mount point
2020-12-03T15:49:52.749972050Z error in CaptureOutput: error scanning files to copy to output: cannot output file "/output": not in any mount

tetron · 3 December 2020 15:54

Ah, right, you need to set up the mounts. They are documented on the container requests document page I linked earlier

Also I should have mentioned if you are looking at a record such as a container request, you can go to “Advanced” and open “API response” you can get the whole record. I think that will answer your earlier question about how to see what the request generated by arvados-cwl-runner looks like.

georgebax · 3 December 2020 16:08

I see. When I see a completed and run request (through the cwl runner) I see that any input files defined in my .cwl file are present in the mounts section, but in a format that I don’t know how to reproduce!

"mounts": {
    "/var/spool/cwl/bu_pma_nvt.data": {
      "portable_data_hash": "a9995b4966a0d276b56a6d03bf6e6566+67",
      "kind": "collection",
      "path": "bu_pma_nvt.data"
    },
    "/keep/d45b861e9cde945d9f5885ac8b54bd19+233/test": {
      "portable_data_hash": "d45b861e9cde945d9f5885ac8b54bd19+233",
      "kind": "collection",
      "path": "test"
    },
    "/tmp": {
      "kind": "tmp",
      "capacity": 1073741824
    },
    "/var/spool/cwl": {
      "kind": "tmp",
      "capacity": 1073741824
    }
  },

My question concerning this (and I sure hope it’s the last one!) is how I can submit a container request to run a specific workflow, like I am doing through the cwl runner. Thanks again for your help!

tetron · 3 December 2020 16:13

The example you gave the mounts section of an individual job. Look at the mounts section for a workflow run and you will find /var/spool/cwl/workflow.json and /var/spool/cwl/inputs.json in the mounts.

The inputs.json is your cwl input object. Any files you reference have to already be in Keep.

The workflow.json can be the contents you get when you do arvados-cwl-runner --create-workflow ... and that creates a Workflow object in the API (https://doc.arvados.org/v2.1/api/methods/workflows.html).

georgebax · 3 December 2020 16:23

Ok, so I have 2 workflows now, but how do I schedule container requests based on a workflow? I don’t see it in the API docs. Do I have to copy the mounts section from the API response of a successful run?

tetron · 3 December 2020 16:41

Let’s see if this helps, here’s how Workbench does it (Ruby)

https://dev.arvados.org/projects/arvados/repository/revisions/master/entry/apps/workbench/app/controllers/work_units_controller.rb#L60

Here’s how Workbench 2 does it (Typescript):

https://dev.arvados.org/projects/arvados/repository/arvados-workbench2/revisions/master/entry/src/store/run-process-panel/run-process-panel-actions.ts#L141

Those two functions are basically doing the same thing to construct a container request that runs a workflow.

georgebax · 3 December 2020 18:30

Thanks a lot, I will figure it out!