Arvados 2.7.0 released

The Arvados team is pleased to announce Arvados 2.7.0. Highlights of this release include a new container logging system, scalability and performance improvements, and user interface improvements. We recommend that new and existing installations of 2.6.3 or earlier upgrade to 2.7.0. See Upgrading Arvados for instructions.

New Container Logging System

Arvados 2.7.0 introduces an entirely new system to view logs from running containers. This new API enables clients to retrieve logs directly from the running container, rather than storing logs on the API server as log objects. This greatly reduces API server load, network traffic, and database storage requirements for clusters with heavy compute load: live logs only need to be sent to interested clients, and not from every running container. This also makes it easier for clients to provide users with a consistent and complete view of all logs available, whether a container is running or finished.

The new logging API was implemented in #19889, #20319, and #20647.

Workbench 2 support for this new API was implemented in #20219. The interface is the same as before, but the logs shown should be much more complete.

Log viewing at the command line was added as the arvados-client logs command in #18790.

With the introduction of this API, the default configuration for Containers.Logging.LimitLogBytesPerJob is now 0. This functionally disables old log record creation in Crunch. Those code paths are still available for now, but expected to be deprecated and removed in future releases of Arvados. #20894

Workbench

Workbench 2 is now the default for new installs, and Workbench 1 is deprecated. All new development is going into Workbench 2, which means that Workbench 1 cannot take advantage of new APIs like the container logging API described above. Our user guide has been updated to give people instructions based on Workbench 2, not Workbench 1. #20497, #20688, #20731, #20850, #20890

Workbench supports a richer set of copy and move operations when you select multiple files from a collection view. You can copy or move those files elsewhere within the same collection, to a new collection, to an existing collection, or to a separate collection for each selection. The underlying operations are handled by the API server so they’re fast and efficient. #20031

The process view links to the collection that contains the CWL workflow definition that was run when applicable. #20513

The process view shows both complete workflow and single-container cost in a single, simplified line item. #20454

Workflows correctly display inputs with an optional enum type. #19359

Added a Delete action to the workflow definitions menu. #20477, #20899

New users can be directed to complete a user profile on first login. The user profile, including required fields, is configurable by the administrator. #18946, #20913

Breadcrumbs in Workbench render even when the parent project is not visible in the left-hand navigation menu. #19991

Searching and filtering on the Workbench “Shared with me” view now works as intended. #20617

Scalability and Performance Improvements

The API server prioritizes requests that come from interactive clients, processing them before any others. This improves the responsiveness of the system to users using clients like Workbench while the cluster is under heavy compute load. An “interactive” request is one with the Origin header set. Both versions of Workbench and keep-web set this. #20602

The installer can deploy nginx load balancing in front of multiple controller nodes. This provides an easy way to deploy Arvados clusters with more availability and scale. #20610

Improved the scalability of the Crunch cloud dispatcher can by supporting a list of subnet IDs in Containers.CloudVMs.DriverParameters.SubnetIDs. If an attempt to create a compute node fails because a subnet is full, the dispatcher will retry the request in the next subnet in the list, cycling through it as needed. #20755

Improved the responsiveness of container update cascades (such as cancelling a running workflow with many children) by optimizing the SQL and reducing back and forth between the database and the API server. #20457, #20529, #20472

Improved the performance of keep-web when users are writing many small files. #20559

Improved the performance of keep-web’s S3 API when listing directories with more than 1,000 items. #20726

Improved the responsiveness of keep-web under heavy cluster load by using a shorter timeout on API requests. #20425

The default Containers.CloudVMs.SupervisorFraction for the Crunch cloud dispatcher has been changed to 0.50 (50%) to allow more progress when the dispatcher just started and there are many workflows waiting to run. #20894

API

The /arvados/v1/groups/contents API supports a select parameter like many others. Clients can use this to request only the fields they need and reduce load on the API server. #20470

An API tokens with usage limited by scopes can always make a request to GET /arvados/v1/api_client_authorizations/current—i.e., get itself. This makes authentication across clusters in a federation more reliable. #20750

When the API server is authenticating a remote user, if it fails to get the current user record from the original cluster for any reason, but already has a record of the user in its local database, it will use that local record for the session. This improves the reliability of authentication across clusters in a federation, and makes it easier to issue tokens with more limited scopes (e.g., tokens intended for collection sharing). #20750

The API server accepts SSH public keys in any format recognized by OpenSSH. This means it accepts ECDSA and ED25519 keys in addition to RSA and DSA. (Note that Workbench 2 has a separate validation that has not been updated yet.) #20241

Optimized default values for several scale-related settings in the installer to account for changes in API server behavior. #20680

When a project that contains running container requests is trashed, any containers that are running to fulfill the request are cancelled. #20877

The API server returns a more appropriate status when a request for a collection by portable data hash receives a mix of error responses from different clusters in a federation, so clients have better information about whether or not they can retry the request. In particular, the return code is 404 Not Found if all clusters return 404; 422 Unprocessable Entity if all clusters return a 4xx error; or 502 Bad Gateway if any cluster returns a 5xx error. #20425

Improved the performance of API requests with a “property exists” filter by optimizing the underlying SQL query. #20858

Improved the performance of API requests that list collections by name by adding a database index on this column. In our experience this is a common user query. #14070

Crunch

Crunch supports the Linux kernel cgroups v2 API. You can now deploy Crunch on more modern distributions with full compute usage reporting without turning on the older cgroups v1 API. #17244

If a container request does not specify preemptible compute node instances, then Arvados will no longer reuse unfinished containers that used preemptible instances. This situation can occur when a user notices that preemptible instances are failing before Arvados finishes retrying, and resubmits their workflow with preemptible instance use disabled. The change ensures that Arvados runs new containers on reserved instances as the user intended, rather than reusing the preemptible containers that the user expects to fail. #20606

The Crunch cloud dispatcher’s internal concurrency limit more closely follows the known cloud quota, to avoid excess thrashing around the limit. A new configuration setting Containers.CloudVMs.InitialQuotaEstimate provides the initial value used by the dispatcher at startup. #20667

The Crunch cloud dispatcher waits longer after hitting a cloud quota limit, to reduce request thrashing and increase the chances that the next attempt to create a compute node will succeed. #20457

crunchstat-summary reports a warning when a category of statistics is not available from a container’s logs to help the user understand why a graph is empty. This can occur when compute nodes are not configured with cgroup statistics accounting that Crunch can read. #20705

The arvados-server cloudtest diagnostic respects the Containers.CloudVMs.DeployPublicKey setting, so the test more closely mirrors Crunch’s own behavior. #20649

If the Crunch cloud dispatcher encounters an SSH authentication error, that is logged immediately to aid debugging, rather than waiting for the boot probe timeout. #20649

If the Crunch cloud dispatcher times out waiting for a successful boot probe on a newly created instance, it logs the last error in addition to error output from the boot probe command. It also suggests using arvados-server cloudtest to help diagnose the problem. #20649

Improved performance in the Crunch cloud dispatcher and reduced load on the API server by optimizing several queries. #20601

Improved performance in arvados-cwl-runner and reduced load on the API server by optimizing several queries. #20652

SDKs

The writeFile function in the R SDK has been extended to take collectionUUID and fileFormat arguments. This makes the function more extensible, allowing you to write to an existing collection rather than a new one, or to write a file with a particular format but nonstandard extension. Thanks to AnetaSta22 for this contribution. #20660

The collection list method in the Java SDK now supports the include_old_versions and include_trash arguments of the Arvados API. Thanks to Krzysztof Majewski for this contribution. #20664

Optimized the Python SDK’s thread pool that prefetches data from Keep to scale better when fetching from thousands of Collection objects. #20637

Added the arvados.api_resources module to the Python SDK. It documents the API provided by the Arvados API client object, like you get when you call arvados.api('v1'). This documentation should help developers make fuller use of the Python SDK. You can view the documentation on the web or in pydoc (e.g., run pydoc arvados.api_resources on a system with the Python SDK installed). #18799

Updated installation documentation for all the Arvados Python SDKs and tools to recommend installing inside a virtualenv as best practice following the adoption of PEP 668. #20543

Expanded installation documentation for Arvados client tools in the user guide. #20684

Deployment

The installer can reduce cluster downtime by performing rolling upgrades when a cluster is deployed with a load balancer and multiple controllers. #20680

The installer’s Terraform tools can deploy into existing cloud infrastructure (VPC, subnets, etc.) instead of creating a completely new stack. #20482

Administrators can configure which resources are managed by arvados-login-sync: user accounts, group memberships, SSH keys, and Arvados API tokens. Arvados clusters in environments that already have infrastructure to manage some of these resources can configure arvados-login-sync to disregard them and prevent conflicts. A flag for each resource is in the Users section of the configuration. #20663

Update the default configuration for arvados-login-sync to avoid managing security-sensitive groups on Debian- and Red Hat-based distributions. If you are granting users access to groups like sudo or wheel through Arvados, you may need to configure Users.SyncIgnoredGroups with your own list. #20663

Improved the security of the installer by using a separate file to configure cluster secrets. This file can be managed in more secure environments to better protect these secrets during the deployment process. #20665

Added configuration options to the installer for administrators to adjust:

  • several nginx and Passenger settings that need to be tuned to match cluster size and load - #20468

  • how long Prometheus retains data - #20889

  • names of the Arvados PostgreSQL database, database role, Keep’s S3 bucket, and Keep’s IAM role - #20889

Improved scalability in the installer by configuring more nginx settings based on CONTROLLER_MAX_QUEUED_REQUESTS. #20594

Improved availability in the installer by configuring nginx to allow a few more connections than the API server is willing to handle. This ensures metrics are available even when the API server has no more capacity for requests. #20474

Expanded the installer documentation to cover different certificate modes, optional encryption of the TLS certificate key, and Keep’s S3 backends. The documentation for the previous manual rolling upgrade process has been removed now that the installer natively supports rolling upgrades. #20888, #20889

Improved the reporting of arvados-client diagnostics by extending the test container to make Arvados API requests. This lets users know if compute nodes have trouble making API requests. To do this, arvados-client builds a tiny Arvados Docker image to use to run the test container. If you cannot build this image in your environment, you can select what Docker image is used for diagnostics with the -docker-image option. #20612

API Deprecations

With the release of Arvados 2.7.0, we are formally announcing the deprecation of some older APIs. These are scheduled to be removed in a future major Arvados release. The following API resources and their associated endpoints are all deprecated:

  • jobs, job_tasks, pipeline_instances, and pipeline_templates: These resources were all used by the previous version of Crunch. They have been replaced by containers, container_requests, and workflows.
  • keep_disks: Replaced by keep_services.
  • nodes: This resource was used by the previous version of Crunch. Crunch now better integrates with the underlying dispatcher, so it no longer needs to duplicate this information.
  • repositories: This was meant to support workflow development in the previous version of Crunch. CWL workflows let you deploy software in container images, and arvados-cwl-runner records Git metadata for registered workflows, so this functionality is no longer useful.
  • humans, specimens, and traits: These resources were originally intended to hold metadata for specific kinds of samples. They have been replaced by project and collection properties, which are more flexible and can be enforced with metadata vocabularies.

In addition, when Arvados returns api_client_authorizations, the fields api_client_id, user_id, and default_owner_uuid are all deprecated. The first two are internal fields that are not useful to clients. default_owner_uuid has never been implemented and we have no plans to do so.

Updated the Arvados API documentation to announce these deprecations. #20840, #20951

Some classes and functions in the Python SDK that were built around these APIs, or have been replaced by new functionality in the Python standard library, have been deprecated as well. Calling them will emit a DeprecationWarning with a suggested alternative where possible. Their docstrings note this information too. #20839

The Keep S3 driver version 2 became the default driver in Arvados 2.5.0. The version 1 driver has been removed completely from this release. #19620

Bug Fixes and Minor Enhancements

Fixed a websockets server bug which caused it to stop sending updates under high load. #20507

Improved the reporting of various services with a new configuration option Users.AuditLogs.RequestQueueDumpDirectory. If a service is near its configured maximum of concurrent requests, it will write a JSON file to this directory with details about the request queue. This can help diagnose performance problems even when the problem is difficult to catch in realtime. #20475

The browser back button correctly navigates to the previous panel after visiting a collection by portable data hash. #19793

The “Trash” view in Workbench shows all items in the trash, not only those owned by the current user. #20603

Fixed a Workbench 2 bug where certain users with “manage” permission on an object were not able to access the sharing UI. #20829

Filtering Processes on “Queued” status lists containers in both “Queued” and “Locked” state. #20845

When you launch a workflow or other process from Workbench 2, it is submitted with the usual default priority 500, rather than the lowest possible priority 1. #20882

Fixed an issue where the API controller could not serve its cached discovery document in some network configurations. Thanks to George Chlipala for contributing this fix. #20919

Fixed a bug where arvados-cwl-runner would crash with an IndexError message when there was exactly one file in a set of related inputs. #20462

The arvados-client shell command reads connection settings from ~/.config/arvados/settings.conf like other client tools. #20757

Documented the collection metadata property arv:workflowMain. #20374

Improved the scalability of the Crunch cloud dispatcher by recalculating the number of allowed supervisor containers after hitting a cloud concurrency limit. #20601, #20667

Fixed a bug where Crunch would retry containers for a workflow that had been cancelled. #20614

Fixed some consistency issues in the installer to prevent “unbound variable” errors. #20889

Dependency Updates and Development Improvements

Arvados 2.7.0 runs on Go 1.20.6 and Ruby 2.7.7. We also upgraded various libraries and services that Arvados works with. #20325, #20735

We publish Arvados packages that are built on Rocky 8. We expect these packages to be compatible with any distribution based on RHEL 8. Note the installer has not been updated to support these distributions yet; that work is coming in a future release. #20797, #20844, #20822, #20878

The web documentation for our Python SDK is built using pdoc, instead of its pdoc3 fork. #20853

Prevented some deprecation warnings coming from regular expressions and use of the pipes module in the Python SDK. #20343, #20710

Fixed a crash in arvados-docker-cleaner by updating the docker library to prevent dependency conflicts. #20754

arvados-docker-cleaner now uses version 1.35 of the Docker API to better match other Crunch tools. #20754

Improved the reliability of the arvados-client Debian package by declaring its dependency on fuse. This client tool has long depended on the FUSE library; this just lets the package manager know so the library can be installed if necessary. #20619