Skip to main content
v1.1.4
June 10, 2026

API & Services

  • The Jobs Orchestration Service (JOS) now supports chunked file outputs, exposing a single logical output file per input regardless of how many segments a worker produces. Worker pods tag each segment result with a chunk_of field containing the chunk index, parent filename, and a final-chunk flag. Output URIs are rewritten to the authenticated /files/download endpoint, and files carry a job_output.status attribute (partial while in progress, complete when done) so users can track job state from the file library.
  • JOS now detects and reports stuck jobs through a new background sweep and an OTel jobs_stuck gauge. A job is now correctly marked RUNNING only when its first pod becomes Ready, so jobs queued without available GPU remain in ADMITTED rather than prematurely appearing active to users. The sweep detects two conditions — a job stuck in ADMITTED for over an hour and a RUNNING job with no emitted event or progress for over an hour — and exposes both counts via the new metric.
  • JOS workers now exit gracefully on job cancellation by catching SIGTERM, checkpointing the current segment, and terminating cleanly. The pod termination grace period has been raised from 30 seconds to 2 minutes by default to give workers sufficient time to complete a checkpoint before Kubernetes forcibly terminates the pod. The grace period is configurable per deployment.
  • A manual-generation JOS pipeline has been added for producing structured procedural documentation from video. The pipeline performs audio-grounded batch inference and is available as a new task type configured via a JOS seed deployed to the environment.

Security & Authentication

  • API keys can now be created with an optional TTL, and expired keys are automatically marked in the database on first use after expiry. A new --ttl-days flag on the atai iam key create and atai iam user key create CLI commands computes an expires_at timestamp at creation time and passes it through to the IAM service. When an expired key is rejected, its status transitions from active to expired. No schema migration is required as the expires_at column already existed on the api_keys table.
v1.1.3
May 27, 2026

API & Services

  • Chunked files can now be downloaded via the existing /files/download endpoint, with content streamed directly from S3 in chunk order: The download path detects whether a file is chunked and assembles the response as a streaming byte sequence across all stored S3 objects, with no need for a separate endpoint or client-side logic.
  • A new SegmentedInferenceSession abstraction has been added to the JOS worker client, enabling automatic checkpointing, progress reporting, and resilient resume for batch inference jobs: Workers now iterate over “segments” (subsets of input files) rather than whole files, and the session handles checkpointing between segments, emitting file-started/completed events, and resuming from the last checkpoint on restart. Segment size is configurable with defaults of 50 MB of input or 10 minutes, whichever comes first. The Nano inference worker and the inference phase of the machine state job have been updated to use this abstraction.
  • Stale file uploads stuck in the UPLOADING state are now automatically cleaned up: A configurable scheduler (default: every 6 hours) aborts any upload older than a threshold (default: 24 hours) using the same logic as the /abort endpoint.
  • A new DatasetList port mode and a PortMode::Checkpoints type have been added to JOS to support dataset listing and checkpoint upload workflows: These additions enable workers to declare input/output ports that carry dataset lists and model checkpoints respectively, broadening the range of job types that can be expressed natively in JOS config. Checkpoint files are now staged under a non-tmpfs folder to avoid memory pressure.

Inference & Model Updates

  • The C2.6 model is now available with a new vLLM inference backend that delivers improved throughput over previous transformer-based inference: Multi-image support has been added to Direct Query and the Nano engine, and GPU-specific YAML configurations allow fine-grained tuning per model and hardware combination.
  • GPU memory budget calibration for C2.5 and C2.6 has been improved with a more accurate adaptive probe sweep across batch size and shape dimensions: The calibration now accounts for padding-aware batch peaks, shrinks the memory budget incrementally after each OOM, and warms up the engine at max probe shapes before calibrating. This reduces spurious OOM events during production batch inference workloads.
  • UFM training has been extended with variable-K Stage 1+2 resolution, pad_mode=interpolate retraining, Stage 3 sweep plumbing, and several stability fixes: A deterministic eval fix and a PEP 479 StopIteration correction were also merged, along with pinning flash-linear-attention to avoid a ~3× training slowdown introduced by a dependency update. The UFM builder and model directories were refactored to align with the factory pattern introduced in the updated Nano engine.

Observability & Telemetry

  • OTel SQL tracing has been extended to cover database transactions via a new TracedTransaction wrapper, and rolled out to the Lens API and Data services: This completes full OTel tracing coverage for data_service and lens_api_service; iam_service and jos_service will follow in subsequent releases.
  • New Python OtelMetricsBase and OtelWorkerMetricsBase base classes have been introduced, and the first batch of active platform services has been migrated from Prometheus to OpenTelemetry metrics: The new base classes use the global MeterProvider directly, dropping the per-service CollectorRegistry and update() chain used by the legacy Prometheus classes. Services migrated in this release include api_service, lens_service, and gpq_service; the previous service_info gauge is replaced by OTel resource attributes.
  • Prometheus business metrics in data_service, iam_service, jos_service, lens_api, health_service, and registry_service have been migrated to the OTel push model: This brings the Python service telemetry pipeline in line with the Rust services and enables unified metric collection through the OTLP exporter. Per-request org_id propagation has also been added to OTLP logs, and per-signal OTLP endpoint overrides are now supported for routing traces, metrics, and logs to different collectors.
  • New Grafana dashboard panels have been added for JOS job counts, node pool sizes, total file counts and bytes, and the sync has been improved to include fixes for overcounting: The panels provide visibility into per-org file storage volume and JOS queue depth and were synchronized with the cloud environment. A separate fix corrects an over-counting bug in the file count and bytes panels introduced during the initial rollout.

Console & UI

  • Filename search has been added to the files dashboard, backed by new database indexes for performant queries: Users can now filter the file list by partial filename match directly in the UI, and the underlying data service database has been updated with the necessary indexes to keep query latency low even for large file tables.
  • The Console login page, homepage, and top navigation have been migrated to the Archetype design system: Light/dark mode persistence is preserved.
  • The Console dashboard and workbench pages have been migrated to the design system: Dashboard pages now use design system primitives, and the workbench has been updated while keeping all existing functionality intact.

Bug Fixes

  • vLLM batch job result ordering and crash recovery have been fixed to ensure JSONL output is always in input order and that fatal engine failures are handled cleanly: When async vLLM workers complete out of order, results are now staged by submit position and flushed in contiguous runs, keeping output aligned with the original input. Per-record failure messages are sanitized to “inference error” so internal exception details never reach user-downloadable files. When the vLLM engine encounters a fatal error (such as CUDA out-of-memory), the worker now emits an inference.engine_dead event, cancels in-flight work, and exits cleanly so Kubernetes restarts the pod and JOS resumes from the last checkpoint.
  • A TOCTOU race condition and stale counter bug in chunked file writes have been fixed, and chunk data and S3 objects are now properly cleaned up on file deletion: Concurrent appenders previously could race on chunk_index assignment; the fix ensures all writes happen within a single locked transaction. Deleting a chunked file now removes all associated file_chunks rows and their corresponding S3 objects, preventing orphaned storage.
v1.1.2
May 13, 2026

API & Services

  • file_type filter added to the list-files endpoint: A new query parameter allows callers to filter the file listing by MIME type or file type category, making it easier to retrieve only the files relevant to a specific workflow.
  • OpenAPI schema for /query expanded with full request/response documentation: The OpenAPI spec for the /query endpoint now includes complete request and response schemas, improving SDK generation quality and developer documentation.
  • ATAI_CA_BUNDLE_PATH environment variable renamed to ATAI_CA_BUNDLE: The legacy environment variable name was simplified for consistency. Deployments using the old name will need to update their configuration.
  • Per-org file count and byte total exposed as Prometheus metrics (PLDEV-784): Two new Prometheus gauges track total file count and aggregate storage usage per organisation, enabling capacity planning and billing dashboards.
  • The jos seed apply command now bundles seed YAMLs (PLDEV-909): The jos seed apply CLI command was updated to ship seed YAML files directly with the binary, so applying default data seeds no longer requires a separate file distribution step.
  • Example automatic migration scripts added for 1.0.9 → 1.1.1 and 1.1.1 → 1.1.2: Reference migration examples were committed to the repository so operators have a concrete starting point for upgrading existing deployments across these version boundaries.

Observability & Telemetry

  • Full OpenTelemetry OTLP distributed tracing stack added to all Rust services: The atai_telemetry crate now wires a complete three-signal OTel pipeline — distributed traces, log records with trace/span correlation, and push metrics — all exported via OTLP/gRPC when OTEL_EXPORTER_OTLP_ENDPOINT is set. HTTP spans follow OTel semantic conventions via axum-tracing-opentelemetry, with W3C traceparent extraction and propagation across all four Rust HTTP services. When the endpoint is unset, the crate behaves exactly as before with zero OTel overhead.
  • OTLP log bridge and trace improvements added to console_2_service: This PR wires the OTLP log bridge into console_2_service and improves trace context handling, including always-on W3C trace ID generation and X-Request-Id reflection in responses. Static asset requests are excluded from tracing noise, and root HTTP span naming now uses METHOD /route/[param] with SvelteKit route IDs.
  • init_tracer() added to atai_py and wired into initialize_service_logging: Python services can now initialize OpenTelemetry tracing with a single call, automatically reading OTEL_EXPORTER_OTLP_ENDPOINT and configuring the tracer pipeline (PLDEV-801).
  • PLATFORM_VERSION is now propagated through logs and metrics: The platform version string is attached as a resource attribute on all telemetry signals, making it easier to correlate observability data with specific release versions.
  • Legacy x-trace-id header dropped in favor of the OTel traceparent standard: The proprietary x-trace-id request/response header has been removed and all tracing correlation now relies on the W3C traceparent header (PLDEV-797). Any tooling or dashboards relying on x-trace-id will need to be updated.
  • atai_telemetry_reqwest crate adds OTel HTTP client spans via reqwest-middleware: Outgoing HTTP calls made through reqwest now produce properly attributed OTel client spans (PLDEV-798), enabling end-to-end trace stitching for calls leaving Rust services.

Security & Authentication

  • JWT exchange endpoint and JWKS added to the IAM service: A new POST /v1/iam/exchange endpoint accepts an API key and returns a signed RS256 JWT containing Archetype claims (subject, org ID, role, auth method), laying the foundation for moving from per-request API key validation to JWT-based auth. A companion GET /v1/iam/.well-known/jwks.json endpoint serves the RSA public key so downstream services can validate tokens locally. The existing /v1/iam/authenticate endpoint is unchanged for backward compatibility.
  • UPLOADING status is now exposed in the public file API (PLDEV-833): Previously the UPLOADING state was hidden from the public-facing file status endpoint; it is now surfaced so clients can accurately track in-progress uploads.

Data Integrity & Uploads

  • End-to-end server-driven checksum verification added to the direct upload flow (PLDEV-663): The server now selects the checksum algorithm at upload initiation and returns it in the InitiateUploadResponse; clients can optionally compute and submit a whole-file CRC32C in CompleteUploadRequest. If provided and mismatched, the file is marked CORRUPT and HTTP 422 is returned; if absent, the upload succeeds without verification, preserving backward compatibility. S3 CreateMultipartUpload is called with ChecksumType: FullObject so a single whole-object CRC32C is stored and retrievable via HeadObject.

Inference & Model Features

  • vLLM engine support added along with c26 improvements: The vLLM inference engine backend was integrated, including model weight caching and related configuration changes for the c26 hardware generation.
  • Stage 3 task classification expanded with MoteStrain and PAMAP2 datasets for UFM (fixed and variable): The UFM task classification pipeline now supports MoteStrain and PAMAP2 benchmarks across both fixed and variable input configurations.
  • Jobs now fail explicitly if any input fails (PLDEV-940): Previously a job could silently succeed even if one or more of its inputs errored; the job runner now propagates input failures and marks the overall job as failed.

Console & UI

  • Autocomplete label suggestions added to the n-shot file picker (PLDEV-715): As users enter class labels across n-shot files, a session-scoped vocabulary is built up and previously used labels are surfaced as autocomplete suggestions on subsequent inputs. The dropdown supports keyboard navigation (Arrow Up/Down, Enter, Escape), mouse selection, and auto-scrolls the active suggestion into view.
  • Error tooltip in the console stays visible longer and supports text copying: The tooltip that appears on API or service errors now remains on screen long enough to read and can have its content copied, improving the debugging experience for users (PLDEV-786).
  • Content-aware column widths applied to batch manifest tables: Columns in batch manifest views now size themselves based on their content rather than using fixed widths, improving readability for a wide range of payload shapes.
  • Progress chart tooltip is now flipped when near the right edge (PLDEV-889): The tooltip on progress charts was being clipped when the cursor was near the right boundary; it now flips to the left to stay fully visible.
  • “Pipeline” label renamed to “Task Type” throughout the console (PLDEV-905): The UI label used to describe processing pipelines has been standardized to “Task Type” for consistency with the rest of the product terminology.
  • MSJ broken image display fixed: A regression that caused broken image previews in the multi-sensor join UI was resolved.
  • Wrong progress counter in MSJ fixed (RES-272): A display bug that caused incorrect progress percentages to appear in the MSJ task view was corrected.
v1.1.1
April 29, 2026

New Features & Improvements

  • Added direct-to-cloud file upload support to the Rust and Python SDKs (PLDEV-535, PLDEV-16): Both the Rust and Python SDKs now support uploading files directly to cloud storage via presigned S3 URLs, bypassing the data service proxy. The Rust SDK introduces a new builder API (UploadBuilder) as the default path, with the proxy path still available via .using_proxy(); the Python SDK makes the direct path opt-in via use_proxy=False. Both implementations support concurrent multipart uploads with configurable worker counts, per-part retries with exponential backoff, progress callbacks, and cancellation. This enables uploads well beyond the proxy’s previous 500 MB size limit.
  • Added resumable upload support to the Rust and Python SDKs (PLDEV-551, PLDEV-778): Clients that fail mid-upload can now resume from where they left off without re-uploading already-completed parts. A new server-side checkpoint endpoint stores completed part tokens, and the initiate call accepts a resume_if_started flag to reuse an in-progress upload for the same file. On the Rust SDK, resuming is enabled via .with_resume(true) on the upload builder, with checkpointing on by default; on the Python SDK, resuming is enabled by passing allow_resume=True to upload(), with checkpointing also on by default. If a progress callback is set, it will be called once with the already-uploaded byte count when a resume occurs.
  • Increased the maximum upload file size to 250 GB: The platform-wide maximum file size for uploads has been raised to 250 GB to accommodate large dataset and model artifact transfers.
  • Added a session validation step to the Workbench (PLDEV-195, PLDEV-600): The Workbench now sends an explicit session.validate event at the start of every lens session before entering the active streaming state. If validation fails, users receive a clear error notification even when the session log panel is collapsed, and error log entries are highlighted in red for quick visual identification. This addresses a recurring issue where heartbeat timeouts would cause sessions to continue in a degraded state without surfacing clear feedback to users.
  • Added a read-only selected file name field to the Workbench lens tray (PLDEV-609): The lens tray settings panel for the Activity Monitor and Machine State lenses now displays the name of the currently selected input file below the model version field. When no file has been chosen, the field shows “Not selected.”
  • Limited CSV table and graph rendering to 10,000 rows in the Workbench to prevent freezes on large datasets (PLDEV-186): When an uploaded CSV file exceeds 10,000 rows, the Workbench now truncates the preview display and shows a banner informing users of the total row count and directing them to the API for full data access. CSV data is now truncated at the line-split stage before full parsing occurs, preventing memory pressure from very large files.
  • Improved Workbench output panel autoscroll behavior (PLDEV-350): The Workbench output panel now auto-scrolls to the latest response by default but pauses when the user manually scrolls up to review earlier results. A NEWEST button appears when the user has scrolled away from the bottom, and clicking it jumps back to the latest output and resumes auto-scrolling. Autoscroll state is reset at the start of each new session.
  • Added drag-and-drop bulk file upload to the File Manager (ATAI-2938): Users can now upload files to the File Manager by dragging and dropping them directly onto the file list page, or by using a new modal-based upload dialog triggered from the “Add files” button. The upload dialog shows per-file progress, supports individual file cancellation, and retains failed upload placeholders in the list so users can see which uploads did not complete.
  • Added the Batch Manager to the Console with live job creation, listing, and detail pages (PLDEV-575, PLDEV-576, PLDEV-604, PLDEV-689, PLDEV-695): A new Batch Manager page is available in the Console, allowing users to create and monitor batch jobs submitted to the Job Orchestration Service, and to download job artifacts. See the Batch Manager documentation for details.
  • Added Python JOS clients for API access and job container use (PLDEV-348): Two new Python clients for the Job Orchestration Service have been added. JosApiClient covers all 20 REST API endpoints (jobs, components, pipelines) with an async-first design and structured error handling. JosWorkerClient is for use inside JOS-managed pods and provides InputPort/OutputPort abstractions for reading JSONL manifests from S3, uploading outputs, reporting progress via Redis, and saving and restoring checkpoints.
  • Synced CSV window_size and step_size from the Workbench lens tray to the input stream config (PLDEV-629): When a user updates the window_size or step_size fields in the CSV lens tray, the values are now propagated to the underlying input stream config so that inference window boundaries correctly reflect the configured parameters.
  • Added structured error responses to the data service (PLDEV-591): The data service now returns structured, machine-readable error response bodies across its endpoints, replacing unstructured text errors.

Bug Fixes

  • Fixed lens worker crash loops from missing model_parameters and stale Redis events (PLDEV-610): A cascading failure was identified where missing model_parameters in a lens config caused a KeyError crash on worker nodes, and stale Redis queue events from crashed sessions caused nodes to re-enter crash loops on restart. Five targeted fixes prevent the KeyError, drain stale Redis events on node restart, skip unknown-session events, and garbage-collect stuck sessions.
  • Fixed narrator memory not being initialized for direct model.query calls without a video stream (PLDEV-632): When model.query was called without a preceding stream.start event — for example, for non-video file inputs — the narrator’s internal memory and buffer fields were uninitialized, causing failures.
  • Fixed deleted files returning stale data from the Files API (PLDEV-614, PLDEV-605): The GET /files/metadata/{id} endpoint now returns 404 Not Found on deleted files instead of the old record with an “unknown” status, and the download endpoint also returns 404 instead of 400 on deleted files. Re-uploading a file to a previously deleted file key is now permitted.
  • Fixed multipart upload completion not accepting unsorted parts (PLDEV-670): The /files/uploads/{id}/complete endpoint no longer requires the uploaded parts list to be submitted in sorted order.
  • Fixed the window_size and step_size values not being returned by the console-2-service (PLDEV-600): The backend was not including window_size and step_size in its responses, causing the lens tray to display stale or missing values after a session was initialized.
  • Fixed the console CSV config tooltip width and wrapping: The tooltip for CSV configuration settings in the console was overflowing its container; width and text-wrapping constraints have been applied.
  • Fixed the console-2-service to correctly apply schema default values for config templates: The console frontend now correctly applies schema-defined default values when rendering config template fields, rather than leaving them empty.
v1.1.0
April 15, 2026

New Features & Improvements

  • Added the Fine-Tuning Node to perform fine tuning jobs on dedicated GPUs: The new Fine-Tuning Node offers an API to create, manage, and monitor fine-tuning jobs for your organization. Each Fine-Tuning Node runs a fine-tuning job on its assigned GPU, acting like a worker that trains a model using the provided dataset and configuration. This produces a fine-tuned model and training metrics.
  • Added the Batch Manager to the Console with live job creation and listing (PLDEV-575, PLDEV-576): A new Batch Manager page is now available in the Console (behind the CONSOLE_FLAG_SHOW_BATCH_MANAGER feature flag), allowing users to create and monitor batch jobs submitted to the Job Orchestration Service. The Batch Manager supports creating jobs with file input selection and YAML/JSON config validation, displays submitted jobs with status badges and relative timestamps, and fetches job and pipeline data server-side.
  • Added a session validation step to the Workbench (PLDEV-195, PLDEV-600): The Workbench now runs an explicit session.validate event at the start of every lens session before entering the active streaming state. If validation fails, users receive a clear error notification even when the session log panel is collapsed, and error log entries are highlighted in red for quick identification.
  • Added a read-only selected file name field to the Workbench lens tray (PLDEV-609): The lens tray settings panel for the Activity Monitor and Machine State lenses now displays the name of the currently selected input file below the model version field. When no file has been chosen, the field shows “Not selected.”
  • Limited CSV table and graph rendering to 10,000 rows in the Workbench to prevent freezes on large datasets (PLDEV-186): When an uploaded CSV file exceeds 10,000 rows, the Workbench now truncates the preview display and shows a banner informing users of the total row count and directing them to the API for full data access.
  • Improved Workbench output panel autoscroll behavior (PLDEV-350): The Workbench output panel now auto-scrolls to the latest response by default but pauses scrolling when the user manually scrolls up to review earlier results. A “NEWEST” button appears when the user has scrolled away from the bottom, allowing them to jump back to the latest output and resume auto-scrolling. Autoscroll state is reset at the start of each new session.
  • Added drag-and-drop bulk file upload to the File Manager (ATAI-2938): Users can now upload files to the File Manager by dragging and dropping them directly onto the file list page or by using the new modal-based upload dialog. The upload dialog supports multiple concurrent file uploads, individual cancellation, progress tracking per file, and persists failed upload placeholders in the file list so users can see which uploads did not complete.
  • Implemented the Machine State Job (RES-226): A new batch job for machine state classification has been added, built on the fine-tuned Omega 1.3 model. The job can run on both CPU and GPU, processes sensor CSV data, and supports n-shot input files for healthy and faulty reference examples.
  • Synced CSV window_size and step_size from the Workbench lens tray to the input stream config (PLDEV-629): When a user updates the window_size or step_size fields in the CSV lens tray configuration, the values are now propagated to the underlying input stream config so that the inference window boundaries correctly reflect the configured parameters.

Bug Fixes

  • Fixed narrator memory not being initialized for direct model.query calls without a video stream: When model.query was called without a preceding stream.start event (e.g., with non-video file inputs), the narrator’s memory and buffer fields were uninitialized, causing the lens service to fail.
  • Fixed deleted files returning stale data from the Files API (PLDEV-614, PLDEV-605): The GET /files/metadata/{id} endpoint now returns a 404 Not Found response when called on a deleted file, rather than returning the old file record with an “unknown” status. The GET /files/download/{id} endpoint similarly now returns 404 on deleted files instead of a 400 error. Additionally, the Files API now allows re-uploading a new file over a previously deleted file’s key.
  • Fixed the console CSV config tooltip width and wrapping: The tooltip for CSV configuration settings in the console was overflowing its container; width and text-wrapping constraints have been applied.