API & Services
- The Jobs Orchestration Service (JOS) now supports chunked file outputs, exposing a single
logical output file per input regardless of how many segments a worker produces. Worker pods
tag each segment result with a
chunk_offield containing the chunk index, parent filename, and a final-chunk flag. Output URIs are rewritten to the authenticated/files/downloadendpoint, and files carry ajob_output.statusattribute (partialwhile in progress,completewhen done) so users can track job state from the file library. - JOS now detects and reports stuck jobs through a new background sweep and an OTel
jobs_stuckgauge. A job is now correctly markedRUNNINGonly when its first pod becomesReady, so jobs queued without available GPU remain inADMITTEDrather than prematurely appearing active to users. The sweep detects two conditions — a job stuck inADMITTEDfor over an hour and aRUNNINGjob with no emitted event or progress for over an hour — and exposes both counts via the new metric. - JOS workers now exit gracefully on job cancellation by catching
SIGTERM, checkpointing the current segment, and terminating cleanly. The pod termination grace period has been raised from 30 seconds to 2 minutes by default to give workers sufficient time to complete a checkpoint before Kubernetes forcibly terminates the pod. The grace period is configurable per deployment. - A manual-generation JOS pipeline has been added for producing structured procedural documentation from video. The pipeline performs audio-grounded batch inference and is available as a new task type configured via a JOS seed deployed to the environment.
Security & Authentication
- API keys can now be created with an optional TTL, and expired keys are automatically marked
in the database on first use after expiry. A new
--ttl-daysflag on theatai iam key createandatai iam user key createCLI commands computes anexpires_attimestamp at creation time and passes it through to the IAM service. When an expired key is rejected, its status transitions fromactivetoexpired. No schema migration is required as theexpires_atcolumn already existed on theapi_keystable.
API & Services
-
Chunked files can now be downloaded via the existing
/files/downloadendpoint, with content streamed directly from S3 in chunk order: The download path detects whether a file is chunked and assembles the response as a streaming byte sequence across all stored S3 objects, with no need for a separate endpoint or client-side logic. -
A new
SegmentedInferenceSessionabstraction has been added to the JOS worker client, enabling automatic checkpointing, progress reporting, and resilient resume for batch inference jobs: Workers now iterate over “segments” (subsets of input files) rather than whole files, and the session handles checkpointing between segments, emitting file-started/completed events, and resuming from the last checkpoint on restart. Segment size is configurable with defaults of 50 MB of input or 10 minutes, whichever comes first. The Nano inference worker and the inference phase of the machine state job have been updated to use this abstraction. -
Stale file uploads stuck in the
UPLOADINGstate are now automatically cleaned up: A configurable scheduler (default: every 6 hours) aborts any upload older than a threshold (default: 24 hours) using the same logic as the/abortendpoint. -
A new
DatasetListport mode and aPortMode::Checkpointstype have been added to JOS to support dataset listing and checkpoint upload workflows: These additions enable workers to declare input/output ports that carry dataset lists and model checkpoints respectively, broadening the range of job types that can be expressed natively in JOS config. Checkpoint files are now staged under a non-tmpfs folder to avoid memory pressure.
Inference & Model Updates
- The C2.6 model is now available with a new vLLM inference backend that delivers improved throughput over previous transformer-based inference: Multi-image support has been added to Direct Query and the Nano engine, and GPU-specific YAML configurations allow fine-grained tuning per model and hardware combination.
- GPU memory budget calibration for C2.5 and C2.6 has been improved with a more accurate adaptive probe sweep across batch size and shape dimensions: The calibration now accounts for padding-aware batch peaks, shrinks the memory budget incrementally after each OOM, and warms up the engine at max probe shapes before calibrating. This reduces spurious OOM events during production batch inference workloads.
-
UFM training has been extended with variable-K Stage 1+2 resolution,
pad_mode=interpolateretraining, Stage 3 sweep plumbing, and several stability fixes: A deterministic eval fix and a PEP 479StopIterationcorrection were also merged, along with pinningflash-linear-attentionto avoid a ~3× training slowdown introduced by a dependency update. The UFM builder and model directories were refactored to align with the factory pattern introduced in the updated Nano engine.
Observability & Telemetry
-
OTel SQL tracing has been extended to cover database transactions via a new
TracedTransactionwrapper, and rolled out to the Lens API and Data services: This completes full OTel tracing coverage fordata_serviceandlens_api_service;iam_serviceandjos_servicewill follow in subsequent releases. -
New Python
OtelMetricsBaseandOtelWorkerMetricsBasebase classes have been introduced, and the first batch of active platform services has been migrated from Prometheus to OpenTelemetry metrics: The new base classes use the globalMeterProviderdirectly, dropping the per-serviceCollectorRegistryandupdate()chain used by the legacy Prometheus classes. Services migrated in this release includeapi_service,lens_service, andgpq_service; the previousservice_infogauge is replaced by OTel resource attributes. -
Prometheus business metrics in
data_service,iam_service,jos_service,lens_api,health_service, andregistry_servicehave been migrated to the OTel push model: This brings the Python service telemetry pipeline in line with the Rust services and enables unified metric collection through the OTLP exporter. Per-requestorg_idpropagation has also been added to OTLP logs, and per-signal OTLP endpoint overrides are now supported for routing traces, metrics, and logs to different collectors. - New Grafana dashboard panels have been added for JOS job counts, node pool sizes, total file counts and bytes, and the sync has been improved to include fixes for overcounting: The panels provide visibility into per-org file storage volume and JOS queue depth and were synchronized with the cloud environment. A separate fix corrects an over-counting bug in the file count and bytes panels introduced during the initial rollout.
Console & UI
- Filename search has been added to the files dashboard, backed by new database indexes for performant queries: Users can now filter the file list by partial filename match directly in the UI, and the underlying data service database has been updated with the necessary indexes to keep query latency low even for large file tables.
- The Console login page, homepage, and top navigation have been migrated to the Archetype design system: Light/dark mode persistence is preserved.
- The Console dashboard and workbench pages have been migrated to the design system: Dashboard pages now use design system primitives, and the workbench has been updated while keeping all existing functionality intact.
Bug Fixes
-
vLLM batch job result ordering and crash recovery have been fixed to ensure JSONL output
is always in input order and that fatal engine failures are handled cleanly: When async
vLLM workers complete out of order, results are now staged by submit position and flushed in
contiguous runs, keeping output aligned with the original input. Per-record failure messages
are sanitized to “inference error” so internal exception details never reach
user-downloadable files. When the vLLM engine encounters a fatal error (such as CUDA
out-of-memory), the worker now emits an
inference.engine_deadevent, cancels in-flight work, and exits cleanly so Kubernetes restarts the pod and JOS resumes from the last checkpoint. -
A TOCTOU race condition and stale counter bug in chunked file writes have been fixed, and
chunk data and S3 objects are now properly cleaned up on file deletion: Concurrent
appenders previously could race on
chunk_indexassignment; the fix ensures all writes happen within a single locked transaction. Deleting a chunked file now removes all associatedfile_chunksrows and their corresponding S3 objects, preventing orphaned storage.
API & Services
file_typefilter added to thelist-filesendpoint: A new query parameter allows callers to filter the file listing by MIME type or file type category, making it easier to retrieve only the files relevant to a specific workflow.- OpenAPI schema for
/queryexpanded with full request/response documentation: The OpenAPI spec for the/queryendpoint now includes complete request and response schemas, improving SDK generation quality and developer documentation. ATAI_CA_BUNDLE_PATHenvironment variable renamed toATAI_CA_BUNDLE: The legacy environment variable name was simplified for consistency. Deployments using the old name will need to update their configuration.- Per-org file count and byte total exposed as Prometheus metrics (PLDEV-784): Two new Prometheus gauges track total file count and aggregate storage usage per organisation, enabling capacity planning and billing dashboards.
- The
jos seed applycommand now bundles seed YAMLs (PLDEV-909): Thejos seed applyCLI command was updated to ship seed YAML files directly with the binary, so applying default data seeds no longer requires a separate file distribution step. - Example automatic migration scripts added for 1.0.9 → 1.1.1 and 1.1.1 → 1.1.2: Reference migration examples were committed to the repository so operators have a concrete starting point for upgrading existing deployments across these version boundaries.
Observability & Telemetry
- Full OpenTelemetry OTLP distributed tracing stack added to all Rust services: The
atai_telemetrycrate now wires a complete three-signal OTel pipeline — distributed traces, log records with trace/span correlation, and push metrics — all exported via OTLP/gRPC whenOTEL_EXPORTER_OTLP_ENDPOINTis set. HTTP spans follow OTel semantic conventions viaaxum-tracing-opentelemetry, with W3Ctraceparentextraction and propagation across all four Rust HTTP services. When the endpoint is unset, the crate behaves exactly as before with zero OTel overhead. - OTLP log bridge and trace improvements added to
console_2_service: This PR wires the OTLP log bridge intoconsole_2_serviceand improves trace context handling, including always-on W3C trace ID generation andX-Request-Idreflection in responses. Static asset requests are excluded from tracing noise, and root HTTP span naming now usesMETHOD /route/[param]with SvelteKit route IDs. init_tracer()added toatai_pyand wired intoinitialize_service_logging: Python services can now initialize OpenTelemetry tracing with a single call, automatically readingOTEL_EXPORTER_OTLP_ENDPOINTand configuring the tracer pipeline (PLDEV-801).PLATFORM_VERSIONis now propagated through logs and metrics: The platform version string is attached as a resource attribute on all telemetry signals, making it easier to correlate observability data with specific release versions.- Legacy
x-trace-idheader dropped in favor of the OTeltraceparentstandard: The proprietaryx-trace-idrequest/response header has been removed and all tracing correlation now relies on the W3Ctraceparentheader (PLDEV-797). Any tooling or dashboards relying onx-trace-idwill need to be updated. atai_telemetry_reqwestcrate adds OTel HTTP client spans viareqwest-middleware: Outgoing HTTP calls made throughreqwestnow produce properly attributed OTel client spans (PLDEV-798), enabling end-to-end trace stitching for calls leaving Rust services.
Security & Authentication
- JWT exchange endpoint and JWKS added to the IAM service: A new
POST /v1/iam/exchangeendpoint accepts an API key and returns a signed RS256 JWT containing Archetype claims (subject, org ID, role, auth method), laying the foundation for moving from per-request API key validation to JWT-based auth. A companionGET /v1/iam/.well-known/jwks.jsonendpoint serves the RSA public key so downstream services can validate tokens locally. The existing/v1/iam/authenticateendpoint is unchanged for backward compatibility. UPLOADINGstatus is now exposed in the public file API (PLDEV-833): Previously theUPLOADINGstate was hidden from the public-facing file status endpoint; it is now surfaced so clients can accurately track in-progress uploads.
Data Integrity & Uploads
- End-to-end server-driven checksum verification added to the direct upload flow (PLDEV-663): The server now selects the checksum algorithm at upload initiation and returns it in the
InitiateUploadResponse; clients can optionally compute and submit a whole-file CRC32C inCompleteUploadRequest. If provided and mismatched, the file is markedCORRUPTand HTTP 422 is returned; if absent, the upload succeeds without verification, preserving backward compatibility. S3CreateMultipartUploadis called withChecksumType: FullObjectso a single whole-object CRC32C is stored and retrievable viaHeadObject.
Inference & Model Features
- vLLM engine support added along with c26 improvements: The vLLM inference engine backend was integrated, including model weight caching and related configuration changes for the c26 hardware generation.
- Stage 3 task classification expanded with MoteStrain and PAMAP2 datasets for UFM (fixed and variable): The UFM task classification pipeline now supports MoteStrain and PAMAP2 benchmarks across both fixed and variable input configurations.
- Jobs now fail explicitly if any input fails (PLDEV-940): Previously a job could silently succeed even if one or more of its inputs errored; the job runner now propagates input failures and marks the overall job as failed.
Console & UI
- Autocomplete label suggestions added to the n-shot file picker (PLDEV-715): As users enter class labels across n-shot files, a session-scoped vocabulary is built up and previously used labels are surfaced as autocomplete suggestions on subsequent inputs. The dropdown supports keyboard navigation (Arrow Up/Down, Enter, Escape), mouse selection, and auto-scrolls the active suggestion into view.
- Error tooltip in the console stays visible longer and supports text copying: The tooltip that appears on API or service errors now remains on screen long enough to read and can have its content copied, improving the debugging experience for users (PLDEV-786).
- Content-aware column widths applied to batch manifest tables: Columns in batch manifest views now size themselves based on their content rather than using fixed widths, improving readability for a wide range of payload shapes.
- Progress chart tooltip is now flipped when near the right edge (PLDEV-889): The tooltip on progress charts was being clipped when the cursor was near the right boundary; it now flips to the left to stay fully visible.
- “Pipeline” label renamed to “Task Type” throughout the console (PLDEV-905): The UI label used to describe processing pipelines has been standardized to “Task Type” for consistency with the rest of the product terminology.
- MSJ broken image display fixed: A regression that caused broken image previews in the multi-sensor join UI was resolved.
- Wrong progress counter in MSJ fixed (RES-272): A display bug that caused incorrect progress percentages to appear in the MSJ task view was corrected.
New Features & Improvements
- Added direct-to-cloud file upload support to the Rust and Python SDKs (PLDEV-535,
PLDEV-16): Both the Rust and Python SDKs now support uploading files directly to cloud
storage via presigned S3 URLs, bypassing the data service proxy. The Rust SDK introduces a new
builder API (
UploadBuilder) as the default path, with the proxy path still available via.using_proxy(); the Python SDK makes the direct path opt-in viause_proxy=False. Both implementations support concurrent multipart uploads with configurable worker counts, per-part retries with exponential backoff, progress callbacks, and cancellation. This enables uploads well beyond the proxy’s previous 500 MB size limit. - Added resumable upload support to the Rust and Python SDKs (PLDEV-551, PLDEV-778): Clients
that fail mid-upload can now resume from where they left off without re-uploading
already-completed parts. A new server-side checkpoint endpoint stores completed part tokens,
and the initiate call accepts a
resume_if_startedflag to reuse an in-progress upload for the same file. On the Rust SDK, resuming is enabled via.with_resume(true)on the upload builder, with checkpointing on by default; on the Python SDK, resuming is enabled by passingallow_resume=Truetoupload(), with checkpointing also on by default. If a progress callback is set, it will be called once with the already-uploaded byte count when a resume occurs. - Increased the maximum upload file size to 250 GB: The platform-wide maximum file size for uploads has been raised to 250 GB to accommodate large dataset and model artifact transfers.
- Added a session validation step to the Workbench (PLDEV-195, PLDEV-600): The Workbench now
sends an explicit
session.validateevent at the start of every lens session before entering the active streaming state. If validation fails, users receive a clear error notification even when the session log panel is collapsed, and error log entries are highlighted in red for quick visual identification. This addresses a recurring issue where heartbeat timeouts would cause sessions to continue in a degraded state without surfacing clear feedback to users. - Added a read-only selected file name field to the Workbench lens tray (PLDEV-609): The lens tray settings panel for the Activity Monitor and Machine State lenses now displays the name of the currently selected input file below the model version field. When no file has been chosen, the field shows “Not selected.”
- Limited CSV table and graph rendering to 10,000 rows in the Workbench to prevent freezes on large datasets (PLDEV-186): When an uploaded CSV file exceeds 10,000 rows, the Workbench now truncates the preview display and shows a banner informing users of the total row count and directing them to the API for full data access. CSV data is now truncated at the line-split stage before full parsing occurs, preventing memory pressure from very large files.
- Improved Workbench output panel autoscroll behavior (PLDEV-350): The Workbench output panel now auto-scrolls to the latest response by default but pauses when the user manually scrolls up to review earlier results. A NEWEST button appears when the user has scrolled away from the bottom, and clicking it jumps back to the latest output and resumes auto-scrolling. Autoscroll state is reset at the start of each new session.
- Added drag-and-drop bulk file upload to the File Manager (ATAI-2938): Users can now upload files to the File Manager by dragging and dropping them directly onto the file list page, or by using a new modal-based upload dialog triggered from the “Add files” button. The upload dialog shows per-file progress, supports individual file cancellation, and retains failed upload placeholders in the list so users can see which uploads did not complete.
- Added the Batch Manager to the Console with live job creation, listing, and detail pages (PLDEV-575, PLDEV-576, PLDEV-604, PLDEV-689, PLDEV-695): A new Batch Manager page is available in the Console, allowing users to create and monitor batch jobs submitted to the Job Orchestration Service, and to download job artifacts. See the Batch Manager documentation for details.
- Added Python JOS clients for API access and job container use (PLDEV-348): Two new Python
clients for the Job Orchestration Service have been added.
JosApiClientcovers all 20 REST API endpoints (jobs, components, pipelines) with an async-first design and structured error handling.JosWorkerClientis for use inside JOS-managed pods and providesInputPort/OutputPortabstractions for reading JSONL manifests from S3, uploading outputs, reporting progress via Redis, and saving and restoring checkpoints. - Synced CSV
window_sizeandstep_sizefrom the Workbench lens tray to the input stream config (PLDEV-629): When a user updates thewindow_sizeorstep_sizefields in the CSV lens tray, the values are now propagated to the underlying input stream config so that inference window boundaries correctly reflect the configured parameters. - Added structured error responses to the data service (PLDEV-591): The data service now returns structured, machine-readable error response bodies across its endpoints, replacing unstructured text errors.
Bug Fixes
- Fixed lens worker crash loops from missing
model_parametersand stale Redis events (PLDEV-610): A cascading failure was identified where missingmodel_parametersin a lens config caused aKeyErrorcrash on worker nodes, and stale Redis queue events from crashed sessions caused nodes to re-enter crash loops on restart. Five targeted fixes prevent theKeyError, drain stale Redis events on node restart, skip unknown-session events, and garbage-collect stuck sessions. - Fixed narrator memory not being initialized for direct
model.querycalls without a video stream (PLDEV-632): Whenmodel.querywas called without a precedingstream.startevent — for example, for non-video file inputs — the narrator’s internal memory and buffer fields were uninitialized, causing failures. - Fixed deleted files returning stale data from the Files API (PLDEV-614, PLDEV-605): The
GET /files/metadata/{id}endpoint now returns 404 Not Found on deleted files instead of the old record with an “unknown” status, and the download endpoint also returns 404 instead of 400 on deleted files. Re-uploading a file to a previously deleted file key is now permitted. - Fixed multipart upload completion not accepting unsorted parts (PLDEV-670): The
/files/uploads/{id}/completeendpoint no longer requires the uploaded parts list to be submitted in sorted order. - Fixed the
window_sizeandstep_sizevalues not being returned by the console-2-service (PLDEV-600): The backend was not includingwindow_sizeandstep_sizein its responses, causing the lens tray to display stale or missing values after a session was initialized. - Fixed the console CSV config tooltip width and wrapping: The tooltip for CSV configuration settings in the console was overflowing its container; width and text-wrapping constraints have been applied.
- Fixed the console-2-service to correctly apply schema default values for config templates: The console frontend now correctly applies schema-defined default values when rendering config template fields, rather than leaving them empty.
New Features & Improvements
- Added the Fine-Tuning Node to perform fine tuning jobs on dedicated GPUs: The new Fine-Tuning Node offers an API to create, manage, and monitor fine-tuning jobs for your organization. Each Fine-Tuning Node runs a fine-tuning job on its assigned GPU, acting like a worker that trains a model using the provided dataset and configuration. This produces a fine-tuned model and training metrics.
- Added the Batch Manager to the Console with live job creation and listing (PLDEV-575,
PLDEV-576): A new Batch Manager page is now available in the Console (behind the
CONSOLE_FLAG_SHOW_BATCH_MANAGERfeature flag), allowing users to create and monitor batch jobs submitted to the Job Orchestration Service. The Batch Manager supports creating jobs with file input selection and YAML/JSON config validation, displays submitted jobs with status badges and relative timestamps, and fetches job and pipeline data server-side. - Added a session validation step to the Workbench (PLDEV-195, PLDEV-600): The Workbench now
runs an explicit
session.validateevent at the start of every lens session before entering the active streaming state. If validation fails, users receive a clear error notification even when the session log panel is collapsed, and error log entries are highlighted in red for quick identification. - Added a read-only selected file name field to the Workbench lens tray (PLDEV-609): The lens tray settings panel for the Activity Monitor and Machine State lenses now displays the name of the currently selected input file below the model version field. When no file has been chosen, the field shows “Not selected.”
- Limited CSV table and graph rendering to 10,000 rows in the Workbench to prevent freezes on large datasets (PLDEV-186): When an uploaded CSV file exceeds 10,000 rows, the Workbench now truncates the preview display and shows a banner informing users of the total row count and directing them to the API for full data access.
- Improved Workbench output panel autoscroll behavior (PLDEV-350): The Workbench output panel now auto-scrolls to the latest response by default but pauses scrolling when the user manually scrolls up to review earlier results. A “NEWEST” button appears when the user has scrolled away from the bottom, allowing them to jump back to the latest output and resume auto-scrolling. Autoscroll state is reset at the start of each new session.
- Added drag-and-drop bulk file upload to the File Manager (ATAI-2938): Users can now upload files to the File Manager by dragging and dropping them directly onto the file list page or by using the new modal-based upload dialog. The upload dialog supports multiple concurrent file uploads, individual cancellation, progress tracking per file, and persists failed upload placeholders in the file list so users can see which uploads did not complete.
- Implemented the Machine State Job (RES-226): A new batch job for machine state classification has been added, built on the fine-tuned Omega 1.3 model. The job can run on both CPU and GPU, processes sensor CSV data, and supports n-shot input files for healthy and faulty reference examples.
- Synced CSV
window_sizeandstep_sizefrom the Workbench lens tray to the input stream config (PLDEV-629): When a user updates thewindow_sizeorstep_sizefields in the CSV lens tray configuration, the values are now propagated to the underlying input stream config so that the inference window boundaries correctly reflect the configured parameters.
Bug Fixes
- Fixed narrator memory not being initialized for direct
model.querycalls without a video stream: Whenmodel.querywas called without a precedingstream.startevent (e.g., with non-video file inputs), the narrator’s memory and buffer fields were uninitialized, causing the lens service to fail. - Fixed deleted files returning stale data from the Files API (PLDEV-614, PLDEV-605): The
GET /files/metadata/{id}endpoint now returns a404 Not Foundresponse when called on a deleted file, rather than returning the old file record with an “unknown” status. TheGET /files/download/{id}endpoint similarly now returns404on deleted files instead of a400error. Additionally, the Files API now allows re-uploading a new file over a previously deleted file’s key. - Fixed the console CSV config tooltip width and wrapping: The tooltip for CSV configuration settings in the console was overflowing its container; width and text-wrapping constraints have been applied.