Skip to main content

COG-8: Jobs

Status:      Draft (Work in Progress)
Version: 0.2
Created: 2025-01-28
Updated: 2026-01-29
Authors: Mike Anderson
Work in Progress

This specification is under active development. Structure and details may change significantly based on implementation experience and community feedback.

This standard specifies Jobs - the execution tracking mechanism for operations on the Covia Grid. A Job represents a running process on the Grid, modelled as a pointer to an evolving chain of immutable state records.

Purpose

Jobs are the runtime counterpart to Operations. While an operation defines what can be executed, a job represents a specific execution of that operation. Jobs provide:

  • Asynchronous Execution: Clients submit work and track progress without blocking
  • Observable Lifecycle: Every job transitions through well-defined states
  • Error Handling: Structured error reporting for failed, rejected, or timed-out work
  • Interactive Workflows: Support for human-in-the-loop patterns where jobs pause for input or authorisation
  • Auditability: Job state chains provide a verifiable, immutable history of all computation performed on the Grid
  • Unified Data Model: Job state records share the same content-addressed, immutable structure as other Grid assets

Terminology

See COG-1: Architecture for definitions of Grid terminology including Job, Operation, Venue, and Client.

Key Concepts

Job as a Pointer to Process State

A Job is not a mutable record. It is a HEAD pointer to the latest entry in a chain of immutable state records. Each state transition appends a new immutable record to the chain, linked to its predecessor by a content-addressed hash.

Job HEAD ──▶ State(N)  ──prev──▶  State(N-1)  ──prev──▶  ...  ──prev──▶  State(0)
(latest) (initial)

Each state record in the chain is an immutable, content-addressed value — structurally identical to an Artifact or Operation asset. The only mutable element is the HEAD pointer itself, which advances as the job progresses.

This design provides several properties:

  • Audit trail: The full execution history is preserved as a chain of immutable records, each cryptographically linked to its predecessor
  • Verifiability: Any participant can verify the integrity of a job's history by walking the chain and checking hashes
  • Lattice compatibility: Immutable state records merge naturally using union semantics in the Grid Lattice, eliminating the need for timestamp-based conflict resolution on job data
  • Structural sharing: When state records are stored as native lattice data structures, unchanged fields (e.g. input, op) are automatically deduplicated across state transitions via the lattice's Merkle tree structure

State Record Structure

Each state record in the chain contains:

FieldPresenceDescription
statusREQUIREDCurrent lifecycle status
prevREQUIRED (except initial)Content-addressed hash of the previous state record
opInitial recordOperation identifier (Asset ID, name, or adapter reference)
inputInitial recordInput provided at job creation
outputTerminal (COMPLETE)Result produced by the operation
errorTerminal (FAILED, REJECTED, CANCELLED)Error description
messageOPTIONALContext about current state (e.g. what input is required)
updatedRECOMMENDEDTimestamp in milliseconds since Unix epoch

Fields that have not changed since the previous state record MAY be omitted from subsequent records, since the full state can be reconstructed by walking the chain. However, implementations MAY include all fields in every record for convenience.

Example Chain

A simple echo operation produces a three-link chain:

State(2) — HEAD
{
"status": "COMPLETE",
"prev": "0xdef...",
"output": {"text": "hello"},
"updated": 1769683717710
}

▼ prev
State(1)
{
"status": "STARTED",
"prev": "0x123...",
"updated": 1769683717708
}

▼ prev
State(0) — initial
{
"status": "PENDING",
"prev": null,
"op": "test:echo",
"input": {"text": "hello"},
"updated": 1769683717706
}

Verifiability of Job State

The chain model enables cryptographic verification of job execution history:

  1. Each state record is content-addressed. The record's identifier is derived from its content (see Asset ID Scheme below). Any modification to a record changes its identifier, making tampering detectable.

  2. Chain integrity. Each record contains the hash of its predecessor in the prev field. A verifier can walk the chain from HEAD to the initial state, confirming that no records have been inserted, removed, or modified.

  3. Cross-venue verification. Because state records are immutable and content-addressed, they can be safely replicated to other venues. A client can request the same job history from multiple venues and confirm consistency.

  4. Selective disclosure. A client can share a specific state record (e.g. the terminal COMPLETE record) along with its chain as proof that a computation occurred, without revealing the full job context.

Relationship to Assets

Job state records, Artifacts, and Operations are all immutable, content-addressed records stored in the lattice. They differ only in their metadata shape:

Record TypeDistinguishing FieldPurpose
ArtifactcontentImmutable data
OperationoperationExecutable function definition
State Recordstatus + prevJob execution state

This unification means that job state records can be stored, replicated, and verified using the same mechanisms as any other asset on the Grid.

One-Shot Jobs and Multi-Turn Jobs

All jobs share the same lifecycle and data model, but differ in their interaction pattern:

One-shot jobs are created with an initial input, execute a single operation, and reach a terminal state. The state chain is short: PENDING → STARTED → COMPLETE (or FAILED). This is the model used by simple tool calls and operation invocations.

Multi-turn jobs cycle through interactive states, accepting additional input between processing steps. The state chain grows with each interaction: PENDING → STARTED → INPUT_REQUIRED → STARTED → INPUT_REQUIRED → ... → COMPLETE. This is the model used by conversational agents, orchestrated workflows, and any process requiring human-in-the-loop interaction.

Both types use the same job lifecycle, the same state chain structure, and the same observation mechanisms (polling, SSE). The difference is purely behavioural — a one-shot job's transition function produces a terminal state on the first invocation, while a multi-turn job's transition function produces an interactive state and waits for the next message.

Clients that require a synchronous result (e.g. a single tool call) simply wait for the terminal state — regardless of how many intermediate transitions occur. The job's internal interaction loop is opaque to a synchronous caller. See COG-9: Agent Messaging for the asynchronous message delivery mechanism that drives multi-turn interactions.

Agents as Persistent Jobs

A Job whose transition function produces a terminal state is a finite process — it runs to completion. A Job whose transition function never reaches a terminal state is an Agent — a persistent, stateful process that accepts repeated interactions.

In this model:

  • The state record holds the agent's current state (e.g. conversation history, workflow position)
  • The transition function is an Operation asset that defines how the agent processes input and produces the next state
  • Each interaction appends a new state record to the chain
Agent HEAD ──▶ State(N)     [status: INPUT_REQUIRED]

│ transition_fn(State(N), user_input) → State(N+1)

State(N+1) [status: INPUT_REQUIRED]

The INPUT_REQUIRED and AUTH_REQUIRED interactive statuses (defined below) naturally support agent interaction patterns. An agent waiting for the next user message is simply a job in INPUT_REQUIRED state. When input arrives, the transition function (an LLM call, orchestrator step, or any other operation) produces the next state.

The distinction between a Job and an Agent is purely behavioural — whether the process converges to a terminal state or not. The data model is identical.

See COG-11: Agent Lifecycle for the full agent lifecycle specification, including agent state, the run loop, transition functions, and the three-level architecture.

Specification

Job Lifecycle

Every job follows a lifecycle defined by its status. Status transitions are unidirectional — a job moves forward through the lifecycle and never returns to a previous state (with the exception of interactive states resuming to STARTED).

Status Values

StatusCategoryDescription
PENDINGActiveJob created, queued for execution
STARTEDActiveJob is currently executing
COMPLETETerminalJob finished successfully with output
FAILEDTerminalJob finished with an error
CANCELLEDTerminalJob was cancelled by client or venue
REJECTEDTerminalJob was rejected before execution (e.g. policy violation, invalid input)
TIMEOUTTerminalJob exceeded its time limit
PAUSEDInteractiveJob execution suspended, awaiting a resume signal
INPUT_REQUIREDInteractiveJob requires additional input from the client to continue
AUTH_REQUIREDInteractiveJob requires authorisation or credentials to continue

Status Categories

Statuses fall into three categories:

  • Active (PENDING, STARTED): The job is progressing. Clients should poll or subscribe for updates.
  • Terminal (COMPLETE, FAILED, CANCELLED, REJECTED, TIMEOUT): The job has finished. No further status changes will occur. The job's output or error is available.
  • Interactive (PAUSED, INPUT_REQUIRED, AUTH_REQUIRED): The job is waiting for external action. The client must respond before the job can continue.

State Transitions

                         ┌─────────────────┐
│ REJECTED │
└─────────────────┘

│ (invalid input,
│ policy violation)

┌──────────┐ ┌──────────┐ ┌──────────────┐
│ PENDING │ ────────▶ │ STARTED │ ────────▶ │ COMPLETE │
└──────────┘ └──────────┘ └──────────────┘
│ ▲ │
│ │ │
┌────┘ │ └────┐
│ │ │
▼ │ ▼
┌───────────┐ │ ┌──────────┐
│ PAUSED / │───┘ │ FAILED │
│ INPUT_ │ └──────────┘
│ REQUIRED /│
│ AUTH_ │ ┌──────────┐
│ REQUIRED │ │ CANCELLED │
└───────────┘ └──────────┘


(from any non-terminal state)

┌──────────┐
│ TIMEOUT │
└──────────┘


(from any non-terminal state)

The permitted transitions are:

FromToTrigger
PENDINGSTARTEDVenue begins execution
PENDINGREJECTEDVenue rejects the job (policy, quota, invalid input)
PENDINGCANCELLEDClient cancels before execution begins
STARTEDCOMPLETEOperation finishes successfully
STARTEDFAILEDOperation encounters an error
STARTEDCANCELLEDClient cancels during execution
STARTEDTIMEOUTExecution exceeds time limit
STARTEDPAUSEDOperation suspends execution
STARTEDINPUT_REQUIREDOperation needs additional client input
STARTEDAUTH_REQUIREDOperation needs client authorisation
PAUSED / INPUT_REQUIRED / AUTH_REQUIREDSTARTEDClient provides required input or authorisation
PAUSED / INPUT_REQUIRED / AUTH_REQUIREDCANCELLEDClient cancels while paused
PAUSED / INPUT_REQUIRED / AUTH_REQUIREDTIMEOUTPaused job exceeds time limit

Each transition appends a new immutable state record to the job's chain.

Job Identification

A job is identified by a unique Job ID within the venue. The Job ID is assigned at creation time and remains stable throughout the job's lifecycle — it identifies the HEAD pointer, not any individual state record.

Job IDs are represented as hex strings. The format is implementation-defined, but MUST be unique within the venue.

Individual state records in the chain are identified by their content-addressed hash (see Asset ID Scheme).

Job Data

The current state of a job is represented as a JSON object when accessed via the API. This object reflects the latest state record in the chain, with all fields resolved (including inherited fields from earlier records).

id (REQUIRED)

The Job ID — the stable identifier for this job within the venue.

{
"id": "0x12345678901234567890123456789012"
}

status (REQUIRED)

The current lifecycle status of the job. MUST be one of the status values defined above.

{
"status": "STARTED"
}

The identifier of the operation being executed (Asset ID, operation name, or adapter reference).

{
"operation": "0x7a8b9c0d1e2f3a4b5c6d7e8f9a0b1c2d3e4f5a6b7c8d9e0f1a2b3c4d5e6f7a8b"
}

output (present when COMPLETE)

The result produced by the operation. The structure of the output is defined by the operation's output schema (see COG-7: Operations).

{
"status": "COMPLETE",
"output": {
"result": "Success",
"data": { "count": 42 }
}
}

error (present when FAILED, REJECTED, or CANCELLED)

A human-readable error message describing why the job did not complete.

{
"status": "FAILED",
"error": "Connection to upstream service timed out"
}

message (OPTIONAL)

A general-purpose message providing context about the current state. Particularly useful for interactive states to describe what input or authorisation is needed.

{
"status": "INPUT_REQUIRED",
"message": "Please provide the API key for the target service"
}

input (OPTIONAL)

The input that was provided when the job was created.

Timestamp in milliseconds since Unix epoch when the job was created.

Timestamp in milliseconds since Unix epoch of the last status change.

Job Creation

Jobs are created when a client invokes an operation on a venue via POST /api/v1/invoke. See COG-7: Operations for the invocation model.

A venue MUST create a job for every accepted invocation request.

A venue MUST create an initial state record with status PENDING (or REJECTED if the job is immediately rejected).

A venue MAY immediately append a STARTED state record if execution begins synchronously.

Job Observation

Clients can observe job progress through two mechanisms:

Polling

Clients poll the job status via GET /api/v1/jobs/{id}. This returns the resolved job data from the latest state record, including current status, output (if complete), or error (if failed).

Clients SHOULD use exponential backoff when polling to avoid overloading the venue. Recommended parameters:

ParameterValue
Initial delay300ms
Backoff factor1.5x
Maximum delay10s

Server-Sent Events (SSE)

Clients can subscribe to real-time updates via GET /api/v1/jobs/{id}/sse. The venue pushes an event each time a new state record is appended to the chain.

SSE is preferred over polling for long-running jobs because it:

  • Reduces network overhead
  • Provides immediate notification of state changes
  • Avoids the latency inherent in polling intervals

Job History

Clients MAY retrieve the full state chain for a job via GET /api/v1/jobs/{id}/history. This returns the ordered sequence of state records from initial to latest, enabling audit and verification of the complete execution history.

Job Pause

Clients can pause a running job via PUT /api/v1/jobs/{id}/pause.

A venue MUST append a state record with status PAUSED if the job is in a non-terminal, non-paused state (PENDING, STARTED, INPUT_REQUIRED, AUTH_REQUIRED).

A venue MUST return 409 Conflict if the job is already in a terminal state or already paused.

Pausing suspends execution. The adapter is not re-invoked until the job is resumed.

Job Resume

Clients can resume a paused job via PUT /api/v1/jobs/{id}/resume.

A venue MUST append a state record with status STARTED if the job is in PAUSED state, and re-engage the adapter to continue execution.

A venue MUST return 409 Conflict if the job is not in PAUSED state.

Note: INPUT_REQUIRED and AUTH_REQUIRED jobs are resumed by delivering a message (see COG-9: Agent Messaging), not by calling the resume endpoint. The resume endpoint is specifically for PAUSED jobs.

Job Cancellation

Clients can cancel a job via PUT /api/v1/jobs/{id}/cancel.

A venue MUST append a state record with status CANCELLED if the job is in a non-terminal state.

A venue MUST ignore cancel requests for jobs that are already in a terminal state.

Cancellation is best-effort — the underlying operation may have already produced side effects before the cancellation takes effect.

Job Deletion

Clients can remove a job record via PUT /api/v1/jobs/{id}/delete.

Deletion removes the HEAD pointer from the venue's job index. The immutable state records in the chain MAY be retained in lattice storage for audit purposes, or MAY be garbage collected at the venue's discretion.

Deletion does not affect any resources or outputs produced by the job.

Job Finality

Once a job reaches a terminal state, no further state records are appended to its chain:

  • The HEAD pointer MUST NOT advance beyond a terminal state record
  • The terminal state record's output or error fields MUST NOT be modified (they are immutable by construction)
  • The job record MUST remain available for querying until explicitly deleted

The immutable chain structure ensures that job records serve as reliable, verifiable audit evidence of computation performed on the Grid.

Lattice Storage

Job state records are stored in the Grid Lattice as immutable, content-addressed values. Because state records are immutable, they can be stored using union merge semantics — the same strategy used for assets — rather than timestamp-based merge.

The HEAD pointer (Job ID → latest state record hash) is the only mutable element and requires a lightweight mutable index within the venue's lattice state.

When state records are stored as native lattice data structures (rather than serialised JSON), the lattice provides automatic structural sharing: fields common across state records (such as input, op, and unchanged context) are deduplicated in the Merkle tree without any explicit compaction.

Interactive Jobs

Interactive statuses (PAUSED, INPUT_REQUIRED, AUTH_REQUIRED) enable human-in-the-loop, multi-step workflows, and agent interaction patterns where execution cannot proceed without external action.

Use Cases

StatusUse Case
PAUSEDDebugging breakpoint; operator-initiated suspension; rate limiting
INPUT_REQUIREDOperation needs additional parameters not provided at invocation; multi-turn conversation flows; agent interaction
AUTH_REQUIREDOperation needs credentials for a downstream service; consent or approval step

Client Responsibilities

When a job enters an interactive state, the client SHOULD:

  1. Read the message field to understand what is required
  2. Deliver the requested input via POST /api/v1/jobs/{id} (see COG-9: Agent Messaging)
  3. Monitor for resumption to STARTED via polling or SSE

If a client cannot fulfil an interactive request, it SHOULD cancel the job rather than leaving it indefinitely paused.

Venues MAY impose timeouts on interactive states to prevent resource leaks from abandoned jobs.

Agent Interaction Pattern

For persistent agents (jobs that cycle through interactive states), the client interaction loop is:

1. Invoke operation → Job created (PENDING → STARTED → INPUT_REQUIRED)
2. Client sends message → Job resumes (STARTED → INPUT_REQUIRED)
3. Repeat step 2 for each interaction turn
4. Agent terminates → Job reaches terminal state (COMPLETE or FAILED)

The state chain grows with each interaction turn. Each turn's input and output are preserved as immutable records in the chain, providing a complete, verifiable interaction history.

Messages are delivered via COG-9: Agent Messaging, which provides a per-job message queue. Messages can be submitted at any time — including while the job is actively processing (STARTED). Queued messages are processed in order when the job is ready for input. This decoupling means clients do not need to wait for an interactive state before sending the next message.

Synchronous vs. Asynchronous Observation

The same job supports both synchronous and asynchronous interaction:

  • Synchronous callers (e.g. an MCP tools/call bridge) wait for the terminal state. The entire multi-turn interaction — including any intermediate message exchanges — is opaque to them. They submit the initial input and receive the final result.
  • Asynchronous callers (e.g. a conversational UI, an A2A agent) observe each state transition in real time via SSE, and submit messages via POST /api/v1/jobs/{id} as the interaction progresses.

Both patterns operate on the same job, the same state chain, and the same message queue. The choice is the client's, not the job's.

Asset ID Scheme

Job state records, like all content-addressed data on the Grid, are identified by lattice Value IDs — the SHA3-256 hash of their canonical CAD003 binary encoding (see COG-5: Asset Metadata).

This means:

  • The lattice automatically computes Value IDs for all stored data structures, so state record identification is a zero-cost property of storage
  • Structural sharing and deduplication across state records are handled natively by the lattice
  • Unchanged fields across state transitions (such as input, op) are automatically deduplicated in the Merkle tree

Examples

Simple Invocation

Client                              Venue
│ │
│ POST /api/v1/invoke │
│ {"operation": "test:echo", │
│ "input": {"text": "hello"}} │
│ ─────────────────────────────────▶ │
│ │ Create State(0): PENDING
│ 201 Created │ Create State(1): STARTED
│ {"id": "0xabc...", │
│ "status": "PENDING"} │
│ ◀───────────────────────────────── │
│ │ Execute → State(2): COMPLETE
│ GET /api/v1/jobs/0xabc... │
│ ─────────────────────────────────▶ │
│ │
│ 200 OK │
│ {"id": "0xabc...", │
│ "status": "COMPLETE", │
│ "output": {"text": "hello"}} │
│ ◀───────────────────────────────── │

Cancellation

Client                              Venue
│ │
│ POST /api/v1/invoke │
│ {"operation": "long-running"} │
│ ─────────────────────────────────▶ │
│ │
│ 201 Created │ State(0): PENDING
│ {"id": "0xdef...", │ State(1): STARTED
│ "status": "PENDING"} │
│ ◀───────────────────────────────── │
│ │
│ PUT /api/v1/jobs/0xdef.../cancel │
│ ─────────────────────────────────▶ │
│ │ State(2): CANCELLED
│ 200 OK │
│ {"id": "0xdef...", │
│ "status": "CANCELLED", │
│ "error": "Job cancelled"} │
│ ◀───────────────────────────────── │

Interactive Job (Input Required)

Client                              Venue
│ │
│ POST /api/v1/invoke │
│ {"operation": "data-import"} │
│ ─────────────────────────────────▶ │
│ │ State(0): PENDING
│ 201 Created │ State(1): STARTED
│ {"id": "0x123...", │
│ "status": "PENDING"} │
│ ◀───────────────────────────────── │
│ │ State(2): INPUT_REQUIRED
│ GET /api/v1/jobs/0x123... │
│ ─────────────────────────────────▶ │
│ │
│ 200 OK │
│ {"id": "0x123...", │
│ "status": "INPUT_REQUIRED", │
│ "message": "Provide API key"} │
│ ◀───────────────────────────────── │
│ │
│ (Client provides input) │
│ ─────────────────────────────────▶ │ State(3): STARTED
│ │ State(4): COMPLETE
│ GET /api/v1/jobs/0x123... │
│ ─────────────────────────────────▶ │
│ │
│ 200 OK │
│ {"id": "0x123...", │
│ "status": "COMPLETE", │
│ "output": {"imported": 1500}} │
│ ◀───────────────────────────────── │

Agent Interaction (Multi-Turn)

Client                              Venue
│ │
│ POST /api/v1/invoke │
│ {"operation": "llm-agent", │
│ "input": {"prompt": "Hello"}} │
│ ─────────────────────────────────▶ │
│ │ State(0): PENDING
│ 201 Created │ State(1): STARTED (LLM call)
│ {"id": "0x456...", │ State(2): INPUT_REQUIRED
│ "status": "PENDING"} │ (response + awaiting next turn)
│ ◀───────────────────────────────── │
│ │
│ GET /api/v1/jobs/0x456... │
│ ─────────────────────────────────▶ │
│ │
│ 200 OK │
│ {"id": "0x456...", │
│ "status": "INPUT_REQUIRED", │
│ "output": {"response": "Hi!"}, │
│ "message": "Awaiting input"} │
│ ◀───────────────────────────────── │
│ │
│ (Client sends next message) │
│ ─────────────────────────────────▶ │ State(3): STARTED (LLM call)
│ │ State(4): INPUT_REQUIRED
│ GET /api/v1/jobs/0x456... │ (response + awaiting next turn)
│ ─────────────────────────────────▶ │
│ │
│ 200 OK │
│ {"id": "0x456...", │
│ "status": "INPUT_REQUIRED", │
│ "output": {"response": "..."}, │
│ "message": "Awaiting input"} │
│ ◀───────────────────────────────── │
│ │
│ ... (conversation continues) │

Security Considerations

Resource Exhaustion

Jobs consume venue resources (memory, connections, compute). Venues SHOULD:

  • Impose limits on the number of concurrent jobs per client
  • Enforce execution timeouts for all jobs
  • Enforce timeouts on interactive states
  • Clean up resources promptly when jobs reach terminal states

Long-lived agent jobs (persistent interactive processes) require particular attention to resource management. Venues SHOULD impose maximum chain lengths or interaction counts for agent jobs.

Job Data Sensitivity

Job records may contain sensitive information in their inputs and outputs. Venues SHOULD:

  • Apply access control so that only the submitting client (or authorised parties) can query a job
  • Redact or encrypt sensitive fields in stored job records
  • Support job deletion to allow clients to remove HEAD pointers for records containing sensitive data

Cancellation Safety

Cancellation does not guarantee rollback. Operations may have produced side effects (network calls, data writes) before cancellation takes effect. Clients SHOULD NOT rely on cancellation as a mechanism for undoing work.

State Chain Integrity

The immutable chain structure provides tamper evidence but not tamper prevention. A malicious venue could fabricate a chain. Clients requiring strong guarantees SHOULD:

  • Verify chain integrity by walking prev links and checking content-addressed hashes
  • Cross-check job state with multiple independent venues
  • Use signed state records where available