Async Batch Processing

A payroll run is a deadline with no flexibility: tens of thousands of records arrive across SFTP drops, object stores, and vendor APIs, and they all have to land in one canonical schema before the bank file cuts. Async batch processing, part of the Multi-Format Payroll Data Ingestion & Normalization framework, is how that volume gets normalized inside the window without a thread-per-file architecture that either starves on blocking I/O or saturates a vendor’s rate limit. The discipline is narrow and specific: overlap the waiting — file fetches, database lookups, tax-table joins — while keeping every monetary transform exact, every write idempotent, and every routing decision recorded in an append-only trail that a Department of Labor auditor could replay.

This pattern sits at the normalization layer, downstream of structural validation and upstream of gross-to-net. Concurrency here is a throughput tool, never a correctness shortcut: the moment two coroutines race to post the same paycheck, or a gather swallows an exception and a chunk silently disappears, the run is no longer auditable. The job of this page is to show how to get the throughput of asyncio while preserving the determinism the rest of the pipeline depends on.

Data Normalization & Boundary Enforcement

Async concurrency multiplies whatever correctness properties the boundary already has — including the bad ones. If a malformed record can slip past validation in the synchronous path, the async path will slip thousands past in parallel. So the boundary contract is stricter here, not looser. Every record materialized from a fetched chunk must satisfy the canonical schema defined by Data Boundary Definitions before any coroutine schedules a write.

Each record entering the processor must carry, at minimum:

employee_id — a stable identifier matched against a fixed format (^[A-Z0-9]{6,12}$), so a vendor’s renumbering shows up as a validation rejection rather than an orphaned paycheck.
idempotency_key — a deterministic key derived from (employee_id, pay_period_start, pay_period_end, source_run_id). This is the single mechanism that makes a retried batch safe; without it, idempotent ingestion is impossible and a re-run double-posts wages.
gross_pay — parsed as a Decimal via Decimal(str(value)). Native floats are rejected at the boundary, never coerced, because Decimal precision is the only arithmetic that reconciles to the cent under IRS Pub 15 withholding rules.
tax_jurisdiction — a resolved authority key, not a free-text work location. An unresolved jurisdiction is a quarantine condition, never a silent default to federal.

Quarantine conditions specific to batch processing are explicit: a record that fails schema validation, an idempotency_key collision against the ledger, a gross_pay that arrives as a float or breaches a statutory cap, or any chunk that fails its control-total check against the source manifest. Each condition routes to the same dead-letter path through Fallback Routing Strategies, and the audit record names which condition fired so reconciliation is deterministic rather than a forensic exercise.

Reads happen in fixed-size windows — typically 500 to 2,000 rows — so memory stays bounded regardless of file size and the processor can checkpoint between chunks. The streaming-window technique is the same one that keeps CSV ingestion pipelines from running out of memory on a quarter-end export; async simply lets the next chunk’s fetch overlap the current chunk’s transform.

Jurisdictional Resolution & Effective Dating

Concurrency does not change which rule controls a record — it changes how easy it is to bind the wrong one. When many coroutines resolve jurisdiction in parallel, the resolution function must be a pure function of the record and the rule set, with no shared mutable clock. The override hierarchy is most-protective-wins, evaluated municipal first, then state, then federal:

Municipal > State > Federal

A municipal rule supersedes a state rule, which supersedes the federal baseline — but only for the authority tied to the employee’s primary work location, and only for a rule whose effective window contains the pay period. This is the same precedence the FLSA Threshold Mapping gate applies when it resolves exempt/non-exempt status, reused here so a default selected during ingestion can never contradict the threshold the calculation engine applies downstream.

Effective dating uses half-open intervals so adjacent windows never both claim a date. A rule is in force for an evaluation date $d$ when:

$\text{start} \le d < \text{end}$

with end = None modeled as $+\infty$ . The critical async-specific rule: resolution must bind to pay_period_start, never to datetime.now(). A batch reprocessed today for a prior period — a common retry scenario — must select the rule that was in force then, or the retry silently rewrites history. Because coroutines share the event loop, a single accidental now() call contaminates every record processed in that tick. The canonical selection predicate:

SELECT rule_id
FROM statutory_defaults
WHERE jurisdiction = :tax_jurisdiction
  AND code_prefix  = :code_prefix
  AND effective_start <= :pay_period_start
  AND (effective_end IS NULL OR :pay_period_start < effective_end)
ORDER BY authority_rank DESC   -- municipal=3, state=2, federal=1
LIMIT 1;

Overlap detection is a load-time concern. Two rule versions that both claim a date for the same (jurisdiction, code_prefix) must fail the rule-set load rather than letting concurrent workers resolve arbitrarily — non-determinism across coroutines is the one bug that is nearly impossible to reproduce after the fact.

Production Implementation Pattern

The processor below is I/O-decoupled, bounds concurrency with a semaphore, keeps all monetary state in Decimal, emits structured key=value logs that are copy-paste safe for production, and writes a SHA-256 audit hash per chunk. It uses asyncio.gather(..., return_exceptions=True) so a single failed record never cancels its siblings, and it routes every failure to an explicit fallback queue. The code is runnable as-is and follows PEP 8.

"""Async batch processor for payroll normalization."""
import asyncio
import hashlib
import json
import logging
from dataclasses import dataclass, field
from datetime import date
from decimal import Decimal
from typing import Any, Optional

from pydantic import BaseModel, ConfigDict, Field, ValidationError, field_validator

logger = logging.getLogger("payroll.async_batch")


class PayrollRecord(BaseModel):
    """Canonical record. gross_pay is Decimal end to end."""

    model_config = ConfigDict(arbitrary_types_allowed=True)

    employee_id: str = Field(..., pattern=r"^[A-Z0-9]{6,12}$")
    pay_period_start: date
    pay_period_end: date
    gross_pay: Decimal = Field(..., ge=Decimal("0"))
    tax_jurisdiction: str
    idempotency_key: str

    @field_validator("gross_pay", mode="before")
    @classmethod
    def reject_float_money(cls, value: Any) -> Decimal:
        # Never let binary float rounding enter monetary state.
        if isinstance(value, float):
            raise TypeError("gross_pay must not be a float; pass str or Decimal")
        return Decimal(str(value))


@dataclass
class BatchResult:
    processed: int = 0
    failed: int = 0
    audit_hash: str = ""
    fallback_routed: int = 0
    errors: list[str] = field(default_factory=list)


class AsyncPayrollBatchProcessor:
    """Bounded-concurrency, Decimal-safe, audit-logged batch processor."""

    def __init__(self, batch_size: int = 1000, max_concurrency: int = 8) -> None:
        self.batch_size = batch_size
        self.semaphore = asyncio.Semaphore(max_concurrency)
        self.audit_log: list[str] = []
        self.fallback_queue: list[dict[str, Any]] = []

    async def _fetch_chunk(
        self, source_path: str, offset: int
    ) -> list[dict[str, Any]]:
        """Bounded async fetch. Replace the sleep with aiohttp/asyncpg I/O."""
        async with self.semaphore:
            await asyncio.sleep(0.01)
            if offset >= self.batch_size * 3:  # demo stop condition
                return []
            return [{"row": offset + i} for i in range(self.batch_size)]

    async def _validate_and_transform(
        self, raw: dict[str, Any]
    ) -> Optional[PayrollRecord]:
        """Strict validation. Failures route to fallback, never raise upward."""
        try:
            return PayrollRecord(**raw["data"])
        except (ValidationError, TypeError, KeyError) as exc:
            self._route_to_fallback("validation_failed", raw, str(exc))
            return None

    async def _upsert_record(self, record: PayrollRecord) -> bool:
        """Idempotent upsert keyed on idempotency_key (ON CONFLICT DO NOTHING)."""
        async with self.semaphore:
            await asyncio.sleep(0.005)  # Replace with asyncpg execute.
            return True

    def _compute_audit_hash(self, records: list[PayrollRecord]) -> str:
        """Deterministic SHA-256 over canonical records for reconstruction."""
        payload = json.dumps(
            [r.model_dump(mode="json") for r in records], sort_keys=True
        ).encode("utf-8")
        return hashlib.sha256(payload).hexdigest()

    def _route_to_fallback(
        self, reason: str, context: dict[str, Any], error: str
    ) -> None:
        """Explicit dead-letter routing; the reason names which gate fired."""
        self.fallback_queue.append(
            {"reason": reason, "context": context, "error": error}
        )
        logger.warning(
            "event=fallback_routed reason=%s error=%s", reason, error
        )

    async def process_stream(self, source_path: str) -> BatchResult:
        result = BatchResult()
        offset = 0

        while True:
            chunk = await self._fetch_chunk(source_path, offset)
            if not chunk:
                break

            validated: list[PayrollRecord] = []
            for raw in chunk:
                record = await self._validate_and_transform(raw)
                if record is not None:
                    validated.append(record)

            # One failed upsert must not cancel its siblings.
            upserts = await asyncio.gather(
                *(self._upsert_record(r) for r in validated),
                return_exceptions=True,
            )
            for record, outcome in zip(validated, upserts):
                if isinstance(outcome, Exception):
                    result.failed += 1
                    self._route_to_fallback(
                        "upsert_failed",
                        {"idempotency_key": record.idempotency_key},
                        str(outcome),
                    )
                else:
                    result.processed += 1

            if validated:
                self.audit_log.append(self._compute_audit_hash(validated))

            logger.info(
                "event=chunk_done offset=%s processed=%s failed=%s",
                offset,
                result.processed,
                result.failed,
            )
            offset += self.batch_size

        result.audit_hash = hashlib.sha256(
            "".join(self.audit_log).encode("utf-8")
        ).hexdigest()
        result.fallback_routed = len(self.fallback_queue)
        return result


async def run_pipeline(source_path: str, timeout_s: float = 300.0) -> BatchResult:
    processor = AsyncPayrollBatchProcessor(batch_size=1000, max_concurrency=8)
    # Hard timeout: a stuck batch must surface, not block the run forever.
    return await asyncio.wait_for(
        processor.process_stream(source_path), timeout=timeout_s
    )

Three properties make this safe under concurrency. The semaphore bounds both fetches and writes, so a burst of chunks cannot exhaust the database connection pool or trip a vendor rate limit. The gather(..., return_exceptions=True) plus per-record fallback routing means one bad record is isolated, not amplified into a cancelled chunk. And the chunk audit hash is a pure function of the canonical records, so a replayed batch produces an identical hash — the determinism reconciliation depends on.

Compliance Verification & Fallback Routing

Async throughput without a verification suite is a faster way to produce an unauditable run. The following checklist is the minimum gate set; run it in CI and against the daily reconciliation job.

Concurrency isolation tests. Inject one record that raises during upsert into a chunk of valid records and assert that every sibling still commits and lands in processed, while the failure lands in fallback_routed. This proves return_exceptions=True is doing its job and no gather is cancelling the chunk.
Backpressure bound tests. Set max_concurrency to a known value and assert the semaphore never permits more than that many simultaneous _upsert_record calls (instrument with a counter). Confirm the bound matches the database connection-pool ceiling so workers never starve.
Idempotency replay tests. Process the same batch twice and assert the ledger row count is unchanged after the second run and the duplicate idempotency_key values are logged as deduplications, never double-posted.
Effective-date drift tests. Resolve the same (jurisdiction, code_prefix) for a date inside, on, and one day outside each window. Confirm half-open behavior and that selection binds to pay_period_start, not the run clock — critical because every coroutine shares that clock.
Decimal precision checks. Assert that constructing a PayrollRecord with a float gross_pay raises TypeError, and that no code path coerces money through float(). Sum gross_pay across processed records with Decimal and reconcile against the source register to a tolerance of ≤ $0.01.
Audit determinism. Process an identical batch twice and assert the audit_hash values match. Write the hashes to a write-once store (e.g. S3 Object Lock) and map each to NIST SP 800-53 AU-2 audit-event evidence; retain for the seven-year minimum implied by IRS Pub 15 and DOL recordkeeping rules.
Timeout and partial-commit checks. Force a batch past the asyncio.wait_for deadline and assert it cancels cleanly, persists the fallback queue, and never leaves a chunk half-committed without an audit record.

Failure Modes & Gotchas

gather without return_exceptions=True. A single failed upsert raises, cancels every sibling coroutine in the chunk, and silently drops correct paychecks mid-run. Root cause: default gather propagation cancels pending tasks on first exception. Fix: always pass return_exceptions=True and inspect each outcome, routing exceptions to the fallback queue while letting successes commit.
Unbounded concurrency. Spawning a coroutine per record with no semaphore opens thousands of simultaneous connections, exhausts the pool, and trips the vendor’s rate limit — so the “fast” path stalls under retries. Root cause: treating asyncio as infinite parallelism. Fix: gate fetches and writes behind a semaphore sized to the connection-pool ceiling, and let backpressure throttle the producer.
Float money through JSON. A vendor gross_pay parsed by json.loads as a native float reintroduces binary rounding, and the run no longer reconciles to the cent. Root cause: json.loads without parse_float=Decimal. Fix: parse with parse_float=Decimal and let the PayrollRecord validator reject any float at the boundary.
Resolving rules against now(). A batch reprocessed after a statute change binds to today’s rule instead of the rule in force during the pay period, and because coroutines share the event loop, every record in that tick is contaminated at once. Root cause: using the run clock instead of pay_period_start. Fix: pass the period start date explicitly into rule resolution; the run clock never touches it.
Idempotency key derived from time. Building the key from datetime.now() or a UUID makes every retry look like a new record, so a re-run double-posts wages. Root cause: non-deterministic key construction. Fix: derive the key purely from (employee_id, pay_period_start, pay_period_end, source_run_id) and enforce uniqueness with ON CONFLICT DO NOTHING.

Frequently Asked Questions

Why async instead of multiprocessing for payroll batches?

Payroll ingestion is overwhelmingly I/O-bound — fetching files, joining tax tables, writing rows — so the bottleneck is waiting, not CPU. asyncio overlaps that waiting on a single thread without the GIL contention or serialization overhead of processes, which keeps a run inside its SLA. Reserve multiprocessing for the rare CPU-bound stage (large cryptographic hashing of full payloads, heavy parsing) and run it behind a process pool that the async layer awaits, so the I/O path stays cooperative.

How do I keep audit hashes stable across pipeline retries?

Derive each chunk hash from the canonical record contents only — model_dump with sort_keys=True — and nothing time-varying. Timestamps, run IDs, and worker hostnames belong in surrounding log fields, never in the hash payload, so a replayed batch produces an identical hash. That stability is what lets reconciliation prove a retry did not change any record. See the asyncio task documentation for how gather ordering preserves the record-to-result correspondence the hash depends on.

What size should batch chunks and the concurrency limit be?

Chunk size is a memory-versus-checkpoint tradeoff: 500–2,000 rows keeps the resident set bounded while making checkpoints cheap. The concurrency limit should equal the database connection-pool ceiling minus a small headroom for the rest of the application — over-provisioning starves other workers, under-provisioning misses the SLA. Measure both against a representative quarter-end file rather than guessing; the right numbers are workload-specific.

What happens to records that fail validation mid-batch?

They are routed to an explicit fallback queue with a reason code, never raised up to cancel the chunk. A secondary reconciliation worker drains the queue: transient failures (timeouts, connection drops) get bounded retries, while structural failures (schema violations, missing jurisdiction) escalate to an HRIS exception dashboard. The run continues with the records that passed, and the audit trail records exactly which records were held and why.

How does this relate to processing very large payroll files?

This page covers the general async batch architecture — bounded concurrency, idempotent upserts, fallback routing, and audit hashing. Multi-gigabyte exports add their own concerns: streaming parse without loading the whole file, partition-aware checkpointing, and back-pressured producers. Those are handled on top of this same pattern in the large-file guide linked below.

Async batch processing for large payroll files — streaming, checkpointing, and back-pressure for multi-gigabyte exports built on this pattern.
CSV Ingestion Pipelines — the flat-file entry vector whose streaming windows the async layer overlaps.
EDI 834 Parsing — structured benefits payloads whose enrollment-to-payroll ordering the batch processor must preserve.
REST API Payroll Sync — rate-limited HCM endpoints whose backpressure the semaphore enforces.
Fallback Routing Strategies — the dead-letter hierarchy every failed record in a batch routes through.