Async batch processing for large payroll files

A single quarter-end HRIS export that exceeds your worker’s memory budget will not fail loudly — it will load partially, drop the tail, and post an incomplete payroll run that reconciles against nothing. This guide is the large-file specialization of the Async Batch Processing pattern: it keeps the same bounded-concurrency, Decimal precision, and audit-trail guarantees, but adds the streaming, checkpointing, and back-pressure machinery a multi-gigabyte file demands.

Problem Framing

The naive approach — pandas.read_csv(path) or csv.reader(open(path).read()) — materializes the entire file in memory before the first record is validated. On a 2 GB export of 1.2 million employee rows, the resident set balloons past container limits, the OOM killer reaps the worker mid-run, and the orchestrator restarts the whole file from byte zero. Worse, a synchronous read blocks the event loop, so every health check, cancellation signal, and circuit breaker the rest of the pipeline relies on goes dark for the duration.

Large files break naive implementations along three axes the general pattern does not have to confront:

Memory is unbounded in file size. Resident memory must be a function of chunk size, not file size. For a fixed window of $w$ rows across $n$ in-flight chunks, the bound is $M \approx w \times n \times s$ where $s$ is the per-record footprint — a constant, regardless of whether the file is 50 MB or 50 GB.
A restart cannot mean “start over.” A 40-minute ingestion that dies at minute 38 must resume from its last committed checkpoint, not replay 1.1 million already-posted rows. Resumption depends on the deterministic idempotency_key design from the parent pattern so re-read rows are deduplicated rather than double-posted.
The producer outruns the consumer. An async reader that yields chunks faster than the database can upsert them simply moves the unbounded-memory problem from the file into the queue. Back-pressure — a bounded channel that blocks the reader when the writer falls behind — is mandatory, not optional.

None of this relaxes compliance. Overtime aggregation under 29 CFR § 778.107 still requires every hour for an employee to be summed before the regular rate is computed, even when those hours are split across chunk boundaries. The FLSA Threshold Mapping gate still resolves exempt status per record. Streaming changes how the data arrives, never which rules apply.

The three forces above — bounded memory, resumability, and back-pressure — resolve into a single producer/consumer pipeline whose only persistent state between runs is a checkpoint:

Prerequisites & Data Requirements

Before applying this pattern, the engineer must have:

A line-delimited or record-delimited source. CSV, JSONL, or fixed-width files stream naturally. A monolithic XML or a single JSON array does not — those require a streaming parser (ijson, lxml.iterparse) and are out of scope here. Confirm the export is newline-terminated per record, the same precondition that governs CSV Ingestion Pipelines.
A stable header contract. The canonical fields validated downstream — employee_id (^[A-Z0-9]{6,12}$), pay_period_start/pay_period_end (ISO 8601 dates), gross_pay (string, parsed to Decimal — never a float), tax_jurisdiction (resolved authority key), and a deterministic idempotency_key — must match the schema enforced by Data Boundary Definitions. Positional indexing is banned; column order drift is a near-certainty across vendor releases.
A checkpoint store. A small, durable key-value record (a single row in Postgres, or a Redis key) holding {source_run_id, last_committed_offset, audit_chain_hash}. This is what makes a restart resumable.
A fallback sink. A dead-letter table or queue reachable through Fallback Routing Strategies, so a malformed row in the middle of a 2 GB file is isolated rather than aborting the run.
Async I/O drivers. aiofiles for non-blocking reads and an async database client (asyncpg, or SQLAlchemy’s async engine) for non-blocking writes. A synchronous driver inside a coroutine reintroduces the blocked event loop you are trying to avoid.

Step-by-Step Implementation

Step 1 — Stream the file in fixed-size windows

The reader is an async generator that yields lists of raw lines and never holds more than one window in memory. It also reports the byte offset it has reached, which the checkpoint will record.

import aiofiles
from typing import AsyncIterator

CHUNK_SIZE = 1_000  # rows per window; tune to memory budget, not file size

async def stream_chunks(
    filepath: str, start_offset: int = 0
) -> AsyncIterator[tuple[int, list[str]]]:
    """Yield (row_offset, lines) windows. Resumes at start_offset rows in."""
    async with aiofiles.open(filepath, mode="r", encoding="utf-8-sig") as handle:
        await handle.readline()  # discard header
        row_offset = 0
        buffer: list[str] = []
        async for line in handle:
            row_offset += 1
            if row_offset <= start_offset:
                continue  # already committed in a prior run
            buffer.append(line)
            if len(buffer) >= CHUNK_SIZE:
                yield row_offset, buffer
                buffer = []
        if buffer:
            yield row_offset, buffer

Expected output: iterating a 2.5-million-row file yields 2,500 windows of 1,000 rows (the last short), and tracemalloc shows the resident set flat across all of them — proof memory is bounded by CHUNK_SIZE, not file size.

Step 2 — Validate and transform off the event loop

Parsing and Decimal construction are CPU-bound. Run them in a thread pool so the loop stays free to handle the next fetch and any cancellation. Money is parsed with Decimal(str(...)); a float never enters monetary state.

import logging
from decimal import Decimal, InvalidOperation

logger = logging.getLogger("payroll.large_file")

def normalize_window(lines: list[str], run_id: str) -> tuple[list[dict], list[dict]]:
    """Return (valid_records, quarantined). Pure function — safe in a thread."""
    valid: list[dict] = []
    quarantined: list[dict] = []
    for raw in lines:
        cols = raw.rstrip("\n").split(",")
        if len(cols) < 5:
            quarantined.append({"reason": "schema_short_row", "raw": raw})
            continue
        try:
            gross = Decimal(cols[3])  # str in, never float
        except InvalidOperation:
            quarantined.append({"reason": "non_decimal_gross", "raw": raw})
            continue
        emp = cols[0]
        record = {
            "employee_id": emp,
            "pay_period_start": cols[1],
            "pay_period_end": cols[2],
            "gross_pay": gross,
            "tax_jurisdiction": cols[4],
            "idempotency_key": f"{emp}|{cols[1]}|{cols[2]}|{run_id}",
        }
        valid.append(record)
    return valid, quarantined

Expected output: a window containing one float-formatted gross_pay like 1234.5e2 parses fine, but a row with gross_pay = "N/A" lands in quarantined with reason=non_decimal_gross, and the other 999 rows pass untouched.

Step 3 — Apply back-pressure between reader and writer

A bounded asyncio.Queue is the back-pressure channel: the reader blocks on put once the queue is full, so it cannot outrun the writer. A semaphore caps simultaneous database writes at the connection-pool ceiling.

import asyncio

async def producer(filepath: str, queue: asyncio.Queue, start_offset: int) -> None:
    async for offset, lines in stream_chunks(filepath, start_offset):
        await queue.put((offset, lines))  # blocks when queue is full -> back-pressure
    await queue.put(None)  # sentinel: end of file

async def consumer(
    queue: asyncio.Queue, run_id: str, pool, checkpoint, semaphore: asyncio.Semaphore
) -> None:
    loop = asyncio.get_running_loop()
    while True:
        item = await queue.get()
        if item is None:
            queue.task_done()
            break
        offset, lines = item
        valid, quarantined = await loop.run_in_executor(
            None, normalize_window, lines, run_id
        )
        async with semaphore:
            await upsert_batch(pool, valid)        # ON CONFLICT DO NOTHING
        await route_fallback(quarantined)
        await checkpoint.commit(offset, valid)     # advance only after the write
        logger.info(
            "event=window_committed offset=%s ok=%s quarantined=%s",
            offset, len(valid), len(quarantined),
        )
        queue.task_done()

Expected output: with maxsize=4, instrumenting the queue shows it oscillating between full and near-empty while never exceeding 4 — the reader is being throttled by the writer exactly as intended.

Step 4 — Checkpoint after every committed window

The checkpoint advances only after the idempotent upsert succeeds, and it folds each window’s hash into a running chain so the audit trail survives a restart.

import hashlib, json

class Checkpoint:
    def __init__(self, store, run_id: str):
        self.store, self.run_id = store, run_id
        self.chain = ""

    async def commit(self, offset: int, records: list[dict]) -> None:
        payload = json.dumps(
            [{**r, "gross_pay": str(r["gross_pay"])} for r in records],
            sort_keys=True,
        ).encode("utf-8")
        window_hash = hashlib.sha256(payload).hexdigest()
        self.chain = hashlib.sha256((self.chain + window_hash).encode()).hexdigest()
        await self.store.put(
            self.run_id,
            {"last_committed_offset": offset, "audit_chain_hash": self.chain},
        )

Expected output: kill the worker after offset 1,200,000; on restart, stream_chunks(..., start_offset=1_200_000) skips the committed rows, and the final audit_chain_hash equals the hash a single uninterrupted run produces — continuity proven.

Step 5 — Wire the run with a hard timeout

async def run_large_file(filepath: str, run_id: str, pool, store) -> str:
    state = await store.get(run_id) or {"last_committed_offset": 0}
    queue: asyncio.Queue = asyncio.Queue(maxsize=4)
    checkpoint = Checkpoint(store, run_id)
    semaphore = asyncio.Semaphore(8)  # match the connection-pool ceiling
    await asyncio.wait_for(
        asyncio.gather(
            producer(filepath, queue, state["last_committed_offset"]),
            consumer(queue, run_id, pool, checkpoint, semaphore),
        ),
        timeout=3600,  # a stuck run must surface, not block the cutoff forever
    )
    return checkpoint.chain

Expected output: a clean run returns the audit chain hash; a stuck window past 3600 s raises asyncio.TimeoutError, leaving the last checkpoint intact for a resume.

Verification

Confirm correctness with boundary tests aimed at the large-file scenario specifically:

Bounded-memory test. Stream a synthetic 5 GB file under tracemalloc and assert peak resident memory stays within a few multiples of CHUNK_SIZE × record_size and does not grow with file length. The slope of memory-vs-rows-read must be flat.
Resume-from-checkpoint test. Run to completion, record the chain hash, then run again killing the consumer after a known offset and resuming. Assert the resumed run’s final audit_chain_hash is byte-identical and the ledger row count is unchanged — ON CONFLICT DO NOTHING absorbs any re-read overlap.
Back-pressure bound test. Instrument queue.qsize() and assert it never exceeds maxsize. Slow the writer artificially and confirm the producer blocks rather than buffering the file into RAM.
Cross-chunk aggregation test. Place an employee’s hours in two adjacent windows that straddle a chunk boundary; assert overtime under 29 CFR § 778.107 is computed from the summed hours, not per chunk. Aggregation must key on employee_id across the whole run.
Decimal precision check. Sum gross_pay across all committed records as Decimal and reconcile against the source register’s control total to a tolerance of ≤ $0.01, matching IRS Pub 15 cent-exact withholding expectations.

Failure Modes

Header re-consumed on resume. Symptom: the first data row of every restart lands in quarantine. Root cause: readline() discards the header, but on a resume the start_offset skip starts counting from the wrong baseline, dropping or shifting a row. Remediation: count row_offset from the first data row only (as shown), and store the offset in committed-data-rows, never byte positions, since byte offsets shift if the file is re-encoded.
Queue sentinel lost on consumer crash. Symptom: the run hangs forever at end-of-file after a transient consumer error. Root cause: the consumer died before reading the None sentinel, so queue.join()/gather never completes. Remediation: wrap the whole run in asyncio.wait_for (Step 5) so a hang surfaces as a timeout, and re-raise consumer exceptions instead of swallowing them — a dead consumer must fail the run, not deadlock it.
Checkpoint advanced before the write committed. Symptom: a restart skips rows that were never actually persisted, silently underpaying a batch of employees. Root cause: committing the offset before (or concurrently with) the upsert. Remediation: advance the checkpoint strictly after the idempotent upsert returns success, in that order, so a crash between the two only ever causes a harmless re-process — never a skip.

Async Batch Processing — the parent pattern: bounded concurrency, idempotent upserts, and audit hashing this guide streams on top of.
CSV Ingestion Pipelines — the flat-file entry vector whose streaming windows large-file ingestion extends.
EDI 834 Parsing — structured benefit payloads that need a streaming state machine rather than line windows.
Multi-Format Payroll Data Ingestion & Normalization — the ingestion framework these patterns normalize into a single canonical schema.

Frequently Asked Questions

How big does a file have to be before I need this streaming pattern?

There is no universal byte threshold — the trigger is whether the file plus its parsed representation fits comfortably inside the worker’s memory budget with headroom for everything else running. A practical rule: if a single read would consume more than roughly a third of the container’s memory limit, switch to streaming. Tuning by CHUNK_SIZE rather than file size means the same code path handles a 50 MB file and a 50 GB file without change.

Why a bounded queue instead of just awaiting each chunk in sequence?

Sequential await gives you correctness but no overlap — the reader sits idle while the writer commits, and vice versa. A bounded queue lets the reader fetch the next window while the writer commits the current one, so I/O overlaps, while the maxsize ceiling still prevents the reader from racing ahead and buffering the whole file into memory. You get throughput and a memory bound at once.

Does resuming a run risk double-paying employees?

No, provided two things hold: the upsert is keyed on a deterministic idempotency_key with ON CONFLICT DO NOTHING, and the checkpoint advances only after that upsert succeeds. On resume, any rows re-read from the overlap are absorbed as conflicts rather than new inserts. The key must derive purely from record identity — employee and pay period — never from a timestamp or UUID, or every re-read looks like a new payment.