CSV Ingestion Pipelines

A payroll run that trusts a vendor CSV at face value is one header rename away from a misfiled 941, and CSV ingestion pipelines — part of the Multi-Format Payroll Data Ingestion & Normalization framework — exist to make that trust explicit, verifiable, and reversible. The flat file is the oldest and least disciplined payroll interface still in production: legacy HRIS exports, timekeeping dumps, and hand-edited adjustment sheets all arrive as delimited text with no schema guarantee, no type system, and no contract beyond a header row that a vendor can silently change overnight. The job of this pipeline is to turn that ambiguity into a strictly typed, jurisdictionally anchored, checksum-stamped record set before any gross-to-net engine runs — and to route every record it cannot vouch for into a quarantine it can later defend line by line.

This pattern sits at the normalization boundary, downstream of file receipt and upstream of calculation. It does not calculate pay; it decides which rows are allowed to reach the calculator, what canonical shape they take, and which authority’s rules govern each one. Done correctly, a batch of ten thousand rows yields ten thousand identical decisions on a retry, a quarantine file that names exactly why each rejected row failed, and a hash chain that lets an auditor reconcile the output back to the byte stream that produced it.

Data Normalization & Boundary Enforcement

CSV files rarely honor a single canonical schema. Vendor exports introduce header drift, column reordering, implicit type coercion, and locale-specific number formatting that can silently corrupt a calculation without ever raising an error. The first responsibility of the pipeline is therefore boundary enforcement: validate the file’s shape and every field’s type before a single record materializes, and reject — never coerce — anything that does not fit the contract. This is the same boundary discipline formalized in Data Boundary Definitions, applied to the specific failure surface of delimited text.

The input contract for a payroll CSV is small and non-negotiable. Every accepted row must resolve to:

employee_id — a non-empty, trimmed identifier. An empty or whitespace-only value is a quarantine condition, never a generated placeholder.
pay_period_start / pay_period_end — ISO-8601 dates (%Y-%m-%d) where start <= end. A reversed or unparseable window is rejected; a pipeline that swaps them to “fix” the order corrupts proration silently.
gross_pay / hours_worked — parsed as Decimal via Decimal(str(value)) after stripping thousands separators. Native float must never enter monetary state. This decimal precision requirement is what keeps a penny of binary rounding from compounding across thousands of rows into a reconciliation break.
tax_jurisdiction — a two-character authority code that must resolve against a known jurisdiction table. An unresolved code is a quarantine condition; it is never silently defaulted to federal.

Two classes of corruption are unique enough to delimited text that they deserve explicit handling at the boundary. The first is the UTF-8 byte-order mark: a file saved by a Windows tool prepends \xef\xbb\xbf, which, read as plain UTF-8, glues itself to the first header name so employee_id arrives as employee_id and the schema gate rejects a structurally valid file. The fix is deterministic encoding resolution from the first bytes, not a guess. The second is positional fragility: any pipeline that reads row[3] instead of a named column miscalculates the instant a vendor inserts a leading column. Header-name mapping with case-insensitive normalization is mandatory — the detailed treatment of dropped, renamed, and nulled columns lives in handling missing payroll fields in CSV imports.

Sensitive identifiers carry their own boundary rules. Social Security Numbers, EINs, and bank routing numbers must be format-validated on entry and masked in any transit log the moment they are read, so a debug line never persists a raw SSN to disk. The validation layer must also run independently of the I/O layer: decoupling the schema gate from parsing lets the pipeline reject a malformed file before it consumes memory streaming a two-gigabyte payload, while preserving the original bytes for forensic reconciliation. The canonical output schema this stage produces is deliberately identical to the one emitted by EDI 834 Parsing and REST API Payroll Sync, so a record’s downstream treatment never depends on which channel delivered it.

Jurisdictional Resolution & Effective Dating

A two-character tax_jurisdiction code tells you where an employee is taxed; it does not tell you which rule version governs the pay period in the file. CSV adjustment batches are routinely processed weeks after the period they cover — a March correction loaded in April must bind to the rule that was in force in March, not the one in force on the run date. Resolving jurisdiction without effective dating is the most common way a structurally clean CSV pipeline still produces a wrong number.

The override hierarchy is most-protective-wins, evaluated municipal first, then state, then federal:

Municipal > State > Federal

A municipal minimum-wage or local-tax rule supersedes the state rule, which supersedes the federal baseline — but only for the jurisdiction tied to the row and only for a rule whose effective window contains the pay period. This is the same precedence the FLSA Threshold Mapping gate applies when resolving exempt status, and reusing it here means a default selected at ingestion can never contradict the threshold the calculation engine applies later.

Effective windows are half-open so that adjacent rule versions never both claim a boundary date. A rule is in force for an evaluation date $d$ when:

\text{effective\_start} \le d < \text{effective\_end}

with a missing effective_end modeled as $+\infty$ . Resolution must select against pay_period_start, never datetime.now(). The canonical selection is a single indexed query:

SELECT rule_id
FROM jurisdiction_rules
WHERE jurisdiction = :tax_jurisdiction
  AND effective_start <= :pay_period_start
  AND (effective_end IS NULL OR :pay_period_start < effective_end)
ORDER BY authority_rank DESC   -- municipal=3, state=2, federal=1
LIMIT 1;

Overlap detection belongs at rule-load time, not run time. If two versions of the same (jurisdiction) rule both claim a date — overlapping [start, end) windows — the rule set must fail to load rather than letting the run pick arbitrarily between them. That turns a non-deterministic payroll bug, which is nearly impossible to reproduce, into a deploy-time error, which is the only place it is cheap to catch.

Production Implementation Pattern

The pipeline below uses Python’s standard library only. It resolves encoding deterministically, maps headers by name, casts every monetary field through Decimal, resolves the controlling jurisdiction rule against pay_period_start with a Municipal > State > Federal hierarchy, emits structured key=value logs that are copy-paste safe for production, and stamps each row with a SHA-256 checksum. Invalid rows route to a quarantine file with their exact rejection reason instead of halting the batch. The code is runnable as-is and follows PEP 8.

"""Deterministic CSV ingestion pipeline for payroll normalization."""
import csv
import hashlib
import logging
from dataclasses import dataclass
from datetime import date, datetime
from decimal import Decimal, ROUND_HALF_UP, InvalidOperation, getcontext
from pathlib import Path
from typing import Iterator, Optional

# Fixed-point context for all monetary arithmetic — never float.
getcontext().prec = 28
getcontext().rounding = ROUND_HALF_UP

logger = logging.getLogger("payroll.csv_ingestion")
DATE_FORMAT = "%Y-%m-%d"


class PayrollIngestionError(Exception):
    """Base exception for CSV pipeline failures."""


class SchemaValidationError(PayrollIngestionError):
    """Header mapping or field-cast failure (wrong shape)."""


class ComplianceViolationError(PayrollIngestionError):
    """PII, sign, or jurisdiction-rule failure (wrong meaning)."""


@dataclass(frozen=True)
class JurisdictionRule:
    """An effective-dated rule keyed by jurisdiction, half-open window."""
    rule_id: str
    jurisdiction: str
    authority_rank: int          # municipal=3, state=2, federal=1
    effective_start: date
    effective_end: Optional[date] = None

    def in_force(self, on: date) -> bool:
        if on < self.effective_start:
            return False
        return self.effective_end is None or on < self.effective_end


@dataclass(frozen=True)
class PayrollRecord:
    employee_id: str
    pay_period_start: date
    pay_period_end: date
    gross_pay: Decimal
    hours_worked: Decimal
    tax_jurisdiction: str
    rule_id: str
    row_checksum: str
    row_index: int


class CSVIngestionPipeline:
    CANONICAL_HEADERS = frozenset({
        "employee_id", "pay_period_start", "pay_period_end",
        "gross_pay", "hours_worked", "tax_jurisdiction",
    })

    def __init__(
        self,
        quarantine_dir: Path,
        jurisdiction_rules: list[JurisdictionRule],
    ) -> None:
        self.quarantine_dir = quarantine_dir
        self.quarantine_dir.mkdir(parents=True, exist_ok=True)
        self.jurisdiction_rules = jurisdiction_rules

    @staticmethod
    def _resolve_encoding(file_path: Path) -> str:
        """Detect a UTF-8 BOM from the first bytes; never guess."""
        with open(file_path, "rb") as fh:
            if fh.read(3) == b"\xef\xbb\xbf":
                return "utf-8-sig"
        return "utf-8"

    def _normalize_headers(self, raw_headers: list[str]) -> dict[str, str]:
        """Map vendor headers to canonical names by name, not position."""
        normalized = {h.strip().lower().replace(" ", "_"): h for h in raw_headers}
        missing = self.CANONICAL_HEADERS - normalized.keys()
        if missing:
            raise SchemaValidationError(f"missing_columns={sorted(missing)}")
        return {c: normalized[c] for c in self.CANONICAL_HEADERS}

    def _resolve_rule(self, jurisdiction: str, period_start: date) -> str:
        """Most-protective-wins, in force on period_start (never now())."""
        candidates = [
            r for r in self.jurisdiction_rules
            if r.jurisdiction == jurisdiction and r.in_force(period_start)
        ]
        if not candidates:
            raise ComplianceViolationError(
                f"no_rule jurisdiction={jurisdiction} period_start={period_start}"
            )
        return max(candidates, key=lambda r: r.authority_rank).rule_id

    def _validate_and_cast(
        self, row: dict[str, str], idx: int, hmap: dict[str, str]
    ) -> PayrollRecord:
        """Strict casting, compliance checks, rule resolution, checksum."""
        try:
            emp_id = row[hmap["employee_id"]].strip()
            if not emp_id:
                raise SchemaValidationError("empty employee_id")

            start_dt = datetime.strptime(
                row[hmap["pay_period_start"]].strip(), DATE_FORMAT
            ).date()
            end_dt = datetime.strptime(
                row[hmap["pay_period_end"]].strip(), DATE_FORMAT
            ).date()
            if end_dt < start_dt:
                raise SchemaValidationError("pay_period_end precedes start")

            gross = Decimal(row[hmap["gross_pay"]].strip().replace(",", ""))
            hours = Decimal(row[hmap["hours_worked"]].strip().replace(",", ""))
            if gross < 0 or hours < 0:
                raise ComplianceViolationError("negative gross_pay or hours")

            jurisdiction = row[hmap["tax_jurisdiction"]].strip().upper()
            if len(jurisdiction) != 2:
                raise ComplianceViolationError(f"bad_jurisdiction={jurisdiction!r}")
            rule_id = self._resolve_rule(jurisdiction, start_dt)

            raw = "|".join(row[hmap[c]] for c in sorted(self.CANONICAL_HEADERS))
            checksum = hashlib.sha256(raw.encode("utf-8")).hexdigest()[:16]

            return PayrollRecord(
                employee_id=emp_id,
                pay_period_start=start_dt,
                pay_period_end=end_dt,
                gross_pay=gross,
                hours_worked=hours,
                tax_jurisdiction=jurisdiction,
                rule_id=rule_id,
                row_checksum=checksum,
                row_index=idx,
            )
        except (KeyError, ValueError, InvalidOperation) as exc:
            raise SchemaValidationError(f"row={idx} cast_failure={exc}")

    def _quarantine(
        self, row: dict[str, str], idx: int, reason: str, stem: str
    ) -> None:
        """Append a rejected row plus its reason; never halt the batch."""
        target = self.quarantine_dir / f"{stem}_quarantine.csv"
        with open(target, "a", newline="", encoding="utf-8") as fh:
            writer = csv.DictWriter(
                fh, fieldnames=[*row.keys(), "row_index", "rejection_reason"]
            )
            if fh.tell() == 0:
                writer.writeheader()
            writer.writerow({**row, "row_index": idx, "rejection_reason": reason})
        logger.warning("event=quarantine row=%s reason=%s", idx, reason)

    def process_file(self, file_path: Path) -> Iterator[PayrollRecord]:
        """Stream validated records; route failures to quarantine."""
        encoding = self._resolve_encoding(file_path)
        logger.info(
            "event=ingest_start file=%s encoding=%s", file_path.name, encoding
        )
        accepted = rejected = 0
        with open(file_path, "r", encoding=encoding, newline="") as fh:
            reader = csv.DictReader(fh)
            if not reader.fieldnames:
                raise SchemaValidationError("empty or malformed header")
            hmap = self._normalize_headers(list(reader.fieldnames))

            for idx, row in enumerate(reader, start=2):  # row 1 is the header
                try:
                    yield self._validate_and_cast(row, idx, hmap)
                    accepted += 1
                except (SchemaValidationError, ComplianceViolationError) as exc:
                    self._quarantine(row, idx, str(exc), file_path.stem)
                    rejected += 1
        logger.info(
            "event=ingest_done file=%s accepted=%s rejected=%s",
            file_path.name, accepted, rejected,
        )

Three properties make this safe in production. Monetary fields are Decimal from the first cast inward and the date fields are parsed strictly, so no binary float or ambiguous locale string ever reaches calculation. Jurisdiction resolution runs against pay_period_start and ranks by authority, so effective dating and the override hierarchy are enforced in one place rather than scattered through the calculator. And the checksum is a pure function of the canonical field values in sorted order, so a re-ingested file produces identical checksums — the determinism that reconciliation depends on. For files large enough to dominate the run window, the same generator-based streaming model feeds directly into Async Batch Processing for retry semantics and dead-letter handling.

Compliance Verification & Fallback Routing

Shipping the pipeline without a gate suite turns the quarantine from a safety valve back into a silent failure mode. Run the following checklist in CI and against a per-batch reconciliation job before any record reaches the payroll ledger.

Header-mapping and boundary tests. Feed files with reordered columns, mixed-case headers (Gross_Pay, GROSS PAY), and a leading inserted column; assert every accepted row maps by name and that a missing canonical column raises SchemaValidationError rather than producing a shifted value. Add a BOM-prefixed file and assert it is accepted, confirming encoding resolution per RFC 4180 line handling.
Decimal precision checks. Assert gross_pay and hours_worked are Decimal, that a thousands-separated value like 1,234.50 casts correctly, and that no code path routes money through float(). Reconcile a synthetic batch to the cent; per IRS Publication 15 withholding tables, rounding must be ROUND_HALF_UP, not banker’s rounding.
Effective-date drift tests. Resolve the same jurisdiction for a pay_period_start inside, exactly on, and one day outside each rule window. Confirm half-open behavior — the effective_start date resolves, the effective_end date falls through to the next version — and that resolution binds to the period start, never the run clock.
Override-hierarchy tests. With a municipal, state, and federal rule all in force for one date, assert the resolver returns the municipal rule_id; remove it and assert fallback to state, then federal. Feed two overlapping windows for one jurisdiction and assert the loader rejects the rule set rather than the run picking arbitrarily.
Fallback activation and quarantine integrity. Inject a row with an unmapped jurisdiction, an empty employee_id, a negative gross_pay, and a reversed pay period. Assert each lands in the quarantine file with a distinct rejection_reason and that the batch still completes. The unmapped-jurisdiction case is the handoff into the broader Fallback Routing Strategies tier hierarchy.
Rejection-rate threshold. Compute the batch quarantine rate $r = \frac{n_{\text{rejected}}}{n_{\text{total}}}$ and halt-and-escalate when $r > 0.02$ . A spike past two percent almost always means a vendor changed the export, not that the data is genuinely bad — the alias map needs updating, and the run should not proceed on a guess.
Checksum and audit reconciliation. Re-ingest an unchanged file and assert identical row_checksum values; compare them against vendor-supplied manifest hashes to detect in-transit corruption. Write the run summary to a write-once store and map each checksum to an audit-evidence record, retaining quarantine artifacts and manifests for the federal wage-record minimum under 29 CFR § 516.5 (three years) and the IRS employment-tax minimum of four years — most shops standardize on a seven-year retention to cover both.

Failure Modes & Gotchas

Float money entering through read_csv. A pipeline that lets pandas infer gross_pay as float64 reintroduces binary rounding, and the batch stops reconciling to the cent. Root cause: type inference on monetary columns. Fix: cast every monetary field through Decimal(str(value)) at the boundary and forbid float in the record dataclass, exactly as the implementation above does.
Silent default to federal on an unresolved jurisdiction. A blank or unknown tax_jurisdiction is treated as “use federal,” which underwithholds in a state with a higher floor and creates a wage-and-hour exposure. Root cause: defaulting an unresolved key instead of quarantining it. Fix: an unresolved jurisdiction is a quarantine condition; rule resolution runs only for a code that matches a known authority.
Positional column reads. Relying on row[3] for gross_pay miscalculates the moment a vendor prepends a pay_period_type column, and the error is invisible because every value is still a valid number. Root cause: positional indexing instead of header mapping. Fix: map by normalized header name and reject a file whose header set does not contain every canonical column.
Resolving rules against now(). A March correction loaded in April binds to April’s rule, silently rewriting history on a retroactive adjustment. Root cause: using the run clock instead of pay_period_start. Fix: pass the period start explicitly into rule resolution; the run clock never touches it.
The BOM that breaks the schema gate. A structurally perfect file is rejected because employee_id does not match employee_id. Root cause: reading a BOM-prefixed file as plain utf-8. Fix: detect the BOM from the first three bytes and open with utf-8-sig, never strip the mark after the fact with a fragile replace.

Frequently Asked Questions

Why not just use pandas read_csv for payroll ingestion?

read_csv is excellent for analysis and dangerous for payroll, because its convenience defaults work against compliance. It infers monetary columns as float64, coerces malformed cells to NaN instead of rejecting them, and silently fills gaps — every one of which hides exactly the corruption a payroll pipeline must surface. If you do use pandas, pin dtype=str on ingestion, disable NaN coercion with keep_default_na=False, and cast money to Decimal yourself. The standard-library csv module is preferred here precisely because it does nothing you did not ask for.

How should the pipeline handle duplicate rows across re-sent files?

Treat the row_checksum as a natural dedup key and make ingestion idempotent: a row whose checksum already exists for the period is a no-op, not a second paycheck. Vendors routinely re-send a full file after a partial failure, so a pipeline that appends blindly will double-pay. Persist accepted checksums per pay period and skip on collision, logging event=duplicate_skip so the dedup is auditable rather than invisible.

What belongs in quarantine versus a hard batch halt?

Individual bad rows go to quarantine so the other ten thousand correct paychecks keep moving; the batch halts only when the rejection rate crosses the threshold (two percent in the implementation above) or when the file itself is unreadable — a missing canonical column, an empty header, or an unresolvable encoding. The principle is row-level isolation with batch-level circuit breaking: one ambiguous record should never stop payroll, but a systemic schema change should never be processed past.

How do we keep PII out of logs while still being able to debug?

Log identifiers, never values. The structured lines in the implementation emit row=, reason=, and counts — never a raw SSN, bank routing number, or gross amount. When you must correlate a quarantined row to its source, reference it by row_index and row_checksum, which point an investigator at the exact bytes in the preserved original file without spilling the protected data into a log aggregator that has a different retention and access model than the payroll store.

Does CSV ingestion need the same canonical schema as EDI and API ingestion?

Yes, and that is the point. CSV, EDI 834, and REST sync all emit the identical PayrollRecord shape so that downstream calculation, validation, and audit logic never branch on the source channel. If the schemas diverge, you end up maintaining three slightly different calculators and three ways to be wrong. Keeping the output contract uniform is what lets one set of verification gates cover every ingestion vector.

Handling missing payroll fields in CSV imports — deterministic handling of dropped, renamed, and nulled columns within this pipeline.
EDI 834 Parsing — the sibling ingestion vector that emits the same canonical record schema.
REST API Payroll Sync — real-time ingestion whose payloads must reconcile with CSV output.
Async Batch Processing — retry semantics and dead-letter routing for large CSV batches.
Data Boundary Definitions — the canonical-record contract every ingested row must satisfy.