Skip to content

File Diff

Compare two flat files in the same data source to see exactly what changed — which rows were added, removed, or modified, down to individual cell values.

File Diff is useful for:

  • Drift detection — compare today's data extract against yesterday's to spot unexpected changes
  • Reconciliation — verify that a transformed or migrated file matches its source
  • QA checks — confirm that a new file version contains only the expected modifications

How It Works

File Diff performs a key-based structural comparison. You choose one or more columns that uniquely identify each row (the key columns), and Precept matches rows across the two files using those keys. Every row is then classified as:

ClassificationMeaning
AddedRow exists in the target file but not in the base file
RemovedRow exists in the base file but not in the target file
ModifiedRow exists in both files (same key) but one or more non-key values differ
UnchangedRow exists in both files with identical values

For modified rows, Precept also records cell-level changes — which columns changed, and what the old and new values are.

Both files must be in the same data source but can be different formats (e.g., base is CSV, target is Parquet) as long as they share the same column structure.

Using the Ad-Hoc Diff Page

The diff page lets you compare two files interactively from the Data Sources area.

1. Select Two Files

Navigate to the Data Sources page and open the data source containing your files. In the file list, use the checkboxes in the leftmost column to select exactly two files. Once two files are selected, a Compare button appears in the toolbar. Click it to open the diff page.

The first file you select becomes the base (the "before" snapshot), and the second becomes the target (the "after" snapshot). You can swap them on the diff page if needed.

2. Confirm the Model

Precept auto-detects the data model for each file using the data source's path-matching rules (fileTypeMappings). Three outcomes are possible:

  • Both files match the same model — the model is shown as read-only. You can override it if needed.
  • Files match different models — an error is shown. You can override to pick a single model, or go back and select different files.
  • No model matched — a dropdown lets you pick a model manually, or you can proceed without one. Without a model, Precept probes the base file to discover its columns.

3. Pick Key Columns

Choose one or more columns that uniquely identify each row. These are the columns Precept uses to match rows across files — for example, account_id or account_id, security_id for a composite key.

The key column picker auto-completes from the model's field list (or from the probed columns if no model is selected). Precept remembers your last-used key columns per data source and model.

WARNING

Key columns must uniquely identify rows. If duplicate key values exist in either file, the diff is rejected with an error.

4. Run and Read Results

Click Run diff to execute the comparison. The results page shows:

  • A summary strip with counts: base rows, target rows, added, removed, modified, and unchanged
  • A results table with all changed rows, ordered by type (removed, then modified, then added), then by key columns

Each row has a colored left stripe indicating its type:

  • Green — added rows, showing target values
  • Red — removed rows, showing base values (muted text)
  • Amber — modified rows, showing target values with inline cell-level changes (old value → new value)

NULL values display as a symbol to distinguish them from empty strings.

Results are paginated at 1,000 rows per page. Use the Prev / Next controls below the table to navigate.

The diff page is URL-addressable. You can share or bookmark a comparison:

/data-sources/:sourceId/diff?base=path/to/base.csv&target=path/to/target.csv&keys=account_id,security_id

If keys is omitted from the URL, the page falls back to your last-used keys for that source and model.

API Usage

The diff is also available as a REST endpoint for programmatic use:

POST /rest/ingestion/diff

Request body:

json
{
  "sourceId": "my-sftp-source",
  "base": "positions/2024-01-01.csv",
  "target": "positions/2024-01-02.csv",
  "keyColumns": ["account_id", "security_id"],
  "model": "vendor/module/PositionsModel",
  "limit": 1000,
  "offset": 0
}
  • sourceId, base, target, keyColumns are required
  • model is optional — when omitted, column validation is deferred to DuckDB
  • limit defaults to 1,000 (max 10,000); offset defaults to 0

The response contains a summary object, a rows array of changed rows, a hasMore pagination flag, and a columns array describing the result schema. Set the Accept header to text/csv for CSV output (summary and cell-level changes are only available in JSON).

Workflow Usage

The Diff node lets you run file comparisons as part of an automated workflow — scheduled drift checks, reconciliation pipelines, and change-driven notifications.

Adding a Diff Node

Open a workflow in the graph editor and add a Diff node from the toolbar menu. The node card has four required fields:

FieldDescription
SourceThe data source containing both files
BasePath to the base file (e.g., positions/2024-01-01.csv)
TargetPath to the target file (e.g., positions/2024-01-02.csv)
Key ColumnsComma-separated column names that uniquely identify rows

An Advanced section provides optional fields:

FieldDescription
ModelPDM model ID. Auto-resolved from the data source's path rules when you enter the base path.
LimitMaximum number of changed rows to return (default 1,000, max 10,000)
OffsetPagination offset into the changed-row set
TimeoutActivity timeout in milliseconds

Output

The Diff node emits the full DiffResult as its output. Downstream nodes receive it as inputData with this shape:

json
{
  "summary": {
    "baseRowCount": 1234,
    "targetRowCount": 1251,
    "added": 22,
    "removed": 5,
    "modified": 18,
    "unchanged": 1209
  },
  "rows": [
    {
      "_diff": "added",
      "account_id": "ACC-999",
      "balance": 50000
    },
    {
      "_diff": "modified",
      "account_id": "ACC-100",
      "balance": 75000,
      "_changes": {
        "balance": { "base": 70000, "target": 75000 }
      }
    }
  ],
  "hasMore": false,
  "columns": [
    { "name": "account_id", "type": "string" },
    { "name": "balance", "type": "number" }
  ]
}

Each row includes a _diff field (added, removed, or modified) and, for modified rows, a _changes map with base and target values for each changed column.

Compatible Nodes

The Diff node can connect to:

  • Route — branch on diff summary values (e.g., send an alert only when changes exceed a threshold)
  • Transform — reshape the diff output (e.g., extract just the added rows)
  • Expression — run arbitrary JavaScript against the diff result
  • CallAction — send the diff as a request body to an external API
  • Data Source Write — write the changed rows to a file

Common Patterns

Drift Check with Threshold Alert

Compare two files and alert only when the number of changes exceeds a threshold:

Diff → Route → CallAction

Configure the Route node's condition to branch on the summary. For example, to alert when more than 10 rows were added:

data.summary.added > 10

The "true" branch connects to a CallAction that sends a notification (webhook, email API, etc.). The "false" branch can be left unconnected or wired to a no-op.

Reconciliation — Write Changes to a File

Run a diff and persist the changed rows for downstream processing:

Diff → Data Source Write

The Write node receives the rows array from the diff result. Each written row includes the _diff classification column and, for modified rows, the _changes column with before/after values.

INFO

The diff summary (counts of added, removed, modified, unchanged) is not included in the written file — only per-row data is written. If you need to persist summary stats, add a Transform node between Diff and Write to reshape the summary into a writable row.

Reshape with Transform

Extract just the modified rows and flatten the changes:

Diff → Transform → Data Source Write

Use a Transform node to filter data.rows to only _diff === 'modified' entries, or to restructure the output into a different format before writing.

Usage Considerations

  • Static configuration — the Diff node's fields (Source, Base, Target, Key Columns) are configured directly on the node card. To compare different file pairs, create separate Diff nodes or update the node configuration.
  • Full result output — downstream nodes receive the complete DiffResult object, including all matched rows. Use a Transform node if you need to filter to a specific subset of rows before passing data downstream.
  • Large result sets — the diff result flows through Temporal's workflow history, which has a ~2 MB default payload ceiling. Use the Limit field to control the number of returned rows for large comparisons.
  • Writing diff results — when wiring Diff directly into a Data Source Write node, the per-row data (including _diff and _changes columns) is written, but summary counts are not. To persist summary stats, add a Transform node between Diff and Write to reshape the summary into a writable row.

Usage Considerations

  • Same-source comparison — both files must belong to the same data source.
  • Key columns — you specify the key columns that identify each row at query time. Precept uses these to match rows across the two files.
  • Exact comparison — values are compared exactly as stored. Numeric precision, date formatting, and string casing all affect whether a cell is reported as changed.
  • Column type hints — the type field in the result's columns array is derived from JavaScript typeof on the first row's value, not from the underlying storage metadata. Treat it as a rough hint.