File Diff
Compare two flat files in the same data source to see exactly what changed — which rows were added, removed, or modified, down to individual cell values.
File Diff is useful for:
- Drift detection — compare today's data extract against yesterday's to spot unexpected changes
- Reconciliation — verify that a transformed or migrated file matches its source
- QA checks — confirm that a new file version contains only the expected modifications
How It Works
File Diff performs a key-based structural comparison. You choose one or more columns that uniquely identify each row (the key columns), and Precept matches rows across the two files using those keys. Every row is then classified as:
| Classification | Meaning |
|---|---|
| Added | Row exists in the target file but not in the base file |
| Removed | Row exists in the base file but not in the target file |
| Modified | Row exists in both files (same key) but one or more non-key values differ |
| Unchanged | Row exists in both files with identical values |
For modified rows, Precept also records cell-level changes — which columns changed, and what the old and new values are.
Both files must be in the same data source but can be different formats (e.g., base is CSV, target is Parquet) as long as they share the same column structure.
Using the Ad-Hoc Diff Page
The diff page lets you compare two files interactively from the Data Sources area.
1. Select Two Files
Navigate to the Data Sources page and open the data source containing your files. In the file list, use the checkboxes in the leftmost column to select exactly two files. Once two files are selected, a Compare button appears in the toolbar. Click it to open the diff page.
The first file you select becomes the base (the "before" snapshot), and the second becomes the target (the "after" snapshot). You can swap them on the diff page if needed.
2. Confirm the Model
Precept auto-detects the data model for each file using the data source's path-matching rules (fileTypeMappings). Three outcomes are possible:
- Both files match the same model — the model is shown as read-only. You can override it if needed.
- Files match different models — an error is shown. You can override to pick a single model, or go back and select different files.
- No model matched — a dropdown lets you pick a model manually, or you can proceed without one. Without a model, Precept probes the base file to discover its columns.
3. Pick Key Columns
Choose one or more columns that uniquely identify each row. These are the columns Precept uses to match rows across files — for example, account_id or account_id, security_id for a composite key.
The key column picker auto-completes from the model's field list (or from the probed columns if no model is selected). Precept remembers your last-used key columns per data source and model.
WARNING
Key columns must uniquely identify rows. If duplicate key values exist in either file, the diff is rejected with an error.
4. Run and Read Results
Click Run diff to execute the comparison. The results page shows:
- A summary strip with counts: base rows, target rows, added, removed, modified, and unchanged
- A results table with all changed rows, ordered by type (removed, then modified, then added), then by key columns
Each row has a colored left stripe indicating its type:
- Green — added rows, showing target values
- Red — removed rows, showing base values (muted text)
- Amber — modified rows, showing target values with inline cell-level changes (
old value → new value)
NULL values display as a ∅ symbol to distinguish them from empty strings.
Results are paginated at 1,000 rows per page. Use the Prev / Next controls below the table to navigate.
Deep Links
The diff page is URL-addressable. You can share or bookmark a comparison:
/data-sources/:sourceId/diff?base=path/to/base.csv&target=path/to/target.csv&keys=account_id,security_idIf keys is omitted from the URL, the page falls back to your last-used keys for that source and model.
API Usage
The diff is also available as a REST endpoint for programmatic use:
POST /rest/ingestion/diffRequest body:
{
"sourceId": "my-sftp-source",
"base": "positions/2024-01-01.csv",
"target": "positions/2024-01-02.csv",
"keyColumns": ["account_id", "security_id"],
"model": "vendor/module/PositionsModel",
"limit": 1000,
"offset": 0
}sourceId,base,target,keyColumnsare requiredmodelis optional — when omitted, column validation is deferred to DuckDBlimitdefaults to 1,000 (max 10,000);offsetdefaults to 0
The response contains a summary object, a rows array of changed rows, a hasMore pagination flag, and a columns array describing the result schema. Set the Accept header to text/csv for CSV output (summary and cell-level changes are only available in JSON).
Workflow Usage
The Diff node lets you run file comparisons as part of an automated workflow — scheduled drift checks, reconciliation pipelines, and change-driven notifications.
Adding a Diff Node
Open a workflow in the graph editor and add a Diff node from the toolbar menu. The node card has four required fields:
| Field | Description |
|---|---|
| Source | The data source containing both files |
| Base | Path to the base file (e.g., positions/2024-01-01.csv) |
| Target | Path to the target file (e.g., positions/2024-01-02.csv) |
| Key Columns | Comma-separated column names that uniquely identify rows |
An Advanced section provides optional fields:
| Field | Description |
|---|---|
| Model | PDM model ID. Auto-resolved from the data source's path rules when you enter the base path. |
| Limit | Maximum number of changed rows to return (default 1,000, max 10,000) |
| Offset | Pagination offset into the changed-row set |
| Timeout | Activity timeout in milliseconds |
Output
The Diff node emits the full DiffResult as its output. Downstream nodes receive it as inputData with this shape:
{
"summary": {
"baseRowCount": 1234,
"targetRowCount": 1251,
"added": 22,
"removed": 5,
"modified": 18,
"unchanged": 1209
},
"rows": [
{
"_diff": "added",
"account_id": "ACC-999",
"balance": 50000
},
{
"_diff": "modified",
"account_id": "ACC-100",
"balance": 75000,
"_changes": {
"balance": { "base": 70000, "target": 75000 }
}
}
],
"hasMore": false,
"columns": [
{ "name": "account_id", "type": "string" },
{ "name": "balance", "type": "number" }
]
}Each row includes a _diff field (added, removed, or modified) and, for modified rows, a _changes map with base and target values for each changed column.
Compatible Nodes
The Diff node can connect to:
- Route — branch on diff summary values (e.g., send an alert only when changes exceed a threshold)
- Transform — reshape the diff output (e.g., extract just the added rows)
- Expression — run arbitrary JavaScript against the diff result
- CallAction — send the diff as a request body to an external API
- Data Source Write — write the changed rows to a file
Common Patterns
Drift Check with Threshold Alert
Compare two files and alert only when the number of changes exceeds a threshold:
Diff → Route → CallActionConfigure the Route node's condition to branch on the summary. For example, to alert when more than 10 rows were added:
data.summary.added > 10The "true" branch connects to a CallAction that sends a notification (webhook, email API, etc.). The "false" branch can be left unconnected or wired to a no-op.
Reconciliation — Write Changes to a File
Run a diff and persist the changed rows for downstream processing:
Diff → Data Source WriteThe Write node receives the rows array from the diff result. Each written row includes the _diff classification column and, for modified rows, the _changes column with before/after values.
INFO
The diff summary (counts of added, removed, modified, unchanged) is not included in the written file — only per-row data is written. If you need to persist summary stats, add a Transform node between Diff and Write to reshape the summary into a writable row.
Reshape with Transform
Extract just the modified rows and flatten the changes:
Diff → Transform → Data Source WriteUse a Transform node to filter data.rows to only _diff === 'modified' entries, or to restructure the output into a different format before writing.
Usage Considerations
- Static configuration — the Diff node's fields (Source, Base, Target, Key Columns) are configured directly on the node card. To compare different file pairs, create separate Diff nodes or update the node configuration.
- Full result output — downstream nodes receive the complete
DiffResultobject, including all matched rows. Use a Transform node if you need to filter to a specific subset of rows before passing data downstream. - Large result sets — the diff result flows through Temporal's workflow history, which has a ~2 MB default payload ceiling. Use the Limit field to control the number of returned rows for large comparisons.
- Writing diff results — when wiring Diff directly into a Data Source Write node, the per-row data (including
_diffand_changescolumns) is written, but summary counts are not. To persist summary stats, add a Transform node between Diff and Write to reshape the summary into a writable row.
Usage Considerations
- Same-source comparison — both files must belong to the same data source.
- Key columns — you specify the key columns that identify each row at query time. Precept uses these to match rows across the two files.
- Exact comparison — values are compared exactly as stored. Numeric precision, date formatting, and string casing all affect whether a cell is reported as changed.
- Column type hints — the
typefield in the result'scolumnsarray is derived from JavaScripttypeofon the first row's value, not from the underlying storage metadata. Treat it as a rough hint.