Data Types, Filtering & Comparison
This page explains, in detail, how DataRecs interprets your data once it has been extracted: how it decides whether a column is a number or text, how filters behave, how aggregations and tolerances are applied, and what happens at every edge case. If you have ever asked “why did this row match (or not match)?”, the answer is here.
The problem this solves
Section titled “The problem this solves”Different sources represent the same value in different ways. A cell containing the number one hundred might arrive as:
- the typed number
100(most databases, and Google Sheets when read as an unformatted value), or - the text string
"100"(Excel/SharePoint, CSV-style exports, and some spreadsheet configurations).
If DataRecs compared these naively, two problems would appear:
- Numeric comparisons would be wrong. Comparing text sorts lexicographically (character by character), so
"50" > "100"istrue— because the character'5'is greater than'1'. A filter likeAmount > 100would then keep the wrong rows. - The same value from two sources would not match. The number
100from one source and the string"100"from another would be treated as different.
To prevent both, DataRecs infers each column’s intended type from its data and normalises every value to that type before any filtering, aggregation, or comparison happens.
Column type inference
Section titled “Column type inference”Before comparing anything, DataRecs scans every row of each extracted column and assigns it a single type. The rules, in order:
| Condition | Inferred type |
|---|---|
| The column has at least one value, and every non-empty value parses as a finite number | Numeric (number) |
Every non-empty value is a boolean (true/false) | Boolean |
| Every non-empty value is a date/time | Date |
| Anything else (any non-numeric text present) | Text |
| The column is entirely empty | Text |
Key points:
- Inference scans the whole column, not a sample. A column is only treated as numeric if every populated cell is a number. One genuinely textual value is enough to make the whole column text.
- Mixed encodings of the same number are unified. A column where one source supplies
100(number) and another supplies"100"(string) is inferred numeric, and both load as the number100. They compare equal, as they should. - Empty cells never affect inference. An empty cell in an otherwise-numeric column does not make it text — see Empty cells and NULLs.
What counts as “a number”
Section titled “What counts as “a number””A value is numeric if it parses as a finite number: 100, -3.5, 0, 1000000, and their string forms "100", "-3.5". The following are not numeric and force a column to text:
- Anything with letters or symbols:
"N/A","100kg","$100","1,000"(the comma makes it non-numeric),"12%". - The special values
NaNandInfinity. - Blank strings (these are treated as empty, not as text).
Edge case: one stray non-numeric value in a number column
Section titled “Edge case: one stray non-numeric value in a number column”This is the most important edge case to understand.
Suppose an Amount column has 9,999 clean numbers and a single cell containing "N/A". By the inference rule, one non-numeric value makes the whole column text. The consequences are deliberate and safe:
- Equality,
IN,LIKE, and text operations still work on the column as text. - A numeric range filter (
>,<,>=,<=) against that column will fail the run loudly rather than silently produce a wrong answer. DataRecs attempts a numeric conversion for the comparison; because"N/A"cannot convert, the run ends in an ERRORED state with a clear failure, not a misleading MATCHED/UNMATCHED result.
This is a direct consequence of our correctness-first principle: when the data contradicts the operation you asked for, DataRecs surfaces the problem instead of guessing. To fix it, correct the stray value at the source (or remove it with a filter) so the column is cleanly numeric.
Empty cells and NULLs
Section titled “Empty cells and NULLs”Sources are inconsistent about empty cells. Google Sheets typically omits an empty cell (it arrives as a true NULL), while Excel/SharePoint usually delivers an empty string "". DataRecs treats both identically:
- An empty cell (
NULL, missing, or"") always loads asNULLinternally. It staysNULL— DataRecs never rewrites it to0, to an empty string, or to any other placeholder. - A
NULLnever becomes0and never becomes an empty string in a numeric column — so it never silently skews aSUM,AVG, or count. NULLvalues are ignored by aggregates (standard SQL behaviour):SUMandAVGskip them,COUNT(column)does not count them.
This unification (every “empty” encoding → NULL) is what makes IS NULL / IS NOT NULL filters behave consistently across sources — see the filter reference — and what lets an empty dimension group match across sources — see Empty and NULL dimension values.
Filtering
Section titled “Filtering”Filters are applied per source, before aggregation and comparison. Each source extracts only the rows that satisfy the filters, then those filtered rows are aggregated and compared. Multiple filters on the same source are combined with AND — a row must satisfy every filter to be included.
Filter operators
Section titled “Filter operators”| Operator | Example | Behaviour |
|---|---|---|
= | Status = 'Active' | Exact match. On numeric columns, compares numerically; on text columns, exact string match. |
!= | Status != 'Excluded' | Not equal. Keeps every row whose value differs from the given value. |
> | Amount > '100' | Greater than. Numeric when the column is numeric and the bound is a number; otherwise a text comparison. |
< | Amount < '50' | Less than. Same numeric/text rule as >. |
>= | Amount >= '100' | Greater than or equal — includes the boundary value. |
<= | Amount <= '50' | Less than or equal — includes the boundary value. |
IN | Region IN ['US','UK'] | Value is one of the listed values. |
NOT IN | Region NOT IN ['EU','APAC'] | Value is none of the listed values. |
LIKE | Name LIKE 'A%' | Pattern match. % matches any sequence of characters, _ matches a single character. |
IS NULL | Notes IS NULL | Value is empty. An empty string "" counts as NULL. No values needed. |
IS NOT NULL | Notes IS NOT NULL | Value is present and non-empty. An empty string "" does not count as present. No values needed. |
Numeric vs. text comparison
Section titled “Numeric vs. text comparison”For the range operators (>, <, >=, <=), the comparison type depends on the column:
- Numeric column + numeric bound → numeric comparison.
Amount > 100keeps rows where the number is genuinely greater than 100.50is correctly excluded. - Text column (or a non-numeric bound such as a date string) → text comparison, ordered lexicographically.
Because column type is inferred from the data, you normally get numeric comparison automatically on numeric columns. You do not need to declare the type. The one situation where a numeric range filter falls back is a text column — and as described in the stray-value edge case, a numeric bound against a text column fails loudly rather than comparing lexicographically.
IS NULL / IS NOT NULL
Section titled “IS NULL / IS NOT NULL”Because an empty cell may arrive as NULL from one source and "" from another, DataRecs treats them as the same “no value”:
IS NULLmatches rows where the cell isNULLor an empty string.IS NOT NULLmatches rows where the cell has genuine, non-empty content.
These two filters are exact complements: for any column, IS NULL and IS NOT NULL together select every row exactly once.
Date columns and cross-source normalisation
Section titled “Date columns and cross-source normalisation”Different sources represent dates differently:
- Google Sheets returns dates as ISO strings (
"2024-02-20") because the extractor requestsdateTimeRenderOption: 'FORMATTED_STRING'. - Excel/SharePoint returns dates as serial numbers — integers counting the days since 1899-12-30. For example,
2024-02-20is serial45342. - Databases (PostgreSQL, MySQL, etc.) return typed dates that are rendered as ISO strings by the driver.
If DataRecs compared these raw representations, an Excel serial number 45342 would never match the Google Sheets string "2024-02-20" — even though they represent the same date. The reconciliation would show UNMATCHED for identical data.
How DataRecs detects and normalises date columns
Section titled “How DataRecs detects and normalises date columns”DataRecs uses the Excel column’s number format (the format string set in Excel, e.g. yyyy-mm-dd or m/d/yy) as the authoritative signal. This is the same information Excel uses to display the value as a date. The detection works as follows:
- At extraction time, the Excel/SharePoint extractor fetches both the raw
values(serial numbers) and thenumberFormatfor each column from the Microsoft Graph API. - If a column’s number format is a date format (contains
y,m,ddate placeholders — not the accounting/general format), the extractor converts each serial number to a canonical ISO date string (YYYY-MM-DD). - The converted ISO string is what gets stored in the parquet artifact and used as the dimension key for comparison.
This means by the time the comparator pairs dimension keys, both Google Sheets and Excel produce the same "2024-02-20" string.
Date filters
Section titled “Date filters”When a filter uses an ISO date value (e.g. TransactionDate >= '2024-02-01'), DataRecs recognises it as a date comparison and uses proper calendrical ordering rather than lexicographic string ordering. Because the column values are already normalised to ISO format (which sorts correctly as text), this “just works” — but DataRecs applies an explicit DATE cast as a safety measure to ensure correct behaviour even with edge cases.
What about columns that look numeric but are actually dates?
Section titled “What about columns that look numeric but are actually dates?”DataRecs never guesses. If a column arrives as integers (e.g. 45342) from Excel without a date number format, it is treated as a numeric column. Possible reasons this could happen:
- The column was formatted as “General” or “Number” in Excel — the user did not intend it as a date.
- The values genuinely represent counts, amounts, or IDs that happen to be in the same numeric range as date serials.
If you notice a date column being treated as numeric, check the column’s format in the Excel workbook. Setting it to any date format (e.g. Short Date, YYYY-MM-DD, DD/MM/YYYY) will cause DataRecs to correctly normalise it on the next extraction.
Filters that match every row or no rows
Section titled “Filters that match every row or no rows”- A filter that matches no rows (for example
Status = 'NonExistent') is valid. Both sources contribute zero rows, there is nothing to disagree on, and the result is MATCHED with zero rows processed — not an error. - A filter that matches every row (for example
Status = 'Active'when every row is Active) behaves exactly as if no filter were applied.
Parameterised filter values
Section titled “Parameterised filter values”Filter values may contain {{parameter}} placeholders that are resolved at run time from the parameters supplied when the job is triggered. For example, Date >= '{{start_date}}' lets you run the same job for different date ranges without editing it. An unresolved parameter (one not supplied at run time) fails the run rather than comparing against a literal {{...}} string.
Aggregation
Section titled “Aggregation”DataRecs compares aggregated values grouped by your chosen dimensions, not raw rows. You define one or more measures, each pairing a column with an aggregation function.
| Aggregation | What it computes | Column requirement |
|---|---|---|
SUM | Total of the values | Numeric |
AVG | Mean of the values | Numeric |
MIN | Smallest value | Numeric, text, or date |
MAX | Largest value | Numeric, text, or date |
COUNT | Number of non-empty values | Any |
COUNT_DISTINCT | Number of distinct values | Any |
COUNT_STAR | Number of rows (including empties) | — (no column) |
VARIANCE | Statistical variance of the values | Numeric |
STD_DEV | Standard deviation (the square root of variance) | Numeric |
Notes:
SUM,AVG,VARIANCE, andSTD_DEVrequire numeric columns. If the column was inferred as text (because of a stray non-numeric value), these aggregations fail the run loudly rather than skipping or zeroing the bad value.MIN/MAX/COUNT/COUNT_DISTINCTwork on any column type, so they are safe on text and date columns.STD_DEVis always exactly the square root ofVARIANCEfor the same data. When comparing these across sources, allow a tinyABSOLUTEtolerance (for example0.01) to absorb floating-point rounding.
Dimensions and grouping
Section titled “Dimensions and grouping”Dimensions are the columns you group by — the equivalent of GROUP BY in SQL. The measures are computed within each group. For example, with dimension Region and measure SUM(Amount), DataRecs compares the total Amount per Region between the two sources.
Empty and NULL dimension values
Section titled “Empty and NULL dimension values”A row whose dimension cell is empty still belongs to a group — the empty group. DataRecs handles this so that identical data always matches, regardless of how each source encoded “empty”:
- An empty dimension cell (a missing/
NULLcell from Google Sheets, or an empty string""from Excel/SharePoint) loads asNULL, exactly like any other column (see Empty cells and NULLs). The empty value is kept asNULL— it is not converted to an empty string or any other placeholder.NULLmeans “no value”, and DataRecs preserves that meaning rather than inventing a fake key. - Within each source, all rows with an empty dimension collapse into one group keyed by
NULL(standard SQLGROUP BYbehaviour). You get a single empty group per source, never one group per empty row. - When the two sources are compared, the empty (
NULL) group on one side is matched against the empty (NULL) group on the other. DataRecs uses a NULL-safe match for this — two empty groups are treated as the same group (NULLmatchesNULL), so identical data in the empty group reconciles as MATCHED, not as a spurious difference.
Tolerances
Section titled “Tolerances”A tolerance defines how much a measure is allowed to differ between sources before the row is considered a mismatch. Tolerances are configured per measure, per stage.
| Type | Compares against | A row fails when |
|---|---|---|
ABSOLUTE | The raw difference (delta) between the two sources | delta > value |
PERCENT | The percentage difference relative to the smaller source value | percentage_difference > value |
| (none / 0) | — | Any non-zero difference fails (delta > 0) |
How the percentage difference is computed
Section titled “How the percentage difference is computed”For a PERCENT tolerance, DataRecs computes the percentage difference as the delta divided by the smallest non-zero source value, times 100. The special cases:
- Both sources are 0 → percentage difference is
0(they agree exactly). - One source is 0 and the other is non-zero → percentage difference is
100. A zero-versus-non-zero difference can never be absorbed by any finite percentage tolerance, so it always counts as a mismatch.
Mixed tolerances and the “any breach fails” rule
Section titled “Mixed tolerances and the “any breach fails” rule”When a stage has multiple measures with their own tolerances, a row is MATCHED only if every measure passes its tolerance. A single measure exceeding its threshold makes the row a mismatch, even if all other measures pass. Each measure’s pass/fail is reported individually so you can see exactly which one breached.
Comparison and the MATCHED / UNMATCHED result
Section titled “Comparison and the MATCHED / UNMATCHED result”After filtering and aggregation, DataRecs performs a full outer join of the two aggregated result sets on the dimension columns, then evaluates the tolerances. The join is NULL-safe on the dimension keys: a group whose dimension is empty (NULL) on both sides is matched to itself rather than appearing as two one-sided rows (see Empty and NULL dimension values).
- MATCHED — every compared group passed every tolerance check. The data agrees within the limits you set.
- UNMATCHED — at least one group failed at least one tolerance check, or a group exists on one side but not the other.
A run reports counts alongside the result: rows processed, rows matched, and rows unmatched. A MATCHED result always has zero unmatched rows.
Multi-stage drill-down
Section titled “Multi-stage drill-down”Jobs can define multiple stages, each comparing at a progressively finer dimension grouping. The default strategy, DRILL_DOWN_ON_FAILURE, runs the coarsest stage first and only drills into a finer stage for the groups that failed. This makes large reconciliations efficient: if the grand total matches, there is no need to compare every sub-group.
See Configuring a Reconciliation Job for the stage strategies (DRILL_DOWN_ON_FAILURE, SEQUENTIAL, PARALLEL) and how to configure them.
Correctness first
Section titled “Correctness first”Throughout this page you will notice a consistent principle: when DataRecs cannot compute a correct answer, it fails the run rather than returning a wrong one.
- A numeric operation on data that is not actually numeric → ERRORED, not a silently-wrong match.
- An extraction or comparison failure → surfaced as ERRORED, never a partial or approximate result.
- An empty result from a filter → a legitimate MATCHED with zero rows, never a crash.
This is intentional. The value of a reconciliation platform is that you can trust the numbers. A wrong “MATCHED” is far more damaging than a clearly-reported failure, so DataRecs always errs toward telling you when something is wrong.
Summary
Section titled “Summary”- DataRecs infers each column’s type from all its values; a column is numeric only if every populated value is a number.
- The same value supplied as a number by one source and a string by another is unified and compares equal.
- Empty cells (
NULLor"") are treated identically and never skew aggregates or NULL filters. - Numeric range filters compare numerically on numeric columns; a numeric filter on a genuinely textual column fails loudly rather than comparing lexicographically.
- Aggregations, tolerances, and the MATCHED/UNMATCHED outcome behave identically regardless of source.
- When a correct answer is not possible, the run ERRORs instead of returning a misleading result.