Data Types, Filtering & Comparison

This page explains, in detail, how DataRecs interprets your data once it has been extracted: how it decides whether a column is a number or text, how filters behave, how aggregations and tolerances are applied, and what happens at every edge case. If you have ever asked “why did this row match (or not match)?”, the answer is here.

The problem this solves

Different sources represent the same value in different ways. A cell containing the number one hundred might arrive as:

the typed number 100 (most databases, and Google Sheets when read as an unformatted value), or
the text string "100" (Excel/SharePoint, CSV-style exports, and some spreadsheet configurations).

If DataRecs compared these naively, two problems would appear:

Numeric comparisons would be wrong. Comparing text sorts lexicographically (character by character), so "50" > "100" is true — because the character '5' is greater than '1'. A filter like Amount > 100 would then keep the wrong rows.
The same value from two sources would not match. The number 100 from one source and the string "100" from another would be treated as different.

To prevent both, DataRecs settles on one type per column before any filtering, aggregation, or comparison happens — taken from the database’s declared schema where there is one, and inferred from the values for schema-less spreadsheet sources — and normalises every value to that type.

How DataRecs types each column

Every column is classified as numeric, date, or text before anything else happens. How that type is decided depends on the source, because a database describes its columns and a spreadsheet does not.

Database sources — the declared type

For database sources (PostgreSQL, MySQL, SQL Server, Oracle, Db2, and BigQuery), DataRecs uses each column’s declared type from the database catalog (information_schema and the per-engine equivalents). The type is read once, from the schema — DataRecs does not scan the column’s values to guess it. A column declared NUMERIC, INT, DECIMAL, and so on is numeric; a VARCHAR/TEXT column is text; a DATE/TIMESTAMP column is temporal.

Because the declared type is authoritative, filtering and aggregation run inside the source database using that type. A numeric filter on a numeric column compares numerically; the same filter against a text column is handled by the source engine as a text comparison. A single unusual value does not reclassify the column — its type is fixed by the schema.

Spreadsheet sources — inferred from the values

Google Sheets and Excel/SharePoint have no schema to declare column types, so for these sources DataRecs infers each column’s type in-worker by examining its values. The rule:

Condition	Inferred type
The column has at least one value, and every non-empty value is a finite number (and not a leading-zero identifier such as `007`)	Numeric
Anything else — any non-numeric value present, a mix of types, or leading-zero strings	Text
The column is entirely empty	Text

Key points:

Inference scans the whole column, not a sample. A column is treated as numeric only if every populated cell is a number. One genuinely textual value is enough to make the whole column text.
Numbers written as text are still numeric. A cell holding "100" (a string) and one holding 100 (a number) are both recognised as the number 100, so a spreadsheet column mixing the two encodings is numeric. (Dates are handled separately — see Date columns and cross-source normalisation — and compare as canonical date strings.)
Leading-zero strings stay text. A value like "007" is kept as text so identifier digits are never lost to 7.
Empty cells never affect inference. An empty cell in an otherwise-numeric column does not make it text — see Empty cells and NULLs.

When two sources of different kinds are compared — say a numeric database column against a spreadsheet column — DataRecs reconciles their physical types to a single value-preserving type at comparison time, so identical values still match.

What counts as “a number” (spreadsheet sources)

A value is numeric if it parses as a finite number: 100, -3.5, 0, 1000000, and their string forms "100", "-3.5". The following are not numeric and force a spreadsheet column to text:

Anything with letters or symbols: "N/A", "100kg", "$100", "1,000" (the comma makes it non-numeric), "12%".
Leading-zero identifiers such as "007" (kept as text to preserve the digits).
The special values NaN and Infinity.
Blank strings (these are treated as empty, not as text).

Edge case: one stray non-numeric value in a number column

This edge case applies to spreadsheet sources, which infer a column’s type from its values. Database columns keep their declared type regardless of any individual value.

Suppose a spreadsheet Amount column has 9,999 clean numbers and a single cell containing "N/A". By the inference rule, one non-numeric value makes the whole column text. The consequences are deliberate and safe:

Equality, IN, LIKE, and text operations still work on the column as text.
A numeric range filter (>, <, >=, <=), or a numeric aggregation (SUM, AVG, …), against that column will fail the run loudly rather than silently produce a wrong answer. DataRecs applies the numeric operation with an explicit numeric conversion; because "N/A" cannot convert, the run ends in an ERRORED state with a clear failure, not a misleading MATCHED/UNMATCHED result.

This is a direct consequence of our correctness-first principle: when the data contradicts the operation you asked for, DataRecs surfaces the problem instead of guessing. To fix it, correct the stray value at the source (or remove it with a filter) so the column is cleanly numeric.

Empty cells and NULLs

Sources are inconsistent about empty cells. Google Sheets typically omits an empty cell (it arrives as a true NULL), while Excel/SharePoint usually delivers an empty string "". DataRecs treats both identically:

An empty cell (NULL, missing, or "") always loads as NULL internally. It stays NULL — DataRecs never rewrites it to 0, to an empty string, or to any other placeholder.
A NULL never becomes 0 and never becomes an empty string in a numeric column — so it never silently skews a SUM, AVG, or count.
NULL values are ignored by aggregates (standard SQL behaviour): SUM and AVG skip them, COUNT(column) does not count them.

This unification (every “empty” encoding → NULL) is what makes IS NULL / IS NOT NULL filters behave consistently across sources — see the filter reference — and what lets an empty dimension group match across sources — see Empty and NULL dimension values.

Filtering

Filters are applied per source, before aggregation and comparison. Each source extracts only the rows that satisfy the filters, then those filtered rows are aggregated and compared. Multiple filters on the same source are combined with AND — a row must satisfy every filter to be included.

Filter operators

Operator	Example	Behaviour
`=`	`Status = 'Active'`	Exact match. On numeric columns, compares numerically; on text columns, exact string match.
`!=`	`Status != 'Excluded'`	Not equal. Keeps every row whose value differs from the given value.
`>`	`Amount > '100'`	Greater than. Numeric when the column is numeric and the bound is a number; otherwise a text comparison.
`<`	`Amount < '50'`	Less than. Same numeric/text rule as `>`.
`>=`	`Amount >= '100'`	Greater than or equal — includes the boundary value.
`<=`	`Amount <= '50'`	Less than or equal — includes the boundary value.
`IN`	`Region IN ['US','UK']`	Value is one of the listed values.
`NOT IN`	`Region NOT IN ['EU','APAC']`	Value is none of the listed values.
`LIKE`	`Name LIKE 'A%'`	Pattern match. `%` matches any sequence of characters, `_` matches a single character.
`IS NULL`	`Notes IS NULL`	Value is empty. An empty string `""` counts as NULL. No values needed.
`IS NOT NULL`	`Notes IS NOT NULL`	Value is present and non-empty. An empty string `""` does not count as present. No values needed.

Numeric vs. text comparison

For the range operators (>, <, >=, <=), the comparison type depends on the column:

Numeric column + numeric bound → numeric comparison. Amount > 100 keeps rows where the number is genuinely greater than 100. 50 is correctly excluded.
Text column (or a non-numeric bound such as a date string) → text comparison, ordered lexicographically.

You do not declare the type yourself — for database columns it comes from the declared schema type, and for spreadsheet columns it is inferred from the values (see How DataRecs types each column). A numeric range filter therefore compares numerically on any numeric column automatically. The exception is a spreadsheet column that a stray non-numeric value has forced to text: as described in the stray-value edge case, a numeric bound there fails loudly rather than returning a misleading answer.

`IS NULL` / `IS NOT NULL`

Because an empty cell may arrive as NULL from one source and "" from another, DataRecs treats them as the same “no value”:

IS NULL matches rows where the cell is NULL or an empty string.
IS NOT NULL matches rows where the cell has genuine, non-empty content.

These two filters are exact complements: for any column, IS NULL and IS NOT NULL together select every row exactly once.

Date columns and cross-source normalisation

Different sources represent dates differently:

Google Sheets returns dates as ISO strings ("2024-02-20") because the extractor requests dateTimeRenderOption: 'FORMATTED_STRING'.
Excel/SharePoint returns dates as serial numbers — integers counting the days since 1899-12-30. For example, 2024-02-20 is serial 45342.
Databases (PostgreSQL, MySQL, etc.) return typed dates that are rendered as ISO strings by the driver.

If DataRecs compared these raw representations, an Excel serial number 45342 would never match the Google Sheets string "2024-02-20" — even though they represent the same date. The reconciliation would show UNMATCHED for identical data.

How DataRecs detects and normalises date columns

DataRecs uses the Excel column’s number format (the format string set in Excel, e.g. yyyy-mm-dd or m/d/yy) as the authoritative signal. This is the same information Excel uses to display the value as a date. The detection works as follows:

At extraction time, the Excel/SharePoint extractor fetches both the raw values (serial numbers) and the numberFormat for each column from the Microsoft Graph API.
If a column’s number format is a date format (contains y, m, d date placeholders — not the accounting/general format), the extractor converts each serial number to a canonical ISO date string (YYYY-MM-DD).
The converted ISO string is what gets stored in the parquet artifact and used as the dimension key for comparison.

This means by the time the comparator pairs dimension keys, both Google Sheets and Excel produce the same "2024-02-20" string.

Date filters

When a filter uses an ISO date value (e.g. TransactionDate >= '2024-02-01'), DataRecs recognises it as a date comparison and uses proper calendrical ordering rather than lexicographic string ordering. Because the column values are already normalised to ISO format (which sorts correctly as text), this “just works” — but DataRecs applies an explicit DATE cast as a safety measure to ensure correct behaviour even with edge cases.

What about columns that look numeric but are actually dates?

DataRecs never guesses. If a column arrives as integers (e.g. 45342) from Excel without a date number format, it is treated as a numeric column. Possible reasons this could happen:

The column was formatted as “General” or “Number” in Excel — the user did not intend it as a date.
The values genuinely represent counts, amounts, or IDs that happen to be in the same numeric range as date serials.

If you notice a date column being treated as numeric, check the column’s format in the Excel workbook. Setting it to any date format (e.g. Short Date, YYYY-MM-DD, DD/MM/YYYY) will cause DataRecs to correctly normalise it on the next extraction.

Filters that match every row or no rows

A filter that matches no rows (for example Status = 'NonExistent') is valid. Both sources contribute zero rows, there is nothing to disagree on, and the result is MATCHED with zero rows processed — not an error.
A filter that matches every row (for example Status = 'Active' when every row is Active) behaves exactly as if no filter were applied.

Parameterised filter values

Filter values may contain {{parameter}} placeholders that are resolved at run time from the parameters supplied when the job is triggered. For example, Date >= '{{start_date}}' lets you run the same job for different date ranges without editing it. An unresolved parameter (one not supplied at run time) fails the run rather than comparing against a literal {{...}} string.

Aggregation

DataRecs compares aggregated values grouped by your chosen dimensions, not raw rows. You define one or more measures, each pairing a column with an aggregation function.

Aggregation	What it computes	Column requirement
`SUM`	Total of the values	Numeric
`AVG`	Mean of the values	Numeric
`MIN`	Smallest value	Numeric, text, or date
`MAX`	Largest value	Numeric, text, or date
`COUNT`	Number of non-empty values	Any
`COUNT_DISTINCT`	Number of distinct values	Any
`COUNT_STAR`	Number of rows (including empties)	— (no column)
`VARIANCE`	Statistical variance of the values	Numeric
`STD_DEV`	Standard deviation (the square root of variance)	Numeric

Notes:

SUM, AVG, VARIANCE, and STD_DEV require numeric columns. If the column is text — declared that way in the database, or, for a spreadsheet source, forced to text by a stray non-numeric value — these aggregations fail the run loudly rather than skipping or zeroing the bad value.
MIN/MAX/COUNT/COUNT_DISTINCT work on any column type, so they are safe on text and date columns.
STD_DEV is always exactly the square root of VARIANCE for the same data. When comparing these across sources, allow a tiny ABSOLUTE tolerance (for example 0.01) to absorb floating-point rounding.

Dimensions and grouping

Dimensions are the columns you group by — the equivalent of GROUP BY in SQL. The measures are computed within each group. For example, with dimension Region and measure SUM(Amount), DataRecs compares the total Amount per Region between the two sources.

Empty and NULL dimension values

A row whose dimension cell is empty still belongs to a group — the empty group. DataRecs handles this so that identical data always matches, regardless of how each source encoded “empty”:

An empty dimension cell (a missing/NULL cell from Google Sheets, or an empty string "" from Excel/SharePoint) loads as NULL, exactly like any other column (see Empty cells and NULLs). The empty value is kept as NULL — it is not converted to an empty string or any other placeholder. NULL means “no value”, and DataRecs preserves that meaning rather than inventing a fake key.
Within each source, all rows with an empty dimension collapse into one group keyed by NULL (standard SQL GROUP BY behaviour). You get a single empty group per source, never one group per empty row.
When the two sources are compared, the empty (NULL) group on one side is matched against the empty (NULL) group on the other. DataRecs uses a NULL-safe match for this — two empty groups are treated as the same group (NULL matches NULL), so identical data in the empty group reconciles as MATCHED, not as a spurious difference.

How dimension keys are matched

Two rows land in the same group only when their dimension values are genuinely the same. Sources — and the databases behind them — encode “the same” text in cosmetically different ways, so DataRecs normalises every dimension key to a canonical form before grouping. This keeps identical data together without ever merging values that are actually different:

Trailing padding is trimmed. A fixed-width CHAR(n) column pads short values with spaces ("ABC" stored in a CHAR(5) becomes "ABC "). DataRecs trims that trailing padding, so a CHAR key matches the same text stored in a VARCHAR column on another source.
Unicode is normalised (NFC). The same accented character can be stored two ways — a single “é” code point, or a plain “e” followed by a combining accent. These are different bytes but the same character. DataRecs normalises both to one form, so "café" matches "café" across sources regardless of how each stored it.
Keys are case-sensitive and byte-exact. After the normalisation above, keys are compared exactly. "Munich" and "MUNICH" are different groups, on every source — including databases (such as MySQL and SQL Server) whose default collation would otherwise merge them. This is deliberate: silently folding case could hide a real difference between two groups, and the platform never trades correctness for convenience. If you want case-insensitive grouping, normalise the case in your source query or a transformation before reconciling.
Booleans canonicalise. A genuine boolean column canonicalises to true / false, so a boolean matches a source that carries the same flag as the text "true" / "false". (A boolean stored as the numbers 0 / 1 is treated as a number, not a boolean, because the two are indistinguishable in the data.)

Tolerances

A tolerance defines how much a measure is allowed to differ between sources before the row is considered a mismatch. Tolerances are configured per measure, per stage.

Type	Compares against	A row fails when
`ABSOLUTE`	The raw difference (delta) between the two sources	`delta > value`
`PERCENT`	The percentage difference relative to the smaller source value	`percentage_difference > value`
(none / 0)	—	Any non-zero difference fails (`delta > 0`)

How the percentage difference is computed

For a PERCENT tolerance, DataRecs computes the percentage difference as the delta divided by the smallest non-zero source value, times 100. The special cases:

Both sources are 0 → percentage difference is 0 (they agree exactly).
One source is 0 and the other is non-zero → percentage difference is 100. A zero-versus-non-zero difference can never be absorbed by any finite percentage tolerance, so it always counts as a mismatch.

Mixed tolerances and the “any breach fails” rule

When a stage has multiple measures with their own tolerances, a row is MATCHED only if every measure passes its tolerance. A single measure exceeding its threshold makes the row a mismatch, even if all other measures pass. Each measure’s pass/fail is reported individually so you can see exactly which one breached.

Comparison and the MATCHED / UNMATCHED result

After filtering and aggregation, DataRecs performs a full outer join of the two aggregated result sets on the dimension columns, then evaluates the tolerances. The join is NULL-safe on the dimension keys: a group whose dimension is empty (NULL) on both sides is matched to itself rather than appearing as two one-sided rows (see Empty and NULL dimension values).

MATCHED — every compared group passed every tolerance check. The data agrees within the limits you set.
UNMATCHED — at least one group failed at least one tolerance check, or a group exists on one side but not the other.

A run reports counts alongside the result: rows processed, rows matched, and rows unmatched. A MATCHED result always has zero unmatched rows.

Multi-stage drill-down

Jobs can define multiple stages, each comparing at a progressively finer dimension grouping. The default strategy, DRILL_DOWN_ON_FAILURE, runs the coarsest stage first and only drills into a finer stage for the groups that failed. This makes large reconciliations efficient: if the grand total matches, there is no need to compare every sub-group.

See Configuring a Reconciliation Job for the stage strategies (DRILL_DOWN_ON_FAILURE, SEQUENTIAL, PARALLEL) and how to configure them.

Correctness first

Throughout this page you will notice a consistent principle: when DataRecs cannot compute a correct answer, it fails the run rather than returning a wrong one.

A numeric operation on data that is not actually numeric → ERRORED, not a silently-wrong match.
An extraction or comparison failure → surfaced as ERRORED, never a partial or approximate result.
An empty result from a filter → a legitimate MATCHED with zero rows, never a crash.

This is intentional. The value of a reconciliation platform is that you can trust the numbers. A wrong “MATCHED” is far more damaging than a clearly-reported failure, so DataRecs always errs toward telling you when something is wrong.

Summary

DataRecs infers each column’s type from all its values; a column is numeric only if every populated value is a number.
The same value supplied as a number by one source and a string by another is unified and compares equal.
Empty cells (NULL or "") are treated identically and never skew aggregates or NULL filters.
Numeric range filters compare numerically on numeric columns; a numeric filter on a genuinely textual column fails loudly rather than comparing lexicographically.
Aggregations, tolerances, and the MATCHED/UNMATCHED outcome behave identically regardless of source.
When a correct answer is not possible, the run ERRORs instead of returning a misleading result.