BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Lakehouse Tower of Babel: Handling Identifier Resolution Rules Across Database Engines

Lakehouse Tower of Babel: Handling Identifier Resolution Rules Across Database Engines

Listen to this article -  0:00

Key Takeaways

  • Open table formats such as Apache Iceberg standardize data and metadata semantics across engines, but they do not provide SQL dialect interoperability, leaving identifier resolution to each engine.
  • In multi-engine lakehouses, identifier resolution has become an architectural concern where a table can exist in shared metadata yet be effectively invisible to some engines or require users to rely on pervasive quoting or escaping.
  • Catalog implementations introduce their own conventions on naming, with some normalizing identifiers to Hadoop-style lowercase and others preserving the casing provided by the engine.
  • Adopting a strict, organization-wide naming convention aligned with the engines and catalogs in the data lakehouse is currently the most reliable way to reduce cross-engine portability failures.
  • Teams should treat identifier normalization as part of their data contract, validating and testing naming behavior across engines, rather than assuming that shared metadata alone provides portability.

The promise of modern lakehouse architecture is a unified data layer where diverse compute engines such as Snowflake, Spark, Trino, and Flink can seamlessly interoperate using open standards like Apache Iceberg.

While significant progress has been made in standardizing data storage and metadata formats, a critical interoperability gap remains: the lack of a shared language dialect across different database engines. While supporting a common SQL dialect is hard due to differences in supported features across vendors, we would highlight how there are inconsistencies even for most fundamental aspects like addressing identifiers (databases, schemas, tables, table columns, etc.).

Each engine brings its own historical rules for resolving and normalizing identifiers, creating a "Tower of Babel" effect where an organization’s goal to build governed, unified, and modern AI/Data systems is hindered by its tooling not agreeing on a common dialect to address its database objects. This is an emerging problem as organizations shift from siloed data lakes to lakehouses, where these subtle identifier issues have caused pipeline reliability and consistency issues.

The SQL Dialect Interoperability Gap in Lakehouse

To understand this problem, let’s walk through a scenario where a data engineer creates a table in Spark with the following command:

SQL 
CREATE TABLE my_lakehouse.MyTable (id INT, value STRING); 
SELECT * from my_lakehouse.mytable; — successfully returns results

By default, Spark persists the table name in the Apache Iceberg catalog with the casing exactly as provided: MyTable. Later, a business analyst attempts to query this table from Flink or Trino:

SQL 
-- This subsequent select query fails for both Flink and Trino due to different reasons even though it succeeds for Spark
SELECT * FROM my_lakehouse.mytable;

Flink preserves identifiers exactly as typed, so when the analyst writes mytable, Flink sends mytable to the catalog, which has the table stored as MyTable. If the catalog performs a case-sensitive lookup, the table is not found. Even if the table is resolved (e.g., through a case-insensitive catalog), column-level access remains strictly case-sensitive: "SELECT Id FROM MyTable" would fail if the column is stored as id in the Iceberg metadata. Flink will return the error: "Column 'Id' not found in any table; did you mean 'id'?"

Trino presents a different challenge within this dialect gap. Because Trino normalizes identifiers to lowercase, a query for MyTable or mytable becomes a lookup for mytable (lowercase). If Spark has persisted the table as MyTable or MYTABLE, Trino's case-sensitive catalog lookup will not find a match.

Trino's lack of first-class support for handling delimited identifiers determined that even when users attempted to specify casing, resolution against non-lowercase metadata entries would remain unsuccessful, effectively making tables created with uppercase or mixed case characters unable to be discovered from the Trino engine.

Figure 1. Multi-engine lakehouse setup showcasing identifier issues across popular database engines.

In practice, data architects rely on intermediate tooling like DBT or SQLMesh to transpile their SQL code across different query engines so that their workloads are not tightly coupled to a single query engine. While transpilation helps reduce the friction of rewriting workload for each engine, it doesn’t protect against architects locking in identifier design choices that are incompatible across engines.

Why Does It Matter Now?

This challenge is not new in the database space, because historically database migrations have encountered similar SQL dialect inconsistencies. However, in traditional siloed environments, workflows were governed by a single engine's rules, and one-time migration difficulties were accepted as the norm. Whereas in the modern lakehouse, multiple engines operate on the same data simultaneously, so the pain is ever-present. When Spark treats Table1 and table1 as distinct objects while Trino treats them as identical, automated pipelines and cross-engine workflows can fail and cause serious data corruption or contract violations.

Managing this SQL dialect interoperability gap is largely the responsibility of the organization creating its lakehouse strategy, and not addressed by a single catalog or database platform. We will dive into the technical nuances and behavioral differences across catalogs and databases, empowering architects to design a more resilient lakehouse strategy.

Technical Overview

Identifier name resolution involves a set of rules that dictates which characters could be part of the identifiers and how the identifiers would be normalized to their case-normalized form (CNF) for comparison as well as stored in metadata and storage. In a lakehouse stack, multiple components developed independently by diverse vendors interact together to produce an "effective" identifier behavior that could break existing workload assumptions.

The diagram below represents the high-level flow of identifier resolution and persistence in a lakehouse, showing different components that are involved:

Figure 2. High-level flow of identifier resolution.

It is therefore important to survey different vendors across both database engines as well as catalogs to understand the "effective" behavior and common pitfalls.

A Survey of Behavior: Catalogs and Engines

This section covers the different conventions used across database engines and catalogs. We will start by analyzing the behavior for the most popular database engines, followed by analyzing the most popular catalogs.

Database Engines Level Differences

Engines apply different logic when interacting with the catalog layer.

Comparison of Database Engines Level Differences

Engine

Normalization
Rule

Resolution Rule
Spark Case preserving: persists identifiers as provided

Case sensitive for catalogs that accept all cases (e.g., Apache Polaris)
Case insensitive for lowercase catalogs (eg: Apache Hadoop, AWS Glue, Unity)

Snowflake (CLD) Lowercase: normalizes all identifiers to lowercase during creation

Case insensitive: resolves regardless of query casing.
mytable, MyTable, MYTABLE all resolve for a select query

Trino Lowercase: normalizes all identifiers to lowercase Case insensitive: only lowercase forms are supported
Flink Case preserving: persists identifiers as provided Case sensitive: only exact casing matches
DuckDB Case preserving: persists identifiers as provided Case insensitive: resolves regardless of query casing.
mytable, MyTable, and MYTABLE all resolve for a select query

Catalog Level Differences

Catalogs serve as the source of truth for entity names, but their implementation details vary.

  • Apache Polaris follows the Apache Iceberg specification. It accepts strings as provided via the REST interface and performs case-sensitive matching for lookups.
  • Databricks Unity Catalog standardizes identifiers to lowercase. This standardization can lead to different results when Spark users, who may expect case sensitivity, interact with objects that Unity Catalog has normalized to lowercase.
  • AWS Glue Data Catalog automatically converts uppercase entity names to lowercase.

These differences in semantics across three layers (user intent, database engine, and the catalog) lead to subtle failures and downstream risks which could lead to costly rewrites later.

Illustrative Scenarios

While the technical specifications of each engine’s resolution logic highlight a clear divergence in how identifiers are handled, the real impact of these choices is best understood through their operational consequences. The following two scenarios are illustrative as they are not based on a single customer engagement, but are composites drawn from real case studies. The company names and specific details are fictional, but the failure patterns, error messages, and resolutions reflect issues encountered in production environments.

We analyze how one organization chose a case-preserving catalog and the other a lowercase-normalizing catalog, and how neither choice provided a universally correct solution and instead each trades one set of failures for another.

Scenario A: NovaPay Using Case-Preserving Catalog (Polaris)

NovaPay is a fintech company processing over two million daily transactions across fifteen countries. Their data stack consists of Apache Polaris (REST catalog), Spark for ETL, Trino for ad-hoc analytics, Snowflake for BI reporting and ML, and Flink for real-time fraud detection. They chose Polaris's default case-preserving behavior to maintain their camel case convention across over four hundred tables.

Initial Spark pipelines work flawlessly as Polaris stores identifiers exactly as typed.

CREATE TABLE payments.dailyTransactions (
    transactionId STRING,
    merchantName STRING,
    paymentAmount DECIMAL(10,2),
    processingDate DATE
); -- Table created via spark

Problems begin when the analytics team onboards Trino:

SELECT * FROM payments.dailytransactions;
-- Trino lowercases to "dailytransactions", Polaris has "dailyTransactions" → 404

This pattern matches the behavior documented where tables with uppercase characters are browsed as lowercase and where users report that "case sensitive identifiers coming from an Iceberg REST catalog are unable to be referenced in Trino queries because they are passed to the rest catalog as all lowercase". All four hundred plus tables are effectively hidden from the analytics team. Snowflake integration via a Catalog Linked Database (CLD) helps as CLDs normalize to lowercase with case-insensitive resolution, making tables discoverable. However, the Flink fraud detection team encounters column-level failures: Flink's case-sensitive resolution resulted in SELECT transactionid failing when the Iceberg metadata stores transactionId. To address this issue, NovaPay runs a migration to rename all tables to snake_case (a case in which all letters are lowercase and words are separated by underscore characters) and adds CI linting to reject non-lowercase identifiers going forward.

Scenario B: Lowercase-Normalizing Catalog (AWS Glue)

MediStream, a healthcare analytics startup processing patient records from over fifty hospital systems, has the following data stack: AWS Glue (catalog), Spark on EMR for ETL, Athena (Trino-based) for analyst queries, and Flink on Kinesis for real-time alerting.

The startup chose AWS Glue because they were an AWS-native shop and expected the lowercase normalization by AWS Glue catalog would prevent any cross-engine casing issues.

The team hit a wall immediately because Glue rejected their PascalCase table names:

# PySpark ETL job
spark.sql("""
    CREATE TABLE glue_catalog.clinical.PatientVitals (
        PatientId STRING,
        HeartRate INT,
        BloodPressure STRING,
        RecordedAt TIMESTAMP
    )
""")
Error: `ValidationException: Cannot use PatientVitals as Glue table name, because it must be 1-255 chars of lowercase letters, numbers, underscore`

This issue matches an issue where users report that Spark could not skip Glue table name validation. The team spent two weeks rewriting over eighty tables and ETL scripts to snake_case. Infrastructure as Code (IaC) adds another friction point: Terraform detects perpetual drift because Glue silently lowercases names that Terraform sends with mixed casing.

Even after standardizing table names, column-level issues persist. Glue controls table names but column names live in Iceberg metadata so columns originally created as PatientId by Spark remain mixed-case. Athena (Trino-based) handles this issue transparently by lowercasing internally, but Flink requires exact casing, causing the same table to work in one engine and fail in another. To address these issues, MediStream rebuilds early tables with lowercase column names, standardizes Terraform to lowercase, and adds a validation step that rejects non-lowercase column names before any CREATE TABLE.

How to Choose Your Engine Combination

As a practitioner, the following decision framework can help you configure a data lakehouse that minimizes case sensitivity related issues. The single most impactful decision is to choose the batch of engines that work well together in a lakehouse. You can use the following compatibility matrix:

  Spark Trino/ Athena Flink Snowflake (CLD) DuckDB
Spark ⚠️ ⚠️
Snowflake (CLD) ⚠️
Trino ⚠️ ⚠️⚠️ ⚠️
Flink ⚠️ ⚠️⚠️ ⚠️ ⚠️
DuckDB ⚠️ ⚠️

✅ = Compatible because it works without special effort
⚠️ = Works if you enforce lowercase naming standard
⚠️⚠️ = Some fundamental incompatibilities remain that lowercase naming alone doesn't fully fix (column-level issues persist).

Spark and Snowflake (CLD) are broadly compatible because both resolve case-insensitively, so tables flow between them without friction. DuckDB shares this same case-insensitive resolution, making it equally compatible. Trino's strict lowercase normalization creates friction with case-preserving engines (e.g., Spark, Flink, and DuckDB) unless all identifiers are already lowercase. Flink is the strictest because its case-sensitive resolution results in any column casing mismatch causing failures, making it the engine that most demands a disciplined naming convention.

Enforce Naming Standards

To avoid discoverability and other casing-related issues within the data stack, it is important to standardize and enforce a naming standard that works for all the tools in your lakehouse setup. One of the most effective strategies is to limit all the identifiers to use lowercase with underscore as the default, which would be widely accepted across engines and catalogs. Divergence from this default would require carefully pairing engines and tweaking their settings to guarantee that all the scenarios work well.

A lowercase naming convention prevents most issues, but each layer in the lakehouse stack has configuration options that further reduce case-related friction. The defaults are not always optimal for multi-engine environments.

Lakehouse Stack Configuration Options

Layer Key Setting Default Why Tune It?
Spark

spark.sql.
caseSensitive

false Controls whether Spark treats different casings for table columns as same or different.

Snowflake
CLD

CATALOG_CASE
_SENSITIVITY

CASE_ INSENSITIVE Controls whether the CLD resolves identifiers case-insensitively or follows strict ANSI SQL casing.
Glue glue.skip-name-validation false Rejects mixed-case table names by default (#5768). Setting to true bypasses client-side validation, but Glue still lowercases server-side which creates metadata drift between the catalog and Iceberg metadata files.
Trino iceberg.rest-catalog.case-insensitive-name-matching false When connecting to a REST catalog with mixed-case names, setting to true builds a lowercase-to-original mapping via catalog listing so those tables become visible.

Validate Across the Stack

Enforcing a naming standard prevents most issues, but the only way to confirm true cross-engine portability is to test it end-to-end. A lightweight CI job that creates a table through the primary engine and verifies it is discoverable and queryable from every other engine in the stack will help find any integration failures. These cross-engine failures, such as catalog normalization surprises, column-level casing mismatches in Iceberg metadata, and resolution differences, only surface when validation includes interactions across multiple engines.

Conclusion

The promise of the modern lakehouse to bring any engine to a single copy of data remains the North Star for data architecture, yet the examples discussed prove that shared storage and unified catalogs are not enough to provide a seamless experience. This problem is rooted in a fundamental philosophical divide between two approaches to identifier resolution: case-preserving fidelity and case-normalizing uniformity. While some engines follow a philosophy of high fidelity, which preserves and resolves the exact casing of an object, because it was created to maintain metadata richness, this very flexibility creates non-discoverable data for other engines that adhere to a philosophy of strict normalization. By forcibly casting identifiers to a normalized case to promise SQL standard compliance, these engines lose the ability to discover objects that don't fit their logic.

This conflict forces architects into a difficult trade-off, because they must either accept the risk of "shadow tables", or impose rigid, lowercase-only constraints that strip away the naming conventions of traditional systems. Organizations must stop treating identifier naming as a matter of individual engine preference and start treating it as a critical data contract.

By establishing a strict, organization-wide casing convention, teams can minimize friction regardless of how an engine or catalog behaves. Ultimately, solving the "Tower of Babel" problem requires a shift in mindset: True interoperability is achieved not just by sharing the same bytes on disk, but by enforcing a shared language across every engine that touches them.

About the Author

Rate this Article

Adoption
Style

BT