[SPARK-56547][SQL][CONNECT] Improve error message when DataFrame column is referenced from a wrong DataFrame by zhengruifeng · Pull Request #55421 · apache/spark

zhengruifeng · 2026-04-20T06:05:15Z

What changes were proposed in this pull request?

In CheckAnalysis.failUnresolvedAttribute, when the unresolved Attribute carries PLAN_ID_TAG (i.e. a Spark Connect DataFrame column reference like df.col_name or df["col_name"]), raise the existing CANNOT_RESOLVE_DATAFRAME_COLUMN error instead of the generic UNRESOLVED_COLUMN. The tag reliably identifies DataFrame column references that the plan-id-based resolution already tried and failed, so the dedicated error's message ("illegal references like df1.select(df2.col("a"))") is always more accurate than a similarity-based suggestion.

Why are the changes needed?

The previous error was misleading. For example:

from pyspark.sql import functions as sf
df1 = spark.range(10)
df2 = df1.withColumn("id", sf.col('id') + 1)
df2.select(df1["id"]).show()

produced:

AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `id` cannot be resolved. Did you mean one of the following? [`id`]. SQLSTATE: 42703;
'Project ['id]
+- Project [(id#1L + cast(1 as bigint)) AS id#3L]
   +- Range (0, 10, step=1, splits=Some(1))

The message says id can't be resolved and suggests id as a fix, with no hint that the real issue is a reference to a column from a different DataFrame whose value was overwritten by withColumn.

Does this PR introduce any user-facing change?

Yes. For DataFrame column references that fail to resolve in Spark Connect, the error class changes from UNRESOLVED_COLUMN.WITH_SUGGESTION (or UNRESOLVED_COLUMN.WITHOUT_SUGGESTION) to CANNOT_RESOLVE_DATAFRAME_COLUMN:

[CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve dataframe column "id". It's probably because of illegal references like `df1.select(df2.col("a"))`. SQLSTATE: 42704

How was this patch tested?

Updated test_invalid_column in python/pyspark/sql/tests/connect/test_connect_error.py:

cdf3.select(cdf1.b) where cdf3 = cdf1.select(cdf1.a) (existing case, error class updated).
Added cdf3.select(cdf1.a) where cdf3 = cdf1.withColumn("a", F.lit(0)) (new case mirroring the reported issue).

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.7)

…mn is referenced out of scope ### What changes were proposed in this pull request? When a user references a column from a different DataFrame whose value was overwritten in the current DataFrame (e.g. `df2.select(df1["id"])` where `df2 = df1.withColumn("id", ...)`), the analyzer raised a misleading `UNRESOLVED_COLUMN.WITH_SUGGESTION` error that suggested the exact column name that was referenced. This PR detects the case in `ColumnResolutionHelper.resolveDataFrameColumnRecursively`: when a DataFrame column is resolved against the plan node matching the `PLAN_ID_TAG` but its resolved attribute is filtered out because it's not in some ancestor operator's output, we tag the `UnresolvedAttribute`. If the attribute is still unresolved after the remaining analyzer rules (e.g. `ResolveReferencesInSort` may still resolve it by promoting hidden output), `CheckAnalysis` raises the existing `CANNOT_RESOLVE_DATAFRAME_COLUMN` error which clearly points out the wrong DataFrame reference. ### Why are the changes needed? The previous error was confusing — it said a column couldn't be resolved and suggested the same column name as a fix, giving no hint that the real issue was a reference to the wrong DataFrame. ### Does this PR introduce _any_ user-facing change? Yes, the error class for the specific case of referencing a column from a different DataFrame changes from `UNRESOLVED_COLUMN.WITH_SUGGESTION` to `CANNOT_RESOLVE_DATAFRAME_COLUMN`. The latter already contained guidance about illegal DataFrame references. ### How was this patch tested? Updated `test_invalid_column` in `test_connect_error.py` to cover both the existing case (`cdf3.select(cdf1.b)` where `cdf3 = cdf1.select(cdf1.a)`) and the new `withColumn` case. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code Co-authored-by: Isaac

Any unresolved attribute with `PLAN_ID_TAG` at `CheckAnalysis` time is a Spark Connect DataFrame column reference that couldn't be reached from the current operator, so raise `CANNOT_RESOLVE_DATAFRAME_COLUMN` directly without introducing a dedicated tag. Co-authored-by: Isaac

zhengruifeng added 2 commits April 20, 2026 05:09

zhengruifeng changed the title ~~Better error message~~ [SPARK-XXXXX][SQL][CONNECT] Improve error message when DataFrame column is referenced from a wrong DataFrame Apr 20, 2026

zhengruifeng changed the title ~~[SPARK-XXXXX][SQL][CONNECT] Improve error message when DataFrame column is referenced from a wrong DataFrame~~ [SPARK-56547][SQL][CONNECT] Improve error message when DataFrame column is referenced from a wrong DataFrame Apr 20, 2026

zhengruifeng marked this pull request as draft April 20, 2026 06:14

zhengruifeng force-pushed the better-error-message branch from 4a609cc to 0bd0129 Compare April 20, 2026 09:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56547][SQL][CONNECT] Improve error message when DataFrame column is referenced from a wrong DataFrame#55421

[SPARK-56547][SQL][CONNECT] Improve error message when DataFrame column is referenced from a wrong DataFrame#55421
zhengruifeng wants to merge 2 commits intoapache:masterfrom
zhengruifeng:better-error-message

zhengruifeng commented Apr 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhengruifeng commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zhengruifeng commented Apr 20, 2026 •

edited

Loading