Skip to content

[SPARK-56547][SQL][CONNECT] Improve error message when DataFrame column is referenced from a wrong DataFrame#55421

Draft
zhengruifeng wants to merge 2 commits intoapache:masterfrom
zhengruifeng:better-error-message
Draft

[SPARK-56547][SQL][CONNECT] Improve error message when DataFrame column is referenced from a wrong DataFrame#55421
zhengruifeng wants to merge 2 commits intoapache:masterfrom
zhengruifeng:better-error-message

Conversation

@zhengruifeng
Copy link
Copy Markdown
Contributor

@zhengruifeng zhengruifeng commented Apr 20, 2026

What changes were proposed in this pull request?

In CheckAnalysis.failUnresolvedAttribute, when the unresolved Attribute carries PLAN_ID_TAG (i.e. a Spark Connect DataFrame column reference like df.col_name or df["col_name"]), raise the existing CANNOT_RESOLVE_DATAFRAME_COLUMN error instead of the generic UNRESOLVED_COLUMN. The tag reliably identifies DataFrame column references that the plan-id-based resolution already tried and failed, so the dedicated error's message ("illegal references like df1.select(df2.col("a"))") is always more accurate than a similarity-based suggestion.

Why are the changes needed?

The previous error was misleading. For example:

from pyspark.sql import functions as sf
df1 = spark.range(10)
df2 = df1.withColumn("id", sf.col('id') + 1)
df2.select(df1["id"]).show()

produced:

AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `id` cannot be resolved. Did you mean one of the following? [`id`]. SQLSTATE: 42703;
'Project ['id]
+- Project [(id#1L + cast(1 as bigint)) AS id#3L]
   +- Range (0, 10, step=1, splits=Some(1))

The message says id can't be resolved and suggests id as a fix, with no hint that the real issue is a reference to a column from a different DataFrame whose value was overwritten by withColumn.

Does this PR introduce any user-facing change?

Yes. For DataFrame column references that fail to resolve in Spark Connect, the error class changes from UNRESOLVED_COLUMN.WITH_SUGGESTION (or UNRESOLVED_COLUMN.WITHOUT_SUGGESTION) to CANNOT_RESOLVE_DATAFRAME_COLUMN:

[CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve dataframe column "id". It's probably because of illegal references like `df1.select(df2.col("a"))`. SQLSTATE: 42704

How was this patch tested?

Updated test_invalid_column in python/pyspark/sql/tests/connect/test_connect_error.py:

  • cdf3.select(cdf1.b) where cdf3 = cdf1.select(cdf1.a) (existing case, error class updated).
  • Added cdf3.select(cdf1.a) where cdf3 = cdf1.withColumn("a", F.lit(0)) (new case mirroring the reported issue).

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.7)

…mn is referenced out of scope

### What changes were proposed in this pull request?

When a user references a column from a different DataFrame whose value was
overwritten in the current DataFrame (e.g. `df2.select(df1["id"])` where `df2 =
df1.withColumn("id", ...)`), the analyzer raised a misleading
`UNRESOLVED_COLUMN.WITH_SUGGESTION` error that suggested the exact column name
that was referenced.

This PR detects the case in `ColumnResolutionHelper.resolveDataFrameColumnRecursively`:
when a DataFrame column is resolved against the plan node matching the
`PLAN_ID_TAG` but its resolved attribute is filtered out because it's not in
some ancestor operator's output, we tag the `UnresolvedAttribute`. If the
attribute is still unresolved after the remaining analyzer rules (e.g.
`ResolveReferencesInSort` may still resolve it by promoting hidden output),
`CheckAnalysis` raises the existing `CANNOT_RESOLVE_DATAFRAME_COLUMN` error
which clearly points out the wrong DataFrame reference.

### Why are the changes needed?

The previous error was confusing — it said a column couldn't be resolved and
suggested the same column name as a fix, giving no hint that the real issue
was a reference to the wrong DataFrame.

### Does this PR introduce _any_ user-facing change?

Yes, the error class for the specific case of referencing a column from a
different DataFrame changes from `UNRESOLVED_COLUMN.WITH_SUGGESTION` to
`CANNOT_RESOLVE_DATAFRAME_COLUMN`. The latter already contained guidance about
illegal DataFrame references.

### How was this patch tested?

Updated `test_invalid_column` in `test_connect_error.py` to cover both the
existing case (`cdf3.select(cdf1.b)` where `cdf3 = cdf1.select(cdf1.a)`) and
the new `withColumn` case.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

Co-authored-by: Isaac
Any unresolved attribute with `PLAN_ID_TAG` at `CheckAnalysis` time is a
Spark Connect DataFrame column reference that couldn't be reached from the
current operator, so raise `CANNOT_RESOLVE_DATAFRAME_COLUMN` directly
without introducing a dedicated tag.

Co-authored-by: Isaac
@zhengruifeng zhengruifeng changed the title Better error message [SPARK-XXXXX][SQL][CONNECT] Improve error message when DataFrame column is referenced from a wrong DataFrame Apr 20, 2026
@zhengruifeng zhengruifeng changed the title [SPARK-XXXXX][SQL][CONNECT] Improve error message when DataFrame column is referenced from a wrong DataFrame [SPARK-56547][SQL][CONNECT] Improve error message when DataFrame column is referenced from a wrong DataFrame Apr 20, 2026
@zhengruifeng zhengruifeng marked this pull request as draft April 20, 2026 06:14
@zhengruifeng zhengruifeng force-pushed the better-error-message branch from 4a609cc to 0bd0129 Compare April 20, 2026 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant