Skip to content

[SPARK-38483][PYTHON][CONNECT] Add Column._name property exposing a column's name#56726

Open
AgenticSpark wants to merge 1 commit into
apache:masterfrom
AgenticSpark:agenticspark/SPARK-38483-column-name
Open

[SPARK-38483][PYTHON][CONNECT] Add Column._name property exposing a column's name#56726
AgenticSpark wants to merge 1 commit into
apache:masterfrom
AgenticSpark:agenticspark/SPARK-38483-column-name

Conversation

@AgenticSpark

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This adds a _name property to the PySpark Column class that returns the
column's name, alias, or expression as a string -- the same string shown inside
Column.__repr__ (Column<'...'>). It is implemented for both Spark Classic
(self._jc.toString()) and Spark Connect (self._expr.__repr__()).

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([(2, "Alice")], ["age", "name"])
>>> df.age._name
'age'
>>> sf.col("value")._name
'value'
>>> sf.col("a").cast("int")._name
'CAST(a AS INT)'

The leading underscore intentionally avoids a collision with the existing
Column.name method, which is an alias for Column.alias.

Why are the changes needed?

Requested in SPARK-38483.
Having the name available as an attribute enables convenient patterns, e.g.
re-aliasing an expression with the source column's name, or branching on a
column's name inside a helper function:

values = sf.col("values")
distinct_values = sf.array_distinct(values).alias(values._name)

def custom_function(col):
    return col.cast("int") if col._name == "my_column" else col.cast("string")

Previously the name was only obtainable by parsing repr(col).

Does this PR introduce any user-facing change?

Yes -- a new Column._name property is available. There is no change to any
existing behavior.

How was this patch tested?

Added test_name_property to ColumnTestsMixin, so it runs under both the
classic (pyspark.sql.tests.test_column) and Spark Connect parity
(pyspark.sql.tests.connect.test_parity_column) suites. It checks concrete
values and the invariant repr(col) == "Column<'%s'>" % col._name. Doctests
were also added on the new property.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: GitHub Copilot CLI (Claude Opus 4.8)

…olumn's name

Adds a `_name` property to the PySpark `Column` class that returns the
column's name, alias, or expression as a string, mirroring what is shown
inside `Column.__repr__`. This makes it easy to reuse a column's name
(e.g. re-aliasing an expression with the source column's name) or to
branch on the name inside a helper function.

The leading underscore intentionally avoids a collision with the existing
`Column.name` method, which is an alias for `Column.alias`.

Implemented for both Spark Classic (`self._jc.toString()`) and Spark
Connect (`self._expr.__repr__()`). Tested with a new case in
`ColumnTestsMixin` (exercised by both the classic and Connect parity
suites) plus doctests on the new property.
name inside a helper function. The leading underscore avoids a collision
with the existing :func:`name` method (an alias for :func:`alias`).

.. versionadded:: 5.0.0

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. versionadded:: 5.0.0
.. versionadded:: 4.3.0


@property
def _name(self) -> str:
return self._expr.__repr__()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: undocumented Classic/Connect output divergence. The docstring's cast("int") -> 'CAST(a AS INT)' happens to agree, but Spark's own CastExpression.repr (connect/expressions.py:1005-1008) documents cast("long") -> CAST(a AS BIGINT) (Classic) vs CAST(a AS LONG) (Connect). _name inherits this, so the PR's stated motivating use-case ("branch on the name inside a helper function") is backend-dependent and unreliable.

...

@property
def _name(self) -> str:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: leading-underscore name on a documented, versionadded-tagged public API is a design smell (private by Python convention yet appears in docs/tab-completion). The collision-avoidance motivation vs Column.name is real, but an alternative (col_name / expr_name) would be cleaner.


@property
def _name(self) -> str:
return self._expr.__repr__()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return self._expr.__repr__()
return repr(self._expr)

def test_name_property(self):
# SPARK-38483: Column._name exposes the column name/alias shown in repr
self.assertEqual(sf.col("a")._name, "a")
self.assertEqual(sf.col("a").cast("int")._name, "CAST(a AS INT)")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new test has one concrete assert (CAST(a AS INT)) plus a tautological invariant loop (repr(col) == "Column<'%s'>" % col._name); please consider adding a concrete expected string for alias/arithmetic (e.g. col("x").alias("y")._name == "x AS y").

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants