PySpark Part 3 : 3 Ways to Select Columns in Spark DataFrame

Selecting one or set of columns in a spark dataframe is an art of writing good code. There are many ways to do it. I am explaining ONLY 3 ways. Choose your best and comment us. –

Method 1 :-

 +---+---+ 
 | c1| c2|
 +---+---+
 |foo|  1| 
 |bar|  2|
 +---+---+

output = df.select(df.c1)
output.show()
 +---+
 | c1|
 +---+
 |foo| 
 |bar|
 +---+

Method 2 – Column names is passed as a list :-

 +---+---+ 
 | c1| c2|
 +---+---+
 |foo|  1| 
 |bar|  2|
 +---+---+

# column names is passed as a list
cols = ["c1", "c2"]
output = df.select(*cols)
output.show()

 +---+---+ 
 | c1| c2|
 +---+---+
 |foo|  1| 
 |bar|  2|
 +---+---+

Method 3 – Column names is passed as a string :-

 +---+---+ 
 | c1| c2|
 +---+---+
 |foo|  1| 
 |bar|  2|
 +---+---+
# column names is passed as a string
output = df.select("c1")
output.show()
 +---+
 | c1|
 +---+
 |foo| 
 |bar|
 +---+

Read Part 4:- Delete columns in pyspark dataframe

Leave a Reply

Your email address will not be published. Required fields are marked *