PySpark Part 4 : How to delete columns in pyspark dataframe

It often becomes tricky for beginners to delete a column or list of columns present in a pyspark dataframe. Below are the ways to delete a column or list of columns.

Lets create a spark session and create a dataframe before practise deleting a column

1) Deleting a single column in pyspark dataframe

Below is the data frame with 2 columns i.e. “id” and “data”. We are deleting “data” column using drop()

+---+----+
| id|data|
+---+----+
|  1|  a1|
|  2|  a2|
|  3|  a3|
+---+----+

df = df.drop("data")
df.show()

+---+
| id|
+---+
|  1|
|  2|
|  3|
+---+

2) Deleting Multiple Columns

Below is the data frame with 3 columns i.e. “id” , “data” and “flag”. We are deleting “data” and “flag” column using drop()

+---+----+----+
| id|data|flag|
+---+----+----+
|  1|  a1| yes|
|  2|  a2| yes|
|  3|  a3| yes|
+---+----+----+

cols_to_drop = ["data", "flag"]
df = df.drop(*cols_to_drop)
df.show()

+---+
| id|
+---+
|  1|
|  2|
|  3|
+---+

Read other posts on Pyspark here

Leave a Reply

Your email address will not be published. Required fields are marked *