PySpark Part 2 : What is Spark DataFrame and RDD?

DataFrame – A BUZZ Word !! People tend to use it with popular languages used for Data Analysis like Python, Scala and R. We thought to contribute an article in the direction of making this term VERY CLEAR.

What is Spark DataFrame?

  • A Spark DataFrame is a collection of data organised into named columns.
  • It is equivalent to a table in SQL with better optimisation technique.
| id|Operation|  Value|
|  1| Date_Min| 148590|
|  1| Date_Max| 148590|
|  1|   Device| iphone|
|  2| Date_Min| 148590|
  • We can perform operations like filter, group, aggregate on DataFrame.
  • Dataframe can be constructed from data files (csv, txt), JSON, tables in hive, or any table in external database.

Features of DataFrame

  • Dataframes are Distributed in Nature, which makes it Fault Tolerant and Highly Available Data Structure.
  • Dataframes are Immutable in nature. By immutable I mean that it is an object whose state cannot be modified after it is created. But we can transform its values by applying a certain transformation, like in RDDs.
  • It is lazy evaluated. Lazy evaluation in Spark means that the execution will not start until an action is triggered. Action means transformation(groupby, collect etc)

READ PART2.1 : How to Create a DataFrame

Leave a Reply

Your email address will not be published. Required fields are marked *