This is the 2nd post in the PySpark XP series (click for the 1st), and its main theme is Spark plans.
In this post you will learn how to decipher Spark plans, and learn ways to leverage your new knowledge in order to optimize Spark jobs performance.
Some Spark developers avoid reading the plan printed by
.explain() at all cost.
I get it, it might be frightening.
One may look online for the meaning of each curse that appears on the plan — I believe it will be beneficial for programming better Spark applications.
On the other hand, one can…
This is the first post in a series of posts , PySpark XP, each consists of 5 tips. XP stands for experience points, as the tips are related to matters I learnt from my experience with PySpark. Each post will provide tips about a different aspect of my experience with PySpark.
The first post is about syntax. It will be a valuable lesson for PySpark beginners. More experienced developers can learn from it too (check out Tip#2).
To check this post’s Jupyter notebook click here.
Columns are objects on their own. One can put a Column in a list or…
TL;DR —I optimized Spark joins and reduced runtime from 90 mins to just 7 mins. Use a withColumn operation instead of a join operation and optimize your Spark joins ~10 times faster.
If you are an experienced Spark developer, you have probably encountered the pain in joining dataframes. It is like you must be a true master to be able to join dataframes efficiently. The questions one may ask himself are:
These questions have been occupying many people for a long time. Some really…