Decipher Spark plans and leverage them to your benefit

This is the 2nd post in the PySpark XP series (click for the 1st), and its main theme is Spark plans.

In this post you will learn how to decipher Spark plans, and learn ways to leverage your new knowledge in order to optimize Spark jobs performance.

Photo by Med Badr Chemmaoui on Unsplash

Some Spark developers avoid reading the plan printed by .explain() at all cost.

I get it, it might be frightening.

One may look online for the meaning of each curse that appears on the plan — I believe it will be beneficial for programming better Spark applications.

On the other hand, one can…

This is the first post in a series of posts , PySpark XP, each consists of 5 tips. XP stands for experience points, as the tips are related to matters I learnt from my experience with PySpark. Each post will provide tips about a different aspect of my experience with PySpark.

The first post is about syntax. It will be a valuable lesson for PySpark beginners. More experienced developers can learn from it too (check out Tip#2).

To check this post’s Jupyter notebook click here.

Columns are objects on their own. One can put a Column in a list or…

TL;DR —I optimized Spark joins and reduced runtime from 90 mins to just 7 mins. Use a withColumn operation instead of a join operation and optimize your Spark joins ~10 times faster.

If you are an experienced Spark developer, you have probably encountered the pain in joining dataframes. It is like you must be a true master to be able to join dataframes efficiently. The questions one may ask himself are:

  • Should I repartition my dataframe?
  • Should I broadcast the smaller dataframe?
  • What about the spark.shuffle.partitions parameter?

These questions have been occupying many people for a long time. Some really…

Dan Flomin

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store