Saturday, 15 November 2025

Spark 01 : Interview Questions Broadcast Join vs Shuffle Join

 

🔄 Broadcast Join vs Shuffle Join 

🚀 Broadcast Join

Idea: “Send the tiny table to everyone.”

  • When one table is small enough to fit in memory
  • Spark copies (broadcasts) this small table to all worker nodes
  • The big table stays where it is
  • Super fast — no shuffling!

Use when:
Small dimension table (e.g., country code lookup)
Table < ~10100 MB
Want the fastest join

Why fast?
Because moving one small table once is cheaper than moving big tables many times.

 

🔀 Shuffle Join

Idea: “Group both tables by the join key.”

  • Used when both tables are large
  • Spark repartitions (shuffles) both tables on the join key
  • Every node gets matching keys from both tables
  • More network I/O → slower & expensive

Use when:
Both tables are big
Join key is high-cardinality
No table is small enough to broadcast

Why slow?
Because Spark must move data across the cluster, which is the most expensive operation.

 

🥊 Quick Comparison

Feature

Broadcast Join

Shuffle Join

Table size

One table small

Both tables large

Network cost

Low

High

Execution

No shuffle

Full shuffle

Speed

Very fast

Slower

Ideal for

Dim lookup joins

Large fact-fact joins

 

No comments:

Post a Comment

Data Engineering - Client Interview question regarding data collection.

What is the source of data How the data will be extracted from the source What will the data format be? How often should data be collected? ...