🔄 Broadcast Join vs
Shuffle Join
🚀 Broadcast Join
Idea: “Send the tiny table to everyone.”
- When
one table is small enough to fit in memory
- Spark
copies (broadcasts) this small table to all worker nodes
- The
big table stays where it is
- Super
fast — no shuffling!
Use when:
✔ Small dimension table (e.g., country code lookup)
✔ Table < ~10–100
MB
✔ Want the fastest join
Why fast?
Because moving one small table once is cheaper than moving big tables
many times.
🔀 Shuffle Join
Idea: “Group both tables by the join key.”
- Used
when both tables are large
- Spark
repartitions (shuffles) both tables on the join key
- Every
node gets matching keys from both tables
- More
network I/O → slower & expensive
Use when:
✔ Both tables are big
✔ Join key is high-cardinality
✔ No table is small enough to broadcast
Why slow?
Because Spark must move data across the cluster, which is the most
expensive operation.
🥊 Quick Comparison
|
Feature |
Broadcast Join |
Shuffle Join |
|
Table size |
One table small |
Both tables large |
|
Network cost |
Low |
High |
|
Execution |
No shuffle |
Full shuffle |
|
Speed |
Very fast |
Slower |
|
Ideal for |
Dim lookup joins |
Large fact-fact joins |
No comments:
Post a Comment