Sunday, 16 November 2025

Performance Optimization 01: SQL Optimizaiton without materializing data

 

1️ Use Proper Indexing (Most Important)

Indexes allow the database to avoid full table scans.

Create indexes on:

  • JOIN keys
  • WHERE clause columns
  • GROUP BY columns
  • ORDER BY columns

Example:

CREATE INDEX idx_orders_customer_id ON orders(customer_id);

CREATE INDEX idx_customer_country ON customer(country);

Key idea:

Indexes let the database filter and join data efficiently without copying or storing anything.

 

2️.Rewrite Subqueries as JOINs or EXISTS

Avoid IN (subquery) and correlated subqueries when possible.

Slow

SELECT * FROM orders

WHERE customer_id IN (SELECT id FROM customers WHERE country='US');

Faster (JOIN)

SELECT o.*

FROM orders o

JOIN customers c ON o.customer_id = c.id

WHERE c.country='US';

Also fast (EXISTS)

SELECT *

FROM orders o

WHERE EXISTS (

  SELECT 1 FROM customers c

  WHERE c.id = o.customer_id AND c.country='US'

);

 

3️.Choose the Right JOIN Type

Unnecessary join types can degrade performance.

Replace:

  • LEFT JOIN → INNER JOIN (if possible)
  • FULL OUTER JOIN → split logic into UNION ALL
  • CROSS JOIN → avoid unless needed

Fewer rows processed = faster queries.

 

4️.Push Filters Down Early (Predicate Pushdown)

Apply filters on the smallest dataset first.

Slow

SELECT ...

FROM big_table b

JOIN small_table s ON ...

WHERE s.type = 'X';

Fast

Move predicate to small table before join:

SELECT ...

FROM big_table b

JOIN (SELECT * FROM small_table WHERE type='X') s ON ...

This reduces join workload without materializing the data.

 

5️.Avoid Functions on Indexed Columns

This blocks index usage.

Bad

WHERE DATE(created_at) = '2024-01-01'

Good

WHERE created_at >= '2024-01-01'

  AND created_at < '2024-01-02'

 

6️.Use Covering Indexes

A covering index contains all columns needed, so the DB doesn't fetch the table.

Example query:

SELECT amount, created_at

FROM orders

WHERE customer_id = 100;

Create covering index:

CREATE INDEX idx_orders_cover ON orders(customer_id, created_at, amount);

The DB can serve the entire query from the index only
→ faster, no temp storing.

 

7️ **Avoid SELECT ***

Only select columns you need.

Bad

SELECT *

FROM orders o

JOIN customers c ON ...

Good

SELECT o.id, o.amount, c.name

FROM orders o

JOIN customers c ON ...

Less data scanned + less data transferred.

 

8️.Use LIMIT, WINDOWING, and Pagination

Avoid scanning large datasets.

Example Pagination:

SELECT * FROM orders

ORDER BY id

LIMIT 50 OFFSET 0;

Avoid OFFSET for large pages:

Use keyset pagination:

SELECT *

FROM orders

WHERE id > last_seen_id

ORDER BY id

LIMIT 50;

 

9️ Normalize Query Logic (No Redundant Operations)

Avoid repeating the same subquery multiple times.

Bad

SELECT (SELECT price FROM products WHERE id = o.product_id),

       (SELECT category FROM products WHERE id = o.product_id)

FROM orders o;

Good

SELECT p.price, p.category

FROM orders o

JOIN products p ON p.id = o.product_id;

 

10. Use Database-specific Optimizer Hints (When Needed)

These do not materialize data; they influence execution plan.

Examples:

  • MySQL: STRAIGHT_JOIN
  • Oracle: USE_NL, NO_MERGE
  • SQL Server: OPTION (HASH JOIN)
  • Postgres: SET enable_seqscan=off (temporary)

Only use when the optimizer chooses a poor plan.

 

1️1.Partitioning (Logical, Not Materializing)

Partitioning does not materialize data; it splits tables for faster scanning.

Use partitioning on:

  • date columns
  • high-cardinality keys

Improves:

  • scanning
  • filtering
  • aggregation

Without storing extra copies of data.

 

1️2.Use Window Functions Instead of Self-Joins

Window functions compute aggregates without extra joins.

Slow

SELECT o.*,

       (SELECT SUM(amount) FROM orders WHERE customer_id=o.customer_id)

FROM orders o;

Fast (window)

SELECT o.*,

       SUM(amount) OVER (PARTITION BY customer_id) AS customer_total

FROM orders o;

 

🧠 Summary: Optimization Without Materializing Data

Technique

Benefit

Indexes

Fast filtering and joining

Rewriting subqueries

Reduce scans + better execution plans

Join optimization

Process fewer rows

Predicate pushdown

Filter early

Covering indexes

Avoid table lookups

Avoid functions on indexed columns

Enable index usage

Keyset pagination

Avoid large offsets

Window functions

Avoid redundant joins

Partitioning

Faster scans on large datasets

 

No comments:

Post a Comment

Data Engineering - Client Interview question regarding data collection.

What is the source of data How the data will be extracted from the source What will the data format be? How often should data be collected? ...