Saturday, 13 December 2025

Data Engineering - Client Interview question regarding data collection.

  • What is the source of data
  • How the data will be extracted from the source
  • What will the data format be?
  • How often should data be collected?
  • How to handle missing data points? what rules must be applied for missing data points
  • How the data will be received by the reporting system (ware house). Pull or push approach.

Sunday, 16 November 2025

Performance Optimization 01: SQL Optimizaiton without materializing data

 

1️ Use Proper Indexing (Most Important)

Indexes allow the database to avoid full table scans.

Create indexes on:

  • JOIN keys
  • WHERE clause columns
  • GROUP BY columns
  • ORDER BY columns

Example:

CREATE INDEX idx_orders_customer_id ON orders(customer_id);

CREATE INDEX idx_customer_country ON customer(country);

Key idea:

Indexes let the database filter and join data efficiently without copying or storing anything.

 

2️.Rewrite Subqueries as JOINs or EXISTS

Avoid IN (subquery) and correlated subqueries when possible.

Slow

SELECT * FROM orders

WHERE customer_id IN (SELECT id FROM customers WHERE country='US');

Faster (JOIN)

SELECT o.*

FROM orders o

JOIN customers c ON o.customer_id = c.id

WHERE c.country='US';

Also fast (EXISTS)

SELECT *

FROM orders o

WHERE EXISTS (

  SELECT 1 FROM customers c

  WHERE c.id = o.customer_id AND c.country='US'

);

 

3️.Choose the Right JOIN Type

Unnecessary join types can degrade performance.

Replace:

  • LEFT JOIN → INNER JOIN (if possible)
  • FULL OUTER JOIN → split logic into UNION ALL
  • CROSS JOIN → avoid unless needed

Fewer rows processed = faster queries.

 

4️.Push Filters Down Early (Predicate Pushdown)

Apply filters on the smallest dataset first.

Slow

SELECT ...

FROM big_table b

JOIN small_table s ON ...

WHERE s.type = 'X';

Fast

Move predicate to small table before join:

SELECT ...

FROM big_table b

JOIN (SELECT * FROM small_table WHERE type='X') s ON ...

This reduces join workload without materializing the data.

 

5️.Avoid Functions on Indexed Columns

This blocks index usage.

Bad

WHERE DATE(created_at) = '2024-01-01'

Good

WHERE created_at >= '2024-01-01'

  AND created_at < '2024-01-02'

 

6️.Use Covering Indexes

A covering index contains all columns needed, so the DB doesn't fetch the table.

Example query:

SELECT amount, created_at

FROM orders

WHERE customer_id = 100;

Create covering index:

CREATE INDEX idx_orders_cover ON orders(customer_id, created_at, amount);

The DB can serve the entire query from the index only
→ faster, no temp storing.

 

7️ **Avoid SELECT ***

Only select columns you need.

Bad

SELECT *

FROM orders o

JOIN customers c ON ...

Good

SELECT o.id, o.amount, c.name

FROM orders o

JOIN customers c ON ...

Less data scanned + less data transferred.

 

8️.Use LIMIT, WINDOWING, and Pagination

Avoid scanning large datasets.

Example Pagination:

SELECT * FROM orders

ORDER BY id

LIMIT 50 OFFSET 0;

Avoid OFFSET for large pages:

Use keyset pagination:

SELECT *

FROM orders

WHERE id > last_seen_id

ORDER BY id

LIMIT 50;

 

9️ Normalize Query Logic (No Redundant Operations)

Avoid repeating the same subquery multiple times.

Bad

SELECT (SELECT price FROM products WHERE id = o.product_id),

       (SELECT category FROM products WHERE id = o.product_id)

FROM orders o;

Good

SELECT p.price, p.category

FROM orders o

JOIN products p ON p.id = o.product_id;

 

10. Use Database-specific Optimizer Hints (When Needed)

These do not materialize data; they influence execution plan.

Examples:

  • MySQL: STRAIGHT_JOIN
  • Oracle: USE_NL, NO_MERGE
  • SQL Server: OPTION (HASH JOIN)
  • Postgres: SET enable_seqscan=off (temporary)

Only use when the optimizer chooses a poor plan.

 

1️1.Partitioning (Logical, Not Materializing)

Partitioning does not materialize data; it splits tables for faster scanning.

Use partitioning on:

  • date columns
  • high-cardinality keys

Improves:

  • scanning
  • filtering
  • aggregation

Without storing extra copies of data.

 

1️2.Use Window Functions Instead of Self-Joins

Window functions compute aggregates without extra joins.

Slow

SELECT o.*,

       (SELECT SUM(amount) FROM orders WHERE customer_id=o.customer_id)

FROM orders o;

Fast (window)

SELECT o.*,

       SUM(amount) OVER (PARTITION BY customer_id) AS customer_total

FROM orders o;

 

๐Ÿง  Summary: Optimization Without Materializing Data

Technique

Benefit

Indexes

Fast filtering and joining

Rewriting subqueries

Reduce scans + better execution plans

Join optimization

Process fewer rows

Predicate pushdown

Filter early

Covering indexes

Avoid table lookups

Avoid functions on indexed columns

Enable index usage

Keyset pagination

Avoid large offsets

Window functions

Avoid redundant joins

Partitioning

Faster scans on large datasets

 

Performance Optimization 01: Debouncing with Elasticsearch

 

๐Ÿ” Why Debounce with Elasticsearch?

When building search functionalities (like autocomplete, live search, or suggestions), every keystroke can trigger a request to Elasticsearch.

Elasticsearch queries can be:

  • CPU-intensive

  • Heavy on cluster resources

  • Network-expensive

Without debouncing:

  • Typing “smart” could trigger 5 queries: s → sm → sma → smar → smart

  • This generates unnecessary load

  • Can cause UI lag and slow search results

Debouncing solves this by waiting for users to pause typing before sending an Elasticsearch request.

⚙️ How Debouncing Helps with Elasticsearch

Debouncing ensures:

  • Only one request is sent after the user stops typing (e.g., after 300ms)

  • Fewer queries → Faster UI → Less load on Elasticsearch cluster

  • Better relevance and reliability in search results


๐Ÿง  Flow Diagram (Concept)

User types → debounce timer resets → waits X ms → No new keystrokes? → Trigger Elasticsearch query → Show results

๐Ÿงฉ Code Implementations

1. JavaScript Frontend Debouncing + Elasticsearch Query (Common Approach)

function debounce(fn, delay) { let timer; return function(...args) { clearTimeout(timer); timer = setTimeout(() => fn.apply(this, args), delay); }; } async function searchElastic(query) { const response = await fetch(`/api/search?q=${encodeURIComponent(query)}`); const data = await response.json(); console.log("Results:", data); } // Attach debounce to input const debouncedSearch = debounce(searchElastic, 300); document.getElementById("search-box").addEventListener("input", (e) => { debouncedSearch(e.target.value); });

How it works:

  • The request fires only after typing stops for 300 ms.

  • Great for autocomplete or suggestions.

2. Node.js Backend Debouncing (Less Common but Possible)

If the server receives too many rapid requests (e.g., microservices), you can debounce on the backend:

const debounce = require('lodash.debounce'); const { Client } = require("@elastic/elasticsearch"); const client = new Client({ node: "http://localhost:9200" }); const performSearch = debounce(async (query, res) => { const result = await client.search({ index: "products", query: { match: { name: query } } }); res.json(result.hits.hits); }, 300); app.get("/search", (req, res) => { performSearch(req.query.q, res); });

Note: Backend debouncing is only useful in special controlled scenarios; generally debouncing belongs in frontend.

3. React Autocomplete Search (Popular UI Pattern)

import { useState, useCallback } from "react"; import debounce from "lodash.debounce"; function SearchBox() { const [results, setResults] = useState([]); const searchElastic = useCallback( debounce(async (query) => { const res = await fetch(`/api/search?q=${query}`); const data = await res.json(); setResults(data); }, 300), [] ); return ( <input type="text" onChange={(e) => searchElastic(e.target.value)} placeholder="Search..." /> ); }

๐ŸŽฏ Best Practices for Debouncing with Elasticsearch

✔ 1. Use 250–500 ms debounce delay

Lower delays cause more frequent calls; higher delays hurt UX.

✔ 2. Use Suggesters or Search-as-you-type fields

Elasticsearch features like:

  • completion suggester

  • search_as_you_type

  • edge N-grams

These are optimized for instant queries with UI debouncing.

✔ 3. Cache previous responses

If the user repeats queries, return cached results instantly.

✔ 4. Use async cancellation

If a new query fires, cancel the previous promise to avoid race conditions.๐Ÿงพ Example: Elasticsearch Query for Autocomplete

GET products/_search { "query": { "match_phrase_prefix": { "name": "smart" } } }

Useful for autocomplete with debounced calls.

Deep Learning 11 : What is Dropout?

 What is Dropout?

  • Dropout is a regularization technique used in deep learning to reduce overfitting.
  • During training, it randomly “drops” (sets to 0) a fraction of neurons in a layer.
  • This forces the network to learn more robust patterns instead of relying too heavily on specific neurons.

⚙️ How It Works

  1. Training phase (forward pass):
    • Each time the model processes a batch, dropout randomly deactivates some neurons.
    • Example:
      • Pass 1 → neurons n1, n3, n4 dropped.
      • Pass 2 → neurons n2, n5 dropped.
    • The pattern changes every batch, so the model can’t depend on fixed neurons.
  2. Testing/Inference phase:
    • Dropout is disabled.
    • All neurons are active, but their outputs are scaled to account for dropout during training.

๐Ÿ“Œ Why Use Dropout?

  • Prevents overfitting (memorizing training data instead of generalizing).
  • Encourages redundancy in feature learning.
  • Improves generalization to unseen data.
  • Simple and effective — often used with rates like 0.2 (20%) or 0.5 (50%).

Example in Keras

python

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense, Dropout

 

model = Sequential([

    Dense(128, activation='relu', input_shape=(784,)),

    Dropout(0.5),  # randomly drop 50% of neurons

    Dense(64, activation='relu'),

    Dropout(0.2),  # randomly drop 20% of neurons

    Dense(10, activation='softmax')

])

In short: Dropout is like making your model “forget” parts of itself during training so it learns to be flexible and generalize better.

 

Deep Learning 10 : What is an Epoch?

 

๐Ÿ”„ What is an Epoch?

  • An epoch is one complete pass through the entire training dataset by the model.

  • If you have 1,000 samples and a batch size of 100:

    • One epoch = 10 batches (because 100 × 10 = 1,000).

  • After each epoch, the model has seen all training data once.

⚙️ Why Multiple Epochs?

  • A single epoch usually isn’t enough for the model to learn meaningful patterns.

  • Training for multiple epochs allows the model to gradually adjust weights and improve accuracy.

  • Too few epochs → underfitting (model hasn’t learned enough).

  • Too many epochs → overfitting (model memorizes training data, performs poorly on unseen data).

๐Ÿ“Œ Epochs vs. Batches vs. Iterations

Term Meaning:

BatchSubset of the dataset processed at once (e.g., 32 samples).
IterationOne update step of weights (processing a single batch).
EpochOne full pass through the dataset (all batches processed once).

So:

  • Epochs = how many times the model sees the full dataset.

  • Iterations = how many times weights are updated.

  • Batches = how many samples are processed per iteration.

✅ Example

  • Dataset size = 10,000 samples

  • Batch size = 100

  • Epochs = 5

➡️ Each epoch = 100 iterations (10,000 ÷ 100). ➡️ Total training = 500 iterations (100 × 5).

In short: Epochs are the number of times the model cycles through the entire dataset during training.

Deep Learning 09 : What is Dense Layer ?

 

๐Ÿงฉ What is a Dense Layer?

  • Definition: A dense layer is a type of neural network layer where each neuron receives input from all neurons in the previous layer.

  • Structure:

    • Inputs → multiplied by weights

    • Added to biases

    • Passed through an activation function (e.g., ReLU, sigmoid, softmax)

  • Purpose: Transforms input features into higher-level representations and contributes to decision-making in the network.

⚙️ How Dense Layers Work

  • Mathematical operation: For input vector xx, weights WW, bias bb, and activation function ff:

y=f(Wx+b)y = f(Wx + b)
  • Connections: Every neuron in the dense layer has a unique weight for each input, making it highly interconnected.

  • Learning: During training, weights and biases are updated via backpropagation to minimize error.

๐Ÿ“Œ Where Dense Layers Are Used

  • Feedforward Neural Networks: Core building blocks for classification and regression tasks.

  • Convolutional Neural Networks (CNNs): Often appear after convolution + pooling layers to interpret extracted features into final predictions.

  • Recurrent Neural Networks (RNNs): Sometimes used at the output stage to map hidden states to predictions.

✅ Key Characteristics

  • Fully connected: Maximum connectivity between layers.

  • Parameter-heavy: Dense layers can have a large number of parameters, especially with big input sizes.

  • Versatile: Suitable for tasks like image classification, text processing, and tabular data.

  • Trade-off: Powerful but computationally expensive compared to sparse layers.

In short: A dense layer is the “decision-making” part of a neural network, where all inputs interact with all outputs, enabling the model to learn complex patterns.

Deep learning Interview Question 01 : Batch Processing and Weight Updates

 If we train a model with 32 batches, where batches 1–32 result in weights of 0.2 and batches 33–64 result in weights of 0.3, will the model continue using the previously updated weights from the earlier batches, or will it start fresh with new weights for each batch range?

  • Batch 1–32 → weight = 0.2 The model processes these batches, computes gradients, and updates parameters. After this step, the model’s weights are no longer the initial ones — they’ve been adjusted to reflect learning from batches 1–32.

  • Batch 33–64 → weight = 0.3 When the model moves to the next set of batches, it does not reset to the old weights. Instead, it continues from the updated weights after batch 32. The new batches further refine the parameters.

⚙️ Key Principle

  • In training, the model always uses the latest weights (the ones updated after the previous batch).

  • It does not start fresh for each batch range unless you explicitly reinitialize the model.

  • So in your example, batches 33–64 will be processed using the weights that already include learning from batches 1–32.

๐Ÿ“Œ Analogy

Think of it like writing a book:

  • After chapters 1–32, you’ve already built the storyline (weights = 0.2).

  • When you write chapters 33–64, you don’t throw away the first half — you continue building on it (weights evolve to 0.3).

Answer: The model will always use the previously updated weights from the last batch. It does not start with a new model per batch unless you explicitly reset or reinitialize it.

Data Engineering - Client Interview question regarding data collection.

What is the source of data How the data will be extracted from the source What will the data format be? How often should data be collected? ...