Saturday, 13 December 2025

Data Engineering - Client Interview question regarding data collection.

What is the source of data
How the data will be extracted from the source
What will the data format be?
How often should data be collected?
How to handle missing data points? what rules must be applied for missing data points
How the data will be received by the reporting system (ware house). Pull or push approach.

Sunday, 16 November 2025

Performance Optimization 01: SQL Optimizaiton without materializing data

1️ Use Proper Indexing (Most Important)

Indexes allow the database to avoid full table scans.

Create indexes on:

JOIN keys
WHERE clause columns
GROUP BY columns
ORDER BY columns

Example:

CREATE INDEX idx_orders_customer_id ON orders(customer_id);

CREATE INDEX idx_customer_country ON customer(country);

Key idea:

Indexes let the database filter and join data efficiently without copying or storing anything.

2️.Rewrite Subqueries as JOINs or EXISTS

Avoid IN (subquery) and correlated subqueries when possible.

❌ Slow

SELECT * FROM orders

WHERE customer_id IN (SELECT id FROM customers WHERE country='US');

✅ Faster (JOIN)

SELECT o.*

FROM orders o

JOIN customers c ON o.customer_id = c.id

WHERE c.country='US';

✅ Also fast (EXISTS)

SELECT *

FROM orders o

WHERE EXISTS (

SELECT 1 FROM customers c

WHERE c.id = o.customer_id AND c.country='US'

);

3️.Choose the Right JOIN Type

Unnecessary join types can degrade performance.

Replace:

LEFT JOIN → INNER JOIN (if possible)
FULL OUTER JOIN → split logic into UNION ALL
CROSS JOIN → avoid unless needed

Fewer rows processed = faster queries.

4️.Push Filters Down Early (Predicate Pushdown)

Apply filters on the smallest dataset first.

❌ Slow

SELECT ...

FROM big_table b

JOIN small_table s ON ...

WHERE s.type = 'X';

✅ Fast

Move predicate to small table before join:

SELECT ...

FROM big_table b

JOIN (SELECT * FROM small_table WHERE type='X') s ON ...

This reduces join workload without materializing the data.

5️.Avoid Functions on Indexed Columns

This blocks index usage.

❌ Bad

WHERE DATE(created_at) = '2024-01-01'

✅ Good

WHERE created_at >= '2024-01-01'

AND created_at < '2024-01-02'

6️.Use Covering Indexes

A covering index contains all columns needed, so the DB doesn't fetch the table.

Example query:

SELECT amount, created_at

FROM orders

WHERE customer_id = 100;

Create covering index:

CREATE INDEX idx_orders_cover ON orders(customer_id, created_at, amount);

The DB can serve the entire query from the index only
→ faster, no temp storing.

7️⃣ **Avoid SELECT ***

Only select columns you need.

❌ Bad

SELECT *

FROM orders o

JOIN customers c ON ...

✅ Good

SELECT o.id, o.amount, c.name

FROM orders o

JOIN customers c ON ...

Less data scanned + less data transferred.

8️.Use LIMIT, WINDOWING, and Pagination

Avoid scanning large datasets.

Example Pagination:

SELECT * FROM orders

ORDER BY id

LIMIT 50 OFFSET 0;

Avoid OFFSET for large pages:

Use keyset pagination:

SELECT *

FROM orders

WHERE id > last_seen_id

ORDER BY id

LIMIT 50;

9️⃣ Normalize Query Logic (No Redundant Operations)

Avoid repeating the same subquery multiple times.

❌ Bad

SELECT (SELECT price FROM products WHERE id = o.product_id),

(SELECT category FROM products WHERE id = o.product_id)

FROM orders o;

✅ Good

SELECT p.price, p.category

FROM orders o

JOIN products p ON p.id = o.product_id;

10. Use Database-specific Optimizer Hints (When Needed)

These do not materialize data; they influence execution plan.

Examples:

MySQL: STRAIGHT_JOIN
Oracle: USE_NL, NO_MERGE
SQL Server: OPTION (HASH JOIN)
Postgres: SET enable_seqscan=off (temporary)

Only use when the optimizer chooses a poor plan.

1️1.Partitioning (Logical, Not Materializing)

Partitioning does not materialize data; it splits tables for faster scanning.

Use partitioning on:

date columns
high-cardinality keys

Improves:

scanning
filtering
aggregation

Without storing extra copies of data.

1️2.Use Window Functions Instead of Self-Joins

Window functions compute aggregates without extra joins.

❌ Slow

SELECT o.*,

(SELECT SUM(amount) FROM orders WHERE customer_id=o.customer_id)

FROM orders o;

✅ Fast (window)

SELECT o.*,

SUM(amount) OVER (PARTITION BY customer_id) AS customer_total

FROM orders o;

🧠 Summary: Optimization Without Materializing Data

Technique	Benefit
Indexes	Fast filtering and joining
Rewriting subqueries	Reduce scans + better execution plans
Join optimization	Process fewer rows
Predicate pushdown	Filter early
Covering indexes	Avoid table lookups
Avoid functions on indexed columns	Enable index usage
Keyset pagination	Avoid large offsets
Window functions	Avoid redundant joins
Partitioning	Faster scans on large datasets

Performance Optimization 01: Debouncing with Elasticsearch

🔍 Why Debounce with Elasticsearch?

When building search functionalities (like autocomplete, live search, or suggestions), every keystroke can trigger a request to Elasticsearch.

Elasticsearch queries can be:

CPU-intensive
Heavy on cluster resources
Network-expensive

Without debouncing:

Typing “smart” could trigger 5 queries: s → sm → sma → smar → smart
This generates unnecessary load
Can cause UI lag and slow search results

Debouncing solves this by waiting for users to pause typing before sending an Elasticsearch request.

⚙️ How Debouncing Helps with Elasticsearch

Debouncing ensures:

Only one request is sent after the user stops typing (e.g., after 300ms)
Fewer queries → Faster UI → Less load on Elasticsearch cluster
Better relevance and reliability in search results

🧠 Flow Diagram (Concept)


User types → debounce timer resets → waits X ms → 
No new keystrokes? → Trigger Elasticsearch query → Show results

🧩 Code Implementations

1. JavaScript Frontend Debouncing + Elasticsearch Query (Common Approach)


function debounce(fn, delay) {
  let timer;
  return function(...args) {
    clearTimeout(timer);
    timer = setTimeout(() => fn.apply(this, args), delay);
  };
}

async function searchElastic(query) {
  const response = await fetch(`/api/search?q=${encodeURIComponent(query)}`);
  const data = await response.json();
  console.log("Results:", data);
}

// Attach debounce to input
const debouncedSearch = debounce(searchElastic, 300);

document.getElementById("search-box").addEventListener("input", (e) => {
  debouncedSearch(e.target.value);
});

How it works:

The request fires only after typing stops for 300 ms.
Great for autocomplete or suggestions.

2. Node.js Backend Debouncing (Less Common but Possible)

If the server receives too many rapid requests (e.g., microservices), you can debounce on the backend:


const debounce = require('lodash.debounce');
const { Client } = require("@elastic/elasticsearch");

const client = new Client({ node: "http://localhost:9200" });

const performSearch = debounce(async (query, res) => {
  const result = await client.search({
    index: "products",
    query: {
      match: { name: query }
    }
  });
  res.json(result.hits.hits);
}, 300);

app.get("/search", (req, res) => {
  performSearch(req.query.q, res);
});

Note: Backend debouncing is only useful in special controlled scenarios; generally debouncing belongs in frontend.

3. React Autocomplete Search (Popular UI Pattern)


import { useState, useCallback } from "react";
import debounce from "lodash.debounce";

function SearchBox() {
  const [results, setResults] = useState([]);

  const searchElastic = useCallback(
    debounce(async (query) => {
      const res = await fetch(`/api/search?q=${query}`);
      const data = await res.json();
      setResults(data);
    }, 300),
    []
  );

  return (
    <input
      type="text"
      onChange={(e) => searchElastic(e.target.value)}
      placeholder="Search..."
    />
  );
}

🎯 Best Practices for Debouncing with Elasticsearch

✔ 1. Use 250–500 ms debounce delay

Lower delays cause more frequent calls; higher delays hurt UX.

✔ 2. Use Suggesters or Search-as-you-type fields

Elasticsearch features like:

completion suggester
search_as_you_type
edge N-grams

These are optimized for instant queries with UI debouncing.

✔ 3. Cache previous responses

If the user repeats queries, return cached results instantly.

✔ 4. Use async cancellation

If a new query fires, cancel the previous promise to avoid race conditions.🧾 Example: Elasticsearch Query for Autocomplete


GET products/_search
{
  "query": {
    "match_phrase_prefix": {
      "name": "smart"
    }
  }
}

Useful for autocomplete with debounced calls.

Deep Learning 11 : What is Dropout?

What is Dropout?

Dropout is a regularization technique used in deep learning to reduce overfitting.
During training, it randomly “drops” (sets to 0) a fraction of neurons in a layer.
This forces the network to learn more robust patterns instead of relying too heavily on specific neurons.

⚙️ How It Works

Training phase (forward pass):

Each time the model processes a batch, dropout randomly deactivates some neurons.
Example:

Pass 1 → neurons n1, n3, n4 dropped.
Pass 2 → neurons n2, n5 dropped.

The pattern changes every batch, so the model can’t depend on fixed neurons.

Testing/Inference phase:

Dropout is disabled.
All neurons are active, but their outputs are scaled to account for dropout during training.

📌 Why Use Dropout?

Prevents overfitting (memorizing training data instead of generalizing).
Encourages redundancy in feature learning.
Improves generalization to unseen data.
Simple and effective — often used with rates like 0.2 (20%) or 0.5 (50%).

✅ Example in Keras

python

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense, Dropout

model = Sequential([

Dense(128, activation='relu', input_shape=(784,)),

Dropout(0.5), # randomly drop 50% of neurons

Dense(64, activation='relu'),

Dropout(0.2), # randomly drop 20% of neurons

Dense(10, activation='softmax')

])

In short: Dropout is like making your model “forget” parts of itself during training so it learns to be flexible and generalize better.

Deep Learning 10 : What is an Epoch?

🔄 What is an Epoch?

An epoch is one complete pass through the entire training dataset by the model.
If you have 1,000 samples and a batch size of 100:
- One epoch = 10 batches (because 100 × 10 = 1,000).
After each epoch, the model has seen all training data once.

⚙️ Why Multiple Epochs?

A single epoch usually isn’t enough for the model to learn meaningful patterns.
Training for multiple epochs allows the model to gradually adjust weights and improve accuracy.
Too few epochs → underfitting (model hasn’t learned enough).
Too many epochs → overfitting (model memorizes training data, performs poorly on unseen data).

📌 Epochs vs. Batches vs. Iterations

Term Meaning:

Batch	Subset of the dataset processed at once (e.g., 32 samples).
Iteration	One update step of weights (processing a single batch).
Epoch	One full pass through the dataset (all batches processed once).

So:

Epochs = how many times the model sees the full dataset.
Iterations = how many times weights are updated.
Batches = how many samples are processed per iteration.

✅ Example

Dataset size = 10,000 samples
Batch size = 100
Epochs = 5

➡️ Each epoch = 100 iterations (10,000 ÷ 100). ➡️ Total training = 500 iterations (100 × 5).

In short: Epochs are the number of times the model cycles through the entire dataset during training.

Deep Learning 09 : What is Dense Layer ?

🧩 What is a Dense Layer?

Definition: A dense layer is a type of neural network layer where each neuron receives input from all neurons in the previous layer.
Structure:
- Inputs → multiplied by weights
- Added to biases
- Passed through an activation function (e.g., ReLU, sigmoid, softmax)
Purpose: Transforms input features into higher-level representations and contributes to decision-making in the network.

⚙️ How Dense Layers Work

Mathematical operation: For input vector $x$ , weights $W$ , bias $b$ , and activation function $f$ :

y = f (W x + b)

Connections: Every neuron in the dense layer has a unique weight for each input, making it highly interconnected.
Learning: During training, weights and biases are updated via backpropagation to minimize error.

📌 Where Dense Layers Are Used

Feedforward Neural Networks: Core building blocks for classification and regression tasks.
Convolutional Neural Networks (CNNs): Often appear after convolution + pooling layers to interpret extracted features into final predictions.
Recurrent Neural Networks (RNNs): Sometimes used at the output stage to map hidden states to predictions.

✅ Key Characteristics

Fully connected: Maximum connectivity between layers.
Parameter-heavy: Dense layers can have a large number of parameters, especially with big input sizes.
Versatile: Suitable for tasks like image classification, text processing, and tabular data.
Trade-off: Powerful but computationally expensive compared to sparse layers.

In short: A dense layer is the “decision-making” part of a neural network, where all inputs interact with all outputs, enabling the model to learn complex patterns.

Deep learning Interview Question 01 : Batch Processing and Weight Updates

If we train a model with 32 batches, where batches 1–32 result in weights of 0.2 and batches 33–64 result in weights of 0.3, will the model continue using the previously updated weights from the earlier batches, or will it start fresh with new weights for each batch range?

Batch 1–32 → weight = 0.2 The model processes these batches, computes gradients, and updates parameters. After this step, the model’s weights are no longer the initial ones — they’ve been adjusted to reflect learning from batches 1–32.
Batch 33–64 → weight = 0.3 When the model moves to the next set of batches, it does not reset to the old weights. Instead, it continues from the updated weights after batch 32. The new batches further refine the parameters.

⚙️ Key Principle

In training, the model always uses the latest weights (the ones updated after the previous batch).
It does not start fresh for each batch range unless you explicitly reinitialize the model.
So in your example, batches 33–64 will be processed using the weights that already include learning from batches 1–32.

📌 Analogy

Think of it like writing a book:

After chapters 1–32, you’ve already built the storyline (weights = 0.2).
When you write chapters 33–64, you don’t throw away the first half — you continue building on it (weights evolve to 0.3).

✅ Answer: The model will always use the previously updated weights from the last batch. It does not start with a new model per batch unless you explicitly reset or reinitialize it.

Technical Explaination Made Simple

Labels

Saturday, 13 December 2025

Data Engineering - Client Interview question regarding data collection.

Sunday, 16 November 2025

Performance Optimization 01: SQL Optimizaiton without materializing data

Performance Optimization 01: Debouncing with Elasticsearch

🔍 Why Debounce with Elasticsearch?

⚙️ How Debouncing Helps with Elasticsearch

🧠 Flow Diagram (Concept)

🧩 Code Implementations

1. JavaScript Frontend Debouncing + Elasticsearch Query (Common Approach)

2. Node.js Backend Debouncing (Less Common but Possible)

3. React Autocomplete Search (Popular UI Pattern)

🎯 Best Practices for Debouncing with Elasticsearch

✔ 1. Use 250–500 ms debounce delay

✔ 2. Use Suggesters or Search-as-you-type fields

✔ 3. Cache previous responses

✔ 4. Use async cancellation

Deep Learning 11 : What is Dropout?

Deep Learning 10 : What is an Epoch?

🔄 What is an Epoch?

⚙️ Why Multiple Epochs?

📌 Epochs vs. Batches vs. Iterations

✅ Example

Deep Learning 09 : What is Dense Layer ?

🧩 What is a Dense Layer?

⚙️ How Dense Layers Work

📌 Where Dense Layers Are Used

✅ Key Characteristics

Deep learning Interview Question 01 : Batch Processing and Weight Updates

⚙️ Key Principle

📌 Analogy

Data Engineering - Client Interview question regarding data collection.

Search This Blog