- What is the source of data
- How the data will be extracted from the source
- What will the data format be?
- How often should data be collected?
- How to handle missing data points? what rules must be applied for missing data points
- How the data will be received by the reporting system (ware house). Pull or push approach.
Technical Explaination Made Simple
Saturday, 13 December 2025
Data Engineering - Client Interview question regarding data collection.
Sunday, 16 November 2025
Performance Optimization 01: SQL Optimizaiton without materializing data
1️ Use Proper Indexing (Most
Important)
Indexes allow the database to avoid full table scans.
Create indexes on:
- JOIN
keys
- WHERE
clause columns
- GROUP
BY columns
- ORDER
BY columns
Example:
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
CREATE INDEX idx_customer_country ON customer(country);
Key idea:
Indexes let the database filter and join data efficiently
without copying or storing anything.
2️.Rewrite Subqueries as JOINs or
EXISTS
Avoid IN (subquery) and correlated subqueries when possible.
❌ Slow
SELECT * FROM orders
WHERE customer_id IN (SELECT id FROM customers WHERE country='US');
✅ Faster (JOIN)
SELECT o.*
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.country='US';
✅ Also fast (EXISTS)
SELECT *
FROM orders o
WHERE EXISTS (
SELECT 1 FROM
customers c
WHERE c.id =
o.customer_id AND c.country='US'
);
3️.Choose the Right JOIN Type
Unnecessary join types can degrade performance.
Replace:
- LEFT
JOIN → INNER JOIN (if possible)
- FULL
OUTER JOIN → split logic into UNION ALL
- CROSS
JOIN → avoid unless needed
Fewer rows processed = faster queries.
4️.Push Filters Down Early
(Predicate Pushdown)
Apply filters on the smallest dataset first.
❌ Slow
SELECT ...
FROM big_table b
JOIN small_table s ON ...
WHERE s.type = 'X';
✅ Fast
Move predicate to small table before join:
SELECT ...
FROM big_table b
JOIN (SELECT * FROM small_table WHERE type='X') s ON ...
This reduces join workload without materializing the data.
5️.Avoid Functions on Indexed
Columns
This blocks index usage.
❌ Bad
WHERE DATE(created_at) = '2024-01-01'
✅ Good
WHERE created_at >= '2024-01-01'
AND created_at < '2024-01-02'
6️.Use Covering Indexes
A covering index contains all columns needed, so the
DB doesn't fetch the table.
Example query:
SELECT amount, created_at
FROM orders
WHERE customer_id = 100;
Create covering index:
CREATE INDEX idx_orders_cover ON orders(customer_id,
created_at, amount);
The DB can serve the entire query from the index only
→ faster, no temp storing.
7️⃣ **Avoid SELECT ***
Only select columns you need.
❌ Bad
SELECT *
FROM orders o
JOIN customers c ON ...
✅ Good
SELECT o.id, o.amount, c.name
FROM orders o
JOIN customers c ON ...
Less data scanned + less data transferred.
8️.Use LIMIT, WINDOWING, and
Pagination
Avoid scanning large datasets.
Example Pagination:
SELECT * FROM orders
ORDER BY id
LIMIT 50 OFFSET 0;
Avoid OFFSET for large pages:
Use keyset pagination:
SELECT *
FROM orders
WHERE id > last_seen_id
ORDER BY id
LIMIT 50;
9️⃣ Normalize Query Logic (No
Redundant Operations)
Avoid repeating the same subquery multiple times.
❌ Bad
SELECT (SELECT price FROM products WHERE id = o.product_id),
(SELECT
category FROM products WHERE id = o.product_id)
FROM orders o;
✅ Good
SELECT p.price, p.category
FROM orders o
JOIN products p ON p.id = o.product_id;
10. Use Database-specific
Optimizer Hints (When Needed)
These do not materialize data; they influence execution
plan.
Examples:
- MySQL:
STRAIGHT_JOIN
- Oracle:
USE_NL, NO_MERGE
- SQL
Server: OPTION (HASH JOIN)
- Postgres:
SET enable_seqscan=off (temporary)
Only use when the optimizer chooses a poor plan.
1️1.Partitioning (Logical, Not Materializing)
Partitioning does not materialize data; it splits tables for
faster scanning.
Use partitioning on:
- date
columns
- high-cardinality
keys
Improves:
- scanning
- filtering
- aggregation
Without storing extra copies of data.
1️2.Use Window Functions Instead of Self-Joins
Window functions compute aggregates without extra joins.
❌ Slow
SELECT o.*,
(SELECT SUM(amount)
FROM orders WHERE customer_id=o.customer_id)
FROM orders o;
✅ Fast (window)
SELECT o.*,
SUM(amount) OVER
(PARTITION BY customer_id) AS customer_total
FROM orders o;
๐ง Summary: Optimization
Without Materializing Data
|
Technique |
Benefit |
|
Indexes |
Fast filtering and joining |
|
Rewriting subqueries |
Reduce scans + better execution plans |
|
Join optimization |
Process fewer rows |
|
Predicate pushdown |
Filter early |
|
Covering indexes |
Avoid table lookups |
|
Avoid functions on indexed columns |
Enable index usage |
|
Keyset pagination |
Avoid large offsets |
|
Window functions |
Avoid redundant joins |
|
Partitioning |
Faster scans on large datasets |
Performance Optimization 01: Debouncing with Elasticsearch
๐ Why Debounce with Elasticsearch?
When building search functionalities (like autocomplete, live search, or suggestions), every keystroke can trigger a request to Elasticsearch.
Elasticsearch queries can be:
-
CPU-intensive
-
Heavy on cluster resources
-
Network-expensive
Without debouncing:
-
Typing “smart” could trigger 5 queries: s → sm → sma → smar → smart
-
This generates unnecessary load
-
Can cause UI lag and slow search results
Debouncing solves this by waiting for users to pause typing before sending an Elasticsearch request.
⚙️ How Debouncing Helps with Elasticsearch
Debouncing ensures:
-
Only one request is sent after the user stops typing (e.g., after 300ms)
-
Fewer queries → Faster UI → Less load on Elasticsearch cluster
-
Better relevance and reliability in search results
๐ง Flow Diagram (Concept)
User types → debounce timer resets → waits X ms →
No new keystrokes? → Trigger Elasticsearch query → Show results
๐งฉ Code Implementations
1. JavaScript Frontend Debouncing + Elasticsearch Query (Common Approach)
function debounce(fn, delay) {
let timer;
return function(...args) {
clearTimeout(timer);
timer = setTimeout(() => fn.apply(this, args), delay);
};
}
async function searchElastic(query) {
const response = await fetch(`/api/search?q=${encodeURIComponent(query)}`);
const data = await response.json();
console.log("Results:", data);
}
// Attach debounce to input
const debouncedSearch = debounce(searchElastic, 300);
document.getElementById("search-box").addEventListener("input", (e) => {
debouncedSearch(e.target.value);
});
How it works:
-
The request fires only after typing stops for 300 ms.
-
Great for autocomplete or suggestions.
2. Node.js Backend Debouncing (Less Common but Possible)
If the server receives too many rapid requests (e.g., microservices), you can debounce on the backend:
const debounce = require('lodash.debounce');
const { Client } = require("@elastic/elasticsearch");
const client = new Client({ node: "http://localhost:9200" });
const performSearch = debounce(async (query, res) => {
const result = await client.search({
index: "products",
query: {
match: { name: query }
}
});
res.json(result.hits.hits);
}, 300);
app.get("/search", (req, res) => {
performSearch(req.query.q, res);
});
Note: Backend debouncing is only useful in special controlled scenarios; generally debouncing belongs in frontend.
3. React Autocomplete Search (Popular UI Pattern)
import { useState, useCallback } from "react";
import debounce from "lodash.debounce";
function SearchBox() {
const [results, setResults] = useState([]);
const searchElastic = useCallback(
debounce(async (query) => {
const res = await fetch(`/api/search?q=${query}`);
const data = await res.json();
setResults(data);
}, 300),
[]
);
return (
<input
type="text"
onChange={(e) => searchElastic(e.target.value)}
placeholder="Search..."
/>
);
}
๐ฏ Best Practices for Debouncing with Elasticsearch
✔ 1. Use 250–500 ms debounce delay
Lower delays cause more frequent calls; higher delays hurt UX.
✔ 2. Use Suggesters or Search-as-you-type fields
Elasticsearch features like:
-
completion suggester
-
search_as_you_type
-
edge N-grams
These are optimized for instant queries with UI debouncing.
✔ 3. Cache previous responses
If the user repeats queries, return cached results instantly.
✔ 4. Use async cancellation
If a new query fires, cancel the previous promise to avoid race conditions.๐งพ Example: Elasticsearch Query for Autocomplete
GET products/_search
{
"query": {
"match_phrase_prefix": {
"name": "smart"
}
}
}
Useful for autocomplete with debounced calls.
Deep Learning 11 : What is Dropout?
What is Dropout?
- Dropout is a regularization
technique used in deep learning to reduce overfitting.
- During training, it randomly
“drops” (sets to 0) a fraction of neurons in a layer.
- This forces the network to
learn more robust patterns instead of relying too heavily on specific
neurons.
⚙️ How It Works
- Training phase (forward
pass):
- Each time the model
processes a batch, dropout randomly deactivates some neurons.
- Example:
- Pass 1 → neurons n1, n3,
n4 dropped.
- Pass 2 → neurons n2, n5
dropped.
- The pattern changes every
batch, so the model can’t depend on fixed neurons.
- Testing/Inference phase:
- Dropout is disabled.
- All neurons are active, but
their outputs are scaled to account for dropout during training.
๐ Why Use Dropout?
- Prevents overfitting
(memorizing training data instead of generalizing).
- Encourages redundancy
in feature learning.
- Improves generalization
to unseen data.
- Simple and effective — often
used with rates like 0.2 (20%) or 0.5 (50%).
✅ Example in Keras
python
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
model = Sequential([
Dense(128, activation='relu', input_shape=(784,)),
Dropout(0.5), # randomly drop 50%
of neurons
Dense(64,
activation='relu'),
Dropout(0.2), # randomly drop 20%
of neurons
Dense(10,
activation='softmax')
])
In short: Dropout is like making your
model “forget” parts of itself during training so it learns to be flexible and
generalize better.
Deep Learning 10 : What is an Epoch?
๐ What is an Epoch?
An epoch is one complete pass through the entire training dataset by the model.
If you have 1,000 samples and a batch size of 100:
One epoch = 10 batches (because 100 × 10 = 1,000).
After each epoch, the model has seen all training data once.
⚙️ Why Multiple Epochs?
A single epoch usually isn’t enough for the model to learn meaningful patterns.
Training for multiple epochs allows the model to gradually adjust weights and improve accuracy.
Too few epochs → underfitting (model hasn’t learned enough).
Too many epochs → overfitting (model memorizes training data, performs poorly on unseen data).
๐ Epochs vs. Batches vs. Iterations
| Batch | Subset of the dataset processed at once (e.g., 32 samples). |
| Iteration | One update step of weights (processing a single batch). |
| Epoch | One full pass through the dataset (all batches processed once). |
So:
Epochs = how many times the model sees the full dataset.
Iterations = how many times weights are updated.
Batches = how many samples are processed per iteration.
✅ Example
Dataset size = 10,000 samples
Batch size = 100
Epochs = 5
➡️ Each epoch = 100 iterations (10,000 ÷ 100). ➡️ Total training = 500 iterations (100 × 5).
In short: Epochs are the number of times the model cycles through the entire dataset during training.
Deep Learning 09 : What is Dense Layer ?
๐งฉ What is a Dense Layer?
Definition: A dense layer is a type of neural network layer where each neuron receives input from all neurons in the previous layer.
Structure:
Inputs → multiplied by weights
Added to biases
Passed through an activation function (e.g., ReLU, sigmoid, softmax)
Purpose: Transforms input features into higher-level representations and contributes to decision-making in the network.
⚙️ How Dense Layers Work
Mathematical operation: For input vector , weights , bias , and activation function :
Connections: Every neuron in the dense layer has a unique weight for each input, making it highly interconnected.
Learning: During training, weights and biases are updated via backpropagation to minimize error.
๐ Where Dense Layers Are Used
Feedforward Neural Networks: Core building blocks for classification and regression tasks.
Convolutional Neural Networks (CNNs): Often appear after convolution + pooling layers to interpret extracted features into final predictions.
Recurrent Neural Networks (RNNs): Sometimes used at the output stage to map hidden states to predictions.
✅ Key Characteristics
Fully connected: Maximum connectivity between layers.
Parameter-heavy: Dense layers can have a large number of parameters, especially with big input sizes.
Versatile: Suitable for tasks like image classification, text processing, and tabular data.
Trade-off: Powerful but computationally expensive compared to sparse layers.
In short: A dense layer is the “decision-making” part of a neural network, where all inputs interact with all outputs, enabling the model to learn complex patterns.
Deep learning Interview Question 01 : Batch Processing and Weight Updates
If we train a model with 32 batches, where batches 1–32 result in weights of 0.2 and batches 33–64 result in weights of 0.3, will the model continue using the previously updated weights from the earlier batches, or will it start fresh with new weights for each batch range?
Batch 1–32 → weight = 0.2 The model processes these batches, computes gradients, and updates parameters. After this step, the model’s weights are no longer the initial ones — they’ve been adjusted to reflect learning from batches 1–32.
Batch 33–64 → weight = 0.3 When the model moves to the next set of batches, it does not reset to the old weights. Instead, it continues from the updated weights after batch 32. The new batches further refine the parameters.
⚙️ Key Principle
In training, the model always uses the latest weights (the ones updated after the previous batch).
It does not start fresh for each batch range unless you explicitly reinitialize the model.
So in your example, batches 33–64 will be processed using the weights that already include learning from batches 1–32.
๐ Analogy
Think of it like writing a book:
After chapters 1–32, you’ve already built the storyline (weights = 0.2).
When you write chapters 33–64, you don’t throw away the first half — you continue building on it (weights evolve to 0.3).
✅ Answer: The model will always use the previously updated weights from the last batch. It does not start with a new model per batch unless you explicitly reset or reinitialize it.
Data Engineering - Client Interview question regarding data collection.
What is the source of data How the data will be extracted from the source What will the data format be? How often should data be collected? ...
-
When you're working with a file that's too large to fit into memory—say, multi-GB logs or datasets—Python gives you several efficien...
-
What is Delta Table ? A Delta Table is a type of table used in Delta Lake , which is an open-source storage layer built on top of Apache ...
-
Using Snowpark to connect to S3 via a Snowflake stage is a powerful pattern for scalable, secure, and flexible data engineering. Here'...