Technical Explaination Made Simple: DataBricks 01 : Interview questions

What is Delta Table ?

A Delta Table is a type of table used in Delta Lake, which is an open-source storage layer built on top of Apache Spark and Hadoop. It helps manage large datasets by combining the benefits of data lakes (like flexibility and scalability) with the reliability and performance of data warehouses.

To put it simply:

A Delta Table is a table that supports ACID transactions, meaning it can handle updates, deletes, and inserts in a consistent and reliable way. This makes it more robust than a regular data lake, which usually lacks these features.
It supports versioning, so you can track changes to the data over time. This allows you to time travel—you can access previous versions of your data.
Optimized performance: Delta Tables are optimized for faster read and write operations by storing metadata and applying optimizations like indexing.

Think of a Delta Table as a highly reliable, performant, and flexible version of a regular table that works well in big data environments.

Differnce betwenn managed and external tables ?

A managed table in Databricks means Databricks controls both the data files and metadata. An external table means Databricks only manages the metadata, while the actual data files remain in an external location (like S3, ADLS, or Blob Storage).

🗂️ Managed Tables

Storage location: Data is stored inside Databricks’ default warehouse directory (usually dbfs:/user/hive/warehouse).
Lifecycle: When you drop a managed table, Databricks deletes both the metadata and the underlying data files.
Use case: Best when you want Databricks to fully manage the table lifecycle and don’t need to reuse the data outside Databricks.
Creation example:

python

df.write.saveAsTable("my_managed_table")

📦 External Tables

Storage location: Data resides in an external path you specify (e.g., dbfs:/mnt/mydata/ or s3://bucket/path).
Lifecycle: Dropping the table removes only the metadata; the underlying data files remain intact.
Use case: Ideal when data is shared across multiple systems, or you want to retain control of the files outside Databricks.
Creation example:

sql

CREATE TABLE my_external_table

USING DELTA

LOCATION 'dbfs:/mnt/mydata/external_table_path';

🔑 Key Differences

Aspect	Managed Table	External Table
Data location	Databricks warehouse dir	External path (S3, ADLS, Blob, DBFS mount)
Lifecycle	Dropping table deletes data and metadata.	Dropping table deletes metadata only
Control	Databricks controls everything	User controls data files
Best for	Full Databricks-managed workflows	Shared/external datasets

Technical Explaination Made Simple

Labels

Friday, 14 November 2025

DataBricks 01 : Interview questions

No comments:

Post a Comment

Data Engineering - Client Interview question regarding data collection.

Search This Blog