Friday, 14 November 2025

DataBricks 01 : Interview questions

 

What is Delta Table ?

A Delta Table is a type of table used in Delta Lake, which is an open-source storage layer built on top of Apache Spark and Hadoop. It helps manage large datasets by combining the benefits of data lakes (like flexibility and scalability) with the reliability and performance of data warehouses.

To put it simply:

  • A Delta Table is a table that supports ACID transactions, meaning it can handle updates, deletes, and inserts in a consistent and reliable way. This makes it more robust than a regular data lake, which usually lacks these features.

  • It supports versioning, so you can track changes to the data over time. This allows you to time travel—you can access previous versions of your data.

  • Optimized performance: Delta Tables are optimized for faster read and write operations by storing metadata and applying optimizations like indexing.

Think of a Delta Table as a highly reliable, performant, and flexible version of a regular table that works well in big data environments.

Differnce betwenn managed and external tables ?

A managed table in Databricks means Databricks controls both the data files and metadata. An external table means Databricks only manages the metadata, while the actual data files remain in an external location (like S3, ADLS, or Blob Storage).

🗂️ Managed Tables

  • Storage location: Data is stored inside Databricks’ default warehouse directory (usually dbfs:/user/hive/warehouse).
  • Lifecycle: When you drop a managed table, Databricks deletes both the metadata and the underlying data files.
  • Use case: Best when you want Databricks to fully manage the table lifecycle and don’t need to reuse the data outside Databricks.
  • Creation example:

python

df.write.saveAsTable("my_managed_table")

📦 External Tables

  • Storage location: Data resides in an external path you specify (e.g., dbfs:/mnt/mydata/ or s3://bucket/path).
  • Lifecycle: Dropping the table removes only the metadata; the underlying data files remain intact.
  • Use case: Ideal when data is shared across multiple systems, or you want to retain control of the files outside Databricks.
  • Creation example:

sql

CREATE TABLE my_external_table

USING DELTA

LOCATION 'dbfs:/mnt/mydata/external_table_path';

🔑 Key Differences

Aspect

Managed Table

External Table

Data location

Databricks warehouse dir

External path (S3, ADLS, Blob, DBFS mount)

Lifecycle

Dropping table deletes data and metadata.

Dropping table deletes metadata only

Control

Databricks controls everything

User controls data files

Best for

Full Databricks-managed workflows

Shared/external datasets

 


No comments:

Post a Comment

Data Engineering - Client Interview question regarding data collection.

What is the source of data How the data will be extracted from the source What will the data format be? How often should data be collected? ...