Saturday, 15 November 2025

Snowflake 03: Use a catalog-linked database for Apache Iceberg tables

 

Use a catalog-linked database for Apache Iceberg tables”

 🧊 First, what is Apache Iceberg?

Iceberg is like a smart filing system for big data tables.
It keeps track of all your files, versions, and snapshots so querying data is fast and reliable.

🗂️ What’s a Catalog in Iceberg?

Think of the catalog as the master notebook where Iceberg writes:

  • Where your tables are stored
  • What files belong to each table
  • The schema
  • The table versions
  • The metadata

Examples of catalogs: AWS Glue, Hive Metastore, Nessie, REST Catalog, Snowflake, etc.

🏷️ What is a catalog-linked database?

Imagine you want to organize your toys in boxes.
You don’t write directly on the toy box; instead, you write in a notebook:

  • Box 1 → Cars
  • Box 2 → Legos
  • Box 3 → Action figures

In Iceberg, the catalog-linked database is this organized grouping inside the catalog.

It means:

Your database in Spark or Flink is connected to a catalog. All tables you create inside that database automatically become Iceberg tables managed by that catalog.

Think of it like this:

  • The catalog = a big library system.
  • A catalog-linked database = a section in that library, like “Kids Books”.
  • Iceberg tables = the actual books.

When you create a table in that database (section):

📘 → It is automatically registered in the catalog (library system)
📚 → Iceberg manages how the data files are stored
🗂️ → Everything stays organized

So instead of you manually telling Iceberg where every book is,
the catalog-linked database takes care of that automatically.

🧐 Why do people use catalog-linked databases?

Because:

  • You don’t have to specify catalog settings every time
  • All tables in that database are Iceberg tables by default
  • Easier to organize tables
  • Cleaner project structure
  • Less code and fewer mistakes

 

No comments:

Post a Comment

Data Engineering - Client Interview question regarding data collection.

What is the source of data How the data will be extracted from the source What will the data format be? How often should data be collected? ...