Iceberg - Introduction and Setup (Part - 1)

Abhishek Sharma

Data Engineering

Tags:

PySpark

iceberg

data lake

AWS

As we already discussed in our previous Delta Lake blog, there are already table formats in use, ones with very high specifications and their own benefits. Iceberg is one of them. So, in this blog, we will discuss Iceberg.

What is Apache Iceberg?

Iceberg, from the open-source Apache, is a table format used to handle large amounts of data stored locally or on various cloud storage platforms. Netflix developed Iceberg to solve its big data problem. After that, they donated it to Apache, and it became open source in 2018. Iceberg now has a large number of contributors all over the world on GitHub and is the most widely used table format.

Iceberg mainly solves all the key problems we once faced when using the Hive table format to deal with data stored on various cloud storage like S3.

Iceberg has similar features and capabilities, like SQL tables. Yes, it is open source, so multiple engines like Spark can operate on it to perform transformations and such. It also has all ACID properties. This is a quick introduction to Iceberg, covering its features and initial setup.

Why to go with Iceberg

The main reason to use Iceberg is that it performs better when we need to load data from S3, or metadata is available on a cloud storage medium. Unlike Hive, Iceberg tracks the data at the file level rather than the folder level, which can decrease performance; that’s why we want to choose Iceberg. Here is the folder hierarchy that Iceberg uses while saving the data into its tables. Each Iceberg table is a combination of four files: snapshot metadata list, manifest list, manifest file, and data file.

Snapshot Metadata File: This file holds the metadata information about the table, such as the schema, partitions, and manifest list.

Manifest List: This list records each manifest file along with the path and metadata information. At this point, Iceberg decides which manifest files to ignore and which to read.

Manifest File: This file contains the paths to real data files, which hold the real data along with the metadata.

Data File: Here is the real parquet, ORC, and Avro file, along with the real data.

Features of Iceberg:

Some Iceberg features include:

Schema Evolution: Iceberg allows you to evolve your schema without having to rewrite your data. This means you can easily add, drop, or rename columns, providing flexibility to adapt to changing data requirements without impacting existing queries.
Partition Evolution: Iceberg supports partition evolution, enabling you to modify the partitioning scheme as your data and query patterns evolve. This feature helps maintain query performance and optimize data layout over time.
Time Travel: Iceberg’s time travel feature allows you to query historical versions of your data. This is particularly useful for debugging, auditing, and recreating analyses based on past data states.
Multiple Query Engine Support: Iceberg supports multiple query engines, including Trino, Presto, Hive, and Amazon Athena. This interoperability ensures that you can read and write data across different tools seamlessly, facilitating a more versatile and integrated data ecosystem.
AWS Support: Iceberg is well-integrated with AWS services, making it easy to use with Amazon S3 for storage and other AWS analytics services. This integration helps leverage the scalability and reliability of AWS infrastructure for your data lake.
ACID Compliance: Iceberg ensures ACID (Atomicity, Consistency, Isolation, Durability) transactions, providing reliable data consistency and integrity. This makes it suitable for complex data operations and concurrent workloads, ensuring data reliability and accuracy.
Hidden Partitioning: Iceberg’s hidden partitioning abstracts the complexity of managing partitions from the user, automatically handling partition management to improve query performance without manual intervention.
Snapshot Isolation: Iceberg supports snapshot isolation, enabling concurrent read and write operations without conflicts. This isolation ensures that users can work with consistent views of the data, even as it is being updated.
Support for Large Tables: Designed for high scalability, Iceberg can efficiently handle petabyte-scale tables, making it ideal for large datasets typical in big data environments.
Compatibility with Modern Data Lakes: Iceberg’s design is tailored for modern data lake architectures, supporting efficient data organization, metadata management, and performance optimization, aligning well with contemporary data management practices.

These features make Iceberg a powerful and flexible table format for managing data lakes, ensuring efficient data processing, robust performance, and seamless integration with various tools and platforms. By leveraging Iceberg, organizations can achieve greater data agility, reliability, and efficiency, enhancing their data analytics capabilities and driving better business outcomes.

Prerequisite:

PySpark: Ensure that you have PySpark installed and properly configured. PySpark provides the Python API for Spark, enabling you to harness the power of distributed computing with Spark using Python.
Python: Make sure you have Python installed on your system. Python is essential for writing and running your PySpark scripts. It's recommended to use a virtual environment to manage your dependencies effectively.
Iceberg-Spark JAR: Download the appropriate Iceberg-Spark JAR file that corresponds to your Spark version. This JAR file is necessary to integrate Iceberg with Spark, allowing you to utilize Iceberg's advanced table format capabilities within your Spark jobs.
Jars to Configure Cloud Storage: Obtain and configure the necessary JAR files for your specific cloud storage provider. For example, if you are using Amazon S3, you will need the hadoop-aws JAR and its dependencies. For Google Cloud Storage, you need the gcs-connector JAR. These JARs enable Spark to read from and write to cloud storage systems.
Spark and Hadoop Configuration: Ensure your Spark and Hadoop configurations are correctly set up to integrate with your cloud storage. This might include setting the appropriate access keys, secret keys, and endpoint configurations in your spark-defaults.conf and core-site.xml.
Iceberg Configuration: Configure Iceberg settings specific to your environment. This might include catalog configurations (e.g., Hive, Hadoop, AWS Glue) and other Iceberg properties that optimize performance and compatibility.
Development Environment: Set up a development environment with an IDE or text editor that supports Python and Spark development, such as IntelliJ IDEA with the PyCharm plugin, Visual Studio Code, or Jupyter Notebooks.
Data Source Access: Ensure you have access to the data sources you will be working with, whether they are in cloud storage, relational databases, or other data repositories. Proper permissions and network configurations are necessary for seamless data integration.
Basic Understanding of Data Lakes: A foundational understanding of data lake concepts and architectures will help effectively utilize Iceberg. Knowledge of how data lakes differ from traditional data warehouses and their benefits will also be helpful.
Version Control System: Use a version control system like Git to manage your codebase. This helps in tracking changes, collaborating with team members, and maintaining code quality.
Documentation and Resources: Familiarize yourself with Iceberg documentation and other relevant resources. This will help you troubleshoot issues, understand best practices, and leverage advanced features effectively.

You can download the run time JAR from here —according to the Spark version installed on your machine or cluster. It will be the same as the Delta Lake setup. You can either download these JAR files to your machine or cluster, provide a Spark submit command, or you can download these while initializing the Spark session by passing these in Spark config as a JAR package, along with the appropriate version.

To use cloud storage, we are using these JARs with the S3 bucket for reading and writing Iceberg tables. Here is the basic example of a spark session:

CODE: https://gist.github.com/velotiotech/b364213a7c7fecf2808582eb4c0691f2.js

Iceberg Setup Using Docker

You can set and configure AWS creds, as well as some database-related or stream-related configs inside the docker-compose file.

CODE: https://gist.github.com/velotiotech/09db76717263ccd253648f089ee90151.js

Save this file with docker-compose.yaml. And run the command: docker compose up. Now, you can log into your container by using this command:

CODE: https://gist.github.com/velotiotech/f750fb7eef1a116662801e20d024001d.js

You can mount the sample data directory in a container or copy it from your local machine to the container. To copy the data inside the Docker directory, we can use the CP command.

CODE: https://gist.github.com/velotiotech/ce9edd0e69d03bc8d7b7212bf226f54a.js

Setup S3 As a Warehouse in Iceberg, Read Data from the S3, and Write Iceberg Tables in the S3 Again Using an EC2 Instance

We have generated 90 GB of data here using Spark Job, stored in the S3 bucket.

CODE: https://gist.github.com/velotiotech/2ab591115d9486d74bfee0096d93f869.js

Step 1

We read the data in Spark and create an Iceberg table out of it, storing the iceberg tables in the S3 bucket only.

Some Iceberg functionality won’t work if we haven’t installed or used the appropriate JAR file of the Iceberg version. The Iceberg version should be compatible with the Spark version you are using; otherwise, some feature partitions will throw an error of noSuchMethod. This must be taken care of carefully while setting this up, either in EC2 or EMR.

Create an Iceberg table on S3 and write data into that table. The sample data we have used is generated using a Spark job for Delta tables. We are using the same data and schema of the data as follows.

Step 2

We created Iceberg tables in the location of the S3 bucket and wrote the data with partition columns in the S3 bucket only.

CODE: https://gist.github.com/velotiotech/778f52a2d9c0ed179e7c6381f248e6ef.js

This is how we can use Iceberg over S3. There is another option: We can also create Iceberg tables in the AWS Glue catalog. Most tables created in the Glue catalog using Ahena are external tables that we use externally after generating the manifest files, like Delta Lake.

Step 3

We print the Iceberg table’s data along with the table descriptions.

‍

Using Iceberg, we can directly create the table in the Glue catalog using Athena, and it supports all read and write operations on the data available. These are the configurations that need to use in spark while using Glue catalog.

CODE: https://gist.github.com/velotiotech/11f8849c594008b9d7171690b3abf8a5.js

Now, we can easily create the Iceberg table using the Spark or Athena, and it will be accessible via Delta. We can perform upserts, too.

Conclusion

We’ve learned the basics of the Iceberg table format, its features, and the reasons for choosing Iceberg. We discussed how Iceberg provides significant advantages such as schema evolution, partition evolution, hidden partitioning, and ACID compliance, making it a robust choice for managing large-scale data. We also delved into the fundamental setup required to implement this table format, including configuration and integration with data processing engines like Apache Spark and query engines like Presto and Trino. By leveraging Iceberg, organizations can ensure efficient data management and analytics, facilitating better performance and scalability. With this knowledge, you are well-equipped to start using Iceberg for your data lake needs, ensuring a more organized, scalable, and efficient data infrastructure.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Iceberg - Introduction and Setup (Part - 1)

What is Apache Iceberg?

Iceberg mainly solves all the key problems we once faced when using the Hive table format to deal with data stored on various cloud storage like S3.

Why to go with Iceberg

Snapshot Metadata File: This file holds the metadata information about the table, such as the schema, partitions, and manifest list.

Manifest List: This list records each manifest file along with the path and metadata information. At this point, Iceberg decides which manifest files to ignore and which to read.

Manifest File: This file contains the paths to real data files, which hold the real data along with the metadata.

Data File: Here is the real parquet, ORC, and Avro file, along with the real data.

Features of Iceberg:

Some Iceberg features include:

Schema Evolution: Iceberg allows you to evolve your schema without having to rewrite your data. This means you can easily add, drop, or rename columns, providing flexibility to adapt to changing data requirements without impacting existing queries.
Partition Evolution: Iceberg supports partition evolution, enabling you to modify the partitioning scheme as your data and query patterns evolve. This feature helps maintain query performance and optimize data layout over time.
Time Travel: Iceberg’s time travel feature allows you to query historical versions of your data. This is particularly useful for debugging, auditing, and recreating analyses based on past data states.
Multiple Query Engine Support: Iceberg supports multiple query engines, including Trino, Presto, Hive, and Amazon Athena. This interoperability ensures that you can read and write data across different tools seamlessly, facilitating a more versatile and integrated data ecosystem.
AWS Support: Iceberg is well-integrated with AWS services, making it easy to use with Amazon S3 for storage and other AWS analytics services. This integration helps leverage the scalability and reliability of AWS infrastructure for your data lake.
ACID Compliance: Iceberg ensures ACID (Atomicity, Consistency, Isolation, Durability) transactions, providing reliable data consistency and integrity. This makes it suitable for complex data operations and concurrent workloads, ensuring data reliability and accuracy.
Hidden Partitioning: Iceberg’s hidden partitioning abstracts the complexity of managing partitions from the user, automatically handling partition management to improve query performance without manual intervention.
Snapshot Isolation: Iceberg supports snapshot isolation, enabling concurrent read and write operations without conflicts. This isolation ensures that users can work with consistent views of the data, even as it is being updated.
Support for Large Tables: Designed for high scalability, Iceberg can efficiently handle petabyte-scale tables, making it ideal for large datasets typical in big data environments.
Compatibility with Modern Data Lakes: Iceberg’s design is tailored for modern data lake architectures, supporting efficient data organization, metadata management, and performance optimization, aligning well with contemporary data management practices.

Prerequisite:

PySpark: Ensure that you have PySpark installed and properly configured. PySpark provides the Python API for Spark, enabling you to harness the power of distributed computing with Spark using Python.
Python: Make sure you have Python installed on your system. Python is essential for writing and running your PySpark scripts. It's recommended to use a virtual environment to manage your dependencies effectively.
Iceberg-Spark JAR: Download the appropriate Iceberg-Spark JAR file that corresponds to your Spark version. This JAR file is necessary to integrate Iceberg with Spark, allowing you to utilize Iceberg's advanced table format capabilities within your Spark jobs.
Jars to Configure Cloud Storage: Obtain and configure the necessary JAR files for your specific cloud storage provider. For example, if you are using Amazon S3, you will need the hadoop-aws JAR and its dependencies. For Google Cloud Storage, you need the gcs-connector JAR. These JARs enable Spark to read from and write to cloud storage systems.
Spark and Hadoop Configuration: Ensure your Spark and Hadoop configurations are correctly set up to integrate with your cloud storage. This might include setting the appropriate access keys, secret keys, and endpoint configurations in your spark-defaults.conf and core-site.xml.
Iceberg Configuration: Configure Iceberg settings specific to your environment. This might include catalog configurations (e.g., Hive, Hadoop, AWS Glue) and other Iceberg properties that optimize performance and compatibility.
Development Environment: Set up a development environment with an IDE or text editor that supports Python and Spark development, such as IntelliJ IDEA with the PyCharm plugin, Visual Studio Code, or Jupyter Notebooks.
Data Source Access: Ensure you have access to the data sources you will be working with, whether they are in cloud storage, relational databases, or other data repositories. Proper permissions and network configurations are necessary for seamless data integration.
Basic Understanding of Data Lakes: A foundational understanding of data lake concepts and architectures will help effectively utilize Iceberg. Knowledge of how data lakes differ from traditional data warehouses and their benefits will also be helpful.
Version Control System: Use a version control system like Git to manage your codebase. This helps in tracking changes, collaborating with team members, and maintaining code quality.
Documentation and Resources: Familiarize yourself with Iceberg documentation and other relevant resources. This will help you troubleshoot issues, understand best practices, and leverage advanced features effectively.

To use cloud storage, we are using these JARs with the S3 bucket for reading and writing Iceberg tables. Here is the basic example of a spark session:

CODE: https://gist.github.com/velotiotech/b364213a7c7fecf2808582eb4c0691f2.js

Iceberg Setup Using Docker

You can set and configure AWS creds, as well as some database-related or stream-related configs inside the docker-compose file.

CODE: https://gist.github.com/velotiotech/09db76717263ccd253648f089ee90151.js

Save this file with docker-compose.yaml. And run the command: docker compose up. Now, you can log into your container by using this command:

CODE: https://gist.github.com/velotiotech/f750fb7eef1a116662801e20d024001d.js

You can mount the sample data directory in a container or copy it from your local machine to the container. To copy the data inside the Docker directory, we can use the CP command.

CODE: https://gist.github.com/velotiotech/ce9edd0e69d03bc8d7b7212bf226f54a.js

Setup S3 As a Warehouse in Iceberg, Read Data from the S3, and Write Iceberg Tables in the S3 Again Using an EC2 Instance

We have generated 90 GB of data here using Spark Job, stored in the S3 bucket.

CODE: https://gist.github.com/velotiotech/2ab591115d9486d74bfee0096d93f869.js

Step 1

We read the data in Spark and create an Iceberg table out of it, storing the iceberg tables in the S3 bucket only.

Step 2

We created Iceberg tables in the location of the S3 bucket and wrote the data with partition columns in the S3 bucket only.

CODE: https://gist.github.com/velotiotech/778f52a2d9c0ed179e7c6381f248e6ef.js

Step 3

We print the Iceberg table’s data along with the table descriptions.

‍

CODE: https://gist.github.com/velotiotech/11f8849c594008b9d7171690b3abf8a5.js

Now, we can easily create the Iceberg table using the Spark or Athena, and it will be accessible via Delta. We can perform upserts, too.

Conclusion

About the Author

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

Explore current openings

Subscribe to get the latest technology updates

Iceberg - Introduction and Setup (Part - 1)

Abhishek Sharma

What is Apache Iceberg?

Why to go with Iceberg

Features of Iceberg:

Prerequisite:

Iceberg Setup Using Docker

Setup S3 As a Warehouse in Iceberg, Read Data from the S3, and Write Iceberg Tables in the S3 Again Using an EC2 Instance

Step 1

Step 2

Step 3

Conclusion

MORE POSTS BY THIS AUTHOR

Abhishek Sharma

You may also like

Centralized Governance of Data Lake, Data Fabric with adopted Data Mesh Setup

Sagar Jaswani

Data Engineering: Beyond Big Data

Pratyush Pranav

Iceberg: Features and Hands-on (Part 2)

Abhishek Sharma

Iceberg - Introduction and Setup (Part - 1)

What is Apache Iceberg?

Why to go with Iceberg

Features of Iceberg:

Prerequisite:

Iceberg Setup Using Docker

Setup S3 As a Warehouse in Iceberg, Read Data from the S3, and Write Iceberg Tables in the S3 Again Using an EC2 Instance

Step 1

Step 2

Step 3

Conclusion

About the Author

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

About Velotio

Subscribe to get the latest technology updates

Related Posts

Centralized Governance of Data Lake, Data Fabric with adopted Data Mesh Setup

Data Engineering: Beyond Big Data

Iceberg: Features and Hands-on (Part 2)

Data QA: The Need of the Hour

Confluent Kafka vs. Amazon Managed Streaming for Apache Kafka (AWS MSK) vs. on-premise Kafka

Mage: Your New Go-To Tool for Data Orchestration

The Data Lake Revolution: Unleashing the Power of Delta Lake

Unlocking the Potential of Knowledge Graphs: Exploring Graph Databases

Spatial Data Analytics : The What, Why, and How?

Apache Flink - A Solution for Real-Time Analytics

An Introduction to Stream Processing & Analytics

Modern Data Stack: The What, Why and How?

Best Practices for Kafka Security

Parallelizing Heavy Read and Write Queries to SQL Datastores using Spark and more!

ClickHouse - The Newest Data Store in Your Big Data Arsenal

How to Load Unstructured Data into Apache Hive

Building an ETL Workflow Using Apache NiFi and Hive

Unit Testing Data at Scale using Deequ and Apache Spark

Elasticsearch - Basic and Advanced Concepts

BigQuery 101: All the Basics You Need to Know

Your Quintessential Guide to AWS Athena

Real Time Analytics for IoT Data using Mosquitto, AWS Kinesis and InfluxDB

Lessons Learnt While Building an ETL Pipeline for MongoDB & Amazon Redshift Using Apache Airflow

The Ultimate Beginner’s Guide to Jupyter Notebooks

Product Engineering

Data and AI

Cloud & DevOps

Strategy and Consulting