Thanks! We'll be in touch in the next 12 hours
Oops! Something went wrong while submitting the form.

The Data Lake Revolution: Unleashing the Power of Delta Lake

Abhishek Sharma

Data Engineering

Once upon a time, in the vast and ever-expanding world of data storage and processing, a new hero emerged. Its name? Delta Lake. This unsung champion was about to revolutionize the way organizations handled their data, and its journey was nothing short of remarkable.

The Need for a Data Savior

In this world, data was king, and it resided in various formats within the mystical realm of data lakes. Two popular formats, Parquet and Hive, had served their purposes well, but they harbored limitations that often left data warriors frustrated.

Enterprises faced a conundrum: they needed to make changes, updates, or even deletions to individual records within these data lakes. But it wasn't as simple as it sounded. Modifying schemas was a perilous endeavor that could potentially disrupt the entire data kingdom.

Why? Because these traditional table formats lacked a vital attribute: ACID transactions. Without these safeguards, every change was a leap of faith.

The Rise of Delta Lake

Amidst this data turmoil, a new contender emerged: Delta Lake. It was more than just a format; it was a game-changer.

Delta Lake brought with it the power of ACID transactions. Every data operation within the kingdom was now imbued with atomicity, consistency, isolation, and durability. It was as if Delta Lake had handed data warriors an enchanted sword, making them invincible in the face of chaos.

But that was just the beginning of Delta Lake's enchantment.

The Secrets of Delta Lake

Delta Lake was no ordinary table format; it was a storage layer that transcended the limits of imagination. It integrated seamlessly with Spark APIs, offering features that left data sorcerers in awe.

  • Time Travel: Delta Lake allowed users to peer into the past, accessing previous versions of data. The transaction log became a portal to different eras of data history.
  • Schema Evolution: It had the power to validate and evolve schemas as data changed. A shapeshifter of sorts, it embraced change effortlessly.
  • Change Data Feed: With this feature, it tracked data changes at the granular level. Data sorcerers could now decipher the intricate dance of inserts, updates, and deletions.
  • Data Skipping with Z-ordering: Delta Lake mastered the art of optimizing data retrieval. It skipped irrelevant files, ensuring that data requests were as swift as a summer breeze.
  • DML Operations: It wielded the power of SQL-like data manipulation language (DML) operations. Updates, deletes, and merges were but a wave of its hand.

Delta Lake's Allies

Delta Lake didn't stand alone; it forged alliances with various data processing tools and platforms. Apache Spark, Apache Flink, Presto, Trino, Hive, DBT, and many others joined its cause. They formed a coalition to champion the cause of efficient data processing.

In the vast landscape of data management, Delta Lake stands as a beacon of innovation, offering a plethora of features that elevate your data handling capabilities to new heights. In this exhilarating adventure, we'll explore the key features of Delta Lake and how they triumph over the limitations of traditional file formats, all while embracing the ACID properties.

ACID Properties: A Solid Foundation

In the realm of data, ACID isn't just a chemical term; it's a set of properties that ensure the reliability and integrity of your data operations. Let's break down how Delta Lake excels in this regard.

A for Atomicity: All or Nothing

Imagine a tightrope walker teetering in the middle of their performance—either they make it to the other side, or they don't. Atomicity operates on the same principle: either all changes happen, or none at all. In the world of Spark, this principle often takes a tumble. When a write operation fails midway, the old data is removed, and the new data is lost in the abyss. Delta Lake, however, comes to the rescue. It creates a transaction log, recording all changes made along with their versions. In case of a failure, data loss is averted, and your system remains consistent.

C for Consistency: The Guardians of Validity

Consistency is the gatekeeper of data validity. It ensures that your data remains rock-solid and valid at all times. Spark sometimes falters here. Picture this: your Spark job fails, leaving your system with invalid data remnants. Consistency crumbles. Delta Lake, on the other hand, is your data's staunch guardian. With its transaction log, it guarantees that even in the face of job failure, data integrity is preserved.

I for Isolation: Transactions in Solitude

Isolation is akin to individual bubbles, where multiple transactions occur in isolation, without interfering with one another. Spark might struggle with this concept. If two Spark jobs manipulate the same dataset concurrently, chaos can ensue. One job overwrites the dataset while the other is still using it—no isolation, no guarantees. Delta Lake, however, introduces order into the chaos. Through its versioning system and transaction log, it ensures that transactions proceed in isolation, mitigating conflicts and ensuring the data's integrity.

D for Durability: Unyielding in the Face of Failure

Durability means that once changes are made, they are etched in stone, impervious to system failures. Spark's Achilles' heel lies in its vulnerability to data loss during job failures. Delta Lake, however, boasts a different tale. It secures your data with unwavering determination. Every change is logged, and even in the event of job failure, data remains intact—a testament to true durability.

Time Travel: Rewriting the Past

Now, let's embark on a fascinating journey through time. Delta Lake introduces a feature that can only be described as "time travel." With this feature, you can revisit previous versions of your data, just like rewinding a movie. All of this magical history is stored in the transaction log, encapsulated within the mystical "_delta_log" folder. When you write data to a Delta table, it's not just the present that's captured; the past versions are meticulously preserved, waiting for your beck and call.

In conclusion, Delta Lake emerges as the hero of the data world, rewriting the rules of traditional file formats and conquering the challenges of the ACID properties. With its robust transaction log, versioning system, and the ability to traverse time, Delta Lake opens up a new dimension in data management. So, if you're on a quest for data reliability, integrity, and a touch of magic, Delta Lake is your trusted guide through this thrilling journey beyond convention.

More Features of Delta Lake Are:

  • UPSERT
  • Schema Evolution
  • Change Data Feed
  • Data Skipping with Z-ordering
  • DML Operations

The Quest for Delta Lake

Setting up Delta Lake was like embarking on a quest. Data adventurers ventured into the cloud, AWS, GCP, Azure, or even their local domains. They armed themselves with the delta-spark spells and summoned the JARs of delta-core, delta-contrib, and delta-storage, tailored to their Spark versions.

Requirements:

  • Python
  • Delta-spark
  • Delta jars

You can configure in a Spark session and define the package name so it will be downloaded at the run time. As I said, I am using Spark version 3.3. We will require these things: delta-core, delta-contribs, delta-storage. You can download them from here: https://github.com/delta-io/delta/releases/ 

To use and configure various cloud storage options, there are separate .jars you can use: https://docs.delta.io/latest/delta-storage.html. Here, you can find .jars for AWS, GSC, and Azure to configure and use their data storage medium.

Run this command to install delta-spark first:

pip  install delta-spark

(If you are using dataproc or EMR, you can install this while creating cluster as a startup action, and if you are using serverless env like Glue or dataproc batches, you can create docker build or pass the .whl file for this package.)

You must also do this while downloading the .jar. If it is serverless, download the .jar, store it in cloud storage, like S3 or GS, and use that path while running the job. If it is a cluster like dataproc or EMR, you can download this on the cluster. 

One can also download these .jars at the run time while creating the Spark session as well.

Now, create the Spark session, and you are ready to play with Delta tables.

Environment Setup

How do you add the Delta Lake dependencies to your environment?

  1. You can directly add them while initializing the Spark session for Delta Lake by passing the specific version, and these packages or dependencies will be downloaded during run time.
  2. You can place the required .jar files in your cluster and provide the reference while initializing the Spark session.
  3. You can download the .jar files and store them in cloud storage, and you can pass them as a run time argument if you don’t want to download the dependencies on your cluster.

CODE: https://gist.github.com/velotiotech/9947ee290851af95e90caa7abf06631f.js

You have to add the following properties to use delta in Spark:-

  • Spark.sql.extensions
  • Spark.sql.catalog.spark_catalog

You can see these values in the above code snippet. If you want to use cloud storage like reading and writing data from S3, GS, or Blob storage, then we have to set some more configs as well in the Spark session. Here, I am providing examples for AWS and GSC only.

The next thing that will come to your mind: how will you be able to read or write the data into cloud storage?

For different cloud storages, there are certain .jar files available that are used to connect and to do IO operations on the cloud storage. See the examples below.

You can use the above approach to make this .jar available for Spark sessions either by downloading at a run time or storing them on the cluster itself.

AWS 

spark_jars_packages = "com.amazonaws:aws-java-sdk:1.12.246,org.apache.hadoop:hadoop-aws:3.2.2,io.delta:delta-core_2.12:2.2.0"

spark = SparkSession.builder.appName('delta') \
  .config("spark.jars.packages", spark_jars_packages) \
  .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
  .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
  .config('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider') \
  .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
  .config("spark.hadoop.fs.AbstractFileSystem.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
  .config("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.S3SingleDriverLogStore") 

spark = builder.getOrCreate()

GCS

spark_session = SparkSession.builder.appName('delta').builder.getOrCreate()

spark_session.conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")

spark_session.conf.set("spark.hadoop.fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")

spark_session.conf.set("fs.gs.auth.service.account.enable", "true")

spark_session.conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")

spark_session.conf.set("fs.gs.project.id", project_id)

spark_session.conf.set("fs.gs.auth.service.account.email", credential["client_email"])

spark_session.conf.set("fs.gs.auth.service.account.private.key.id", credential["private_key_id"])

spark_session.conf.set("fs.gs.auth.service.account.private.key", credential["private_key"])

Write into Delta Tables: In the following example, we are using a local system only for reading and writing the data into and from delta lake tables.

Data Set Used: https://media.githubusercontent.com/media/datablist/sample-csv-files/main/files/organizations/organizations-100.zip

For reference, I have downloaded this file in my local machine and unzipped the data:

CODE: https://gist.github.com/velotiotech/6589b2b3beac99c5dff4afc83b05d39a.js

There are two modes available in Delta Lake and Spark (Append and Overwrite) while writing the data in the Delta tables from any source.

For now, we have enabled the Delta catalog to store all metadata-related information. We can also use the hive meta store to store the metadata information and to directly run the SQL queries over the delta tables. You can use the cloud storage path as well.

Read data from the Delta tables:

CODE: https://gist.github.com/velotiotech/3a758a175f7e8498fd3574f7e486c72e.js

Here, you can see the folder structure, and after writing the data into Delta Tables, it creates one delta log file, which keeps track of metadata, partitions, and files.

Option 2: Create Delta Table and insert data using Spark SQL.

CODE: https://gist.github.com/velotiotech/08c1139be7114e7fdfe488288bc4a744.js

Insert the data:

CODE: https://gist.github.com/velotiotech/716664f1472b39910a931c3dd9f0ba71.js

This way, we can read the Delta table, and you can use SQL as well if you have enabled the hive.

Schema Enforcement: Safeguarding Your Data

In the realm of data management, maintaining the integrity of your dataset is paramount. Delta Lake, with its schema enforcement capabilities, ensures that your data is not just welcomed with open arms but also closely scrutinized for compatibility. Let's dive into the meticulous checks 

Delta Lake performs when validating incoming data against the existing schema:

Column Presence: Delta Lake checks that every column in your DataFrame matches the columns in the target Delta table. If there's a single mismatch, it won't let the data in and, instead, will raise a flag in the form of an exception.

Data Types Harmony: Data types are the secret language of your dataset. Delta Lake insists that the data types in your incoming DataFrame align harmoniously with those in the target Delta table. Any discord in data types will result in a raised exception.

Name Consistency: In the world of data, names matter. Delta Lake meticulously examines that the column names in your incoming DataFrame are an exact match to those in the target Delta table. No aliases allowed. Any discrepancies will lead to, you guessed it, an exception.

This meticulous schema validation guarantees that your incoming data seamlessly integrates with the target Delta table. If any aspect of your data doesn't meet these strict criteria, it won't find a home in the Delta Lake, and you'll be greeted by an error message and a raised exception.

Schema Evolution: Adapting to Changing Data

In the dynamic landscape of data, change is the only constant. Delta Lake's schema evolution comes to the rescue when you need to adapt your table's schema to accommodate incoming data. This powerful feature offers two distinct approaches:

Overwrite Schema: You can choose to boldly overwrite the existing schema with the schema of your incoming data. This is an excellent option when your data's structure undergoes significant changes. Just set the "overwriteSchema" option to true, and voila, your table is reborn with the new schema.

Merge Schema: In some cases, you might want to embrace the new while preserving the old. Delta Lake's "Merge Schema" property lets you merge the incoming data's schema with the existing one. This means that if an extra column appears in your data, it elegantly melds into the target table without throwing any schema-related tantrums.

Should you find the need to tweak column names or data types to better align with the incoming data, Delta Lake's got you covered. The schema evolution capabilities ensure your dataset stays in tune with the ever-changing data landscape. It's a smooth transition, no hiccups, and no surprises, just data management at its finest.

CODE: https://gist.github.com/velotiotech/d5379926b4a0e34b39106d55b5997294.js

The above code will overwrite the existing delta table with the new schema along with the new data.

Delta Lake has support for automatic schema evolution. For instance, if you have added two more columns in the Delta Lake tables and still tried to access the existing table, you will be able to read the data without any error.

There is another way as well. For example, if you have three columns in a Delta table but the incoming table has four columns, you can set up spark.databricks.delta.schema.autoMerge.enabled is true. It can be done for the entire cluster as well.

CODE: https://gist.github.com/velotiotech/56adfccd321e64bf3926fb13fa46caa6.js


Let’s add one more column and try to access the data again:

spark.sql("alter table orgs_data add column(exta_col String)")

spark.sql("describe table orgs_data").show()


As you can see, that column has been added but has not impacted the data. You can still smoothly and seamlessly read the data. It will set null to a newly created column.

What happens if we receive the extra column in an incoming CSV that we want to append to the existing delta table? You have to set up one config here for that:

CODE: https://gist.github.com/velotiotech/0bc719d250d417d75969fb230f842b13.js

You have to add this config mergeSchem=true while appending the data. It will merge the schema of incoming data that is receiving some extra columns.

The first figure shows the schema of incoming data, and in the previous one, we have already seen the schema of our delta tables.


Here, we can see that the new column that was coming in the incoming data is merged with the existing schema of the table. Now, the delta table has the latest updated schema.

Time Travel 

Basically, Delta Lake keeps track of all the changes in _delta_log by creating a log file. By using this, we can fetch the data of the previous version by specifying the version number.

CODE: https://gist.github.com/velotiotech/5360915975bd2fea306ec39c61cee957.js

Here, we can see the first version of the data, where we have not added any columns. As we know, the Delta table maintains the delta log file, which contains the information of each commit so that we can fetch the data till the particular commit.

Upsert, Delete, and Merge

Unlocking the Power of Upsert with Delta Lake

In the exhilarating realm of data management, upserting shines as a vital operation, allowing you to seamlessly merge new data with your existing dataset. It's the magic wand that updates, inserts, or even deletes records based on their status in the incoming data. However, for this enchanting process to work its wonders, you need a key—a primary key, to be precise. This key acts as the linchpin for merging data, much like a conductor orchestrating a symphony.

A Missing Piece: Copy on Write and Merge on Read

Now, before we delve into the mystical world of upserting with Delta Lake, it's worth noting that Delta Lake dances to its own tune. Unlike some other table formats like Hudi and Iceberg, Delta Lake doesn't rely on the concepts of Copy on Write and Merge on Read. These techniques are used elsewhere to speed up data operations.

Two Paths to Merge: SQL and Spark API

To harness the power of upserting in Delta Lake, you have two pathways at your disposal: SQL and Spark API. The choice largely depends on your Delta version. In the latest Delta version, 2.2.0, you can seamlessly execute merge operations using Spark API. It's a breeze. However, if you're working with an earlier Delta version, say 1.0.0, then Spark SQL is your trusty steed for upserts and merges. Remember, using the right Delta version is crucial, or you might find yourself grappling with the cryptic "Method not found" error, which can turn into a debugging labyrinth.

In the snippet below, we showcase the elegance of upserting using Spark SQL, a technique that ensures your data management journey is smooth and error-free:

CODE: https://gist.github.com/velotiotech/019e176d0a753d9adb0f0cfd5149b8b6.js

CODE: https://gist.github.com/velotiotech/5e02d0829672a62cbbb72fe2ca156776.js

Here, we are loading the incoming data and showing what is inside. The existing data with the same primary key appears in the Delta table so that we can compare after upserting or merging the data.

CODE: https://gist.github.com/velotiotech/c9a204f4b6efad127d3430204d6b28b6.js

CODE: https://gist.github.com/velotiotech/3e35b7ac34594d2af205f326a20397e6.js

This is the example of how you can do upsert using Spark APIs. The merge operation creates lots of small files. You can control the number of small files by setting up the following properties in the spark session.

spark.delta.merge.repartitionBeforeWrite true

spark.sql.shuffle.partitions 10

This is how merge operations work. Merge supports one-to-one mapping. What we want to say is that only rows should try to update the one row in the target delta table. If multiple rows try to update the rows in the target Delta table, it will fail. Delta Lake matches the data on the basis of a key in case of an update operation.

Change Data Feed

This is also a useful feature of Delta Lake and tracks and maintains the history of all records in the Delta table after upsert or insert at the row level. You can enable these things at the beginning while setting up the Spark session or using Spark SQL by enabling  “change events” for all the data.

Now, you can see the whole journey of each record in the Delta table, from its insertion to deletion. It introduces one more extra column, _change_type, which contains the type of operations that have been performed on that particular row.

To enable this, you can set these configurations: 

CODE: https://gist.github.com/velotiotech/391f9066279b9ad2ecadde1154337cbe.js

Or you can set this conf while reading the delta table as well. 

CODE: https://gist.github.com/velotiotech/68f950c2bf3dff0a76c4cbefc9db17cb.js

CODE: https://gist.github.com/velotiotech/81b9e2038bb04445ab48506b36b112ad.js

Now, after deleting something, you will be able to see the changes, like what is deleted and what is updated. If you are doing upserts on the same Delta table after enabling the change data feed, you will be able to see the update as well, and if you insert anything, you will be able to see what is inserted in your Delta table. 

If we overwrite the complete Delta table, it will mark all past records as a delete:

If you want to record each data change, you have to enable this before creating the table so that we can see the data changes for each version. If you’ve already created one table, you won’t be able to see the changes for the previous version once you enable the change data feed, but you will be able to see the changes in all versions that came after this configuration.

Data Skipping with Z-ordering

Data skipping is the technique in Delta Lake where, if you have a larger number of records stored in many files, it will read the data from the files that contain required information, but apart from that, other files will get skipped. This makes it faster to read the data from the Delta tables.

Z-ordering is a technique used to colocate the information in the same dataset files. If you know the column that will be more in use in the select statement and has A cardinality, you can use Z-order by that particular column. It will reduce the large number of files from being read. We can give you multiple columns for Z-order by separating them from commas.

Let’s suppose you have two tables, a and b, and there is one column that is most frequently used. You can increase the number of files to be skipped, and you can use the columns files running that query. Normal order works linearly, whereas Z-order works in multi- dimensionally.

CODE: https://gist.github.com/velotiotech/c8c1254f833bcb6183ccbec1b1eb4aa1.js

DML Operations

Delta Lake has capabilities to run all the DML operations of SQL in the data lake as well as update, delete, and merge operations.

Integrations and Ecosystem Supported in Delta Lake

Read Delta Tables

Unlock the Delta Tables: Tools That Bring Data to Life

Reading data from Delta tables is like diving into a treasure trove of information, and there's more than one way to unlock its secrets. Beyond the standard Spark API, we have a squad of powerful allies ready to assist: SQL query engines like Athena and Trino. But they're not just passive onlookers; they bring their own magic to the table, empowering you to perform data manipulation language (DML) operations that can reshape your data universe.

Athena: Unleash the SQL Sorcery

Imagine Athena as the Oracle of data. With SQL as its spellbook, it delves deep into your Delta tables, fetching insights with precision and grace. But here's the twist: Athena isn't just for querying; like a skilled blacksmith, it can help you hammer your data into a new shape, creating a masterpiece.

Trino: The Shape-Shifting Wizard

Trino, on the other hand, is the shape-shifter of the data realm. It glides through Delta tables, allowing you to perform an array of DML operations that can transform your data into new, dazzling forms. Think of it as a master craftsman who can sculpt your data, creating entirely new narratives and visualizations.

So, when it comes to Delta tables, these tools are not just readers; they are your co-creators. They enable you to not only glimpse the data's beauty but also mold it into whatever shape serves your purpose. With Athena and Trino at your side, the possibilities are as boundless as your imagination.

Read Delta Tables Using Spark APIS

CODE: https://gist.github.com/velotiotech/906ce8b5eedcc25d345d46b28586b6e3.js

Steps to Set Up Delta Lake with S3 on EC2 Or EMR and Access Data through Athena

Data Set Used - We have generated some dummy data of around 100gb and written that into the Delta tables.

Step 1- Set up a Spark session along with AWS cloud storage and Delta - Spark. Here, we have used an EC2 instance with Spark 3.3 and Delta version 2.1.1. Here, we are setting up Spark config for Delta and S3.

CODE: https://gist.github.com/velotiotech/c7d33e42fadb8592659ba833ab91693d.js

Spark Version - You can use any Spark version, but Spark 3.3.1 came along with the pip install. Just make sure whatever version you are using is compatible with the Delta Lake version that you are using; otherwise, most of the features won't work.

Step 2 - Here, we are creating a Delta table with an S3 path. We can directly write the data into an S3 bucket as a Delta table, but it is better to create a table first and then write it into S3 to make sure the schema is correct.

Set the Delta location path if it exists to run the Spark SQL query and create the Delta table along with the S3 path. 

CODE: https://gist.github.com/velotiotech/b9a9b536476812aa40ea559149a14ee8.js

Step 3 - Here, I have given one link that I have used to generate the dummy data and have written that into the S3 bucket as Delta tables. Feel free to look over this. An example code of writing is given below:

CODE: https://gist.github.com/velotiotech/ae64ea8b9c78b64d6ea01f4579d18f70.js

https://github.com/velotio-tech/delta-lake-iceberg-poc/blob/0396cdbf96230609695a907fdbe8c240042fce9e/delta-data-writer.py#L83 

In the above link, you find the code of dummy data generation.

Step 4 - Here, we are printing the count and selecting some data from the Delta table that we have written in just right away.

Run the SQL query to check the table data and upsert using S3 bucket data:

CODE: https://gist.github.com/velotiotech/b6b51465919ccdb0bd35793ab26edd5a.js

This is the output of select statement:

This is the schema of the incoming data we are planning to merge into the existing Delta table:

After upsert, let’s see the data for the particular data partition:

spark.sql("select * from delta.`s3://abhishek-test-01012023/delta-lake-sample-data/` where id = 1 and date = 20221206")

Access Delta table using Hive or any other external metastore: 

For that, we have to create a link between them and to create this link, go to the Spark code and generate the manifest file on the S3 path where we have already written the data.

CODE: https://gist.github.com/velotiotech/c85624a27653f0151e955b12228b13f3.js

This will create the manifest folder not go to Athena and run this query:

CODE: https://gist.github.com/velotiotech/f8c31908ead9756d0b74a623f43e803f.js

Run this command:

CODE: https://gist.github.com/velotiotech/0d6864fb641d26b2944960ffa4cc52a6.js

You will be able to query the data. 

Conclusion: A New Dawn

In a world where data continued to grow in volume and complexity, Delta Lake stood as a beacon of hope. It empowered organizations to manage their data lakes with unprecedented efficiency and extract insights with unwavering confidence.

The adoption of Delta Lake marked a new dawn in the realm of data. Whether dealing with structured or semi-structured data, it was the answer to the prayers of data warriors. As the sun set on traditional formats, Delta Lake emerged as the hero they had been waiting for—a hero who had not only revolutionized data storage and processing but also transformed the way stories were told in the world of data.

And so, the legend of Delta Lake continued to unfold, inspiring data adventurers across the land to embark on their own quests, armed with the power of ACID transactions, time travel, and the promise of a brighter, data-driven future.

Get the latest engineering blogs delivered straight to your inbox.
No spam. Only expert insights.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

Explore current openings

The Data Lake Revolution: Unleashing the Power of Delta Lake

Once upon a time, in the vast and ever-expanding world of data storage and processing, a new hero emerged. Its name? Delta Lake. This unsung champion was about to revolutionize the way organizations handled their data, and its journey was nothing short of remarkable.

The Need for a Data Savior

In this world, data was king, and it resided in various formats within the mystical realm of data lakes. Two popular formats, Parquet and Hive, had served their purposes well, but they harbored limitations that often left data warriors frustrated.

Enterprises faced a conundrum: they needed to make changes, updates, or even deletions to individual records within these data lakes. But it wasn't as simple as it sounded. Modifying schemas was a perilous endeavor that could potentially disrupt the entire data kingdom.

Why? Because these traditional table formats lacked a vital attribute: ACID transactions. Without these safeguards, every change was a leap of faith.

The Rise of Delta Lake

Amidst this data turmoil, a new contender emerged: Delta Lake. It was more than just a format; it was a game-changer.

Delta Lake brought with it the power of ACID transactions. Every data operation within the kingdom was now imbued with atomicity, consistency, isolation, and durability. It was as if Delta Lake had handed data warriors an enchanted sword, making them invincible in the face of chaos.

But that was just the beginning of Delta Lake's enchantment.

The Secrets of Delta Lake

Delta Lake was no ordinary table format; it was a storage layer that transcended the limits of imagination. It integrated seamlessly with Spark APIs, offering features that left data sorcerers in awe.

  • Time Travel: Delta Lake allowed users to peer into the past, accessing previous versions of data. The transaction log became a portal to different eras of data history.
  • Schema Evolution: It had the power to validate and evolve schemas as data changed. A shapeshifter of sorts, it embraced change effortlessly.
  • Change Data Feed: With this feature, it tracked data changes at the granular level. Data sorcerers could now decipher the intricate dance of inserts, updates, and deletions.
  • Data Skipping with Z-ordering: Delta Lake mastered the art of optimizing data retrieval. It skipped irrelevant files, ensuring that data requests were as swift as a summer breeze.
  • DML Operations: It wielded the power of SQL-like data manipulation language (DML) operations. Updates, deletes, and merges were but a wave of its hand.

Delta Lake's Allies

Delta Lake didn't stand alone; it forged alliances with various data processing tools and platforms. Apache Spark, Apache Flink, Presto, Trino, Hive, DBT, and many others joined its cause. They formed a coalition to champion the cause of efficient data processing.

In the vast landscape of data management, Delta Lake stands as a beacon of innovation, offering a plethora of features that elevate your data handling capabilities to new heights. In this exhilarating adventure, we'll explore the key features of Delta Lake and how they triumph over the limitations of traditional file formats, all while embracing the ACID properties.

ACID Properties: A Solid Foundation

In the realm of data, ACID isn't just a chemical term; it's a set of properties that ensure the reliability and integrity of your data operations. Let's break down how Delta Lake excels in this regard.

A for Atomicity: All or Nothing

Imagine a tightrope walker teetering in the middle of their performance—either they make it to the other side, or they don't. Atomicity operates on the same principle: either all changes happen, or none at all. In the world of Spark, this principle often takes a tumble. When a write operation fails midway, the old data is removed, and the new data is lost in the abyss. Delta Lake, however, comes to the rescue. It creates a transaction log, recording all changes made along with their versions. In case of a failure, data loss is averted, and your system remains consistent.

C for Consistency: The Guardians of Validity

Consistency is the gatekeeper of data validity. It ensures that your data remains rock-solid and valid at all times. Spark sometimes falters here. Picture this: your Spark job fails, leaving your system with invalid data remnants. Consistency crumbles. Delta Lake, on the other hand, is your data's staunch guardian. With its transaction log, it guarantees that even in the face of job failure, data integrity is preserved.

I for Isolation: Transactions in Solitude

Isolation is akin to individual bubbles, where multiple transactions occur in isolation, without interfering with one another. Spark might struggle with this concept. If two Spark jobs manipulate the same dataset concurrently, chaos can ensue. One job overwrites the dataset while the other is still using it—no isolation, no guarantees. Delta Lake, however, introduces order into the chaos. Through its versioning system and transaction log, it ensures that transactions proceed in isolation, mitigating conflicts and ensuring the data's integrity.

D for Durability: Unyielding in the Face of Failure

Durability means that once changes are made, they are etched in stone, impervious to system failures. Spark's Achilles' heel lies in its vulnerability to data loss during job failures. Delta Lake, however, boasts a different tale. It secures your data with unwavering determination. Every change is logged, and even in the event of job failure, data remains intact—a testament to true durability.

Time Travel: Rewriting the Past

Now, let's embark on a fascinating journey through time. Delta Lake introduces a feature that can only be described as "time travel." With this feature, you can revisit previous versions of your data, just like rewinding a movie. All of this magical history is stored in the transaction log, encapsulated within the mystical "_delta_log" folder. When you write data to a Delta table, it's not just the present that's captured; the past versions are meticulously preserved, waiting for your beck and call.

In conclusion, Delta Lake emerges as the hero of the data world, rewriting the rules of traditional file formats and conquering the challenges of the ACID properties. With its robust transaction log, versioning system, and the ability to traverse time, Delta Lake opens up a new dimension in data management. So, if you're on a quest for data reliability, integrity, and a touch of magic, Delta Lake is your trusted guide through this thrilling journey beyond convention.

More Features of Delta Lake Are:

  • UPSERT
  • Schema Evolution
  • Change Data Feed
  • Data Skipping with Z-ordering
  • DML Operations

The Quest for Delta Lake

Setting up Delta Lake was like embarking on a quest. Data adventurers ventured into the cloud, AWS, GCP, Azure, or even their local domains. They armed themselves with the delta-spark spells and summoned the JARs of delta-core, delta-contrib, and delta-storage, tailored to their Spark versions.

Requirements:

  • Python
  • Delta-spark
  • Delta jars

You can configure in a Spark session and define the package name so it will be downloaded at the run time. As I said, I am using Spark version 3.3. We will require these things: delta-core, delta-contribs, delta-storage. You can download them from here: https://github.com/delta-io/delta/releases/ 

To use and configure various cloud storage options, there are separate .jars you can use: https://docs.delta.io/latest/delta-storage.html. Here, you can find .jars for AWS, GSC, and Azure to configure and use their data storage medium.

Run this command to install delta-spark first:

pip  install delta-spark

(If you are using dataproc or EMR, you can install this while creating cluster as a startup action, and if you are using serverless env like Glue or dataproc batches, you can create docker build or pass the .whl file for this package.)

You must also do this while downloading the .jar. If it is serverless, download the .jar, store it in cloud storage, like S3 or GS, and use that path while running the job. If it is a cluster like dataproc or EMR, you can download this on the cluster. 

One can also download these .jars at the run time while creating the Spark session as well.

Now, create the Spark session, and you are ready to play with Delta tables.

Environment Setup

How do you add the Delta Lake dependencies to your environment?

  1. You can directly add them while initializing the Spark session for Delta Lake by passing the specific version, and these packages or dependencies will be downloaded during run time.
  2. You can place the required .jar files in your cluster and provide the reference while initializing the Spark session.
  3. You can download the .jar files and store them in cloud storage, and you can pass them as a run time argument if you don’t want to download the dependencies on your cluster.

CODE: https://gist.github.com/velotiotech/9947ee290851af95e90caa7abf06631f.js

You have to add the following properties to use delta in Spark:-

  • Spark.sql.extensions
  • Spark.sql.catalog.spark_catalog

You can see these values in the above code snippet. If you want to use cloud storage like reading and writing data from S3, GS, or Blob storage, then we have to set some more configs as well in the Spark session. Here, I am providing examples for AWS and GSC only.

The next thing that will come to your mind: how will you be able to read or write the data into cloud storage?

For different cloud storages, there are certain .jar files available that are used to connect and to do IO operations on the cloud storage. See the examples below.

You can use the above approach to make this .jar available for Spark sessions either by downloading at a run time or storing them on the cluster itself.

AWS 

spark_jars_packages = "com.amazonaws:aws-java-sdk:1.12.246,org.apache.hadoop:hadoop-aws:3.2.2,io.delta:delta-core_2.12:2.2.0"

spark = SparkSession.builder.appName('delta') \
  .config("spark.jars.packages", spark_jars_packages) \
  .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
  .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
  .config('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider') \
  .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
  .config("spark.hadoop.fs.AbstractFileSystem.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
  .config("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.S3SingleDriverLogStore") 

spark = builder.getOrCreate()

GCS

spark_session = SparkSession.builder.appName('delta').builder.getOrCreate()

spark_session.conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")

spark_session.conf.set("spark.hadoop.fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")

spark_session.conf.set("fs.gs.auth.service.account.enable", "true")

spark_session.conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")

spark_session.conf.set("fs.gs.project.id", project_id)

spark_session.conf.set("fs.gs.auth.service.account.email", credential["client_email"])

spark_session.conf.set("fs.gs.auth.service.account.private.key.id", credential["private_key_id"])

spark_session.conf.set("fs.gs.auth.service.account.private.key", credential["private_key"])

Write into Delta Tables: In the following example, we are using a local system only for reading and writing the data into and from delta lake tables.

Data Set Used: https://media.githubusercontent.com/media/datablist/sample-csv-files/main/files/organizations/organizations-100.zip

For reference, I have downloaded this file in my local machine and unzipped the data:

CODE: https://gist.github.com/velotiotech/6589b2b3beac99c5dff4afc83b05d39a.js

There are two modes available in Delta Lake and Spark (Append and Overwrite) while writing the data in the Delta tables from any source.

For now, we have enabled the Delta catalog to store all metadata-related information. We can also use the hive meta store to store the metadata information and to directly run the SQL queries over the delta tables. You can use the cloud storage path as well.

Read data from the Delta tables:

CODE: https://gist.github.com/velotiotech/3a758a175f7e8498fd3574f7e486c72e.js

Here, you can see the folder structure, and after writing the data into Delta Tables, it creates one delta log file, which keeps track of metadata, partitions, and files.

Option 2: Create Delta Table and insert data using Spark SQL.

CODE: https://gist.github.com/velotiotech/08c1139be7114e7fdfe488288bc4a744.js

Insert the data:

CODE: https://gist.github.com/velotiotech/716664f1472b39910a931c3dd9f0ba71.js

This way, we can read the Delta table, and you can use SQL as well if you have enabled the hive.

Schema Enforcement: Safeguarding Your Data

In the realm of data management, maintaining the integrity of your dataset is paramount. Delta Lake, with its schema enforcement capabilities, ensures that your data is not just welcomed with open arms but also closely scrutinized for compatibility. Let's dive into the meticulous checks 

Delta Lake performs when validating incoming data against the existing schema:

Column Presence: Delta Lake checks that every column in your DataFrame matches the columns in the target Delta table. If there's a single mismatch, it won't let the data in and, instead, will raise a flag in the form of an exception.

Data Types Harmony: Data types are the secret language of your dataset. Delta Lake insists that the data types in your incoming DataFrame align harmoniously with those in the target Delta table. Any discord in data types will result in a raised exception.

Name Consistency: In the world of data, names matter. Delta Lake meticulously examines that the column names in your incoming DataFrame are an exact match to those in the target Delta table. No aliases allowed. Any discrepancies will lead to, you guessed it, an exception.

This meticulous schema validation guarantees that your incoming data seamlessly integrates with the target Delta table. If any aspect of your data doesn't meet these strict criteria, it won't find a home in the Delta Lake, and you'll be greeted by an error message and a raised exception.

Schema Evolution: Adapting to Changing Data

In the dynamic landscape of data, change is the only constant. Delta Lake's schema evolution comes to the rescue when you need to adapt your table's schema to accommodate incoming data. This powerful feature offers two distinct approaches:

Overwrite Schema: You can choose to boldly overwrite the existing schema with the schema of your incoming data. This is an excellent option when your data's structure undergoes significant changes. Just set the "overwriteSchema" option to true, and voila, your table is reborn with the new schema.

Merge Schema: In some cases, you might want to embrace the new while preserving the old. Delta Lake's "Merge Schema" property lets you merge the incoming data's schema with the existing one. This means that if an extra column appears in your data, it elegantly melds into the target table without throwing any schema-related tantrums.

Should you find the need to tweak column names or data types to better align with the incoming data, Delta Lake's got you covered. The schema evolution capabilities ensure your dataset stays in tune with the ever-changing data landscape. It's a smooth transition, no hiccups, and no surprises, just data management at its finest.

CODE: https://gist.github.com/velotiotech/d5379926b4a0e34b39106d55b5997294.js

The above code will overwrite the existing delta table with the new schema along with the new data.

Delta Lake has support for automatic schema evolution. For instance, if you have added two more columns in the Delta Lake tables and still tried to access the existing table, you will be able to read the data without any error.

There is another way as well. For example, if you have three columns in a Delta table but the incoming table has four columns, you can set up spark.databricks.delta.schema.autoMerge.enabled is true. It can be done for the entire cluster as well.

CODE: https://gist.github.com/velotiotech/56adfccd321e64bf3926fb13fa46caa6.js


Let’s add one more column and try to access the data again:

spark.sql("alter table orgs_data add column(exta_col String)")

spark.sql("describe table orgs_data").show()


As you can see, that column has been added but has not impacted the data. You can still smoothly and seamlessly read the data. It will set null to a newly created column.

What happens if we receive the extra column in an incoming CSV that we want to append to the existing delta table? You have to set up one config here for that:

CODE: https://gist.github.com/velotiotech/0bc719d250d417d75969fb230f842b13.js

You have to add this config mergeSchem=true while appending the data. It will merge the schema of incoming data that is receiving some extra columns.

The first figure shows the schema of incoming data, and in the previous one, we have already seen the schema of our delta tables.


Here, we can see that the new column that was coming in the incoming data is merged with the existing schema of the table. Now, the delta table has the latest updated schema.

Time Travel 

Basically, Delta Lake keeps track of all the changes in _delta_log by creating a log file. By using this, we can fetch the data of the previous version by specifying the version number.

CODE: https://gist.github.com/velotiotech/5360915975bd2fea306ec39c61cee957.js

Here, we can see the first version of the data, where we have not added any columns. As we know, the Delta table maintains the delta log file, which contains the information of each commit so that we can fetch the data till the particular commit.

Upsert, Delete, and Merge

Unlocking the Power of Upsert with Delta Lake

In the exhilarating realm of data management, upserting shines as a vital operation, allowing you to seamlessly merge new data with your existing dataset. It's the magic wand that updates, inserts, or even deletes records based on their status in the incoming data. However, for this enchanting process to work its wonders, you need a key—a primary key, to be precise. This key acts as the linchpin for merging data, much like a conductor orchestrating a symphony.

A Missing Piece: Copy on Write and Merge on Read

Now, before we delve into the mystical world of upserting with Delta Lake, it's worth noting that Delta Lake dances to its own tune. Unlike some other table formats like Hudi and Iceberg, Delta Lake doesn't rely on the concepts of Copy on Write and Merge on Read. These techniques are used elsewhere to speed up data operations.

Two Paths to Merge: SQL and Spark API

To harness the power of upserting in Delta Lake, you have two pathways at your disposal: SQL and Spark API. The choice largely depends on your Delta version. In the latest Delta version, 2.2.0, you can seamlessly execute merge operations using Spark API. It's a breeze. However, if you're working with an earlier Delta version, say 1.0.0, then Spark SQL is your trusty steed for upserts and merges. Remember, using the right Delta version is crucial, or you might find yourself grappling with the cryptic "Method not found" error, which can turn into a debugging labyrinth.

In the snippet below, we showcase the elegance of upserting using Spark SQL, a technique that ensures your data management journey is smooth and error-free:

CODE: https://gist.github.com/velotiotech/019e176d0a753d9adb0f0cfd5149b8b6.js

CODE: https://gist.github.com/velotiotech/5e02d0829672a62cbbb72fe2ca156776.js

Here, we are loading the incoming data and showing what is inside. The existing data with the same primary key appears in the Delta table so that we can compare after upserting or merging the data.

CODE: https://gist.github.com/velotiotech/c9a204f4b6efad127d3430204d6b28b6.js

CODE: https://gist.github.com/velotiotech/3e35b7ac34594d2af205f326a20397e6.js

This is the example of how you can do upsert using Spark APIs. The merge operation creates lots of small files. You can control the number of small files by setting up the following properties in the spark session.

spark.delta.merge.repartitionBeforeWrite true

spark.sql.shuffle.partitions 10

This is how merge operations work. Merge supports one-to-one mapping. What we want to say is that only rows should try to update the one row in the target delta table. If multiple rows try to update the rows in the target Delta table, it will fail. Delta Lake matches the data on the basis of a key in case of an update operation.

Change Data Feed

This is also a useful feature of Delta Lake and tracks and maintains the history of all records in the Delta table after upsert or insert at the row level. You can enable these things at the beginning while setting up the Spark session or using Spark SQL by enabling  “change events” for all the data.

Now, you can see the whole journey of each record in the Delta table, from its insertion to deletion. It introduces one more extra column, _change_type, which contains the type of operations that have been performed on that particular row.

To enable this, you can set these configurations: 

CODE: https://gist.github.com/velotiotech/391f9066279b9ad2ecadde1154337cbe.js

Or you can set this conf while reading the delta table as well. 

CODE: https://gist.github.com/velotiotech/68f950c2bf3dff0a76c4cbefc9db17cb.js

CODE: https://gist.github.com/velotiotech/81b9e2038bb04445ab48506b36b112ad.js

Now, after deleting something, you will be able to see the changes, like what is deleted and what is updated. If you are doing upserts on the same Delta table after enabling the change data feed, you will be able to see the update as well, and if you insert anything, you will be able to see what is inserted in your Delta table. 

If we overwrite the complete Delta table, it will mark all past records as a delete:

If you want to record each data change, you have to enable this before creating the table so that we can see the data changes for each version. If you’ve already created one table, you won’t be able to see the changes for the previous version once you enable the change data feed, but you will be able to see the changes in all versions that came after this configuration.

Data Skipping with Z-ordering

Data skipping is the technique in Delta Lake where, if you have a larger number of records stored in many files, it will read the data from the files that contain required information, but apart from that, other files will get skipped. This makes it faster to read the data from the Delta tables.

Z-ordering is a technique used to colocate the information in the same dataset files. If you know the column that will be more in use in the select statement and has A cardinality, you can use Z-order by that particular column. It will reduce the large number of files from being read. We can give you multiple columns for Z-order by separating them from commas.

Let’s suppose you have two tables, a and b, and there is one column that is most frequently used. You can increase the number of files to be skipped, and you can use the columns files running that query. Normal order works linearly, whereas Z-order works in multi- dimensionally.

CODE: https://gist.github.com/velotiotech/c8c1254f833bcb6183ccbec1b1eb4aa1.js

DML Operations

Delta Lake has capabilities to run all the DML operations of SQL in the data lake as well as update, delete, and merge operations.

Integrations and Ecosystem Supported in Delta Lake

Read Delta Tables

Unlock the Delta Tables: Tools That Bring Data to Life

Reading data from Delta tables is like diving into a treasure trove of information, and there's more than one way to unlock its secrets. Beyond the standard Spark API, we have a squad of powerful allies ready to assist: SQL query engines like Athena and Trino. But they're not just passive onlookers; they bring their own magic to the table, empowering you to perform data manipulation language (DML) operations that can reshape your data universe.

Athena: Unleash the SQL Sorcery

Imagine Athena as the Oracle of data. With SQL as its spellbook, it delves deep into your Delta tables, fetching insights with precision and grace. But here's the twist: Athena isn't just for querying; like a skilled blacksmith, it can help you hammer your data into a new shape, creating a masterpiece.

Trino: The Shape-Shifting Wizard

Trino, on the other hand, is the shape-shifter of the data realm. It glides through Delta tables, allowing you to perform an array of DML operations that can transform your data into new, dazzling forms. Think of it as a master craftsman who can sculpt your data, creating entirely new narratives and visualizations.

So, when it comes to Delta tables, these tools are not just readers; they are your co-creators. They enable you to not only glimpse the data's beauty but also mold it into whatever shape serves your purpose. With Athena and Trino at your side, the possibilities are as boundless as your imagination.

Read Delta Tables Using Spark APIS

CODE: https://gist.github.com/velotiotech/906ce8b5eedcc25d345d46b28586b6e3.js

Steps to Set Up Delta Lake with S3 on EC2 Or EMR and Access Data through Athena

Data Set Used - We have generated some dummy data of around 100gb and written that into the Delta tables.

Step 1- Set up a Spark session along with AWS cloud storage and Delta - Spark. Here, we have used an EC2 instance with Spark 3.3 and Delta version 2.1.1. Here, we are setting up Spark config for Delta and S3.

CODE: https://gist.github.com/velotiotech/c7d33e42fadb8592659ba833ab91693d.js

Spark Version - You can use any Spark version, but Spark 3.3.1 came along with the pip install. Just make sure whatever version you are using is compatible with the Delta Lake version that you are using; otherwise, most of the features won't work.

Step 2 - Here, we are creating a Delta table with an S3 path. We can directly write the data into an S3 bucket as a Delta table, but it is better to create a table first and then write it into S3 to make sure the schema is correct.

Set the Delta location path if it exists to run the Spark SQL query and create the Delta table along with the S3 path. 

CODE: https://gist.github.com/velotiotech/b9a9b536476812aa40ea559149a14ee8.js

Step 3 - Here, I have given one link that I have used to generate the dummy data and have written that into the S3 bucket as Delta tables. Feel free to look over this. An example code of writing is given below:

CODE: https://gist.github.com/velotiotech/ae64ea8b9c78b64d6ea01f4579d18f70.js

https://github.com/velotio-tech/delta-lake-iceberg-poc/blob/0396cdbf96230609695a907fdbe8c240042fce9e/delta-data-writer.py#L83 

In the above link, you find the code of dummy data generation.

Step 4 - Here, we are printing the count and selecting some data from the Delta table that we have written in just right away.

Run the SQL query to check the table data and upsert using S3 bucket data:

CODE: https://gist.github.com/velotiotech/b6b51465919ccdb0bd35793ab26edd5a.js

This is the output of select statement:

This is the schema of the incoming data we are planning to merge into the existing Delta table:

After upsert, let’s see the data for the particular data partition:

spark.sql("select * from delta.`s3://abhishek-test-01012023/delta-lake-sample-data/` where id = 1 and date = 20221206")

Access Delta table using Hive or any other external metastore: 

For that, we have to create a link between them and to create this link, go to the Spark code and generate the manifest file on the S3 path where we have already written the data.

CODE: https://gist.github.com/velotiotech/c85624a27653f0151e955b12228b13f3.js

This will create the manifest folder not go to Athena and run this query:

CODE: https://gist.github.com/velotiotech/f8c31908ead9756d0b74a623f43e803f.js

Run this command:

CODE: https://gist.github.com/velotiotech/0d6864fb641d26b2944960ffa4cc52a6.js

You will be able to query the data. 

Conclusion: A New Dawn

In a world where data continued to grow in volume and complexity, Delta Lake stood as a beacon of hope. It empowered organizations to manage their data lakes with unprecedented efficiency and extract insights with unwavering confidence.

The adoption of Delta Lake marked a new dawn in the realm of data. Whether dealing with structured or semi-structured data, it was the answer to the prayers of data warriors. As the sun set on traditional formats, Delta Lake emerged as the hero they had been waiting for—a hero who had not only revolutionized data storage and processing but also transformed the way stories were told in the world of data.

And so, the legend of Delta Lake continued to unfold, inspiring data adventurers across the land to embark on their own quests, armed with the power of ACID transactions, time travel, and the promise of a brighter, data-driven future.

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

Explore current openings