DP-203 Data Engineering on Microsoft Azure – Design and Develop Data Processing – Azure Databricks Part 9

  • By
  • June 26, 2023
0 Comment

29. Delta Lake Introduction

Hi and welcome back. Now in this chapter I just want to go through Delta Lake when it comes to Azure databricks and we’ll see a couple of examples on Delta Lake in data bricks itself. So with the help of a Delta Lake, you get some more features when it comes to tables that are stored in Azure databricks. So some of the features that you get are so asset transactions. So here you can always ensure that you get consistent data. So when someone is actually, let’s say, performing an update on records that are being stored in a table, you can always be ensured that readers never see inconsistent data. So here you have now the feature of also having transactions on the data in your underlying tables.

Apart from this, you also have now the ability to handle all the metadata for your data itself. In addition to this, a table can be used both for your batch jobs and also for your streaming jobs as well. You also have schema enforcement to ensure that no bad records are inserted into your table. You also have this concept of time travel. So when it comes to your data, you have data versioning that helps you in terms of performing rollbacks. We’ll actually see an example on this in a later on chapter and then finally you can also perform up search and deletes on your data. So here I just want to give a quick introduction on Delta Lake. In the subsequent chapters we just see some examples on implementing Delta Lake.

30. Lab – Creating a Delta Table

Now in this chapter, I’ll show you how to create a delta lake table. So here I have the code in place. So here I want to first of all take information from a jsonbased file that I have in my Azure Data Lake Gentu storage account. So this is my Data Lake gentoo storage account. You’ve seen this earlier. I have my raw directory and I have a JSON based file. So again, this JSON based file has metrics that are coming in from our database via diagnostic settings. Again, I will ensure to keep this JSON file has a resource onto this chapter. You can upload it onto the raw directory. And if you’ve been following along, we already have the databricks scoped secret in place to access our Data Lake Gen two storage account. Now, next, if you want to now create a table. So now we are trying to create a table in Azure databricks. Here we are using the Save has table option to give the name of the table.

Now here, in terms of the format, we are saying go ahead and make this a delta table. So let me take this and let me execute this in a cell in a notebook that is attached onto my cluster. So this is complete. We now have a table in place. So let me execute now another cell. So now we can issue SQL commands against this table. So remember, this is the same diagnostic based information we had seen earlier on. So these are all metrics that were stored in that JSON based file.

Now here if you want to actually partition your table, so again, this is something that you can also do. So let’s say that you have your queries that are using the where clause where you are filtering on data, and there you are filtering on data based on the metric name. So if you want to get faster performance for your queries, you can actually partition your data on the metric name. So let’s say that your query is looking at information where in the metric name is equal to CPU underscore percent. Then, because it has now been partitioned, azure Tailor Bricks only has to go into the partition where the metric name is equal to CPU underscore percent. So in this way you can actually make your queries much more efficient. So in order to create a table with partitions, it’s very easy. Use the partition by clause. And here you tell what is a column based on which you want to create the partition.

Here I am giving another table name of partition metrics. So let me take this. Here I’ll go on to a new cell. Let me run this. So while this is running, and then you can issue commands. So here I am selecting the metric name and account from partition metrics and grouping by the metric name. So once this is done, let me run this. Let me ensure that it is a SQL based command. So we have all of this information in place. So in this chapter, so far, we have only been looking at how to create a delta table. In the next couple of chapters, we’ll see some advantages of using the delta table. Now, before I actually wrap this chapter off, I just want to give you a note. So if I go on to the compute section so we have our clusters in place. Now, if I create a cluster here, you can see that based on our runtime. So the databricks runtime Ax automatically uses delta lake as the default table format. So when creating a table, you don’t necessarily have to tell it to be a delta table. This is the default table format that will be used. But from the purpose of the exam, the perspective of the exam in terms of the command, because these commands are important, you should know the commands. That’s why I ensure to let you know on what are the commands to actually create a delta table.

31. Lab – Streaming data into the table

Now in this chapter, I’ll show you how to create a delta lake table. So here I have the code in place. So here I want to first of all take information from a jsonbased file that I have in my Azure Data Lake Gentu storage account. So this is my Data Lake gentoo storage account. You’ve seen this earlier. I have my raw directory and I have a JSON based file. So again, this JSON based file has metrics that are coming in from our database via diagnostic settings. Again, I will ensure to keep this JSON file has a resource onto this chapter. You can upload it onto the raw directory. And if you’ve been following along, we already have the databricks scoped secret in place to access our Data Lake Gen two storage account. Now, next, if you want to now create a table. So now we are trying to create a table in Azure databricks.

Here we are using the Save has table option to give the name of the table. Now here, in terms of the format, we are saying go ahead and make this a delta table. So let me take this and let me execute this in a cell in a notebook that is attached onto my cluster. So this is complete. We now have a table in place. So let me execute now another cell.

So now we can issue SQL commands against this table. So remember, this is the same diagnostic based information we had seen earlier on. So these are all metrics that were stored in that JSON based file. Now here if you want to actually partition your table, so again, this is something that you can also do. So let’s say that you have your queries that are using the where clause where you are filtering on data, and there you are filtering on data based on the metric name. So if you want to get faster performance for your queries, you can actually partition your data on the metric name. So let’s say that your query is looking at information wherein the metric name is equal to CPU underscore percent. Then, because it has now been partitioned, azure Tailor Bricks only has to go into the partition where the metric name is equal to CPU underscore percent.

So in this way, you can actually make your queries much more efficient. So in order to create a table with partitions, it’s very easy. Use the partition by clause. And here you tell what is a column based on which you want to create the partition. Here I am giving another table name of partition metrics. So let me take this. Here I’ll go on to a new cell. Let me run this. So while this is running, and then you can issue commands. So here I am selecting the metric name and the count from partition metrics and grouping by the metric name. So once this is done, let me run this. Let me ensure that it is a SQL based command. So we have all of this information in place. So in this chapter, so far, we have only been looking at how to create a delta table. In the next couple of chapters, we’ll see some advantages of using the delta table. Now, before I actually wrap this chapter off, I just want to give you a note. So if I go on to the compute section so we have our clusters in place.

Now, if I create a cluster here, you can see that based on our runtime. So the databricks runtime Ax automatically uses delta lake as the default table format. So when creating a table, you don’t necessarily have to tell it to be a delta table. This is the default table format that will be used. But from the purpose of the exam, the perspective of the exam in terms of the command, because these commands are important, you should know the commands. That’s why I ensure to let you know on what are the commands to actually create a delta table.

32. Lab – Time Travel

So in the prior chapter we had seen how we could stream data onto a delta lake table. Now, in this chapter I just want to give a quick overview on the time travel function that is available for your delta lake tables. So here let me issue the sequel statement of describing the history of the new metric table. So here when any change is actually made on to the data in the table, when it comes to delta lake tables, there are different versions that are being made about that table because it is a delta lake table. Here you can see that for each version what is the operation that is being performed.

So anytime there is a streaming update on the table, here you can see the operation and here you can see what are the operation parameters. And if you want to select data from the table as per a particular version, that is something that you can do as well. So for example, here if I select Star from metrics has of let’s say first of all version one, so I can see that I have no results because there was probably no data in that table. Let me go on to version two. And now you can see you have some data in place, so you can actually look at your data at different versions at different points in time. So this is the concept of the time travel that is also available for your delta lake tables.

33. Quick note on the deciding between Azure Synapse and Azure Databricks

So in this chapter, I just want to again go through some quick points when it comes to the comparison of maybe using the Spark pool in Azure Synapse and when it comes to using the Spark engine that is available in Azure databricks. So with Azure Synapse, you do have advantage of having everything in one place. So you can host your data warehouse with the help of ensuring that you create a dedicated SQL pool. You can also create external tables that actually points onto, let’s say, data in an Azure Storage account. You can also bring your storage accounts much more closer in Azure Synapse with the help of linking Azure Storage accounts, in this case Azure Data Lake Storage Gen Two accounts. Then you also have the integrated section wherein you can actually develop pipelines. So you can actually make use of these pipelines to copy data from a source on to a destination.

So you have everything that is in one place in Azure Synapse. Whereas in Azure databricks we have seen that we have a lot of functionality that is available and this is based on the underlying Spark engine. Also when it comes onto Azure databricks, it’s not only for data science, it can also be used for machine learning. So a lot of the frameworks which is available for machine learning is also part of the databrick service. So this is one complete solution if you are looking at data engineering, data science, machine learning. And as I mentioned before, because the people who have made Spark have also made Azure data bricks, whatever changes they make onto the Spark engine will be available always in Azure data bricks. So in this chapter, again want to go through a few points when it comes to both of these services to help you decide on which service best suits your needs.

34. What resources are we taking forward

So, again, a quick note on what I’m taking forward. So I need to discuss some monitoring aspects when it comes on to as your data bricks. And we’ll be covering this in the monitoring section. So at this point in time anyway, you can actually go ahead and delete your cluster if it’s no longer required. And then when we go on to the monitoring section, you can go ahead and recreate the cluster. But we will visit the monitoring part when it comes on to the Azure databricks service.

Comments
* The most recent comment are at the top

Interesting posts

Everything ENNA: Cisco’s New Network Assurance Specialist Certification

The landscape of networking is constantly evolving, driven by rapid technological advancements and growing business demands. For IT professionals, staying ahead in this dynamic environment requires an ongoing commitment to developing and refining their skills. Recognizing the critical need for specialized expertise in network assurance, Cisco has introduced the Cisco Enterprise Network Assurance (ENNA) v1.0… Read More »

Best Networking Certifications to Earn in 2024

The internet is a wondrous invention that connects us to information and entertainment at lightning speed, except when it doesn’t. Honestly, grappling with network slowdowns and untangling those troubleshooting puzzles can drive just about anyone to the brink of frustration. But what if you could become the master of your own digital destiny? Enter the… Read More »

Navigating Vendor-Neutral vs Vendor-Specific Certifications: In-depth Analysis Of The Pros And Cons, With Guidance On Choosing The Right Type For Your Career Goals

Hey, tech folks! Today, we’re slicing through the fog around a classic dilemma in the IT certification world: vendor-neutral vs vendor-specific certifications. Whether you’re a fresh-faced newbie or a seasoned geek, picking the right cert can feel like trying to choose your favorite ice cream flavor at a new parlor – exciting but kinda overwhelming.… Read More »

Achieving Your ISO Certification Made Simple

So, you’ve decided to step up your game and snag that ISO certification, huh? Good on you! Whether it’s to polish your company’s reputation, meet supplier requirements, or enhance operational efficiency, getting ISO certified is like telling the world, “Hey, we really know what we’re doing!” But, like with any worthwhile endeavor, the road to… Read More »

What is Replacing Microsoft MCSA Certification?

Hey there! If you’ve been around the IT block for a while, you might fondly remember when bagging a Microsoft Certified Solutions Associate (MCSA) certification was almost a rite of passage for IT pros. This badge of honor was crucial for those who wanted to master Microsoft platforms and prove their mettle in a competitive… Read More »

5 Easiest Ways to Get CRISC Certification

CRISC Certification – Steps to Triumph Are you ready to stand out in the ever-evolving fields of risk management and information security? Achieving a Certified in Risk and Information Systems Control (CRISC) certification is more than just adding a prestigious title next to your name — it’s a powerful statement about your expertise in safeguarding… Read More »

img