DP-203 Data Engineering on Microsoft Azure – Monitor and optimize data storage and data processing

  • By
  • June 30, 2023
0 Comment

1. Best practices for structing files in your data lake

Now, in this chapter, I want to give some best practices when it comes to structuring your files. And this is when it comes to building your data lake. So normally when you design a data lake, you might go ahead and create something known as multiple zones. These zones, for example, can map on to different containers that you have in an Azure data lake gen two storage account. For example, you might have one container which is defined as the raw zone. So here the container would take all of the files that are being ingested into the Azure data lake gen two storage account.

So this contains the files in its original format, whether it be Avro, whether it be Parquet, whether it be JSON, etc. Then you might have another zone or another container where basic filtering has been carried out on the raw data, the data in the raw zone. So, for example, you could remove columns that are not required, let’s say in a parquet based file or in a JSON based file. And then finally, you might have another folder which represents your curated zone. This is the data in which you want to perform the analytics. So this is the data that you might, let’s say, transfer onto your data warehouse. Now, apart from that, the hierarchy used for the storage of files is also important. See, normally when it comes to ingesting data, you will be ingesting data at a rapid pace.

You might be having data that might be coming in every minute onto your Azure data Lake gen two storage account. That’s why it’s very important to use a proper hierarchy when it comes to the storage of your files. Here just giving an example. So you might have your raw zone, but this might be in a particular department. So you might have a department first have a raw zone for the department. You might have a folder which denotes the data source, because data can come from a variety of sources. Then you have the year, the month, the day.

You could even have the hour and minute as well. And then you have the file itself. Next, look at using compressed file formats such as park, because here less time is spent on the data transfer. And when it comes to loading data in Azure Synapse, you can take the advantage of the data warehousing aspect that can be used for decompression. And you can also use multiple source files because remember that when it comes to the compute nodes, when it comes to the MVP architecture of Azure Synapse, when it comes to your dedicated SQL pool, you can actually split your source files into different parts. And here are the multiple compute nodes. Each of them can process one file at a time, right? So in this chapter, just want to kind of go through some of the best practices when it comes to structuring your files.

2. Azure Storage accounts – Query acceleration

Now, in this chapter, I briefly want to go through something known as the Azure Data Lake Storage query acceleration. So here I don’t have any sort of lab on this. I just want to let you know about this particular feature. So, this feature is used when you have applications such as your net based applications that actually access your files in your Azure Data Lake Gen Two storage account. So when you’re using a. Net program, and let’s say you’re using SQL to work with the data that is present in your Azure Data Lake Gen Two storage account, then in order to get faster results when it comes to filtering of row, predicates and column projections, you can actually make use of this feature known as Query Acceleration Request.

Currently, this only supports CSV and JSON based files. Now, when it comes to exam, what’s very important to understand is what is the purpose of the query acceleration feature and how do you make this work? So let me scroll down onto the next steps wherein we filter data by using this acceleration feature. Here, if I scroll down so in order to enable this query acceleration feature, you have to ensure that you register a provider using PowerShell.

So here, the name of the provider is Microsoft Storage. So this page actually has details on how you can actually use this query acceleration feature, since most of this is done from a development language. That’s why I do want to let students actually go into details on how to use this feature. From an exam perspective, it is just important to understand how do you enable this feature. That’s why I just won’t have a quick video based on this. Now, I’ll ensure that these links are placed as an external resource on to this chapter so that you can actually view through these documentation pages.

3. View on Azure Monitor

Hi and welcome back. Now in this chapter, I just want to give an overview when it comes to the monitoring service that is available in Azure. So Azure has their own built monitoring service. So for example, if you look at the resources we have created so far, so if I look at, let’s say, database now in the overview itself, if you scroll down, you can get aspects such as the compute utilization, what is the D two percentage max that is being used. So all of these are actually coming in from a separate service known as the Azure Monitoring service. Now here I can actually search for the monitor service and I can go on to it here. If you go on to the metrics section here, you can actually plot the metrics for a particular resource. So for example, let’s say I want to plot the metrics for my dedicated SQL pool. So I can choose new pool over here and let me hit on apply, let me just hide this. Then I can select the metric. So based on what do I want to see my metrics?

So here, let’s say I want to look at the data warehousing units that have been used over a period of time. So I can see at one point I use a maximum of 19 data warehousing units. This also gives you a good idea on whether you should increase or maybe even decrease the number of data warehousing units that you are using for your dedicated SQL pool. In my example, I’m using the most lowest tier when it comes to the dedicated SQL pool. So there’s no way for me to decrease. The only way is to increase. But maybe in your organization you might be having dedicated SQL pools that have a higher number of data warehousing units that have might be assigned if you want to reduce the cost.

What you can do is you can look at the utilization of your pool over a period of time and then based on that, you can actually decide whether you want to decrease the number of data warehousing units that have been assigned onto your dedicated SQL pool. You can also create alerts. Say that you want your It administrative team to be alerted when there is the threshold being reached for, let’s say, the data warehousing units for your dedicated SQL pool. So you can create a new alert rule from here, or you could also go on to alerts and create alerts from here as well. So if you create a new alert rule here, you choose the scope which has already been defined. So here the scope and the condition is whenever the maximum data warehousing units is greater than something, we have to define this condition.

So I can click on this. And here, if I scroll down, so I can say that whenever the maximum number of data warehousing units goes beyond 60% over, let’s say, a period of five minutes, this will be my condition. I can then hit on done. Then I can scroll down and I can create something known as an action group. So in an action group, you can actually define what action to take. So here you can actually give a name for the action group. Then if you go next on to notifications, you can choose a notification type. So if you want to email someone, you can specify what should be the email address and then you can hit on okay, you can just give a name for the notification.

Then you can go ahead and select an action. So you could use an action to do something. So these are automation tools that are also available as part of Azure. So there are some automation based tools that are available in Azure that you can actually make use of. And then you can go on tags, you can go on to review and create and you can create the action group. This group will be created that can be reused across multiple alerts. So once you have now your action group in place, you have your condition in place, you have your scope in place. So when it comes to the cost, please note that there is a small cost when it comes to defining this alert rule.

And here you can scroll down, you can give a name for the alert and then you can create an alert rule. Here I’ll say not to automatically resolve the alerts. I’ll create the alert rule. And now whenever the data warehousing units goes beyond that particular threshold, an alert will be generated and the email notification will be sent. If you go on to activity log. If I just hide this, this will actually give you all of the activities, the administrative based activities, all of your control plane activities that occur as part of your Azure account. So for example, if you create create a storage account that will come over here.

If you delete a storage account, that activity will come over here. So all of the activities are recorded in the activity lock section in Azure Monitor. And then apart from that, you also have a lot of other features that are available in the Azure monitoring service.

* The most recent comment are at the top

Interesting posts

IBM Certified Data Scientist: Building a Career in Data Science

In today’s digital age, data is the new oil, driving decision-making and innovation across industries. The role of a data scientist has become one of the most sought-after positions in the tech world. If you’re considering a career in data science, obtaining the IBM Certified Data Scientist certification can be a game-changer. This certification not… Read More »

How to Balance Work and Study While Preparing for IT Certification Exams

Balancing work and study while preparing for IT certification exams can feel like an uphill battle. Juggling a full-time job and intense study sessions requires careful planning, discipline, and creativity. The pressure of meeting job responsibilities while dedicating time and energy to study can be overwhelming. However, with the right strategies and mindset, you can… Read More »

10 Highest Paying IT Certifications

In the ever-evolving world of information technology, certifications are more than just a feather in your cap – they’re a ticket to higher salaries and advanced career opportunities. With the tech landscape constantly shifting, staying updated with the most lucrative and relevant certifications can set you apart in a competitive job market. Whether you’re aiming… Read More »

Strategies for ISACA Certified Information Systems Auditor (CISA) Exam

Are you ready to take your career in information systems auditing to the next level? The ISACA Certified Information Systems Auditor (CISA) exam is your ticket to becoming a recognized expert in the field. But let’s face it, preparing for this comprehensive and challenging exam can be daunting. Whether you’re a seasoned professional or just… Read More »

Preparing for Juniper Networks JNCIA-Junos Exam: Key Topics and Mock Exam Resources

So, you’ve decided to take the plunge and go for the Juniper Networks JNCIA-Junos certification, huh? Great choice! This certification serves as a robust foundation for anyone aiming to build a career in networking. However, preparing for the exam can be a daunting task. The good news is that this guide covers the key topics… Read More »

Mastering Microsoft Azure Fundamentals AZ-900: Essential Study Materials

Ever wondered how businesses run these days without giant server rooms? That’s the magic of cloud computing, and Microsoft Azure is a leading cloud platform. Thinking about a career in this exciting field? If so, mastering the Microsoft Certified: Azure Fundamentals certification through passing the AZ-900 exam is the perfect starting point for you. This… Read More »