DP-203 Data Engineering on Microsoft Azure – Monitor and optimize data storage and data processing

  • By
  • June 30, 2023
0 Comment

1. Best practices for structing files in your data lake

Now, in this chapter, I want to give some best practices when it comes to structuring your files. And this is when it comes to building your data lake. So normally when you design a data lake, you might go ahead and create something known as multiple zones. These zones, for example, can map on to different containers that you have in an Azure data lake gen two storage account. For example, you might have one container which is defined as the raw zone. So here the container would take all of the files that are being ingested into the Azure data lake gen two storage account.

So this contains the files in its original format, whether it be Avro, whether it be Parquet, whether it be JSON, etc. Then you might have another zone or another container where basic filtering has been carried out on the raw data, the data in the raw zone. So, for example, you could remove columns that are not required, let’s say in a parquet based file or in a JSON based file. And then finally, you might have another folder which represents your curated zone. This is the data in which you want to perform the analytics. So this is the data that you might, let’s say, transfer onto your data warehouse. Now, apart from that, the hierarchy used for the storage of files is also important. See, normally when it comes to ingesting data, you will be ingesting data at a rapid pace.

You might be having data that might be coming in every minute onto your Azure data Lake gen two storage account. That’s why it’s very important to use a proper hierarchy when it comes to the storage of your files. Here just giving an example. So you might have your raw zone, but this might be in a particular department. So you might have a department first have a raw zone for the department. You might have a folder which denotes the data source, because data can come from a variety of sources. Then you have the year, the month, the day.

You could even have the hour and minute as well. And then you have the file itself. Next, look at using compressed file formats such as park, because here less time is spent on the data transfer. And when it comes to loading data in Azure Synapse, you can take the advantage of the data warehousing aspect that can be used for decompression. And you can also use multiple source files because remember that when it comes to the compute nodes, when it comes to the MVP architecture of Azure Synapse, when it comes to your dedicated SQL pool, you can actually split your source files into different parts. And here are the multiple compute nodes. Each of them can process one file at a time, right? So in this chapter, just want to kind of go through some of the best practices when it comes to structuring your files.

2. Azure Storage accounts – Query acceleration

Now, in this chapter, I briefly want to go through something known as the Azure Data Lake Storage query acceleration. So here I don’t have any sort of lab on this. I just want to let you know about this particular feature. So, this feature is used when you have applications such as your net based applications that actually access your files in your Azure Data Lake Gen Two storage account. So when you’re using a. Net program, and let’s say you’re using SQL to work with the data that is present in your Azure Data Lake Gen Two storage account, then in order to get faster results when it comes to filtering of row, predicates and column projections, you can actually make use of this feature known as Query Acceleration Request.

Currently, this only supports CSV and JSON based files. Now, when it comes to exam, what’s very important to understand is what is the purpose of the query acceleration feature and how do you make this work? So let me scroll down onto the next steps wherein we filter data by using this acceleration feature. Here, if I scroll down so in order to enable this query acceleration feature, you have to ensure that you register a provider using PowerShell.

So here, the name of the provider is Microsoft Storage. So this page actually has details on how you can actually use this query acceleration feature, since most of this is done from a development language. That’s why I do want to let students actually go into details on how to use this feature. From an exam perspective, it is just important to understand how do you enable this feature. That’s why I just won’t have a quick video based on this. Now, I’ll ensure that these links are placed as an external resource on to this chapter so that you can actually view through these documentation pages.

3. View on Azure Monitor

Hi and welcome back. Now in this chapter, I just want to give an overview when it comes to the monitoring service that is available in Azure. So Azure has their own built monitoring service. So for example, if you look at the resources we have created so far, so if I look at, let’s say, database now in the overview itself, if you scroll down, you can get aspects such as the compute utilization, what is the D two percentage max that is being used. So all of these are actually coming in from a separate service known as the Azure Monitoring service. Now here I can actually search for the monitor service and I can go on to it here. If you go on to the metrics section here, you can actually plot the metrics for a particular resource. So for example, let’s say I want to plot the metrics for my dedicated SQL pool. So I can choose new pool over here and let me hit on apply, let me just hide this. Then I can select the metric. So based on what do I want to see my metrics?

So here, let’s say I want to look at the data warehousing units that have been used over a period of time. So I can see at one point I use a maximum of 19 data warehousing units. This also gives you a good idea on whether you should increase or maybe even decrease the number of data warehousing units that you are using for your dedicated SQL pool. In my example, I’m using the most lowest tier when it comes to the dedicated SQL pool. So there’s no way for me to decrease. The only way is to increase. But maybe in your organization you might be having dedicated SQL pools that have a higher number of data warehousing units that have might be assigned if you want to reduce the cost.

What you can do is you can look at the utilization of your pool over a period of time and then based on that, you can actually decide whether you want to decrease the number of data warehousing units that have been assigned onto your dedicated SQL pool. You can also create alerts. Say that you want your It administrative team to be alerted when there is the threshold being reached for, let’s say, the data warehousing units for your dedicated SQL pool. So you can create a new alert rule from here, or you could also go on to alerts and create alerts from here as well. So if you create a new alert rule here, you choose the scope which has already been defined. So here the scope and the condition is whenever the maximum data warehousing units is greater than something, we have to define this condition.

So I can click on this. And here, if I scroll down, so I can say that whenever the maximum number of data warehousing units goes beyond 60% over, let’s say, a period of five minutes, this will be my condition. I can then hit on done. Then I can scroll down and I can create something known as an action group. So in an action group, you can actually define what action to take. So here you can actually give a name for the action group. Then if you go next on to notifications, you can choose a notification type. So if you want to email someone, you can specify what should be the email address and then you can hit on okay, you can just give a name for the notification.

Then you can go ahead and select an action. So you could use an action to do something. So these are automation tools that are also available as part of Azure. So there are some automation based tools that are available in Azure that you can actually make use of. And then you can go on tags, you can go on to review and create and you can create the action group. This group will be created that can be reused across multiple alerts. So once you have now your action group in place, you have your condition in place, you have your scope in place. So when it comes to the cost, please note that there is a small cost when it comes to defining this alert rule.

And here you can scroll down, you can give a name for the alert and then you can create an alert rule. Here I’ll say not to automatically resolve the alerts. I’ll create the alert rule. And now whenever the data warehousing units goes beyond that particular threshold, an alert will be generated and the email notification will be sent. If you go on to activity log. If I just hide this, this will actually give you all of the activities, the administrative based activities, all of your control plane activities that occur as part of your Azure account. So for example, if you create create a storage account that will come over here.

If you delete a storage account, that activity will come over here. So all of the activities are recorded in the activity lock section in Azure Monitor. And then apart from that, you also have a lot of other features that are available in the Azure monitoring service.

* The most recent comment are at the top

Interesting posts

5 Easiest Ways to Get CRISC Certification

CRISC Certification – Steps to Triumph Are you ready to stand out in the ever-evolving fields of risk management and information security? Achieving a Certified in Risk and Information Systems Control (CRISC) certification is more than just adding a prestigious title next to your name — it’s a powerful statement about your expertise in safeguarding… Read More »

Complete VMware Certification Guide 2024

Hello, tech aficionados and IT wizards! Ever thought about propelling your career forward with a VMware certification? If you have, great – you’ve landed in the perfect spot. And if you haven’t, get ready to be captivated. VMware stands at the forefront of virtualization and cloud infrastructure globally, presenting a comprehensive certification program tailored to… Read More »

How Cisco CCNA Certification Can Boost Your IT Career?

Hello, fellow tech aficionados! Are you itching to climb the IT career ladder but find yourself at a bit of a standstill? Maybe it’s time to spice up your resume with some serious certification action. And what better way to do that than with the Cisco Certified Network Associate (CCNA) certification? This little gem is… Read More »

What You Need to Know to Become Certified Information Security Manager?

Curious about the path to Certified Information Security Manager? Imagine embarking on a journey where each step brings you closer to mastering the complex realm of information security management. Picture yourself wielding the prestigious Certified Information Security Manager (CISM) certification, a beacon of expertise administered by the esteemed Information Systems Audit and Control Association (ISACA).… Read More »

VMware VCP: Is It Worth It?

Introduction In the dynamic realm of IT and cloud computing, where technology swiftly changes and competition is fierce, certifications shine as vital markers of proficiency and dedication. They act as keys to unlocking career potential for ambitious professionals. Within this context, VMware certifications have become a cornerstone for professionals aiming to showcase their expertise in… Read More »

3 Real-World Tasks You’ll Tackle in Google Data Analytics Certification

Introduction In today’s fast-paced digital world, certifications are essential for professionals aiming to showcase their expertise and progress in their careers. Google’s certifications, especially in data analytics, are highly regarded for their emphasis on practical, job-ready skills. The Google Data Analytics Certification, known for its broad skill development in data processing, analysis, and visualization, stands… Read More »