DP-203 Data Engineering on Microsoft Azure – Monitor and optimize data storage and data processing Part 9

  • By
  • July 4, 2023
0 Comment

24. Azure Stream Analytics – Diagnostics setting

And welcome back. Now in this chapter I just want to go through the diagnostic settings that is available for your Azure Stream antics job. So even earlier on we had seen that we could actually direct the logs the pipeline runs when it came on to Azure Data factory onto a log antics workspace and all of the logs would be collected in the log section and they would come has different tables under log management. Now, the same thing you can also do with your Azure Stream antics job. So here if I go on to my Stream antics job here, if I scroll down and if I go on to diagnostic settings here, I have already enabled a diagnostic setting. Here I am actually directing the logs onto a different log antics workspace.

Now please know that you can direct the logs on to the same log antics workspace. But here I have just gone ahead and directed it onto a different log antics workspace if I click on Edit setting. So here I have chosen the log of execution. So everything when it comes to executing the stream antics job is sent on to the log antics workspace. So now if I actually go on to that log antics workspace, so it’s DB workspace, I’ll go on to the log section, I’ll just hide this, I’ll expand this. So when it comes to the diagnostic setting, it will be written onto the table as your diagnostics.

Now Azure diagnostics is a table that can store information, the log information about various other resources as well. So not only Azure stream antics So here if I go on to Azure diagnostics, let me just hide this and let me run the statement. So here we can see all of the information. Now if I expand one record. So here we are seeing the resource provider which is Microsoft Stream antics. Remember, as I mentioned before, you could have different other resources, different based on other services that can also send their logs onto the log antics workspace.

So for example, your Azure SQL database, your Azure web apps, all of them can actually send their logs onto the log antics workspace. So if you only want to look at the Steam antics logs from your point of view, then you have to ensure that in your query against the Azure diagnostic table, you query based on the resource provider.

Now next, when it comes to the stream antics, there is also one more property here that’s properties underscore S. Now here, this actually gives properties about that particular event that occurred in Azure Stream and X. Now here I have an event wherein it could not decrease the input event data. Here we have the properties of, let’s say the data error type, the error category and the error code. Now this property is in JSON and also this property, the properties will not be same for every type of event in Azure Stream antics. So for example, if I just close this so for example, if I go on to another event and if I scroll down. So now this was the operation of starting the stream at the top and if I go on to the properties here, I can see I have some different properties in place.

So depending upon the type of event that is actually occurring in the Azure Stream antics job, that information will actually be in properties underscore S. So here I just want to show an example of a query. Here I am saying please only return the rows where the resource provider is Microsoft or stream antics. And next, if I only want to get those rows wherein it is resulting in an input deserialization error, then I can get the error code, which is part of properties underscore S. So this is not a direct column in Azure diagnostics.

This is part of an event, part of a row in the log. Antics workspace in this particular table. So here I’m using the pass underscore JSON function or method to basically get all of the elements in this particular JSON object and with the help of the project statement. So here if you look at all of the records, I can see the column of the time generated, the resource ID category, resource group, et cetera. If I only want the columns of the time generated and the message, I can actually use the project statement here.

So if I run this now, so I will see only those rows which have been resulted from the input desalination error. And here you can see I only have two columns. One is a time generate and the other is the message property. So in this chapter, just want to go through the diagnostic setting which is available for your as your stream addicts job.

25. Azure Databricks – Monitoring

Now, in this chapter, I just want to go through the monitoring aspects that are available in Azure data bricks. So firstly, if you go on to compute section, if you go on to a running cluster, if you go on to the event log here, you will see all the events when it comes to the cluster itself. So, for example, when the cluster was terminated, when it started running, seeing the health of the driver node and even your executor nodes as well, if you go on to the driver logs. So here you will see all of the logs. So in terms of the jobs, if you want to see whether a job has started, you can actually look at the Spark driver logs here, because all of our jobs is going to run on the driver.

Because remember that in our cluster we only have one node in place and that node is also working as the driver node. If you go on to the Spark UI, the Spark UI is this feature is also available as part of your Spark pools in Azure synapse. So here you will see all of the information about your jobs. So you have jobs and then you have stages within the jobs, then you have your storage. So you have various aspects that you can see. So here I can see that I have my delta tables and if I scroll on to the right, I can see the size in memory. If I go on to structured streaming here I can see all of the streaming jobs that have taken place on my cluster.

Now, apart from that, let’s also look at how you can see the different stages when you execute your jobs in Spark. So if I go on to my workspace, let me go on to any notebook that I have, let me create a new cell. And here let me execute a very simple set of commands. So here I’m just getting information from one of my Jsonbased files and I’m displaying the data frame. So here let me run the cell. So here the first thing that you can see is it has run two Spark jobs in order to run these particular set of commands.

So now if I expand these two Spark jobs, here you can see the different stages of the job. So let me view job number 60. Here you have something known as the Dag Visualization. Dag stands for a directed Acyclic graph. So this gives you the different stages or the different events that have taken place when it came to executing this job. If you click on this you will see details when it comes to the Dag Visualization. So here you can see it is converting it onto a SQL based data frame. And here you can see that it’s mapping the partitions as it should. If I click on view for job number 61, here we have a very simple stage. If I click on this. So it’s a very simple stage wherein it is scanning the JSON based file.

If I take another command here, let’s say I’m writing onto a partition metrics table. So I’m creating a delta table. Let me run the cell. So you can see there are a lot of Spark jobs that are running here. If you again go on to each of the jobs, you can see there is a whole lot that’s actually going on. You can also see an exchange happening here. So if I go on to view for job number 63 here, you can see that data is being transferred from one stage on to another. So there’s a lot that actually goes in the background when it comes to how the jobs are getting executed. Here you can see there are some shuffle reads and writes because now we are making use of partitions.

So if data, let’s say, needs to be made available for a query, if the table has been partitioned, there could be data shuffling that can happen across the partitions to satisfy your query if your query is only targeting one partition. So remember, all our partitions are based on the metric name here. So if you have, let’s say, select star from the table where the metric name is equal to just one metric name, it will only go on to that one partition that is having all of that metric name information. But if you have a query that is scanning across multiple partitions, then definitely you will get shuffled reads in place because the Spark engine needs to read data across multiple partitions.

So, again, not going into much detail about the Dag visualization. This is to give you an idea when it comes to what happens in the background. If you actually want to see all of the partitions in your table, you want to run the cell. So here you can see all of the partitions. So there are eleven partitions in this particular table. So all of the data by the metric name is equal to memory, percent will be in partition number one, so on and so forth. Right? So in this chapter, I just want to go through some of the important points when it comes to the monitoring aspects that is available for Spark in Azure databricks.

Comments
* The most recent comment are at the top

Interesting posts

Impact of AI and Machine Learning on IT Certifications: How AI is influencing IT Certification Courses and Exams

The tech world is like a never-ending game of upgrades, and IT certifications are no exception. With Artificial Intelligence (AI) and Machine Learning (ML) taking over everything these days, it’s no surprise they are shaking things up in the world of IT training. As these technologies keep evolving, they are seriously influencing IT certifications, changing… Read More »

Blockchain Technology Certifications: Exploring Certifications For Blockchain Technology And Their Relevance In Various Industries Beyond Just Cryptocurrency

Greetings! So, you’re curious about blockchain technology and wondering if diving into certifications is worth your while? Well, you’ve come to the right place! Blockchain is not just the backbone of cryptocurrency; it’s a revolutionary technology that’s making waves across various industries, from finance to healthcare and beyond. Let’s unpack the world of blockchain certifications… Read More »

Everything ENNA: Cisco’s New Network Assurance Specialist Certification

The landscape of networking is constantly evolving, driven by rapid technological advancements and growing business demands. For IT professionals, staying ahead in this dynamic environment requires an ongoing commitment to developing and refining their skills. Recognizing the critical need for specialized expertise in network assurance, Cisco has introduced the Cisco Enterprise Network Assurance (ENNA) v1.0… Read More »

Best Networking Certifications to Earn in 2024

The internet is a wondrous invention that connects us to information and entertainment at lightning speed, except when it doesn’t. Honestly, grappling with network slowdowns and untangling those troubleshooting puzzles can drive just about anyone to the brink of frustration. But what if you could become the master of your own digital destiny? Enter the… Read More »

Navigating Vendor-Neutral vs Vendor-Specific Certifications: In-depth Analysis Of The Pros And Cons, With Guidance On Choosing The Right Type For Your Career Goals

Hey, tech folks! Today, we’re slicing through the fog around a classic dilemma in the IT certification world: vendor-neutral vs vendor-specific certifications. Whether you’re a fresh-faced newbie or a seasoned geek, picking the right cert can feel like trying to choose your favorite ice cream flavor at a new parlor – exciting but kinda overwhelming.… Read More »

Achieving Your ISO Certification Made Simple

So, you’ve decided to step up your game and snag that ISO certification, huh? Good on you! Whether it’s to polish your company’s reputation, meet supplier requirements, or enhance operational efficiency, getting ISO certified is like telling the world, “Hey, we really know what we’re doing!” But, like with any worthwhile endeavor, the road to… Read More »

img