DP-203 Data Engineering on Microsoft Azure – Design and Develop Data Processing – Azure Databricks Part 7

  • By
  • June 25, 2023
0 Comment

22. Lab – Azure Data Lake Storage Credential Passthrough

Now in this chapter, I want to go through a scenario wherein you can actually make use of something known as your ad Credential pass-through. So earlier on we had seen that if we wanted to fetch data from a data lake gen two storage account, we had to ensure that we have the access keys defined in a key vault.

But we can also make use of a feature known as Azure Active Directory Credential Pass through wherein the user that is actually working with the notebook can actually be authorized to access the data in the Azure data lake gen two storage account. So this is a much more useful security feature. So here the user who is executing the notebook does not need to go through the process of having the access keys in place based on their credentials, based on their permissions, they will have access on to the data in the data lake gen two storage account.

Now, I go through in detail on how you can actually give access on to your data in your data like Gentle storage account in the section of security in this course. So I go through that. But in this chapter, we are going to see how to make use of that feature, that security feature. And this is your Active Directory Credential pass through feature. So now, in order to have a clean slate to test this feature, I’m going to create a new storage account. So I’m going to choose a storage account. So it’ll be a data lake gen two storage account. So here I’ll choose my resource group. I’ll give a storage account name. I’ll choose North Europe. I’ll make this locally redundant. I’ll go on to Next for advance. I’ll enable the heroku namespace Cornet working, Data protection tags, review and Create and let me go ahead and hit on create. So here I have listed down all the steps that you need to perform.

The first is creating a new data lake storage account. Next, we need to upload a file and give the required permissions. This also includes to ensure that we give something known as the reader role and the storage Blob reader role onto the Azure admin user and also something known as ACL permissions. So once we have the storage account in place, I’ll go ahead onto the resource, I’ll go on to my containers and I’ll create a data container. Now I’m going to upload the log CSV file. So, we’ve already seen this early on. I have this log CSV file in place. Now we need to give permissions on to Azure admin account. See when we run our notebooks. Here I’m running has the Azure admin account. So I need to ensure that I give the right permissions.

Now onto my data in the data lake gen two storage account. Even though I’m the Azure admin, I still need to specifically give these permissions. So the first thing I need to do is to go on to my Data Lake storage account. Here. I have to go on to access control. I have to click on Add and add a role assignment. Here I need to choose the reader role and here I need to search for the Azure admin user ID. And then I’ll click on save. I have to add another role.

So here another role assignment. Here I need to choose the role of Storage Blob data reader. Here again search for my admin account, click on Save. Now, I also need to log into Azure Storage explorer to give something known as access control list permissions. So I mentioned that in the section of security I go through all of these concepts. So at this point in time I have logged into Azure Storage Explorer, has my Azure admin let me go on to New Data Lake.

I’m just waiting for my containers to load up. I’ll go on to my Blob containers, I’ll go on to my data container. I’ll right click. I’ll manage access control. So here I’ll click on Add. Here I’ll search for the user. So I’ll choose my user ID. That’s the first one. I’ll click on Add. Here I’ll choose the permissions of Access read hit on OK. So here, successfully save the permissions. So what I’ve done earlier on, I’ve chosen the option of manage access control list. Now here I’ll choose Propagate Access Control List so that it will propagate the access control onto all of the objects that are there in this container and I’ll hit OK.

So normally this is a more secure way. So you could have users defined in Azure Active Directory and you can give them selective access onto the files in the Azure Data Lake gen Two storage account. So we’ve done this part, we’ve given the required permissions. We’ve also assigned all of these roles. Now, next we need to create a new cluster. Now, one very important note is this is only available with the premium plan of Azure databricks and we are using that trial Premium plan and this only works with Azure Data Lake Gen two storage accounts. So now in Azure data bricks, I’ll go on to the Compute section. Now I’m going to go on to my existing cluster and I’m going to terminate the cluster.

So we have to create a new cluster. So one thing is that we will not be able to have multiple clusters in place because we might not be able to do so based on the number of virtual codes that we can actually create as part of our subscription. So I know that I have a limit on the number of virtual codes that I can use in a region as part of my subscription. So if I try to create another cluster, I might get an error. So we have to create a new cluster. So I’ll go on to my clusters and here I’ll create a cluster. Here I’ll give a cluster name and here I’ll again choose single node. Now I have to go on to the advanced options. And here I need to enable that Credential passthrough for user level access. Now, here I am choosing my root user. So actually this is the long ID for my tech support thousand user.

And then I’ll create the cluster. Let’s wait till we have the cluster in place. Now, once we have the cluster in place, I’ll go on. I’ll create a new notebook here, I’ll choose my cluster as the new cluster and hit on Create. Now, here I’ll take the code to create a data frame. I’ll place it here. So, I need to replace all of this. So, the name of my storage account is New Data Lake. I need to replace it here just to make sure it’s the same. So I have my log CSV file, it is the data container.

Now, let me run this. So here you can see all of the data. So I said the difference here is we have not used any access keys. There are no keys that are signed onto the cluster, no keys that are part of the notebook itself. We are not making use of the secrets that are stored in data bricks. We are now purely basing our authorization on the user that is defined in as your active directory. So now that same user that is running your notebook is also having access on to the data in your notebook data Lake gen two storage account. So, another secure way in which you can access your data from your notebooks.

23. Lab – Running an automated job

Now, in this chapter, I want to go through jobs that are available in Azure Data bricks. So a job is a noninteractive way to run an application in an Azure Databricks cluster. You can run the job immediately or based on a schedule. You can run a notebook or a job file in a job, and the job will run on a cluster. So, as an example, earlier on, we had run this particular notebook that would take the streaming events from Azure Event Hubs onto our table in our dedicated SQL Pool. Now, let’s say you want to run this as a job. Now, the first thing I need to do is to move this notebook. So I’ll click on move here. I’ll choose a shared location, and let me hit on Select Here. I’ll confirm the move. So currently the notebook is in the detached state. Now, in another tab, let me go on to Jobs here.

Let me create a new job. Now, here, the first thing I need to do is to select my notebook. So I’ll go on to shared. I’ll choose my app notebook. I’ll hit on confirm. Now here we need to choose the cluster on which to run our job. Now, you have two options. You can run it on your existing cluster, or you can create a new job cluster. So Job Cluster is specific for running the jobs. Since we don’t want to reach any sort of limits on the number of virtual cores that we can assign on to our clusters, I’ll choose my existing App Cluster. Now, if I just quickly go on to another tab, if I go on to my clusters. So we have been working with a couple of clusters in this particular section.

Now, at any point in time, you can go on to a running cluster, and you can terminate the cluster to basically stop it, and then you can start the cluster again. This is a cluster we created earlier on. This actually had the library installed for Azure event hub. So if I go back onto clusters, if I go on to my terminate cluster, you can again start this cluster at any point in time. So what Azure Databricks does is that it actually retains the configuration of your cluster after it has been terminated for a period of 30 days, so that you can start your cluster with the same configuration at any point in time.

If you want to retain the configuration of this cluster for a longer duration of time, you have to actually choose this icon to pin it on to as your data breaks. So now you can see you don’t even have the option to delete this particular cluster. So let me unpin this, because you should have the option to delete the cluster at any point in time. So just a quick note when it comes to the clusters going back onto our Jobs page. So we have everything in place. Let me give a job name. And here in the schedule type, you can run it based on a schedule or you can manually trigger the job. So we will manually trigger this particular job. So let me hit on create. Once we have the job in place, I’ll go back onto jobs. Now let me go ahead and start this particular job.

So it started the job, you can go on to the job. So here I’m actually getting an error. So I can see there’s an internal error if I view the details. So here it’s saying that the notebook is not found. So that means we made a mistake in our job configuration. So I can go on to job A. I can go on to configuration. And here let me select the proper notebook. So app notebook in the shared location. Let me hit on confirm. So this is fine. I’ll click on save, let me go back onto jobs and let me now run this job again, I’ll go back onto job A. And now we can see it is in the running state.

We can see the duration. Over here we can click on view details. So it has submitted the command on to the cluster for execution. So now we can see it is initializing the stream. So now we can see it is running the stream. So let me now try to see if I have any data in my log table so I can see the data in place. So now this is actually running has a job on a general all purpose cluster. But what you can do is that in a large organization, if you want to run the jobs, you can actually run them on separate job clusters. So just for now, I’ll go back on to the job and let me cancel this running job. Right, so in this chapter I just want to go to the job aspect which is available when it comes to your data bricks. You.

Comments
* The most recent comment are at the top

Interesting posts

Impact of AI and Machine Learning on IT Certifications: How AI is influencing IT Certification Courses and Exams

The tech world is like a never-ending game of upgrades, and IT certifications are no exception. With Artificial Intelligence (AI) and Machine Learning (ML) taking over everything these days, it’s no surprise they are shaking things up in the world of IT training. As these technologies keep evolving, they are seriously influencing IT certifications, changing… Read More »

Blockchain Technology Certifications: Exploring Certifications For Blockchain Technology And Their Relevance In Various Industries Beyond Just Cryptocurrency

Greetings! So, you’re curious about blockchain technology and wondering if diving into certifications is worth your while? Well, you’ve come to the right place! Blockchain is not just the backbone of cryptocurrency; it’s a revolutionary technology that’s making waves across various industries, from finance to healthcare and beyond. Let’s unpack the world of blockchain certifications… Read More »

Everything ENNA: Cisco’s New Network Assurance Specialist Certification

The landscape of networking is constantly evolving, driven by rapid technological advancements and growing business demands. For IT professionals, staying ahead in this dynamic environment requires an ongoing commitment to developing and refining their skills. Recognizing the critical need for specialized expertise in network assurance, Cisco has introduced the Cisco Enterprise Network Assurance (ENNA) v1.0… Read More »

Best Networking Certifications to Earn in 2024

The internet is a wondrous invention that connects us to information and entertainment at lightning speed, except when it doesn’t. Honestly, grappling with network slowdowns and untangling those troubleshooting puzzles can drive just about anyone to the brink of frustration. But what if you could become the master of your own digital destiny? Enter the… Read More »

Navigating Vendor-Neutral vs Vendor-Specific Certifications: In-depth Analysis Of The Pros And Cons, With Guidance On Choosing The Right Type For Your Career Goals

Hey, tech folks! Today, we’re slicing through the fog around a classic dilemma in the IT certification world: vendor-neutral vs vendor-specific certifications. Whether you’re a fresh-faced newbie or a seasoned geek, picking the right cert can feel like trying to choose your favorite ice cream flavor at a new parlor – exciting but kinda overwhelming.… Read More »

Achieving Your ISO Certification Made Simple

So, you’ve decided to step up your game and snag that ISO certification, huh? Good on you! Whether it’s to polish your company’s reputation, meet supplier requirements, or enhance operational efficiency, getting ISO certified is like telling the world, “Hey, we really know what we’re doing!” But, like with any worthwhile endeavor, the road to… Read More »

img