DP-203 Data Engineering on Microsoft Azure – Design and Develop Data Processing – Azure Databricks

  • By
  • June 22, 2023
0 Comment

1. What is Azure Databricks

Now, before we actually go into Azure data bricks, let me go ahead and first explain the need for Data Bricks itself. So, Databricks is a company that was actually founded by the original creators of Apache Spark. So Data Bricks itself, the service actually makes use of Apache Spark to go ahead and provide a unified antics platform. So let’s go ahead and understand the use case of using Data bricks. So let’s say that you want to go ahead and make use of Apache Spark for your underlying processing needs.

So the first thing that you would need to do is you would need to go ahead and provision machines. You then go ahead and install the Spark engine on these underlying machines and the required libraries. And then you could actually go ahead and use Apache Spark for your data processing needs. Now, in such a scenario, so over here, you are responsible for provisioning the underlying machines. You are responsible for installing the required Spark engine and the required libraries.

Over here, you also have the responsibility of maintaining the underlying infrastructure itself. So if you need to go ahead and scale the underlying machines in order to cater to the data processing needs, this is something that you need to take care of. But with Databricks itself over here, data Bricks can allow you to create this entire environment with just a few clicks.

So over here, Data Bricks can actually go ahead and first of all create the underlying compute infrastructure for you. In addition to that, it will also go ahead and work with the underlying storage layer. So in addition to having the servers in place, it also provides an abstraction layer that allows Spark to go ahead and interact with the underlying storage service. It will also go ahead and install Spark for you and also other libraries and frameworks to go ahead and add other capabilities to Spark as well. So for example, you could also go ahead and include the use of machine learning libraries.

So all of this can be done by Data Bricks itself. It also goes ahead and provides a workspace for you. So in this workspace, you can actually go ahead and create notebooks. Users can then go ahead and collaborate on these notebooks. And you can also go ahead and create visualizations on the notebook itself. Now when it comes to Data bricks. So you can go ahead and launch data bricks either in AWS, that’s Amazon Web Services or Azure. And that’s where we come on to Azure databricks. So Azure databricks is nothing but a completely managed databricks environment for you.

So over here, it’ll actually go ahead and make use of the underlying compute infrastructure and the Virtual Network service that is already available in Azure. So Azure Data Bricks is nothing but an implementation of data bricks on Azure itself. Over here, you can also make use of Azure security aspects such as integration with Azure Active Directory and Rolebased access control. Right? So in this chapter, I just want to kind of give an introduction onto Azure databricks.

2. Clusters in Azure Databricks

Hi and welcome back. Now in the previous chapter, I gave an introduction onto Azure databricks. Now in this chapter, I just want to go through some important concepts before we go into labs, into looking at Azure data bricks, just so that you have an idea on what we are going to do in the labs itself. So again, in data bricks allows you to go ahead and create the underlying infrastructure which will have the underlying machines in place. And those machines will have Spock installed along with the underlying libraries that will allow you to go ahead and perform your data analytics. So now in this case, when it comes to Azure databricks over here, with the help of the Azure Data Bricks service, you can actually go ahead and create clusters in something known as a workspace. So this cluster of machines will actually go ahead and have the Spark engine and other components installed. Now, when it comes to the cluster itself, there are two types of nodes that get created.

So first you have the worker nodes. So these are the nodes that actually process the underlying task. So let’s say you want to go ahead and send a particular command on to the underlying Spark engine. That command will actually be sent onto the worker notes. The worker nodes will have the responsibility of performing the underlying tasks. And then you have the driver node. The driver node actually has the responsibility of distributing the tasks which we send on to the Spark cluster onto the worker nodes, right? So this is one of the key concepts in Azure databricks. We can actually go ahead and create a cluster of nodes.

Now in Azure databricks, when it comes to the clusters, there are two types of clusters in place. So we can actually go ahead and create something known as an interactive cluster, or you can go ahead and create something known as a job cluster. Now, with the help of the interactive cluster here, you can actually go ahead and analyze your data with the help of interactive notebooks over here.

Also, multiple users can go ahead and use a cluster and then collaborate on the notebooks that get created. So this is an interactive way of analyzing your data. Whereas let’s say you just want a job to run on the cluster, you don’t want any sort of interaction from a user, then you could actually go ahead and run that job on a job cluster. So when the job needs to run, then as your databricks will automatically go ahead and start the cluster, it will go ahead and run the job. And when the job is complete, the cluster will be terminated. So this is a cost efficient way of running jobs on a cluster. Now, again, when it comes to an interactive cluster, so there are two types of interactive clusters. So you have a standard cluster and you have a high concurrency cluster.

Now, the standard cluster is recommended if you are a single user working in Azure databricks. Now, over here, there is actually no fault isolation. So over here, yes, you can have multiple users that are running workloads on the cluster itself. But over here, in the standard cluster, there is no fault isolation. That means if a fault happens on a workload that has been executed by one user, it might impact the workloads running by other users on the same cluster. Over here, also the resources of the cluster might get allocated onto a single workload. So in this case, what happens is that if all of the resources are just working on a single workload, and if you have other users who are trying to execute their workloads on the cluster, they might not run efficiently because the resources are not being allocated onto those workloads.

Now, when it comes to a cluster, when it comes to running your notebooks, when it comes to a standard cluster, so it has support for the underlying languages, the programming languages of Python, Rscala and SQL, then you have the high concurrency clusters. So this is recommended for multiple users. So if you have multiple data engineering users who need to go ahead and make use of a cluster in Azure databricks, then you can go ahead and make use of a high concurrency cluster.

Here, you have aspects such as fault isolation. You are also the resources of the cluster are effectively shared across different user workloads. Now, over here, this has support for Python, R and SQL. So there is no support for scala. As of yet, in the high concurrency cluster, your odds are something known as table access control. Here you can go ahead and grant and revoke access onto data from either Python or SQL. Right, so in this chapter, just want to go through some important aspects when it comes to clusters in Azure databreak. Six.

3. Lab – Creating a workspace

So now, in this chapter, let’s go ahead with the working of Azure data bricks. So the first thing that we need to do is to create something known as an Azure Databricks workspace. So let’s do that. In all resources, I’ll hit on Create. So here, I will search for as your data bricks, I’ll choose that. I’ll hit on create. Here, I’ll choose my resource group. Here, I need to give a workspace name. I have to choose my region. So here I’ll choose North Europe.

Now, here, in terms of the pricing tier, there are different pricing tiers in place. I’m going to choose the trial, which is giving us the premium features along with 14 days free DB use. Now, I’ll explain this concept when it comes to this particular pricing tier. So I’ll do that at a later point in time. This is when creating the cluster in the workspace. For now, I’ll choose this pricing tier. I’ll go on to networking. I’ll leave everything hazardous. I’ll go on to advance. I’ll go on tags. I’ll go on to review and Create. And let’s hit on create.

So this is now going to launch our databricks workspace. Let’s come back once we have the workspace in place. Once we have the workspace in place, I’ll go ahead on to the resource. Here, we need to scroll down, and we need to now launch our workspace. So now when the workspace is where you’ll actually do all of your work, you’ll create clusters, you create notebooks. You can create spark databases and tables. So you will do all of your data engineering work here.

In this particular workspace, in as your data bricks here, you can see you have the ability to create a new notebook, create a table, create a cluster, create something known as a job. Here, in the menu options, you can again see that you can create a notebook. You can create a table, you can create a cluster. You can create a job. Here you can see an overview of your workspace. Here you can see something known as repos. You can look at your data. You can look at the compute options and at your jobs. Right? So in this chapter, I just want to start with creating databricks workspace.

Comments
* The most recent comment are at the top

Interesting posts

Impact of AI and Machine Learning on IT Certifications: How AI is influencing IT Certification Courses and Exams

The tech world is like a never-ending game of upgrades, and IT certifications are no exception. With Artificial Intelligence (AI) and Machine Learning (ML) taking over everything these days, it’s no surprise they are shaking things up in the world of IT training. As these technologies keep evolving, they are seriously influencing IT certifications, changing… Read More »

Blockchain Technology Certifications: Exploring Certifications For Blockchain Technology And Their Relevance In Various Industries Beyond Just Cryptocurrency

Greetings! So, you’re curious about blockchain technology and wondering if diving into certifications is worth your while? Well, you’ve come to the right place! Blockchain is not just the backbone of cryptocurrency; it’s a revolutionary technology that’s making waves across various industries, from finance to healthcare and beyond. Let’s unpack the world of blockchain certifications… Read More »

Everything ENNA: Cisco’s New Network Assurance Specialist Certification

The landscape of networking is constantly evolving, driven by rapid technological advancements and growing business demands. For IT professionals, staying ahead in this dynamic environment requires an ongoing commitment to developing and refining their skills. Recognizing the critical need for specialized expertise in network assurance, Cisco has introduced the Cisco Enterprise Network Assurance (ENNA) v1.0… Read More »

Best Networking Certifications to Earn in 2024

The internet is a wondrous invention that connects us to information and entertainment at lightning speed, except when it doesn’t. Honestly, grappling with network slowdowns and untangling those troubleshooting puzzles can drive just about anyone to the brink of frustration. But what if you could become the master of your own digital destiny? Enter the… Read More »

Navigating Vendor-Neutral vs Vendor-Specific Certifications: In-depth Analysis Of The Pros And Cons, With Guidance On Choosing The Right Type For Your Career Goals

Hey, tech folks! Today, we’re slicing through the fog around a classic dilemma in the IT certification world: vendor-neutral vs vendor-specific certifications. Whether you’re a fresh-faced newbie or a seasoned geek, picking the right cert can feel like trying to choose your favorite ice cream flavor at a new parlor – exciting but kinda overwhelming.… Read More »

Achieving Your ISO Certification Made Simple

So, you’ve decided to step up your game and snag that ISO certification, huh? Good on you! Whether it’s to polish your company’s reputation, meet supplier requirements, or enhance operational efficiency, getting ISO certified is like telling the world, “Hey, we really know what we’re doing!” But, like with any worthwhile endeavor, the road to… Read More »

img