DP-203 Data Engineering on Microsoft Azure – Design and Develop Data Processing – Azure Databricks Part 3

  • By
  • June 23, 2023
0 Comment

7. Lab – Reading a CSV file

Now, in this chapter, we’ll see how to process our log CSV file. So, one of the most important aspects when it comes on to any service that we have seen. So, from the exam perspective, you should be able to understand how to go through CSV files, how to work with your Jsonbased files, and how to work with Park ABS files. That’s why I’ve covered all of these different types of files in each particular section. So we have to do the same over here as well. So, firstly, I’ll close all the cells that I have open. Now here what I can do. I can go on to the menu option that is available here, and I can click on Upload Data. So now I can upload my log CSV file onto Azure databricks.

So, Azure Databricks actually has this underlying data bricks file system in place. So if you want to work with files locally, you can do so you can actually upload your files and you can work with them. Yes, as your data bricks can also connect onto your Data Lake Gentle storage accounts, onto your Azure Normal storage accounts, and you can also create mount points onto those storage accounts. But we’ll see that a little bit later on.

So here, I’ll just click on this just so that I can browse for my file. So my log CSV file I’ll go onto next. And here it’s giving the way that you can now access this file. So, if you’re working in Python Spark, if you’re working in R, if you’re working in Scala, you can just copy this particular statement. So let me copy this. I’ll click on Done, I’ll remove everything in the server and let me place it here. So, we have our databricks file system, and here we have our log CSV file.

So there are some folders in between. We’ve already seen the statement before, wherein we can load a CSV file. The format we are mentioning is CSV, and we are using the Spark context to read our file. So here we can also show the contents of the data frame. So let me run this. We can also do a display of our data frame. Here we can see again that our column names is coming as a row in our data frame. So we can change this. I can take the option. So let me take this. It’s the same URL. So I’ll copy these two statements to create a new data frame. Yeah, I’m mentioning that the header is true. That means the first row is having our column names. Let me run this. And we can now see our data frame being properly displayed. So in this chapter, I want to explain to you the concept on how you can read your CC files. The same thing, but at the same time. We’ve also been introduced onto the databricks file system.

8. Databricks File System

So just want to quickly cover some aspects when it comes on to the data bricks file system so your workspace gets a databricks file system. This is abstraction layer on top of the scalable object storage so only the covers you are getting object storage which is scalable in nature. If you want to interact with that object storage you have the databricks file system. So here you can store your objects using directories and the normal file semantics. These files also persist if the cluster is terminated. So if you terminate your cluster and if you recreate the cluster you can actually have access onto those files. The default storage location is called dBFS root.

Now there are some predefined route locations so we have the file store, this is used for the imported data files, the generic plots, the uploaded libraries, you have the databricks data sets, these are used for some sample public data sets you have user hype warehouse this is the data and the metadata for non external hype tables. So here if I go on to one of my files so there are some magic commands in place to actually look at the database file system so just go on to the cell itself. So this is the magic command and LS is to list all of the contents. So I’ll just run the cell so here you can see the path on to the database file system and what is the name and if you want to create a new directory so here we can create a new directory and then we can again list the contents. So here’s we can see our new directory in place. So just want to give you some more ideas when it comes on to the databricks file system.

9. Lab – The SQL Data Frame

Now in our continuation, working with data frames, let’s again see some commands when it comes to the SQL API that is available on top of your data frames, on top of your RDDs. So again, I’m reading my file here. If you only want to select some columns, you only want to see some columns in that particular particular data frame. So here, let me run this. So here we can see our output in place. So remember that earlier on in our previous chapter, we had created a data frame DF two. So we are reusing that same data frame. So here I am only selecting some of the columns. Now we can also create a data frame which will actually infer the schema. So, if I look at my data frame so, let me do one thing, let me print the schema of the data frame.

So here in terms of the schema, we can see the ID is a string and the time is also a string. But we want the spark to actually infer the schema based on the underlying data. So let me copy this, let me run the cell. Now we can see that the ID is coming up as an integer and the time is coming up has a timestamp if you only want to show the rows based on a particular filter. So this is like also having the where condition in place. We can also use the display command as I’ve shown here. So we can see it here where only these status is equal to succeeded. And then finally, if you want to use the group by statement, that’s something that you can do as well. So here it’s grouped by the status. So again, there are different commands that are available to work with your data frames.

10. Visualizations

In this chapter. I just want to have a quick note when it comes to the visualization that is available by default in the notebooks. So here I am displaying my data frame. So the entire data frame is coming in a tabular format. If I scroll down, I have the different visualizations available here if I click on the bar chart. So by default it’s stacking it up against the different IDs. And here I have the resource group and the resource type. You can expand the plot here by dragging this if you go on to the plot options. So by default it is plotting it against the ID. The keys are the resource group and the resource type. Let’s say you want to stack it against the operation name. You can drag the operation name onto the keys and here you can see all of the operation names. It will go ahead and display it again. So here we have account based on the different operation names. So this is the default visualization that you actually get in the notebooks in as your data bricks.

11. Lab – Few functions on dates

In this chapter, I just want to go through a few functions when it comes to working with dates. So if I go back onto our data frame, if I display it back in the tableau format here, I should be able to see the timestamps. So here we do have a column based on the time. So let me take this first set of statements. So what am I doing? Here I am selecting the column of time. Here. I’m using the year function to display the year aspect of the time. The same goes with the month and the same goes with the day of year. So these are the default functions that are available. So to ensure that I can use these date based functions, I’m using the import statement here.

And then I am selecting all of those different columns. So let’s run this. So here I can see the year, the month and the day of year. If you want to give more meaningful names, you can actually use the alias. We’ve seen this early on to give meaningful names on to the columns in the data frame. Let me run this. Yeah, it’s giving now the different column names. And finally, if you want to convert the date onto a particular format, you can use the two date function. Let me run this. So here you can see all of the different dates in place. So in this chapter, just want to go through some important functions when it comes to working with dates.

Comments
* The most recent comment are at the top

Interesting posts

Impact of AI and Machine Learning on IT Certifications: How AI is influencing IT Certification Courses and Exams

The tech world is like a never-ending game of upgrades, and IT certifications are no exception. With Artificial Intelligence (AI) and Machine Learning (ML) taking over everything these days, it’s no surprise they are shaking things up in the world of IT training. As these technologies keep evolving, they are seriously influencing IT certifications, changing… Read More »

Blockchain Technology Certifications: Exploring Certifications For Blockchain Technology And Their Relevance In Various Industries Beyond Just Cryptocurrency

Greetings! So, you’re curious about blockchain technology and wondering if diving into certifications is worth your while? Well, you’ve come to the right place! Blockchain is not just the backbone of cryptocurrency; it’s a revolutionary technology that’s making waves across various industries, from finance to healthcare and beyond. Let’s unpack the world of blockchain certifications… Read More »

Everything ENNA: Cisco’s New Network Assurance Specialist Certification

The landscape of networking is constantly evolving, driven by rapid technological advancements and growing business demands. For IT professionals, staying ahead in this dynamic environment requires an ongoing commitment to developing and refining their skills. Recognizing the critical need for specialized expertise in network assurance, Cisco has introduced the Cisco Enterprise Network Assurance (ENNA) v1.0… Read More »

Best Networking Certifications to Earn in 2024

The internet is a wondrous invention that connects us to information and entertainment at lightning speed, except when it doesn’t. Honestly, grappling with network slowdowns and untangling those troubleshooting puzzles can drive just about anyone to the brink of frustration. But what if you could become the master of your own digital destiny? Enter the… Read More »

Navigating Vendor-Neutral vs Vendor-Specific Certifications: In-depth Analysis Of The Pros And Cons, With Guidance On Choosing The Right Type For Your Career Goals

Hey, tech folks! Today, we’re slicing through the fog around a classic dilemma in the IT certification world: vendor-neutral vs vendor-specific certifications. Whether you’re a fresh-faced newbie or a seasoned geek, picking the right cert can feel like trying to choose your favorite ice cream flavor at a new parlor – exciting but kinda overwhelming.… Read More »

Achieving Your ISO Certification Made Simple

So, you’ve decided to step up your game and snag that ISO certification, huh? Good on you! Whether it’s to polish your company’s reputation, meet supplier requirements, or enhance operational efficiency, getting ISO certified is like telling the world, “Hey, we really know what we’re doing!” But, like with any worthwhile endeavor, the road to… Read More »

img