Getting Started with Databricks
Introduction to Databricks
Databricks (full name: Databricks Lakehouse Platform) is a unified platform that integrates with cloud storage and allows building, sharing, and maintaining data, analytics, and AI solutions. It allows working with data to provide enterprise-level solutions. Databricks runs on top of Apache Spark and can be used for dashboards and visualizations, data discovery and exploration, machine learning modeling, and integrates with developer tools such as PyCharm and Visual Studio Code, and solutions such as Power BI and Tableau. It provides a one-stop shop for most data needs in a single platform that abstracts much of the complicated setup and maintenance required for its many components. In this tutorial, we’re going to set up Databricks in a cloud platform, do some basic configuration, and start working with some data.
Setting up a Databricks Account
It is a simple process to set up Databricks on our preferred cloud platform, either Amazon Web Services (AWS), Google Cloud, or Microsoft Azure. This can be done either through Databricks’ site or via the marketplace on our preferred platform. Note that there is a difference between these methods. Typically, signing up via a marketplace will manage Databricks billing through the cloud account, and signing up via Databricks’ website will cause the billing for Databricks to be managed through the Databricks account console. Either way, Databricks offers a 14-day free trial to explore before they bill for the service. In addition, Databricks offers a limited, free community edition that we can sign up for on their website.
The following shows the options on Databricks’ website after you enter your contact information, highlighting the link for the community edition:
Note: In this tutorial, you’ll use a free-tier AWS account to set up a trial version of Databricks. If your usage exceeds AWS’s limits for the basic free tier, or you subscribe to Databricks past the 14-day trial, you will be billed for the service.
To get started, log into your AWS console and go to the AWS marketplace. Then click on the menu button on the left side of the screen to open the Marketplace menu, and from there click on “Discover products”. Then we can just type “Databricks” in the search bar.
Select “Databricks Lakehouse Platform”. Clicking on the link takes you to Databricks’ product page in AWS Marketplace. To sign up for Databricks, click on “view purchase options,” which will take you to the subscribe page:
Clicking “subscribe” will open a banner prompting you to set up your Databricks account. Clicking on “set up your account” will take you to Databricks’ signup page. Filling this out and clicking “Sign up” will send you an email to verify your email address and set up a password. Once you do this, Databricks will start guiding you through creating your first workspace.
Setting up a Workspace
After setting your Databricks password and clicking “continue”, on the next screen, you will be brought to a screen to set up your first workspace:
You need to enter a name for your workspace and pick an AWS region.
For this tutorial, you should click on the dropdown and pick a region as close as possible to where you’re working from. (Note that in a production environment, you’ll want a region as close to your users as possible.) In the above image, “Ohio (us-east-2)” happens to be the closest region to where this tutorial was written. You can see a global map of regions on AWS’ Global Infrastructure page.
Once you’ve chosen a name and made your region selection, click on the “start quickstart” button. This routes you back to your AWS console to a “quick create stack” page. There are several fields on this page, most of which are already filled out for you. All you need to do to proceed is to scroll down to “Databricks Account Credentials” and enter your password, then scroll down to the bottom of the page and click the checkbox labeled “I acknowledge that AWS CloudFormation might create IAM resources with custom names.” Then you can click on the “create stack” Button.
Once you click “create stack” you will be brought to a page that gives you information on your stack and will tell you that the creation of your stack is in progress. On the right side, you’ll get information on the creation of various elements of the stack and when they are completed. On the left side will be a box with the overall progress of your stack.
The process will take a few minutes, but after waiting a while, you’ll get the message that the stack creation is complete.
You now have a Databricks workspace in your AWS account!
In the account console click on “workspaces”, either the workspace tile on the account home page or on the workspace icon on the toolbar menu to the left.
Now you’ll see a list of workspaces in your account that shows the one you just created. It also has a “Create workspace” button in the upper right that will allow you to create new workspaces in the future.
Now you need to create a cluster to run your code.
Creating a Simple Cluster
You need to have a cluster to run any code, so you’ll need to create one for your workspace. To do this, you’ll need to log into the workspace on Databricks. To log into a workspace, click on “Workspaces” on the Databricks Account Console and click on the “Open” link on the row in the workspace list for the workspace you want. (You may be prompted to log in again, remember to use your Databricks credentials, not your AWS credentials.)
This will open the home page for your workspace. If this is your first time logging in, you may have to click “skip onboarding” to get to the home page.
To create your cluster, you need to click on the “compute” item on the menu to the left of the workspace home page.
This will open a list of compute resources in your workspace which are initially empty.
To create your new cluster, click on the “create compute” button in the upper right. It will open a page where you can configure the details of your cluster. Change the cluster name, which you can do by clicking the edit icon next to the cluster name at the top.
After changing the cluster name, accept the defaults and click on “Create compute”. This will lock the screen for editing, and you’ll see a little progress wheel appear next to the cluster name. If you click on the “compute” icon in the menu on the left again, you’ll be taken back to the list of computes, which now has your new cluster listed. The state column will inform you when the cluster is up and running, which may take a few minutes.
Next, let’s load some data!
There are many ways to import data into your workspace, and many sources that data can come from. In this case, you’re going to use the GUI to load a CSV dataset. First, you need a dataset to use, and for this tutorial you’ll go to Kaggle, to download a dataset of Google Play Store Apps. (If you don’t have a Kaggle account, you’ll have to register. It’s free.) Once you download the
archive.zip file, unzip it into three files. The one you’ll be using is
googleplaystore.csv. Once you have this file locally, you will import it into the workspace.
To do this, click on the “new” button at the top of the menu on the left of the workspace homepage. Then select “file upload” from the submenu.
This will open a tab where you can drag and drop files or browse for them on the PC.
Click on “browse” and navigate to where you saved the
googleplaystore.csv file and select it. This will open a preview tab where you can see the data in the file before you create the table. It may take a few moments for the data to populate the preview. When it’s complete you should have something that looks like this:
Above the data preview are boxes that select the catalog, schema, and table name for the imported table. You can create a new schema or change the table name. Stick with the default and “googleplaystore” for the table name.
Next to the column names, there is a little icon that represents the data type that Databricks guessed for the column. You can see that it thinks “Reviews” and “Price” are text columns. You can change this by clicking on the icon next to the name. You can use this to change “Reviews” and “Price” into numeric columns. (Such as “bigint” for “Reviews” and “decimal” for “Price”)
Also, on this tab, you can click on a column header name to change it, and you can click on the down arrow next to the column header to see a menu where you can exclude the column from the new table, or reset it, removing all changes. Don’t make any more changes and click on “create table”. Databricks will inform you that your table is being created.
Once the table is created, you’ll be shown the Catalog Explorer tab, focusing on the new table. Here you can add tags and comments to the various columns, and the dataset as a whole. You can also have AI suggest comments for you.
Now that you have some data it’s time to do things with it.
Creating and Using Notebooks
Notebooks are an interactive tool that allows you to run code and query data in a way that can be saved and shared. If you’ve used Jupyter Notebooks before, Databricks’ notebooks should feel familiar, as they work in much the same fashion. You’ll now create a new notebook to manipulate the new
googleplaystore data table.
Click on the “new” button at the top of the menu on the left of our workspace homepage. Then select “notebook” from the submenu.
The new notebook will be named “Untitled Notebook” followed by a date stamp, it will default to the most recently used language, and attach to the most recently used cluster.
To change the name, click on the title of the notebook and edit it. Change it to something like “Google Play Store Notebook”.
To change the notebook’s default language, you can click on the language drop-down next to the title and choose the default language from the dropdown. Choose Python if it isn’t already chosen.
The cluster being used is shown in the second button to the right. (If the cluster isn’t running, it will read “Terminated”.) To change the cluster (or start it up), just click on the button and choose the cluster from the dropdown menu.
Try out the new notebook by running some Python code. First, make sure the cluster is running by looking at the above button and making sure it shows a green dot and the name of the cluster. Code will not run unless we’re attached to an active cluster.
Now you can type in a snippet of code and type
Shift+Enter to run it. The results will be computed, and the output will be displayed in the cell underneath our code.
Now that you have your notebook up and running, you can work with your data. The most straightforward way to do this is with an SQL query. Now you might note that the notebook is using Python, but there are two ways you can get a cell to run SQL:
- Use the language dropdown in the upper right of the cell. This can change the cell’s language without affecting other cells.
- Or, use the magic command
%sqlbefore your SQL query.
This way you can run a simple select query on your dataset and view the results.
To have a more modern feel to the query result, click on the “new result table” link in the upper right and switch it to “ON”. This produces a nicer output.
Do another query using some aggregate functions, calculating the total reviews by category. Also, limit it to categories having total reviews of over 100 million. Take note of the “+” in the top left of the table display.
Now if you click on the “+” you get a menu with “Visualization” and “Data Profile”. Click on “Visualization”. This pulls up a Visualization Editor that allows you to play around with various charts to display your data.
If you click on “save”, the visualization will be saved with the table preview in our cell.
This provides a convenient and intuitive way to incorporate data visualizations into your notebooks.
Now, if you’re going to do more serious manipulation of data, things like cleaning and tidying, you probably want more than SQL. This is where Python and dataframes come into play. Spark dataframes provide all the SQL commands you might need, but they offer much more in terms of manipulating data. Your notebook’s default language is already set to Python. So you can load the
googleplaystore table into a Spark dataframe with the
Another way to get a Spark dataframe is to use a SQL query with the
Now if you’re familiar with Pandas dataframes, you can convert the Spark dataframe into a Pandas dataframe with the
You’re now able to perform any dataframe operations on your data.
This tutorial covered what we need to do to get started using Databricks with AWS. We covered signing up for Databricks and creating our first workspace. We showed how to create a cluster for our coding operations. Finally, we showed how to import data into our workspace and create a notebook to work with that data. There is much more to explore with Databricks, from data lakes to machine learning, and this tutorial should be a first step in learning the Databricks platform.