Working in the cloud means you’ll have to be familiar with a lot of services, terminologies, and buzzwords. In this article, I’d like to explain one of the most popular analytical services used in the cloud: Databricks.
What does Databricks do?
Databricks is a Software-as-a-Service data platform that can be used for almost all your (big) data needs. It’s a very scalable data tool mostly used for data processing & analysis.
It’s being used in a wide range of industries as its strengths lie in its flexibility and performance. Whether you’re a data scientist, engineer, analyst (or not), understanding Databricks will prove to be helpful for anyone working with data.
Lastly, as it’s built for the cloud, it integrates very well and runs the same on either Azure, AWS, or GCP. Today we’ll focus on Azure Databricks.
How often does Devoteam work with Azure Databricks?
I’ve been working with Azure Databricks on a daily basis for my clients for ~3 years now. Since then, I have only become more enthusiastic with it.
Originally Databricks was a data tool mostly aimed at data scientists and Scala developers (which is still the most popular language) but it has gradually moved to become a larger data platform that integrates amazingly with Azure. You can use it for machine learning, batch processing, streaming, ETL pipelines, data governance, orchestrating, and more.
It’s also at the forefront of the ‘lake house’ architecture, which we’ll dedicate a separate blog topic for. In the last months I’ve been experimenting with Databricks SQL – the ‘new’ environment for data analysts – and the Unity Catalog – Databricks soon-to-be-released data governance solution.
Even though I’ve developed quite some applications with it and obtained my first certifications, I still feel like there’s much more to explore. Partially it’s because of the wide range of scenarios in which you could use Databricks – more about this later- but also because Databricks keeps releasing new updates and moving forwards. All in all, the direction towards an multifunctional transparent data platform built on open-source is exciting.
Why is Databricks promising for a technical audience?
So far it sounds good, right? But how does it work and what makes Databricks stand out from a technical perspective?
1. Databricks is made for big data
Behind the scenes it runs Apache Spark, which is a very fast open-source big data processing framework. Databricks was actually founded by the team who created Apache Spark at UC Berkeley. This framework performs and scales extremely well by using many tricks like in-memory caching and running processes in parallel.
That’s why it can handle up to PetaBytes (that’s 1000TB) of data, far surpassing its ‘predecessor’ MapReduce. You can simply choose your cluster and the processing power, and then get started. Not to worry, handling millions of records is easy with even the smallest clusters.
2. Since it’s high-code software it’s flexible
You can choose to run the code in 4(!) different languages – Scala, Python, SQL,or R – in the same notebook. The performance is all equal as these are just a layer on top of the Apache Spark framework. You can easily create your own functions, use libraries, etc. Databricks comes installed with a lot of the libraries pre-installed and installing packages is easy.
This means you do not have the limitations of low-code/no-code data processing tools such as Azure Data Factory / AWS Glue where you’re dependent on the provided functionality.
3. Ease of collaboration and administration
These two important elements also have to be mentioned, you can work with multiple users simultaneously on the same notebook, leave comments, and jump back to previous versions. The different types of users – data analysts, engineers, and scientists – are also divided into separated workspaces if you want, making administration relatively hassle-free. And for the DevOps professionals, yes you can integrate it with your CI/CD pipelines, docker, and git.
Why is Azure Databricks good for businesses?
I hope by now you’re starting to get an idea of what’s possible with Databricks; but let me show you some examples of use cases of how we’ve used Databricks at a few of my clients. Note that all these use cases were cost-efficient as you pay only for what you use and you can automatically turn off clusters when not used for x minutes.
- ETL pipelines – Handling millions to billions of records, cleaning them, adding business logic, and writing it onto the data lake.
- Loading and transforming large amounts of semi-structured data into readily available tables in a performant, scalable manner.
- Data Vault 2.0 modelling – By creating a template with widgets users could, without writing any code, automatically create and upsert hubs, satellites, and links for Data Vault.
The above are just some personal use cases, and are besides the ‘standard’ use cases of using Databricks. As you can see, Databricks is becoming close to a one-stop-shop for all your (big) data needs. How many tools can handle streaming, batch, semi- or unstructured data and machine learning models of huge amounts of data?
Does Databricks have any limitations?
Despite all the above, you won’t need Azure Databricks everywhere. See below some scenarios where you might want to use other tools:
- If you’re working on-premise or in case your developers have barely any experience with either SQL, python, R, or Scala then Databricks won’t be a good fit.
- If you just copy-paste some data around or need very basic data manipulation, Databricks is probably an overkill.
- If you’re using it only for storage; it still shines most in the data analytics/processing domain, you’d usually store the data externally on Azure/AWS/GCP and analyse it directly by Databricks (there’s no need to shuffle the data around).
Top 3 reasons to start experimenting with Databricks
- It works – The framework and its performance is proven, the User Interface is very user-friendly, and Databricks is rising in popularity because it delivers on its promises.
- It’s flexible – It can be used for numerous use cases in different languages by different types of users.
- It’s quick – Okay, you need to wait for the cluster to start up (unless you have your cluster always running, which I wouldn’t recommend), but when it’s started uploading and querying data is fast. Visualising and analysing your outputs is easy, so you can always see what’s going on.
How can I learn more?
This article is a part of a greater series centred around the technologies and themes found within the first edition of the Devoteam TechRadar. To read further into these topics, please download the TechRadar.
Want to know more about Databricks? Check out these resources:
Advance your data + AI skills with Databricks Academy – Databricks