Microsoft Azure Data Lake: grow fearless

Rapid data growth: be aware and prepared

Keep growing is one of the main topics for companies. When we talk about growth, we think of things like more employees, more branches and more cash flow. Thanks to today’s technology, we are able to record all of that information digitally and summarise it into more valuable data. Data gives us opportunities to have a better insight into the business, from demand and supply to business performance and customer relations. Data goes beyond record information nowadays. It also enables us to correlate and forecast, facilitating short term planning and long-term strategy making.

Rapid data growth shouldn’t be a problem

Besides all the opportunities and excitement growth brings, managing fast-growing data starts to be challenging. Where and how to store data? Which part of data is useful information and which part is just eating storage resources? How could we extract value from data?

The awareness for data management is growing these years as we can see more data related positions in the market: data analyst, data engineer and data scientist. On the one hand, data professionals extract value from data as business required. On the other hand, each step of data transformation, transition, process and analysis means that more data are generated.

We see that this rapid growth leads to pressure for storage. If we think about storage as part of a business growth strategy, what would be the ideal situation?

No bottleneck from storage. Business growth should never feel the limit of storage. Once the business starts to feel the limit, it normally means that business growth has to slow down to put extra energy somewhere else. Ideally, storage should be unlimited.
Centralised management. Data will be hard to manage if it is stored in different software and in different ways. Instead of using different software to handle different data stores and ingestion, it is better to use a single data storage solution that is compatible with various data types and ingestion.
Facilitate business process. We prefer to put books on a shelf instead of throwing them into a big box because a shelf gives us the possibility to organise books into different layers by category and easily to find later. A storage solution should also provide the possibility for businesses to organise data so that it is easy to retrieve useful information.

Azure Data Lake: big data, easy thing

“Azure Data Lake is a scalable data storage and analytics service”. It has two parts: Azure Data Lake Storage and Data Lake Analytics.

SECURITY AZURE DATA LAKE STORAGE GEN2

The latest version of Microsoft Azure Data Lake Storage is Azure Data Lake Storage Gen2 (ADLS Gen2), which is further developed based on Azure Blob Storage and Azure Data Lake Storage Gen1. ADLS Gen2 has unlimited storage. It is more than a big hard drive providing you a worry-free storage solution. It also helps you to store strategically and be fully prepared for data ingestion and consumption.

Apart from Storage, Data Lake Analytics is another important part of Azure Data Lake. It is a Software-as-a-Service (SaaS), which provides on-demand analytics job service. You can focus on processing data itself without configuring the cluster in front. Development not only can be done on the Azure portal itself, it can also be done in Visual Studio that is widely used. You can query data by using U-SQL which is a new language but similar to T-SQL with the expressive feature of C#.

ADLS and DLA work very well with each other, but it does not mean you have to use both of them. Those are two separate components, which can be used individually in combination with other software. For example, ADLS Gen2 also works very well with Azure Synapse Analytics and Azure Databrick.

Why should you consider Azure Data Lake?

scalability

Unlimited storage empowers rapid business growth.
Easy-to-scale prepare business for any big leap in development.

FLEXIBILITY

ADLS Gen2 can adapt to any data format, structured, semi-structured, and unstructured.
Data can be stored in raw format and can be explored in its native feature without the traditional ETL process.
ADLS Gen2 can easily connect with other Azure components. It can serve source data to data warehouses or databases and act as a landing area for transformed data from Azure Data Factory and Azure Databricks. Additionally, Power BI can connect directly to it for analysis.

HIGH AVAILABILITY

There are always multiple copies of data in the ADLS Gen2 to be fully prepared for any unexpected failures and disasters. There are always three copies of data within a single region. Azure storage redundancy provides local to global solutions in the protection of local, zone, or regional outages or disasters. Besides disaster recovery, those replicates can also be used to repair corrupted data to guarantee data integrity.

SECURITY

Auditing: diagnostic logging can be enabled to record data access traces.
Access control: ADLS Gen2 provides access control on individual files and folders and integrates with Azure Active Directory for identity and access management. Permissions can be given specifically to someone on certain files or folders. It is also possible to have multi-factor authentication, role-based access control (RBAC), monitoring, and alerting.
Encryption: data stored in ADLS Gen2 are encrypted by default (encryption-at-rest). Microsoft can manage the encryption key for you and you have the choice to manage it yourself. It is also possible to configure a secure transfer and enable encryption-in-transit.

better performance

Query performance: ADLS Gen2 has a hierarchical file system. This feature not only enables granular security mentioned above but also partitions. Therefore, if you connect with any software that can perform partition scan, the query performance will be highly improved.
Data load: ADLS Gen2 is able to relocate data through metadata-only operation, which is easy and cost-efficient.

How to load data to Azure Data Lake

For loading data into Azure Data Lake, there are different ways for different situations. The following table shows some use cases and tools that can be used.


USE CASES	TOOLS
Bulk data load	AZCopy
Small dataset load	Azure PowerShell, Azure CLI
Data movement pipeline	Azure Data Factory
Continuous ingestion and incremental load from IoT solution	Azure Data Box Edge and Azure Data Box Gateway

Data quality
It is one of our main priorities to guarantee accurate and relevant data to be provided to the business on time for decision-making. Together with other services, such as Azure Synapse Analytics, Azure Databricks, and Azure Data Catalogue, ADLS Gen2 supports the realisation of better data quality, including accuracy, timeliness, integrity, and relevance.

Although unlimited storage of different types of data is impressive, unorganised storage of a large amount of data can be a disaster. It is recommended to follow the best practices to maintain a good data lake instead of turning it into a data dump.

Have a data storage plan before data loading can be a good start. We can organise Data Lake into layers and folders to serve different use cases. For example, we can organise data into the raw data layer, ETL layer, and consumption layer. Data can further be organised per client, source channel, year, and month.
Using Azure Data Catalogue to make data easy to search and recognise.
Purge duplicated and unneeded data. Periodically examine if data need to be reorganised to fit business development and adjust if necessary.

SECURITY
As above-mentioned, ADLS Gen2 supports auditing, access control, and encryption. Data professionals can utilise those features by following best practices.

Analyse audit logs periodically to control quality and identify risks.
Apply granular access control per folder/file and role. Update access control on time in case of personnel or position changes.
Rotate the encryption key periodically.

LIFECYCLE MANAGEMENT
Manage and consume data in an efficient way is what we want to achieve throughout the whole life cycle. ADLS Gen2 provides different access tiers including hot, cool, and archive with decreasing cost rates and increasing retrieving time. Therefore, we can store data more efficiently based on access frequency. ADLS Gen2 also has a lifecycle management feature itself which gives you the possibility to schedule rules to move data to the right tier automatically.

Do you have questions or you want more info? Don’t hesitate to reach out to the author of this blog post Yashan Pang, or contact our Belgian Big Data expert Robin Vanden Ecker via email.