Data lakes are not new, but they are becoming increasingly important. And not just because of full-service cloud data solutions. It reduces the most significant risk associated with data lakes – the lake becomes a swamp. In this short article, we explain what data lakes are, how they can be used alongside and not necessarily instead of a data warehouse, and why governance is so important.

A data warehouse is a repository that can store all types of data in their original form: structured, semi-structured, and unstructured data. Structured data is strictly defined and standardized, stored in traditional tables, and queried using SQL. Semi-structured data sometimes have a defined format, e.g., CSV files or image file metadata, but are not used. Unstructured data does not follow a predefined format, e.g., images, videos, audio, text, etc., and is therefore more difficult to process mechanically.

The need to create a data lake is justified for several reasons: 

The Diversity of Data

In today’s digital world, only some of the data needed is structured in an SQL database. Streaming data, social media posts, IoT data, and audio or video recordings contain valuable business information that needs to be analyzed by the team.

Audience And Their Needs

Different stakeholders want to process different data and have very different needs. BI experts or business analysts want to process and structure data to gain insights from analytical tools. Data scientists are more interested in the raw form and use languages such as Python or R to create models. Data aggregation is used to avoid creating a closed space for a particular audience so that everyone gets the same data from the same repository, possibly from multiple domains.

Evaluation Time

As the data is stored in its original format when added to the dataset, it does not need to be transformed, so the time needed to make the data usable is short. The transformation can be done later, especially if you use a data warehouse-based data storage environment.

However, not all users need to be converted. Machine learning or streaming data processing is often based on raw data, and resource-intensive data warehouse conversion slows down real-time processing.

Can You Replace a Data Warehouse? 

Does the data lake replace the data warehouse? No, but it can. A data warehouse is used specifically for data analysis and BI. Various raw data are processed (cleaned, transformed, formatted, merged, etc.) and stored according to a predefined structure. This allows BI teams to work with clean data. When a data warehouse is linked to a data repository, the raw data is first loaded into the repository. Then, the data warehouse becomes one of many data consumers of the repository. This configuration combines both methods.

Depending on the data warehouse’s complexity and its analytical functions, the repository can replace the traditional data warehouse. In a data warehouse architecture, several zones are usually created.

Depending on the concept or the cloud solution, the zones are named differently: landing, raw data, storage, transformation, transport, production, etc. However, the principle is always the same: data is zoned until it reaches the gold level, i.e., higher quality and value. The data warehouse may classically depend on this data quality or parts of the data warehouse preparation may need to be updated. If the raw material is still needed in its original form, it is sent to the previous area.

Data Governance Is Key

The data environment mustn’t become a data swamp despite all the benefits. This means data is loaded into a dataset and used randomly without further checking or development. It is therefore important to address data governance issues, i.e., who is responsible for which data, who has access to and can use it, what internal procedures and policies are in place, etc.

Here are some examples:

Data Directory

A data catalog is used to document and evaluate all the data available in the organization. It is a very important point in the data environment, as data must be structured due to its volume and variety. Once you know what data exists and who is responsible, you can determine how to access and use it.

Ensuring Site Security

As with the architecture, the concept of access must be clearly defined from the outset: at what level (attribute, file, repository, etc.), how, what user groups and roles are used, etc. The diversity of data in one place (one lake) requires a clear concept of access. Otherwise, unauthorized access to data can be gained, and misuse can occur. 

Quality

The fact that a data lake allows unstructured data does not mean the data quality is negligible. Even within a data lake, transformation, and cleaning must be performed at some point. As

described above, the data lake can be divided into different zones, and the quality can be improved by transforming the data as it passes through these zones.

Conclusion

When dealing with large amounts of data in different formats, it is worth considering the concept of a data lake. Cloud solution providers often solve data management challenges with dataset solutions that provide tools and methods for creating and managing data stores securely and organized.