Data lake: store your data without drowning in a lake of data
At a time when the mass of information generated by a company can grow by 50 to 150% from one year to the next, it makes sense to want to make the most of it and get the most out of it.
The infrastructures and architectures required to manage Big Data still put off many companies, particularly what is often defined as its heart: the Data Lake.
What is a data lake? How does it differ from a data warehouse? Which data lake solutions should you choose? Answers in this article.
What is a data lake? Definition
A data lake can be defined first and foremost as a reservoir of raw data, qualified at the margin, in structured or unstructured form. This data can be :
- extracts from relational databases,
- images
- PDFs,
- feeds or events from business applications,
- semi-structured CSV or log files, etc.
Why use a data lake? Advantages of a data lake
The data lake's first mission is to ingest this raw data en masse, in order to preserve its history for future use:
- behavioral analysis (of a customer or an application),
- predictive AI or machine learning engines,
- or, more pragmatically, the monetization of this information with new partners.
In addition to this main characteristic, other key criteria include :
- its structuring, to make it navigable and avoid a data swamp,
- its elasticity, enabling it to grow (and in theory shrink) at high speed in terms of storage and computing power,
- its security, to guarantee the proper use of data.
Data lake, data warehouse: what's the difference?
Unlike the Data Lake, the primary aim of the Data Warehouse is to obtain refined data for a precise, recurring need, requiring solid aggregation performance and making it possible to serve reporting, analysis and sometimes new business applications.
But, with a cost per terabyte stored more than 10 times higher, the data warehouse has reached its limits as the cornerstone of enterprise data.
How can we get the best of both worlds?
What data lake solutions should you consider?
Many large companies, having invested significant sums in their data warehouse, have decided to make a smooth transition to the data lake, with an on-premise solution and the tailor-made composition of a panel of tools to manage it.
An on-premise solution like the Hadoop data lake
The Apache Foundation has provided the open-source Hadoop framework, the heart of the data lake's mass ingestion capabilities, thanks to the parallelization and distribution of the storage process.
This framework is enriched by a host of open-source tools that have made data lake implementation affordable (financially):
- Kafka for ingestion,
- Yarn for resource allocation,
- Spark for high-performance processing,
- MongoDB as a NoSQL database,
- ElasticSearch and Kibana for content indexing and retrieval,
- and a plethora of other tools (graph databases, auditing, security) that emerge and sometimes disappear as this market becomes more concentrated.
In the final analysis, however, the multiplicity of tools and the possibility of creating an ultra-customized environment for yourself can result in very high ownership costs, particularly if you've bet on a technology with an uncertain future.
Logically, then, we prefer packaged solutions such as Cloudera, which has swallowed Hortonworks and retained an open source distribution, but of course offers a better-supported paying model.
A strong partnership with IBM also aims to provide strong on-premise solutions.
MapR, having been taken over in 2019 by Hewlett Packard Enterprise, will be integrated into HPE GreenLake, a cloud solution designed to compete with the giants Amazon, Microsoft, Google and Oracle, who are multiplying partnerships, takeovers and new developments to build cloud platforms that rival the best on-premise data analysis tools.
A cloud solution like the AWS or Azure data lake
Amazon AWS, Microsoft Azure, Google Big Query or Oracle Cloud Infrastructure Data Flow all integrate more or less sophisticated tools for data management (migration, lineage, tracking), analysis (real-time transformation, aggregation, classical analysis or AI models), but this time in the cloud.
The big advantage of the shared cloud is that it puts aside the hardware issue, which can quickly become a headache when data growth is anticipated.
However, the uninhibited cloud has shown its limits, with cases of mass hacking. IBM's Private Cloud can guarantee the integrity of your data (industrial property, confidential contracts, etc.), while Azure Stack offers an on-premise version of Microsoft's main tools in this field.
Teradata, another world leader in data warehousing, has begun its shift towards a cloud solution in the hope of winning back a customer base blunted by the costs of its powerful on-premise servers.
The challenge of good governance
All solutions have their advantages and disadvantages. You must not lose sight of your company's commitments to its customers (RGPD, industrial or professional secrecy), and weigh these against the quest for elasticity, which can represent significant structural and human costs.
Assessing this balance must be part of the essential work of data governance, which must define and structure the data lake and therefore :
- provide a human, technical and technological framework for the data engineers who will be handling terabytes of data on a daily basis
- facilitate the investigative work of data scientists for their AI and Machine Learning engines
- enable users to trace and validate their sources to guarantee the results of their analyses.
This governance will make it possible to grasp the real needs of your core business while authorizing broader exploitation of the data. The aim :
- to enable the emergence of new uses and a new understanding of data,
- provide your customers with the benefits of greater responsiveness and even anticipation, in complete security.
Good governance can result in architectures that may seem complex at first glance, but can be both technically and financially beneficial.
The choice of data mesh for a successful big data transition
So, while the data lake may be useful, it doesn't necessarily mean that other data management structures will disappear: from the data swamp upstream, to the data warehouse and data marts downstream, right up to the dialogue between several of these structures in an international context, good data governance can, on the contrary, enable a wider range of tools to be used.
By fostering dialogue between these data storage and processing elements, companies can get the most out of each of them:
- historical systems, considered indispensable and reliable, will continue their work
- and will be able to take advantage of the benefits of the data lake to, for example, archive cold data, secure raw data sources for better auditing and eventual recovery, and so on.
This data mesh, within a framework of strong governance, will prevent a company from ruining an existing system by embarking on an "all data lake" migration.by embarking on an "all data lake" or even "all cloud" migration, which is sometimes impractical and often inappropriate.
The data mesh will then be a guarantee of acceptance and success in the transition to Big Data.