
What information is going into the data lake, who can access that data, and for what uses? Without proper governance, access control and privacy issues can be problems.It can be hard to guarantee the quality of the data going into the data lake.Lack of semantic consistency across the data can make it challenging to perform analysis on the data, unless users are highly skilled at data analytics.Lack of a schema or descriptive metadata can make the data hard to consume or query.

The following table compares data lakes and data warehouses: They are built to handle high volumes of small writes at low latency, and are optimized for massive throughput. Source data that is already relational may go directly into the data warehouse, using an ETL process, skipping the data lake.ĭata lake stores are often used in event streaming or IoT scenarios, because they can persist large amounts of relational and nonrelational data without transformation or schema definition. Typically this transformation uses an ELT (extract-load-transform) pipeline, where the data is ingested and transformed in place. With this approach, the raw data is ingested into the data lake and then transformed into a structured queryable format.

Typical uses for a data lake include data exploration, data analytics, and machine learning.Ī data lake can also act as the data source for a data warehouse. Data lake processing involves one or more processing engines built with these goals in mind, and can operate on data stored in a data lake at scale. Data lake storage is designed for fault-tolerance, infinite scalability, and high-throughput ingestion of data with varying shapes and sizes.

The idea with a data lake is to store everything in its original, untransformed state. The data typically comes from multiple heterogeneous sources, and may be structured, semi-structured, or unstructured. Data lake stores are optimized for scaling to terabytes and petabytes of data. A data lake is a storage repository that holds a large amount of data in its native, raw format.
