A Data lake is a huge storage repository where you can store in an object or in any native format. You can store any amount of size for any amount of duration
In Data warehouse where data is stored in Files and Folder but Data lake has a flat architecture.Every data element in a Data Lake is given a unique identifier and tagged with a set of metadata information.
Data lakes are usually configured on a cluster of inexpensive and scalable commodity hardware. This allows data to be dumped in the lake in case there is a need for it later without having to worry about storage capacity. The clusters could either exist on-premises or in the cloud.
How it Works
Below is the sample image how it works
Data lake can include different data types :
Data Lake is a centralized repository where all types of datas are stored. The structure of the data or schema is not defined when data is captured, Data Lakes allow you to import any amount of data that can come in real-time. Data is collected from multiple sources, and moved into the data lake in its original format. This process allows you to scale to data of any size, while saving time of defining data structures, schema, and transformations
|Data type||Data format||sample|
|Structures||Rows and columns,Excel|
Advantages of Data Lake:
Adopt Easily : Because all data is stored in the raw form . if the users are empowered to go beyond the structure so if the result of an exploration is shown to be useful then go for formal schema and development if result is not useful, it can be discarded and no changes to the data structures and no development resources have been consumed
Provide faster insight : Because data lakes contain all data and data type because it enables users to access data before it has been transformed, cleansed and structured it enables users to get to their results faster actually is really the result of the other fourat
Ability to derive value from many types of data
Unlimited ways to query the data
Scalability : It offers scalability and is relatively inexpensive compared to a traditional data warehouse when we take scalability into account.
Versatility : A data lake can store multi-structured data from diverse sources. In simple words, a data lake can store logs, XML, multimedia, sensor data, binary, social data, chat, and people data.
Supports not only SQL but more languages : Traditional data-warehouse technology mostly supports SQL, which is suitable for simple analytics but for advanced use cases, we need more ways to analyze data. Data lake provides various options and language support for analysis. It has Hive/Impala/Hawq which supports SQL but also has features for more advanced needs. For example, to analyze the data in a flow, you can use PIG or to do machine learning you can use Spark MLlib.
Advanced Analytics : Unlike a data warehouse, a data lake excels at utilizing the availability of large quantities of coherent data along with deep learning algorithms. It helps in real-time decision analytics
Difference between Data lake and Datawarehouse
|Data Lake||VS||Data Warehouse|
|Highly accessible and quick to update||Accessibility and flexible||More complicated and costly to make changes|
|Less secure compare to data warehouse||Security||More Secured|
|Stores huge data with lesser price||Cost||High cost|
|Schema on read||Process||Schema on write|
|Business Professional,Data Analyst||Users||Data Scientists|
|Used for Cost effective Data Storage||Purpose||Analytics for Business Decisions|
|Can use open source tool like Hadoop, Map reduce||Tool||Mostly Commercial tools|
|Data at a low level of detail or granularity||Data Granularity||Data at the summary or aggregated level of detail|
|Highly agile, configure and reconfigure as needed.||Agility||Compare to Data lake it is less agile and has fixed configuration.|
Disadvantages of Data Lakes:
Transition Period : When moving from a data warehouse to a data lake (or just adding a data lake to your existing system), there’s going to be a transition period.
Complex : Installing a data lake solution on-prem can be much more complex
Skillset Learning Curve : The data lake often comes with a new set of tools and services that need to be understood. That requires some additional investment, either from recruiting to get the right team members or doing internal professional development to understand those new tools.
Not Optimized for Query Performance : Although we can very easily store data in the data lake, in order to query that data back out, the data lake doesn’t have the same underlying query engine that a data warehouse does. Those queries may not perform to the level you need.