What is Data Lake?

A Data lake is a huge storage repository where you can store in an object or in any native format. You can store any amount of size for any amount of duration

In  Data warehouse where data is stored in Files and Folder but Data lake has a flat architecture.Every data element in a Data Lake is given a unique identifier and tagged with a set of metadata information.

Data lakes are usually configured on a cluster of inexpensive and scalable commodity hardware. This allows data to be dumped in the lake in case there is a need for it later without having to worry about storage capacity. The clusters could either exist on-premises or in the cloud.

How it Works
Below is the sample image how it works

Data lake can include different data types :

Data Lake is a centralized repository where all types of datas are stored. The structure of the data or schema is not defined when data is captured, Data Lakes allow you to import any amount of data that can come in real-time. Data is collected from multiple sources, and moved into the data lake in its original format. This process allows you to scale to data of any size, while saving time of defining data structures, schema, and transformations

Data typeData formatsample
StructuresRows and columns,Excel
Semi structuredCSV
Semi structuredXML
Semi structuredJSON
Non Structuredemails
Non Structureddocument,PDFs
Non Structuredimages
Non Structuredaudio
Non Structuredvideo

Advantages of Data Lake:

Adopt Easily : Because all data is stored in the raw form . if the users are empowered to go beyond the structure so if the result of an exploration is shown to be useful then go for formal schema and development if result is not useful, it can be discarded and no changes to the data structures and no development resources have been consumed

Provide faster insight : Because data lakes contain all data and data type because it enables users to access data before it has been transformed, cleansed and structured it enables users to get to their results faster actually is really the result of the other fourat

Ability to derive value from many types of data

Unlimited ways to query the data

Scalability : It offers scalability and is relatively inexpensive compared to a traditional data warehouse when we take scalability into account.

Versatility :  A data lake can store multi-structured data from diverse sources. In simple words, a data lake can store logs, XML, multimedia, sensor data, binary, social data, chat, and people data.

Supports not only SQL but more languages : Traditional data-warehouse technology mostly supports SQL, which is suitable for simple analytics but for advanced use cases, we need more ways to analyze data. Data lake provides various options and language support for analysis. It has Hive/Impala/Hawq which supports SQL but also has features for more advanced needs. For example, to analyze the data in a flow, you can use PIG or to do machine learning you can use Spark MLlib.

Advanced Analytics : Unlike a data warehouse, a data lake excels at utilizing the availability of large quantities of coherent data along with deep learning algorithms. It helps in real-time decision analytics

Difference between Data lake and Datawarehouse

Data Lake                  VSData Warehouse
RawData Structure  Processed
Highly accessible and quick to updateAccessibility           and flexible
       
More complicated and costly to make changes
Less secure compare to data warehouseSecurity                  More Secured
Stores huge data with lesser priceCost                       High cost
Schema on readProcess                   Schema on write
Business Professional,Data AnalystUsers                Data Scientists
Used for Cost effective Data StoragePurpose              Analytics for Business Decisions
Can use open source tool like Hadoop, Map reduceTool                    Mostly Commercial tools
Data at a low level of detail or granularityData Granularity    Data at the summary or aggregated level of detail
Highly agile, configure and reconfigure as needed.Agility                 Compare to Data lake it is less agile and has fixed configuration.

Disadvantages of Data Lakes:

Transition Period : When moving from a data warehouse to a data lake (or just adding a data lake to your existing system), there’s going to be a transition period.

Complex : Installing a data lake solution on-prem can be much more complex

Skillset Learning Curve : The data lake often comes with a new set of tools and services that need to be understood. That requires some additional investment, either from recruiting to get the right team members or doing internal professional development to understand those new tools.

Not Optimized for Query Performance : Although we can very easily store data in the data lake, in order to query that data back out, the data lake doesn’t have the same underlying query engine that a data warehouse does. Those queries may not perform to the level you need.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s