The new buzzword in tech-talk land is data lake and it sounds so cool to mention it in keynote presentations; it gets everyone to sit up and pay rapt attention. Watch how all the database and data warehouse companies will introduce data lake solutions soon. But how will organizations really benefit from data lakes in the next few years? Get the lowdown on data lakes in this quick 10-minute read.
1> What is a Data Lake?
There are all kinds of data coming into the organization’s information systems. This data is captured from different sources and departments. Structured data, for one, is transactional data that can be stored in tables within databases. It comes from ATM & POS machines, ticketing & reservation systems, and even IoT sensors. Unstructured data comprises graphics, audio, video, social media posts, podcasts etc. It is a challenge to organize the storage of unstructured data as traditional data management solutions were not designed for these data types.
Now for the definition.
According to TechTarget, a data lake is a storage repository that holds a vast amount of raw data in its native format, until it is needed for processing by different applications. It is a new way to store massive volumes of heterogeneous data.
But doesn’t a data warehouse do that? Well, there is a difference, and we explain it later in the article.
2> Why will your organization ever need one?
Organizations are increasingly dependent on analytics for timely business insights. However, they grapple with the challenge of data management. For years organizations stored data in silos and each department had its own database. Database specialists and analysts had to collate all the data in one place before business users could get actual insights from it. This is a slow process that impedes decision making. Of course, that problem was solved to an extent when ERP came along. Today, there are new types of data and new data sources, as we explained at the beginning of the article. The volume, variety and velocity of data have increased. Traditional databases and data warehousing solutions are not equipped to manage and process data at the speed that business users demand. Organizations want to do timely and adhoc analyses of the data but will need more agile, scalable and flexible storage systems to do that.
3> What are the capabilities of a data lake?
According to the AWS (Amazon) S3 data lake architecture, the essential capabilities of a data lake are:
Data Ingestion – Data lakes can quickly ingest heterogeneous data coming in from different sources, and store these in their raw or native formats. They have the ability to ingest data without the need to force it into a pre-defined schema. To explain this further: With traditional data bases you need to do a data definition and define certain attributes about the data that can be stored in the database, in advance. Some attributes are table structure, data type (numeric, character, alphanumeric), maximum and minimum length, constraints, the fields in the database, and the limits of these fields.
Central Storage – Data lake can store data in its native format. The central storage repository is decoupled from the compute. The data lake can scale up as storage requirements increase, in a cost-effective manner. That’s because you are not paying for extraneous compute cores.
Catalog & Search – The data management system needs to have information about the data to make it discoverable to applications. There needs to be a catalog system to enable this. A data lake builds a record of meta data – data about data. It provides tools (like Amazon Lambda) to ensure that the catalog is always updated. It also offers a search capability to enable users and applications to search for the right kind of data in the data lake.
Access & User Interface – The data lake provides an interface for users and applications to access the data. This is usually done through a website with dynamic components and an API gateway to access the various functions for searching and discovering the data – and for performing queries, and analytics.
Processing & Analytics – The ability to process real-time and batch data. The data lake should also have predictive and AI capabilities for predictive analysis. It must provide the tools to build BI and data visualization on the platform. It should support different processing and analytics requirements such as predictive and descriptive analytics, machine learning, data science, deep learning etc.
A data lake should allow you to do some quick ad hoc analysis, without the need for you to define schemas for that data. AWS S3 data lake for instance does schema on read, and hence allows you to do some quick analysis. We explained schemas when we talked about data ingestion upfront in the article. The data lake must offer the capability to do multiple analysis on the same data, at the same time.
Protect & Secure – How do you ensure that the right users have access to the right kind of data? The data lake should address all aspects of security and industry compliance through a set of security services. This is backed by highly secure cloud infrastructure.
4> Can’t you achieve the same thing with Hadoop clusters and Cassandra?
Yes, you can, but it’s complex. Data lakes built on Hadoop have struggled to deliver value because these are complex to deploy and manage, require scarce programming skills, and lack critical data warehousing and SQL capabilities.
Data lakes on the other hand offer a lot more flexibility as they decouple the storage from the compute. With fixed cluster data lake solutions such as Hadoop, data warehouse or Apache Cassandra, you are limited to the single tool on that cluster. On the other hand, a data lake such as AWS S3 offers the flexibility to use any of the tools in the ecosystem. That means you can future proof your architecture. As new use cases and new tools emerge, you can plug and play current best of breed tools.
5> Does that mean data warehouses are dead?
Data lake will not render databases or data warehouses obsolete; rather it will complement traditional data management solutions.
The data lake will be a central repository of data. The data warehouse comes after that, and will extract data from the data lake, and then condition the data for further processing.
According to the Wall Street Journal, a data lake, as opposed to a data warehouse, contains the mess of raw unstructured or multi-structured data that for the most part has unrecognized value for the firm. While traditional data warehouses will clean up and convert incoming data for specific analysis and applications, the raw data residing in lakes are still waiting for applications to discover ways to manufacture insights.
6> Which companies offer data lake technology today?
Many companies will claim to have a data lake solution, but it really boils down to which of those have the capabilities to store, manage and process massive amounts of data. Go back to the definition and capabilities of data lake (as mentioned above).
To name a few:
AWS S3 Data Lake
Microsoft Azure Data Lake