Today's enterprises need raw data fidelity and long-term storage in the cloud—and that's exactly what the data lake can provide. A data lake solution can be the foundation your organization needs to maximize the value of advanced analytics and data management best practices.

What is a data lake?

A data lake functions as a storage system for raw data, one with the massive scalability and broad compatibility that modern enterprises demand in our data-driven era. The long-term object storage containers that comprise data lakes follow a "wide and deep" design pattern.

Data lake architecture is thus capable of storing both structured and unstructured data in its raw formats—no transformation necessary. Metadata tags and individual identifiers ensure that despite the data lake's lack of schema or other structural elements common to other data management systems, all of the information it holds can still be located and retrieved by business users.

The value in data lakes

Data lakes help enterprise teams economically harness and optimize the exploding volumes of data they regularly generate and collect. Early on, the unstructured data now managed via data lakes puzzled some organizations. It came from new multiple sources: web applications, mobiles, Internet of things (IoT) devices, and others. Some were skeptical of this "dark data" or lacked appropriate storage, and discarded it. 

Before long, most data teams realized this data needed to be stored somewhere, even if only for later use—hence, data lakes. Many early versions were on-premises repositories based on the open-source Hadoop Distributed File System (HDFS). As cloud went from "important" to "essential" for enterprise operations, data scientists and analysts started developing cloud data lake solutions.

Now, data lakes are invaluable to enterprises' data management ecosystems, especially when deployed in the cloud. They are enabled by cost-efficient technologies such as hybrid cloud architecture and cloud-based object storage, and multiple facilities downstream draw upon them, including data marts, data warehouses, and recommendation engines, among others.

Key benefits of the data lake include:

Democratizing data access across the enterprise
Data lake architecture can be difficult to navigate at first because it lacks the familiar tabular structure of data warehouses. Early data lakes were often left for data scientists to operate and maintain. However, the advent of data preparation and management tools that allow self-service access to data lakes changed this. Now, enterprise staff who aren't experts in data science or data engineering can still take advantage of data lakes' vast possibilities for analytics insights, in industries ranging from healthcare to finance. Also, easier data access encourages data sharing.

Enabling a new approach to data integration
While the data management field hasn't moved entirely away from tools that use traditional extract, transform, and load (ETL) methods, mitigating data transformation costs has become a major priority. Data lake solutions, which typically represent the newer extract, load, and transform (ELT) approach, offer greater scalability and flexibility than strictly ETL-based systems at a lower cost. Organizations employing the most leading-edge best practices for data management are rebalancing hundreds of data integration jobs across the data lake, data warehouse, and ETL servers, as each has its own capabilities and economics.

Strengthening—and simplifying—corporate memory retention
Archiving data that has not been accessed in a long time with the help of a data lake can benefit an enterprise's data warehouse by saving storage space. Until the data lake design pattern came along, there was no other place to put "colder" data for occasional access except the high-performing data warehouse—or truly archaic, offline backup systems such as tape storage. With the aid of virtual query tools and data lakes, users can easily access cold data—in conjunction with the warm and hot data in the data warehouse—through data Lake Design Pattern: Realize Faster single query.

Embracing new forms of analytics
Technologies ranging from Hadoop to Spark Streaming—along with the cloud era as a whole—have given rise to new forms of analytics. Data lakes allow these new forms of business analytics to be efficiently processed at scale, using graphing methods and machine learning algorithms.

Finding insights in non-curated data
Prior to the big data trend, data integration normalized and aggregated critical enterprise information using a standardized repository system—such as a database or data warehouse—and the structure of this methodology allowed analysts to derive the value from key data. But this alone is no longer enough to manage all of the data across any enterprise. 

Attempting to structure what was once called "dark data" and homogenize it with a uniform format can actually undermine its value. By contrast, data lakes are an ideal environment for data scientists and analysts to mine this information in its original format for insights. The unstructured, unprocessed raw data is more malleable than its structured counterpart, making it especially valuable for machine learning projects.

Common pitfalls of data lakes

On the surface, data lakes appear straightforward, providing a way to manage and leverage massive volumes of structured and unstructured data. But this doesn't mean they're simple to design or maintain, and failed data lake projects are not uncommon across many types of industries and organizations.

Early data lakes faced challenges because best practices for data governance and management hadn't yet been established. Now, it's more common for lakes to offer an uninspiring return on investment (ROI) due to a lack of solid design.

Further factors contributing to data lake difficulties include:

Cluster proliferation
Self-service business intelligence (BI) tools and similar resources that make data lakes more accessible can be a double-edged sword: The ease of use might make some users believe that lakes can be set up ad hoc in the cloud. Although that's technically true, it's known as cluster proliferation if it becomes too common. Clustering easily leads to redundancy, inconsistency, synchronization problems, and difficulty reconciling any two lakes. In other words, it's got the potential to be just as bad as data siloing.

Lack of end-user adoption
Conversely, if your organization doesn't have those self-service tools, fewer users will reap the data lake's advantages. Non-experts may think getting answers from data lakes requires premium coding skills, which isn't true as long as these users have the right complementary solutions.

Limited commercial off-the-shelf tools
Many vendors of data lake solutions claim that their products are compatible either with Hadoop or cloud object storage tools like Amazon S3 and Microsoft Azure Blob. However, a significant number of these offerings lack deep integration capabilities and thus don't provide the democratized access to data that lakes are supposed to facilitate. Moreover, a great deal of these products were built to work with data warehouses rather than data lakes, limiting their ability to maximize value from unstructured data.

Conflicting objectives for data access
In all aspects of data management—including oversight of data lakes—it's critical to strike the proper balance between keeping strict security measures in place and facilitating agile access. One can't be seen as more important than the other, but because data lakes themselves don't necessarily have native security features, it's not uncommon to see organizations err on the side of greater caution. Stakeholders should align on best practices that address both sides of this issue. Zero-trust security tools are a possible solution, as they are extremely protective against unauthorized access but can be programmed to allow unfettered data lake access to authorized users.

Cost and resource concerns
Enterprises basically must choose between adopting a managed data lake solution or building their own from scratch using open-source tools like Hadoop and its derivatives. With the former, organizations tie themselves to vendors that can increase subscription fees at any time. The latter requires plenty of time and technical expertise to set up and maintain—and will likely be fairly expensive to pull off.

Management and governance issues
For organizations with data teams that have used traditional databases and data warehouses until now, managing a data lake can be difficult at first. Data scientists, engineers, and analysts must work together to carefully manage data partitioning, metadata tagging, data integrity, and infrastructure upgrades to support scalability. Some believe the data lakehouse framework to be an effective workaround for these management issues, but this may not always be the case: See the Frequently Asked Questions section for more information.

Proper data governance is also essential for an effective data lake. This may require the use of programmatic administration tools due to the data lake's sheer volume, and these aren't always a part of organizations' existing data governance frameworks. But without that method of governance, a data lake can quickly become a data swamp—difficult to access and near-impossible to navigate.

Maximizing the data lake's potential

Making the most of a data lake isn't about adopting it as a be-all, end-all approach. Instead, data teams should consider it another valuable part of their enterprises' data ecosystems, using the data lake alongside the data warehouse and leveraging each for its key strengths. 

Additionally, the design pattern of the data lake is far more important than the technology upon which it's built. Hadoop isn't a prerequisite, nor is any other single cloud data lake platform. Multiple technologies can be used in tandem to form the data lake. Meanwhile, a proper data lake design pattern offers a framework for workloads and data management expectations that will guide successful implementation.

Teradata data lake solutions

Teradata VantageCloud Lake combines the data processing, integration, and analytics capabilities of classic Vantage—the connected multi-cloud data platform for enterprise analytics—with all-new cloud-native architecture. Using VantageCloud, your enterprise can unearth the nuggets of critical information within unstructured, semi-structured, and structured data, and use them as the driving forces of strategies to support the organization's vision. 

The Teradata team is well-versed in the ins and outs of various data lake technologies, including Hadoop, Cassandra, and the object storage products from Amazon Web Services, Microsoft Azure, and Google Cloud. We'll work with your team to ensure VantageCloud Lake integrates smoothly into your data ecosystem.

VantageCloud

Accelerate value with Teradata VantageCloud

VantageCloud, formerly Vantage in the Cloud, offers two deployment options on AWS: Teradata VantageCloud Lake and VantageCloud Enterprise. VantageCloud Lake’s next-generation, cloud-native architecture enables your teams to experiment and innovate while saving money and maintaining governance. With VantageCloud Enterprise, you can leverage the same industry-leading analytics and fast, secure data access for managing your enterprise-level needs.

Frequently asked questions

What is an example of a data lake?

Traditionally, data lakes were repositories of unstructured data hosted with on-premises servers, which followed the Hadoop file system. In our cloud-first era, data lakes are just as commonly found in the cloud, with unstructured data housed within object stores like S3, Blob, and Google Cloud Storage. If using a hybrid cloud deployment, Hadoop and cloud tools can support the data lake together.

What is a data lake vs. a data warehouse?

Data lakes are so named because these repositories are wide, deep spaces that don't have much structure. As such, they're most associated with storing unstructured data, but they can contain semi-structured and structured data as well. By contrast, data warehouses are rigidly structured as tables with columns and rows, and contain only structured or semi-structured data. Learn more about data lake vs. data warehouse.

What is a data lake vs. a database?

Effectively, a database is a "data puddle" compared to a data lake. Data lakes can store numerous databases because of their compatibility with database file formats. However, it's generally easier to access database information if it is aggregated as part of a warehouse—which is where an offering like Teradata VantageCloud can provide an ideal solution.

What is a data lakehouse?

A data lakehouse combines the structured data management and processing ability of a data warehouse alongside the inexpensive storage capacity of a data lake. Some in the data management field consider the data lakehouse framework to be the way of the future, while others are skeptical about how feasible it would be to implement.

Cloud Analytics - AWS Amazon Web Services

Use AWS infrastructure with Teradata Vantage

Cloud Analytics - Microsoft Azure Microsoft Azure

Combine Azure resources with Teradata Vantage

Cloud Analytics - Google Cloud Google Cloud

Leverage Google Cloud with Teradata Vantage

Rise above needless bottlenecks and complexity, take analytics to the cloud