If you’re wondering about the differences between “data lakes” and “data warehouses” (or other Big Data terminology), you are not alone. In recent months, Pluralsight has received a growing number of queries on data-related topics. This brief overview provides a starting point for planning your Big Data training strategy.
What is Big Data?
Data is growing at a very fast pace. Check out Internet Live Stats, which displays real-time internet activity. Looking at these numbers, it is evident that traditional systems are not able to cope with this massive amount of data.
Relational databases and mainframes handle only a limited amount of structured data, measured in gigabytes or small numbers of terabytes. In contrast, Big Data involves acquiring, storing, processing, analyzing, and gaining actionable insights from massive amounts of data—terabytes and beyond. Companies use Big Data to create a competitive advantage.
What’s driving the interest in Big Data?
Picture this: A business user wants to review the last six months of sales results for a specific region. She enters a query into the sales system, which then accesses the requested information from a data warehouse.
In order to get an answer…
- The requested information must be in the warehouse.
- The user must ask a question that the sales system recognizes.
Suppose this same business user wants a different piece of information that is not currently stored in the warehouse? Or asks a question that the system does not recognize?
For example, what if the user wants to drill down and find the sales results for a particular store in the last six months? If that information is not in the data warehouse, the user would not receive an answer.
Before Big Data, making the changes to get this store-level sales information was not a simple undertaking. Implementing a change request like this could impact 50+ enterprise systems, requiring an intensive time and resource investment.
To remain competitive, organizations need fluid, nimble ways to access and analyze data. Companies cannot afford to wait months or more for answers to pressing questions.
Big Data terminology: What’s the difference between a data lake and a data warehouse?
A data warehouse stores business data in a structured way (e.g. relational database tables with well-defined structures/schema). Because warehousing is expensive, organizations limit what they store. They typically utilize a warehouse for very specific use cases, such as historical sales data for analysis and forecasting.
In contrast, a data lake can store all types of data at a centralized location. Format doesn’t matter. In addition to structured data, a data lake can accommodate semi-structured data (e.g. XML, sensor-based data) and unstructured data (e.g. images, videos, emails).
Organizations can build data lakes with inexpensive storage to store all enterprise data, as well as external data sources (industry information, social media, and so forth). A data lake enables business users and data analysts to ask an infinite number of questions without waiting for the data to be available via a change request process.
Big Data terminology: The roles
If you’re new to Big Data, the job titles can be confusing. Here are some important distinctions:
These three roles have distinct training needs. While some consider data engineering as a subdomain of software engineering, it requires mastery of different skills and tools.
Also, the work of data scientists can vary widely. For example, data scientists may perform a one-off analysis for a team that wants a better understanding of customer behavior. Or, they may develop machine learning algorithms that software engineers or data engineers implement into the code base.
Dive deeper into Big Data with part two in our series and discover how to find actionable insights in your data lake.
About the author
Bhavuk Chawla teaches Big Data, Machine Learning, and Cloud Computing courses for DevelopIntelligence, a Pluralsight Company. As well, he is an official instructor for Google, Cloudera, and Confluent. For the past ten years, he’s helped implement AI, Big Data Analytics, and Data Engineering projects as a practitioner, utilizing Cloudera/Hortonworks Stack for Big Data, Apache Spark, Confluent Kafka, Google Cloud, Microsoft Azure, Snowflake, and more. He brings this hands-on experience, coupled with more than 25 Data/Cloud/Machine Learning certifications, to each course he teaches. Chawla has delivered knowledge-sharing sessions at Google Singapore, Starbucks Seattle, Adobe India, and many other Fortune 500 companies.
5 keys to successful organizational design
How do you create an organization that is nimble, flexible and takes a fresh view of team structure? These are the keys to creating and maintaining a successful business that will last the test of time.
Read more8 ways to stand out in your stand-up meetings
Whether you call them stand-ups, scrums, or morning circles, here's some secrets to standing out and helping everyone get the most out of them.
Read moreTechnology in 2025: Prepare your workforce
The key to surviving this new industrial revolution is leading it. That requires two key elements of agile businesses: awareness of disruptive technology and a plan to develop talent that can make the most of it.
Read more