Blog articles

Big data terminology: A layperson's guide

July 13, 2022

If you’re wondering about the differences between “data lakes” and “data warehouses” (or other Big Data terminology), you are not alone. In recent months, Pluralsight has received a growing number of queries on data-related topics. This brief overview provides a starting point for planning your Big Data training strategy.

What is Big Data?

Data is growing at a very fast pace. Check out Internet Live Stats, which displays real-time internet activity. Looking at these numbers, it is evident that traditional systems are not able to cope with this massive amount of data.

Relational databases and mainframes handle only a limited amount of structured data, measured in gigabytes or small numbers of terabytes. In contrast, Big Data involves acquiring, storing, processing, analyzing, and gaining actionable insights from massive amounts of data—terabytes and beyond. Companies use Big Data to create a competitive advantage.

What’s driving the interest in Big Data?

Picture this: A business user wants to review the last six months of sales results for a specific region. She enters a query into the sales system, which then accesses the requested information from a data warehouse.

In order to get an answer…

  1. The requested information must be in the warehouse.
  2. The user must ask a question that the sales system recognizes.

Suppose this same business user wants a different piece of information that is not currently stored in the warehouse? Or asks a question that the system does not recognize?

For example, what if the user wants to drill down and find the sales results for a particular store in the last six months? If that information is not in the data warehouse, the user would not receive an answer.

Before Big Data, making the changes to get this store-level sales information was not a simple undertaking. Implementing a change request like this could impact 50+ enterprise systems, requiring an intensive time and resource investment.

To remain competitive, organizations need fluid, nimble ways to access and analyze data. Companies cannot afford to wait months or more for answers to pressing questions.

Big Data terminology: What’s the difference between a data lake and a data warehouse?

A data warehouse stores business data in a structured way (e.g. relational database tables with well-defined structures/schema). Because warehousing is expensive, organizations limit what they store. They typically utilize a warehouse for very specific use cases, such as historical sales data for analysis and forecasting.

In contrast, a data lake can store all types of data at a centralized location. Format doesn’t matter. In addition to structured data, a data lake can accommodate semi-structured data (e.g. XML, sensor-based data) and unstructured data (e.g. images, videos, emails).

Organizations can build data lakes with inexpensive storage to store all enterprise data, as well as external data sources (industry information, social media, and so forth). A data lake enables business users and data analysts to ask an infinite number of questions without waiting for the data to be available via a change request process.

This graphic compares a data warehouse and data lake. Adata warehouse typically is a repository of clean data, which has already been processed for a specific reason. It houses mostly structured data, and it's costly to make changes. In contrast, a data lake is a collection of raw data that can be processed later for getting insights. It can include any type of data: structured, semi-structured, or unstructured.  And it's highly accessible, quick ,and flexible.

Big Data terminology: The roles

If you’re new to Big Data, the job titles can be confusing. Here are some important distinctions:

The image compares three roles: software developers, data engineers, and data scientists. Software developers take part in all phases of the software development lifecycle from design, to writing code, to testing and review. Data engineers create the infrastructure needed for accessing and utilizing data. And data scientists identify the key insights and present them in understandable ways to key stakeholders.

These three roles have distinct training needs. While some consider data engineering as a subdomain of software engineering, it requires mastery of different skills and tools.

Also, the work of data scientists can vary widely. For example, data scientists may perform a one-off analysis for a team that wants a better understanding of customer behavior. Or, they may develop machine learning algorithms that software engineers or data engineers implement into the code base.

Dive deeper into Big Data with part two in our series and discover how to find actionable insights in your data lake.

About the author 

Bhavuk Chawla teaches Big Data, Machine Learning, and Cloud Computing courses for DevelopIntelligence, a Pluralsight Company. As well, he is an official instructor for Google, Cloudera, and Confluent. For the past ten years, he’s helped implement AI, Big Data Analytics, and Data Engineering projects as a practitioner, utilizing Cloudera/Hortonworks Stack for Big Data, Apache Spark, Confluent Kafka, Google Cloud, Microsoft Azure, Snowflake, and more. He brings this hands-on experience, coupled with more than 25 Data/Cloud/Machine Learning certifications, to each course he teaches. Chawla has delivered knowledge-sharing sessions at Google Singapore, Starbucks Seattle, Adobe India, and many other Fortune 500 companies.