Bogged down in a data swamp? Here’s how to get unstuck

What are data swamps and data lakes? How do you clean up a data swamp once you've got one? We answer these questions and more.

By Brian Roehm

Mar 5, 2023 • 3 Minute Read

Subscribe to the newsletter

Swamps: homogenous pools of brackish water, mud, and all manner of unnamed squishy things. Yuck! Unfortunately, in the world of data, there’s not much difference between a real swamp and a data swamp.

In this blog post we’ll be exploring the world of data swamps and data lakes as we navigate the path out of the swamp and into a relaxing world of clean data! Along the way we might even stop to explore some amazing tools like Tableau (and how you can use it to keep your data sparkling clean).

What is a data swamp?
What is a data lake?
What are the risks associated with a data swamp?
How do you avoid a data swamp?
Final reflections on the importance of data organization

What is a data swamp?

Data swamps are the result of a lack of proper planning or proper implementation. A data swamp is created when multiple business groups decide to add data in any manner they wish to a data lake. The end result is complex folder structures with no observable pattern across the organization.

Inside the folders, files, and tables of a data swamp, you will find erroneous records, missing data, and non-uniform data. All of this makes attempting to pull multiple files or sources from the data lake into a single report virtually impossible!

In short, everyone has access and no one follows the rules. It’s complete anarchy.

What is a data lake?

A data lake is the world of data the way it was meant to be. Everyone contributes in a meaningful way. Data is uploaded without error. It is organized, easy to access, and presented with a uniformity across the data lake. It is the opposite of a data swamp.

A data lake allows you to pull data from multiple organizational silos and quickly mesh it together. You can then use this data to create new and unique reports that help with business decision making, leveraging advanced analytics.

What are the risks associated with a data swamp?

Data swamps can result in management not getting the reports they need to make business decisions, or getting faulty intelligence. In the former scenario, the organization flies blind, resulting in inevitable mistakes. In the latter, management works with erroneous data, which results in costly errors, backtracking, and potential fines.

The second scenario is worse, since it is better to not have the report then to be misinformed. This often occurs when IT plows ahead and mashes reports together as quickly as possible, not realizing the data quality is not fantastic.

How do you avoid a data swamp?

You need a plan that is easy to implement and carry out, with four key components: data cleaning, paying attention to metadata, using data automation, and building a data catalog. These will help you create a well-kept data lake and avoid your data sinking into a swampy mess.

1. Start with thorough data cleaning

The first component that is critical to success is data cleaning. An example would be country codes. It’s possible that one source says USA, one United States, and one #120. The data team needs to decide which one to use and make sure that Zones are created in the data lake.

Zones might be defined as raw, processed, and curated (this depends on the complexity of your data and your organizational scheme). Business groups drop the data off as raw and then the data team process or clean the data to match the correct country code, remove missing data, duplicate data, etc. This new cleaned data is dropped into a processed zone to be refined for reports, and added in unique ways to be curated for final business report generation.

Data cleaning and zones are critical to the success of your data lake. Without them you cannot possibly avoid the data swamp.

2. Don’t ignore metadata

To assist with your zones, data cleaning, and curation, you will want to make sure to include metadata. Metadata will help your data engineers and data scientists to understand what the data is used for, what version you are using, where your data comes from, and where it belongs.

All of your data should include metadata. It will speed the processing of your data and ensure it arrives in the right place and is used for the right reports.

3. Utilize data automation

Data automation is also key to increasing the speed of data preparation. Tools such as Tableau cannot only assist in increasing speed but they can also allow for data democratization, or the ability of non-technical employees in your company to access and create critical business intelligence reports.

4. Build a data catalog

If you have done all of the steps above, you will finally need a data catalog. The data catalog is like a table of contents in a book. It helps you to understand where your data is likely to be located, how fresh the data is, and provides rules that will allow you to determine the fitness of data for various reporting purposes.

Like the ingested data, a data catalog is only helpful if it is well maintained and enforced. If not, your data catalog might wind up sinking into the data swamp right along with the rest of your data!

Final reflections on the importance of data organization

I get it. At this point you’re probably thinking “All of that sounds great, but I have no idea where to start. How do I take this murky mess and start to clean it up?”

If you find yourself in this boat (and stuck trying to paddle in the middle of your data swamp), ask yourself this question “What would a data lake even look like?” This might include questions like:

What major categories or zones would you have?
What would live in those zones?
What kinds of reports would you like to have generated out of your data?

Once you have that information, you can start organizing your data. Here are some more steps to follow to do that:

Start designing and creating the necessary folders.
Take what you have and make that a raw zone (to be edited later).
Create rules for how you will maintain the new data lake and who will own the process.
Get to work creating your processed and curated folders and populating those new folders with clean data.

Before you know it, you will be leveraging the success through data and enjoying new insights for your business to act upon. Great data governance acts as its own reward!

Brian R.

Brian Roehm is a multi-cloud certified architect with business certifications and degrees including a Project Management Professional and an MBA. He has been in IT for over a decade, starting out in Technical Project Management. Brian has worked as an Architect on projects from medium-sized businesses to Fortune 500, and has experience in multiple cloud environments including AWS, Azure, and Oracle. For the past several years he has been an instructor at A Cloud Guru preparing students for certification exams in topics on Dev Ops, Data Engineering, and Data Science.

More about this author