Blog articles

Data pipeline Q&A

These days, organizations need to leverage their data to make smart business decisions. But with large amounts of data scattered across applications and systems, it can be difficult to gather the information needed for meaningful analysis.

That’s where big data pipelines come in. 

In a recent webinar, Andrew Brust, GigaOm’s Category Lead for Data and Analytics and Founder/CEO of Blue Badge Insights, and Durga Halaharvi, Director of Data Product Management with Pluralsight, fielded participant questions about big data transformation. Here’s their advice for any organization looking to create a competitive advantage:

What are recent trends you’ve noticed in cloud-based data pipelines?

Durga Halaharvi (DH): As the world moves towards more real-time use cases, like Netflix which requires on-demand use cases, I see more organizations using stream processing. This is also influenced by the rise of DS/ML and a consistent data feed which powers real-time recommendations. 

In general, I also see an increase in open-source technologies like Apache Flink and Kafka. Specifically, I've seen a lot more organizations adopt dbt (data build tool) and repurpose it for their own needs. Traditionally, dbt has been an analytics engineering tool because it’s focused on processing data to display it in Tableau or a dashboard. But even at Pluralsight, our data engineering teams have started to adopt the tool more. 

Andrew Brust (AB): One of the bigger trends is the fact that so much data in so many platforms is moving to the cloud. Even the pre-cloud era vendors have moved to the cloud. Informatica, for example, has been around since the ETL days. They began with an on-premises product, but they don’t actively sell it anymore. Most of the action has moved to the cloud. That's where customers want to be. That's where Informatica wants to be. And this goes not just for Informatica, but companies like Oracle and Microsoft, too.

In the industry, some people say that old platforms are obsolete. But it’s more complicated than that. The older companies excel at running data pipelines without fail for mission-critical workloads. They took their experience and added it to new technology. 

On the other hand, startups leverage innovative technology, but they may lack the experience or customer knowledge needed to deliver a quality solution. If they haven’t experienced harsh conditions, they may not have bulletproofed their platforms. You could also run into this if you use open source code and run it yourself rather than on a managed platform.

Do you have any concerns about using low-code and no-code platforms?

DH: When it comes to transferring data from one place to another, low-code and no-code platforms can handle broad, high-level use cases. But when you get into specialized use cases, some of those tools may not be able to fit every single situation. 

As an example, at Pluralsight we use Stitch as a low-code/no-code pipeline to transfer data from the backend of Pluralsight Flow into our data warehouse Snowflake. But for some of our more specialized queries and needs, our data engineering team will probably need to build out a custom pipeline integration.

AB: Ultimately, you want to have as much centralized engineering as possible. This way, all the individual parties can spend their time working on unique requirements. Having a low-code or no-code platform can help with this. 

If a use case is fairly common, most platforms can take that on. I see it as a continuum, though. If somebody innovates around a unique scenario, then it may become more commonplace, and platforms may adopt that solution. But new situations emerge all the time, and software vendors are not clairvoyant.

How do tools like Databricks change data pipeline structure or portability?

AB: Databricks is an interesting case because it was founded by the people who created Apache Spark. It's a developer-oriented platform, and the number one use case for Databricks has been a more bespoke approach to data engineering, data transformation, and data pipelines. 

But at this point, even Databricks has introduced a declarative data pipelining facility as a feature in its workspace. As things mature and become more mission-critical, you start to need more infrastructure and more plumbing built around it. The DIY stuff needs to move up the stack, and your baseline platform takes on more and more.

Platforms might automate what you did manually last year. But that just gets you to the next level of sophistication. Pluralsight helps developers upskill, cross-skill, and learn these new domains of software. We're always going to need developers because we can't have a priori knowledge of everything. We need to systematize in code.

Confirmation bias can influence data analysis. Can data pipelines help ensure data validity?

AB: In part. We have augmented analytics, natural language generation, and automated insights to avoid modeling the data a certain way and getting a self-fulfilling result. But data will always be manipulated, just like information will always be manipulated. I don't think data pipelines will fix this. They may be able to help, though. 

Confirmation bias can get formalized in things like machine learning models if you’re overfitting to a small amount of training data. If you use a pipeline to bring in fresh data with low latency, you can overcome that challenge, at least in the world of modeling. In the end, a lot of data validity comes down to human intent, goodwill, and some amount of auditing.

DH: At the end of the day, data validity might not be a data problem. It may be more of a problem with how organizations and teams interpret the data and what types of questions they seek to answer. Unfortunately, you're going to run into that regardless of your organization’s data maturity.

A data pipeline might not be able to solve this, but you can develop criteria to guide your analysis. Whether you create your own data pipeline or use a platform, don’t be afraid to debunk some of the assumptions you put forward. Be clear on the questions you’re trying to answer and set specific benchmarks or metrics. If the data shows that you haven’t met those KPIs, acknowledge that. Be open to coming to terms with your own data and the questions you’re trying to answer with it.

How can organizations prepare for big data transformation?

AB: Technology transforms all the time. New startups and ventures pop up almost every day. Keep your eye on big data and the state-of-the-art platforms and methods. Then, make sober decisions about what it is you want to skill yourself in and adopt in your own organization.

DH: I have a lot of empathy for all of you out there with various data needs in your organizations. When you’re thinking about big data and data pipelines, consider what use cases your organization needs, what a platform can do in terms of analytics, and how leadership will push data to the forefront of your organization.

Ready to get started with data pipelines? Check out Andrew's course.

Biography

Pluralsight author Andrew Brust is Founder and CEO of Blue Badge Insights and category lead for Data and Analytics at GigaOm, a Pluralsight partner. He provides strategy and advisory services to data, analytics, BI, and AI companies, as well as their partners and customers. Andrew also covers the data and analytics world for VentureBeat and The New Stack, co-chairs the Visual Studio Live! series of developer conferences, is a Microsoft Regional Director and Data Platform MVP, an entrepreneur, and consulting veteran.