Data Product and DataOps

Data Product and DataOps

Recently, I have come across the term DataOp multiple times. After exploration, i will try to explain the concept of data product and DataOps. This post is more for conceptual understanding and light-read purpose

In the last 10 years, data has been considered an asset and gradually adopted by more and more companies. With many companies' failed attempts to adopt big data in an enterprise setting, the industry has started to formulate a set of best practices for delivering data products. This set of best practices is absorbing successful experience coming from its close neighbour software development, as illustrated in the image below,

DataOps is the set of best practices for "manufacturing" data products, just like manufacturing vehicles from assembly line. Some of the questions naturally arise like,

  • how to deliver data products fast to adjust to BI/DS teams' needs.

  • how to design the data architecture to be robust enough (data eng), conforms to good security and privacy (compliance and audit purpose) and reliable data quality so stakeholder can trust your data (data quality, observability engineering and reliability engineering)

  • etc

To answer those questions and address them, many inspiration have been absorbed from newly risen terms in the last decade and stitched into DataOps including,

  • observability and reliability engineering

  • DevOps

To better deliver the data product in an enterprise setting, one needs to understand DataOps and its close relative DevOps.

Software product vs data product

The difference between SDE team and DE team is that SDE delivers software while DE delivers data product to be consumed by BI and DS team.

Just like users want to play around with cool new features on instagram, DS and BI team also wants new data as fast as possible.

For dev-ops of software product, we have the following tools in the pocket:

  • ci/cd

  • version control

  • dev/qa/environment

Why can't we directly copy the best practice from dev-ops to data-ops then? What are exactly holding back the DE's development flow?

  • It's hard to duplicate QA environment for ETL since it contains many source integration, infrastructure, cluster and VM configs. Hard to manage all of them in a code format unless some platform has been

  • DE involves too many tech stack and you changes pieces very quickly. It's a lot of work to maintain your CI/CD pipeline for deployment team.

The entropy generated by many tool stacks make it really hard to adopt these best practice. Delivering the data product involves many components as illustrated in the figure below

To address these issues, besides ci/cd, and version control, more toolings have been developed around the concept of everything-as-a-code including terraform (too sad hashicorp made it a business license), Ansible etc.

There is no silver bullet that could solve for this problem since it's dependent case by case among enterprises. It is not an easy task as well but some of the basic design principle sticks, such as

  • minimising the number of tech stack you use

  • to make your data architecture more "microservice-like" so you could easily switch to a new component once that component went stale (let's say every 2 years scan). This will make it more extensible

Summary

We understand the origin and why we need DataOps. This blog mainly serves as an intro for you to dive into the world of delivering good data products in an enterprise setting. I recommend the book Practical DataOps Delivering Agile Data Science at Scale as a light read if you want to understand more conceptual parts of DataOps, enterprise struggles with big data and how to have a better chance of delivering good data products with a better team composition etc.

Reference