Recently, I have come across the term DataOp multiple times. After exploration, i will try to explain the concept of data product and DataOps. This post is more for conceptual understanding and light-read purpose
In the last 10 years, data has been considered an asset and gradually adopted by more and more companies. With many companies' failed attempts to adopt big data in an enterprise setting, the industry has started to formulate a set of best practices for delivering data products. This set of best practices is absorbing successful experience coming from its close neighbour software development, as illustrated in the image below,
DataOps is the set of best practices for "manufacturing" data products, just like manufacturing vehicles from assembly line. Some of the questions naturally arise like,
how to deliver data products fast to adjust to BI/DS teams' needs.
how to design the data architecture to be robust enough (data eng), conforms to good security and privacy (compliance and audit purpose) and reliable data quality so stakeholder can trust your data (data quality, observability engineering and reliability engineering)
etc
To answer those questions and address them, many inspiration have been absorbed from newly risen terms in the last decade and stitched into DataOps including,
observability and reliability engineering
DevOps
To better deliver the data product in an enterprise setting, one needs to understand DataOps and its close relative DevOps.
Software product vs data product
The difference between SDE team and DE team is that SDE delivers software while DE delivers data product to be consumed by BI and DS team.
Just like users want to play around with cool new features on instagram, DS and BI team also wants new data as fast as possible.
For dev-ops of software product, we have the following tools in the pocket:
ci/cd
version control
dev/qa/environment
Why can't we directly copy the best practice from dev-ops to data-ops then? What are exactly holding back the DE's development flow?
It's hard to duplicate QA environment for ETL since it contains many source integration, infrastructure, cluster and VM configs. Hard to manage all of them in a code format unless some platform has been
DE involves too many tech stack and you changes pieces very quickly. It's a lot of work to maintain your CI/CD pipeline for deployment team.
The entropy generated by many tool stacks make it really hard to adopt these best practice. Delivering the data product involves many components as illustrated in the figure below
To address these issues, besides ci/cd, and version control, more toolings have been developed around the concept of everything-as-a-code including terraform (too sad hashicorp made it a business license), Ansible etc.
There is no silver bullet that could solve for this problem since it's dependent case by case among enterprises. It is not an easy task as well but some of the basic design principle sticks, such as
minimising the number of tech stack you use
to make your data architecture more "microservice-like" so you could easily switch to a new component once that component went stale (let's say every 2 years scan). This will make it more extensible
Summary
We understand the origin and why we need DataOps. This blog mainly serves as an intro for you to dive into the world of delivering good data products in an enterprise setting. I recommend the book Practical DataOps Delivering Agile Data Science at Scale as a light read if you want to understand more conceptual parts of DataOps, enterprise struggles with big data and how to have a better chance of delivering good data products with a better team composition etc.
Reference
the book: Practical DataOps. This is a good read. more conceptual.
Lenny Liebmann called 3-reason why dataops is essential for big data success in IBM big data and analytics hub. He proposed dataops first
Tamr's CEO Andy Palmer posted in 2016 and now in 2022 on what's DataOps. It made DataOps more buzzwordy back in 2016