Soda.io checks to keep your data online – The New Stack
Much has been said lately about the data mesh, which rather than a technology or a service, is actually an organizational structure that brings ownership of data closer to those who actually use it to bring value to the business, as Emily Omier explained in a post recently.
If you have a core data engineering group, how well does it really understand what data sets finance needs? Or the data sets that one of the business units needs? The closer you are to someone who understands business issues and requirements and has domain knowledge, the better prepared they are to create the right set of data assets to power the right type of business case. use.
Belgian startup Soda.io takes this data ownership one step further to allow enterprise data owners to own data quality as well. Co-founders Tom Baeyens and Martin Masschelein approached the problem from slightly different angles but recognized a common problem, and the company was born.
“There are all these people who work together to enhance the data they have. And it turns out that in production, the biggest problem is actually keeping that data in a clean form. Because once you use data in production, engineers are usually going to do something else, build the next product. And then it breaks down,” Baeyens explained.
There are a myriad of ways data systems can go wonky – it could be as simple as someone adding a new field in Salesforce – but traditionally engineers have had to write code to create checks on data quality by production, which data analysts often lack the skills to do. The Soda team set out to change that, focusing on the needs of data analysts as well as data engineers.
Data as code
To this end, he published soda corea framework for integrating data reliability checks and quality management into data pipelines powered by SodaCL (Soda Checks Language), a domain-specific language for data reliability.
Inspired by the concept of data as code, Soda Core is a open-source CLI tool and Python library that allows users to use SodaCL to transform user-defined inputs into aggregated SQL queries. Core components include the use of dataset metadata to understand the shape and health of data, as well as built-in metrics and broad verification coverage that can be used to validate many data quality parameters. They include anomaly detection checks and change-over-time checks to detect and fix problems in the data and alert the appropriate people. It’s the basis of Soda Cloud, but it can also be used as a standalone tool.
In 2021, the company launched Soda SQL to help data engineers maintain reliable data pipelines in production and continued to develop it as a specific language, allowing data teams to verify data as code in every data workload, from ingestion to consumption.
As a more human-readable language, SodaCL eliminates the need to code in SQL, meaning each member of a data team can set thresholds for what good data should look like. At the same time, underneath it still queries SQL based data sources.
These are among the more than 30 built-in metrics included in SodaCL:
Said Tiago Andradehead of big data, analytics and AI at Brazilian retailer Americanas SA, “The modern retail environment has changed, and for organizations like Americanas to continue to deliver the best possible commerce experience , we rely on digital engines powered by AI and ML that sit behind our retail platform.
“This platform is a dynamically evolving entity that needs to be managed in real time to ensure we adapt to changing conditions and don’t suffer from errors that affect accuracy and degrade overall performance. Soda gives us the end-to-end observability we need to be more confident about the data that powers our engines, which means that instead of being reactive to issues, we can take a much more proactive approach based on a fully accurate picture of the health of our data.
Baeyens said his users pushed for the idea of a specific language for data reliability. A few companies had already worked on such a language.
“When you want to monitor this data in production, that means you need to create a picture of what good data looks like, so you can monitor that,” he said.
“Normally, this is a field reserved only for engineers. They have to write code, they know how to write code, and then they have to learn the library and all that. But our goal is…to extend this to analysts and non-technical users as well. So the language really allows analysts to become autonomous. They no longer need to rely on programmers to write those checks. [With the language] it’s much simpler than writing code. It’s easy to read. And now many more people can contribute the picture of what good data looks like.
For example, you can compare datasets, check data freshness, or set up a programmatic analysis create a circuit breaker to stop data ingestion if a problem is detected.
You need two entries. One is for all of your data source configuration and the other is for the checks you want to perform. Both are YAML configuration files.
“It’s very easy for engineers to plug into their Airflow or orchestration tools, very early on as the data comes in,” Masschelein said.
Its commercial offering is a managed cloud that includes collaboration tools, features like incident management, integrations with Slack, and other features.
Baeyens, the company’s CTO, previously created the open source projects jBPM, a JBoss-based toolkit for building business applications to automate business processes; and Activities, a Java-centric business process model and notation engine (BPMN) for process automation. He also created Effektif, a cloud-based business process management (BPM) solution for process automation that became SAP Signavio Process Governance.
Masschelein, the CEO, came from data governance platform provider Collibra, which used Baeyens data tools. The two connected on a community forum and Soda.io was launched almost four years ago. The Brussels-based company now has around forty employees.
Its open source users and contributors include Disney, HelloFresh, Udemy, and St. Jude Children’s Research Hospital.
Disney, for example, provided connectors to the trino The SQL query engine and Hello Fresh are working with the company on Spark.
“So you can use it on dataframes, which is also very popular,” Masschelein said. “And then in the future we will also go in the direction of streaming. We have made some early prototypes. But we want to make sure that we cover the whole landscape, from streaming to Spark to all SQL sources.
The New Stack is a wholly owned subsidiary of Insight Partners, an investor in the following companies mentioned in this article: Udemy, HelloFresh.
Characteristic picture Going through Pixabay.