Data Discovery 1.0 in MoMo

You,data platformgcp

Read and follow me on Medium (opens in a new tab)

“Humans generate a lot of data”

With nearly 25 million users and hundreds of thousands of transactions every day. We need to build a big data platform to manage the data-informed generated every day in the fintech company.

MoMo E-Wallet

Data discovery is a term used to describe the process for collecting data from various sources by detecting patterns and outliers with the help of guided advanced analytics and visual navigation of data, thus enabling the consolidation of all business information.

Problems

The nature of data usage is problem-driven, meaning data assets (tables, reports, dashboards,…) are aggregated from underlying data assets to help decision making about a particular business problem, feed a machine learning algorithm, or serve as an input to another data asset. The big challenge is also one of the most fundamental — how to find the data they need. As the world creates a significant amount of new data every day, it becomes both harder and more time-consuming to find a new dataset that might be relevant to one’s work. Lack of metadata surrounding these report/dashboard insights directly impacts decision making, causes duplication of effort for the Data team, and increases the stakeholders’ reliance on data as a service model that in turn inhibits our ability to scale our Data team.

A real testament to this topic is the problem of new team members. When he needs to ask for information about a certain data table, he needs to go to many other members of the team to inquire and try to get useful information from those people. This is often very time-consuming and frustrating if the results are not as expected.

Goals of our tool

Before we built tools, finding the answer to questions of data at MoMo often involved asking team members in person, reaching out on Google Chat, digging through code changes, digging through dags in Airflow, sifting through various job logs,…. To kick things off, we spent time conducting user research to learn more about our users, their needs, and their specific pain points regarding data discovery. In doing so, we were able to better understand our user’s intent within the context of data discovery and use this understanding to drive tool development.

Solution

The solution here is to use the current migration and transform flow to get inside the data and save metadata from them. But, what is metadata?

Metadata is simply data about data. It means it is a description and context of the data.

There are three kinds of metadata:

To store the metadata, the team decided to choose Google Data Catalog (opens in a new tab). Why do we choose Data Catalog?

Google Data Catalog. Source: Google

In addition, we also use Google Cloud Datastore (opens in a new tab) for storing data dependencies to build upstream and downstream.

Architecture

Architecture Overview

This article will skip the migration and transform steps from the core database. But can be summarized a bit as follows:

And this is the flow diagram above shows the metadata sources our pipeline ingests use Airflow.

Process Metadata Flow

Some metadata fields: Ownership, Updated Time, URL, Description, Tags, Ingest Type, Framework, Statement Type, Storage Type, Table Status, Table Type, Quality Issues, Update Frequency,…

Result

After we have metadata, we build the web landing page that allows all teams to discover data assets, lineage, usage, ownership, and other metadata that helps users build the necessary data context.

Data Management Tools

The main functions:

What’s next?

In the mid to long term, we are looking to tackle data asset stewardship, change management, or other ways to go about Data Governance.

Data Governance is defined as “the exercise of authority and control (planning, monitoring, and enforcement) over the management of data assets to ensure that those assets are managed properly”.