Data Discovery 1.0 in MoMo

You,Thu Jul 08 2021•data platform gcp

Read and follow me on Medium (opens in a new tab)

“Humans generate a lot of data”

With nearly 25 million users and hundreds of thousands of transactions every day. We need to build a big data platform to manage the data-informed generated every day in the fintech company.

MoMo E-Wallet

Data discovery is a term used to describe the process for collecting data from various sources by detecting patterns and outliers with the help of guided advanced analytics and visual navigation of data, thus enabling the consolidation of all business information.

Problems

The nature of data usage is problem-driven, meaning data assets (tables, reports, dashboards,…) are aggregated from underlying data assets to help decision making about a particular business problem, feed a machine learning algorithm, or serve as an input to another data asset. The big challenge is also one of the most fundamental — how to find the data they need. As the world creates a significant amount of new data every day, it becomes both harder and more time-consuming to find a new dataset that might be relevant to one’s work. Lack of metadata surrounding these report/dashboard insights directly impacts decision making, causes duplication of effort for the Data team, and increases the stakeholders’ reliance on data as a service model that in turn inhibits our ability to scale our Data team.

A real testament to this topic is the problem of new team members. When he needs to ask for information about a certain data table, he needs to go to many other members of the team to inquire and try to get useful information from those people. This is often very time-consuming and frustrating if the results are not as expected.

Goals of our tool

Be simple to implement: People do not need statistical degrees or an analytical background to use them.
Be adaptable: Anyone can gain insights from data across all departments without relying on IT experts for information.
Be quick: You can understand exactly what you need to improve your decision-making abilities without waiting to get the information that you need.
Easily work with massive amounts of data: Visual discovery is helping expand traditional business intelligence and improve efficiency.

Before we built tools, finding the answer to questions of data at MoMo often involved asking team members in person, reaching out on Google Chat, digging through code changes, digging through dags in Airflow, sifting through various job logs,…. To kick things off, we spent time conducting user research to learn more about our users, their needs, and their specific pain points regarding data discovery. In doing so, we were able to better understand our user’s intent within the context of data discovery and use this understanding to drive tool development.

Solution

The solution here is to use the current migration and transform flow to get inside the data and save metadata from them. But, what is metadata?

Metadata is simply data about data. It means it is a description and context of the data.

There are three kinds of metadata:

Technical metadata: schemas, tables, columns,…
Business metadata: business descriptions, comments, annotations, classifications,…
Operational metadata: data owner, dependencies, update frequency,…

To store the metadata, the team decided to choose Google Data Catalog (opens in a new tab). Why do we choose Data Catalog?

Google Data Catalog. Source: Google

It can catalog the native metadata on data assets from the following Google Cloud storage system sources: GCS(Google Cloud Storage) (opens in a new tab), BigQuery (opens in a new tab), PubSub (opens in a new tab),…
Offers powerful, structured search capabilities and predicate-based filtering over both the technical and business metadata for a data asset
It can build additional applications that consume this contextual metadata about a data asset and take further actions.

In addition, we also use Google Cloud Datastore (opens in a new tab) for storing data dependencies to build upstream and downstream.

Architecture

Architecture Overview

This article will skip the migration and transform steps from the core database. But can be summarized a bit as follows:

Step 1: Apache Spark using SparkSQL to extract data on the core database, then raw files are stored to GCS.
Step 2: GCS will act as the data source where all raw files are stored.
Step 3: Apache Airflow creates DAGs from GCS to load data to BigQuery. The ETL process will take data from those staging tables and create data warehouse tables.

And this is the flow diagram above shows the metadata sources our pipeline ingests use Airflow.

Process Metadata Flow

Some metadata fields: Ownership, Updated Time, URL, Description, Tags, Ingest Type, Framework, Statement Type, Storage Type, Table Status, Table Type, Quality Issues, Update Frequency,…

Result

After we have metadata, we build the web landing page that allows all teams to discover data assets, lineage, usage, ownership, and other metadata that helps users build the necessary data context.

Data Management Tools

The main functions:

Search data entity by name, schema, or tags.
Search predicates can be business or technical.
Tags are an extension of existing metadata.
Shows all upstream and downstream.

What’s next?

In the mid to long term, we are looking to tackle data asset stewardship, change management, or other ways to go about Data Governance.

Data Governance is defined as “the exercise of authority and control (planning, monitoring, and enforcement) over the management of data assets to ensure that those assets are managed properly”.