How we built MTA's internal data catalog

The MTA generates a staggering amount of data every minute—from MetroCard swipes and OMNY payments to real-time bus pings and elevator outages. Managing this data effectively requires coordination across various MTA departments. While it is common for data collection and maintenance to be siloed, a centralized, standardized repository is far more effective in helping us use our data to deliver better service to our customers. This data centralization facilitates easier collaboration among engineers, data scientists, analysts, planners, and leadership.

Over the past year, our data engineering team has made significant efforts in centralizing the flow and availability of MTA data in our data lake. This brings us a step closer to our goal of facilitating data-driven decision-making at the MTA. Naturally, as our data lake grows, so do the number of users who now access it. Our growing number of data users need to be able to easily find answers to questions like:

Is the data I’m looking for already present on the data lake?
Is this the right table for me to use?
Are these the right columns for me to use? Are they getting populated regularly?
If there are issues with the data, who is the data owner I can contact?

Previously, analysts and data scientists may have navigated a maze of emails, DMs, or face-to-face queries to find data owners and relevant datasets. This approach simply does not scale for an organization the size of the MTA. So, we needed a solution that could anticipate the growth of datasets on the data lake, while providing accurate information transparently to all users.

Enter Amundsen, our new data discovery application. We built a custom data discovery engine on top of this powerful tool, simplifying how MTA staff can explore and interact with our data.

Exploring MTA Amundsen

Our Amundsen-based data discovery engine provides a user-friendly search interface, allowing data users to quickly locate tables, dashboards, and users by keyword.

Here’s how it works:

Search and Discover: On the landing page, users can type keywords to find associated tables and dashboards.

Screenshot of the MTA Data Catalog landing page

Screenshot of search results for "lirr" on the MTA Data Catalog

Table Insights: Clicking on a specific table reveals detailed metadata on the left panel, including table descriptions, data owners, tags, grain (i.e. what does each row in this table represent), and more. The right panel displays column-level metadata, such as column descriptions and types. Users can navigate to the Airflow DAGs that generate the table and access the source code from the table page as well.

Dashboard Links: Some tables also link to related dashboards that users can navigate to from the Dashboard tab. Users can also see the queries that were used in the dashboard under the “Queries” tab.

How we generate Amundsen metadata

To keep our data catalog accurate and up-to-date, we use Amundsen’s databuilder library to run various metadata extraction jobs. These include:

Spark Metadata Extractor
YAML Metadata Extractor
Dashboard Metadata Extractors
Airflow Metadata Extractor

User-defined metadata

Data owners contribute metadata via YAML files, which are read and processed as part of our CI/CD build pipeline. Integrating this step into our build pipelines ensures a seamless update process whenever a pull request is merged, without manual intervention or having to mount our metadata onto our Amundsen docker containers.

Metadata extraction jobs

Our Amundsen DAG on Airflow runs daily to perform the following metadata extraction jobs:

YAML Metadata: Processed as part of our build pipeline, pushes a metadata JSON blob to our data lake.
Hive Catalog Comparison Spark Job: Ensures no metadata is missed, even if YAML files are absent. This additional comparison step ensures that if a YAML file is missing but metadata exists in the hive catalog from a spark script someone wrote a while ago, we still pull that metadata for the data catalog.
Table & Column Metadata Extractor: Loads the table and column metadata generated from the first two processes.
Airflow Extractor: Identifies table-DAG and table-source code linkages that allow us to display the Airflow and Gitlab widget on the table’s UI.
Dashboard Extraction: Retrieves dashboard metadata from the dashboard API, including the dashboard URL and SQL queries used to generate the dashboard. These queries are also used to identify table-dashboard linkages.

After extraction, we clear and update our Neo4j graph database and index metadata to Elasticsearch to support Amundsen’s metadata and search functionalities.

Flow chart of the MTA's Amundsen architecture

Future scope and final thoughts

In the coming months, the data engineering team is planning to incorporate several new features to the data catalog, including table and column lineage using the OpenLineage framework, which will provide a clear view of data flows and transformations. We also plan to populate frequent users along with other useful table metadata.

Data discovery is a growing effort at the MTA. We hope that this work will enable a more robust and intuitive data exploration experience and expand the scope of data usage within the organization.

About the author

Preksha Agarwal is a Senior Data Engineer on the Data & Analytics team.

How we built MTA’s internal data catalog