Leaves and (money) trees

The MTA recently released three budget open datasets on Statement of Operations, Subsidies, and Headcount data reported in the MTA’s budget documents. The MTA’s Open Data Team is excited for users to explore this data and see what insights they come up with. In this blog, though, we aren’t going to talk about the datasets themselves, but the process of putting them together. Specifically, we’ll discuss the (hopefully) interesting technical challenge of turning “long” entity and account hierarchy datasets into more usable “wide” datasets.

The MTA’s budget database

Budget data at the MTA is pulled from the financial planning tool Hyperion. In Hyperion, a database called ESSBASE stores data using “on-line analytical processing” (OLAP) technology. An OLAP cube, more commonly known as a multidimensional cube, is a data structure used in data warehousing and business intelligence to store and analyze multidimensional data. It is essentially an aggregated database where stored data is pre-calculated, making it easy for users to explore and interact with large amounts of data quickly.

You might be asking, “I’m just interested in the budget; why do I care about OLAP?” To produce a single row in a budget table, multiple dimensions (categories) are grouped together and aggregated over the measured values (headcount or dollar value). With regards to financial data, the important questions are what was spent (accounts; think supplies, salaries) and who spent it (entity; think departments, divisions).

An illustration of an OLAP cube — *Figure 1: An example OLAP cube.*

To access budget data in ESSBASE, we created queries to extract the underlying data. Each extract came with two corresponding metadata files: an entity hierarchy and an account hierarchy. To create open datasets that allow users to gather their own insights into the budget, we needed to develop a fundamental understanding of the hierarchical relationships between accounts and entities across the MTA, and how those map to the budget tables MTA presents to the public. This would allow us to produce datasets where each record represents each line item in the public reports, organized by subsets of financial categories.

A table of MTA Bridges and Tunnels total positions from the 2024-27 February Financial Plan — *Figure 2: Bridges and Tunnels Headcount table from the* *2024 Adopted Budget. Entities and accounts in MTA’s open data needed to map to these tables in the budget book.*

Thus began our relationship with the MTA Headquarters Consolidated Analysis Group (CAG), the team responsible for preparing monthly actuals reports and annual financial plans. Their extensive knowledge covers individual MTA agency budgets, downstream Hyperion data pipelines, dimension hierarchies, financial and legal consolidation requirements, and more. Over the course of several months, we worked closely to develop a common language of interpreting the budget data and pieced together the business logic applied to the consolidated reporting system.

These conversations naturally evolved to include the database administrator of ESSBASE, who provided the systems approach for how and when to pull the budget data. Our next step was to incorporate the consolidation requirements into our extract, using the hierarchies to inform their structure. The account and entity dimensions in ESSBASE contain multiple trees for the same lowest-level account or entity, allowing for different ways of reporting the same underlying data (think internal operations reports versus external board material reports). The challenge of having different ways to represent the same budget data is finding the right subset of dimensions for the open datasets.

Multidimensional data hierarchies

In a multidimensional database, every dimension represents a subsection of the database tree. Within each tree, “generations” and “levels” represent the distance from either the “root” (viewed from the top down) or “leaves” (viewed from the bottom up) of the dimension. This position reference allows for locating any member within a database tree and, more broadly, an understanding of the relationships between groups of related members. Some of these hierarchies can be quite deep, as far as 15 entities or accounts from root to leaf.

An illustration of generations and levels — *Figure 3: An illustration from* *Oracle* *of generations and levels.*

An entity tree diagram for the MTA — *Figure 4: Sample MTA entity tree diagram, four generations down.*

To determine where accounts and entities in the budget extract fell in MTA’s hierarchy, we needed to join the metadata files to the extract. However, the original metadata files did not make these hierarchies obvious. Instead, the metadata files had three columns—child, parent, and child_description—and rows for all entities and accounts.

*Figure 5: Snippet of the MTA’s entity metadata file from Hyperion.*

Alone, these files weren’t that useful:

They don’t obviously communicate what level of the hierarchy an entity or account occupies. Is an entity the lowest level in a hierarchy (think NYCTA A Division Transport Operations)? Somewhere in the middle of the hierarchy (think NYCTA Service Delivery)? Or at the top of a hierarchy (think NYCT, or MTA Consolidated)?
They don’t tell you how deep the hierarchy is, or what distinct hierarchies are represented in the metadata file. The same child account can repeat multiple times in the metadata file, with different parents, and appear as a parent for other accounts. This makes sense for representing accounts or entities in different ways, but it makes interpreting the metadata files confusing; if we see two rows with a child of Farebox Revenue, one with a parent of Operating Revenue, and one with a parent of Uncontrollable Revenue, it’s hard to make sense of what function each serves without knowing the rest of the hierarchy.
They don’t provide you an easy way to understand the account or entity’s parent, just the code or number associated with it. Knowing entity “A Division Transport Operations” has a parent of “NYCTA_2398” isn’t that helpful unless you’re already pretty knowledgeable about the MTA.

Unfolding entity/account trees

To create a more useful entity and account hierarchy, we needed to unfold the metadata files so that each row represented one complete hierarchy for a lowest-level account or entity, with columns for each level of the hierarchy. We did this using a function we made (with some help from a Medium blog post) using Python’s polars library called unfold_trees.

A screenshot of a spreadsheet showing an unfolded MTA entity tree — *Figure 6: Unfolded entity tree, created by applying the* `unfold_trees` *function to the entity metadata dataset. The tree root is the first column, and lower-level entities fill subsequent columns*

This process proceeds in a few steps:

The function filters the original metadata dataset to all rows where a parent account/entity never appears as a child (to unfold from the bottom up, you could do the reverse instead). This represents the top-level of a hierarchy for each account/entity in the data (our unfolding dataset), and we subset the dataset to the parent column renamed as [account or entity]_tree_1.
This next part starts incrementally unfolding the hierarchy to subsequent generations from the top-down, and we repeat it until we’ve reached the lowest-level child entity (the leaf).
- We left-join the original dataset back to our unfolding dataset, using the [account or entity]_tree_1 column in the unfolding dataset and the parent column in the original dataset.
- We keep the columns in the unfolding dataset, but rename the joined child and child_description columns as the next level in the hierarchy; for the first join, [account or entity]_tree_2, and for subsequent joins account_or_entity_tree_[n+1], where n represents each iteration.
- As part of each iteration, we check whether all account_or_entity_tree_[n+1]s are Null; essentially, whether there any generations left to unfold. If there aren’t, we stop iterating, and are left with a dataset showing from left to right the top level of each hierarchy to the bottom (in our example earlier, from “MTA Consolidated” in entity_tree_1 to “A Division Transport Operations” in entity_tree_8).
Once we’ve fully unfolded the hierarchy, we look back through each row to find the lowest-level account/entity, and add that as a column to the end of the data as [account or entity]_child. Each hierarchy can be of different lengths, so it’s convenient to have information on the lowest-level account/entity in one place.

This doesn’t resolve all the challenges of working with the budget data. Each row tells us where entities/accounts fall in a hierarchy of other accounts/entities in the same row, but not where accounts/entities fall in relation to accounts/entities in other rows. So Farebox Revenue appears at account_tree_4 for one account, and Health and Welfare at account_tree_7 for another account, even though they exist at similar conceptual levels in the Consolidated Statement of Operations.

A page from the MTA March 2024 Statement of Operations, showing Farebox Revenue as Generation 4 and Health and Welfare Labor Expenses as Generation 7 — *Figure 7:* *March 2024 Statement of Operations* *report, where accounts are displayed according to their generation in their relative hierarchy.*

Lowest-level accounts can also repeat with different hierarchies (e.g., representing Farebox Revenue as Operating or Uncontrollable Revenue). And different queries produce different metadata files; an early, incomplete version of the query produced a discontinuous hierarchy, where entities appeared at levels of the hierarchy they shouldn't have.

But unfolding the entity and account files made it much easier to know 1) what accounts/entities are the lowest level in the data, and 2) where they existed in the account/entity hierarchy, an important starting point for creating an open budget dataset.

Conclusion

This was a just small part of a long and involved project, but we hope a helpful illustration of some of the complexities of working with multidimensional data. And we hope you enjoy exploring the fruits of this project on the NY State Open Data portal—MTA’s Statement of Operations, Subsidies, and Headcount budget datasets.

About the authors

Niki Keramat and Dan Powers are Senior Data Scientists on the Data & Analytics team.

Data and Analytics Blog