Considerations for Integration of the Ed-Fi Ecosystem into Data Lake and Next Generation Data Warehouse Architectures

...

There is no silver bullet for building interoperable systems and delivering actionable insights from educational data. The Ed-Fi Technology Suite helps solve both these problems by providing a stable, single platform, serving data that are structured according to the an extensive and well-defined Ed-Fi Data Standard. Data Standard; the ODS/API for exchange and storage of operational data; and analytics starter kits that make the operational data more accessible to analysts and end-users alike. But it is not a complete solution. Because it is an operational data system, data analysts need to extract granular data from the API, and/or denormalize data in the relational database, before they can begin building reports, analytic dashboards, or connecting to machine learning systems. Furthermore, there is typically perfect or complete system. There is a long lead time to access data from new sources and/or new domains, and the analytics starter kits only scratch the surface of what is possible.

Data lakes are intended to deliver fast time-to-value short lead times for use of new data with a philosophy of getting the raw data into an analyst's hands as quickly as possible, and many people have asked "why not just use a data lake?" to solve educational data interoperability and delivery problems. This paper makes the case that a pure data lake, without something like the Ed-Fi Data Standard and the Ed-Fi API, could present more problems than it would solve.

...

Anchor

	up1
	up1

Ed-Fi's mission statement¹ contains two core technology problems:

System interoperability: "helping every school district and state achieve data interoperability"
Serving data to end users: "empower educators with comprehensive, real-time² insights into their students’ performance and needs"

The diagrams below illustrate these two problems and will serve as a foundation for the many architectural diagrams that follow.

Problem: interoperability	Problem: serving data

...

Data Model Extensions that support exchange storage of new data elements and domains that are not otherwise covered by the Data Standard;
API Composites that denormalize data, so that data can be extracted from the API with fewer calls; and
Custom Analytics Middle Tier (AMT) views that denormalize more of the ODS tables, compared to what is available out of the box.

The Data Lake Hypothesis

Amazon defines a data lake as

"... a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions."

A central premise of the data lake concept is to get the data into the hands of analysts as quickly as possible and let them sort out the details. That brings us to the question, "why not put it in a data lake?"

...

Anchor

	f1
	f1

(1) Ed-Fi mission statement, from About Ed-Fi:

"The Ed-Fi Alliance is a nonprofit devoted to helping every school district and state achieve data interoperability. By connecting educational data systems, we empower educators with comprehensive, real-time insights into their students’ performance and needs."

Back to the text

Anchor

	f2
	f2

(2) Anecdotally, "real-time" in educational data is not meant literally, as in some other industries. In educational settings, there is more concern that the data are up-to-date within a day or two. A counter-example might be literal real-time notifications on classroom attendance. On the other hand, manually-entered attendance data may be prone to errors or recording delays, unless the school is using automated proximity detection (e.g. RFID). In the manual case, actual real-time may not be desirable, and in the latter case, the proximity system itself likely takes responsibility for notifications.

Back to the text

(3) Colloquially, following the Microsoft terminology, known as Change Data Capture or CDC. Examples: CDC on Microsoft SQL Server and Azure SQL; roll your own with PostgreSQL (1) (2) or use an add-on such as Materialize, Hevo, Aiven, etc.

Back to the text

Version	Old Version 4	New Version 5
Changes made by	Stephen Fuqua	Stephen Fuqua
Saved on	Feb 10, 2022	Feb 10, 2022

Versions Compared

Key

Considerations for Integration of the Ed-Fi Ecosystem into Data Lake and Next Generation Data Warehouse Architectures

The Data Lake Hypothesis

Content Comparison

Versions Compared

Key

The Data Lake Hypothesis