Considerations for Integration of the Ed-Fi Ecosystem into Data Lake and Next Generation Data Warehouse Architectures
...
There is no silver bullet for building interoperable systems and delivering actionable insights from educational data. The Ed-Fi Technology Suite helps solve both these problems by providing a stable, single platform, serving data that are structured according to the an extensive and well-defined Ed-Fi Data Standard. But it is not a complete solution. Because it is an operational data system, data analysts need to extract granular data from the API, and/or denormalize data in the relational database, before they can begin building reports, analytic dashboards, or connecting to machine learning systems. Furthermore, there is typically Data Standard; the ODS/API for exchange and storage of operational data; and analytics starter kits that make the operational data more accessible to analysts and end-users alike. But it is not a perfect or complete system. There is a long lead time to access data from new sources and/or new domains, and the analytics starter kits only scratch the surface of what is possible.
Data lakes are intended to deliver fast time-to-value short lead times for use of new data with a philosophy of getting the raw data into an analyst's hands as quickly as possible, and many people have asked "why not just use a data lake?" to solve educational data interoperability and delivery problems. This paper makes the case that a pure data lake, without something like the Ed-Fi Data Standard and the Ed-Fi API, could present more problems than it would solve.
...
Anchor | ||||
---|---|---|---|---|
|
- System interoperability: "helping every school district and state achieve data interoperability"
- Serving data to end users: "empower educators with comprehensive, real-time2 insights into their students’ performance and needs"
The diagrams below illustrate these two problems and will serve as a foundation for the many architectural diagrams that follow.
Problem: interoperability | Problem: serving data |
---|---|
| |
...
- Data Model Extensions that support exchange storage of new data elements and domains that are not otherwise covered by the Data Standard;
- API Composites that denormalize data, so that data can be extracted from the API with fewer calls; and
- Custom Analytics Middle Tier (AMT) views that denormalize more of the ODS tables, compared to what is available out of the box.
The Data Lake Hypothesis
Amazon defines a data lake as
"... a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions."
A central premise of the data lake concept is to get the data into the hands of analysts as quickly as possible and let them sort out the details. That brings us to the question, "why not put it in a data lake?"
...
(1) Ed-Fi mission statement, from About Ed-Fi: Anchor f1 f1
"The Ed-Fi Alliance is a nonprofit devoted to helping every school district and state achieve data interoperability. By connecting educational data systems, we empower educators with comprehensive, real-time insights into their students’ performance and needs."
(2) Anecdotally, "real-time" in educational data is not meant literally, as in some other industries. In educational settings, there is more concern that the data are up-to-date within a day or two. A counter-example might be literal real-time notifications on classroom attendance. On the other hand, manually-entered attendance data may be prone to errors or recording delays, unless the school is using automated proximity detection (e.g. RFID). In the manual case, actual real-time may not be desirable, and in the latter case, the proximity system itself likely takes responsibility for notifications. Anchor f2 f2
(3) Colloquially, following the Microsoft terminology, known as Change Data Capture or CDC. Examples: CDC on Microsoft SQL Server and Azure SQL; roll your own with PostgreSQL (1) (2) or use an add-on such as Materialize, Hevo, Aiven, etc.