Considerations for Integration of the Ed-Fi Ecosystem into Data Lake and Next Generation Data Warehouse Architectures
...
There is no silver bullet for building interoperable systems and delivering actionable insights from educational data. The Ed-Fi Technology Suite helps solve both these problems by providing a stable, single platform, serving data that are structured according to the an extensive and well-defined Ed-Fi Data Standard. Data Standard; the ODS/API for exchange and storage of operational data; and analytics starter kits that make the operational data more accessible to analysts and end-users alike. But it is not a perfect or complete solutionsystem. Because it is an operational data system, data analysts need to extract granular data from the API, and/or denormalize data in the relational database, before they can begin building reports, analytic dashboards, or connecting to machine learning systems. Furthermore, there is typically a long lead time to access data from new sources and/or new domainsThere is a long lead time to access data from new sources and/or new domains, and the analytics starter kits only scratch the surface of what is possible.
Data lakes are intended to deliver fast time-to-value short lead times for use of new data with a philosophy of getting the raw data into an analyst's hands as quickly as possible, and many people have asked "why not just use a data lake?" to solve educational data interoperability and delivery problems. This paper makes the case that a pure data lake, without something like the Ed-Fi Data Standard and the Ed-Fi API, could present more problems than it would solve.
...
Anchor | ||||
---|---|---|---|---|
|
- System interoperability: "helping every school district and state achieve data interoperability"
- Serving data to end users: "empower educators with comprehensive, real-time2 insights into their students’ performance and needs"
The diagrams below illustrate these two problems and will serve as a foundation for the many architectural diagrams that follow.
Problem: interoperability | Problem: serving data |
---|---|
| |
...
The Ed-Fi Data Standard ("the Data Standard") creates a common language for storing and exchanging a broad range of educational data. An Ed-Fi-compatible API ("Ed-Fi API") supports the Data Standard by serving as a hub for both data exchange and downstream uses. The Data Standard and the Ed-Fi API specifications are available for anyone to implement. In addition to these standards, the Ed-Fi Alliance produces a Technology Suite of software that includes:
...
- Data Model Extensions that support exchange storage of new data elements and domains that are not otherwise covered by the Data Standard;
- API Composites that denormalize data, so that data can be extracted from the API with fewer calls; and
- Custom Analytics Middle Tier (AMT) views that denormalize more of the ODS tables, compared to what is available out of the box.
The Data Lake Hypothesis
Amazon defines a data lake as
"... a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions."
A central premise of the data lake concept is to get the data into the hands of analysts as quickly as possible and let them sort out the details. That brings us to the question, "why not put it in a data lake?"
...
Build and schedule an API-based ETL process, using the Changed Record Queries functionality in the ODS/API. The Changed Record Queries feature shortens the data load time by helping the API client retrieve only the records changed since the last extraction. The retrieved data have shapes defined by the Data Standard, since they are standard API payloads. This batch-based process fits well with many systems, though some implementations might prefer a more real-time approach.
...
To be secure and trusted as a data source, a data lake strategy needs to have robust governance practices. Downstream users will have greater confidence in the data when they know that they have already been validated through the Ed-Fi ODS/API. But that is only part of the trust equation. It is also important to know the provenance and lineage: where the data came from and (provenance) and what has been done to them (lineage). When confidence / trust is low, the data lake will be under-utilized. In industry parlance, an under-utilized lake becomes a swamp.
For example, an An analyst's trust level may differ based on the sourceprovenance. "Trust" in this sense could refer to either verifiability, or to setting appropriate expectations. For example, Perhaps some source systems could write writes student records without having any knowledge of student demographics. If the analyst knows the source system, then she will better know how to react to missing demographic information. Today's EdAlternately, if she sees that demographics have been added where not expected, she will wish to know the lineage: what intermediate steps were taken to enrich the original data with the missing information?
Today's Ed-Fi Data Standard does not record which source system vendor submitted the data, and it may be worthwhile to modify the Data Standard to support lineageprovenance. Unfortunately, this will not help existing Ed-Fi implementations where knowledge of the source system of record has already been lost. At minimum, the extract process from the Ed-Fi Platform should wrap the Data Standard definition with metadata including the lineage information including "Source System" (In that case, "Ed-Fi ODS/API with version number) and the Data Standard version number" may have to be used as the source system of record when extracting data out to a lake or warehouse. Lineage information should then be added when extracting and transforming the data. The lineage might belong to an Analytics schema or standard, rather than to the operational Ed-Fi Data Standard.
This information would also be critical when using the data lake for auditing purposes.
...
Some denormalization can be standardized, for example by reshaping the data to match the CEDS reporting data model or another model. The reshaping process can be componentized into either extract-transform-load (ETL) or extract-load-transform (ELT) tools. A tool such dbt might be very attractive in an ELT scenario; the work to extract and load the initial data into the warehouse ("staging") is routine and can be a black box, with transformation logic handled in the powerful modeling approach offered by dbt. This raises exciting and open questions:
- With significant variation in analytical needs, does it make sense for the Ed-Fi community to try to adopt a shared Analytical schema?
- If so, how far should it go? For example, perhaps the community only agrees on a light level of baseline denormalization while leaving open what the schema should look like to handle detailed use cases.
- Is it better to talk about multiple Analytical schemas, which are fit-to-purpose?
An Opportunity for Platform Providers
...
(1) Ed-Fi mission statement, from About Ed-Fi: Anchor f1 f1
"The Ed-Fi Alliance is a nonprofit devoted to helping every school district and state achieve data interoperability. By connecting educational data systems, we empower educators with comprehensive, real-time insights into their students’ performance and needs."
(2) Anecdotally, "real-time" in educational data is not meant literally, as in some other industries. In educational settings, there is more concern that the data are up-to-date within a day or two. A counter-example might be literal real-time notifications on classroom attendance. On the other hand, manually-entered attendance data may be prone to errors or recording delays, unless the school is using automated proximity detection (e.g. RFID). In the manual case, actual real-time may not be desirable, and in the latter case, the proximity system itself likely takes responsibility for notifications. Anchor f2 f2
(3) Colloquially, following the Microsoft terminology, known as Change Data Capture or CDC. Examples: CDC on Microsoft SQL Server and Azure SQL; roll your own with PostgreSQL (1) (2) or use an add-on such as Materialize, Hevo, Aiven, etc.