Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

Considerations for Integration of the Ed-Fi Ecosystem into Data Lake and Next Generation Data Warehouse Architectures

Executive Summary

There is no silver bullet for building interoperable systems and delivering actionable insights from educational data. The Ed-Fi Technology Suite helps solve both these problems by providing a stable, single platform, serving data that are structured according to the an extensive and well-defined Ed-Fi Data StandardData Standard; the ODS/API for exchange and storage of operational data; and analytics starter kits that make the operational data more accessible to analysts and end-users alike. But it is not a perfect solution; data analysts are challenged to either extract granular data from an API, or denormalize data in a relational database before they can begin building reports, analytic dashboards, or connecting to machine learning systems. Furthermore, there is typically or complete system. There is a long lead time to access data from new sources and/or new domains., and the analytics starter kits only scratch the surface of what is possible.

Data lakes are intended to deliver fast time-to-value short lead times for use of new data with a philosophy of getting the raw data into an analyst's hands as quickly as possible, and many people have asked "why not just use a data lake?" to solve educational data interoperability and delivery problems. This paper makes the case that a pure data lake, without something like the Ed-Fi Data Standard and the Ed-Fi API, could present more problems than it would solve.

...

  • facilitating data exchange through the Ed-Fi ODS/API,
  • standardizing data storage with using the Ed-Fi Data Standard as a schema for data storage,
  • providing a landing zone and pipeline for data that are not integrated into the Ed-Fi Data Standard,
  • and allowing data analysts to combine data from new and unstandardized sources with the validated / standardized data from the Ed-Fi ODS/API.

Background

Motivating Questions

Where does the Ed-Fi Data Standard fit in a data lake world? Do data lake, cloud-based enterprise data warehouses, and next-generation data integration tools make the Data Standard, and/or the Ed-Fi Technology Suite, obsolete? And if there is a role, what architectural patterns might support integration of the Ed-Fi Data Standard and Technology Suite into a next generation data platform?

Two Core Problems to Solve

Anchor
up1
up1
Ed-Fi's mission statement1 contains two core technology problems:

  1. System interoperability: "helping every school district and state achieve data interoperability"
  2. Serving data to end users: "empower educators with comprehensive, real-time2 insights into their students’ performance and needs"

The diagrams below illustrate these two problems and will serve as a foundation for the many architectural diagrams that follow.

Problem: interoperabilityProblem: serving data

Understanding Interoperability

Project Unicorn defines interoperability as "the seamless, secure, and controlled exchange of data between applications." They also publish a rubric for judging interoperability, which lists eleven different attributes of student data, each with possible technical solutions that are rated on a scale from least to most interoperable. Among their requirements for high marks is the use of a well-defined Data Standard and real-time exchange of data via an web-based Application Programming Interface (API).

Serving Data with Data Governance

When serving data to educators it is essential that the data are not only relevant, but also validated, complete, and trustworthy ("clean", "quality assurance"). An educational agency's data analysts will also want to understand the sources of data, their granularity, and where and how different data sets are compatible (e.g. do different data sets use the same unique identifier for students?). These topics fall under the general heading of "data governance".

Ed-Fi Solution

The Ed-Fi Data Standard ("the Data Standard") creates a common language for storing and exchanging a broad range of educational data. An Ed-Fi-compatible API ("Ed-Fi API") supports the Data Standard by serving as a hub for both data exchange and downstream uses. The Data Standard and the Ed-Fi API specifications are available for anyone to implement. In addition to these standards, the Ed-Fi Alliance produces a Technology Suite of software that includes:

...

The Ed-Fi Technology Suite also includes several add-on products, which will be described below. Before doing so, it will be helpful to step back from this platform and ask: how quickly can it support new data requirements?

Time to Value

Many IT departments and software development groups struggle to keep up with business demand for new systems, features, and fixes. A key question about any technology platform is how well does it enable rapid development and deployment of new business requirements? Put another way: what is the time to value?

...

Speedbumps in time to value with the Ed-Fi platform

Ed-Fi Technology Suite, Expanded

Many Ed-Fi implementations will include additional tools that help improve the time to value. The next diagram introduces these new components:

...

  • Data Model Extensions that support exchange storage of new data elements and domains that are not otherwise covered by the Data Standard;
  • API Composites that denormalize data, so that data can be extracted from the API with fewer calls; and
  • Custom Analytics Middle Tier (AMT) views that denormalize more of the ODS tables, compared to what is available out of the box.

The Data Lake Hypothesis

Amazon defines a data lake as

"... a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions."

A central premise of the data lake concept is to get the data into the hands of analysts as quickly as possible and let them sort out the details. That brings us to the question, "why not put it in a data lake?

...

  • New source system: They probably have some export facility already. Just deliver the CSV / XML / etc. files into the lake.
  • Configuration:No longer an immediate concern since all data will be arriving in the lake.
  • New data domain: The Data Lake emphasizes the idea of storing what you get and interpreting it as you read it. No need to figure out a single Data Standard up front.
  • Overly principled: No data are rejected; all are available to the data team.
  • Data out scenarios: The Analyst can use federated query tools to build reporting, analytics, or machine learning data models from any source data, in any format.

An Aside: Enterprise Data Warehouses

While this paper primarily concerns itself with data lakes, many of the issues covered here also apply to the use of enterprise data warehouses, especially cloud-based systems such as Snowflake, Google BigQuery, Azure Synapse Analytics, and AWS Redshift. These systems fit into the "Reporting Data Store" box in the architecture diagrams. Because the motivating questions for this paper include explicit data lake features, it will focus on the lake itself. Nonetheless, many of the points raised are equally applicable to any hybrid solution that employs an enterprise data warehouse.

Data Lake Architecture

In reality, the simple diagram above is a sky-high perspective, hiding considerable detail.

...

A typical architecture diagram

This diagram shows only one "cleaning" job, where in reality there would be separate jobs for each data source and type. Similarly there may be many small, stand-alone ELT processes for populating the reporting data store.

Anchor
up2
up2
Please see the Appendix for explanations of the icons and how they apply to on-premises and cloud-based development.

...

The organize stage described above attempts to bring a degree of standardization that will hide the messy details, allowing the data scientist or data analyst to quickly and easily find the data required for their task. Even with this, the data platform team (engineers + scientists + analysts) must agree on the data model(s). This could mean adopting one or more existing data standards or developing their own.

Data Lake Time to Value

What do we get with a move to a data lake-first approach that does not include something like the Ed-Fi ODS/API? On the positive side, data analysts and/or data scientists may have more rapid access to novel data, thus shortening the time-to-value for rough analyses on "unverified data."

...

Speedbumps with a Data Lake

Proposal: Hybrid Solution

We believe a hub-model RESTful API that enforces a strict Data Standard continues to offer, on average, the best time-to-value for achieving interoperability and delivering meaningful reports and novel data-driven insights.

...

These are discussed in more detail below.

Gap: Getting Data from Ed-Fi into the Lake

What options are available for getting data from the Ed-Fi Platform out to a Data Lake?

Option 1: Read from the Database

Anchor
up3
up3
Write and schedule a standard ETL process to move data from the ODS to a data lake, using direct queries against the tables. Better yet, utilize the database's transaction log for faster reads that do not compete with the the API's use of the ODS databaseDraft Five3.

This approach has a significant weakness: the retrieved data will be close, but not compliant, with the Data Standard. This is because the ODS database contains some normalization and table inheritance. Downstream use cases would have to reshape the data or accept the ODS as the de-facto standard, instead of the actual Ed-Fi Data Standard.

Change Data Capture

Option 2: Read from the API

Build and schedule an API-based ETL process, using the Changed Record Queries functionality in the ODS/API. The Changed Record Queries feature shortens the data load time by helping the API client retrieve only the records changed since the last extraction. The retrieved data have shapes defined by the Data Standard, since they are standard API payloads. This batch-based process fits well with many systems, though some implementations might prefer a more real-time approach.

Changed Record Queries

Option 3: ODS/API Modification for Streaming

Modify the ODS/API to write API requests to a stream so that they can be saved to the lake in real-time while keeping them in the original Data Standard schema. To preserve the validation and integrity checking provided by the relational ODS database, this modification needs to be introduced into the code after a database operation executes successfully.

While the real-time nature of this event streaming obviates the need for scheduling batch tasks, it does introduce a new set of technologies that are likely to be unfamiliar to most users in the Ed-Fi ecosystem. Furthermore, it would only work with a hypothetical future release (or backport to forked old releases), whereas the Changed Record Queries approach would work with any release of Tech Suite 3 from late 2018 (version 3.1+).

Event Streaming

Gap: Supporting Lineage for Data Governance

To be secure and trusted as a data source, a data lake strategy needs to have robust governance practices. Downstream users will have greater confidence in the data when they know that they have already been validated through the Ed-Fi ODS/API. But that is only part of the trust equation. It is also important to know the provenance and lineage: where the data came from and  (provenance) and what has been done to them (lineage).   When confidence / trust is low, the data lake will be under-utilized. In industry parlance, an under-utilized lake becomes a swamp.

For example, an An analyst's trust level may differ based on the sourceprovenance. "Trust" in this sense could refer to either verifiability, or to setting appropriate expectations. For example, Perhaps some source systems could write writes student records without having any knowledge of student demographics. If the analyst knows the source system, then she will better know how to react to missing demographic information. Today's Ed-Fi Data Alternately, if she sees that demographics have been added where not expected, she will wish to know the lineage: what intermediate steps were taken to enrich the original data with the missing information?

Today's Ed-Fi Data Standard does not record which source system vendor submitted the data, and it may be worthwhile to modify the Data Standard to support lineageprovenance. Unfortunately, this will not help existing Ed-Fi implementations where knowledge of the source system of record has already been lost. At minimum, the extract process from the Ed-Fi Platform should wrap the Data Standard definition with metadata including the lineage information including "Source System" (In that case, "Ed-Fi ODS/API with version number) and the Data Standard version number" may have to be used as the source system of record when extracting data out to a lake or warehouse. Lineage information should then be added when extracting and transforming the data. The lineage might belong to an Analytics schema or standard, rather than to the operational Ed-Fi Data Standard.

This information would also be critical when using the data lake for auditing purposes.

Gap: Simplifying the Data Model

The Ed-Fi Data Standard is designed to optimize the process of writing interoperable data. The Data Standard's Unifying Data Model (UDM) is not an ideal data model for analytics, dashboards, etc.

...

Some denormalization can be standardized, for example by reshaping the data to match the CEDS reporting data model or another model. The reshaping process can be componentized into either extract-transform-load (ETL) or extract-load-transform (ELT) tools. A tool such dbt might be very attractive in an ELT scenario; the work to extract and load the initial data into the warehouse ("staging") is routine and can be a black box, with transformation logic handled in the powerful modeling approach offered by dbt. This raises exciting and open questions:

  • With significant variation in analytical needs, does it make sense for the Ed-Fi community to try to adopt a shared Analytical schema?
  • If so, how far should it go? For example, perhaps the community only agrees on a light level of baseline denormalization while leaving open what the schema should look like to handle detailed use cases.
  • Is it better to talk about multiple Analytical schemas, which are fit-to-purpose? 

An Opportunity for Platform Providers

Most data lakes are operated on a cloud platform, utilizing the cloud provider's managed services to the extent possible for cost optimization. Learning how to operate such a platform cost-efficiently, securely, and with appropriate response times would be a significant burden for any small IT operation who tries to migrate to such a platform.

...

The template would also handle setup of default downstream serverless functions for further data preparation and organization, including pushing refined data into an enterprise data warehouse for high-performance analytics. These functions would generally be relatively simple, since they are purpose-built, and would operate on data that already have a well-defined schema - the Ed-Fi Data Standard. This drives down the cost for an education organization to customize the solution if needed, for example by adding additional data validation rules or adding new warehousing requirements.

Acknowledgments

In addition to innumerable blog posts and internal Ed-Fi conversations, this paper benefited greatly from direct conversations on the topic with Gene Garcia of Microsoft, Linda Feng of Unicon, Erik Joranlien and Jordan Mader of Ed Analytics, and Marcos Alcozer. Thank you!

All architectural icons used in this document are courtesy of diagrams.net.

Appendix: Translating the Diagrams into Product Components

Anchor
appendix
appendix
In the diagrams above, the orange blocks represent small, independent software programs (typically serverless functions); the green boxes are for file storage, analogous to traditional file folders; the blue box is most likely an enterprise data warehouse; purple boxes are custom applications or managed services; and the machine learning box is a dedicated machine learning software package.

...

Tool and IconOn-premisesAmazon Web ServicesGoogle CloudMicrosoft Azure

Function

Custom applicationsAWS Lambda​Google ​Cloud Functions​Azure Functions

File storage

NTFS

NFS

HDFS

Amazon S3Google Cloud StorageAzure Storage

Reporting data store

SQL Analysis Services

Oracle

Greenplum

Amazon Redshift

Snowflake

Google BigQuery

Snowflake

Synapse Analytics

Azure SQL Data Warehouse

Snowflake

Reports and dashboards

Tableau

Power BI

... and many others

Custom applications

Amazon QuickSight

Custom applications

Google Data Studio

Custom applications

Power BI

Custom applications

Machine learning

TensorFlow

PyTorch

Amazon SagemakerGoogle VertexAzure Machine Learning

Back to the text

...

Footnotes

Anchor
f1
f1
(1) Ed-Fi mission statement, from About Ed-Fi:

"The Ed-Fi Alliance is a nonprofit devoted to helping every school district and state achieve data interoperability. By connecting educational data systems, we empower educators with comprehensive, real-time insights into their students’ performance and needs."

Back to the text

Anchor
f2
f2
(2) Anecdotally, "real-time" in educational data is not meant literally, as in some other industries. In educational settings, there is more concern that the data are up-to-date within a day or two. A counter-example might be literal real-time notifications on classroom attendance. On the other hand, manually-entered attendance data may be prone to errors or recording delays, unless the school is using automated proximity detection (e.g. RFID). In the manual case, actual real-time may not be desirable, and in the latter case, the proximity system itself likely takes responsibility for notifications.

Back to the text

(3) Colloquially, following the Microsoft terminology, known as Change Data Capture or CDC. Examples: CDC on Microsoft SQL Server and Azure SQL; roll your own with PostgreSQL (1) (2) or use an add-on such as Materialize, Hevo, Aiven, etc.

Back to the text