...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Considerations for Integration of the Ed-Fi Ecosystem into Data Lake and Next Generation Data Warehouse Architectures
Executive Summary
There is no silver bullet for building interoperable systems and delivering actionable insights from educational data. The Ed-Fi Technology Suite helps solve both problems by providing a stable, single platform, serving data that are structured according to the extensive and well-defined Ed-Fi Data Standard. But it is not a perfect solution; data analysts are challenged to either extract granular data from an API, or denormalize data in a relational database before they can begin building reports, analytic dashboards, or connecting to machine learning systems. Furthermore, there is typically a long lead time to access data from new sources and/or new domains.
...
- facilitating data exchange through the Ed-Fi ODS/API,
- standardizing data storage with the Ed-Fi Data Standard,
- providing a landing zone and pipeline for data that are not integrated into the Ed-Fi Data Standard,
- and allowing data analysts to combine data from new and unstandardized sources with the validated / standardized data from the Ed-Fi ODS/API.
Background
Motivating Questions
Where does the Ed-Fi Data Standard fit in a data lake world? Do data lake, cloud-based enterprise data warehouses, and next-generation data integration tools make the Data Standard, and/or the Ed-Fi Technology Suite, obsolete? And if there is a role, what architectural patterns might support integration of the Ed-Fi Data Standard and Technology Suite into a next generation data platform?
Two Core Problems to Solve
Anchor | ||||
---|---|---|---|---|
|
- System interoperability: "helping every school district and state achieve data interoperability"
- Serving data to end users: "empower educators with comprehensive, real-time2 insights into their students’ performance and needs"
The diagrams below illustrate these two problems and will serve as a foundation for the many architectural diagrams that follow.
Problem: interoperability | Problem: serving data |
---|---|
| |
Understanding Interoperability
Project Unicorn defines interoperability as "the seamless, secure, and controlled exchange of data between applications." They also publish a rubric for judging interoperability, which lists eleven different attributes of student data, each with possible technical solutions that are rated on a scale from least to most interoperable. Among their requirements for high marks is the use of a well-defined Data Standard and real-time exchange of data via an web-based Application Programming Interface (API).
Serving Data with Data Governance
When serving data to educators it is essential that the data are not only relevant, but also validated, complete, and trustworthy ("clean", "quality assurance"). An educational agency's data analysts will also want to understand the sources of data, their granularity, and where and how different data sets are compatible (e.g. do different data sets use the same unique identifier for students?). These topics fall under the general heading of "data governance".
Ed-Fi Solution
The Ed-Fi Data Standard ("the Data Standard") creates a common language for storing and exchanging a broad range of educational data. An Ed-Fi-compatible API ("Ed-Fi API") supports the Data Standard by serving as a hub for both data exchange and downstream uses. The Data Standard and the Ed-Fi API specifications are available for anyone to implement. In addition to these standards, the Ed-Fi Alliance produces a Technology Suite of software that includes:
...
The Ed-Fi Technology Suite also includes several add-on products, which will be described below. Before doing so, it will be helpful to step back from this platform and ask: how quickly can it support new data requirements?
Time to Value
Many IT departments and software development groups struggle to keep up with business demand for new systems, features, and fixes. A key question about any technology platform is how well does it enable rapid development and deployment of new business requirements? Put another way: what is the time to value?
...
Speedbumps in time to value with the Ed-Fi platform |
---|
|
Ed-Fi Technology Suite, Expanded
Many Ed-Fi implementations will include additional tools that help improve the time to value. The next diagram introduces these new components:
...
- Data Model Extensions that support exchange storage of new data elements and domains that are not otherwise covered by the Data Standard;
- API Composites that denormalize data, so that data can be extracted from the API with fewer calls; and
- Custom Analytics Middle Tier (AMT) views that denormalize more of the ODS tables, compared to what is available out of the box.
The Data Lake Hypothesis
Amazon defines a data lake as
"... a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions."
A central premise of the data lake concept is to get the data into the hands of analysts as quickly as possible and let them sort out the details. That brings us to the question, "why not put it in a data lake?"
...
- New source system: They probably have some export facility already. Just deliver the CSV / XML / etc. files into the lake.
- Configuration:No longer an immediate concern since all data will be arriving in the lake.
- New data domain: The Data Lake emphasizes the idea of storing what you get and interpreting it as you read it. No need to figure out a single Data Standard up front.
- Overly principled: No data are rejected; all are available to the data team.
- Data out scenarios: The Analyst can use federated query tools to build reporting, analytics, or machine learning data models from any source data, in any format.
An Aside: Enterprise Data Warehouses
While this paper primarily concerns itself with data lakes, many of the issues covered here also apply to the use of enterprise data warehouses, especially cloud-based systems such as Snowflake, Google BigQuery, Azure Synapse Analytics, and AWS Redshift. These systems fit into the "Reporting Data Store" box in the architecture diagrams. Because the motivating questions for this paper include explicit data lake features, it will focus on the lake itself. Nonetheless, many of the points raised are equally applicable to any hybrid solution that employs an enterprise data warehouse.
Data Lake Architecture
In reality, the simple diagram above is a sky-high perspective, hiding considerable detail.
...
A typical architecture diagram | ||||||
---|---|---|---|---|---|---|
This diagram shows only one "cleaning" job, where in reality there would be separate jobs for each data source and type. Similarly there may be many small, stand-alone ELT processes for populating the reporting data store.
|
...
The organize stage described above attempts to bring a degree of standardization that will hide the messy details, allowing the data scientist or data analyst to quickly and easily find the data required for their task. Even with this, the data platform team (engineers + scientists + analysts) must agree on the data model(s). This could mean adopting one or more existing data standards or developing their own.
Data Lake Time to Value
What do we get with a move to a data lake-first approach that does not include something like the Ed-Fi ODS/API? On the positive side, data analysts and/or data scientists may have more rapid access to novel data, thus shortening the time-to-value for rough analyses on "unverified data."
...
Speedbumps with a Data Lake |
---|
|
Proposal: Hybrid Solution
We believe a hub-model RESTful API that enforces a strict Data Standard continues to offer, on average, the best time-to-value for achieving interoperability and delivering meaningful reports and novel data-driven insights.
...
These are discussed in more detail below.
Gap: Getting Data from Ed-Fi into the Lake
What options are available for getting data from the Ed-Fi Platform out to a Data Lake?
Option 1: Read from the Database
Anchor | ||||
---|---|---|---|---|
|
This approach has a significant weakness: the retrieved data will be close, but not compliant, with the Data Standard. This is because the ODS database contains some normalization and table inheritance. Downstream use cases would have to reshape the data or accept the ODS as the de-facto standard, instead of the actual Ed-Fi Data Standard.
Change Data Capture |
---|
|
Option 2: Read from the API
Build and schedule an API-based ETL process, using the Changed Record Queries functionality in the ODS/API. The Changed Record Queries feature shortens the data load time by helping the API client retrieve only the records changed since the last extraction. The retrieved data have shapes defined by the Data Standard, since they are standard API payloads. This batch-based process fits well with many systems, though some implementations might prefer a more real-time approach.
Changed Record Queries |
---|
|
Option 3: ODS/API Modification for Streaming
Modify the ODS/API to write API requests to a stream so that they can be saved to the lake in real-time while keeping them in the original Data Standard schema. To preserve the validation and integrity checking provided by the relational ODS database, this modification needs to be introduced into the code after a database operation executes successfully.
While the real-time nature of this event streaming obviates the need for scheduling batch tasks, it does introduce a new set of technologies that are likely to be unfamiliar to most users in the Ed-Fi ecosystem. Furthermore, it would only work with a hypothetical future release (or backport to forked old releases), whereas the Changed Record Queries approach would work with any release of Tech Suite 3 from late 2018 (version 3.1+).
Event Streaming |
---|
|
Gap: Supporting Lineage for Data Governance
To be secure and trusted as a data source, a data lake strategy needs to have robust governance practices. Downstream users will have greater confidence in the data when they know that they have already been validated through the Ed-Fi ODS/API. But that is only part of the trust equation. It is also important to know the lineage: where the data came from and what has been done to them. When confidence / trust is low, the data lake will be under-utilized. In industry parlance, an under-utilized lake becomes a swamp.
...
This information would also be critical when using the data lake for auditing purposes.
Gap: Simplifying the Data Model
The Ed-Fi Data Standard is designed to optimize the process of writing interoperable data. The Data Standard's Unifying Data Model (UDM) is not an ideal data model for analytics, dashboards, etc.
...
Some denormalization can be standardized, for example by reshaping the data to match the CEDS reporting data model or another model. The reshaping process can be componentized into either extract-transform-load (ETL) or extract-load-transform (ELT) tools. A tool such dbt might be very attractive in an ELT scenario; the work to extract and load the initial data into the warehouse ("staging") is routine and can be a black box, with transformation logic handled in the powerful modeling approach offered by dbt.
An Opportunity for Platform Providers
Most data lakes are operated on a cloud platform, utilizing the cloud provider's managed services to the extent possible for cost optimization. Learning how to operate such a platform cost-efficiently, securely, and with appropriate response times would be a significant burden for any small IT operation who tries to migrate to such a platform.
...
The template would also handle setup of default downstream serverless functions for further data preparation and organization, including pushing refined data into an enterprise data warehouse for high-performance analytics. These functions would generally be relatively simple, since they are purpose-built, and would operate on data that already have a well-defined schema - the Ed-Fi Data Standard. This drives down the cost for an education organization to customize the solution if needed, for example by adding additional data validation rules or adding new warehousing requirements.
Acknowledgments
In addition to innumerable blog posts and internal Ed-Fi conversations, this paper benefited greatly from direct conversations on the topic with Gene Garcia of Microsoft, Linda Feng of Unicon, Erik Joranlien and Jordan Mader of Ed Analytics, and Marcos Alcozer. Thank you!
All architectural icons used in this document are courtesy of diagrams.net.
Appendix: Translating the Diagrams into Product Components
Anchor | ||||
---|---|---|---|---|
|
...
Tool and Icon | On-premises | Amazon Web Services | Google Cloud | Microsoft Azure |
---|---|---|---|---|
Function | Custom applications | AWS Lambda | Google Cloud Functions | Azure Functions |
File storage | NTFS NFS HDFS | Amazon S3 | Google Cloud Storage | Azure Storage |
Reporting data store | SQL Analysis Services Oracle Greenplum | Amazon Redshift Snowflake | Google BigQuery Snowflake | Synapse Analytics Azure SQL Data Warehouse Snowflake |
Reports and dashboards | Tableau Power BI ... and many others Custom applications | Amazon QuickSight Custom applications | Google Data Studio Custom applications | Power BI Custom applications |
Machine learning | TensorFlow PyTorch | Amazon Sagemaker | Google Vertex | Azure Machine Learning |
...
Footnotes
(1) Ed-Fi mission statement, from About Ed-Fi: Anchor f1 f1
"The Ed-Fi Alliance is a nonprofit devoted to helping every school district and state achieve data interoperability. By connecting educational data systems, we empower educators with comprehensive, real-time insights into their students’ performance and needs."
(2) Anecdotally, "real-time" in educational data is not meant literally, as in some other industries. In educational settings, there is more concern that the data are up-to-date within a day or two. A counter-example might be literal real-time notifications on classroom attendance. On the other hand, manually-entered attendance data may be prone to errors or recording delays, unless the school is using automated proximity detection (e.g. RFID). In the manual case, actual real-time may not be desirable, and in the latter case, the proximity system itself likely takes responsibility for notifications. Anchor f2 f2
(3) Colloquially, following the Microsoft terminology, known as Change Data Capture or CDC. Examples: CDC on Microsoft SQL Server and Azure SQL; roll your own with PostgreSQL (1) (2) or use an add-on such as Materialize, Hevo, Aiven, etc.