Designs for Flexible Referentiality

Overview

This document looks at the issues caused by strong referentiality in the Ed-Fi data model, and potential options to address them, while considering the benefits this referentiality brings as well.

Background

Early history

The highly referential nature of the Ed-Fi core data model has been critical to its success thus far. That referentiality exerted an important control over an ecosystem that was not accustomed to large-scale, high-quality, system-to-system data exchange.

The referential nature of the Ed-Fi data model – expressed as associations between classes in UML – is reflected in Ed-Fi APIs as required references from one entity to an other entity. If the reference cannot be satisfied at the time of the API transaction, the transaction will fail.

Previous generations of data exchange in particular had been designed assuming human intervention to manipulate and massage data before it was sent to another system or process (indeed, many patterns of data exchange still emphasize this human-enabled step; for example, in some ways data lakes offer this as an advantage). One outcome of these processes was that data extracted or sent from source system was often difficult to join with or was simply inconsistent with data from other extracts, sometimes even with extracts from the same system.

In the earlier years if Ed-Fi data exchange, the main concern was also limited to very few systems or sources of data; in many cases, the only concern was to make SIS data extracts consistent with each other. This meant that occasional issues with joining records due the high referentiality were generally able to be overcome, and the cost of doing was often (but not always) seen as a side benefit.

Current state

While this strong referentiality has been critical thus far, it has also had its drawbacks, and those drawbacks have increased over time as the Ed-Fi ecosystem has grown. In particular, as the Ed-Fi ecosystem expands to include many different categories of systems, it becomes more and more difficult to join records across systems.

In some areas – such as for the principal "person entities" of K12 systems, e.g., students, staff – it is generally possible to join records across systems due to significant investments in unique identity systems and processes covering individuals. However, even in these cases, multiple systems of overlapping person identity often pose problems: state vs district IDs for students or staff, use of system-specific identifiers, use of surrogate IDs to protect identity, etc. 

In other areas, common identity between entities in different systems is even harder to achieve, as these entities lack the support of unique identity systems or other processes to manage identity within larger contexts.

To some extent, rostering systems offer some promise of relief, but those systems tend to be narrowly focused on a thin but important slice of K12 data and on narrower use cases; for example, enormous efforts and funding in K12 goes into supplemental programs, yet these are entirely absent from all major roster efforts (except as thin "flags" on students). The other issue is that the current state of rostering is fragmented across several standards, and those standards also do not share the same approaches to management of entity identity within a larger, multi-source-system ecosystem.

The outcome of these problems is that referentiality requirements can become quite painful in field work. To cite some examples:

  • The reference from CourseTranscript to Course has for years caused significant ecosystem pain. While every CourseTranscript record will have some Course to which it should refer, in some cases these references are not carefully maintained, often due to student transfer. This problem was a key reason the Course Transcript Exchange SIG was formed.
  • Quite often, StudentAssessment records will fail to load in the API because the Student cannot be located. Sometimes the issue is differing systems of identity for students (e.g., a state vs a local ID), but generally such cases can and do require a system-level intervention to agree on student identity system use and rostering. More insidious are simply cases where a smaller number of students can't be joined due to an issue in the assessment rostering process (e.g., a student added late / manually to a schedule interim assessment, a student ID accidentally transformed by addition of padding of '0's on the front of an ID, other data corruption).
  • Most recently, in the data for learning management systems (LMS) we are seeing cases in which the data from the LMS and the SIS are not expected to match (e.g., LMS sections that represent teacher communities of practice or grade-level teacher cohorts) or where the matching may be difficult (e.g., due to rostering being ad hoc or manual, or similar reasons).

Some reason that small percentages of failed transactions pose a large issues:

  • The API host is often able to resolve issues joining records, and in many cases fix the root rostering issues (as the API host is also typically are the roster source). However, since transactions fail, the API host has no data at all, and that can make resolution of issues painful.
  • Such issues tend to be more common in data exchange projects early on, and therefore encourage projects to revert back to older systems or run duplicate exchange systems longer. While it is arguably always better to find the root cause of reference errors, in some cases, it is also a priority to "just get the data." There are many Ed-Fi ecosystem instances of agencies running Ed-Fi and CSV-file-based exchanges in parallel because they cannot be ensured of getting all data via API, and they therefore settle for CSV data, even though that data has the same referential issues as the Ed-Fi API data.

Strategies

"Potentially Logical" References

One strategy that is being piloted by the Ed-Fi technology is the use of references that are "potentially logical."

Tickets:

In this design, the reference (the natural key to the referenced entity) is kept on the entity as required, but the reference is not required to resolve.

One advantage of this design is that the choice of if a reference must resolve can be made at a number of different levels. For example, a data standards specification could delegate this decision to an implementation to make, allowing an implementation to decide on when the local ecosystem is sufficiently robust in order to mandate that references resolve or not.

In this respect, this approach can be used to allow implementations to mature and raise the benchmark on data quality at appropriate times.

It is also evident that this is really more of a technology approach than a data standards approach. From a data standards viewpoint, for a particular API specification, a reference is either required to resolve or not. However, it is possible that the overall Ed-Fi data model could be annotated as to which elements are allowed to be made "potentially logical" within other specifications - i.e. used at at meta level of the Ed-Fi suite of data standards.

In part, this design bears similarity to the FHIR healthcare standards which has a concept of a dual key system, with business identifiers that "identify the underlying concept (also called 'real world entity') consistently across all contexts of use" vs the logical references "assigned by the server responsible for storing" the resource (see https://www.hl7.org/fhir/resource.html). FHIR emphasizes that logical keys are the ones driving connections and lookup with the data exchange context, while business identifiers are more useful for connecting to systems or processes not within the data exchange context. One could imagine that an Ed-Fi reference in an API context could send both/either keys, and the business key not be forced to resolve.