Using the Changed Record Queries
- Vinaya Mayya
The Ed-Fi ODS / API platform contains data that gets updated frequently. The platform tracks inserts, updates, deletes, and key changes, and surfaces those changes to client systems through a feature called changed record queries, or "change queries." Change queries allow client systems to narrow requests for data to only data that has changed since a specified system version number. This allows client systems to stay in sync with the ODS / API without having to pull a complete dataset.
Change queries is an optional feature and is turned on by default - but can be turned off through configuration, so you'll need to check with your target platform host to see if it's enabled.
About Change Queries
The Change Queries feature was designed to provide a lightweight mechanism for tracking relevant API operations for enabling system integration developers to easily retrieve and apply changed data to downstream systems. It is conceptually similar to SQL Server Change Tracking, and thus the solution has many of the same properties and considerations, such as:
- The API will return the most recent version of the resource. (This is different from a change data capture (CDC) system that provides a log of every change.)
- Synchronization processing is greatly simplified when using snapshot isolation (which isolates the API client's work from ongoing changes to the underlying source ODS database). However, this approach is only available if the Ed-Fi ODS / API host takes some extra steps to create and make snapshots of the ODS available.
- When snapshots are not available, the API client bears the ultimate responsibility for ensuring data consistency with downstream systems (guidance for implementing this type of processing is provided later in this document). A periodic full reinitialization of the target system's data (if possible) may be necessary to achieve consistency.
- When snapshots are available and used for processing, this will likely limit the frequency with which processing can be performed (due to the technical constraints around creating and providing new snapshots).
- The article on the Changed Record Queries feature has implementation details which may be of interest to some client system developers.
Overview of Change Query Endpoints
The core operations for Change Queries processing are as follows:
Snapshots
The global Snapshots API resource provides information on the snapshot(s) that the host has made available for change processing.
GET /changeQueries/v1/snapshots
A successful request will contain a response body that looks like the following:
{ "id": "d19c86ced5ff49c19d56e6b5c8f1ec68", "snapshotIdentifier": "abcd", "snapshotDateTime": "2021-02-15T17:19:57.7866667Z" }
If multiple snapshots are returned, the API Client should use the most recent snapshot based on the snapshotDateTime
property. The API client should add the Snapshot-Identifier
HTTP header to each request with the value of the snapshotIdentifier
property.
If no snapshots are returned, it means that the host has not set up snapshots for change query processing. API clients can use the change queries feature without the Snapshot-Identifier
HTTP header, but using snapshots is highly recommended (when available) for data consistency and ease of implementation.
Available Change Versions Resource
The ODS / API uses a change version (as opposed to a date/time) in the form of a sequential long integer. The global Available Change Versions API resource provides information on the current change version. This resource allows clients to request just the resource items that have been created, updated or deleted since the last synchronization processing was performed.
GET /changeQueries/v1/availableChangeVersions
{ "oldestChangeVersion": 0, "newestChangeVersion": 100 }
Minimum and Maximum Change Version Parameters
The Minimum Change Version and Maximum Change Version parameters allow clients to request the latest representation of all resources that were modified within the given change version window. These parameters are available on every data management resource, both as part of the Ed-Fi Data Standard and in extension models. The parameters are also compatible with the existing parameters to support paging using the offset
and limit
parameters. Using paging parameters in conjunction with change version parameters, all records can be retrieved over multiple calls.
GET /data/v3/ed-fi/classPeriods?minChangeVersion=234378&maxChangeVersion=234974&offset=100&limit=500
Deletes Route
The "Deletes" route allows clients to get the id
and the natural key values for deleted resources. This route also supports the existing paging parameters of offset
and limit
.
GET /data/v3/ed-fi/studentProgramAssociations/deletes?minChangeVersion=234378&offset=0&limit=25
[ { "id": "cfa3fea8edf44c50b95e1808e616460a", "changeVersion": 234550, "keyValues": { "beginDate": "2021-08-30", "educationOrganizationId": 255901, "programEducationOrganizationId": 255901, "programName": "Bilingual", "programTypeDescriptor": "uri://ed-fi.org/ProgramTypeDescriptor#Bilingual", "studentUniqueId": "604854" } } ]
KeyChanges Route
The "keyChanges" route allows clients to get the natural key changes for updated resources. Not all resources support key changes. For those unsupported resources, the "keyChanges" route will always return an empty array. If there were multiple key changes in the change window defined on the request, there will be one entry per resource item affected with the initial and final key values for that resource. This route also supports the existing paging parameters of offset
and limit
.
GET /data/v3/ed-fi/classPeriods/keyChanges?minChangeVersion=104030&offset=100&limit=500
[ { "id": "e0f8848b54ad4bddbe81e96737ed7b46", "changeVersion": 104036, "oldKeyValues": { "classPeriodName": "ORIGINAL", "schoolId": 255901001 }, "newKeyValues": { "classPeriodName": "UPDATE 2", "schoolId": 255901001 } } ]
Synchronization Using the Change Query Endpoints
The primary purpose of the change queries feature is to support efficient movement of modified data from an ODS to a downstream source system(s). This section covers the basics and borrows heavily from Microsoft's documentation entitled, Work with change tracking (SQL Server).
Overview
The recommended implementation of a data synchronization solution contains the following elements:
- Identify which API resources are to be processed and the order of processing (informed by Ed-Fi model dependencies).
- Initial Synchronization (a one-time operation).
- Obtain the system's current change version (a.k.a., the synchronization version).
- Obtain and process the initial data for identified resources (with no change version filters applied).
- If successful, save the synchronization version for future change processing.
- Perform Change Processing (repeating, on a schedule).
- Obtain the saved synchronization version from the previous successful processing and add 1 (a.k.a., the starting change version)
- Obtain the system's current change version (a.k.a., the synchronization version).
- Obtain the most recent snapshot identifier (if available) and process the following API calls by applying
Snapshot-Identifier
HTTP header. - Process for key changes on identified resources (as applicable, in dependency order).
- Process for changes on identified resources(in dependency order).
- Process deletes on identified resources (in reverse-dependency order).
- If successful, save the new synchronization version value (for future Change Processing).
Identify resources to be processed
While some use cases may be very focused on a small subset of Ed-Fi data, for more expansive processing consider using the resource dependency metadata endpoint to correctly order the processing for the referential integrity defined in the Ed-Fi model.
Obtain initial data
Before an application can obtain changes for the first time, the application must obtain the system's current change version. This can be obtained using the availableChangeVersions route:
GET /changeQueries/v1/availableChangeVersions
{ "oldestChangeVersion": 0, "newestChangeVersion": 114218 }
With the current (initial) change version identified, the initial data should be retrieved from the API using paged queries against the desired data management API resources, as follows:
GET /data/v3/ed-fi/students?offset=0&limit=500
GET /data/v3/ed-fi/students?offset=500&limit=500
GET /data/v3/ed-fi/students?offset=1000&limit=500
...
If retrieval of the initial data set succeeds, save the change version value obtained at the beginning of this initial processing for use in the next stage – processing for changes.
Use the change version to obtain changes
The minChangeVersion
value used for change processing should be the previously stored (processed) change version value, incremented by 1 to avoid processing any of the same changes again.
For resources that allow for primary/natural key changes, first use the keyChanges route with the minChangeVersion
parameter as follows:
GET /data/v3/ed-fi/sections/keyChanges?minChangeVersion=101
Since the API always returns the current state, for ease of implementation it is recommended that you apply the key changes to the target system before continuing on with processing the other changes.
To obtain changes that have been made to a particular resource, use the main data management API with the minChangeVersion
parameter, as follows:
GET /data/v3/ed-fi/sections?minChangeVersion=101
To obtain resource items that have been deleted, use the deletes route with the minChangeVersion
parameter as follows:
GET /data/v3/ed-fi/sections/deletes?minChangeVersion=101
Obtain consistent and correct results
Obtaining the changed data for a resource requires multiple steps. Inconsistent or incorrect results could be returned if certain issues aren't considered and handled. For example, to obtain the changes that were made to the students and studentSectionAssociations resources, an application would perform the following steps:
- Get the current change version of the system (a.k.a., the synchronization version).
- Obtain and process all the changes for the students resource.
- Obtain and process all the changes for the studentSectionAssociations resource.
During this processing, ongoing operations by other API clients could be occurring in the ODS that could affect the results that are returned by the previous steps, such as the following operations:
- Changes can be made to the resources after the synchronization change version has been obtained. Therefore, more changes can be returned than expected.
- Another API client could add new resource items in the time between the calls to fetch changes from the students resource and the calls to fetch changes from the studentSectionAssociations resource. Therefore, the results for the studentSectionAssociations resource could have items with foreign key values that don't exist in the data already processed for the students resource.
To overcome these challenges, we recommend that you use snapshot isolation functionality provided by the snapshots API resource (which must be supported and maintained by the API host). This will help to ensure consistency of the change information. If you don't (or can't) use snapshots, developing a synchronization process that uses change queries could require significantly more effort.
Use snapshot isolation
Change queries has been designed to work well with snapshot isolation. Snapshot isolation must be supported by the API host through the snapshots API resource. All the steps that are required to obtain changes must be performed against a specific ODS snapshot by applying the Snapshot-Identifier
header to each API request. This will ensure that the changes that are made to the ODS data by other API clients won't be visible to the synchronization processing.
To obtain data inside a prepared ODS snapshot, first get the snapshot identifier as follows:
GET /changeQueries/v1/snapshots
This will return the available snapshots, allowing the most recent snapshot to be identified. The snapshotIdentifier
value should then be supplied to the API via the Snapshot-Identifier
HTTP header with all subsequent API requests. The API will serve the responses to these requests from the snapshot rather than the live ODS so that changes that are made by other API clients will not be visible during processing and data consistency is guaranteed.
Alternatives to snapshot isolation
There are alternatives to using snapshot isolation, but they require more work to make sure all application requirements are met. Changes can occur after the synchronization version for the next enumeration has been obtained. There are two ways to handle this situation. The option that is used depends on the application and how it can handle the side-effects of each approach:
- Include all changes, even those that have a change version larger than the synchronization version
With this approach, all Change Queries API requests will only include theminChangeVersion
parameter (themaxChangeVersion
parameter will not be supplied).
The resource items (including keyChanges and deletes) that have a change version larger than the synchronization version will be obtained again on the next synchronization. This must be expected and handled by the processing logic. - Ignore changes that have a change version larger than the synchronization version
With this approach, all Change Queries API requests will include both theminChangeVersion
and themaxChangeVersion
parameters, the latter of which will be assigned to the synchronization version.
This has the side effect that a new or updated resource item would be skipped if it was created or updated before the synchronization version was obtained, but then updated afterward. If there's a new resource item, a referential integrity problem might occur if there was a resource item in another resource that was created that referenced the skipped resource item. If there's an updated existing resource item, the resource item will be skipped and not synchronized until the next time. - Devise an approach that combines both the previous options
With this approach, you could decide that depending on the operation you might want an application for which it is best to ignore changes newer than the next synchronization version in which the resource item was deleted, but creates/updates aren't ignored.
Adding to the potential list of issues that must be considered with either of the above options is how certain operations performed by other API clients can affect API paging requests. For example, if while paging through a resource's items an item is deleted by another API client, it will impact the items that are included in the paging offsets calculated by the database engine, causing them all to shift. If an item that has already been processed is subsequently deleted during processing, all items will shift up in the page breaks, and the (formerly) first item on an upcoming page could be missed completely (never processed) as it shifts into a page that the client has already retrieved for processing.
The same outcome could result if new changes are ignored and an item is updated after it has been processed (because the item would no longer match the maxChangeVersion
filter criteria and paging would shift as described above, resulting in the same "missed item" condition).
Choosing the approach that will work for the application when you are using change queries (or any custom tracking mechanism), requires significant analysis. Therefore, it is much simpler to use snapshot isolation.
For ODS-to-ODS change processing, consider using the API Publisher utility from the Ed-Fi Exchange which was developed specifically for this use case.