Bulk Data Exchange SIG 2018-06-19

Reference Materials

https://techdocs.ed-fi.org/download/attachments/45777870/Batch%20Modalities%20of%20Data%20Exchange%20-%202018-06-18.docx?api=v2

Agenda

  1. Geoff McElhanon to present changes to the design proposal for bulk/batch import and export, based on information from the 6/5 meeting.
  2. Dependency order
  3. User stories

Attendees

  • Mike Discenza (SchoolLinks)
  • Stephen Fuqua (Ed-Fi Alliance)
  • Eric Jansson (Ed-Fi Alliance)
  • Geoff McElhanon (Student1)
  • Ben Meyers (DoubleLine Partners)
  • Scott Meddles (SchoolZilla)
  • Doug Quinton (PowerSchool)

User Stories

  • As a consumer of a bulk json system I should be able to specify domains/fields where I can take data from for partial exports (and I imagine from a data privacy perspective this would be useful from the perspective of the data owner)
    • Bulk export of records but with specialized profile / data policy? → we should be able to support that
    • What about allowing the query terms to set the expected fields instead of the host profile?
    • Fields parameter is currently "give me only these fields". What about the inverse - you want most of the data, and only want to exclude a few fields? Is that worth pursuing?
  • As a consumer I want to be notified when a request to generate data has been completed so that I can download this data
  • As a consumer I would like to see metadata associated with the export (mainly tstamp of the snapshot) so I would know what records had been updated after the snapshot time to be able to quickly tell if there are more updated versions of entities that were synced (or in a Clever-events like scenario, to know that events after a certain time would need to be played back to keep the data up to date) - this is mainly useful for changes that might happen in between the time of the bulk dump creation and when it is consumed.
  • List of Changes: As a bulk data consumer, I want to get a list of changed resources so that I will know which records need to be updated in my database.
  • Callback: As a bulk data consumer or producer, I want to submit a callback URL in my POST request, so that the Ed-Fi ODS will make a call back to my system when the requested process is done.
  • Secured Status / Data: As a bulk data consumer or producer submitting an asynchronous request, I want the status and output of my request to be secured so that other client applications cannot access my data.
  • Natural Key for Errors: As a bulk data producer, I want to know the natural key for each failed record, so that I can flag the problem in my own system.
  • Upsert: As a bulk data producer, I want to resubmit corrected bulk data with already-processed records present, so that I can reprocess a batch successfully without having to determine which records were already loaded on a prior attempt.

Discussion

Dependency Order

The Ed-Fi data model is highly referential. Should we enforce dependency order of messages, or should the server side try to resolve dependencies by processing messages out of order? If resolving, what does that mean and how would it work?

  • For messages within a batch. E.g. a single JSON payload containing both student and student data.
  • PowerSchool - no reason we can't send the payload with records in the correct dependency order.
  • Without bulk, vendors are already having to submit in the right dependency order - they're used to this.
  • Implementations could have explicit rules for processing, or just loop through the data and re-try each message until they process successfully.
  • Geoff says the import process can handle this already. But what about export?
  • Export should be in a reasonable order.
  • Consumer will have to write business logic no matter what to match the dependency order in their own data model.
  • How do we resolve dependency order for extensions - Do we need to define a feature for publishing this order for extensions?
  • Other organizations are starting to implement the Ed-Fi standard in their own API. Whatever we design here - becomes must have or should have feature for that standard?
    • Is bulk/batch considered part of the API standard? → not yet.
    • Perhaps we initially add to the Ed-Fi implementation but not to the standard. Try it out for a while before standardizing. Eventually "resolving dependency order" probably should become a must have standard.
    • Could we express the dependency order in MetaEd so that it can be consumed by other API implementers?
  • Publishing dependency order / relationships through an API endpoint? Are there any systems that would find this valuable?
    • Need to know in advance what data are being pulled, even with extensions.
    • Dependency graph is not obvious when new to this space.
    • List of descriptors as a composite?

Trial conclusion: not unreasonable to have the API resolve the dependency order for both the import and the export in order to support streaming processing.

Bulk / Batch Modalities Update

  • Namespace security - the implication is that the namespace would be implicit, unlike when posting discriminators - in that case namespace must be provided in payload. Divergence in pattern, but perhaps it is a better pattern.
  • Obtaining resource identifiers
    • Not currently carrying the bulk ID into the records, so querying by natural key and not by bulk ID.
    • What happens when you've just upload thousands of students? Don't want to perform 10,000 queries with natural keys to get each resource id. → error handling should be returning the natural keys that failed, so why re-query to find what records were loaded?
    • Nevertheless, could preserve the bulk ID and allow querying on that field. Or load resource IDs into the bulk database - but would need a cleanup process later on.
    • Alternatively - lookup individual resource IDs as needed instead of pre-loading them into source system after batch process.
    • If storing bulk ID, would it be possible to save only for new records? Or preferable to also save the association for updated records? If main goal is to query for resource IDs by bulk ID, then perhaps only insert is needed.
    • /data/v3/ed-fi/students?fields=id,studentUniqueId&bulkOperationId=abcd..1234
      • Still would need one request per resource type
      • Concern with this being in the core API - preference for being in the bulk API
    • Ultimately need to present errors to a real user in a district, so that they can correct the error in the source system.
  • Callback
    • Should we send the full link to the status details in the payload, instead of just an ID?
    • Would still need to be authorized.
    • Concerned about blindly following links, so preference to have system build the link.
    • Callback absolutely must be optional for systems behind a firewall.
  • Errors
    • Restated need for the natural key in the error. → In the document, look at top of page 11, the sample payload has studentUniqueId natural key for a student resource.

Next Steps and Questions

  • (plus) Consider mandating (standardizing) bulk API support for streamed processing
  • (plus) Querying resources by bulkOperationId - adding to the design
  • (question) Fields parameter as exclusion rather than inclusion
  • (question) What constraints on bulk process (e.g. time). Performance considerations.
  • (question) Batch is resource-oriented or student-oriented? "Student backpack" bundling together all student data for export.
    Is this what we are discussing?

    Student oriented
    { 
        "students": 
        [
            {
                "id": "...",
                "firstName": "...",
                "edOrgAssociations": [
                    { ... },
                    { ... }
                ]
            }
        ]
    }
    Resource oriented
    { 
        "students": 
        [
            {
                "id": "...",
                "firstName": "...",
     
            }
        ],
        "edOrgAssociations": [
            { ... },
            { ... }
        ]
    }
  • (question) Concerns re: requirement of synchronous processing/response for bath operations, and questions if that is a requirement or priority. (Ben - shared offline after meeting)