Bulk Data Exchange SIG 2018-06-05

Pre-Read Materials

Agenda

When

What

Who

10:00 - 10:05

Welcoming / call to order

Eric Jansson

10:05 - 10:10

Introductions

  • Name
  • Organization
  • What is your interest in this topic?

Stephen Fuqua

10:10 - 10:15

High level overview Batch Modalities of Data Exchange

Geoff McElhanon

10:15 - 10:20

High level overview of Change Events

Eric Jansson

10:20 - 10:55

Discussion - merits and real world use cases for by bulk functionalities

Eric Jansson

10:55 - 11:00

Wrap-up next steps / action items

Stephen Fuqua

Attendees

  • Mike Discenza (SchoolLinks)
  • Stephen Fuqua (Ed-Fi Alliance)
  • Eric Jansson (Ed-Fi Alliance)
  • Geoff McElhanon (Student1)
  • Ben Meyers (DoubleLine Partners)
  • Scott Meddles (SchoolZilla)
  • Doug Quinton (PowerSchool)

Discussion

Introductions

  • Stephen & Eric: representing Ed-Fi Alliance.
  • Ben: two perspectives - code oversight, and support of field implementations that want bulk operations.
  • Doug: if publish 1000 records (in bulk) then need to know what errors occurred and need to get resource IDs back. JSON over XML is good. Interested in the idea of sending all of the data for a single student in a single bulk operation, if resource ID is returned.
  • Mike: JSON is simpler for us than XML. Agree about need for resource IDs for syncing. For data analysis, true bulk export lowers infrastructure for loading analytics solutions.
  • Scott: differential data sets with less chatter.
  • Geoff: architecture perspective, ensuring that Ed-Fi solutions have appropriate support for bulk operations.

Batch Modalities of Data Exchange

  • Overview of the pre-read document.
  • Code for this bulk process was initially created for a single state, and it is not in the ODS API core.
  • Should this bulk pipeline extend into the change events model? Would provide feature parity with the way Clever handles change events.
  • Keeping import and export formats aligned - allowing export from one API instance and import to another.
  • Batch format for optimizing chattiness, but discouraging large-scale synchronous batch operations.

Change Events API

  • In Swagger, see the events resource in a Composite
  • Still in development

Open Discussion

  • Does export of bulk data include list of change events?
    • Q: asking for a list of events, or the full resource that changed? A: asking more for list of events, although some might want resource.
    • Geoff: Right now, when you call for the list of changed resource, you get the current snapshot cumulative of all changes.
    • If asking for the detailed changes over time, then this gets into the "Temporal ODS", which is a separate problem from bulk and addressed in another SIG.
  • Should we use the Composites engine against data management resources for GET requests? More support for field selection when using Composites. Allow you to pull just identifiers, for example, without having to create an explicit Composite.
  • Does this proposal return identifiers? Any positional analysis of the file?
    • The file is not processed in a way that could support positional analysis.
    • What about resource key identifier reporting? Yes, that would work. → add id lookup by natural key.
    • This bulk import process is asynchronous; when done, the client needs to call a separate endpoint to fetch the list of IDs.
    • If doing a synchronous batch POST, would want the same format coming back as in the async case. But it would be the response to the POST.
  • Any discussion around ownership security for bulk imports and exports?
    • Right now, if you have bulk access then you can see anyone else's bulk operations.
    • Agreed, needs to be ownership-based. Use namespace URI and client key correlation to limit access to seeing only your own bulk operations.
  • Callback URL (web hooks)
    • Very basic POST to let the client know the bulk operation is done. Includes export ID and status only.
    • Should this be on both imports and exports? Yes.
    • This feature is in addition to being able to poll the status endpoint.
  • Errors
    • Could callback API render the errors? Let's say an upload failed. Would we POST back the details? → no, would only know that there are problems. Request errors via a separate request.
    • This document doesn't dive into the errors, as there was another SIG. → Eric points out that this other SIG was looking at a different problem, and this Bulk API could have its own reporting logic.
    • Vendors want to get the the enumerated error details by natural key after the async bulk load process completes. This is in addition to standard 404 type responses.
    • If there were data errors inside the file, then the query should return 200 OK, but with an errors collection containing the details.
      • Better to state that the entire file is in an indeterminate state so that the entire batch can be resubmitted.
      • Bulk process is not transactional.
      • Creates will have some failures on the second submission, just be expecting that. Ideally we would state that a failure is because a record already exists.
      • Or would we rather just process as an "upsert" instead of rejecting a duplicate?
  • Did not talk about dependency order and automatic resolution.

Action Items

  • Update the design proposal to address:
    • Add ID lookup via natural key
    • Add ownership security
    • Add callback URL to imports, not just exports
    • Details on error handling and failures on processing
    • Dependency order
  • Post meeting, Eric and Stephen came up with an additional request for Ben, Doug, Mike, Scott: before the next meeting, can you send in one or more user stories describing how you would interact with this JSON-oriented bulk system?