Bulk Data Exchange SIG 2018-06-26

Agenda

  1. Concerns re: requirement of synchronous processing/response for bath operations, and questions if that is a requirement or priority.
  2. Batch is resource-oriented or student-oriented? "Student backpack" bundling together all student data for export. Or are exports granular? If student-oriented, what is relationship to composite resource models?
  3. Consider mandating (standardizing) bulk API support for streamed processing
  4. Querying resources by bulkOperationId - need to add to the design
  5. Fields parameter and inclusion/exclusion models to simplify fields selection
  6. What constraints should there be on bulk process (e.g. time) and other performance considerations.

Attendees

  • Mike Discenza (SchoolLinks)
  • Stephen Fuqua (Ed-Fi Alliance)
  • Eric Jansson (Ed-Fi Alliance)
  • Geoff McElhanon (Student1)
  • Ben Meyers (DoubleLine Partners)
  • Scott Meddles (SchoolZilla)
  • Doug Quinton (PowerSchool)

Discussion

  1. Concerns re: requirement of synchronous processing/response for bath operations, and questions if that is a requirement or priority.
    1. Ben: have seen a lot of desire for bulk import and export  but not necessarily batch. Can we prioritize bulk independently from batch? Applies to other items below as well.
    2. Import or export more important, or both? Ben: export since there is none right now. Import improvements is good and consistent but not absolutely necessary.
    3. Doug: agree on distinguishing bulk & batch. Tend to push more data than consume so less worried about export.
    4. Mike: represents the flip side - rare to push and mostly pull.
    5. What is the difference between the two? Batch is synchronous and each batch is self-contained transactionally, and bulk is asynchronous. For batch operations need to get the resource identifiers. Problem with bulk is failing the entire payload. Import definitely needs to improve on reporting on what failed.
  2. Batch is resource-oriented or student-oriented? "Student backpack" bundling together all student data for export. Or are exports granular? If student-oriented, what is relationship to composite resource models?
    1. "use-case-oriented" is a better term than student-oriented.
    2. Another way to state this: export is granular around API resources, or export is use-case-oriented with high level resource and its dependencies included ("tree structure").
    3. Composites can fulfill the use-case perspective.
    4. Should there be additional filtering? Right now it is basically a GetAll request. GetByExample.
    5. Want to avoid accidentally re-creating a query system like OData, GraphQL. What is the minimal use case?
    6. Mike: GraphQL could be a game changer. Not for bulk data transfer - but allowing direct use of the ODS.
      1. (SF side note: GraphQL on top of missing metrics layer?)
    7. What is the MVP?
      1. GetAll so client can decide what to keep on individual resources.
      2. What about attendance data? Query by last modified date?
      3. Change events handles temporal filtering.
      4. Next priority: where clause. Last priority: projection (composites good enough).
  3. Consider mandating (standardizing) bulk API support for streamed processing
    1. This is about API standards that vendors would implement without the ODS.
    2. Keep in mind for the future, once we have proved out the concepts with our implementation.
  4. Querying resources by bulkOperationId - need to add to the design
    1. Problem with adding it to the ODS model - could have overlapping bulk operations.
    2. Need to work on the Bulk operation design and consider a bulk-specific endpoint for retrieving resource Ids. One final step before dumping the data in the Bulk database.
    3. Want to be able to extract Students specifically, perhaps a few other resources.
    4. Respond with the ODS identifier and the natural keys. Similar to the bulk exceptions.
    5. While designing next steps for the Bulk API, will consider if recording the resource sequence from input would be feasible.
  5. Fields parameter and inclusion/exclusion models to simplify fields selection.
    1. This is both in general and in bulk context.
    2. Came up in the conversation around getting keys.
    3. SQL is inclusion based, not exclusion. Other APIs tend to be inclusive.
    4. Data profiles can limit access to secure fields if that is needed.
    5. Composites engine can do field selection and get by example, maybe we could re-use that in bulk.
    6. Is export intent for multiple resources? Yes. But that makes projection / field selection more difficult in the query. We can explore this in the design before committing.
  6. What constraints should there be on bulk process (e.g. time) and other performance considerations.

    1. Design doc already has size restrictions for synchronous batch.
    2. Bulk XML has a size limitation 2 GB file?
    3. It could be useful to identify the largest reasonable size of JSON file.
    4. Memory footprint of this process should be lower than with Bulk and thus should be able to handle larger files. Disk space more of a limitation now.
    5. Maybe a config option to allow implementation to set a limit.

Operations discussion

  • Delete in the bulk? Yes
  • "Create" operation - should that implicitly do an upsert if the record already exists?
  • Should "operation" be "change type"? Or maybe two distinct things:
    • operation: [upsert, delete]
    • changeType: [created, updated, deleted]
  • Should "create" be a different word?
  • What can we learn from HTTP verbs and status codes?
  • Change events reports created or updated.
  • Leaning toward leaving operation in there and defaulting add/create to be an upsert. Maybe just leave the operation field out except with deletes, and just upsert ("POST") all records in the bulk.
  • Want it to be very easy to export and import back in without additional effort.