Internationalization Work Group 2020-08-06
Participants
- Eric Jansson
- Gene Garcia
- Ed Comer
- Stephen Fuqua
- Gabrielle Garonzik
Materials
- Google check-in sheet - https://docs.google.com/spreadsheets/d/1Fm6ykDhnE0EfmUGMXsFiepQXGf1ec7pin28PrOtlfMI/edit?usp=sharing
Notes
Reviewing Slide Deck
Historically, Ed-Fi UDM adopted a natural key system for two main reasons. 1) Key unification (merges) provides a type of validation. 2) Enhance discoverability and usability. To redistribute data to other systems by minimizing number of surrogate keys.
Rest API introduced a surrogate (resource id) key. So Ed-Fi is kind of a hybrid model, really.
Discussion points:
- If each source system has a surrogate key, the identification is clear and provides a one-to-one mapping. When bringing data in from multiple sources, you have to decide which system is the system of record to give that system the "primary" surrogate key.
- The current ecosystem actually behaves this way. They insert data with a natural key and then use the resource id surrogate key to update and delete data.
- There is some difficulty when multiple systems overlap and there is not a clear "primary" for which system should have create, read, update, and delete capabilities. The other concern is around data validation.
- Question: Does it make sense to track the train of providence of surrogate keys for transactions? A way to track changes to data.
- Partial key surrogate discussion that the model has used to address overly complex natural keys in the model (like Assessment and Section).
- Surrogate keys are a way that SIS's can be sure the data they are pulling out is what they put in.
- Pros and cons. Data validity is biggest pro. Discoverability has not really proven itself as a pro in practice. Big con is the natural key is a barrier in the model. Course Transcript is a good example of this because Course is required as part of the key.
- Another issue is the key itself is not sufficient to guarantee uniqueness. Example of Graduation Plan for an Individual Graduation Plan. To make this work, the descriptor values have to explode to cover all possibilities and students.
- Key volatility is an issue where changing a field that is part of a key (like BeginDate for StudentSchoolAssociation) and then anything that is dependent on that record downstream (like Grades), all records have to be deleted and created anew. This is getting better as we track these issues down.
- Downstream complexity is another issue with all the required references and doing joins. Can also impact performance.
Options 1: Do Nothing
Option 2: More Partial Key Surrogates
Option 3: Expand Partial Key Surrogates Throughout
Option 4: Use ResourceId for everything
Option 5: Use Weak References
Two parts to this question: If we didn't have the history of Ed-Fi, what is the best approach knowing what we do now. And are the benefits of a new system worth the pain of shifting to this new approach. For MS, the momentum is behind surrogate keys. So this is the way they are going. For Ed-Fi, this would be a transition, and will have greater challenges?
Discussion points:
- Does this need to apply to Ed-Fi in the US or are we more vamping on a separate International Model? I thought the international model is intended to be more flexible so a surrogate key model would work well with that goal. And we can avoid the pain points of trying to change Ed-Fi in the US as well.
- For now, the idea is to keep these as a single model. So they would need to be merged at some point, perhaps slowly over time.
- Let's make a separate, smaller international model and experiment with it.
- Option 4 can be used to create a data lake. So the REST id would enter the model and references would include the id. This is probably more aligned with a SIS. Lose the ability to do unification but less blockers to entry. And some data may be better than no data.
- So have a "click here to turn it off" configuration to remove natural keys.
- The advantage of the lake is not just a collection but also to centralize so there is a single place to marry data from multiple systems. Be aware of the data quality issues. But have it something that can be addressed to make the bar to get in lower.
- So you could "tag" the data to denote what is clean versus dirty.
- From an analytics side, this makes sense. So both cleaning data in the lake and forming important linkages that did not come in with the other data. When thinking about that in an operational sense, a data lake does not seem like the right tool.
- The linkages are important here, though. A centralized place would allow, for example, a LMS to get roster data from the central location rather than going to the SIS directly.
- How is the community using the model operationally?
- There is the state use case which is the primary operational use example. To Gene's point, I have no issues with surrogate id's if the downstream use case is to create a data lake and data quality issues can be handled after data is loaded.
What if we did go to a hybrid model with both keys and making the choice an implementation configuration? This seems easy enough to do with the current Ed-Fi API. But implementation would require some interesting choices. - Ed-Fi today does not have a relaxed way to "write" data. How to do a dual key system is making brain explode.
- Would it help to move the conversation more to the logical level with the API requirements?
- To be fair, you can create secondary validation on a data lake with a surrogate key system that will recreate all the goodness of Ed-Fi today.
- Shared screen of Modern Data Platform Reference Architecture (https://docs.microsoft.com/en-us/azure/architecture/example-scenario/dataplate2e/data-platform-end-to-end#architecture)
- These types of data stores are becoming more mixed over time.
- I think of it like a tolerace for error. Low tolerance examples are things like enrolling a student in a class, assigning a student a grade. High tolerance example is determining if a Title I program is successful. So with each of these examples, there are different tolerances for error. And the tolerances for error are important in a lot of education use cases.