- Created by Stephen Fuqua, last modified on May 23, 2022
You are viewing an old version of this page. View the current version.
Compare with Current View Page History
« Previous Version 6 Next »
Introduction
In hindsight, DynamoDB was a poor choice of data store for the first release of Meadowlark for two primary reasons:
- Except for a little-known open source implementation, it is entirely restricted to Amazon Web Services.
- The design model is interesting, but idiosyncratic.
MongoDB would have been a better starting point:
- It is supported, directly and/or through emulation, on all major cloud platforms and on-premises.
- It is a mature product, with strong documentation and design patterns.
- The scalability features, such as replication and sharding, are very attractive for large implementation.
There are other NoSQL databases with similar benefits and other attractive features, such as Couchbase. However, the support is less widespread, so it will not be investigated at this time.
Although it is one of the traditional relational databases, PostgreSQL has powerful built-in support for NoSQL operations. Because of the Ed-Fi community's growing adoption of PostgreSQL, it will be explored as an alternative to MongoDB. See Meadowlark 0.2.0 - PostgreSQL.
Also see: Meadowlark 0.2.0 - Durable Change Data Capture for more information on streaming data out to OpenSearch.
Design
This proposal takes its cue from the team experience with DynamoDB. The basic principal continues that the API document is stored along with metadata to be used for existence/reference validation. However, instead of storing the metadata in columns it will be part of a single larger document. Fast document lookups continue to be done by id, constructed as before from API document project name, entity type, version and body. Transactions will again be used to check for existence/references before performing create/update/delete operations. The MongoDB version of reference validation for deletes is greatly simplified from the DynamoDB version by taking advantage of MongoDB's indexing features, in particular indexing of arrays.
To support potential deployment to Amazon DocumentDB or Azure CosmosDB, all code and design should match the MongoDB 4.0 API.
Entity Collection
The MongoDB implementation will only need one collection, to be called Entity. The shape of the Entity document (all fields required):
id
- A string hash derived from the project name, resource name, resource version and identity of the API document. This field will be a unique index on the collection.documentIdentity
- The identity elements extracted from the API document.projectName
-The MetaEd project name the API document resource is defined in e.g. "EdFi" for a data standard entity.resourceName
- The name of the resource. Typically, this is the same as the corresponding MetaEd entity name. However, there are exceptions, for example descriptors have a "Descriptor" suffix on their resource name.resourceVersion
- The resource version as a string. This is the same as the MetaEd project version the entity is defined in e.g. "3.3.1-b" for a 3.3b data standard entity.isDescriptor
- Boolean indicator.edfiDoc
- The Ed-Fi ODS/API document itself.outRefs
- An array of ids extracted from the ODS/API document for all externally referenced documents.validated
- Boolean indicator.
Examples
{ "_id" : ObjectId("6287c039abf2ff4430376b3d"), "documentIdentity" : [ { "name" : "descriptor", "value" : "uri://ed-fi.org/AbsenceEventCategoryDescriptor#Bereavement" } ], "projectName" : "Ed-Fi", "resourceName" : "AbsenceEventCategoryDescriptor", "resourceVersion" : "3.3.1-b", "isDescriptor" : true, "id" : "546c96c905374bed9287409ba1ca77d28fdcd08c9d3ea3e9085d8a10", "edfiDoc" : { "codeValue" : "Bereavement", "shortDescription" : "Bereavement", "description" : "Bereavement", "namespace" : "uri://ed-fi.org/AbsenceEventCategoryDescriptor" }, "outRefs" : [], "validated" : true }
{ "_id" : ObjectId("6287e93cabf2ff4430384af2"), "documentIdentity" : [ { "name" : "schoolId", "value" : 123 } ], "projectName" : "Ed-Fi", "resourceName" : "School", "resourceVersion" : "3.3.1-b", "isDescriptor" : false, "id" : "8d111d14579c51e8aff915e7746cda7e0730ed74837af960b31c4fa6", "edfiDoc" : { "schoolId" : 123, "gradeLevels" : [], "nameOfInstitution" : "abc", "educationOrganizationCategories" : [] }, "outRefs" : [], "validated" : true }
In the following example, note that the outRefs
array has the ID of the school from the example above.
{ "_id" : ObjectId("6287e993abf2ff4430384bfd"), "documentIdentity" : [ { "name" : "schoolReference.schoolId", "value" : 123 }, { "name" : "weekIdentifier", "value" : "1st" } ], "projectName" : "Ed-Fi", "resourceName" : "AcademicWeek", "resourceVersion" : "3.3.1-b", "isDescriptor" : false, "id" : "20325050be22032deaeaddeb6a82cc160ce85911c9ad5ca8de5482e2", "edfiDoc" : { "schoolReference" : { "schoolId" : 123 }, "weekIdentifier" : "1st", "beginDate" : "2022-12-01", "endDate" : "2022-12-31", "totalInstructionalDays" : 30 }, "outRefs" : [ "8d111d14579c51e8aff915e7746cda7e0730ed74837af960b31c4fa6" ], "validated" : true }
If trying to query inside of an entity, or if trying to GET ALL by type in MongoDB, then separate collections would be better than a single collection. However, when using MongoDB we would still plan to have OpenSearch or ElasticSearch in the picture for those functions. Therefore a single "table" (collection) design is appropriate, and makes sharding easy.
Insert Transaction Steps
Inserting a new Entity document into the collection will follow these steps:
- Check that id does not exist (indexed query)
- Check that external reference ids for the document all exist (index query per reference)
- Perform upsert
Update Transaction Steps
Updating an existing Entity document into the collection will follow these steps:
- Check that id exists (indexed query)
- Check that external reference ids for the document all exist (index query per reference)
- Perform overwrite
Delete Transaction Steps
Deleting an existing Entity document from the collection will follow these steps:
- Check that id exists (indexed query)
- Check that there are no out_refs for this id (indexed query)
- Perform delete
Queries
Get all and get-by-key queries will continue to be serviced by OpenSearch. See Meadowlark 0.2.0 - Durable Change Data Capture for more information on how data will flow out to OpenSearch.
Future Considerations
Security
- Investigate adding security annotations based on indexable API document attributes
- Examples: ownership field, extracted education organization field
- Investigate using with CASL.js for attribute-based authorization
- https://casl.js.org/v5/en
- Slide deck intro: CASL presentation by author
Improve version migration support
Consider ways we might want to change the id design to make migrating to newer DS versions easier. For current design, id includes project name, entity type, version, and natural key.
Let's say a new DS version comes out and a Meadowlark implementation wants to upgrade documents to the newer DS version. Assume School is unchanged between two DS versions. From the API client perspective, it would be very nice if the School resource ids didn't change. However, in the current design it would have to because version is part of the id hash.
This may get into changes in how DS versions are incorporated into resource URLs, and/or doing versions per MongoDB collection so that id is unique within a collection?
Table of Contents
- No labels