Overview
Historically, the Ed-Fi ODS/API supported only Microsoft SQL Server; recently, PostgreSQL support was added. Both support the relational database modeling that has been popular for several decades now. That data model has proven effective and efficient in countless situations, including in the ODS/API. But there are limitations, some of which have been strongly felt in the ODS/API:
- Normalization of the Data Model results in a proliferation of tables that then must be re-joined in expensive read operations.
- Table schema evolution is slow, painstaking, and difficult to make backwards-compatible.
- Natural key as primary keys is an anti-pattern on a relational database, leading to larger disk space usage and more difficulty in writing accurate queries.
Is there another database engine or modeling paradigm that might be appropriate for a future architecture that supports the Ed-Fi Data Model?
Data Storage and Modeling Requirements for an Ed-Fi API
Key requirements for an Ed-Fi API that is modeled on the Unified Data Model:
- Must have full authentication and authorization support.
- Must be capable of authorizing partial access, directly in the engine or through query filtering.
- Must support queries by natural keys.
- Must have transport encryption.
- Must have audit logging.
- Should support queries on other fields.
- Should have paging.
- Should have horizontal scaling capability.
- Should have a reasonable query language.
- Nice to have optimistic concurrency support.
- Nice to have consistent reads (all queries at same time return same result).
- Nice to have schema flexibility for supporting changes to the Unified Data Model.
- Nice to have field selection.
Additional factors to consider for Project Meadowlark:
- Should use managed services to the extent available.
- Horizontal (preferably "elastic") database scalability to support large school districts and collaboratives.
- Using multiple storage technologies together is potentially viable and may open up additional features, such as relaxed resolution of references during initial write actions.
Out of scope for this analysis:
Moving beyond relational databases, Document and Graph storage most naturally fit the requirements as described above. A Key-value database could be combined with a document database, or it could be used with secondary indexes to turn it, effectively, into a Document store. Because the API will be serving JSON documents, the data modelling will be more natural on a Document store than Graph.
When working with AWS, the DynamoDB solution may be optimal:
- Managed service
- Either employ secondary indexes and/or combine with another data store for query support
- Elastic scalability
However, there is a downside to DynamoDB: it has (by design) limited indexing capabilities. This will make it difficult to support API queries, e.g. GET someResource?column1=value1&colum2=value2
. ElasticSearch or OpenSearch provides fantastic indexing capability. It can be paired with DynamoDB to provide the a full solution that supports all API-based CRUD capabilities.
On Azure, CosmosDB can provide either key-value storage (in Cassandra mode) or document store (in MongoDB mode). While the Cassandra API is different from DynamoDB, the data models would be very similar. If a solution favors using DynamoDB as a Key-value store in combination with another engine, then an Azure solution should use the Cassandra API. Alternately, the MongoDB API is attractive because MongoDB can be provisioned on-premises or through the MongoDB managed service.
Couchbase as a Document store looks objectively better than MongoDB, for reasons of both scalability and query-writing. While there are deployment options for Couchbase on AWS and Azure, they are not fully managed services. Therefore MongoDB looks more favorable at this time.
Recommendation:
- Explore DynamoDB + OpenSsearch
- In parallel, explore use of MongoDB (deferred to milestone 0.2.0)
DynamoDB + OpenSearch
Storage Design
Brief sketch, based on the single table design
Two types of records:
Streaming to OpenSearch
DynamoDB has native change data capture streaming. The change stream can trigger execution of a Lambda function. This function in turn can write data out to OpenSearch.
References