Meadowlark Data Storage Design
- Stephen Fuqua
Related Pages
These pages relate to the data storage for Meadowlark.
- Meadowlark - Cascading Updates
- Meadowlark - Document Shape
- Meadowlark - DynamoDB
- Meadowlark - Experiments in Pagination
- Meadowlark - File System Storage
- Meadowlark - Full Query Support
- Meadowlark - MongoDB
- Meadowlark - PostgreSQL
- Meadowlark - Referential Integrity in Document Databases
Choosing a Database Platform
Overview
Historically, the Ed-Fi ODS/API supported only Microsoft SQL Server; recently, PostgreSQL support was added. Both support the relational database modeling that has been popular for several decades now. That data model has proven effective and efficient in countless situations, including in the ODS/API. But there are limitations, some of which have been strongly felt in the ODS/API:
- Normalization of the Data Model results in a proliferation of tables that then must be re-joined in expensive read operations.
- Table schema evolution is slow, painstaking, and difficult to make backwards-compatible.
- Natural key as primary keys is an anti-pattern on a relational database, leading to larger disk space usage and more difficulty in writing accurate queries.
Is there another database engine or modeling paradigm that might be appropriate for a future architecture that supports the Ed-Fi Data Model?
Data Storage and Modeling Requirements for an Ed-Fi API
Key requirements for an Ed-Fi API that is modeled on the Unified Data Model:
- Must have full authentication and authorization support.
- Must be capable of authorizing partial access, directly in the engine or through query filtering.
- Must support queries by natural keys.
- Must have transport encryption.
- Must have audit logging.
- Must have ACID compliance.
- Must have a row-locking mechanism.
- Should support queries on other fields.
- Should have paging.
- Should have horizontal scaling capability.
- Should have a reasonable query language.
- Nice to have optimistic concurrency support.
- Nice to have consistent reads (all queries at same time return same result).
- Nice to have schema flexibility for supporting changes to the Unified Data Model.
- Nice to have field selection.
Additional factors to consider for Project Meadowlark:
- Should use managed services to the extent available.
- Horizontal (preferably "elastic") database scalability to support large school districts and collaboratives.
- Using multiple storage technologies together is potentially viable and may open up additional features, such as relaxed resolution of references during initial write actions.
Out of scope for this analysis:
- Analytics support.
Utilizing Multiple Databases
In traditional application development, including the Ed-Fi ODS/API Platform, all Create-Read-Update-Delete (CRUD) operations are served by a single database instance. Project Meadowlark has instead adopted the strategy of choosing database engines that are a good fit-to-purpose. "NoSQL" document databases are a good fit for online transaction processing (OLTP) because they enable storage of the raw API payloads (JSON) directly in the database. This improves both the write speed and the speed of retrieving a single object from the database, since there are no joins to perform between different tables.
A key difference between this document storage approach, compared to relational database modeling, comes in the form of searchability. Many key-value and document databases have the ability to add "secondary indexes" that can help find individual items by some global criteria. But these are limited and very different than the indexes found in a relational database, which can be tuned to identify items based on any column. In other words, when storing an entire document, most key-value and document databases fare poorly when trying to search by query terms (e.g. "get all students with last name Doe").
This is where OpenSearch shines. Based on ElasticSearch (but fully open source), OpenSearch is also a NoSQL document store. The key difference is that it indexes everything in the document, and has a powerful search engine across the indexes. OpenSearch is not designed to be a robust solution for high performance write operations, so it does not make sense to write directly to it.
Milestone 0.1.0 - DynamoDB
In the initial release, Meadowlark used AWS's DynamoDB for the primary OLTP data storage. DynamoDB has a native change data capture stream that can trigger a Lambda function; this function wrote the data out to OpenSearch.
More information:
Milestone 0.2.0 - MongoDB and PostgreSQL
The development team liked working with DynamoDB, but realized there are a couple of problems:
- Vendor lock-in with AWS means that we must support at least one other database.
- Insufficient ability to support reference integrity checks.
MongoDB is the obvious choice for an alternative document data storage, as it is robust, scalable, and well-supported across many hosting providers. PostgreSQL also has very strong support for document storage. In the longer, the project could conceivably focus only on PostgreSQL.
It is possible to perform searches directly in these platforms. However, in the 0.2.0 milestone, the application continues to use OpenSearch to power the search queries.
Temporarily, the records are being written to OpenSearch directly from the Meadowlark code instead of through change data capture.
More information:
Milestone ?? - Change Data Capture
In a future milestone, the development team plans to complete the work to use change data capture from PostgreSQL and MongoDB to send data out to OpenSearch.
More information:
Table of Contents