Versions Compared
Key
- This line was added.
- This line was removed.
- Formatting was changed.
Introduction
Project Meadowlark is a proof-of-concept implementation of the Ed-Fi API Specification, currently supporting Data Standard 3.3b, built on managed services provided by AWS. This document describes the system architecture, including: managed service infrastructure and flow; frameworks used in programming the solution; and notes about potential future direction.
→ More information on Meadowlark
Cloud Managed Services
The big three cloud providers (Amazon, Google, Microsoft) all provide similar managed services that could have been used to build this application. The choice of Amazon Web Services (AWS) is not an endorsement of Amazon per se. Rather, the development team needed to commit to one service in order to remain focused on delivering a usable proof-of-concept without over-engineering up-front. Further development of Meadowlark into a product would require additional effort to ensure that the core software can easily be used on any cloud platform that provides similar managed service capabilities.
→ More information on cloud parity
Infrastructure
The following diagram illustrates the managed service infrastructure utilized by Meadowlark.
What does each of these services provide?
- API Gateway is front-end web server that acts as a proxy back to the separate Lambda functions. With the help of the API Gateway, client applications need know only a single base URL, and the different resource endpoints can opaquely point back to different back-end services.
- Lambda Functions are small, purpose-built, serverless runtime hosts for application code. In the Meadowlark solution, there are ten different Lambda Functions that handle inbound requests from the API Gateway. For simplicity, only a single icon represents all ten in the diagram above.
- DynamoDB is a high-performance NoSQL database for key-value storage. One of the powerful features of DynamoDB is its Change Data Capture (CDC) Streaming: each change to an item stored in the database creates an event on a stream. Another Lambda function detects this event to provide post-processing.
- OpenSearch is a NoSQL database based on ElasticSearch, providing high-performance indexing and querying capabilities. All of the "GET by query" (aka "GET by example") client requests are served by this powerful search engine.
- CloudWatch provides advanced collection and monitoring capabilities for logs, including detailed logging written in to the Meadowlark Lambda functions.
Utilizing Multiple Databases
In traditional application development, including the Ed-Fi ODS/API Platform, all Create-Read-Update-Delete (CRUD) operations are serviced by a single database server. Project Meadowlark has opted to adopt the strategy of choosing a database that is fit-to-purpose. DynamoDB is incredibly fast and highly scalable for online transaction processing (OLTP), allowing the web API layer to respond to the client very quickly. As a key-value store, a DynamoDB table contains only a small number of columns, and the raw JSON payload received in POST and PUT requests are stored directly in a column. This improves both the write speed and the speed of retrieving a single object from the database.
A key difference between this document storage approach, compared to relational database modeling, comes in the form of searchability. DynamoDB has the ability to add "secondary indexes" that can help find individual items by some global criteria. But these are limited and very different than the indexes found in a relational database, which can be tuned to identify items based on any column. In other words, when storing an entire document, Dynamo DB is a poor choice for trying to search by query terms (e.g. "get all students with last name Doe").
But this This is where OpenSearch shines. Based on ElasticSearch, OpenSearch is also a NoSQL document store. The key difference is that it indexes everything in the document, and has a powerful search engine across the indexes.
As a managed service, there is no need to worry about backups
Data FlowOpenSearch is not designed to be a robust solution for high performance write operations, so it does not make sense to write directly to it. DynamoDB's change data capture process fills the gap by triggering another Lambda function that copies the object data, as written to DynamoDB, into OpenSearch (or deletes that object).
Eventual Consistency
DynamoDB stores multiple copies of the data for resiliency and high availability, and only one of these copies receives the initial write operation. The service guarantees that all other copies will eventually come up to date with that initial write operation: the data will eventually be consistent. The tradeoff is connection reliability: queries are not blocked by write operations.
Many people find this disturbing at first, if they are used to thinking about transaction locking in relational databases. But the reality is less scary than it sounds.
Amazon states that it typically takes "one second or less" to bring all copies up to date. How often is a vendor system going to write to an Ed-Fi API, with a fixed requirement that another vendor be able to read that data less than one second later? Let's compare the outcomes of the following three scenarios:
Time | Scenario 1 | Scenario 2 | Scenario 3 |
---|---|---|---|
10:01:01.000 AM | Client A reads a record | Client B writes an update to that record | Client B writes an update to that record |
10:01:01.500 AM (half second) | Client B writes an update to that record | Client A reads a record | All DynamoDB copies are up-to-date |
10:01:02.000 AM (full second) | All DynamoDB copies are up-to-date | All DynamoDB copies are up-to-date | Client A reads a record |
Status | Client A has stale data | Client A might have stale data | Client A has the current data |
In Scenario 1, Client A receives stale data because they requested it half a second before Client B writes an update. And this is no different than in a relational database.
In Scenario 2, the Client B writes an update half a second before Client A sends a read. Client A might coincidentally be assigned to read from the first database node that received the record, or it might read from a node that is lagging by half a second. Thus it might get stale data, though this is not guaranteed.
Finally in Scenario 3, Client A asks for a record a full second after Client B had written an update, and Client A is nearly guaranteed to get the current (not stale) data. Again, same as with a standard relational database.
The difference between the guaranteed consistency of a relational database and the eventual consistency of a distributed database like DynamoDB is thus more a matter of happenstance than anything. In either case, if Client A reads from the system a millesecond before Client B writes, then Client A will have stale data. If Client A reads after Client B writes, then the window of time for getting stale data goes up to perhaps a second. But if they do get stale data, they will never know that they weren't in scenario 1.
Eventual consistency is probably "good enough." But it does deserve further community consideration before using it in a production system.
Data Duplication
For many people, this process of copying data into two storage locations (DynamoDB and OpenSearch) may seem very strange. We have always preferred to write "one copy" of the data, avoiding the costs of storing and maintaining duplicate data.
From the storage perspective, there is a false assumption here: when a relational database table has indexes, you are already storing duplicate copies of the data. With paired DynamoDB and OpenSearch, that hidden truth simply comes to the surface.
There is also an eventual consistency challenge here, one that is more significant than with DynamoDB by itself: there is a greater probability of an error in the CDC stream → Lamba function → OpenSearch write process than in the DynamoDB node synchronization process. This bears further scrutiny
Table of Contents
Table of Contents |
---|