Introduction

PostgreSQL has extensive support for storing and querying JSON documents. In fact, it seems to fare very well compared to MongoDB for features and especially for performance. Each platform has its own advantages. For Meadowlark 0.2.0, the development team will implement CRUD operations using PostgreSQL in addition to MongoDB, thus enabling direct comparison of features and benefits, and demonstrating the flexibility inherent in the design of the Meadowlark code.

Also see: Meadowlark 0.2.0 - Durable Change Data Capture for more information on streaming data out to OpenSearch.

Design

Overview

The PostgreSQL schema would be set up in a similar non-relational design to the other NoSQL designs, but take advantage of PostgreSQL's document store features. The basic principal continues that the API document is stored along with metadata to be used for existence/reference validation. Metadata would continue to be stored alongside the API document in columns. Fast document lookups continue to be done by id, constructed as before from API document project name, entity type, version and natural key. Transactions will again be used to check for existence/references before performing create/update/delete operations.

In order to simplify a PostgreSQL deployment, this design is flexible on the requirement of OpenSearch for queries. This also means change data capture streaming becomes optional.

Instead of using OpenSearch, a "standalone deployment" will take advantage of PostgreSQLs JSON inverted-index support. Rather than split the entities into separate tables, an additional index on project_name/entity_type/entity_version will be required for query support. Once a deployment reaches the performance constraints of this design, all these indexes can be dropped and an Elastic/OpenSearch configuration introduced that will continue to use a single table design.

Documents Table

This implementation will use a single table named Entity.

Columns

Column Name	Data Type	Description
`_pk`	bigserial	Synthetic primary key, analogous to MongoDB's `_id`
`id`	varying	A string hash derived from the project name, resource name, resource version and identity of the API document. This field will be a unique index on the collection.
`document_identity`	varying	The identity elements extracted from the API document.
`project_name`	varying	The MetaEd project name the API document resource is defined in e.g. "EdFi" for a data standard entity.
`resource_name`	varying	The name of the resource. Typically, this is the same as the corresponding MetaEd entity name. However, there are exceptions, for example descriptors have a "Descriptor" suffix on their resource name.
`resource_version`	varying	The resource version as a string. This is the same as the MetaEd project version the entity is defined in e.g. "3.3.1-b" for a 3.3b data standard entity.
`is_descriptor`	Boolean	Indicator
`validated`	Boolean	Indicator
`edfi_doc`	JSONB	The Ed-Fi ODS/API document itself.
`createdBy`	varying	name/ID of the client who created the record, for authorization usage.

Indexes:

On edfi_doc as a GIN jsonb_path_ops index - for query support in standalone deployment
On project_name & entity_type & entity_version - for query support in standalone deployment
- Maybe separate b-tree index, maybe multi-column GIN with api_doc. See https://pganalyze.com/blog/gin-index

References Table

This implementation will also use a reference table for reference validation.

Columns

Column Name	Data Type	Description
`_pk`	bigserial	Synthetic primary key
reference_to	varying	The parent document's `id` (~ foreign key)
reference_from	varying	The child document's `id` (~ document' natural key)

Potential addition:

document_location

varying

JSONPath expression to the external reference in the parent document

Might be useful in API response metadata?

Indexes

Need to be able to look up references in both directions:

reference_to - e.g. when trying to delete a resource, determine if there are any external references to it
reference_from - e.g. when trying to delete a resource, also deletes its own references to a "parent" resource.

Data Processing

Insert Transaction Steps

Inserting a new Entity document into the table will follow these steps:

Check that id does not exist in Entity (indexed query)
Check that external reference ids for the document all exist in Entity (index query per reference)
Perform insert of document into Entity
Perform insert of external references into Reference
Note: PostgreSQL has upsert support, but we may need to know if the outcome was insert or update to return the correct API response.

Update Transaction Steps

Updating an existing Entity document into the table will follow these steps:

Check that id exists in Entity (indexed query)
Check that external reference ids for the document all exist in Entity (index query per reference)
Perform update into Entity
Perform replacement of prior external references in Reference (delete all old + insert)
Note: PostgreSQL has upsert support, but we may need to know if the outcome was insert or update to return the correct API response.

Delete Transaction Steps

Deleting an existing Entity document from the table will follow these steps:

Check that id exists in Entity (indexed query)
Check that there are no external references in Reference for this id (indexed query)
Perform delete

Queries

A PostgreSQL installation will operate in two modes. In standalone mode, get all and get-by-key queries will be done directly on PostgreSQL by project_name/entity_type/entity_version plus the GIN-indexed api_doc. In "normal" mode, get all and get-by-key queries will be serviced by OpenSearch/Elasticsearch via CDC streaming.

Open Issues

Need a partioning / sharding paradigm for large databases. See https://www.percona.com/blog/2019/05/24/an-overview-of-sharding-in-postgresql-and-how-it-relates-to-mongodbs/

This document is for discussion and general guidance. The implementation may vary as needed. The development team will endeavor to keep this document up-to-date, though working software remains a higher priority than comprehensive documentation.

Table of Contents

Meadowlark 0.2.0 - PostgreSQL

Introduction

Design

Overview

Documents Table

Columns

Indexes:

References Table

Columns

Indexes

Data Processing

Insert Transaction Steps

Update Transaction Steps

Delete Transaction Steps

Queries

Open Issues