Converting to PostgreSQL: Analysis

Introduction

This document provides a rough analysis of what will be needed if the Ed-Fi ODS/API were to be converted over to run on another relational database management system (RDBMS). PostgreSQL was used as an anchor; however, many of the findings will be generally applicable to other RDBMS's.

Summary

Using PostgreSQL or another non-SQL Server database provider could bring significant cost savings and operational scalability advantages. Moving the current ODS / API code base to support this is technically possible, but would require significant effort:

  1. Overhaul of the way the database objects are generated in the first place.
    1. The current SQL statements do not use many T-SQL-specific feature, however
    2. They do use many T-SQL-specific bits of syntax, data type names, etc. that require changes instead of lift-and-shift.
    3. MetaEd code generation would need to be able to target the other RDBMS.
  2. NHibernate and Entity Framework (to lesser extent) are used for most database interaction.
    1. They should be able to transparently access PostgreSQL (did not evaluate other RDBMS's).
    2. There is at least one looming issue that would require code changes - use of table-type parameters.
  3. The C# code has hard-coded SQL in various places - especially unit tests.
    1. All hard-coded SQL should be completely rewritten using an ORM.

In the existing code base, trying to support another RDBMS is a non-trivial exercise that could take many months of development and testing effort. However, this effort, carefully planned, could have significant long-term benefits that help lead to zero-downtime deployment and cost optimizations (e.g. through use of serverless cloud functionality).

One further key question: although new features are now targeting Technical Suite Three, we will continue to make changes and maintain Technical Suite Two for the foreseeable future. Changing the development practices to support a set of RDMBS options in only Technical Suite Three sounds initially appealing, but could make support of Suite Two confusing and frustrating. On the other hand, only targeting Suite Three might provide extra incentive for education agencies to upgrade to Suite Three. Alternatives: only target Suite Four. Or perhaps Suite Four is a completely new platform, but supports both data standards?

PostgreSQL Conversion

An entire 2.5 ODS database was scripted out and paintstakingly converted over so that it would execute successfully as pg-sql, thus building out the full ODS schema in a postgres database.

  1. Eliminate boilerplate at the top of the script - things related to SQL Cmd mode, connection properties (e.g. ansi_padding), etc.
  2. Replace all [ ] qualifiers with nothing. Postgres uses quotation marks as qualifiers as needed. However, when you use them to create a object, then you are stuck permanently using them. Best to create without the quotation marks.
  3. Create a dbo schema, belonging to the postgres user.
  4. Alter other schemas so that they belong to the postgres user instead of dbo.
  5. Fix data types

    T-SQL TypePG-SQL TypeNotes
    NVARCHARVARCHARpostgres implicitly converts handles all text as Unicode.
    DATETIMETIMESTAMPNot using TIMESTAMPTZ (with timezone) because the source data has no record of the time zone to begin with.
    DATETIME2TIMESTAMP
    UNIQUEIDENTIFIERCHAR(40)
    BITSMALLINT
    BOOL

    While postgres has a bit data type, it does not implicitly convert integer to bit. Instead it requires use a string '0' or '1' (or 'true', etc.). For maximum compatibility with existing code that uses the integer 0/1 instead of string '0'/'1', we should simply switch to smallint.

    Both Entity Framework and NHibernate work just fine with bool, and in fact they would require customization to convert back and forth from smallint.

  6. Postgres's cluster concept is different and more analysis is needed to know if it is needed (explicitly) for performance benefits. Remove CLUSTERED and NONCLUSTERED keywords from all indexes and primary key constraints. Also remove PAD_INDEX from constraints. Safe to leave in FILLFACTOR.
  7. Remove WITH CHECK and NOCHECK from constraints.
  8. The pg-equivalent of INT IDENTITY is SERIAL. Reg ex: /INT\s+IDENTITY\s*\(1, 1\)/ replace with SERIAL
  9. Remove all PRINT statements and extended properties (table/column descriptions). Extended properties can be rewritten as COMMENTs but the effort was not worthwhile in this POC.
  10.  Views
    1. Remove WITH SCHEMABINDING
    2. Replace COUNT_BIG(*) with COUNT(*), as postgres's COUNT function already returns a big integer.
    3. Remove all indexes on views.

      In a real conversion, some views might need to materialized views for performance optimization.

    4. Ensure that all view definitions end in a semi-colon.
    5. In auth.LocalEducationAgencyIdToStaffUSI, cast null values as integer.
    6. UNPIVOT must be rewritten with UNION ALL statements.

       Click here to expand...
      T-SQLPG-SQL
      CREATE VIEW [auth].[EducationOrganizationIdToStateAgencyId]
      AS
          SELECT    DISTINCT StateEducationAgencyId, EdOrgId as EducationOrganizationId
          FROM
                  (SELECT    * FROM auth.EducationOrganizationIdentifiers) As Source
          UNPIVOT (EdOrgId FOR IdType IN (EducationServiceCenterId, LocalEducationAgencyId, SchoolId)) -- Only ESCs, LEAs and Schools are accessible to State-level claims
                  As UnpivotedSource
          UNION    
          -- State-level claims also can access the State
          SELECT    StateEducationAgencyId, StateEducationAgencyId AS EducationOrganizationId
          FROM    edfi.StateEducationAgency
      GO
      CREATE VIEW auth.EducationOrganizationIdToStateAgencyId
      AS
      
      
          SELECT 
              StateEducationAgencyId,
              EducationServiceCenterId as EducationOrganizationId
          FROM
              auth.EducationOrganizationIdentifiers
          WHERE
              EducationServiceCenterId IS NOT NULL
          UNION
          SELECT 
              StateEducationAgencyId,
              LocalEducationAgencyId as EducationOrganizationId
          FROM
              auth.EducationOrganizationIdentifiers
          WHERE
              LocalEducationAgencyId IS NOT NULL
          UNION
          SELECT 
              StateEducationAgencyId,
              SchoolId as EducationOrganizationId
          FROM
              auth.EducationOrganizationIdentifiers
          WHERE
              SchoolId IS NOT NULL
          UNION   
          -- State-level claims also can access the State
          SELECT  StateEducationAgencyId, StateEducationAgencyId AS EducationOrganizationId
          FROM    edfi.StateEducationAgency;
    7. Views that use common table expressions (CTE) recursively need to be modified by adding the keyword RECURSIVE immediately after WITH. Affected views: StudentAssessmentStudentObjectiveAssessmentH, ObjectiveAssessmentH_CTE, LocalEducationAgencyH, LearningStandardH, LearninbjectiveH, AssessmentFamilyH.
    8. Default values in postgres are not constraints, so the syntax needs to be altered.

      1. Substitute now() for getdate().

        FromTo
        /ALTER TABLE ([^\s]+)[\s\n]+(?:ADD CONSTRAINT (?:[^\s]+)) DEFAULT \(getdate\(\)\) FOR (\w+)/ALTER TABLE $1 ALTER COLUMN $2 SET DEFAULT now()
      2. Replace newid() with uuid_generate_v4(). This requires first configuring uuid support by running CREATE EXTENSION "uuid-ossp".

        FromTo
        /ALTER TABLE ([^\s]+)[\s\n]+(?:ADD CONSTRAINT (?:[^\s]+)) DEFAULT \(newid\(\)\) FOR (\w+)/ALTER TABLE $1 ALTER COLUMN $2 SET DEFAULT uuid_generate_v4()
    9. The maximum identifier length is 63. Longer identifiers are automatically truncated. This results in many foreign key names overlapping. To avoid this, use reasonable abbreviations to shorten these seven problematic beginnings to foreign keys:

      FromTo
      fk_educationorganizationnetworkassociation_educationorganizationfk_edorgnetworkassociation_edorg

      fk_learningstandardprerequisitelearningstandard_learningstandard

      fk_lrnstdprerequisitelrnstd_lrnstd

      fk_educationcontentderivativesourceeducationcontent_educationcontent

      fk_edconderivativesourceedcon_edcon

      fk_studentassessmentstudentobjectiveassessmentperformancelevel

      fk_studassessmentstudobjassessmentperflvl

      fk_postsecondaryeventpostsecondaryinstitutionidentificationcode

      fk_pstscdryeventpstscdryinstidcode

      fk_postsecondaryeventpostsecondaryinstitutionmediumofinstruction

      fk_pstscdryeventpstscdryinstmediumofnstrn

      fk_educationorganizationidentificationcode_educationorganization

      fk_edorgidcode_edorg

Other constructs used in Admin, Bulk, and Security databases

Functions

The util.Split function in the 3.x line can be replaced by a native function, or rewritten (for interface compatibility) to wrap the native function regexp_split_to_table.

Stored Procedures

Stored procedures must be rewritten as functions. Thankfully there is only one: EdFi_Admin database has dbo.AccessTokenIsValid.

Synonyms

Synonyms don't exist in postgres.  There are a number of synonyms in the 3.x database. Example:

CREATE SYNONYM [auth].[LocalEducationAgencyIdToStaffUniqueId] FOR [auth].[LocalEducationAgencyIdToStaff]

Seems simplest to recreate the synonym as a simple view:

CREATE VIEW auth.LocalEducationAgencyIdToStaffUniqueId AS SELECT * FROM auth.LocalEducationAgencyIdToStaff;

Types

The 3.x line has a couple of types in it. These are table types, which postgres doesn't support. These two types UniqueIdentifierTable and IntTable are apparently used for NHibernate table-valued parameters. There are array types which might be a substitute but this could be a real problem.

Change Queries

I was able to evaluate some of the code for the Change Queries feature coming in version 3.1. Thankfully it is possible to convert this code. There are two components to the Change Queries work: a sequence table and triggers to set the ChangeVersion column on a table after update, utilizing the sequence table. The syntax for the s equence itself is the same in both. The trigger structure is a little different in postgres, and it in fact requires creating a function and then calling that function from the trigger.

FromTo
CREATE TRIGGER
edfi.edfi_academichonorcaterytype_TR_UpdateChangeVersion 
ON
edfi.academichonorcaterytype AFTER UPDATE AS
BEGIN
    SET NOCOUNT ON
   
    UPDATE edfi.academichonorcaterytype
	SET changeversion = (NEXT VALUE FOR
changes.ChangeVersionSequence)
    FROM edfi.academichonorcaterytype
    WHERE EXISTS (SELECT 1 from inserted i
where i.id = u.id)

END
GO
CREATE OR REPLACE FUNCTION edfi.edfi_academichonorcaterytype_tr_updatechangeversion()
 RETURNS trigger
 LANGUAGE plpgsql
AS $function$
BEGIN
    UPDATE edfi.academichonorcaterytype
        set changeversion = nextval('changes.ChangeVersionSequence')
        where id = new.id;
        
    RETURN NEW;
END;
$function$

CREATE TRIGGER edfi_academichonorcaterytype_TR_UpdateChangeVersion 
AFTER UPDATE ON edfi.academichonorcaterytype
FOR EACH ROW
WHEN (pg_trigger_depth() = 0)
EXECUTE PROCEDURE edfi.edfi_academichonorcaterytype_TR_UpdateChangeVersion();

Solution Challenges

Potential .NET Application Problems

  • As noted above, COUNT() in postgres returns a big integer, not an integer. This could potentially mean needing to switch return types in C# from int to long - and it might be difficult to find where this is needed.
  • Exceptions coming back from the SQL connections will be of a different type, with different codes. Will need to look closely at the code base to see if this requires any changes.
  • Hard-coded SQL exists in some places in the applications, and in many places in the unit tests. These SQL statements will all have to be evaluated with the conversion rules in mind. Preferably they would be completely removed from the code base, instead relying only on an ORM. Complex queries should be replaced with views. This is just a good practice, as hard-coded SQL embedded in C# is hard to maintain even if using only a single RDBMS.
  • Bulk load might be a problem. The Npgsql library does support bulk copy, but I have not been able to find anything recent about NHibernate support. Old stack overflow discussions appear to pre-date the introduction of the bulk copy facility in Npgsql.
  • NHibernate inserts using table value parameters will need to be rewritten. A clever person has come up with something equivalent and performant, but I think it will require some real coding changes.

EdFi.Ods.CodeGen

The library EdFi.Ods.CodeGen is involved with creating tables from MetaEd entity definitions. One option might be to continue using SQL Server at build time, and then run a post-build process to convert the database to pg-sql.  If trying to modify the CodeGen library...

  • DatabaseSchemaProvider would need to be refactored to use a Npgsql and pg-sql where appropriate (perhaps an opportunity for injection of provider-specific implementations at runtime).
    • NpgsqlConnection instead of SqlConnection
    • Replace T-SQL specific schema queries.
  • IDatabaseTypeTranslator would need a PG-specific implementation

Install/Migration Scripts

Individual SQL scripts in the repositories often have some T-SQL-specific syntax in them. An example I saw while doing a code review of the new operational context work for 3.1:

In this example, lines 1, 2, 3, 4, 5, and 7 are all problematic in pg-sql. There is a clear need to either build parallel scripts for different databases, or create a translation process after the build, as mentioned in the Code Generation section.

MetaEd

The MetaEd build process creates SQL scripts for creating the objects required to support extensions. MetaEd uses HandleBar templates to generate SQL. These templates are currently written to create T-SQL compliant scripts, e.g. with square brackets, clustered indexes, and other features that won't work in pg-sql. Again, we face the possibility of making a significant code change in order to target multiple database platforms, or using an extract-and-translate process from SQL Server to Postgres after the build completes.

Object-Relational Mapping (ORM)

NHibernate

place holder

Entity Framework

place holder

Migrations

The EdFi_Admin database is managed through Entity Framework migrations. In general migrations seem to be supported with Npgsql.EntityFramework, and we should be able to modify EdFi.Ods.Admin.Models to target postgres. This has not been tested yet.

Work Required to Migrate

There is no simple solution for migrating to Postgres or another database. We can split this into two components: generation of the database schema, and runtime execution.

Generation

Raw SQL Scripts

SQL scripts to create the database in Database\Structure under both Ed-Fi-Ods and Ed-Fi-ODS-Implementation. Some options include:

  1. MetaEd
    1. Use MetaEd to define all structures.
    2. DDL: Extend it to output Postgres-compliant (or other) SQL as alternative to T-SQL.
      1. (warning) This needs to be done anyway for extension support.
    3. DML: Initial data sets (e.g. descriptors), either
      1. Ensure the insert statement scripts are generic ANSI-SQL, runnable in both SQL Server and Postgres, or
      2. Define initial data in JSON or another format, and create a utility to iterate over those files and insert into the database at startup.
  2. Manually maintain two different sets of DDL scripts, and adopt the unified DML approach mentioned above.
  3. Convert these setup scripts into a migration framework that generates DDL based on models.
    1. Entity Framework Code-First Migration, NHibernate Schema Update, FluentMigrator
      1. Already using EF code-first for Admin database
      2. NHibernate mechanism might be easy to use, but hard to tell - in a quick search I find references to it, but can't find official documentation. Maybe its just a simple object that you use against your existing NHibernate mappings?
      3. FluentMigrator can target many different databases
      4. Each has some merits and would deserve further investigation.
    2. Consider deployment implications
      1. Embed into separate app that can be run with admin credentials?
      2. Embed directly into the API and temporarily grant admin credentials?
      3. Use backup & restore to migrate from high privilege (staging) to low (production) privilege environment, instead of letting the code install at runtime?
      4. What mechanism(s) would help move toward zero-downtime deployments?

MetaEd

Modify the build process to target different databases.

EF Migrations

Supported via Npgsql.EntityFramework library.

Runtime Execution

NHibernate

Npgsql library can be used to access Postgres. Mostly a matter of changing the config file.

Entity Framework

Npgsql.EntityFramework library, mostly a matter of changes to the config file.

Bulk Load

Npgsql has a bulk copy facility, whose interface differences would force some code changes in the bulk load libraries.