...
A high-level assessment of a wide variety of open source solutions was undertaken that could potentially address all or some of the Ed-Fi Level 2 Data Validation System Requirements. A high-level analysis of the tools that were considered can be found here. However, the final approach taken was to take a deeper look at three different types of solutions from this list. The first one was using a database-centric testing framework. The second was to look at an open-source business rules engine . The second was to experiment with using a database-centric testing framework. The third was to look at what a pure-SQL implementation without external tool dependencies might look like.
Business rule engine solution
Our initial hypothesis when we started this evaluation was that one of the many open source business rules engines would be able to address a significant part of our requirements. During our analysis we settled in on two Java-based business rules engines, Drools (https://www.drools.org/) and Open Rules (https://openrules.com/). For our prototype we decided to try Open Rules, which is geared more towards business analysts instead of Java developers.
Methodology
These solutions require extensive Java infrastructure. Because of time limitations, we approached this prototype from the perspective of a business analyst, without proving out the behind-the-scenes plumbing. This means that we took the most complicated validation rule that we could come up with and modeled it into the spreadsheet template, with the assumption that if an organization was going to go down this route that the supporting infrastructure would work.
The rule that was developed in the spreadsheet was:
If a student is in a program of "special education" type, then they must have an ILP within 30 days of the lesser of the first day of class or the first day they were enrolled at that school.
The resulting spreadsheet can be seen here.
Results
Business rule engines are well suited for complex logic flows and decision trees. However, most of the validation rules that were gathered as examples do not necessarily have complex linear decisions, it is more of a matter of finding specific examples in a complex data set. This means that expressing these rules in the language that the rules engine is looking for might require a level of abstraction and complexity than it would be to do something closer to the data.
Additionally, for a business rules engine to work, Java infrastructure is required, including what would be custom-build data connectors that looked at the database objects and adapted to the API.
Conclusion
If an organization is already using a business rules infrastructure, it could be adapted to address the validation engine requirements. This would be especially well suited for complex validations that had multiple decision points. However, for an organization that is just looking to solve for the requirements identified, the overhead and complexity of this type of solution would be hard to justify.
Testing framework solution
...
- Download, install, and configure tSQLt to run on the Ed-Fi database as described in their getting started page.
Write a test in T-SQL.
Code Block language sql EXEC tSQLt.NewTestClass 'testStudentAddress'; GO CREATE PROCEDURE testStudentAddress.[test that every student has a home or physical address] AS BEGIN DECLARE @NoAddressCount INT; DECLARE @NoAddressThreshold INT; SET @NoAddressThreshold = 0; SELECT @NoAddressCount = count(distinct c.studentUSI) from edfi.student c JOIN edfi.StudentSchoolAssociation on c.studentUSI = StudentSchoolAssociation.StudentUSI LEFT JOIN (SELECT studentUSI, b.CodeValue FROM edfi.studentAddress a JOIN edfi.AddressType b on a.AddressTypeId = b.addressTypeId and b.CodeValue in ( 'Physical', 'Home') ) d on c.studentUSI = d.studentUSI; EXEC tSQLt.AssertEquals @NoAddressCount, @NoAddressThreshold; END; GO
Run the test
Code Block language sql EXEC tSQLt.Run 'testStudentAddress';
Review the results
Code Block language sql 1 row affected) [testStudentAddress].[test that every student has a home or physical address] failed: (Failure) Expected: <21042> but was: <0> +----------------------+ |Test Execution Summary| +----------------------+ |No|Test Case Name |Dur(ms)|Result | +--+-----------------------------------------------------------------------------+-------+-------+ |1 |[testStudentAddress].[test that every student has a home or physical address]| 36|Failure| ----------------------------------------------------------------------------- Msg 50000, Level 16, State 10, Line 27 Test Case Summary: 1 test case(s) executed, 0 succeeded, 1 failed, 0 errored. -----------------------------------------------------------------------------
Results
This proof-of-concept showed that a testing infrastructure could be used for data validation tests, but that there are several reasons why it is not necessarily the best fit.
The primary issue is that testing infrastructures are designed around the concept of a test failing or passing. This is useful when checking new code that has been developed, but in the case of data validation is not sufficient because, per the defined requirements, the infrastructure needs to be identifying the specific instances of data validation errors. Specifically in our example, the test identified that there was at least one instance of a student that did not have an address but did not tell us what student(s) were missing the address. This would not be sufficient.
The other issue is that the testing infrastructures have their own method of outputting validation results. In the case of tSQLt, this output is results returned in a query tool. This does not address the requirement for the results to be sent to the API for downstream consumption by a SIS.
Conclusion
Testing infrastructures do provide a tool set that would allow for a standardized methodology for storing and defining tests. Potentially this could be leveraged for data validation tests, but the toolset would have to be heavily customized so that it would output specific instances of non-compliant data and so that it would output results to the API. If an organization was already heavily invested in a testing infrastructure this work could be justified, otherwise it might be a hard sell because the amount of effort to implement the customizations is probably less then just doing something from the ground up (per the previous example).
Pure SQL Ed-Fi Data Validation Proof-Of-Concept
Overview
This solution shows a simplified version of a data validation engine built using only dynamic SQL.
The proof of concept includes:
- A prototype of the table structure for the validation API
- A prototype of a validation rule table that organizes validation rules
- A T-SQL stored procedure that runs the validation rules and populates the results into the API tables
- Seed data for two validation rules
- Screenshots of the results when run against the Glendale sample Ed-Fi data set
- All SQL used to recreate the prototype
Data Model
The data model includes three tables:
Table | Description |
---|---|
ValidationRun | This table shows the status of runs. It will have one row for each time the validation engine is started. This should evolve into a community standard. |
ValidationResults | This table has all of the validation results. This is the data that would be consumed by the SIS via the API. This table has a foreign key that ties it back to the validation run. This table should also evolve into a community standard. |
ValidationRules | This is a table that has the SQL that produces validation results. This is what drives the T-SQL validation engine, and should serve as a starting point and example. This would NOT evolve into a community standard - different organizations could implement this functionality in their preferred technology. |
Processing Logic
The prototype includes a stored procedure, RunValidationRules, that will iterate through all of the rules in the ValidationRule table for a given ValidationRuleSet and populate the results into the API tables. When called with a ValidationRuleSet it does:
- Inserts a record in the ValidationRun table with a ValidationRunStatus of 'Running' and a ValidationRunSource that is equal to the ValidationRuleSet that was called
- Iterates through each rule in ValidationRule with that RuleSet. For each rule:
- Builds a 'INSERT AS SELECT' SQL statement, substituting in variables from the ValidationRun table and the ValidationRules table
- Runs that SQL statement, potentially populating rows in the ValidationResults table
- Updates the row in ValidationRun to 'Finished'
Seed Data
The prototype implements two validation rules.
- The first one looks for students that do not have either a physical or home address. There are hundreds of these in the Glendale data, so the result set is limited to 10 for the purpose of demonstration.
- The second rule looks for students that have an Exit/Withdraw date but do not have a Exit/Withdraw reason. There are two of these in the Glendale data.
Results
Running the stored procedure generates the tracking row in the ValidationRun table and the detailed results in the ValidationResults, as seen below.
Detailed Steps to recreate the prototype
- Obtain the Glendale sample database from here and restore it to a local SQL Server database.
- Change to the local SQL Server database.
Create a new schema 'dv' for the data validation objects.
Code Block language sql create schema dv;
Create the Validation API tables and constraints.
Code Block language sql Create the rules table.
Code Block language sql Create the stored procedure.
Code Block language sql IF OBJECT_ID('dv.RunValidationRules', 'P') IS NOT NULL BEGIN DROP PROC dv.RunValidationRules; END go CREATE PROC dv.RunValidationRules @ValidationRuleSet varchar(60) AS DECLARE @ValidationRuleId varchar(max); DECLARE @ValidationSQL varchar(max); DECLARE @ValidationSeverity varchar(max); DECLARE @ValidationCategory varchar(max); Declare @RunId INT; -- build a cursor that gets all of the validations for a given rule set DECLARE db_cursor CURSOR for SELECT ValidationRuleId, ValidationSQL, ValidationSeverity, ValidationCategory FROM dv.ValidationRules WHERE ValidationRuleSet = @ValidationRuleSet; -- insert a row in the tracking table insert into dv.ValidationRun( validationRunStatus, ValidationRunSource) VALUES ('Running', @ValidationRuleSet); --capture the ID from that insert SELECT @RunId = SCOPE_IDENTITY() ; OPEN db_cursor; FETCH NEXT FROM db_cursor INTO @ValidationRuleId, @ValidationSQL, @ValidationSeverity, @ValidationCategory; WHILE @@FETCH_STATUS = 0 BEGIN -- sub in the following values set @ValidationSQL= REPLACE(@ValidationSQL,'~SEVERITY~',@ValidationSeverity); set @ValidationSQL= REPLACE(@ValidationSQL,'~VALIDATION_RULE_ID~',@ValidationRuleId); set @ValidationSQL= REPLACE(@ValidationSQL,'~VALIDATION_CATEGORY~',@ValidationCategory); set @ValidationSQL= REPLACE(@ValidationSQL,'~VALIDATION_RUN_ID~',@RunId); set @ValidationSQL = concat(' INSERT into dv.ValidationResult (ValidationRunId, ValidationRuleId, ValidationResourceId, ValidationResourceType, ValidationSeverity, EducationOrganizationId, Namespace, ValidationCategory, ValidationDetails) ', @ValidationSQL); print(@ValidationSQL); exec(@ValidationSQL); FETCH NEXT FROM db_cursor INTO @ValidationRuleId, @ValidationSQL, @ValidationSeverity, @ValidationCategory; -- Tell the tracking table we are done update dv.validationRun set ValidationRunStatus = 'Finished', ValidationRunFinishDateTime = current_timestamp where ValidationRunId = @RunId; END close db_cursor DEALLOCATE db_cursor go
Insert the sample rules into the rules table.
Code Block language sql INSERT INTO dv.ValidationRules (ValidationRuleId, ValidationSQL, ValidationRuleSet, ValidationSeverity, ValidationCategory, RuleDescription) VALUES ('1.1', 'SELECT top(10) ~VALIDATION_RUN_ID~ ValidationRunId, ''~VALIDATION_RULE_ID~'' ValidationRuleId, StudentUniqueId as ValidationResourceId, ''Student'' as ValidationResourceType, ''~SEVERITY~'' as ValidationSeverity, SchoolId as EducationOrganizationId, '''' Namespace, ''~VALIDATION_CATEGORY~'' ValidationCategory, '''' ValidationDetails FROM ( select DISTINCT c.StudentUniqueId , SchoolId from edfi.student c JOIN edfi.StudentSchoolAssociation on c.studentUSI = StudentSchoolAssociation.StudentUSI LEFT JOIN (SELECT studentUSI, b.CodeValue FROM edfi.studentAddress a JOIN edfi.AddressType b on a.AddressTypeId = b.addressTypeId and b.CodeValue in ( ''Physical'', ''Home'') ) d on c.studentUSI = d.studentUSI )e', 'nightly', 'Warning', 'Student', 'Raise a validation warning for every student that does not have a physical address specified. Limit result set to 10.'); INSERT INTO dv.ValidationRules (ValidationRuleId, ValidationSQL, ValidationRuleSet, ValidationSeverity, ValidationCategory, RuleDescription) VALUES ('1.2', 'SELECT top(10) ~VALIDATION_RUN_ID~ ValidationRunId, ''~VALIDATION_RULE_ID~'' ValidationRuleId, StudentUniqueId as ValidationResourceId, ''Student'' as ValidationResourceType, ''~SEVERITY~'' as ValidationSeverity, SchoolId as EducationOrganizationId, '''' Namespace, ''~VALIDATION_CATEGORY~'' ValidationCategory, ValidationDetails FROM ( SELECT distinct studentUniqueId, SchoolId, concat(''exit date '',ExitWithdrawDate) ValidationDetails from edfi.studentSchoolAssociation JOIN edfi.Student on student.StudentUSI = StudentSchoolAssociation.StudentUSI where ExitWithdrawDate is not null and ExitWithdrawTypeDescriptorId is null) tab', 'nightly', 'Major Validation Error', 'Student', 'Raise a major validation error for every StudentSchoolAssociation where there is an ExitWithdrawDate specified but not an ExitWithdrawDescriptor specified. Limit result set to 10.');
- Run the stored procedure and check out the results.
Code Block language sql exec dv.RunValidationRules 'Nightly'; Select * from dv.ValidationRun; Select * from dv.ValidationResult;
Proof-of-concept to-dos
The following list details some short-cuts and known issues that were identified during the prototype.
- There were several reference values that were hard-coded for the sake of time during the prototype. In the API data model the ValidationRunStatus was hard-coded into the ValidationRun Table. In future iterations, it should be represented with a reference table, i.e. - ValidationRunStatusType.
- The validation procedure should run under a user account that has read-only privileges on database objects in the Ed-Fi schema and insert privileges on the API tables. This would prevent possible accidental or malicious SQL in the rules table from corrupting Ed-Fi data.
- An external scheduling interface would be required for initiating the validation runs. Options include the Sql Server internal procedure scheduling infrastructure, the Microsoft task scheduler (with some sort of powershell or dos script), or even a third party tool like Quartz Scheduler.
Overall conclusions and recommendations
When this project was initiated, we had hoped that we would find an open source rules engine that could be configured to address the Ed-Fi community's data validation needs. However, what was found instead, is that once an abstracted infrastructure and proposed validation API structure had been identified, much of the complexity in the problem domain was removed, and the complexity and overhead of those tool sets no longer seems justified for the sole purpose of addressing the data validation requirements. As a result, rather than recommending a specific third party tool we are recommending an architecture that allows for an education agency to implement a toolset that fits well with their internal skill set, scale, budget, and architectural preferences.
For next steps, we suggest that:
- The Ed-Fi community socialize and formalize the validation API and architecture.
- Following that, SIS vendors can work on consuming validation results from that API.
- The pure-SQL implementation proof-of-concept could be expanded to address the to-dos or potentially re-implemented using another language to control the rules, such as Python, Ruby, or Java. This would ultimately depend on the education organization who is likely to be an early adopter and their language preferences.
- Finally, those results should then be shared with the rest of the community.