Prepared for the Ed-Fi Alliance by: Jonathan Hickam, Learning Tapestry

Introduction

This document has been prepared as part of a larger initiative that is looking at scalable, economical, and reusable solutions for level 2 validations. For further context, architecture, and vocabulary refer to the associated Ed-Fi Validation Architecture document.

A high-level assessment of a wide variety of open source solutions was undertaken that could potentially address all or some of the Ed-Fi Level 2 Data Validation System Requirements. A high-level analysis of the tools that were considered can be found here. However, the final approach taken was to take a deeper look at three different types of solutions from this list. The first one was to look at an open-source business rules engine . The second was to experiment with using a database-centric testing framework. The third was to look at what a pure-SQL implementation without external tool dependencies might look like.

Business rule engine solution

Our initial hypothesis when we started this evaluation was that one of the many open source business rules engines would be able to address a significant part of our requirements. During our analysis we settled in on two Java-based business rules engines, Drools (https://www.drools.org/) and Open Rules (https://openrules.com/). For our prototype we decided to try Open Rules, which is geared more towards business analysts instead of Java developers.

Methodology

These solutions require extensive Java infrastructure. Because of time limitations, we approached this prototype from the perspective of a business analyst, without proving out the behind-the-scenes plumbing. This means that we took the most complicated validation rule that we could come up with and modeled it into the spreadsheet template, with the assumption that if an organization was going to go down this route that the supporting infrastructure would work.

The rule that was developed in the spreadsheet was:

If a student is in a program of "special education" type, then they must have an ILP within 30 days of the lesser of the first day of class or the first day they were enrolled at that school.

The resulting spreadsheet can be seen here.

Results

Business rule engines are well suited for complex logic flows and decision trees. However, most of the validation rules that were gathered as examples do not necessarily have complex linear decisions, it is more of a matter of finding specific examples in a complex data set. This means that expressing these rules in the language that the rules engine is looking for might require a level of abstraction and complexity than it would be to do something closer to the data.

Additionally, for a business rules engine to work, Java infrastructure is required, including what would be custom-build data connectors that looked at the database objects and adapted to the API.

Conclusion

If an organization is already using a business rules infrastructure, it could be adapted to address the validation engine requirements. This would be especially well suited for complex validations that had multiple decision points. However, for an organization that is just looking to solve for the requirements identified, the overhead and complexity of this type of solution would be hard to justify.

Testing framework solution

Test driven and continuous integration development methodologies are widely considered best practices in contemporary software development shops, especially those embracing agile development. The ubiquity of these approaches has led to a wide variety of automated testing frameworks, some of which are very database-centric. Our hypothesis is that such a tool could be useful for implementing a data validation testing infrastructure.

First try- DB Fit

Our research identified a couple of strong candidates for a database-centric testing infrastructure. The first one that was prototyped was DB Fit (http://dbfit.github.io/dbfit/). DB fit is a database-centric testing toolset that is modeled after the J-Unit or N-Unit testing infrastructures that are commonplace in Java and .Net respectively. DB Fit was designed for projects that have significant database code and want to embrace test driven development without requiring a Java or .Net wrapper around all of the tests. DB Fit is open source and has an active user and developer community.

DB Fit uses JDBC connectivity, so it is database agnostic. This flexibility proved to be problematic with our time-boxed analysis however. THe JDBC driver requires a JDK and some changes to the default SQL Server security module. After spending an hour on this with little success we decided to look at something with less initial configuration.

Second try- tSQLt

The second attempt looked at a testing infrastructure that was written in T-SQL, and very tightly coupled with Microsoft SQL Server. tSQLt (https://tsqlt.org/) is a unit testing infrastructure that also is open source with decent user and developer activity. This tool installed with minimal configuration and allowed us to get some tests up quickly.

For our proof-of-concept, we implemented a test of the validation rule that students should have a physical or home address. The tests are written in T-SQL.

Detailed steps to recreate the prototype

Download, install, and configure tSQLt to run on the Ed-Fi database as described in their getting started page.

Write a test in T-SQL.

EXEC tSQLt.NewTestClass 'testStudentAddress';
GO

CREATE PROCEDURE testStudentAddress.[test that every student has a home or physical address]
AS
BEGIN
    DECLARE @NoAddressCount INT;
    DECLARE @NoAddressThreshold INT; SET @NoAddressThreshold = 0;

    SELECT @NoAddressCount = count(distinct c.studentUSI)
	from edfi.student c
	JOIN edfi.StudentSchoolAssociation on c.studentUSI = StudentSchoolAssociation.StudentUSI
	LEFT JOIN 
		(SELECT studentUSI, b.CodeValue
		FROM edfi.studentAddress a
			JOIN edfi.AddressType b	on a.AddressTypeId = b.addressTypeId
		and b.CodeValue in ( 'Physical', 'Home') 
		) d on c.studentUSI = d.studentUSI;

	EXEC tSQLt.AssertEquals @NoAddressCount, @NoAddressThreshold;
END;
GO

Run the test
```
EXEC tSQLt.Run 'testStudentAddress';
```

Review the results

1 row affected)
[testStudentAddress].[test that every student has a home or physical address] failed: (Failure) Expected: <21042> but was: <0>
 
+----------------------+
|Test Execution Summary|
+----------------------+
 
|No|Test Case Name                                                               |Dur(ms)|Result |
+--+-----------------------------------------------------------------------------+-------+-------+
|1 |[testStudentAddress].[test that every student has a home or physical address]|     36|Failure|
-----------------------------------------------------------------------------
Msg 50000, Level 16, State 10, Line 27
Test Case Summary: 1 test case(s) executed, 0 succeeded, 1 failed, 0 errored.
-----------------------------------------------------------------------------

Results

This proof-of-concept showed that a testing infrastructure could be used for data validation tests, but that there are several reasons why it is not necessarily the best fit.

The primary issue is that testing infrastructures are designed around the concept of a test failing or passing. This is useful when checking new code that has been developed, but in the case of data validation is not sufficient because, per the defined requirements, the infrastructure needs to be identifying the specific instances of data validation errors. Specifically in our example, the test identified that there was at least one instance of a student that did not have an address but did not tell us what student(s) were missing the address. This would not be sufficient.

The other issue is that the testing infrastructures have their own method of outputting validation results. In the case of tSQLt, this output is results returned in a query tool. This does not address the requirement for the results to be sent to the API for downstream consumption by a SIS.

Conclusion

Testing infrastructures do provide a tool set that would allow for a standardized methodology for storing and defining tests. Potentially this could be leveraged for data validation tests, but the toolset would have to be heavily customized so that it would output specific instances of non-compliant data and so that it would output results to the API. If an organization was already heavily invested in a testing infrastructure this work could be justified, otherwise it might be a hard sell because the amount of effort to implement the customizations is probably less then just doing something from the ground up (per the previous example).

Pure SQL Ed-Fi Data Validation Proof-Of-Concept

Overview

This solution shows a simplified version of a data validation engine built using only dynamic SQL.

The proof of concept includes:

A prototype of the table structure for the validation API
A prototype of a validation rule table that organizes validation rules
A T-SQL stored procedure that runs the validation rules and populates the results into the API tables
Seed data for two validation rules
Screenshots of the results when run against the Glendale sample Ed-Fi data set
All SQL used to recreate the prototype

Data Model

The data model includes three tables:

Table	Description
ValidationRun	This table shows the status of runs. It will have one row for each time the validation engine is started. This should evolve into a community standard.
ValidationResults	This table has all of the validation results. This is the data that would be consumed by the SIS via the API. This table has a foreign key that ties it back to the validation run. This table should also evolve into a community standard.
ValidationRules	This is a table that has the SQL that produces validation results. This is what drives the T-SQL validation engine, and should serve as a starting point and example. This would NOT evolve into a community standard - different organizations could implement this functionality in their preferred technology.

Processing Logic

The prototype includes a stored procedure, RunValidationRules, that will iterate through all of the rules in the ValidationRule table for a given ValidationRuleSet and populate the results into the API tables. When called with a ValidationRuleSet it does:

Inserts a record in the ValidationRun table with a ValidationRunStatus of 'Running' and a ValidationRunSource that is equal to the ValidationRuleSet that was called
Iterates through each rule in ValidationRule with that RuleSet. For each rule:

Builds a 'INSERT AS SELECT' SQL statement, substituting in variables from the ValidationRun table and the ValidationRules table
Runs that SQL statement, potentially populating rows in the ValidationResults table

Updates the row in ValidationRun to 'Finished'

Seed Data

The prototype implements two validation rules.

The first one looks for students that do not have either a physical or home address. There are hundreds of these in the Glendale data, so the result set is limited to 10 for the purpose of demonstration.
The second rule looks for students that have an Exit/Withdraw date but do not have a Exit/Withdraw reason. There are two of these in the Glendale data.

Results

Running the stored procedure generates the tracking row in the ValidationRun table and the detailed results in the ValidationResults, as seen below.

Detailed Steps to recreate the prototype

Obtain the Glendale sample database from here and restore it to a local SQL Server database.
Change to the local SQL Server database.
Create a new schema 'dv' for the data validation objects.
```
create schema dv;
```
Create the Validation API tables and constraints.
```
 
```
Create the rules table.
```
 
```

Create the stored procedure.

IF OBJECT_ID('dv.RunValidationRules', 'P') IS NOT NULL
BEGIN 
  DROP PROC dv.RunValidationRules; 
 END
go


CREATE PROC  dv.RunValidationRules  @ValidationRuleSet varchar(60)
AS
	DECLARE @ValidationRuleId  varchar(max);
	DECLARE @ValidationSQL  varchar(max);
	DECLARE @ValidationSeverity  varchar(max);
	DECLARE @ValidationCategory  varchar(max);
	
	Declare @RunId INT;

	-- build a cursor that gets all of the validations for a given rule set
	DECLARE db_cursor CURSOR for 
		SELECT ValidationRuleId,
				ValidationSQL,
				ValidationSeverity,
				ValidationCategory
		FROM dv.ValidationRules
	WHERE ValidationRuleSet = @ValidationRuleSet;


	-- insert a row in the tracking table
	insert into dv.ValidationRun(
	validationRunStatus, ValidationRunSource) VALUES
	('Running', @ValidationRuleSet);

	--capture the ID from that insert
	SELECT @RunId =  SCOPE_IDENTITY() ;

	OPEN db_cursor;

	FETCH NEXT FROM db_cursor INTO @ValidationRuleId,
				@ValidationSQL,
				@ValidationSeverity,
				@ValidationCategory;
	WHILE @@FETCH_STATUS = 0
	BEGIN
		-- sub in the following values
		set @ValidationSQL= REPLACE(@ValidationSQL,'~SEVERITY~',@ValidationSeverity);
		set @ValidationSQL= REPLACE(@ValidationSQL,'~VALIDATION_RULE_ID~',@ValidationRuleId);
		set @ValidationSQL= REPLACE(@ValidationSQL,'~VALIDATION_CATEGORY~',@ValidationCategory);
		set @ValidationSQL= REPLACE(@ValidationSQL,'~VALIDATION_RUN_ID~',@RunId);
		set @ValidationSQL = concat('
			INSERT into dv.ValidationResult
				(ValidationRunId, ValidationRuleId, ValidationResourceId, ValidationResourceType, ValidationSeverity,
				EducationOrganizationId, Namespace, ValidationCategory, ValidationDetails) 
				 ', @ValidationSQL);
	
		print(@ValidationSQL);
		exec(@ValidationSQL);
				
	FETCH NEXT FROM db_cursor INTO @ValidationRuleId,
				@ValidationSQL,
				@ValidationSeverity,
				@ValidationCategory;

	-- Tell the tracking table we are done
	
	update dv.validationRun
	set ValidationRunStatus = 'Finished',
		ValidationRunFinishDateTime = current_timestamp
	where ValidationRunId = @RunId;
	END

	close db_cursor
	DEALLOCATE db_cursor
go

Insert the sample rules into the rules table.

INSERT INTO dv.ValidationRules (ValidationRuleId, 
ValidationSQL,
ValidationRuleSet,
ValidationSeverity,
ValidationCategory,
RuleDescription)
VALUES
('1.1',
'SELECT top(10)
	~VALIDATION_RUN_ID~ ValidationRunId,
	''~VALIDATION_RULE_ID~'' ValidationRuleId,
	StudentUniqueId as ValidationResourceId,
	''Student'' as ValidationResourceType,
	''~SEVERITY~'' as ValidationSeverity,
	SchoolId as EducationOrganizationId,
	'''' Namespace,
	''~VALIDATION_CATEGORY~'' ValidationCategory,
	'''' ValidationDetails
FROM (	
	select DISTINCT
		c.StudentUniqueId ,
		SchoolId
	from edfi.student c
	JOIN edfi.StudentSchoolAssociation on c.studentUSI = StudentSchoolAssociation.StudentUSI
	LEFT JOIN 
		(SELECT studentUSI, b.CodeValue
		FROM edfi.studentAddress a
			JOIN edfi.AddressType b	on a.AddressTypeId = b.addressTypeId
		and b.CodeValue in ( ''Physical'', ''Home'') 
		) d on c.studentUSI = d.studentUSI
		)e',
'nightly',
'Warning',
'Student',
'Raise a validation warning for every student that does not have a physical address specified. Limit result set to 10.');



INSERT INTO dv.ValidationRules (ValidationRuleId, 
ValidationSQL,
ValidationRuleSet,
ValidationSeverity,
ValidationCategory,
RuleDescription)
VALUES
('1.2',
'SELECT top(10)
	~VALIDATION_RUN_ID~ ValidationRunId,
	''~VALIDATION_RULE_ID~'' ValidationRuleId,
	StudentUniqueId as ValidationResourceId,
	''Student'' as ValidationResourceType,
	''~SEVERITY~'' as ValidationSeverity,
	SchoolId as EducationOrganizationId,
	'''' Namespace,
	''~VALIDATION_CATEGORY~'' ValidationCategory,
	ValidationDetails
FROM (
SELECT distinct studentUniqueId, SchoolId,
concat(''exit date '',ExitWithdrawDate) ValidationDetails
from edfi.studentSchoolAssociation JOIN
     edfi.Student on student.StudentUSI = StudentSchoolAssociation.StudentUSI
where ExitWithdrawDate is not null
and ExitWithdrawTypeDescriptorId is null) tab',
'nightly',
'Major Validation Error',
'Student',
'Raise a major validation error for every StudentSchoolAssociation where there is an ExitWithdrawDate specified but not an ExitWithdrawDescriptor specified. Limit result set to 10.');

Run the stored procedure and check out the results.

exec dv.RunValidationRules 'Nightly';
Select * from dv.ValidationRun;
Select * from dv.ValidationResult;

Proof-of-concept to-dos

The following list details some short-cuts and known issues that were identified during the prototype.

There were several reference values that were hard-coded for the sake of time during the prototype. In the API data model the ValidationRunStatus was hard-coded into the ValidationRun Table. In future iterations, it should be represented with a reference table, i.e. - ValidationRunStatusType.
The validation procedure should run under a user account that has read-only privileges on database objects in the Ed-Fi schema and insert privileges on the API tables. This would prevent possible accidental or malicious SQL in the rules table from corrupting Ed-Fi data.
An external scheduling interface would be required for initiating the validation runs. Options include the Sql Server internal procedure scheduling infrastructure, the Microsoft task scheduler (with some sort of powershell or dos script), or even a third party tool like Quartz Scheduler.

Overall conclusions and recommendations

When this project was initiated, we had hoped that we would find an open source rules engine that could be configured to address the Ed-Fi community's data validation needs. However, what was found instead, is that once an abstracted infrastructure and proposed validation API structure had been identified, much of the complexity in the problem domain was removed, and the complexity and overhead of those tool sets no longer seems justified for the sole purpose of addressing the data validation requirements. As a result, rather than recommending a specific third party tool we are recommending an architecture that allows for an education agency to implement a toolset that fits well with their internal skill set, scale, budget, and architectural preferences.

For next steps, we suggest that:

The Ed-Fi community socialize and formalize the validation API and architecture.
Following that, SIS vendors can work on consuming validation results from that API.
The pure-SQL implementation proof-of-concept could be expanded to address the to-dos or potentially re-implemented using another language to control the rules, such as Python, Ruby, or Java. This would ultimately depend on the education organization who is likely to be an early adopter and their language preferences.
Finally, those results should then be shared with the rest of the community.

Open Source Rules Engine Survey and Recommendation

Introduction

Business rule engine solution

Methodology

Results

Conclusion

Testing framework solution

First try- DB Fit

Second try- tSQLt

Detailed steps to recreate the prototype

Results

Conclusion

Pure SQL Ed-Fi Data Validation Proof-Of-Concept

Overview

Data Model

Processing Logic

Seed Data

Results

Detailed Steps to recreate the prototype

Proof-of-concept to-dos

Overall conclusions and recommendations