Open Source Rules Engine Survey and Recommendation
Prepared for the Ed-Fi Alliance by: Jonathan Hickam, Learning Tapestry
Contents
Introduction
This document has been prepared as part of a larger initiative that is looking at scalable, economical, and reusable solutions for level 2 validations. For further context, architecture, and vocabulary refer to the associated Ed-Fi Validation Architecture document.
A high-level assessment of a wide variety of open source solutions was undertaken that could potentially address all or some of the Ed-Fi Level 2 Data Validation System Requirements. A high-level analysis of the tools that were considered can be found at Survey of Rule Engines. However, the final approach taken was to take a deeper look at three different types of solutions from this list. The first one was to look at an open-source business rules engine . The second was to experiment with using a database-centric testing framework. The third was to look at what a pure-SQL implementation without external tool dependencies might look like.
Option 1: Business rule engine solution
Our initial hypothesis when we started this evaluation was that one of the many open source business rules engines would be able to address a significant part of our requirements. During our analysis we settled in on two Java-based business rules engines, Drools (https://www.drools.org/) and Open Rules (https://openrules.com/). For our prototype we decided to try Open Rules, which is geared more towards business analysts instead of Java developers.
Methodology
These solutions require extensive Java infrastructure. Because of time limitations, we approached this prototype from the perspective of a business analyst, without proving out the behind-the-scenes plumbing. This means that we took the most complicated validation rule that we could come up with and modeled it into the spreadsheet template, with the assumption that if an organization was going to go down this route that the supporting infrastructure would work.
The rule that was developed in the spreadsheet was:
If a student is in a program of "special education" type, then they must have an ILP within 30 days of the lesser of the first day of class or the first day they were enrolled at that school.
The resulting spreadsheet can be seen .
Results
Business rule engines are well suited for complex logic flows and decision trees. However, most of the validation rules that were gathered as examples do not necessarily have complex linear decisions, it is more of a matter of finding specific examples in a complex data set. This means that expressing these rules in the language that the rules engine is looking for might require a level of abstraction and complexity than it would be to do something closer to the data.
Additionally, for a business rules engine to work, Java infrastructure is required, including what would be custom-build data connectors that looked at the database objects and adapted to the API.
Conclusion
If an organization is already using a business rules infrastructure, it could be adapted to address the validation engine requirements. This would be especially well suited for complex validations that had multiple decision points. However, for an organization that is just looking to solve for the requirements identified, the overhead and complexity of this type of solution would be hard to justify.
Option 2: Testing framework solution
Test driven and continuous integration development methodologies are widely considered best practices in contemporary software development shops, especially those embracing agile development. The ubiquity of these approaches has led to a wide variety of automated testing frameworks, some of which are very database-centric. Our hypothesis is that such a tool could be useful for implementing a data validation testing infrastructure.
First try- DB Fit
Our research identified a couple of strong candidates for a database-centric testing infrastructure. The first one that was prototyped was DB Fit (http://dbfit.github.io/dbfit/). DB fit is a database-centric testing toolset that is modeled after the J-Unit or N-Unit testing infrastructures that are commonplace in Java and .Net respectively. DB Fit was designed for projects that have significant database code and want to embrace test driven development without requiring a Java or .Net wrapper around all of the tests. DB Fit is open source and has an active user and developer community.
DB Fit uses JDBC connectivity, so it is database agnostic. This flexibility proved to be problematic with our time-boxed analysis however. THe JDBC driver requires a JDK and some changes to the default SQL Server security module. After spending an hour on this with little success we decided to look at something with less initial configuration.
Second try- tSQLt
The second attempt looked at a testing infrastructure that was written in T-SQL, and very tightly coupled with Microsoft SQL Server. tSQLt (https://tsqlt.org/) is a unit testing infrastructure that also is open source with decent user and developer activity. This tool installed with minimal configuration and allowed us to get some tests up quickly.
For our proof-of-concept, we implemented a test of the validation rule that students should have a physical or home address. The tests are written in T-SQL.
Detailed steps to recreate the prototype
Download, install, and configure tSQLt to run on the Ed-Fi database as described in their getting started page.
Write a test in T-SQL.
EXEC tSQLt.NewTestClass 'testStudentAddress'; GO CREATE PROCEDURE testStudentAddress.[test that every student has a home or physical address] AS BEGIN DECLARE @NoAddressCount INT; DECLARE @NoAddressThreshold INT; SET @NoAddressThreshold = 0; SELECT @NoAddressCount = count(distinct c.studentUSI) from edfi.student c JOIN edfi.StudentSchoolAssociation on c.studentUSI = StudentSchoolAssociation.StudentUSI LEFT JOIN (SELECT studentUSI, b.CodeValue FROM edfi.studentAddress a JOIN edfi.AddressType b on a.AddressTypeId = b.addressTypeId and b.CodeValue in ( 'Physical', 'Home') ) d on c.studentUSI = d.studentUSI; EXEC tSQLt.AssertEquals @NoAddressCount, @NoAddressThreshold; END; GORun the test
EXEC tSQLt.Run 'testStudentAddress';Review the results
1 row affected) [testStudentAddress].[test that every student has a home or physical address] failed: (Failure) Expected: <21042> but was: <0> +----------------------+ |Test Execution Summary| +----------------------+ |No|Test Case Name |Dur(ms)|Result | +--+-----------------------------------------------------------------------------+-------+-------+ |1 |[testStudentAddress].[test that every student has a home or physical address]| 36|Failure| ----------------------------------------------------------------------------- Msg 50000, Level 16, State 10, Line 27 Test Case Summary: 1 test case(s) executed, 0 succeeded, 1 failed, 0 errored. -----------------------------------------------------------------------------
Results
This proof-of-concept showed that a testing infrastructure could be used for data validation tests, but that there are several reasons why it is not necessarily the best fit.
The primary issue is that testing infrastructures are designed around the concept of a test failing or passing. This is useful when checking new code that has been developed, but in the case of data validation is not sufficient because, per the defined requirements, the infrastructure needs to be identifying the specific instances of data validation errors. Specifically in our example, the test identified that there was at least one instance of a student that did not have an address but did not tell us what student(s) were missing the address. This would not be sufficient.
The other issue is that the testing infrastructures have their own method of outputting validation results. In the case of tSQLt, this output is results returned in a query tool. This does not address the requirement for the results to be sent to the API for downstream consumption by a SIS.
Conclusion
Testing infrastructures do provide a tool set that would allow for a standardized methodology for storing and defining tests. Potentially this could be leveraged for data validation tests, but the toolset would have to be heavily customized so that it would output specific instances of non-compliant data and so that it would output results to the API. If an organization was already heavily invested in a testing infrastructure this work could be justified, otherwise it might be a hard sell because the amount of effort to implement the customizations is probably less then just doing something from the ground up (per the previous example).
Option 3: Pure SQL Ed-Fi Data Validation Proof-Of-Concept
Overview
This solution shows a simplified version of a data validation engine built using only dynamic SQL.
The proof of concept includes:
A prototype of the table structure for the validation API
A prototype of a validation rule table that organizes validation rules
A T-SQL stored procedure that runs the validation rules and populates the results into the API tables
Seed data for two validation rules
Screenshots of the results when run against the Glendale sample Ed-Fi data set
All SQL used to recreate the prototype
Data Model
The data model includes three tables:
Table | Description |
|---|---|
ValidationRun | This table shows the status of runs. It will have one row for each time the validation engine is started. This should evolve into a community standard. |
ValidationResults | This table has all of the validation results. This is the data that would be consumed by the SIS via the API. This table has a foreign key that ties it back to the validation run. This table should also evolve into a community standard. |
ValidationRules | This is a table that has the SQL that produces validation results. This is what drives the T-SQL validation engine, and should serve as a starting point and example. This would NOT evolve into a community standard - different organizations could implement this functionality in their preferred technology. |
Processing Logic
The prototype includes a stored procedure, RunValidationRules, that will iterate through all of the rules in the ValidationRule table for a given ValidationRuleSet and populate the results into the API tables. When called with a ValidationRuleSet it does:
Inserts a record in the ValidationRun table with a ValidationRunStatus of 'Running' and a ValidationRunSource that is equal to the ValidationRuleSet that was called
Iterates through each rule in ValidationRule with that RuleSet. For each rule:
Builds a 'INSERT AS SELECT' SQL statement, substituting in variables from the ValidationRun table and the ValidationRules table
Runs that SQL statement, potentially populating rows in the ValidationResults table
Updates the row in ValidationRun to 'Finished'