This version of the Ed-Fi Dashboards is no longer supported. See the Ed-Fi Technology Version Index for a link to the latest version.
ETL Reject Strategy
- Itzel Torres
- Ian Christopher
Overview
The ETL "Reject Strategy" is a general term used in the code and documentation to describe logical error and exception handling. The goal of the Reject Strategy is to facilitate the loading of data into the database through SSIS packages by redirecting error rows to an alternate, error handling workflow, thus allowing an ETL package to run through to completion despite errors.
The error handling workflow compiles exception data and writes into a centralized table for further troubleshooting. This technique has been implemented on the Lookup Transformation and Destination components for the Ed-Fi Loader packages, although the design is generic enough to handle any package with SSIS Data Flow components that have an error output.
Note: This document addresses ETL Exception handling during the loading of XML data files into an Ed-Fi Operational Data Store (ODS). The same exception handling capability has been implemented for the ETL between the Ed-Fi ODS and Dashboard data marts in Ed-Fi v1.2.
Exception Data
Exception data encountered during ETL processes is collected in a data table composed of the columns below. All except ErrorMessage are provided by a custom Transformation component called EdFiException. The data is stored in a table called edfi.EdFiException.
Table 1. Exception Data Column Definition
Column | Description |
---|---|
TableName | This is the table name that the exception is associated with (e.g.: destination tables, lookup data source tables, etc.). |
ColumnNameList | This is a list of columns in the table that the exception is associated with. For Destination components, this should include all columns that have SSIS input assigned to them in the Mappings tab. For Lookup Transformation components, this should include all lookup columns in the relationship. The list is comma delimited. |
ColumnValueList | This is an ordered list of column values with the same order as the corresponding column name list above. All values that are not NULL are wrapped in single quotes regardless of data type. If the value contains a single quote, it will be escaped by two single quotes. |
ErrorMessage | Contains detail error messages if available. For Destination components, it is the same as run-time error messages returned by SQL Server as if an insert statement was executed. This column is populated at the end of the build (instead of during SSIS execution). |
IdentifierCondition | Contains column names and values that can be used to identify a row in the data flow and they are typically the primary keys or unique keys of the Destination component. It is in a format that is ready to be used in a T-SQL where clause. |
LookupCondition | Contains column names and values that can be used to identify a row in the data source of the Lookup Transformation component. It is in a format that is ready to be used in a T-SQL where clause. This column is applicable to exceptions from Lookup Transformation components only. |
ExceptionLevel | Possible values are Error, Warning, and Information. Exceptions that would have caused the package to fail should be set to Error. Exceptions that were originally ignored should have a level of Warning. |
SSIS package system variables | |
StartTime | Timestamp when the exception occurred. |
PackageName | Name of the SSIS package. |
TaskName | Name of the task in the Control Flow. |
ComponentName | Name of the component that captured the exception data (i.e.: the name property of the EdFiException component). |
SSIS error output information | |
ErrorCode | This is provided by SSIS error output to identify the error. It contains the error number that caused the row to be rejected. |
ErrorDescription | Description of ErrorCode provided by an SSIS API. |
ErrorColumn | This is also provided by SSIS error output. It contains the lineage identifier of the error column which is unique to a data flow within the package. |
ErrorColumnName | This is meant to be the column name associated with ErrorColumn but it is not currently populated due to lack of available SSIS APIs. This might change in future versions of SSIS. For now, to view the metadata of ErrorColumn, click the path that connects the error output to the next component in the Data Flow. |
EdFiException Component
This component is the core part of the Reject Strategy. It is a Transformation component that combines data from input columns, error outputs, and system variables and outputs them as the exception data described above. The EdFiException component editor has three tabs. Follow the steps below to configure the EdFiException component. In order for the component to be validated, the TableName, ErrorCode, and ErrorColumn need to be set properly.
Step 1. Configure the Component Properties
- Set the ExceptionLevel from the dropdown box (Error, Warning, Information). The default value is Error.
- Enter TableName. The TableName value needs to include both schema and table name (e.g., edfi.StudentIndicator).
Figure 1. Component Properties Tab
Step 2. Configure the Input Columns
- Select columns relevant to the exception.
- If the Input Column name is different from the schema of the table specified in the TableName property, rename the column in Output Alias (e.g. from DerivedStudentUSI to StudentUSI). The ErrorMessage column in the edfi.EdFiException table will not be populated properly if the column name in Output Alias does not match the schema.
- Select ErrorCode and ErrorColumn. The EdFiException component requires one and only one ErrorCode and ErrorColumn. Not including or including more than one ErrorCode and ErrorColumn will cause the component validation to fail.
- The component validation looks for the keyword "ErrorCode" and "ErrorColumn" at the end of the column names under Output Alias. Having prefixes is ok but do not change or add anything after the keyword "ErrorCode" and "ErrorColumn".
Figure 2. Input Columns Tab
Step 3. Configure the Input and Output Properties
- Set IsIdentifier to True for columns that are primary keys or unique keys. These columns will be included in the IdentifierCondition column of the output. Note that IsIdentifier is a custom property and the property might not show immediately after selecting the column. In that case, close the EdFiException component editor and re-open to refresh the Custom Properties section of the Input and Output Properties tab.
- Output Columns are created by the EdFiException component and there is no need to modify.
Figure 3. Input and Output Properties Tab
Reject Strategy Development Guide
The strategy can be implemented by adding an exception workflow to each component that has the potential to cause the package fail. This might be driven by whether certain business rules need to be enforced strictly; for example, unexpected results on an inner join using a Lookup Transformation may or may not require the Reject Strategy, depending on requirements. The exception workflow has at least two components, namely an EdFiException component to collect exception data and a Destination component to write the exception data to the edfi.EdFiException table. It is important that the target component is not processing data in batch (i.e., FastLoad), or the entire batch will be treated as an exception when an error occurs.
General Implementation
To add and configure the EdFiException component:
- In the Toolbox, expand Data Flow Transformations and then drag EdFiException component onto the design surface of the data flow tab. Place EdFiException close to the target component.
- Click the target component and then drag the red arrow onto the newly added EdFiException to connect the two components.
- In the Configure Error Output dialog box, in the Error column, select Redirect row and then click OK.
- On the Data Flow design surface, click the newly added EdFiException component and change the name. This name will be displayed in the ComponentName column in the edfi.EdFiException table. Name the component in reference with the section where it is placed or the associated transformation, this is important to easily determine later where to look at while reviewing the edfi.EdFiException table in case of an exception.
- Double click the EdFiException component and follow the steps described in the EdFiException component section.
To add and configure the Destination for the edfi.EdFiException table:
- Add an OLE DB Destination onto the data flow design surface. Put the Destination directly underneath the EdFiException component.
- Click the EdFiException component and then drag the green arrow onto the new OLE DB Destination.
- In the Connection Manager tab of the OLE DB Destination, select edfi.EdFiException for the name of the table.In the Mappings tab, map Input Column to Destination Column by column name. Ignore the EdFiExceptionId, ErrorMessage, and LookupCondition columns as EdFiExceptionId is an identity column and ErrorMessage and LookupCondition are going to be populated by a separate process at the end of the build.
- Name the Destination component according with EdFiException component’s name, for consistency and to easily identify the location of the exception in case that the insertion on the edfi.EdFiException table fails.
To populate the ErrorMessage column:
- Run the edfi.EdFiExceptionEnhancement stored procedure through SQL Server Management Studio.
EXEC edfi.EdFiExceptionEnhancement GO
Figure 4 EdFiException Mappings
Destination Components
If a FastLoad access mode is used, for performance tradeoff, set the FastLoad insert commit size in the original destination component to roughly 10% of the data size. Also, add another destination component with exactly the same settings, except that the data access mode should not be FastLoad so that we capture the exception on a row-by-row basis.
When there is more than one error output in the data flow, there will be multiple ErrorCode and ErrorColumn options available in the EdFiException component Input Columns tab. Make sure to select the right one. For a combination of a FastLoad destination followed by a non-FastLoad destination, select the ErrorCode and ErrorColumn associated with the latter.
Figure 5 Exception Workflow
Lookup Transformation Components
To implement the Reject Strategy for Lookup Transformation components:
- Add the EdFiException component and a destination component to the edfi.EdFiException table.
- In the Lookup Transformation Editor, change the in the dropdown list under “Specify how to handle rows with no matching entries” to "Redirect rows to error output". If the original option is "Ignore Failure" or "Redirect rows to no match output" and no match output data flow exists, there is no need to add the exception workflow.
- Find out which columns are being used in the join between the input and the data source of the Lookup Transformation component. Select those columns in the Input Columns tab of the EdFiException component. If the Input Column name is different from the schema in the data source of the Lookup Transformation component, rename the column in Output Alias of the EdFiIntegration component. Select additional columns that are primary keys or unique keys in the Input Columns tab of the EdFiException component and set IsIdentifier in the Input and Output Properties tab to True. These columns will be included in the IdentifierCondition column but not in ColumnNameList or ColumnValueList.
Troubleshooting
Symptom: Multiple Errors, Warnings and Information messages
Solution
The edfi.EdFiException table can have multiple levels of exceptions. The Error level means the exception would have caused the package to fail had it not been for the Redirect row setting in the error output. The Warning level is currently only used for rows that get dropped when there is no matching value in the Lookup Transformation component. The Information level is not currently used in any packages. Generally speaking, Warning and Information level exceptions are not as critical and you might want to postpone investigating these issues until after all Error level exceptions are resolved.
Symptom: Errors preventing package execution
- Violation of UNIQUE KEY constraint 'IX_StudentAcademicRecord'. Cannot insert duplicate key in object 'edfi.StudentAcademicRecord'.
- Violation of PRIMARY KEY constraint 'PK_Grade'. Cannot insert duplicate key in object 'edfi.Grade'.
- The INSERT statement conflicted with the FOREIGN KEY constraint "FK_CourseOffering_Session". The conflict occurred in database "EdFiIntegrationTemp_XXXXX", table "edfi.Session", column 'SchoolId'.
- Cannot insert the value NULL into column 'CourseAttemptResultTypeId', table 'EdFiIntegrationTemp_XXXXX.edfi.CourseTranscript'; column does not allow nulls. INSERT fails.
- Conversion failed when converting the varchar value 'S1234567' to data type int.
Solution
For exceptions from Destination components, the ErrorMessage column will tell you exactly what went wrong. To investigate the issue in depth, look at the TableName, ColumnNameList, and ColumnValueList columns. For UNIQUE KEY, PRIMARY KEY, and FOREIGN KEY violations, use the IdentifierCondition column to isolate relevant information
Symptom: Lookup Transformation error “No matching during lookup”
Solution
This error is particular for Lookup Transformations. To validate the issue, locate the data source of the Lookup Transformation component in the TableName column and the lookup column(s) in the ColumnNameList column. Confirm that the input value(s) specified in the ColumnValueList is indeed not in the lookup source table. Alternatively use the LookupCondition column to identify data in the lookup source table.
Symptom: How to locate the cause of the error
Solution
Use the PackageName, TaskName, and ComponentName columns to locate the component. The ComponentName is the name of the EdFiException component and it is connected to the component that caused the problem.
Symptom: Need high-level picture of the build status
Solution
Execute the following query in SQL Server Mangement Studio.
SELECT ErrorMessage, COUNT(1) FROM edfi.EdFiException GROUP BY ErrorMessage GO
Possible Abnormalities in the edfi.EdFiException Table
Issue | Resolution/Explanation |
---|---|
The ErrorMessage column is completely NULL. | Run the edfi.EdFiExceptionEnhancement stored procedure. |
The ErrorMessage column is not fully populated. | When there are many exceptions, it is possible that the edfi.EdFiExceptionEnhancement stored procedure will timeout before processing the entire edfi.EdFiException table. The edfi.EdFiExceptionEnhancement stored procedure will process records that have not been processed yet (i.e. ErrorMessage IS NULL) so one option to address the issue is to simply re-run the stored procedure manually. Alternatively, to speed up the process, consider reducing the number of records with similar exception data before re-running the stored procedure since records with the same exception data (i.e. same TableName, PackageName, TaskName, ComponentName, ErrorCode, and ErrorColumn) are likely to share the same root cause. |
The ErrorMessage value is "No Error" | It is likely that the Destination component that connects to the EdFiException component is processing data in batch (i.e. FastLoad) so the entire batch including good rows was treated as an exception. These good rows will be redirected to the exception workflow and end up in the edfi.EdFiException table instead of in the Destination. They will not produce any error messages when the edfi.EdFiExceptionEnhancement stored procedure processes them. |
The PackageName value is NULL. | Every column should be fully populated or at least set to blank. A NULL value typically means that the column is not mapped in the Destination that writes data to the edfi.EdFiException table. This can happen in other columns as well. |
The ColumnValueList value is NULL | Double check whether ColumnValueList is really NULL (''WHERE ColumnValueList IS NULL'') or literal "NULL" (''WHERE ColumnValueList = 'NULL'''). When it is literally "NULL", that means the column specified in ColumnNameList was NULL when the data passed through the data flow. If it is a real NULL, chances are that the ColumnValueList column is not mapped in the Destination that writes data to the edfi.EdFiException table. |
The ErrorMessage column contains messages that don't seem to be error messages. For example, "Invalid column name 'SchoolId'." | Check the EdFiException component configuration. It is likely that the Output Alias in the Input Columns tab is incorrect. To fix the ErrorMessage column, correct information in ColumnNameList, set ErrorMessage to NULL, and re-run the edfi.EdFiExceptionEnhancement stored procedure. To fix the issue permanently, correct the Output Alias in the EdFiException component. |
Debugging the SSIS package yields different error messages. | Check the EdFiException component configuration and make sure that the columns selected in the Input Columns tab are identical to those being mapped in the Destination component that throws the exception. |
After addressing the root cause and re-running the package manually you continue seeing the same exception data. | Make sure you are not looking at old exception data. Use the StartTime column to isolate new data. |