ODS / API Bulk Load Enhancements in v2.2

The previous bulk load process was implemented using the expedient approach of reusing the full API pipeline processing, including NHibernate for database persistence. However, due to the nature of how ORMs work, they do not offer the best approach for loading large amounts of data where performance is a concern. This article describes a new approach implemented in ODS / API v2.2.

The new approach creates an optimized bulk load process that reuses all the existing XML shredding functionality but replaces the existing NHibernate-based “Upsert Entity” implementation with one based on SqlBulkCopy operations and set-based operations for referential validation and authorization (rather than single-row processing). This approach shifts much of the work to SQL Server, and eliminates much of the performance degradation caused by latency.

This approach requires that the bulk load and current ODS tables be located on the same physical datbase server since the validation and merge functionality need to execute queries that join bulk load and current ODS tables.

Previous Implementation

Previously, the bulk load process shredded the XML interchanges into API-level Resource “Upsert” operations. The configuration then reused the existing API pipeline directly (bypassing HTTP) to process each Resource into the ODS via NHibernate, as depicted below.

Figure 1. NHibernate-based repository implementation

The entire aggregate is processed synchronously into the database by NHibernate using multiple, single-record operations to the affected tables, as determined by NHibernate’s change detection. When the operation is complete, the next Resource can be processed.

Authorization

Authorization during bulk processing, when enabled (as in the Bulk Load Service), was done in the pipeline before the resource was persisted by NHibernate by utilizing the same security mechanism as the Web API. This approach required that each record be inspected for security independently before insertion.

Optimized Implementation

Under the new implementation, bulk load processing is done in two phases. The first phase processes the data from Ed-Fi XML files into the Bulk Load ODS. The second phase processes the data in the Bulk Load ODS into the current ODS. The Bulk Load ODS is a modified version of the current ODS. The modifications remove foreign key constraints and introduce new columns, tables, views, and aliases for bulk processing.

Bulk Operation Session

Before a bulk operation begins processing it is now assigned a bulk load ODS in a bulk operation session. By default, this is done by utilizing the EdFi_Ods_Bulk connection string and a templated database name of EdFi_Ods_Bulk# where # is the number of the database. The total number of deployed instances of the EdFi_Ods_Bulk database is specified in the configuration element BulkOdsDatabaseInstanceCount.

The bulk load ODS assignment is done by inserting the bulk load operation ID into the BulkOdsOperation table in the bulk load ODS. If the insert succeeds, that bulk load ODS database is locked for the duration of that operation. When the bulk operation session is finished (whether due to successful completion or failure) the operation ID is deleted. This deletion triggers the bulk load ODS to truncate every table in that bulk load ODS, leaving it in a reusable state.

If, for troubleshooting purposes, the bulk load ODS should remain locked, the configuration entry ReleaseBulkDatabaseAfterProcessing can be set to “false”. Note that this will cause all bulk load ODS operations performed by that process to leave the bulk load ODS in a locked state. Again, this setting is intended for debugging purposes only.

To release a locked bulk load ODS simply delete the row out of the BulkOdsOperation table. The deletion will trigger that bulk load ODS to reset itself to a reusable state.

Phase One

During the first phase, as before, Ed-Fi XML files are processed by the XML shredder as “Upserts” into the processing pipeline. However, the data persistence now utilizes the Tabular Data Stream data format via the SqlBulkCopy class in .NET. This is the fastest mechanism available for loading data into SQL Server. The new approach utilizes bulk context aware caches to ensure that any aggregates that use server assigned identities (SAID), such as Student USI or Descriptor ID, are consistent in the Bulk Load ODS for every natural key reference. These changes also enable the interchanges to be loaded in any order within a bulk load operation.

The data processing, as before, includes standard aggregate resource validations. After validation, the aggregate’s data is persisted into the Bulk Load ODS by persisting the aggregate’s component entities via the corresponding bulk copy operations for each entities’ table. These are supplied with the data as a streaming entity source for the tabular data stream. Primary keys are tracked per table. Duplicate key violations are rejected and logged.

Figure 2. SqlBulkCopy-based repository replacement

When an interchange processing is complete, all the ongoing streaming bulk operations are finalized and all the data is then ready for additional processing within the bulk load ODS.

Phase Two

The second phase utilizes a dependency graph based on a model of the data structure. The data relationships are processed based on operation types of “Create”, “Update” or “Merge”. Abstract base types are processed in the context of their concrete aggregates. Children of an aggregate are always processed after the aggregate’s root. The graph edges determine the aggregate processing order and can be influenced. For example, when security is enabled, a dependency on an entity that has to be authorized through another entity, would result in the dependent entity’s dependency being moved to the entity required for authorization.

As the nodes become available for processing, activities are utilized to ensure data processing integrity and persist the data. This includes association checks between records as well as security checks, when enabled. The last activity is always the Merge Data Activity which selects the appropriate data processing strategy to integrate the data into the current ODS.

Association Validation Activity

Association validation only occurs on aggregate root entities (including slices of abstract bases). The association checks validate the external associations of the aggregate, including associations from a child to another aggregate root. Aggregate children are not permitted to have dependencies on one another directly. Aggregates are processed and validated in Phase One as a complete resource and are internally consistent.

To identify any records that are dependent on other records that are not in the current ODS, cross schema/database queries are used to ensure that all of the records in the source table have their dependencies satisfied. Records that are in violation are flagged as logically deleted in the bulk load ODS and the violation is included as the reason for the failure.

Authorization Activities

There are two authorization activity concepts that are utilized when security is enabled. As with the association validations, only aggregate roots and abstract base’s with concrete context are authorized. The first of these activities validates the API token in the security context for the bulk load operation has authorization to perform the basic create/update operation on the aggregate or abstract base within the slice of a given concrete aggregate. If the basic authorization fails, all of the records for that aggregate (or within that slice of the abstract base) including children are flagged as logically deleted, and the reason logged. No further activities are processed for this aggregate (or slice of the abstract base).

The second authorization activity is authorization filtering. This utilizes the NHibernate based security filters to ensure the API token authorizes not just the update activity, but that it is also permitted based on specific associations. The filters are used to construct queries that validate the authorization for the filtered entity. When the filters are applied, any records in violation are marked as logically deleted in the bulk load ODS, and the reason for the violation logged. When these violations occur while processing an abstract base within a concrete context all records within the concrete entity and the abstract base for that concrete aggregate are logically deleted.

Merge Data Activity

The final activity is the propagation of the data from the bulk load ODS to the current ODS. This is done by the appropriate strategy based on the operation (Create, Update, Merge) and the entity type. The strategies use various SQL operations like update but also utilize SQL Server-specific MERGE statements. The bulk load ODS contains a generated series of views that serve several purposes. The first purpose is to allow resolved sever assigned ids from the current ODS to be included with associated data from in bulk load ODS. The second purpose is to filter out records that are flagged as logically deleted in the bulk load ODS.

The semantics of the Ed-Fi ODS/API’s PUT/POST methods are such that the resource submitted in the request is treated as a wholesale replacement for the target resource, including all child collections. To delete a resource, a DELETE request must be made against the API. However, to delete an item from a resource’s child collection, a version of the updated resource (one without the item in the child collection) must be POSTed to the API.

To stay consistent with the semantics of the current implementation, the strategies do not perform any deletions on the “aggregate root” current ODS tables. The API must still be used to delete resources.

However, given that the strategies operate on each table individually (rather than each aggregate), there is a bit of special logic required to correctly handle the “wholesale replacement” semantics when it comes to child tables within an aggregate. Consider the example below where a Student resource is updated with a newer version that does not include one of the addresses and one of the telephone numbers.

Figure 3. Conceptual demonstration of the effect of resource “replacement” semantics on child records

Since writing an aggregate represents a wholesale replacement, any strategy that processes aggregate children ensures that the child records are represented correctly. The result of processing this example would be that address (A2) and phone number (T3) are deleted from the “Student” aggregate, and phone number (T4) is added.

Error Handling

After the processing activities are completed for an entity, an error handing task is initiated that aggregates the first 500 errors of each type for the entity into the EdFi_Bulk database in the BulkOperationExceptions table.

Console Bulk Loader

The console bulk loader will operate with the same command line arguments as before; however, it does require the bulk load ODS database. This database is automatically generated during the developer initdev process. Additionally, support for a new argument has been added, -t. The -t argument takes an API authorization token and uses it as the security context for the bulk load. When the argument is omitted, as before, no security is applied.