ODS / API Optimized Bulk Load
- Ian Christopher (Deactivated)
Bulk load processing in the Ed-Fi ODS / API is done in two phases. The first phase processes the data from Ed-Fi XML files into the Bulk Load ODS. The second phase processes the data in the Bulk Load ODS into the current ODS. The Bulk Load ODS is a modified version of the current ODS. The modifications remove foreign key constraints and introduce new columns, tables, views, and aliases for bulk processing.
Bulk Operation Session
Before a bulk operation begins processing it is assigned a bulk load ODS in a bulk operation session. By default, this is done by utilizing the EdFi_Ods_Bulk connection string and a templated database name of EdFi_Ods_Bulk# where # is the number of the database. The total number of deployed instances of the EdFi_Ods_Bulk database is specified in the configuration element BulkOdsDatabaseInstanceCount.
The bulk load ODS assignment is done by inserting the bulk load operation ID into the BulkOdsOperation table in the bulk load ODS. If the insert succeeds, that bulk load ODS database is locked for the duration of that operation. When the bulk operation session is finished (whether due to successful completion or failure) the operation ID is deleted. This deletion triggers the bulk load ODS to truncate every table in that bulk load ODS, leaving it in a reusable state.
If, for troubleshooting purposes, the bulk load ODS should remain locked, the configuration entry ReleaseBulkDatabaseAfterProcessing
can be set to false
. Note that this will cause all bulk load ODS operations performed by that process to leave the bulk load ODS in a locked state. Again, this setting is intended for debugging purposes only.
To release a locked bulk load ODS simply delete the row out of the BulkOdsOperation table. The deletion will trigger that bulk load ODS to reset itself to a reusable state.
Phase One
During the first bulk load phase, Ed-Fi XML files are processed by the XML shredder as "upserts" into the processing pipeline. The data persistence utilizes the Tabular Data Stream data format via the SqlBulkCopy class in .NET. This is the fastest mechanism available for loading data into SQL Server. This approach utilizes context-aware bulk caches to ensure that any aggregates using server-assigned identities (SAID), such as Student USI or Descriptor ID, are consistent in the Bulk Load ODS for every natural key reference. This approach enables the interchanges to be loaded in any order within a bulk load operation.
The data processing in Phase One includes domain aggregate resource validations. After validation, the domain aggregate’s data is persisted into the Bulk Load ODS by persisting the aggregate’s component entities via the corresponding bulk copy operations for each entities’ table. These are supplied with the data as a streaming entity source for the tabular data stream. Primary keys are tracked per table. Duplicate key violations are rejected and logged.
Figure 1. SqlBulkCopy-based repository replacement
When interchange processing is complete, all ongoing streaming bulk operations are finalized and all the data is then ready for additional processing within the bulk load ODS.
Phase Two
The second bulk load phase uses a dependency graph based on a model of the data structure. The data relationships are processed based on operation types of "Create," "Update," or "Merge." Abstract base types are processed in the context of their concrete aggregates. Children of an aggregate are always processed after the aggregate’s root. The graph edges determine the aggregate processing order and can be influenced. For example, when security is enabled, a dependency on an entity that has to be authorized through another entity, would result in the dependent entity’s dependency being moved to the entity required for authorization.
As the nodes become available for processing, activities are utilized to ensure data processing integrity and persist the data. This includes association checks between records as well as security checks, when enabled. The last activity is always the Merge Data Activity which selects the appropriate data processing strategy to integrate the data into the current ODS.
Association Validation Activity
Association validation only occurs on aggregate root entities (including slices of abstract bases). The association checks validate the external associations of the aggregate, including associations from a child to another aggregate root. Aggregate children are not permitted to have dependencies on one another directly. Aggregates are processed and validated in Phase One as a complete resource and are internally consistent.
To identify any records dependent on other records that are not in the current ODS, cross schema/database queries are used to ensure that all of the records in the source table have their dependencies satisfied. Records that are in violation are flagged as logically deleted in the bulk load ODS and the violation is included as the reason for the failure.
Authorization Activities
Two authorization activity concepts are used when security is enabled. As with the association validations, only aggregate roots and abstract bases with concrete context are authorized. The first of these activities validates the API token in the security context for the bulk load operation, and has authorization to perform the basic create/update operation on the aggregate or abstract base within the slice of a given concrete aggregate. If the basic authorization fails, all of the records for that aggregate (or within that slice of the abstract base) including children are flagged as logically deleted, and the reason logged. No further activities are processed for this aggregate (or slice of the abstract base).
The second authorization activity is authorization filtering. This uses the NHibernate-based security filters to ensure the API token authorizes not just the update activity, but that it is also permitted based on specific associations. The filters are used to construct queries that validate the authorization for the filtered entity. When the filters are applied, any records in violation are marked as logically deleted in the bulk load ODS, and the reason for the violation logged. When these violations occur while processing an abstract base within a concrete context all records within the concrete entity and the abstract base for that concrete aggregate are logically deleted.
Merge Data Activity
The final activity is the propagation of the data from the bulk load ODS to the current ODS. This is done by the appropriate strategy based on the operation (Create, Update, Merge) and the entity type. The strategies use various SQL operations like UPDATE but also utilize SQL Server-specific MERGE statements. The bulk load ODS contains a generated series of views that serve several purposes. The first purpose is to allow resolved server-assigned IDs from the current ODS to be included with associated data from in bulk load ODS. The second purpose is to filter out records that are flagged as logically deleted in the bulk load ODS.
The semantics of the Ed-Fi ODS / API’s PUT and POST methods are such that the resource submitted in the request is treated as a wholesale replacement for the target resource, including all child collections. To delete a resource, a DELETE request must be made against the API. However, to delete an item from a resource's child collection, a version of the updated resource (one without the item in the child collection) must be POSTed to the API.
To stay consistent with the semantics of the ODS / API implementation, the strategies do not perform any deletions on the "aggregate root" current ODS tables. The API must still be used to delete resources.
However, given that the strategies operate on each table individually (rather than each aggregate), there is a bit of special logic required to correctly handle the "wholesale replacement" semantics when it comes to child tables within an aggregate. Consider the example below where a Student resource is updated with a newer version that does not include one of the addresses and one of the telephone numbers.
Figure 2. Conceptual demonstration of the effect of resource "replacement" semantics on child records
Since writing an aggregate represents a wholesale replacement, any strategy that processes aggregate children ensures that the child records are represented correctly. The result of processing this example would be that address (A2) and phone number (T3) are deleted from the "Student" aggregate, and phone number (T4) is added.
Error Handling
After the processing activities are completed for an entity, an error handing task is initiated that aggregates the first 500 errors of each type for the entity into the EdFi_Bulk database in the BulkOperationExceptions table.
Console Bulk Loader
The console bulk loader requires the bulk load ODS database. This database is automatically generated during the developer-facing initdev
process. Of particular note is the console bulk loader -t
argument. The -t
argument takes an API authorization token and uses it as the security context for the bulk load. When the argument is omitted no security is applied.
The console bulk loader creates an optimized bulk load process based on SqlBulkCopy
operations and set-based operations for referential validation and authorization (rather than single-row processing). This approach puts much of the work to SQL Server, and avoids any performance degradation caused by latency.
This approach requires that the bulk load and current ODS tables be located on the same physical database server since the validation and merge functionality need to execute queries that join bulk load and current ODS tables.