_Client-Side Bulk Load Utility
- Ian Christopher (Deactivated)
- Chris Moffatt (Deactivated)
The Client-Side Bulk Load Utility is designed to facilitate consistent and efficient loading of XML data files through the API. The primary feature of the tool is the ability to process only those data elements that have changed since the last run. This leads to substantial load time improvements on subsequent loads once an initial full import is completed. Other features include the option to effectively resume a load after stopping as well as the ability to load data using the full security model, including profiles.
While the Client-Side Bulk Load is like the _Console Bulk Loader and Bulk Upload API in that it also loads XML files, it does so with more restrictions because it uses the transactional API with its associated security, which may not be appropriate for every situation. The following table summarizes the differences between the various XML loading mechanisms.
Console Bulk Loader | Bulk Upload API | Client-Side Bulk Load | |
---|---|---|---|
Enforces Security | Yes* | Yes | Yes |
Complete Load Time | Best | Better | Good |
Incremental Load Time | Slightly Better | Good | Best |
Suspend / Resume Loading | No | No | Yes |
Load Types | Yes | No | No |
Profiles Support | No | No | Yes |
Parallel Loading | Yes** | No | Yes** |
Client Memory Use | High | High*** | Low |
* Command line option
** Parallel loading for a shared ODS / API and targeting different data stores
*** Memory use on the server hosting the bulk load services
Overview of Solution Architecture
The Client-Side Bulk Load Utility consists of a .NET 4.6 solution. It uses C# as the primary language, with TPL (Task Parallel Library) for pipeline processing support. Infrastructure libraries include log4net for logging, nUnit for unit test, and Newtonsoft.Json and Microsoft's WebApi client libraries for web API integration. The solution consists of five projects, with the primary output being a console application for use on the command line.
The projects involved are:
- EdFi.LoadTools: Main class library, provides all functionality used by the application, including pipeline management, resource parsing, resource transformation, and resource posting.
- EdFi.ApiLoader.Console: Output project to generate the executable for running the application.
- EdFi.LoadTools.Test: Contains all unit and integration tests for the application.
The general design of the application is a pipeline that takes XML files as input. The pipeline has the following general flow:
Step | Action |
---|---|
Load Resource Hash Cache | Loads the most recently saved file version of the hash in a folder based on configuration or command line parameters. |
Load JSON Metadata From Swagger | Loads the metadata information from a local cache, if available. Otherwise, loads it from the Swagger url provided through configuration or command line parameters. |
Load XML Metadata From XSD | Loads the metadata information from the XSD files contained in a folder specified in the configuration or as a command line parameter. This step is used to provide data type and other metadata used to create mappings between the XML and API. |
Validate XML Against XSD | Schema validates the input file against the provided XSD. The files are validated based on their filename against the provided XSD. The internal declaration is ignored. This step may be skipped by providing a |
Cache All Possible Reference Targets | Scans the XML file to detect any elements with an XML id and adds them to a file specific XmlReferenceCache for later use. |
Pre-load Any Mandatory Reference Sources | Scans the XML file to detect any elements with an xml ref and pre-loads any that are found before the id they reference. |
Get All Resources For The Current Interchange File | Scans the XML file and streams out every top level node as a resource for further processing |
Compute and Compare Resource Hash | Calculates the algorithmic hash of the XML node string representation, after standardizing for formatting and spacing. Determines if the hash value exists in the hash file from the previous run:
|
Resolve XML References | Checks for any XML id elements and loads them into the XmlReferenceCache. Checks for any XML ref elements and resolves them against the XmlReferenceCache. |
Map XML Element to JSON Element Names | Uses a best match string algorithm to identity relevant mappings between the XML metadata and the JSON metadata. |
Submit Resource to Web API | Uses a web client instance to submit the resource to the configured API endpoint using a configured key and secret. If the submission errors, it goes into an auto retry loop based on the configuration. |
Add to Final Output Hash File | Updates the new hash file, which is saved as the application runs. This will be used on the next run of the application. |
Log Error | Any errors are logged based on log4net configuration to provide information for following up on load errors. |
Usage Guidelines
Generally speaking, the Client-Side Bulk Load Utility is designed to be integrated into an existing automation system.
The application supports configuring a variety of settings, including the max retry count and locations for the input files and XSDs. These settings can all be configured through the app.config or by passing them in as command line parameters. Anything not passed in by command line will default back to the app.config settings.
It is recommend to run the tool daily, but it can be run at any frequency based on the data coming in. The tool will only process what has changed since the last time it was run. The tool does rely on receiving full extracts of the data, so partial files might cause it to resubmit some records. It relies on a record existing in every file to know that that entry hasn't changed. This is still fairly low risk, and as the ODS / API will handle upserting the record as needed, the issue this raises is just a time and efficiency cost.
The code for the Client-Side Bulk Load Utility is located in the Ed-Fi-ODS repository in the \Utilities\DataLoading\EdFi.ApiLoader.Console directory.
Input files are loaded based on pattern-based matching derived from the interchange names. As the application goes through each interchange, it checks for valid files that can be imported for the current interchange. This requires files to match one of three patterns, where InterchangeName represents the name of an interchange:
- Named "InterchangeName.xml"
- Matches the file name pattern "InterchangeName-*.xml"
- Files located inside a sub folder named "InterchangeName", with a file name matching the pattern "*.xml"
Command Line Parameter Definitions
Arg | Long Arg | Description |
---|---|---|
a | apiurl | The web API URL (e.g., http://server/data/v3 ) |
d | data | Path to folder containing the data files to be submitted |
f | Force reload of metadata from metadata url | |
i | interchangeorder | Path to a folder containing the Ed-Fi ODS / API Interchange Order metadata files |
k | key | The web API OAuth key |
m | metadataurl | The metadata URL (e.g., http://server/metadata ) |
n | No argument. Do not validate the XML document against the XSD before processing | |
o | oauthurl | The OAuth URL (e.g., http://server/oauth ); the default OAuth URL for the ODS / API is http://server/ |
p | profile | The name of an API profile to use (optional) |
r | retries | The number of times to retry submitting a resource |
s | secret | The web API OAuth secret |
w | working | Path to a writable folder containing the working files, such as the swagger metadata cache and hash cache |
x | xsd | Path to a folder containing the Ed-Fi Data Standard XSD Schema files |
y | year | The target school year for the web API (e.g., 2018 ) |
Use Tips
Hash Files
Each time that the Client-Side Bulk Load tool runs, it creates a new .hash file. This file contains state information for every element submitted to the ODS / API.
When the tool runs, it computes the hash value of the incoming XML elements against the values contained in the file. When a match is found, the element is not submitted, but the hash value is copied into the new .hash file. By doing so, only unchanged elements are loaded. If the actual data in the database changes using any mechanism other than the Client-Side Bulk Load tool, the .hash files are rendered invalid and should be deleted. Some scenarios where it is useful to delete one or more .hash files are:
- A clean database load from a minimal ODS database is desired.
- A database restore occurred (if the exact time of the backup is known, it may be correlated to the corresponding .hash file).
- The underlying database was updated via a SQL client.
- Another tool (Console Bulk Loader, Bulk Upload API client, or Transactional API client) has modified data.
Metadata Files
When the Client-Side Bulk Load tool runs, it checks for a metadata.json file that contains a summary of the Swagger metadata published by the ODS / API. This file is automatically created when the tool runs and takes a few seconds to complete. If the Client-Side Bulk Load tool fails to connect to the ODS / API during this process, an empty metadata.json file may be created. An empty metadata.json file should be deleted before running the tool again. Likewise, if the ODS / API changes for any reason, the metadata.json file should be deleted so that new mapping information can be stored.
An alternate, and possibly preferred, approach is to include the -f
flag to force metadata to be reloaded rather than using the locally cached information.
Debugging
While the Client-Side Bulk Load Utility is considered stable, there can still be issues when loading large chunks of data. Individual resources can fail to load for a variety of reasons. In general though, the application will either fail early in the file processing or it will fail on the API submission of individual resources.
If an input file fails XSD validation, the source data and any subsequent transform processes leading up to the XML file output should be investigated. Common issues that can cause XSD validation failure are malformed XML (missing closing brackets or invalid characters), missing a source element for an embedded XML reference, or data not matching the Data Standard. Other scenarios that can happen at file processing time include not finding input files or metadata configuration. Ensure all the input directories and configuration settings are correctly configured. See above for configuration documentation.
In some cases, the data provided has some issues. These are typically problems with references, such as trying to create a Program at a school that doesn't exist, or are related to API security, such as trying to add parents to a student without first enrolling the student. In this case, the provider of the data needs to review and determine a solution for fixing the data. These issues are typically associated with 400-series HTTP error codes, and occasionally they can be the cause of a 500-series error.
In other cases, the mapping logic can end up misaligned, particularly if the data standard is being changed or new extensions are being added. The mapping uses a best match string algorithm to determine where to line up XSD components with JSON components, and the mappings that are being used can be reviewed by running unit tests in the MetaEd.Tests project (specifically, by running the MetadataMappingFactoryTests). Mapping issues are usually a symptom of 400-series HTTP error codes, and occasionally they can be the cause of a 500-series error.
As a last resort, you can check the API itself. Particularly in 500-series error code scenarios, its worthwhile to check the log file (located at %ProgramData%\Ed-Fi-ODS-API) to determine the details of the reported error. At this point normal debugging rules apply for working through Ed-Fi ODS / API issues.