Data Import Preprocessor Enhancements Design

Purpose

The primary goals for introducing preprocessor enhancements to Data Import are to provide support for a wider range of file import use cases and improve the experience of working with Powershell scripts within Data Import.

Specifically:

  • Provide support for non-CSV files (e.g. tab-delimited, fixed-width, or XML)
  • Provide support for database connections via ODBC
  • Enable reshaping of the row grain of the file (one row becomes multiple or vice-versa)
  • Enable exclusion of rows (e.g. student assessment rows that indicate the student was not tested)
  • Enable invocation of the ODS API from preprocessor scripts (dynamically selecting the API used by the executing agent)
  • Improve the management and sharing capabilities of PowerShell scripts used by preprocessors
    • Maintain scripts in the database instead of the file system and provide a UI for managing
    • Include script dependencies in template sharing and import/export templates

Related epic:  EDFI-411 - Getting issue details... STATUS

Preprocessing Capabilities in Data Import

Data Import already supports two PowerShell integration capabilities for preprocessing of files:

  • Custom Record Processing
    • Provided via an optional selection of a Row Processor (PS1 file) on the agent screen.
    • This feature enables scripting of modifications to field values within a row.
    • More information
  • Custom File Generation
    • Provided via a "File System / PowerShell" agent type choice on the agent screen, which enables the selection of a Generator (PS1 file).
    • This feature enables the scripting of file generation on a predefined schedule.
    • More information

With the proposed enhancements, a third preprocessing capability will be supported:

  • Custom File Processing
    • Provided via an optional selection of a File Processor (PS1 file) on the map screen.
    • This feature integrates the source file into the PowerShell pipeline to support a wide variety of transformations resulting in a tabular output compatible with the existing mapping designer.

Understanding Custom File Processing

The PowerShell script implementation of Custom File Processing works a little differently than the existing Custom Row Processing.

How It Works: Custom Row Processing

The existing Custom Row Processing operates against the agent.

Set the following at the top of the file: 

  param ($row)

The parameter name must be called $row.

Properties of the parameter variable exist for each column. Properties can be read, updated, or new ones created. Examples:

$row.'sasid' = $row.'sasid'.Trim()

$row.'newfield' = $row.'sasid' + '!'

How It Works: Custom File Processing

By contrast, Custom File Processing uses pipeline input via an Advanced Function, and it operates against the data map rather than the agent.

Set the following at the top of the file:

[CmdletBinding()]

Param( 

    [Parameter(Mandatory = $true, ValueFromPipeline = $true)][string[]]$file

)

Alternatively, substitute the array of arrays parameter with a flat array parameter if you want your pipeline processing to be executed per row instead of for the whole file:

    [Parameter(Mandatory = $true, ValueFromPipeline = $true)][string]$line

The PowerShell advanced function uses input process methods, represented by the Begin, Process, and End blocks of the function, to process the stream as it passes through the PowerShell pipeline.

Custom File Processing Examples

This example demonstrates processing the entire file at once. The Begin and End blocks are optional. They're included here for illustration.

Example: Simple Tabs-Delimited to CSV Conversion
[CmdletBinding()]
Param( 
    [Parameter(Mandatory = $true, ValueFromPipeline = $true)][string[]]$file
)

Begin {
    Write-Information "Converting tabs to commas"
}

Process {
    Write-Output $file.Replace("`t", ",")
}

End {
    Write-Information "Finished converting tabs to commas"
}

This example demonstrates processing each line individually.

Example: Fixed Width to CSV Conversion
[CmdletBinding()]
Param( 
    [Parameter(Mandatory = $true, ValueFromPipeline = $true)]string[]$line
)

Begin {
    $fieldDefinitions = @(
        [PSCustomObject]@{ "Name" = "SCHOOL_ID";  "Start" = 0;  "Length" = 10; }
        [PSCustomObject]@{ "Name" = "STUDENT_ID"; "Start" = 10; "Length" = 10; }
        [PSCustomObject]@{ "Name" = "FIRST_NAME"; "Start" = 20; "Length" = 12; }
        [PSCustomObject]@{ "Name" = "LAST_NAME";  "Start" = 32; "Length" = 12; }
        [PSCustomObject]@{ "Name" = "GENDER";     "Start" = 44; "Length" = 6; }
    )

    Write-Output (($fieldDefinitions | %{ $_.Name }) -join ",")
}

Process {
    $builder = [System.Text.StringBuilder]::new()

    $fieldDefinitions | %{ 
        $field = $line.Substring($_.Start, $_.Length)
            
        $needsQuotes = ($Field -match ",")
        $isLast = ($_.Start + $_.Length -ge $line.Length)

        if ($needsQuotes) {
            $builder.Append("`"") | Out-Null
        }
            
        $builder.Append($field.Trim()) | Out-Null
            
        if ($needsQuotes) {
            $builder.Append("`"") | Out-Null
        }
        
        if (!$IsLast) {
            $builder.Append(",") | Out-Null
        }
    }

    Write-Output $builder.ToString()
}

Understanding ODS API Invocation

Note: This example uses the custom cmdlet "Invoke-OdsApiRequest". More information on this is provided in the section "PowerShell Sandbox"

This partial example demonstrates building an associated array of students for later lookups.

Invoking the ODS API from a Preprocessor Script
Write-Information "Preparing to load student data..."
$runStatus.ScriptStartTime = Get-Date
$runStatus.StudentLoadStartTime = Get-Date
$studentIds = @{}
try {
    $continue = $true
    $offset = 0
    $limit = 100
    while ($continue) {
        $response = Invoke-OdsApiRequest -UriPath "/students?limit=$limit&offset=$offset" -UseBasicParsing 
        if ($response.StatusCode -ne 200) { Write-Error "Error invoking the EdFi API: $_"; return }
        
        $students = ConvertFrom-Json $response
        if ($students.Length -gt 0) {
            foreach ($student in $students) {
                $districtId = "NOT SET"
                $stateId = "NOT SET"
                foreach ($idCode in $student.identificationCodes) {
                    if ($idCode.studentIdentificationSystemDescriptor -eq "District") { $districtId = $idCode.identificationCode }
                    if ($idCode.studentIdentificationSystemDescriptor -eq "State") { $stateId = $idCode.identificationCode }
                }
                $studentIds.Add($stateId, $districtId)
            }
        }
        else {
            $continue = $false
        }
        $offset += $limit
    }
}
catch {
    Write-Error "Error loading list of Student State/District IDs from the ODS: $($_.Exception.Message)"
    return
}

Understanding the Big Picture

It is helpful to understand how files are processed today, and how this workflow

Design Considerations

Do we need both Custom Record Processing and Custom File Processing?

There is an overlap between the capabilities of the existing Custom Record Processing and the new Custom File Processing feature:

  • Both can modify values at the row level.
  • Custom File Processing additionally supports adding new fields to the row as well as the processing of the entire file for more complex scenarios.

Since Custom File Processing provides a superset of functionality, there is a consideration to remove (or at least deprecate) the existing Custom Record Processing feature:

  • Benefits of removing:
    • Strategically, if two file preprocessing methods are not needed, then we should progress to that end state.
    • Avoid additional work to improve refactor Custom Record Processing to support improved preprocessing management and sharing capabilities planned for Customer File Processing.
    • However, we will need to identify a migration strategy for customers currently using Custom Record Processing.
  • Benefits of retaining:
    • If there are use cases for agent-specific processing. 
    • Opportunity to incorporate invocation of ODS /API from agent preprocessor (rather than data map preprocessor), which naturally aligns with the concrete API server connection.

Recommendation:

  • Retain Custom Record Processing capability. An example scenario in which we may want to apply a preprocessor script to an agent rather than the map is to resolve state-assigned student ids to student unique ids for select customers depending on how a particular file specification was populated.

Does Custom File Generation need to be considered in this design?

Custom File Generation uses a PowerShell script in an entirely different context than Customer Record Processing and Custom File Processing. Whereas the latter items modify the content of an uploaded file during processing, Custom File Generation executes a script on a schedule to generate a file.

As a result, Custom File Generation is not impacted by the proposed preprocessor enhancements.

However, there is value in consistently managing all PowerShell scripts. Currently, these scripts are discovered and used from known paths in the file system. The proposed enhancements include migrating script storage to the database and provide a UI for managing the scripts. It is likely beneficial if these enhancements are applied to Custom File Generation as well.

One of the goals stated by Jason for preprocessing is to support database access via ODBC. Custom File Generation is a good fit for this.

Where will ODS API invocation support be implemented?

There are three possibilities:

  • API invocation is supported by Custom Record Processing (on Agent screen)
    • This works well if we are confident that any logic requiring the use of the API happens after the file has been converted to tabular data by the Custom File Processor.
    • However, it is of course only applicable if we retain the Custom Record Processing capability.
  • API invocation is supported by Custom File Processing (on Data Map screen)
    • Building a Data Map requires the discovery of source file columns. When a Custom File Processor is used and that script requires the use of the API, then we must prompt for an API Connection for executing the script (per Multi-Connection enhancements).
    • If we use this approach, it may be preferred that ODS API support for the preprocessor is explicitly enabled as a configuration option for each applicable Preprocessor script.
  • API invocation is supported by Custom Record Processing and Custom File Processing
    • This favors flexibility for undiscovered use cases.

Recommendation:

  • We will support API invocation for all preprocessors.

Security concerns?

Providing the capability for an application to dynamically execute user-provided PowerShell scripts adds a potentially significant attack vector to the web app and transform load console app.

While Data Import already supports the execution of PowerShell scripts, those scripts are read from the file system. With these enhancements, scripts can be managed via the UI.

We need to put some thoughts into ensuring effective security measures are considered, whether this is constraints on script capabilities or documenting risks and recommendations for restricting access to the server. We can also consider an option to disable the use of scripts.

Recommendation:

  • Implement PowerShell Sandbox (described in more detail in its own section)

Template Sharing Service compatibility?

TSS storage accepts the payload as a single JSON file that contains all the packaged assets. We can add PowerShell scripts to this without requiring changes to the service.

However, by providing support for PowerShell scripts in templates, Data Import version compatibility becomes a consideration. An older version of Data Import will likely ignore the addition of PowerShell scripts and new properties referencing them, but this will result in the user importing data maps that are non-functional.

Is it sufficient to say that by convention any usage related information will be described in the template description?

It is also proposed that Custom File Processing is the only preprocessor supported by template sharing. 

Recommendation:

  • Custom File Processing scripts are automatically included in the shared template based on map selections.

PowerShell Sandbox

Certica evaluated a number of strategies to provide reduce the security exposure of executing user-maintained PowerShell scripts in Data Import:

  • Script signing (requires a public key infrastructure)
  • Execute in a constrained runspace (with only the cmdlets needed for mapping available)
  • Execute in container (introduces a complicated infrastructure and dependencies)

Striking a balance between maximizing security and retaining deployment and usage simplicity, the recommended is to execute script in-process using a constrained runspace, colloquially a "PowerShell Sandbox"

Characteristics of the sandbox:

  • The runspace is initialized with no available cmdlets.
  • Cmdlets designed specifically for Data Import use are added to the runspace
    • Invoke-OdsApiRequest
    • New-ArrayList
    • New-StringBuilder
    • ...
  • Cmdlets explicitly whitelisted by the administrator are added to the runspace (probably will be defined in a config file)
    • High-risk cmdlets should be avoided (New-Object, Invoke-WebRequest, Invoke-Expression).

Importing modules:

  • Via configuration, a list of modules to be imported into the constrained runspace is defined.
  • Cmdlets designed specifically for Data Import are packaged as a PowerShell binary module and included by default.
  • PowerShell binary modules and PowerShell script modules can be imported into the constrained runspace by configuration.

(warning) Breaking change:

  • Cmdlets used by existing Custom Row Processor and Custom File Generator scripts must be added to the whitelist.

User Interface Changes

Menu

Manage Preprocessors

Add Preprocessor

Add Data Map

Export Template

  • Custom File Processor script associated with selected Data Maps should be automatically included (similar to Lookups).
  • If Custom Row Processors exist, an additional step in the wizard will be shown after selecting Data Maps called to provide an optional selection of Custom Row Processors.
    • NOTE: This requirement was struck because agents are not shareable items, and the added functionality of supporting the export of Custom Row Processors seems unnecessary.
  • Custom File Generators are not supported in templates.
  • No UI changes are required for importing templates.

Database Changes

TableChangeJustification
Scripts

New Table:

  • Id (PK, int)
  • Name (nvarchar(255) not null)
  • ScriptType (nvarchar(20), not null)
  • ScriptContent (nvarchar(max) null)
  • RequireOdsApiAccess (bit, not null)

Supports storage of PowerShell scripts.

EF Core note: Implement ScriptType as an enum.

DataMaps

New column:

  • FileProcessorScriptId (FK, int, null)
Optionally associate a data map with a Custom File Processor.
Agents

Removed columns:

  • RowProcessor (nvarchar(max), null)
  • FileGenerator (nvarchar(max), null)

New columns:

  • RowProcessorScriptId (FK, int, null)
  • FileGeneratorScriptId (FK, int, null)

Replace Custom Row Processor and Custom File Generator filenames with database references.

This will require a specialized migration:

  • Create a Script record with Name of the current value.
  • Reference it from the new column, and drop the old column.
  • On web application startup, for any Script records with a null ScriptContent value, get the file from the file system (by Name and known path per ScriptType) and populate contents into ScriptContent.