Purpose

The primary goals for introducing preprocessor enhancements to Data Import are provide support for a wider range of file import use cases and improve the experience of working with Powershell scripts within Data Import.

Specifically:

Provide support for non-CSV files (e.g. tab-delimited, fixed-width, or XML)
Enable reshaping of the row grain of the file (one row becomes multiple or vice-versa)
Enable exclusion of rows (e.g. student assessment rows that indicate the student was not tested)
Enable invocation of the ODS API from preprocessor scripts (dynamically selecting the API used by the executing agent)
Improve management and sharing capabilities or PowerShell scripts used by preprocessors

Preprocessing Capabilities in Data Import

Data Import already supports two PowerShell integration capabilities for preprocessing of files:

Custom Record Processing
- Provided via an optional selection of a Row Processor (PS1 file) on the agent screen.
- This feature enables scripting of modifications to field values within a row, including adding new fields.
- More information
Custom File Generation
- Provided via a "File System / PowerShell" agent type choice on the agent screen, which enables selection of a Generator (PS1 file).
- This feature enables scripting of file generation on a predefined schedule.
- More information

With the proposed enhancements, a third preprocessing capability will be supported:

Custom File Processing
- Provided via an optional selection of a File Processor (PS1 file) on the map screen.
- This feature integrates the source file into the PowerShell pipeline to support a wide variety of transformations resulting in a tabular output compatible with the existing mapping designer.

Understanding Custom File Processing

The PowerShell script implementation of Custom File Processing works a little differently than the existing Custom Row Processing.

How It Works: Custom Row Processing

The existing Custom Row Processing operates against the agent.

Set the following at the top of the file:

param ($row)

The parameter name must be called $row.

Properties of the parameter variable exist for each column. Properties can be read, updated, or new ones created. Examples:

$row.'sasid' = $row.'sasid'.Trim()
$row.'newfield' = $row.'sasid' + '!'

How It Works: Custom File Processing

By contrast, Custom File Processing uses pipeline input via an Advanced Function, and it operates against the data map rather than the agent.

Set the following at the top of the file:

[CmdletBinding()]
Param(
[Parameter(Mandatory = $true, ValueFromPipeline = $true)][string[]]$file
)

Alternatively, substitute the array of arrays parameter with a flat array parameter if you want your script to be called per row instead of for the whole file:

[Parameter(Mandatory = $true, ValueFromPipeline = $true)][string]$line

The PowerShell advanced function uses input process methods, represented by the Begin, Process, and End blocks of the function, to process the stream as it passes through the PowerShell pipeline.

Custom File Processing Examples

This example demonstrates processing the entire file at once. The Begin and End blocks are optional. They're included here for illustration.

Example: Simple Tabs-Delimited to CSV Conversion

[CmdletBinding()]
Param( 
    [Parameter(Mandatory = $true, ValueFromPipeline = $true)][string[]]$file
)

Begin {
    Write-Information "Converting tabs to commas"
}

Process {
    Write-Output $file.Replace("`t", ",")
}

End {
    Write-Information "Finished converting tabs to commas"
}

This example demonstrates processing each line individually.

Example: Fixed Width to CSV Conversion

[CmdletBinding()]
Param( 
    [Parameter(Mandatory = $true, ValueFromPipeline = $true)][string[]]$line
)

Begin {
    $fieldDefinitions = @(
        [PSCustomObject]@{ "Name" = "SCHOOL_ID";  "Start" = 0;  "Length" = 10; }
        [PSCustomObject]@{ "Name" = "STUDENT_ID"; "Start" = 10; "Length" = 10; }
        [PSCustomObject]@{ "Name" = "FIRST_NAME"; "Start" = 20; "Length" = 12; }
        [PSCustomObject]@{ "Name" = "LAST_NAME";  "Start" = 32; "Length" = 12; }
        [PSCustomObject]@{ "Name" = "GENDER";     "Start" = 44; "Length" = 6; }
    )

    Write-Output (($fieldDefinitions | %{ $_.Name }) -join ",")
}

Process {
    $builder = [System.Text.StringBuilder]::new()

    $fieldDefinitions | %{ 
        $field = $line.Substring($_.Start, $_.Length)
            
        $needsQuotes = ($Field -match ",")
        $isLast = ($_.Start + $_.Length -ge $line.Length)

        if ($needsQuotes) {
            $builder.Append("`"") | Out-Null
        }
            
        $builder.Append($field.Trim()) | Out-Null
            
        if ($needsQuotes) {
            $builder.Append("`"") | Out-Null
        }
        
        if (!$IsLast) {
            $builder.Append(",") | Out-Null
        }
    }

    Write-Output $builder.ToString()
}

Design Considerations (WIP)

Do we need both Customer Record Processing and Custom File Processing?

There is an overlap between the capabilities of the existing Custom Record Processing and the new Custom File Processing feature:

Both can modify values at the row level, including adding new fields
Custom File Processing additional supports processing of the entire file

Are there any use cases for the agent-based Custom Record Processing that will not be supported by the new map-based Custom File Processing feature?

Where does Custom File Generation fit in?

Should we implement ODS API invocation support in the existing Custom Row Processing (i.e. on agent) rather than the new Custom File Generation (on data map)?

Data Import Preprocessor Enhancements