DI SIG - Report Out

The Data Import (DI) SIG had 3 goals:

Understand and define pain points that Ed-Fi implementations experience while using Data Import as an Extract-Transform-Load (ETL) solution for non-API based data.
Understand other solutions that may be available for reliable ETL pathways for non-API based data.
Inform Data Import future development and roadmap priorities from community blockers, issues and identified needs

These were the outcomes of the SIG from these goals are:

1.) Data Import is reported as serving non-API ready data needs with an active community of users

The DI SIG informed many aspects on how Data Import is used today in the field, in service of incorporating non-API ready data into Ed-Fi data infrastructure. With an active SIG with over 40+ members to contribute to the forum, we've learned that the tool is serving needs with non-API ready data. From these conversations, we've learned that Data Import is active in the assessment, educator preparation program (EPP), finance data and other domains where API pathways are non-existent. It is recognized from these conversations, Data Import carries a maintenance burden for the implementer to maintain, which is balanced in its usage along with the need to import such data in Ed-Fi environments. It too is recognized that direct API connections from education data producing products is ideal and preferred, which relieves the maintenance burden of running additional ETL solutions to accommodate.

2.) Numerous viable open-source and low-cost alternatives exist for serving the ETL need

The DI SIG reviewed a number of alternatives to Data Import, as the education and general technology markets have tools and products to serve needs for extracting, transforming and loading data of many types. As a result of this review, the forum has discovered and discussed numerous viable alternatives which can also serve loading of non-API data into Ed-Fi environments.

Education Analytics

Education Analytics is an organization that serves education agencies with a multitude of solutions and approaches that serve needs for analytics to improve student outcomes. The team uses Ed-Fi technology in many of its solutions and has deep knowledge of Ed-Fi's data model, ODS / API and other facets to meet these goals. They have built an open-source toolkit to transform and load data into the Ed-Fi API. The technology is Python-based and known to be aligned within cloud environments.

Details of this toolkit are below, with links to GitHub for the source code and documentation of each component:

Earthmover - CLI tool for transforming collections of tabular source data into a variety of text-based data formats via YAML configuration and Jinja templates.
Lightbeam - CLI tool for validating and transmitting payloads from JSONL files into an Ed-Fi API.

EPP alternatives eval
- Link to the published report (comes out today)
- Summary brief of tools used:
  - Talend (free version)
  - Ni-Fi (open source)
  - Azure Data Factory (cloud based DataBricks)
  - AWS Glue (cloud based DataBricks)

Outcomes: Ed-Fi will continue on non-enterprise tools / ad-hoc integration — at some point at-scale please a stronger look at the alternatives above — no "perfect line" in this in tool solutions,

Should we begin to look to how to work closer with the tools above?

3.) Data Import users will prefer an open-source path ahead for the product

Because of 2022 conversations heard, Ed-Fi was moved to open-source Data Import
In November 2022, we released Data Import 2.0 to an open source repo
We've moved the Template Sharing Service to a GitHub Exchange repo

4.) Additional Data Import SIG requests lead to these feature requests

John Bailey - has noticed the first time they are importing data, it seems faster than subsequent imports (SF:
Emilio and Rosh - Would like improved logging. It logs too much and they have to truncate the table often
Zurab - duplicated headers in files have posed issues; Emilio - the pre-processor could help with this issue
DI-1135 - Array Format in CSV
John Bailey - 1.3.2 included a Docker container, but it is unclear how to kick off a schedule
Zurab - Source code uses a library to work with FTP servers. The library does not work if the FTP server has a certain setting turned on. Works fine with SFTP, but not FTP.
Mike Werner - Documentation is lacking