SDG Multi-Year Support Analysis

This proposed functionality has not been implemented in Sample Data Generator.


In light of the In The Weeds discussion on 4/23/2021, capturing here some additional documents and thoughts on the state of Sample Data Generator. Please note that these comments were derived from analysis towards the end of 2019 so they may be out-of-date.

This analysis is not intended to be actionable, but perhaps the knowledge may be useful to others. It is primarily written from the context of investigating multi-year support for SDG.

Existing Functionality

The Ed-Fi Sample Data Generator (SDG) produces XML files for bulk import into an ODS.

The SDG generates sample data for a single school year. The tools included with SDG have not been engineered with eventual multiple school year support in mind.

At a high level, projects potentially impacted by adding multiple school year support are:

  • EdFi.CalendarGenerator.Console
    • Currently accepts a required school year parameter to generate CalendarDate.csv, GradingPeriod.csv, and Session.csv files.
  • EdFi.MasterScheduleGenerator.Console
    • Currently accepts a required school year parameter to generate CourseOffering.csv and Section.csv files.
  • EdFi.SampleDataGenerator.Core
    • Currently, work with a single global TimeConfig XML element to generate single school year data within the specified start and end dates for a single school year.

SDG incorporates a seed data generation capability, which produces core information about students in files that can be used as input for subsequent executions of SDG. This is useful if your usage scenarios benefit from some stability of information when demo data is rebuilt. For example, your E2E tests may require that a particular student name always exists in the demo data.

Available Sample Data Configurations

The SDG includes two sample school districts:

SampleConfig.xml

  • District Name: Grand Bend ISD
  • Number of Schools: 1

NorthridgeConfig.xml

  • District Name: Northridge ISD
  • Number of Schools: 5

What is confusing is that the populated template that is created by the sandbox generator uses the Grand Bend ISD district, but has three schools associated with it. It appears that this was generated from a configuration that was not included in the SDG download.

The Northridge data configuration is newer. It was created for SDG to provide a more realistic sample school district.

Multi-Year Support Analysis

Broadly, there are two approaches we can consider taking for enabling multiple school year support.

  1. The SDG will support multiple school years and will produce a set of files for population into an empty ODS.
  2. The seed capability is enhanced so that the SDG can be run multiple times for each school year individually, and the data collectively represents cohesive and realistic patterns of behavior.

Initially, our analysis led us to pursue the first option, and code changes were made to partially achieve this.

A few issues surfaced: (1) code changes would require a significant redesign of the SDG architecture because the concept of a single year is baked in as a global object throughout, (2) initialization of generator providers occurred in a constructor design pattern that only made sense in a single school year context, (3) incorporation of some kind of contextual carry over from one school year to the next (so that students and staff follow realistic patterns of behavior) could not be added in without further refactoring of the overall architecture.

The scope of work became a lot larger than was originally anticipated, so other options were looked at. A high-level analysis of the second approach listed above led us to conclude that this could be achieved without having to require a major overhaul of the existing architecture. Furthermore, it adds the additional benefit of potentially being able to add on to an existing demo system over time.

Note that this analysis did not include and work to identify a strategy for carrying forward logical patterns of behavior from one school year to the next, but our evaluation is that these are challenges that will be faced no matter which approach is selected.

Usage Goals

This is how we envision SDG to work with the second approach:

  1. Generate Education Organization files
  2. Generate Calendar files for year 1.
  3. Generate Master Schedule files for year 1.
  4. Configure Config XML file for year 1.
  5. Run SDG in seed mode.
  6. Run SDG to use seeded data to output XML files for year 1.
  7. Load into empty ODS with XML bulk import.

Note that all of the steps above are already supported today. Now we add a second year:

  1. Generate Calendar files for year 2.
  2. Generate Master Schedule files for year 2.
  3. Configure Config XML file for year 2.
  4. Run SDG to use seeded data to output XML files for year 2.
  5. Load into ODS (that already has year 1 data) with XML bulk import.

There will be some inefficiencies with the year 2 bulk import because global data is already loaded. We could optimize for this, but we'll probably just let the unnecessary updates happen.

There is a possibility we will enhance seed data to capture more activities in year 1, and we may also need to generate an incremental seed in subsequent years to support more consistent real-world behaviors. For example, students shouldn't randomly enroll into different schools each school year. This will likely be managed as incremental improvements to the software as needs arise.

Additional school years are added to the ODS by repeating the steps for year 2.

There are a lot of files created and consumed by the SDG tools. SDG documentation does not describe the console apps (instead the data they produce are included as samples to work from). Some thought needs to be given on how to manage this. Likely we will want to change where files are created and read from to include some kind of folder structure by school year.

We may also want to create PowerShell scripts (or enhance the command line capabilities of the SDG tools) to simplify the steps in generating multiple school years at once.

Undocumented Console Apps

These are not described in the SDG documentation. These console apps generate input files required by the SDG. The generated files are included in the downloaded SDG samples, so I guess the intent is that you don't need to run them directly so long as your working with the existing SDG config files. However, since we're extending the SDG, we will need to understand these undocumented console apps.

EdFi.CalendarGenerator.Console Command Options

Short OptionLong OptionDescriptionRequiredValue
-t-termTypeType of term used by school

Semester (default)

No others currently supported

-g-gradingPeriodLengthLength (in weeks) of grading period.YMust be either 6 or 9
-s-schoolStartDateDate on which school year beginsY
-i-schoolIdId(s) of schools for which you want calendar data to be generated
(space separated)
-fschoolFilePath to School.csv file that contains target School Ids

-w-workDaysNumber of teacher-only work days per grading period
0 (default)
-bbadWeatherDaysNumber of bad weather days per grading period
0 (default)
-o-outputPathPath where output files will be stored
"" (default)

Note: The included sample data has -outputPath set to "./EducationOrgCalendar".

Produces:

  • /EducationOrgCalendar
    • /CalendarDate.csv
    • /GradingPeriod.csv
    • /Session.csv

Possible PowerShell script:

  • Execute command multiple times: adjust schoolStartDate parameter and optionally other parameters to create a variance.
  • Include school year in output filenames.
  • Concatenate year-specific files into a single file.

EdFi.MasterScheduleGenerator.Console Command Options

Short OptionLong OptionDescriptionRequiredValue
-y-schoolYearSchool year to generate (in form 2016-2017)Y
-s-schoolFilePath to the School.csv
".\\School.csv" (default)
-c-courseFilePath to the Course.csv file
".\\Course.csv" (default)
-p-classPeriodFilePath to the ClassPeriod.csv file
".\\ClassPeriod.csv"
-l-locationFilePath to the Location file
".\\Location.csv"
-o-outputPathPath where output files should be written
".\" (default)

Produces:

  • /MasterSchedule
    • /CourseOffering.csv
    • /Section.csv

Possible PowerShell script:

  • Execute command multiple times: adjust schoolYear parameter.
  • Include school year in output filenames.
  • Concatenate year-specific files into a single file.

Reverse Engineering DataClock Design Goals

The SampleConfig.xml file provided in the SDG download includes the following:

The comments describe DataPeriods as being used for dividing the school year into intervals of the school year. Grading period, semester, and year are given as possible alignments. In the config files provided with SDG there is only one DataPeriod given, so it is for the school year as a whole.

However, the provided GradingPeriod.csv and Session.csv calendaring sample files describe a two-term school year, with each term having 6 six-week grading periods. Furthermore, EdFi.CalendarGenerator.Console currently only supports a two-term (i.e. semester) school year.

The question becomes how is DataPeriod intended to be used by the SDG? Is the given example of a single school year, is this a simplistic representation of the actual school calendar which works for the intended use cases, or is it a misconfiguration that causes sub-optimal realistic data generation? We need to understand this since our scope of work is to enhance the school calendar to support multiple years.

The documentation for Data Periods states:

  • The SDG has the capability to simulate the passage of time through a given school year via the concept of data periods. At the most basic level, a data period is simply a period of time for which the SDG will generate and output data. Mutators, when present, run at the end of each data period.

  • Date period Name elements will be used when crafting output file names so that data can easily be grouped by time period for bulk loading. The most frequent use case is data periods which neatly align with school calendar periods such as grading periods, terms, or school years.

Also, in the SDG extension example:

  • You'll place any business logic for generating one-time data into GenerateCore; logic for data that should be generated per data period should be placed in GenerateAdditiveData.

The extension example also demonstrates a different kind of DataClock configuration with six DataPeriod elements representing grading periods.

To further understand the purpose of DataPeriod, we examine how is used. Ignoring configuration, validation, and testing code, the use cases for DataPeriod are all contained in the EdFi.SampleDataGenerator.Core project as follows:

  • DataGeneration\Coordination\GlobalDataGenerationCoordination
    • Call GenerateAdditiveData, RunMutators, OutputGeneratedData on the GlobalDataGenerators for each data period.
    • A review of all generators that derive from GlobalDataGenerationCoordination indicates that  GenerateAdditiveData essentially does nothing for these generators.
  • DataGeneration\Coordination\StudentDataGenerationCoordination
    • Call GenerateAdditiveData, RunMutators, OutputGeneratedData on the StudentDataGenerators for each data period (particularly review Run method). Only those that implement GenerateAdditiveData are listed:
      • CourseTranscriptEntityGenerator
        • Uses an internal StudentTranscriptSession collection to determine CourseTranscript information.
      • DisciplineActionEntityGenerator
        • Creates DisciplineActions by DatePeriod within the timeframe that the student is enrolled.
      • DisciplineIncidentEntityGenerator
        • Creates DisciplineIncidents by DatePeriod within the timeframe that the student is enrolled.
        • Q: How are multiple students associated with the same incident?
      • StudentAcademicRecordEntityGenerator
        • Creates student academic records using transcript sessions completed within the data period.
      • StudentAssessmentEntityGenerator
      • StudentDisciplineIncidentAssociationEntityGenerator
      • StudentGradeEntityGenerator
      • StudentProgramAssociationEntityGenerator
      • StudentSchoolAssociationEntityGenerator
      • StudentSectionAssociationEntityGenerator
      • StudentSectionAttendanceEventEntityGenerator
  • DataGeneration\Generators\StudentAssessment\StudentAssessmentEntityGenerator
  • Serialization\Output\StudentDataOutputCoordinator
    • Produces output XML whereby each data area is split into distinct files per data period.