Versions Compared
Version | Old Version 20 | New Version 21 |
---|---|---|
Changes made by | ||
Saved on |
Key
- This line was added.
- This line was removed.
- Formatting was changed.
Introduction
Analytics Middle Tier 2.0 is becoming a full citizen in the Ed-Fi platform in 2020, rather than just a proof-of-concept on the Ed-Fi Exchange. As it grows up, it needs to correct some architectural concerns that came up as feedback from the field. It also needs to be up to par with the latest release of the ODS/API, version 3.3. This document aims to inform about the challenges and elicit feedback on the real-world usefulness of the proposed solutions.
Naming Convention
Requirement
Hold names to under 63 characters for PostgreSQL compatibility.
Design
Currently, none of the objects have a name that violates this constraint. To help avoid problems with future views, it is proposed to either truncate "Dimension" to "Dim" or drop the word altogether. Generally, clarity should be preferred over length when naming objects, hence the question is: how much clarity is lost if we move away from the "Dimension" suffix?
Old Name | A - Truncate | B - Drop |
---|---|---|
analytics.StudentDimension | analytics.StudentDim | analytics.Student |
Info |
---|
Is there too much risk of confusing the |
Status
Warning |
---|
This is a substantial breaking change compared to 1.0. Since 1.0 adoption is limited, this breaking change would also be very limited - but it needs to happen now if it is going to happen. Requesting additional input before committing. Respond to Stephen Fuqua, #dev-analytics-middle-tier on Slack, or with a comment on this page. |
Multi Data Standard Support
Requirement
Support installing the views on ODS databases supporting multiple data standards (2.2, 3.1, 3.2).
Warning |
---|
Is support for Data Standard 2 really necessary in this next release? Ed-Fi Alliance is committed to both 3.1 and 3.2 in all releases in 2020, but has not decided about 2.x. Welcoming input on this. |
Design
Version 1.3.0 added support for Data Standard 3.1, which was used by ODS/API 3.1.1 and 3.2, through the use of the –dataStandard <Ds2 | Ds31>
argument.
Forcing the user to remember which data standard is installed is sub-optimal. We should be able to detect this implicitly and install the correct version without user input.
- If table
AddressType
exists, then install Data Standard 2.if needed
- Else if table
VersionLevel
exists, then install Data Standard 3.1. - Else if table
DeployJournal
exists, then install Data Standard 3.2. - Else throw an error: "Unable to determine the ODS database version".
Status
Tip |
---|
Will proceed with this design. Stephen Fuqua |
Student, Parent, and Staff Keys
Requirement
The views should expose "Key" fields based on the natural key of the underlying table.
Design
In the case of StudentDimension
, ContactPersonDimension
, and UserDimension
, the original release used StudentUSI
, ParentUSI
, and StaffUSI
respectively. The "USI" columns are primary keys and were used by mistake. The "UniqueId" columns are the correct natural keys.
Change all instances of StudentKey
, ContactPersonKey
, and UserKey
to use the corresponding "UniqueId" column from the source table.
Tip |
---|
Will proceed with this design. Stephen Fuqua |
Descriptor and Type Mapping
Requirement
Decouple the views from hard-coded Descriptor and Type values.
Context
many of the views need to lookup records by Descriptor value - for instance, looking up the Attendance records where a student has an "Excused Absence" or "Unexcused Absence." Because the original developer had access to only a limited dataset, it was not realized that the Descriptor values will vary widely from one implementation to the next. Thus the hard-coding needs to be decoupled, allowing the implementation to provide a mapping from their Descriptor value to the concept used by the Analytics Middle Tier.
In theory, the various "Types" values in Data Standard 2 should provide a more universal constant than the Descriptors. However, some community members report that these too are mutable. Therefore, (a) using Types is not a solution for Data Standard 2, and (b) even those views with hard-coding to Types instead of Descriptors must be modified for greater independence. Note: Type tables were removed in Data Standard 3 precisely because they were not being used in the originally-designed manner.
Design
Summary
- Move hard-coded values to a "Constants" table.
- Create mapping tables that link Descriptors or Types to Constants.
- Modify all views as needed to join to the Constants and new mapping tables.
Expand | ||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||||||||||||||||||||||||||
|
Example
In Version 1.x, the StudentEarlyWarningFact
view reports on excused and unexcused absences, looking for StudentSchoolAttendanceEvent
records with attendance descriptor values of either "Excused Absence" or "Unexcused Absence".
In version 2, the view would now search for all StudentSchoolAttendanceEvent
records whose descriptor maps to the relevant constant. Thus there would be two DescriptorMap
values, one each for "Excused Absence" and "Unexcused Absence." Any school who uses a different term than these two would create a DescriptorMap
record mapping that term to the DescriptorConstant
value of "Absent".
Implications
Those who install the Analytics Middle Tier will need to carefully assess their Descriptors and Types, and then manage the DescriptorMap
table (and TypeMap
, for Data Standard 2) accordingly.
Default Mappings
A new command-line Option will be provided to run a script that loads the default Descriptor mapping for the default Ed-Fi descriptors (minimal/populated template descriptors).
Code Block |
---|
.\EdFi.AnalyticsMiddleTier.exe --connectionString "..." --options DefaultMap |
Status
Tip |
---|
Will proceed with this design. Stephen Fuqua |
Changes to the Student Dimension
UniquenessRequirementRequirements
- Create a "Student" dimension with a single unique key.
- Provide intuitive access to student demographics.
Context
Student Dimension Uniqueness
The Early Warning System fact views both assumed that the StudentDimension
would only have a single record for a student, as shown in these diagrams:
However, a student could be enrolled in multiple schools at the same time, resulting in two records in the StudentDimension for the same StudentKey. This is problematic for the PowerBI Starter Kit, which has a hard requirement for unique StudentKeys.
At the same time, the Program-related fact and event views are looking at a Student in a given Local Education Agency, whereas the facts above are looking at a Student at a School. Although the ODS database supports a Student enrolled in a Program associated with a School, our input from the field tells us that Program association is on the LEA not the school. Thus we need to balance these competing tensions in any change to the data model.
Design
Data Standard 3 moved the student demographics into theStudentEducationOrganizationAssociation
table, away from the Student
table. The current StudentDimension
view provides student Demographics in Ed-Fi UDM v3.x
Image Added
Data Standard 3 moved the student demographics into the StudentEducationOrganizationAssociation
table, away from the Student
table. Demographic information is thus closely aligned with school enrollment. Alternately, because this table is linked to Education Organizations instead of Schools, the demographics could be defined for the Local Education Agency instead of or in addition to being defined on the school enrollment.
The current StudentDimension
view provides student name, primary contact, and demographics. The proposal is to split this into two different views, changing the Early Warning System star schema to a snowflake.
The demographics on the view are secondary to the student-school enrollment relationship. Therefore we will have a new view called StudentEnrollment
to show active enrollment/school association information, with a StudentEnrollmentKey
that combines the StudentUniqueId
and SchoolId
(maintaining the principal of needing only one column for a join). The early warning views would then change to use the single StudentEnrollmentKey
in addition to the two separate StudentKey
and SchoolKey
columns. The original keys are retained to facilitate aggregating by student or school.
The StudentDimension
would continue to exist in a reduced form, providing name and primary contact information. The Program views would continue to have the StudentKey
in them (no changes).
Gliffy name StudentEnrollmentDimension pagePin 3
Impact on Security
By leaving StudentKey
and SchoolKey
in the views above, the current row-level security views will not require any changes, which looks intuitive but is misleading: if the student has multiple sets of records, then which do you choose? The Analytics Middle Tier will not be able to solve this problem. However, it should attempt to be clear about the problem so that the data analyst is not led down a false path.
Dimension or Fact?
While gender, race, and ethnicity all have strings associated with them, some elements of the demographic and enrollment data are more fact-oriented than dimension-oriented:
- IsHispanic
- IsEconomicallyDisadvantaged
- LimitedEnglishProficiency
- IsEligibleForSchoolFoodService
- SchoolEnrollmentDate
The ODS does not support slowly-changing dimensions, so there is only ever one current snapshot of these data - one cannot tie them to a date unless referring to the enrollment date.
Race is not included the the current views because it is not relevant in the Balfanz early warning system. It may be included in future use cases. Sex was provided out of convenience but could be removed (starter kits don't use it), or treated as a Key in a snowflake. In both cases, there is no universal binary definition that would become a Fact, without hard-coding "business logic" that should be decided by the use case or by the school/district.
Design Proposals
1. Shrink the Student Dimension
Only the following columns truly belong on a StudentDimension
that has unique primary keys:
Column | Source |
---|---|
StudentKey | edfi.Student.UniqueId |
StudentFirstName | edfi.Student.FirstName |
StudentMiddleName | edfi.Student.MiddleName |
StudentLastName | edfi.Student.LastSurname |
ContactName | from analytics.ContactDimension for the first primary contact record |
ContactRelationship | |
ContactAddress | |
ContactMobilePhoneNumber | |
ContactWorkPhoneNumber | |
ContactEmailAddress | |
LastModifiedDate | edfi.Student or analytics.ContactDimension |
Tip |
---|
Planning to proceed with this. Stephen Fuqua. |
2a. Kick the Can Down the Road
The Early Warning System use case needs to know about a student's school enrollment - which is already provided via the StudentEarlyWarningFact
view. It does not need any demographic information. That information was included in version 1.0 to support ad hoc analytics without having a specific use case to anchor the structure of the data. Therefore we could simply remove all demographic data from 2.0. Re-introduce in 2.1 (or beyond) when there is a specific use case.
Note |
---|
No decision yet, keep reading for alternate proposal. |
2b. Dimensional Demographics for Exploration
To support ad-hoc analytics exploration, continue supporting dimensional demographics data. Why dimension over fact?
- Fact views should generally be use case driven, and we don't have a clearly defined use case for demographics right now.
- A data analyst using a BI tool probably wants to slice data by demographics, which favors dimensional perspective over date-in-time fact or event perspective.
Image AddedWhen slicing or filtering by some bit of demographic data, a BI tool should automatically re-calculate aggregations / metrics. This is different than creating a report - this is about real-time exploration. For example, "how many fifth graders are at risk for dropout based on attendance, behavior, and course performance?" The analyst simply wants to filter on GradeLevel=Fifth. Then drill down to "ask" about limited English proficiency. In Power BI, this looks the image to the right. In this case, none of the fifth graders are recorded as having limited English proficiency, so the filter only shows "not applicable." Thus the ad hoc exploration just uncovered something interesting.
- That filter would not be so easy to achieve with a fact that is tied to a date, unless the data were provided for every single available date. Doing that would greatly increase the size of the analytics database without providing any additional value: since the ODS doesn't store past data, only current records, the demographic data would never change from one day to the next.
Create a new StudentDemographics
dimension that includes the following columns:
Column | Source |
---|---|
StudentKey | edfi.Student.StudentUniqueId via edfi.StudentSchoolAssociation.StudentUSI |
SchoolKey | edfi.StudentSchoolAssociation.SchoolId |
EnrollmentDateKey | edfi.StudentSchoolAssociation.EntryDate formatted as YYYY-MM-DD |
GradeLevel | edfi.Descriptor.CodeValue via edfi.StudentSchoolAssociation.EntryGradeLevelDescriptorId |
LimitedEnglishProficiency | edfi.Descriptor.CodeValue via edfi.StudentEducationOrganizationAssociation → If multiple records available in StudentEducationOrganizationAssociation, let School record's value take precedence over District's value. → Relies on the Descriptor Mapping described above |
IsEconomicallyDisadvantaged | edfi.Descriptor via edfi.StudentEducationOrganizationAssociationStudentCharacteristic → If multiple records available in StudentEducationOrganizationAssociation, let School record's value take precedence over District's value. → If record present, then value is true - not displaying the Descriptor value → Relies on the Descriptor Mapping described above |
IsEligibleForSchoolFoodService |
The proposal is to split this into two different views, changing the Early Warning System star schema to a snowflake.
The demographics on the view are secondary to the student-school enrollment relationship. Therefore we will have a new view called StudentEnrollment
to show active enrollment/school association information, with a StudentEnrollmentKey
that combines the StudentUniqueId
and SchoolId
(maintaining the principal of needing only one column for a join). The early warning views would then change to use the single StudentEnrollmentKey
in addition to the two separate StudentKey
and SchoolKey
columns. The original keys are retained to facilitate aggregating by student or school.
The StudentDimension
would continue to exist in a reduced form, providing name and primary contact information. The Program views would continue to have the StudentKey
in them (no changes).
Gliffy name StudentEnrollmentDimension pagePin 3
Impact on Security
By leaving StudentKey
and SchoolKey
in the views above, the current row-level security views will not require any changes.
Program Views
At the same time, the Program-related fact and event views are looking at a Student in a given Local Education Agency, whereas the facts above are looking at a Student at a School. Although the ODS database supports a Student enrolled in a Program associated with a School, our input from the field tells us that Program association is on the LEA not the school. Thus we need to balance these competing tensions in any change to the data model.
Problem
What if someone wants to build queries that analyze program participation by demographic data?
Suppose we have the following data (partial representation):
ProgramTypeDimension
ProgramTypeKey | ProgramType |
---|---|
3 | Spanish Grammar for Native Speakers |
StudentProgramEvent
StudentKey | LocalEducationAgencyKey | DateKey | ProgramTypeKey | ProgramEventType |
---|---|---|---|---|
1 | 2 | 20191230 | 3 | Enter |
StudentEnrollmentDimension
StudentEnrollmentKey | StudentKey | SchoolKey | EnrollmentDate | GradeLevel | LimitedEnglishProficiency |
---|---|---|---|---|---|
1-4 | 1 | 4 | 20190110 | 8 | Limited |
1-5 | 1 | 5 | 20191230 | 8 | Not applicable |
Here a student is enrolled at two schools, presumably in the same district. Perhaps the program enrollment is due to a specialized class in the student's home language, at a different school than the one attended daily. Since data entry occurred twice - once for each school - there is an opportunity for inconsistency. In this case, one of the school enrollment entries failed to note that the student has limited English Proficiency.
We can write the following query to analyze this program proficiency by LimitedEnglishProficiency.
Code Block |
---|
select ProgramTypeDimension.ProgramType, count(1) as EnrolledCount, sum(case when StudentEnrollmentDimension.LimitedEnglishProficiency <> 'Not applicable' then 1 else 0 end) as LimitedEnglishProficiencyCount from analytics.ProgramTypeDimension inner join analytics.StudentProgramEvent on ProgramTypeDimension.ProgramTypeKey = StudentProgramEvent.ProgramTypeKey inner join analytics.StudentEnrollmentDimension on StudentProgramEvent.StudentKey = StudentEnrollmentDimension.StudentKey where StudentProgramEvent.ProgramEventType = 'Enter' group by ProgramTypeDimension.ProgramType |
The result will be
ProgramType | EnrolledCount | LimitedEnglishProficiencyCount |
---|---|---|
Spanish Grammar for Native Speakers | 2 | 1 |
In reality there is only one student, yet this naive query counted two students. And it will have the odd result of showing that 50% of the students have limited English proficiency!
Arguably, there is a data quality problem that the Analytics Middle Tier cannot solve. But there is also a training issue: the data analyst needs a deeper understanding of the nuances of Ed-Fi data storage and of the Analytics Middle Tier views. Ignoring the data inconsistency problem, the following query would resolve the double-count problem:
Code Block |
---|
select ProgramTypeDimension.ProgramType, count(1) as EnrolledCount, sum(case when StudentDimensionEnrollment.LimitedEnglishProficiency <> 'Not applicable' then 1 else 0 end) as LimitedEnglishProficiencyCount from analytics.ProgramTypeDimension inner join analytics.StudentProgramEvent on ProgramTypeDimension.ProgramTypeKey = StudentProgramEvent.ProgramTypeKey outer apply ( select distinct LimitedEnglishProficiency from analytics.StudentEnrollmentDimension where StudentKey = StudentProgramEvent.StudentKey ) as StudentDimensionEnrollment where StudentProgramEvent.ProgramEventType = 'Enter' group by ProgramTypeDimension.ProgramType |
These problems are rather ugly, and would be difficult to detect unless you're actively looking for them. Furthermore, while writing an updated SQL query was easy, how would one represent this in a business intelligence tool like Tableau or PowerBI?
Status
Warning |
---|
Requesting additional input before committing. Respond to Stephen Fuqua, #dev-analytics-middle-tier on Slack, or with a comment on this page. |
School Year
Requirement
Add SchoolYear to help support longitudinal data.
Design
Data Standard 2's support for School Year is limited compared to Data Standard 3. The following dimension views could have a SchoolYear
column in them:
Data Standard 2 | Data Standard 3 |
---|---|
Student / Student Enrollment | Student / Student Enrollment |
Student Section | Student Section |
Date | |
Grading Period |
The DateDimension
view does not include a SchoolKey
as it was intended as a generic date table. Although it strikes the writer as a nonsensical, there could be two different school years for the same date at different schools. Therefore it does not seem appropriate to add this column to the DateDimension
.
There might be value in having the SchoolYear
in the GradingPeriod
and making the decision that Data Standard 2 would always return null for this column (so that the interfaces stay consistent): one could use this to create a filter hierarchy School Year > Grading Period. However, other than interpolating the begin and end dates on the Grading Period, there is currently no way to build out the hierarchy beyond this. The PeriodSequence
is insufficient for determining a parent-child relationship between grading periods.
When trying to filter the StudentEarlyWarningFact
or StudentSectionGradeFact
views, the available school years could be extracted from either the Student or Student Section dimensions to create a slicer. However, this would be much simpler if the SchoolYear
were simply included in those fact views.
Status
Warning | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Requesting additional input before committing. Respond to Stephen Fuqua, #dev-analytics-middle-tier on Slack, or with a comment on this page. Questions to address. Should we add a
|
Separation Between Core and Use-Case Views
Requirement
Manage a collection of "core" views and separate collections of use-case specific views.
Design
The application already has a concept for installing optional components, which was first created for optional install of additional indexes in the ODS. Proposal:
- Always install a core set of views
- ContactPersonDimension
- DateDimension
- GradingPeriodDimension
- LocalEducationAgencyDimension
- MostRecentGradingPeriod
- SchoolDimension
- SchoolNetworkAssociationDimension
- StudentDimension
- StudentEnrollmentDimension (if created, see above)
- StudentSectionDimension
- Move some of the existing views into new optional collections:
- Row-level Security (RLS)
- StudentDataAuthorization
- UserAuthorization
- UserDimension
- Early Warning System (EWS)
- StudentEarlyWarningFact
- StudentSectionGradeFact
- QuickSight-Early Warning System (Q-EWS)
- Ews_SchoolRiskTrend
- Ews_StudentAttendanceTrend
- Ews_StudentEnrolledSectionGrade
- Ews_StudentEnrolledSectionGradeTrend
- Ews_StudentIndicators
- Ews_StudentIndicatorsByGradingPeriod
- Ews_UserSchoolAuthorization
- Program Analysis (PROGRAM)
- ProgramTypeDimension
- StudentProgramEvent
- StudentProgramFact
- Row-level Security (RLS)
- Establish clear standards for distinguishing view names and avoiding name overlaps
- Option 1: separate by "namespace" (schema). Instead of having a single
analytics
schema, we could create ananalytics_core
schema and other schemas to match use cases:analytics_rls
for Row-level Securityanalytics_ews
for Early Warning Systemanalytics_quicksight
for QuickSightanalytics_program
for Program Analysis
Each new use case, whether created by the Alliance or submitted by the field, would get its own schema name.
- Option 2: keep everything in a single schema, ensuring unique names, so that downstream data models (without namespaces/schemas) do not need to name their models differently than the views.
- Option 1: separate by "namespace" (schema). Instead of having a single
Tip | ||
---|---|---|
Thus to install the Early Warning System and Row-level security collections used by the Power BI Starter Kit v2, the admin user would run this command:
|
For deployment purposes, it would be ideal to separate these by "namespace" - or in SQL terms, by schema.
Warning |
---|
Requesting additional input before committing. Respond to Stephen Fuqua, #dev-analytics-middle-tier on Slack, or with a comment on this page. |
Additional Views
Will not add any new views in the 2.0 release. New views can be added with 2.1, 2.2 etc. This 2.0 release is all about fixing architectural problems and setting the stage for broader adoption.
Note | ||
---|---|---|
| ||
Work-in-progress draft. This notice will be removed when the "final" design decisions are documented. :
|
Table of Contents |
---|