AMT and Data Lake Storage
The Analytics development team has done something very clever when building an API-to-AMT implementation: make use of the link
elements shown in the screenshot below. This is a problem for Project Meadowlark - Exploring Next Generation Technologies because the Meadowlark does not (as of ) respond with these links. Ideally, any "data out" scenario from an Ed-Fi API should be built so that it would work with anyone's implementation of an Ed-Fi API.
API to AMT Algorithm
Here’s the basic algorithm for getting from the API to an AMT-looking “table”. This could be a database table or a filesystem (data lake). This document will illustrate the data lake way.
- Make API calls like
GET .../ed-fi/educationOrganizations
GET .../ed-fi/studentEducationOrganizations
GET .../ed-fi/students
- Each of those requests results in an array of responses, with body something like…
[ { “id”: “some string”, …. Other data …. }, { … }, etc. ]
- For each item in the array, extract the ID value and run
GET …/ed-fi/<resource>/<id>
- Save the resulting JSON document under a structured file system:
/root/ed-fi/v3.3.1-b/<resource>/<id>.json - Now you want to run a query that will generate a denormalized document that includes: Student name and Institution Name
- Open a
/root/ed-fi/v3.3.1-b/studentEducationOrganizations/<id>.json
document - 💡 Use the
educationOrganizationReference.link.href
to find theeducationOrganizations/<id>.json
document and extract theInstitutionName
- 💡 Likewise use the
studentReference.link.href
to find thestudents/<id>.json
document and extract thefirstName
andlastSurname
.
Imagine the link is not there. You open the studentEducationOrganizations
file and you only have the EducationOrganizationId
and the StudentUniqueId
. With only those values, you have absolutely no way of finding the other files that you want to open. You just have to open every single one until you get it right. Not good!
You might ask: why not store the file as <educationOrganizationId>.json
or <studentUniqueId>.json
. Well, what about a Section
document, which has a five part key? Do you want to store the file with all five parts of the key in the file name – which can then get rather long?
Meadowlark
Meadowlark stores exactly what was sent to it, and it returns exactly what it stored. When you POST a resource, of course you aren’t including that link
element – so a GET request does not contain that link. It would be possible to modify the document and add the link, because the id
value is not a random value in Meadowlark: it is a calculated value built from the natural key of the resource. It concatenates the natural key together (again, this could be quite long), then calculates a SHA256 hash code, and then base64 encodes that to have a < 40 character unique ID.
So, Meadowlark could modify the document without having to query and find the id
value for that reference. Or… the same procedure used by Meadowlark could be employed by the AMT solution: store the files using that calculated hash value. But now when you open a file and want to find all related files, you have to recalculate the hash value with every operation. That will become computationally expensive.
AMT Lookup "Table"
Another option is for the AMT data lake solution to create a “lookup table” that stores both the link.href value and a calculated hash value. In SQL terms this would be like having the following tables and joins.
This has the advantage of being useful for Meadowlark. The “IdMappings” table is quick-and-dirty and can probably down to just “href” and “id”. In fact, this is similar to how Meadowlark handles referential integrity checks, through the help of a References lookup table. Side note: the lookup table in the middle of the diagram probably does not need source and destination, since the href
to a student would be the same regardless of whether the source is studentEducationOrganizationAssociation
or any other document that references a student. And the "destination" (students) is already embedded into the href
value and thus easy for a human to understand.
Summary of Options Thus Far
To summarize, right now we have these ideas for how we can have a single data lake structure that works for both the .NET API and Meadowlark:
- Modify Meadowlark so that it enriches the received document with the
link
information when storing data and responding to GET requests. Continue storing files using the document’sid
value. In Meadowlark, thatid
value is the natural key hashcode. - Modify the data lake solution to store files using hash-codes of the natural key as the file name, instead of using the Ed-Fi ID in the file name. Optimize for joins by creating a pre-calculated mapping between the
link.href
(which contains the Ed-Fi ID value) and the calculatedid
value.
Option 2 is more work than option 1, but feels more "pure" in that it is not modifying the received document.
That said, there are other reasons to consider modifying the document on a GET response: enriching it with createdBy
, createDate
, and lastModifiedDate
(and possibly eTag
, unless it goes into a response header). If we decide to add those metadata elements, why not add the link
? And adding the link
would make it easier for someone else to create their own data lake solution. So maybe option 1 really is the other way to go.
Or perhaps someone can come up with an option 3?