AMT and Data Lake Storage

The Analytics development team has done something very clever when building an API-to-AMT implementation: make use of the link  elements shown in the screenshot below. This is a problem for Project Meadowlark - Exploring Next Generation Technologies because the Meadowlark does not (as of ) respond with these links. Ideally, any "data out" scenario from an Ed-Fi API should be built so that it would work with anyone's implementation of an Ed-Fi API.

API to AMT Algorithm

Here’s the basic algorithm for getting from the API to an AMT-looking “table”. This could be a database table or a filesystem (data lake). This document will illustrate the data lake way.

  1. Make API calls like
    1. GET .../ed-fi/educationOrganizations
    2. GET .../ed-fi/studentEducationOrganizations
    3. GET .../ed-fi/students

  2. Each of those requests results in an array of responses, with body something like…
    [ { “id”: “some string”, …. Other data …. }, { … }, etc. ]

  3. For each item in the array, extract the ID value and run GET …/ed-fi/<resource>/<id>

  4. Save the resulting JSON document under a structured file system:

    /root/ed-fi/v3.3.1-b/<resource>/<id>.json

  5. Now you want to run a query that will generate a denormalized document that includes: Student name and Institution Name
    1. Open a /root/ed-fi/v3.3.1-b/studentEducationOrganizations/<id>.json document
    2. 💡 Use the educationOrganizationReference.link.href to find the educationOrganizations/<id>.json document and extract the InstitutionName
    3. 💡 Likewise use the studentReference.link.href to find the students/<id>.json document and extract the firstName and lastSurname.

Imagine the link is not there. You open the studentEducationOrganizations file and you only have the EducationOrganizationId and the StudentUniqueId. With only those values, you have absolutely no way of finding the other files that you want to open. You just have to open every single one until you get it right. Not good!

You might ask: why not store the file as <educationOrganizationId>.json or <studentUniqueId>.json. Well, what about a Section document, which has a five part key? Do you want to store the file with all five parts of the key in the file name – which can then get rather long?

Meadowlark

Meadowlark stores exactly what was sent to it, and it returns exactly what it stored. When you POST a resource, of course you aren’t including that link  element – so a GET request does not contain that link. It would be possible to modify the document and add the link, because the id  value is not a random value in Meadowlark: it is a calculated value built from the natural key of the resource. It concatenates the natural key together (again, this could be quite long), then calculates a SHA256 hash code, and then base64 encodes that to have a < 40 character unique ID.

So, Meadowlark could modify the document without having to query and find the id  value for that reference. Or… the same procedure used by Meadowlark could be employed by the AMT solution: store the files using that calculated hash value. But now when you open a file and want to find all related files, you have to recalculate the hash value with every operation. That will become computationally expensive.

AMT Lookup "Table"

Another option is for the AMT data lake solution to create a “lookup table” that stores both the link.href value and a calculated hash value. In SQL terms this would be like having the following tables and joins.

This has the advantage of being useful for Meadowlark. The “IdMappings” table is quick-and-dirty and can probably down to just “href” and “id”. In fact, this is similar to how Meadowlark handles referential integrity checks, through the help of a References lookup table. Side note: the lookup table in the middle of the diagram probably does not need source and destination, since the href  to a student would be the same regardless of whether the source is studentEducationOrganizationAssociation or any other document that references a student. And the "destination" (students) is already embedded into the href  value and thus easy for a human to understand.

Summary of Options Thus Far

To summarize, right now we have these ideas for how we can have a single data lake structure that works for both the .NET API and Meadowlark:

  1. Modify Meadowlark so that it enriches the received document with the link information when storing data and responding to GET requests. Continue storing files using the document’s id value. In Meadowlark, that id value is the natural key hashcode.

  2. Modify the data lake solution to store files using hash-codes of the natural key as the file name, instead of using the Ed-Fi ID in the file name. Optimize for joins by creating a pre-calculated mapping between the link.href  (which contains the Ed-Fi ID value) and the calculated id value.

Option 2 is more work than option 1, but feels more "pure" in that it is not modifying the received document.

That said, there are other reasons to consider modifying the document on a GET response: enriching it with createdBy, createDate, and lastModifiedDate (and possibly eTag, unless it goes into a response header). If we decide to add those metadata elements, why not add the link? And adding the linkwould make it easier for someone else to create their own data lake solution. So maybe option 1 really is the other way to go.

Or perhaps someone can come up with an option 3?