Extension Spike

After our initial investigation. And now that we are more interested in implementing the unit tests for Postgres, a concern about the permeance was brought up.

In our initial version, and as mentioned before, we are doing a full database reset. In other words, every time a new test scenario is executed, we delete the database, recreate it, install ODS and install AMT. This process is actually consuming too much time. The goal with this extension is to see the possibility of improving this performance reviewing 3 possible options:

  1. Instead of doing this full database reset, what about doing a truncate on all tables?
  2. Instead of doing this full database reset, implement Respawn. More information about this tool here
  3. Determine if we need to start with an empty database for each of our test cases.

In relation to the first 2 options, the main idea is to compare both approaches and see if one of them has a better performance.

This is what I did to compare performance

Truncate all tables

Changed a little the previous implementation and added some validations to check if the database already existed, in such a case, instead of deleting it and recreating it, etc. a query is executed to truncate all the tables (Meaning that the database is created just the first time, assuming that it doesn't exist). Before executing this truncate, there is a query to build (runtime) this truncate. 

Since all the test had been implemented on Postgres for DateDim view. I executed these 51 unit tests and it took 43 seconds (average, because I executed them several times).

Respawn

Similar to the previous approach, I changed the deletion and recreation of the database, to implement Respawn.

I executed the 51 unit tests associated to DateDim and it took 42 seconds (average, because I executed them several times).

So, as you can see, the performance is super similar. Meaning that, whether we implement the All tables truncate or Respawn, the performance should not be a variable in the equation. Rather we should consider other reasons to implement one or the other. 

Additionally I executed the unit test associated to DateDim, SQL Server, just Data Standard 3.2. In other words, the exact same tests (the 51 unit tests). And it took 42 seconds. Essentially the same amount of time.

Conclusion

In relation to the first and second options, we can implement either the all tables truncate or Respawn. Whether we are going to implement one or the other is still an open question. But I think the approach we take should not be based on the performance. 

Determine if we need to start with an empty database for each of our test cases

There are the conclusions based on our review

Right now the test cases use a specific data set for each view. The data used for the tests is designed to be able to verify the different conditions of a view and determine if the view has been altered and can affect the results. To this end, the data must meet certain conditions regarding column relationships and values. According to the above:

  • The data extracted may come from different sample databases, according to the version used by each developer, or it may have been included to model a specific case. Therefore, we are not working with a single data set .
  • The advantage of working separate data sets is the independence of the test cases, they do not depend on the data from other scenarios to execute the tests. It also does not depend on or is affected by the data that another developer uses in test cases, allowing you to work in parallel.
  • For testing purposes, some data is modified to create additional test scenarios. For example, for two different views, a student can have different data, such as null columns or with different data that can affect the tests.
  • Having a single set of test data requires more design effort so that the global data fits all test cases. This can be complex because, for example, the descriptors can vary according to the data in each sample database.
  • I do not recommend using a single or incremental data set. Designing the tests for each of the views can be a complex process, and designing the data considering the effects in other test scenarios would go against the unit test concept. In addition, it can increase the complexity both when designing data and when several developers are working on test scenarios and need to add data, independence and parallelism would be lost.

Branches

We generated 2 branches just for general review. The code here is the code I moved from the initial spike (spike-postgres-tests), including the changes I made in order to accomplish the goal of the spike ( BIA-761 - Getting issue details... STATUS ).

spike-postgres-tests-Truncate

spike-postgres-tests-Respawn

Notice that in both cases, since it was just a spike, this is not the actual postgres implementation of the tests. Using this code as a reference, we have to work on the real implementation, where we take into account maintainability, easy to read and understand, etc.