4  Preprocessing using R

In the previous section I explained how a data instance of the ConcePTION CDM is a complex object.

Defining a model which would generate an entire instance is outside the scope of this project and even SOTA models have not arrived there yet.

Using an R script I firstly extracted all conceptsets defined in the codelist available here

I then created variables based on those conceptsets and finally a dataset which contains records of persons positive to 18 certain events in the year prior to a specific moment. A person obviously might have more than one event in that period of time.

Some of the more rare events are being excluded since the dataset size is small and training the model would be too complicated if those would have been retained.

These are the final dataset characteristic:

To finish the preprocessing I transformed the dataset to Numpy array ready to be ingested in Python