Introduction

The structure of Hypergol is organised around two functions: Core functionalities that enable running Hypergol projects and code generating scripts that generate those projects.

Hypergol Data Model

Hypergol project’s domain data model is described by a set of classes that all derive from BaseData. This provides the basic functionalities that enable the class to interoperate with the rest of the framework. The classes can be easily generated by create_data_model() then further modified. Automated tests are generated as well that check if the class still serialises correctly after modification. Data model classes recursively serialise into JSON and stored in gzipped text files which are organised into datasets.

Datasets

The Dataset object is the primary storage format in Hypergol. It enables parallel processing by the pipeline and also interactive access from notebooks. The following circumstances of the creation of the file are saved into the .def file:

  • the location of the file

  • the type of data stored in it

  • the git commit at the time

  • the committer’s name.

The gzipped JSON files content (before compression) is hashed with SHA1 which is then saved into .chk file so it can be verified that the file wasn’t changed. The dataset’s checksum is calculated by hashing the entire .chk file with SHA1, as this contains the hash of all the files in the dataset it uniquely verifies the dataset. No part of it can be changed and have the same hash. The hash of any dataset used by the pipeline in the creation will be added to the .def file as well. This is a recursive operation, so each dataset has the hashes of its complete history. This way, the entire processing is verified by a single hash.

Tasks

Tasks are computational routines that operate on domain data. They stream objects from datasets and stream their output to disk as well, therefore make it easy to work with larger than machine memory data. Because datasets can be operated on by multiple threads, it allows tasks to run parallel on the same dataset. This is combined with the memory-efficient data handling to achieve optimal throughput time. Completion speed is only limited by the additional memory requirements of a task (e.g., lookup tables, deep learning models, etc.).

Hypergol tasks are all derived from the Task class. If the inputs of the task are datasets then only the run() function needs to be overwritten, if the inputs are other data (e.g., CSV or pandas dataframes) then the get_jobs() and source_iterator() functions as well. When a task code is autogenerated the --source switch helps to remember the syntax of these functions by autogenerating stubs and comments in the task file to be completed.

Pipeline

Pipelines are used to combine tasks and datasets. It manages the parallel execution and the handling of the data. It executes each task alone so that no inter-task concurrency problems can happen. It also handles data versioning through git and shell execution. Logically a Pipeline is a list of Tasks that are executed sequentially and the code that handles the threads that enable execution. It is deliberately simple as opposed to other frameworks that describe computational tasks as DAGs (Airflow, Prefect.io, dask). That’s because most tasks required in ML is IO/memory/CPU heavy but relatively sequential, and the benefit is achieved by parallelising the slowest step in a linear pipeline. At any one time, there is only one task is running, and only that task’s resource limits needed and can be optimised.

Testing

Hypergol creates example unit tests for the project, so it is easy to verify assumptions about the code and extend the coverage based on the examples. It adds pylint as well, so the linting can be performed. The tests fail at generation time in certain cases because it is too cumbersome to auto-generate tests for some data model types. The intention here is to enable writing tests without setup and only focus on the “Given-When-Then” triple style test writing.

Modelling

Hypergol provides stubs for Tensorflow/Torch models and BaseBatchProcessor abstraction to connect the model at training and deployment to the data model (and datasets). To enable iterative development and SOLID style development an opinionated abstraction is provided through BaseTensorflowModel derived from keras.Model and BaseTensorflowModelBlock derived from keras.layers.Layer. For Torch BaseTorchModel and BaseTorchModelBlock is derived from torch.nn.Module. By following the proposed structure, Hypergol provides TensorflowModelManager and TorchModelManager that handles training and evaluation, saving the model and any metrics for tensorboard. The models are packaged with the correct signature derived from the get_outputs function that enables deployment as well. The generated scripts ensure that full end-to-end data lineage is provided to enable complete output transparency from the output (in the client) back to the training source data and the actual version of code that run at each points.

Deployment

Hypergol _autogenerates_ deployment code with the help of FastAPI and uvicorn. The resulting endpoint is typed by converting input and output datamodel classes to pydantic classes on the fly recursively, so composite classes can be used as well. The request is responded in the following way:

  • FastAPI internally converts the request’s data to a list of pydantic classes. This validates the data and returns the relevant error message if the data doesn’t match the required schema.

  • The classes are converted back to json strings.

  • The json strings are converted to datamodel classes. Because the original pydantic classes are generated from these will match.

  • the model’s batchprocessor converts the list of inputs into the appropriate tensors.

  • the model calculates the output tensors.

  • the model’s batchprocessor conversts the the output tensors to a list of datamodel classes.

  • the list of datamodel classes are converted to jsons.

  • these jsons are converted to pydantic classes.

  • these pydantic classes are return in the response

  • the response header is completed with the time it took to process the request and also the “long name” of the model, in this way the client always knows which version of a model served the request.

All these code is autogenerated so no glue code needs to be written. Because for TensorFlow/Torch models the majority of time is spent with the calculations the overhead of the many conversion steps problably not going to matter. If it does, the endpoint can be rewritten by passing the pydantic classes directly to the batchprocessor and prepare that to handle the pydantic classes as well. It is important that it happens there so the serving input/output is in line with the training/evaluation process.