ESAX: Enhancing the Scalability of the Axyon platform
Allowing Fintech AI applications deal with large customer datasets through GPU-powered supercomputers
Axyon AI is an Italian fintech start-up whose current applications are mainly focused on financial time series analysis with Machine Learning algorithms. More specifically, Axyon AI partners with financial institutions (asset managers, hedge funds, trading desks) to improve the performance and risk profiles of investment strategies. The main objective of the pilot is to work with EOSC DIH as a proof of concept of using the EOSC infrastructure and competences to enhance the TRL of the company services.
Over the years, Axyon AI has developed an internal platform (the “Axyon Platform”) for data scientists and machine learning engineers, enabling them to work more efficiently and removing the need to worry about the management of available computational resources and the physical location of data. In this system, particular attention is placed on data security, which is crucial for a fintech company oftentimes working with proprietary data that cannot leave a certain facility, e.g. a bank’s secure data storage infrastructure.
One limitation of the current workload management system of the Axyon Platform is that the execution of computational jobs is limited to one GPU per job, which poses a limit to the complexity and size of machine learning models as well as the size of data batches used for training such models. The goal of this project is to overcome such limitations, by: (i) enabling the parallel training of machine learning algorithms on multiple GPUs within the same computational node, followed by (ii) assessing the possibility to distribute the training over multiple nodes.
- Dataset creation and validation
- Setup platform on multi-GPU system and testing
- Profiling of Axyon NN on a single GPU
- Porting of Axyon NN to multiple GPUs
- Test configuration, benchmark and parameter optimisation of multi-GPU NN training
- Enabling multi-node training via Horovod
- Test configuration, benchmark and parameter optimisation parameter of multi-node NN training
- Platform robustness test with the introduction of the new multi-GPU and multi-node workflows
- Final paper