Source: Tinkoff Bank
Tinkoff Group built its own supercomputer in line with its AI First strategy and ambition to develop a platform for machine learning and artificial intelligence.
The Kolmogorov cluster is designed to quickly train models on large datasets built since the company’s inception 13 years ago. Fast connections between computer nodes improve the efficiency of hardware resources in distributed training on huge data sets.
Kolmogorov offers a much faster solution to machine learning and AI-related tasks, such as:
● distributed training of neural network models for speech recognition, speech synthesis and processing of natural language; ● training conventional machine learning models for scoring, acquisition and predictive analytics.
Thanks to the Kolmogorov cluster, the speed of training neural networks is hundreds of times faster. For example, it took us just 24 hours to retrain a sales probability forecasting model on the entire 13-year set of accumulated data as part of our outgoing calls optimisation effort. A conventional approach to retraining would have taken us around six months, according to our estimates. The cluster enables business to test hypotheses, improve services and bring new products to the market more quickly and efficiently.
Kolmogorov boasts 658.5 TFLOPS of peak double-precision floating-point (FP64) performance. The system includes 10 nodes with cutting-edge NVIDIA Tesla V100 accelerators, powered by tensor cores delivering exceptional AI performance.
Computational nodes of Tinkoff’s supercomputer are connected with advanced 100 Gb RoCE (RDMA over Converged Ethernet) enabled network. With a combination of the latest technologies, the cluster reached a 418.9 TFLOPS performance in the Linpack test to secure a leading position in the national supercomputer ranking.
Kolmogorov uses the same HPC accelerators as the world's fastest supercomputer Summit (OLCF-4). It is also the most powerful supercomputer among the rating participants in terms of per-node performance. It means that each of its servers is itself a very powerful unit (41.9 TFLOPS).
The Kolmogorov cluster became part of the machine learning and artificial intelligence platform, comprising the following elements:
● infrastructure to collect, store and process data, subsequently pool it and extract features; ● tools to train models, estimate parameters and predict results; ● software and graphical interfaces to visualise results and manage learning artefacts; ● a system to automatically deploy, monitor and manage resources.
Viacheslav Tsyganov, Tinkoff Group CIO, commented: “Tinkoff enjoys a long-standing status of a technology leader in Russia, as the scope of our machine learning and artificial intelligence tasks is growing. This platform was created as part of our AI First strategy, requiring that all the products we bring to the market contain built-in artificial intelligence. The purpose of this platform is to foster a culture of working with data, lower the threshold for entry into this area for our teams and make machine learning accessible for every analyst and developer at Tinkoff.
We did not plan to build a system that would be called ‘super’. In general, it is quite a small part of our infrastructure, but the performance we reached brought the cluster to the top of Russian supercomputers. We can now grant our teams access to one of the most powerful supercomputers in Russia, which will streamline our hypothesis testing and decision-making processes, as well as reduce time-to-market for new products.”
Dmitry Konyagin, Enterprise Business Team Leader at NVIDIA, said: “Artificial intelligence is finding its way into every human activity. It offers businesses totally new opportunities and prospects. We are happy to help Tinkoff Group in creating an effective HPC platform to solve the most ambitious tasks.”
Contributed | what does this mean?