Community
Previous year i.e., 2023 has clearly been a standout year in terms of advancements in field of AI domain. Traditionally it’s always been felt that to get the most out of AI one need a strong investment in infrastructure and support. It’s never been as clear as last year due to the virtue of advent of Generative AI. Most of the traditional AI technology prior to Gen AI performed reasonably well on a handful of GPUs and RAM. All this changed after the release of GPT-3 by Open AI and the further release of large number of opensource models. These Large Language Models were large in every sense, they needed massive computation resources in form of high-performance GPUs and large memory in terms of RAM. Financial services sector in particular is recognized as the top beneficiary of this technology. The number of resources utilized in this sector in analyses and processing of data particularly textual data can be optimized to a large extent using LLMs. Infact it is the opensource LLMs that has found its most utility in this sector. There are multiple reasons for this
(a) Criticality of data and its security: Quite a lot of data in financial sector are sensitive. They are to be secured and refrained from public access. The potential leak of these data can cause serious issues for the business. It makes the case for opensource or internal solutions instead of proprietary ones particularly for critical and sensitive usecases.
(b) Customization of LLMs: Most of the usecases in this sector requires customization of LLM models with very specific dataset varying from company to company in order to provide the correct response.
It’s is quite evident that the applicability of opensource LLM in financial sector is increasing but at same time there are many challenges in basic implementation of LLM solution. The sheer number of resources required in terms of both computation capability and memory is costly as well as difficult to support. Take the case of a recent milestone of Big Science project's unveiling of BLOOM, a model with 176 billion parameters capable of supporting 46 natural languages and 13 programming languages. While the public accessibility of these 100B+ parameter models has facilitated their use, the associated challenges of high memory and computational costs persist. Notably, models like OPT-175B and BLOOM-176B demand over 350 GB of accelerator memory for inference, and even more for fine-tuning. Consequently, the practical utilization of such LLMs often necessitates multiple high-end GPUs or multi-node clusters, which, due to their high costs, limits accessibility for many researchers and practitioners.
This makes the case for for testing completely different outlook all together like they say Thinking out of the box.
Client – Server Approach
This makes the case for distributed computing setup for the LLMs as one of possible solutions. It also makes sense since we are already using normal distributed computing systems like cloud and edge computing. This facilitates collaboration among multiple users for the purpose of inference and fine-tuning of large language models over the Internet. Participants in distributed network can assume the roles of a server, a client, or both. A server is responsible for hosting a subset of model layers, typically Transformer blocks, and managing requests from clients. Clients, in turn, can form a chain of pipeline-parallel consecutive servers to execute the inference of the entire model. Beyond inference, one can engage in fine-tuning activities using parameter-efficient training methods like adapters, or by training entire layers. Trained submodules can be shared on a model hub, where others can leverage them for inference or further training. This demonstrates the efficient execution of existing 100B+ models in this collaborative setting, aided by several optimizations such as dynamic quantization, prioritizing low-latency connections, and load balancing between servers. Let discuss this in bit more detail.
Design and Technical Overview
Practical applications of large language models can be broadly categorized into two main scenarios: inference and parameter-efficient adaptation to downstream tasks. I would try to outline the design of distributed network, elucidating how it effectively manages both scenarios and facilitates the seamless sharing of trained adapters among system users.
Internal Structure and Optimizations
Performance considerations are paramount for distributed inference, involving three key aspects: computation speed (comparing a 5-year-old gaming GPU with a new data center GPU), communication delay due to node distance (intercontinental vs. local), and bandwidth-induced communication delay (10 Mbit/s vs. 10 Gbit/s). While even consumer-grade GPUs like the GeForce RTX 3070 boast the capability to execute a complete inference step of BLOOM-176B in less than a second, the challenge lies in GPU memory constraints, necessitating efficient solutions. One way to address this is by employing quantization for optimized parameter storage and dynamic server prioritization for enhanced communication speed.
Democratization and Privacy Concerns
We can take inspiration from Blockchain to address potential imbalance between peers supplying GPU resources (servers) and those utilizing these servers for inference or fine-tuning. To address this, a system of incentives could be implemented. Peers running servers could earn special points, redeemable for high-priority inference and fine-tuning or other rewards. This approach aims to encourage active participation and maintain a balanced network. An acknowledged limitation of our current approach is the potential privacy concern where peers serving the initial layers of the model might leverage inputs to recover input tokens. One way to address this is users handling sensitive data are advised to limit their clients to trusted servers or establish their isolated swarm. Though we can explore privacy-enhancing technologies such as secure multi-party computing or privacy-preserving hardware from NVIDIA.
Conclusion
My aim through this blog is to introduce my take on Distributed Computing for AI and to explain both why it’s required and a brief technical overview on one possible approach to implement it. I am open to discuss new ideas to implement this. Considering the fact that there will be massive application of AI in financial sector in coming years, we have to start thinking about how can we optimally utilize current resources before creating new ones. The another aim is to democratize access to large language models, enabling a broader range of applications, studies, and research questions that were previously challenging or cost- prohibitive.
This content is provided by an external author without editing by Finextra. It expresses the views and opinions of the author.
Boris Bialek Vice President and Field CTO, Industry Solutions at MongoDB
11 December
Kathiravan Rajendran Associate Director of Marketing Operations at Macro Global
10 December
Barley Laing UK Managing Director at Melissa
Scott Dawson CEO at DECTA
Welcome to Finextra. We use cookies to help us to deliver our services. You may change your preferences at our Cookie Centre.
Please read our Privacy Policy.