Accelerating Coupang’s AI Journey with LLMs

Tuesday, October 15, 2024 minute read

By ML Platform Team

Introduction

In the last couple of years, Coupang has been using machine learning [ML/AI] heavily to improve customer experiences in areas like search, ads, catalog and recommendations. ML drives important decision making in pricing, transportation and logistics.

Coupang’s ML engineer’s toolkit has also grown significantly from simple classical ML techniques to deep learning and now, large language models (LLMs) for generative AI in this short time frame. Coupang’s ML platform has been at the forefront of this journey focused on enabling ML engineers to train and serve models in a resource efficient way. In this blog we write about LLM explorations at Coupang and the technical challenges it poses on the ML platform.

ML at Coupang

There are three main types of ML models trained with Coupang’s ML infrastructure:

Recommendation system models: This is primarily used in personalization and recommendation surfaces such as main home feed, search and ads across Coupang apps for shopping, eats and play. These are trained on large datasets of user interactions (clicks, views, purchases, add to cart) and human labeled relevance judgements.
Content understanding models: Coupang has a huge dataset of product catalog data (text, image), user generated content (text — reviews, queries), user and merchant data (text, image). Various ML teams across product groups use deep learning techniques to understand product, customer and merchant representation and then use it to improve shopping experience.
Forecasting models: Coupang has over 100+ unique fulfillment centers housing millions of products. Predictive modeling is crucial in the pricing, logistics, delivery, and pricing of these products for our customers. While these models are typically statistical in nature, deep learning techniques are now increasingly being incorporated.

Foundation models (FM) are large deep learning models trained on massive datasets. FM can adapt to multiple tasks unlike traditional ML models that are trained for specific tasks. FM are trained on text datasets (large language models) or multi-modal (combining multiple modalities such as text and image). These models can learn powerful representations and generate contextually relevant content. The LLMs (and multi-modal) models have been used in several ways to improve existing ML models and improve customer experiences.

Training and serving LLMs come with significant challenges on ML infrastructure — hardware resources (compute, storage and networking), efficiently scaling training and inference. In this article we describe the applications of LLMs and our key learnings from ML infrastructure related challenges.

Applications

Coupang’s largest presence is in South Korea and Taiwan. Training data for most ML tasks is relatively small in both Korean and Mandarin. Moreover, Coupang has a vibrant marketplace with global sellers from around the world selling to its customers. These pose unique challenges in several problem areas of e-commerce — especially seller and product understanding in different languages and images with embedded text, customer intent while they are searching. We describe three areas of application inside Coupang.

Image & Language Understanding

Coupang has large datasets of product and ads images along with corresponding metadata, which includes titles, descriptions, and user queries. The strategy of jointly modeling image and text data through vision and language transformer models yields superior embeddings, as opposed to learning embeddings separately. These embeddings are then used in various downstream models for more effective ad retrieval, similarity search, and serve as features in recommendation models.

Apart from these, there have been other successful applications of large models in content understanding inside Coupang:

Translating product titles from Korean to Mandarin
Improving image quality in shopping feed
User review summarization
Keyword generation for products and sellers

Generating Weak Labels at Scale:

Obtaining labels created by humans is often a challenging and costly task. This issue is magnified when dealing with multilingual content, such as English, Korean, and Mandarin in the context of Coupang.

However, LLMs present a solution to this problem. They have the ability to produce labels for text-based content on a large scale, with a quality that rivals that of human annotators. Once these generated labels pass some quality checks, they can serve as weak supervision labels for training various models.

The labels generated by LLMs are particularly useful when starting models for new segments where there’s a shortage of high-quality labels. Internal experiments have shown that these weak labels can enhance the quality of relevance models and have the potential of overcoming the challenges of label scarcity in under-resourced languages.

Categorization & Attribute Extraction

In the domain of product categorization and attribute extraction, the traditional approach involved deploying a single ML model for each category. This was necessitated by the fact that an unified or multi-class model often yielded noisy predictions for tail categories. However, this imposed an increased operational burden as teams were required to manage multiple models. LLMs provided a deeper understanding of product data (title, description, reviews, seller info). This resulted in a single LLM powered categorizer for all categories with gains in precision across most categories.

Choice of Model Architectures

This strategy of taking OSS model architectures and fine-tuning them with domain data provides an effective approach to apply LLMs to business problems. It allows ML teams to leverage state of the art pre-trained models and efficient architectures, saving both time and computational resources.

Naturally, the main interest has been around models which show strong multilingual performance specially in CJK (Chinese, Japanese and Korean) languages. Training can differ due to the unique characteristics of these languages. Key differences include the use of spacing, the character-based nature of these languages as opposed to the word-based structure of English, and the larger vocabulary sizes. Each of these factors influences the tokenizer, which in turn, affects the quality of the language model. For language/NLP tasks the most commonly used models have been based on leading open source models, LLAMA [1.1], T5 [1.2], Phi[1.3] and Polyglot [1.4] amongst others. Parameter sizes ranging from 3B to 20B are favored because they strike a good balance between resource and compute efficiency and quality.

For image-text multi-modal models, CLIP [1.5] (Contrastive Language Image Pretraining) and TrOCR [1.6] (Transformer based OCR) were the model architectures of choice for their efficiency and performance.

Patterns of Using LLMs

There are a few commonly used patterns of using LLMs. We have arranged techniques in increasing order of resource requirements and complexity.

In-context learning (ICL): In this mode a pre-trained LLM is provided with a prompt or “context” to guide its answers for a specific task. This process does not involve any additional training, and the same model can be reused for different tasks with different prompts. This is very fast to set up and iterate, cheap — as it involves no training, and versatile as it can be used in several tasks. Internally, this remains one of the most popular ways of prototyping and evaluating usage of LLM in a product.
Retrieval Augmented Generation (RAG): RAG is a technique where LLM generated responses are grounded with facts fetched from external sources (knowledge bases such as a corpus of documents, catalog of products, etc). Making the generation and retrieval components work seamlessly in real-time is nontrivial, leading to potential bottlenecks and errors.
Supervised fine-tuning (SFT): This refers to further training an existing base LLM on small datasets to improve performance on a specific domain or task. A fine-tuned model on a high quality domain dataset often surpasses the base LLM performance.
Continued pre-training (CPT): It refers to further pre-training of an existing base LLM on sizable datasets to improve generalized understanding of the model without focusing on any specific task. This is resource intensive but often produces the best results on downstream tasks like attribute extraction.

In-context learning and later supervised fine-tuning remains the most popular pattern of using LLM due to their flexibility and resource efficiency.

Development Lifecycle & Challenges

In this section we describe how developers write LLM training & inference pipelines.

1. Exploration phase:

In the exploration phase, developers use small experiments to determine a list of model lines they want to try out further. Their main focus is on:

Model architecture
Model size — Return on Investment (ROI) of using larger sizes for example 70B+ variant vs < 10B parameter variant
Prompt templates — Changing prompt templates can change model outputs and therefore the performance

ML infra components:

Most data preparation and processing is done with Apache Zeppelin [2.1] notebooks which delegate the tasks to underlying processing engines such as Spark [2.2] on Kubernetes.
Model architecture & prompt template explorations are done on GPU (or multi-GPU) containerized Jupyter notebooks [2.3].

2. Model training:

Based on the shortlist, developers use fine-tuning or pre-training from scratch depending on the compute budget, dataset size and comparing model performance.
Based on model performance on the application, developers finalize the model to put into production. There isn’t any process difference from the non-LLM model development lifecycle here. We will call it the source LLM model.

ML infra components:

We use Polyaxon [2.4] underneath for managing ML training lifecycle on Kubernetes.
LLM training at Coupang uses the model parallel training on Kubernetes distributed training operator for Pytorch (PytorchJob) [2.5].

3. Path to production:

For our workloads, we see developers use the following methods to go to production:

Distillation: Distill a smaller model from the trained source LLM. The smaller model is used in real-time inference.
Embedding: Embeddings can be exported from the LLMs and used in smaller models. We see this pattern being used in ranking problems.

ML infra components:

Batch and nearline inference on GPUs is the most popular way to extract predictions from the source LLMs at scale and then use it for distillation or as embeddings.
Developers use Ray + vLLM [2.6, 2.7] to write inference pipelines requiring both CPU and GPU processing.

Figure 1: ML infra supporting LLM development workflows.

The key challenges in enabling our developers for LLM development workflows were:

Resource efficiency and management largely due to supply shortage and high cost of GPUs.
Capabilities of training and serving large models. Our training stack was not equipped for distributed training (especially model parallel). Before LLMs, our serving was entirely on CPUs which are too slow for the multi-billion parameter models.

We describe the key learnings and takeaways which enabled us to scale our ML stack for the challenges mentioned above.

Choosing the Right Workhorse: Matching Appropriate GPU for the Workload

Choice of GPUs: Large Language Models (LLMs) are both compute and memory intensive. When dealing with larger workloads for training and serving, we quickly realized that device memory constraints play a crucial role in both training and serving. The demand for large RAM GPUs, such as the Nvidia A100 & H100. GPUs offered by cloud vendors have a significant wait time. We conducted regular benchmarking with model-building teams to evaluate the price-to-performance ratio of different GPUs for each model line. For the training of models with more than 1 billion parameters in mixed precision mode, we utilized the A100–80 GB version. For testing and lightweight training purposes, we could employ a substantial quantity of A10G-24 [3.3] GB devices. Given that each LLM family is available in multiple parameter sizes, it is highly cost-effective to use a device with lower performance for testing smaller versions of the LLM.

Hybrid & Multi-Region AI Clusters

In response to the GPU supply shortage, we implemented a multi-region deployment strategy for our ML infrastructure. By leveraging cloud service clusters across various regions (Asia-Pacific & US), we ensure faster access to GPUs, mitigating the wait times that can disrupt execution plans. Additionally, we built an on-prem cluster to provision a significant portion of our computer, especially the higher-end Nvidia GPUs (such as A100/H100).

This hybrid arrangement has been instrumental in alleviating the shortage of GPUs from the cloud provider and reducing overall cost of training. However, it also presents its own set of challenges, such as ensuring consistent infrastructure (storage & networking) and developer experience.

Embracing the Open Source - Frameworks & Tools:

At Coupang, all ML training and inference is executed on managed containerized services. The clusters have access to distributed file system on both cloud and on-prem. This worked for LLM training and inference as well. For training and inference frameworks, we could leverage high-quality open-source projects to our advantage. We describe the key projects which helped in accelerating our journey below.

Model Parallel Training:

When it comes to LLM training, one of the key obstacles is the inability to fit the model into a single GPU RAM. As a result, the typical method of distributed training — data parallelism alone is insufficient. We support several training frameworks which implement the model sharding strategy, the most popular being DeepSpeed Zero [2.8] due to its quick setup time and availability of trainer recipes for the popular model architectures through hugging face hub. Developers internally experiment and share recipes with smart defaults for hyperparameters, such as the choice of optimizer, gradient accumulation, memory pinning, etc.

GPU Inference:

Realtime model serving stack: The compute-intensive nature of LLMs required the use of GPUs for serving. Our existing serving stack was not equipped for GPUs, prompting us to find an appropriate model serving engine. Nvidia Triton offers a containerized inference solution, complete with features such as dynamic batching, concurrent multi-model execution on GPUs, and compatibility with a broad range of backends. These features are vital for the efficient serving of large models. We do all realtime inference using Nvidia Triton [2.9] on AWS EKS.
Batch inference: We also realized that batch inference plays a pivotal role in LLM explorations, as it is used to generate LLM responses for datasets post training. Batch inference often involves both GPU and CPU processing. For instance, text and image data preprocessing can be carried out in a distributed manner on CPU cores, while the primary model inference takes place on the GPU. After experimentation, we settled on Ray + vLLM which excels in managing this type of heterogeneous computing at scale.
Nearline inference: Nearline inference combines the efficiency of batch inference (using small batches) and the responsiveness of being near real-time inference (within a certain time of event occurrence). ML systems in e-commerce applications have several content data streams (user and seller generated content, orders etc). Using LLMs in nearline inference mode helps teams to support diverse downstream applications with a smaller resource footprint.

Rapid Experimentation and Prototyping

The LLM landscape is fast changing with frequent model releases, new state of the art techniques are introduced, and new performance benchmarks are broken. Changes are on all fronts — model architecture, training & inference frameworks, hardware, and optimization techniques. The best way to keep up is to experiment rapidly and learn from the failures.

Rapid experimentation with newer techniques gives you surprising wins and often a deeper understanding of existing toolkits. For example, through experimentation we observed that vLLM provided us nearly ~20x throughput improvement in multiple workloads with their kernel implementation.
Similarly, experimenting with techniques like offloading model parameters to CPU helped in creating recipes for fine tuning LLMs on more widely available GPU with less RAM. This unblocked developers to iterate on their training pipeline without being blocked on availability of high end GPUs.
With Nvidia H100s on block, we see significant opportunities with fp8 quantization and Nvidia’s transformer engine.

Figure 2: Training and serving stack for rapid prototyping and experimentation.

Conclusion

Using LLMs have improved various production ML systems and shown promise in several areas including search & discovery, catalog, operations, and ads quality, amongst others. We expect more teams to use LLMs and similar model architectures in the coming quarters and ship wins for our customers.

We are continuously investing in our training to train larger models and improve resource efficiency of our GPU training and inference stacks. This involves optimization at all levels — hardware (compute, storage, networking), observability, frameworks (model and data sharded training, profiling, utilization).

If you are interested in working in ML infrastructure and product problems, do check out our ML positions at Coupang. (https://www.linkedin.com/company/coupang/jobs/).

Acknowledgement

We thank our partners in product ML teams, especially Search & Discovery for being the early adopters of LLMs in their applications and sharing the progress and pain-points.

We thank our Tech Infrastructure teams especially for their support in provisioning compute resources and cluster health.

While the technical content in this article reflects our company’s research activities, it does not necessarily indicate our actual investment plans or product development directions. This content is purely for research purposes and should not be used as a basis for investment decisions. Please refer to our official announcements on ir.aboutcoupang.com for information on our formal investment plans and product development strategies.

References

Model architectures:

LLAMA: https://llama.meta.com/
T5: https://huggingface.co/docs/transformers/en/model_doc/t5
Phi: https://huggingface.co/microsoft/phi-2
Polyglot: https://github.com/EleutherAI/polyglot
CLIP: https://huggingface.co/docs/transformers/en/model_doc/clip
TrOCR: https://huggingface.co/docs/transformers/en/model_doc/trocr

Platform components:

Zeppelin: https://zeppelin.apache.org/
Spark: https://spark.apache.org/
Jupyter notebook: https://jupyter.org/
Polyaxon: https://polyaxon.com/
PytorchJob: https://www.kubeflow.org/docs/components/training/user-guides/pytorch/
Ray: https://docs.ray.io/en/latest/serve/tutorials/vllm-example.html
vLLM: https://github.com/vllm-project/vllm
Deepspeed zero: https://www.deepspeed.ai/tutorials/zero/
Nvidia Triton: https://developer.nvidia.com/triton-inference-server

Accelerating Coupang’s AI Journey with LLMs

Introduction

ML at Coupang

Applications

Image & Language Understanding

Generating Weak Labels at Scale:

Categorization & Attribute Extraction

Choice of Model Architectures

Patterns of Using LLMs

Development Lifecycle & Challenges

Choosing the Right Workhorse: Matching Appropriate GPU for the Workload

Hybrid & Multi-Region AI Clusters

Embracing the Open Source - Frameworks & Tools:

Rapid Experimentation and Prototyping

Conclusion

Acknowledgement

References

Tags

Related posts

Accelerating Coupang’s AI Journey with LLMs

Introduction

ML at Coupang

Applications

Image & Language Understanding

Generating Weak Labels at Scale:

Categorization & Attribute Extraction

Choice of Model Architectures

Patterns of Using LLMs

Development Lifecycle & Challenges

Choosing the Right Workhorse: Matching Appropriate GPU for the Workload

Hybrid & Multi-Region AI Clusters

Embracing the Open Source - Frameworks & Tools:

Rapid Experimentation and Prototyping

Conclusion

Acknowledgement

References

Tags

Related posts

Meet Coupang’s Machine Learning Platform

Optimizing the inbound process with a machine learning model

Cloud expenditure optimization for cost efficiency