Our next-gen architecture is built to help you make sense of your ever-growing data. Watch a 4-min demo video!

Observability for LLMs

  • Zev Schonberg
  • June 18, 2024
Share article

So, your company uses LLMs? You’re not the only ones. A survey by Gartner in October 2023 revealed that 55% of organizations were piloting or releasing generative AI projects, and it’s safe to assume that this number has increased since that survey was published. From personalized recommendations in e-commerce, to automated grading in education and fraud detection in finance, LLMs have helped many organizations level up.

These models have also been leveraged by the Enterprise IT industry and drastically improved the capabilities of observability solutions, with features like natural language query interfaces, automated root cause analysis or improved system security. Yet, like any other piece of software, LLMs are fallible. And since observability and other fields rely increasingly on their usage, a question arises: How do we ensure the observability of the LLMs themselves?

The challenges of observability for LLMs

Observability practices for Machine Learning (ML) have been around for a while, but LLMs, and more complex ML systems in general, have brought an additional set of challenges to the table, such as the need for real-time monitoring of language generation, interpreting complex model outputs, and addressing ethical concerns.

Observability Challenges: Traditional ML vs. Large Language Models (LLMs)

There are also other challenges to keep in mind. Given their significant usage of resources, LLMs are usually deployed as separate services and across multiple nodes or clusters. On top of that, the models progress quickly so keeping track of different versions and their impact on performance becomes paramount. Thankfully, the observability space has evolved concurrently and designed new solutions to meet the demands of LLM models, both in training and production settings. 

Emerging observability solutions for LLMs

There are plenty of observability solutions on the market for LLMs and ML in general, but they tend to have slightly different focuses.

ML Model Monitoring involves tracking metrics relating to the performance and behaviors of models, for example accuracy, precision, recall or drift (changes in behavior over time). LLM Ops, on the other hand, focuses specifically on the operational aspects of deploying, managing, and optimizing LLMs in production environments. In this section, we will treat ML Model Monitoring and LLM Ops solutions together as they both pertain to the deployment and management of LLMs specifically.

APM (Application Performance Monitoring) for LLM focuses on monitoring the performance and behavior of applications that utilize LLMs. This includes real-time monitoring of inference and response times, tracking application performance metrics, and identifying performance bottlenecks within LLM-dependent features. 

ML Model Monitoring and LLM Ops

There are, of course, many different metrics to keep track of to ensure LLMs perform in an appropriate way. Let’s take a look at some of them.

Classic Health Metrics

Classic health metrics such as CPU usage, memory consumption, and disk I/O are essential for monitoring the overall health and performance of the infrastructure hosting the LLM. Most ML observability solutions on the market offer these capabilities. 

Versioning and Experiment Tracking

A majority of models undergo training in separate iterations and implementing versioning is helpful to track changes made to hyperparameters, training data, and architecture. While not the only solution available, Weights & Biases provides a good platform to conduct A/B testing and manage different ML experiments and model updates.

Attention Pattern Analysis

Attention patterns offer insights into how an LLM behaves, by focusing on how it processes and weighs different parts of the input sequence. Fiddler AI, amongst other tools, can help capture attention weights during model inference and analyze attention patterns across input tokens. The tool provides a nice visual analysis of common patterns and biases.

Pattern Identification and Drift Detection

Changes in data distributions, known as concept drift, can occur over time due to various factors such as shifts in user behavior or in the underlying data generating process. It is important to detect drift early to prevent performance degradation, for example by retraining models on updated datasets or adjusting hyperparameters. Arize AI and WhyLabs are good drift detection tools and both work well with unstructured data entities, such as vectorized embeddings.

Interpretation And Explainability

Accuracy measures how well the model’s predictions align with ground truth labels or expected outcomes, based on evaluation metrics specific to the LLM task, like accuracy, precision, recall, and F1-score. Again, Weights & Biases offers a good suite of tools for visualizing model accuracy metrics across different datasets and environments, which includes customizable dashboard and real-time alerts.

Affordable APM for LLMs

The tools mentioned above are great for monitoring LLMs, but they do not offer APM (Application Performance Monitoring) capabilities. However, because LLMs are usually part of a large application ecosystem, it is critical to integrate them and track how they interact with the rest of the application’s components. APM tools for LLM should include features for real-time monitoring of key performance metrics such as response times, throughput, and error rates, as well as anomaly detection, root cause analysis, and scalability management.

While keeping ingestion and storage costs down is always a concern,  it’s perhaps even more important for LLM applications, given that the datasets are continuously expanding and generating a high volume of data to process and store. In contrast to other APM vendors, Coralogix is specifically designed for cost-efficiency. The in-stream analysis and processing of data enables organizations to extract insights from LLM data as soon as it’s ingested, without the need for costly indexing or hot storage. Using Coralogix’s TCO (Total Cost Of Ownership) Optimizer, ingested data that demands lightning-fast querying around the clock can be routed to indexing and hot storage. Meanwhile, less time-sensitive data can find its home in archive, with Coralogix Remote Query ensuring swift querying.

The platform doesn’t yet offer an out-of-the-box AI monitoring dashboard, but its flexibility and ease of use make it straightforward for organizations to build custom dashboards tailored to specific AI monitoring needs and requirements. Furthermore, Coralogix values open-source software, with data ingestion via Open Telemetry, and data storage in customers S3 bucket (or other archive solution) using a Parquet-based data format. 

Conclusion

As organizations rely more and more on generative AI and LLMs, the need for specialized observability becomes paramount. Observability solutions tailored to LLMs and AI not only ensure the performance of these systems and their seamless integration in the larger organization ecosystem, but also enable said organization to harness the full potential of AI technologies. It’s a symbiotic relationship: while AI assists observability efforts, observability is equally crucial for AI’s success, and this interconnected cycle will undoubtedly push back the barriers of observability and propel the field to new heights in the coming years.

Observability and Security
that Scale with You.