目錄

AI 智慧監控:LangChain Agent 實現自然語言查詢 Prometheus (I)

Prometheus 是一個強大的開源監控工具,廣泛應用於蒐集和查詢系統的運行數據。在網站可靠性工程(SRE)中,也是用來確保系統的可靠性的關鍵工具。然而,對於非工程師來說,甚至初階開發工程師,理解和解讀這些監控指標或圖表往往具有挑戰性。

隨著大型語言模型(LLM)的出現,我們能夠利用這些技術創建自動化的 AI Agent,提升系統監控和故障排除的效率。這篇文章將帶你一同探討如何使用 LangChain 和 Prometheus 來構建 AI Agent,Agent 可以理解 User 使用自然語言詢問系統的問題,再透過 Prometheus Query Language(PromQL)來查詢系統狀況,從而大大改善系統的可觀察性和運維體驗。

事前準備

要完成這個目標,我們必須先建立環境:

  1. Python 環境的建立,建議可以透過 venvpipenv 等方式建立虛擬環境 。(這邊就不多加說明如何建立)

  2. 安裝 Prometheus,並確保可以對 Prometheus 訪問

    注意
    kube-state-metrics 可以從 Kubernetes API Server 中獲取關於各種 Kubernetes 資源的狀態數據,包括 Pods、Nodes、Deployment、ReplicaSet 等,所以如果在 Kubernetes 安裝,也請務必一起安裝此服務,本篇文章也會使用到這個服務所提供的 metrics 來做示範。
  3. 準備 LLM API,以下擇一即可

    • OpenAI API:訪問 OpenAI API 官網就可以註冊並取得 API Key,且由於近期 GPT-4o mini 的出現,大幅節省了打 API 的成本,非常推薦使用。
    • Gemini API:可透過 Google AI Studio 取得 API Key,且有提供免費額度可以使用,也非常適合在測試時使用。
  4. (Optional) 如果使用 Gemini API 可以註冊一個 GCP 帳號,後續 LangChain 範例也會使用到 GCP 功能,而且現在新註冊 GCP 仍有免費計劃可以提供各位學習使用。

LangChain Agent 與 Prometheus 的結合

LangChain

LangChain 是一個開源框架,專門用於開發由大型語言模型(LLM)支援的應用程序。它希望開發人員能夠輕鬆製作 LLM 應用,並允許自定義提示鏈和無縫的整合外部資料,實現高度客製化的解決方案。具備上下文採集和推理能力,LangChain 使 LLM 能夠智能互動和自適應,適用於開發對話系統、智能助理和內容生成工具等應用。

在 python 中使用,可以透過以下指令安裝

1
pip install langchain
注意
請確認 langchain 安裝的版本,v0.2 與 v0.1 使用上有滿多不同,本篇文會以 v0.2 為主,詳情請查閱官方文件

OpenAI 的使用範例:

1
2
pip install langchain-openai
export OPENAI_API_KEY="YOUR OPEN API KEY"
1
2
from langchain_openai.chat_models import ChatOpenAI
ChatOpenAI(model="gpt-4o-mini", temperature=0, streaming=True, max_tokens=500)

Gemini 的使用範例:

1
2
pip install langchain-google-genai
export GOOGLE_API_KEY="YOUR GOOGLE API KEY"
1
2
from langchain_google_genai import ChatGoogleGenerativeAI
ChatGoogleGenerativeAI(model="gemini-1.5-flash", Convert_system_message_to_human=True)

以下為使用 OpenAI 的 LangChain 範例:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
from langchain_openai.chat_models import ChatOpenAI
from langchain_core.prompts import PromptTemplate

# 創建一個 OpenAI 語言模型
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0, streaming=True, max_tokens=500)

# 定義提示模板
template = """Question: {question}

Answer: Let's think step by step."""
prompt = PromptTemplate.from_template(template)

# 創建一個 LLMChain
llm_chain = prompt | llm

# 問一個問題
question = "什麼是 LangChain?"
response = llm_chain.invoke(question)

# 輸出答案
print(response.content)

LangChain Agent

  • 什麼是 AI Agent?

    AI Agent 是一種智慧系統,能夠自動完成任務、學習和適應環境,不需要人類的干預。也被視為大型語言模型(LLM)應用的最前端,使 LLM 不僅僅是一個回應式的生成工具,擴展了其功能和應用範圍,可以在複雜和動態環境中獨立行動,發揮出更大的潛力。

  • 那 LangChain Agent 呢?

    LangChain Agent 就是可以為我們實現以上所述的重要工具。利用 LLM 作為推理引擎,並允許開發者將各種工具整合到 Agent 中,如網頁搜索、數據庫查詢等,這讓 Agent 能夠根據當前的輸入和情境動態調整行為,適應不同的場景和需求,進而達到使用者的需求。

接下來,我們來建立一個具有 Google 搜尋引擎功能的 Agent 當作範例:

  1. 設定 Agent Tools:Google Search

    • 從 GCP API & Services 設定有 Custom Search API 權限的 API Key。如果使用跟註冊 Gemini API 一樣的帳號,會在這看到一組剛剛取得 Gemini API Key,因此不管新建一組專門給 Google search 的 API Key,或在原有的 Key 給予權限也都可以
    • 程式化搜尋 - 所有搜尋引擎 (google.com) 設定 custom search engine,並取得 GOOGLE_CSE_ID
    技巧
    如果是免費的 GCP 帳號,每天可以搜尋 100 次為上限,並可由 GCP 的 Quotas & System Limits 查詢用量
  2. 使用 Google Search Tools 建立可以回答問題的 LangChain Agent

    • 取得 Google Search package
    1
    
    pip install langchain-google-community
    
    • 建立 LangChain Agent
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    
    import os
    from langchain import hub
    from langchain_core.tools import Tool
    from langchain.agents import AgentExecutor, create_react_agent
    from langchain_openai.chat_models import ChatOpenAI
    from langchain_google_community import GoogleSearchAPIWrapper
    
    # 設定環境變數
    os.environ["OPENAI_API_KEY"] = ""
    os.environ["GOOGLE_CSE_ID"] = ""
    os.environ["GOOGLE_API_KEY"] = ""
    
    # 使用其他人寫好的 prompt
    prompt_template = hub.pull("hwchase17/react")
    
    # 創建一個 OpenAI 語言模型
    client = ChatOpenAI(model="gpt-4o-mini", temperature=0, streaming=True, max_tokens=500)
    
    # 創建 Google Search Tool
    google_search = GoogleSearchAPIWrapper()
    tools = [
        Tool(
            name="google_search",
            func=google_search.run,
            description="useful for when you need to ask with search"
        ),
    ]
    
    # 建立 LangChain ReAct Agent
    search_agent = create_react_agent(client, tools, prompt_template)
    agent_executor = AgentExecutor(
        agent=search_agent,
        tools=tools,
        verbose=True,
        handle_parsing_errors=True,
        return_intermediate_steps=True,
    )
    
    # 詢問問題,並取得結果
    user_input = "你可以檢查在 default namespace 下所有 pods 的狀態嗎?"
    response = agent_executor.invoke({"input": user_input})
    print(response["output"])
    

當我們執行 Agent 程式,會看到以下結果:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
> Entering new AgentExecutor chain...
Thought: I need to understand what the question is asking. It appears to be asking about checking the status of pods in the default namespace. To do this, I need to use a command line tool like kubectl.
Action: google_search
Action Input: "kubectl get pods -n default"May 19, 2021 ... I just want to list pods with their .status.podIP as an extra column. It seems that as soon as I specify -o=custom-colums= the default columns ... Feb 5, 2015 ... thockin@freakshow kubernetes master /$ ./cluster/kubectl.sh get pods ... pod default/hostnames-v5drx. This seems to correspond to lines like ... Mar 29, 2019 ... kubectl does sorting client-side, but when you don't specify a sorting when doing a get command to the api server, it just does not sort. Jul 8, 2019 ... The output above was from kubectl get pods command which lists all pods from default namespace and that's equivalent to what we're doing ... Sep 7, 2022 ... There's always a "default" namespace. When no namespace is specified it's equivalent to: kubectl get pods --namespace=default. Upvote 7 ... pods/security/seccomp/ga/default-pod.yaml. kubectl get pod default-pod. The Pod should be showing as having started successfully: NAME READY STATUS RESTARTS ... Nov 26, 2023 ... ... default.svc.cluster.local Address: 10.244.0.4 Name: mongodb-service ... kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE ... May 27, 2021 ... [root@node1 ~]# kubectl get pods NAME READY STATUS RESTARTS AGE k8ssandra-cass-operator-766b945f65-ntb9s 1/1 Running 0 24m k8ssandra-dc1-default ... Apr 3, 2023 ... kubectl is now configured to use "minikube" cluster and "default ... ubuntu@ip-172-31-25-66:~$ kubectl get pods -A NAMESPACE NAME READY ... Dec 9, 2022 ... I've been trying to see what's going on using kubectl get pod and kubectl describe pod . ... default-scheduler running PreBind plugin ...Thought: I need to understand what the question is asking. It appears to be asking about checking the status of pods in the default namespace. To do this, I need to use a command line tool like kubectl. 
Action: google_search
Action Input: "kubectl get pods -n default"May 19, 2021 ... I just want to list pods with their .status.podIP as an extra column. It seems that as soon as I specify -o=custom-colums= the default columns ... Feb 5, 2015 ... thockin@freakshow kubernetes master /$ ./cluster/kubectl.sh get pods ... pod default/hostnames-v5drx. This seems to correspond to lines like ... Mar 29, 2019 ... kubectl does sorting client-side, but when you don't specify a sorting when doing a get command to the api server, it just does not sort. Jul 8, 2019 ... The output above was from kubectl get pods command which lists all pods from default namespace and that's equivalent to what we're doing ... Sep 7, 2022 ... There's always a "default" namespace. When no namespace is specified it's equivalent to: kubectl get pods --namespace=default. Upvote 7 ... pods/security/seccomp/ga/default-pod.yaml. kubectl get pod default-pod. The Pod should be showing as having started successfully: NAME READY STATUS RESTARTS ... Nov 26, 2023 ... ... default.svc.cluster.local Address: 10.244.0.4 Name: mongodb-service ... kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE ... May 27, 2021 ... [root@node1 ~]# kubectl get pods NAME READY STATUS RESTARTS AGE k8ssandra-cass-operator-766b945f65-ntb9s 1/1 Running 0 24m k8ssandra-dc1-default ... Apr 3, 2023 ... kubectl is now configured to use "minikube" cluster and "default ... ubuntu@ip-172-31-25-66:~$ kubectl get pods -A NAMESPACE NAME READY ... Dec 9, 2022 ... I've been trying to see what's going on using kubectl get pod and kubectl describe pod . ... default-scheduler running PreBind plugin ...Thought: I now know the final answer
Final Answer: 使用命令 `kubectl get pods -n default` 可以查看 default namespace 下所有 pods 的狀態。 


> Finished chain.
使用命令 `kubectl get pods -n default` 可以查看 default namespace 下所有 pods 的狀態。

在此範例中,LangChain Agent 嘗試使用 Google Search Tool 來查詢資料,經過一些推理後,建議我們使用 kubectl 指令來查看 Pods 狀態。接下來,我們將建立自定義的 Prometheus Tool 到我們的 Agent,使 Agent 可以透過 PromQL 來更準確的回答我們的問題。

Custom Prometheus tool

在 LangChain 框架中,Tool 是 Agent 可以與環境互動的關鍵,透過這些工具,可以訪問各種資源,如同先前所介紹的 Google Search 工具,其他還有像是 YouTube、Google Drive … 等工具。接下來,我們將建立一個自定義的工具(Cutom Tools),主要用來查詢 Prometheus 的 metrics 資料,如此使用者就可以透過自然語言來執行 Prometheus Query Language(PromQL)的查詢。

在建立 Custom Prometheus tool 前,我們必須先確認 Prometheus 的連線。如果使用 Kubernetes,可以參考以下指令:

1
2
# 使用以下指令後,即可透過 localhost:9090 訪問 Prometheus
kubectl port-forward pod/prometheus-k8s-0 9090:9090

接下來,我們先簡單介紹一下 Tool 中重要的 components:

  • name:工具的名稱,是必須設定的,且在提供給 Agent 中也必須是不能有重複的 name
  • description:雖然是可選的,但**強列建議使用!!**因為 LLM 會根據此選項做為 context,來決定什麼時候調用此 Tool,非常重要!!

為求讀者可以直接複製就執行,說明會以註解的方式補充,以下為完整的範例程式:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
import os
from prometheus_api_client import PrometheusConnect
from typing import Optional
from pydantic import BaseModel, Field, ConfigDict
from langchain_openai.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.tools import BaseTool
from langchain.agents import Tool, AgentExecutor, create_react_agent

# 設定環境變數,請填入自己的 API Key
os.environ["OPENAI_API_KEY"] = ""

# 定義提示模板
prompt = """Answer the following questions about Kubernetes service status in Prometheus as best you can. You have access to the following tools:
{tools}

Use the following format:

Question: the input question you must answer about Kubernetes service status
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action (for PrometheusQuery, this should be a valid PromQL query)
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question about Kubernetes service status

Make sure to follow the format exactly and include 'Final Answer:' before your final response.

Begin!

Question: {input}
{agent_scratchpad}
"""
prompt_template = PromptTemplate.from_template(prompt)


class PrometheusQueryToolConfig(BaseModel):
    # 配置 Prometheus 連接的參數
    prometheus_url: str = Field(default="http://localhost:9090")  # Prometheus 服務器的 URL
    disable_ssl: bool = Field(default=True)  # 是否禁用 SSL 驗證


class PrometheusQueryTool(BaseTool):
    # 用於執行 Prometheus 查詢的工具
    name = "Prometheus Query"
    description = "Tool for querying a Prometheus server. Useful for querying metrics and alerts from the Prometheus monitoring system. Input should be a PromQL query string."
    config: Optional[PrometheusQueryToolConfig] = None  # Prometheus 連接配置
    prom: Optional[PrometheusConnect] = None  # Prometheus 客戶端連接

    def __init__(self, prometheus_url: str = "http://localhost:9090", disable_ssl: bool = True):
        # 初始化工具,設置 Prometheus 連接
        super().__init__()
        self.config = PrometheusQueryToolConfig(prometheus_url=prometheus_url, disable_ssl=disable_ssl)
        self.prom = PrometheusConnect(url=self.config.prometheus_url, disable_ssl=self.config.disable_ssl)

    def _run(self, query: str):
        # 執行 PromQL 查詢
        query = query.replace("`", "")  # 移除可能的反引號
        try:
            result = self.prom.custom_query(query=query)
            return result
        except Exception as e:
            error_message = f"Error querying Prometheus: {str(e)}"
            return error_message

    def _arun(self, query: str):
        # 異步操作(未實現)
        raise NotImplementedError("This tool does not support async")


# 使用 Prometheus 服務器的 URL 初始化 Prometheus 查詢工具
prometheus_tool = PrometheusQueryTool(prometheus_url="http://localhost:9090")

# 設置 LLM
client = ChatOpenAI(model="gpt-4o-mini", temperature=0, streaming=True)

# 將 Prometheus 工具包含在工具列表中
tools = [prometheus_tool]

# 定義 ReAct agent
custom_agent = create_react_agent(client, tools, prompt_template)

# 設置代理執行器
agent_executor = AgentExecutor.from_agent_and_tools(
    agent=custom_agent,
    tools=tools,
    verbose=True,
    max_iterations=10,  # 設定最大的重複執行次數
    handle_parsing_errors=True,  # 解析錯誤時嘗試重新執行
    return_intermediate_steps=True,
)

# 詢問範例
query = "你可以檢查在 monitoring namespace 下 prometheus 的 CPU 及 Memory 使用率嗎?"
response = agent_executor.invoke({"input": query})

# 獲取最終結果
final_result = response.get('output', '')
print("最終結果:", final_result)

最後,當我們執行程式,可以看到以下結果:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
> Entering new AgentExecutor chain...
Thought: To find out which services are available in the "monitoring" namespace, I need to query the relevant metrics in Prometheus that track Kubernetes services. Typically, the `kube_service_info` metric can provide information about services in a specific namespace. 

Action: Prometheus Query  
Action Input: `kube_service_info{namespace="monitoring"}`
[{"metric":{"__name__":"kube_service_info","cluster_ip":"10.84.14.134","container":"kube-rbac-proxy-main","instance":"10.80.3.8:8443","job":"kube-state-metrics","namespace":"monitoring","service":"prometheus-k8s","uid":"7df4d5c0-1b04-4fea-831e-fec3d8937b69"},"value":[1722667568.712,"1"]},{"metric":{"__name__":"kube_service_info","cluster_ip":"10.84.15.21","container":"kube-rbac-proxy-main","instance":"10.80.3.8:8443","job":"kube-state-metrics","namespace":"monitoring","service":"blackbox-exporter","uid":"0c8f5435-3c6a-4d4c-a48a-d6aa6625d2d6"},"value":[1722667568.712,"1"]},{"metric":{"__name__":"kube_service_info","cluster_ip":"10.84.4.217","container":"kube-rbac-proxy-main","instance":"10.80.3.8:8443","job":"kube-state-metrics","namespace":"monitoring","service":"alertmanager-main","uid":"e55f146d-acd4-4c0b-851c-44a9a24e2641"},"value":[1722667568.712,"1"]},{"metric":{"__name__":"kube_service_info","cluster_ip":"10.84.7.223","container":"kube-rbac-proxy-main","instance":"10.80.3.8:8443","job":"kube-state-metrics","namespace":"monitoring","service":"grafana","uid":"a14054cd-0bcf-413e-a861-cd790d21089b"},"value":[1722667568.712,"1"]},{"metric":{"__name__":"kube_service_info","cluster_ip":"None","container":"kube-rbac-proxy-main","instance":"10.80.3.8:8443","job":"kube-state-metrics","namespace":"monitoring","service":"alertmanager-operated","uid":"c2e722b6-f6de-46f2-8c35-3987124e368f"},"value":[1722667568.712,"1"]},{"metric":{"__name__":"kube_service_info","cluster_ip":"None","container":"kube-rbac-proxy-main","instance":"10.80.3.8:8443","job":"kube-state-metrics","namespace":"monitoring","service":"kube-state-metrics","uid":"3d567e79-b402-4e92-aea2-5e410fd7214a"},"value":[1722667568.712,"1"]},{"metric":{"__name__":"kube_service_info","cluster_ip":"None","container":"kube-rbac-proxy-main","instance":"10.80.3.8:8443","job":"kube-state-metrics","namespace":"monitoring","service":"node-exporter","uid":"ba0f41b1-5241-4754-a8cd-127bef4f7dcc"},"value":[1722667568.712,"1"]},{"metric":{"__name__":"kube_service_info","cluster_ip":"None","container":"kube-rbac-proxy-main","instance":"10.80.3.8:8443","job":"kube-state-metrics","namespace":"monitoring","service":"prometheus-operated","uid":"caf926ec-7b44-4a43-94d1-3572eed52960"},"value":[1722667568.712,"1"]},{"metric":{"__name__":"kube_service_info","cluster_ip":"None","container":"kube-rbac-proxy-main","instance":"10.80.3.8:8443","job":"kube-state-metrics","namespace":"monitoring","service":"prometheus-operator","uid":"d24550d1-0ef2-417c-9a08-6afed1bfb184"},"value":[1722667568.712,"1"]}]I have retrieved the list of services in the "monitoring" namespace. Here are the services found:

1. prometheus-k8s
2. blackbox-exporter
3. alertmanager-main
4. grafana
5. alertmanager-operated
6. kube-state-metrics
7. node-exporter
8. prometheus-operated
9. prometheus-operator

Thought: I now know the final answer.
Final Answer: 在 monitoring namespace 下的服務包括 prometheus-k8s、blackbox-exporter、alertmanager-main、grafana、alertmanager-operated、kube-state-metrics、node-exporter、prometheus-operated 和 prometheus-operator。

> Finished chain.
最終結果: 在 monitoring namespace 下的服務包括 prometheus-k8s、blackbox-exporter、alertmanager-main、grafana、alertmanager-operated、kube-state-metrics、node-exporter、prometheus-operated 和 prometheus-operator。

LangChain Agent 為我們詢問的問題產生了對應的 PromQL kube_service_info{namespace="monitoring"},並透過自定義的 Prometheus Tool 來取得實際的狀態,在經過 Agent 思考與整理後,回應的內容是可以讓使用者一下就明白。

結論

在這篇文章中,我們探討了如何結合 LangChain 和 Prometheus 來建立一個能夠理解自然語言,並進行系統監控的 AI Agent。我們從介紹 LangChain 框架開始,了解了 AI Agent 的概念,並通過實作簡單的範例來展示其功能。最後,我們深入探討了如何創建自定義的 Prometheus Tool,使 Agent 能夠執行 PromQL 查詢,從而實現使用自然語言來查詢系統狀態的目標。

這種結合大型語言模型(LLM)與監控工具的方法不僅提高了系統監控的效率,也大大改善了非技術人員理解和使用監控數據的能力。通過 LangChain Agent 和自定義 Tool 的靈活組合,我們可以創建出更智能、更直觀的系統監控解決方案,為 DevOps 和 SRE 實踐帶來新的可能性。

下一篇,我們可以進一步擴展這個概念,整合更多的監控工具和數據源,優化 Agent 的決策能力,以及完善使用者的體驗,例如:透過 RAG 加強 Agent 能力,又或是使用 WebUI 來進行操作,以應對更複雜的系統監控場景。這不僅將提升系統的可觀察性,也將為運維團隊提供更強大的工具,能夠更有效的管理和優化系統性能。