LogicAD: Explainable Anomaly Detection via VLM-based Text Feature Extraction

Abstract

Logical image understanding involves interpreting and reasoning about the relationships and consistency within an image’s visual content. This capability is essential in applications such as industrial inspection, where logical anomaly detection is critical for maintaining high-quality standards and minimizing costly recalls. Previous research in anomaly detection (AD) has relied on prior knowledge for designing algorithms, which often requires extensive manual annotations, significant computing power, and large amounts of data for training. Autoregressive, multimodal Vision Language Models (AVLMs) offer a promising alternative due to their exceptional performance in visual reasoning across various domains. Despite this, their application in logical AD remains unexplored. In this work, we investigate using AVLMs for logical AD and demonstrate that they are well-suited to the task. Combining AVLMs with format embedding and a logic reasoner, we achieve state-of-the-art (SOTA) performance on public benchmarks, MVTec LOCO AD, with an AUROC of 86.0% and an F1-max of 83.7% along with explanations of the anomalies. This significantly outperforms the existing SOTA method by 18.1% in AUROC and 4.6% in F1-max score.

Methods

Text Extraction

The LogicAD method is based on robust text extraction. The detailed steps of text extraction can be summarized as follows:

Generate a set of regions \( \mathbf{w}_i \), where \( i \in [1, N] \) and \( N \) represents the total number of regions of interest (ROIs), using the function \( f_{\text{GDINO}} \) with feature prompts. Each region, along with the original image, is then processed by the function \( f_{\text{AVLM}} \), \( K = 3 \) times, yielding a collection of textual descriptions \( \mathcal{T} = \{ t_1, t_2, t_3 \} \).
Construct the text embedding space \( \mathcal{M} \) by applying the text embedding model, text-embedding-3-large from OpenAI, \( f_{\text{emb}} \), to \( \mathcal{T} \), resulting in: \[ \mathcal{M} = f_{\text{emb}}(\mathcal{T}) = \{ \mathbf{e}_i \}_{i=1}^k \] where \( \mathbf{e}_i \) is the embedding of the extracted text. Subsequently, these text embeddings are fed into an outlier detection model, specifically the Local Outlier Factor (LOF) function \( f_{\text{LOF}} \), to generate the filtered text embedding space, denoted as \( \mathcal{T}_{\text{filter}} \). Finally, we randomly select the corresponding text from the filtered embedding space \( \mathcal{T}_{\text{filter}} \).

Guided CoT

Sample of text feature extraction pipeline

Anomaly Text Detection

After the robust text extraction, we propose the following two directions for anomaly text detection:

1. Format Embedding

After extracting the text features, we use an LLM to summarize the text in JSON format. Both the normal/reference image \(X_{n}\) and the query image \(X_{q}\) are then fed into the embedding function \(f_{\text{emb}}\) to generate the respective embedding features \(\mathbf{\hat{e}}_n\) and \(\mathbf{\hat{e}}_q\). We then calculate the anomaly score based on the cosine similarity between these normalized embeddings as follows:

\[ \text{ascore} = 1 - \langle \mathbf{\hat{e}}_n, \mathbf{\hat{e}}_q \rangle \]

2. Logic Reasoner

Let \(\Gamma = \Sigma_{\text{norm}} \cup \Sigma_{\text{na}} \cup \Sigma_{\text{fa}} \cup \Sigma_{\text{dca}}\) be the union of all hypotheses. The anomaly detection (\(\text{AD}\)) task is then converted to a theorem-proving task:

If \(\Gamma \models \neg \Sigma_0\), then we label the image as abnormal.
If \(\Gamma \not\models \neg \Sigma_0\), then we label the image as normal.

We use Prover9 for theorem proving. Here, \(\Gamma \models \neg \Sigma_0\) means \(\Gamma\) logically entails \(\neg \Sigma_0\), i.e., every logical model satisfying \(\Gamma\) will also satisfy the negation of \(\Sigma_0\). Since \(\Sigma_0\) is the formal description of the image, it shows that the image description contradicts the normal cases and hence the image is abnormal.

To identify the actual anomaly, we look for a minimal subset of \(\Sigma_0\) which causes the anomaly, i.e., for \(\Sigma_a \subseteq \Sigma_0\), if \(\Gamma \models \neg \Sigma_a\) and for any \(\Sigma' \subsetneq \Sigma_a\), we have \(\Gamma \not\models \neg \Sigma'\), then \(\Sigma_a\) forms an explanation. Consider \(\Sigma_0\) and \(\Gamma\) defined as above with all the mentioned formulae:

\[ \Gamma \models \neg \Sigma_0 \quad \text{since both an apple and a nectarine are on the left.} \] Then \(\Sigma_a = \{ \text{left}(\text{nectarine}, 1), \text{left}(\text{apple}, 1) \}\) is a formal explanation of the anomaly since \(\Sigma_a \subseteq \Sigma_0\), and

\[ \begin{align*} & \Gamma \models \neg (\text{left}(\text{nectarine}, 1) \land \text{left}(\text{apple}, 1)) \\ & \Gamma \not\models \neg \text{left}(\text{nectarine}, 1) \\ & \Gamma \not\models \neg \text{left}(\text{apple}, 1) \end{align*} \]

BibTeX

BibTex Code Here