Logical image understanding involves interpreting and reasoning about the relationships and consistency within an image’s visual content. This capability is essential in applications such as industrial inspection, where logical anomaly detection is critical for maintaining high-quality standards and minimizing costly recalls. Previous research in anomaly detection (AD) has relied on prior knowledge for designing algorithms, which often requires extensive manual annotations, significant computing power, and large amounts of data for training. Autoregressive, multimodal Vision Language Models (AVLMs) offer a promising alternative due to their exceptional performance in visual reasoning across various domains. Despite this, their application in logical AD remains unexplored. In this work, we investigate using AVLMs for logical AD and demonstrate that they are well-suited to the task. Combining AVLMs with format embedding and a logic reasoner, we achieve state-of-the-art (SOTA) performance on public benchmarks, MVTec LOCO AD, with an AUROC of 86.0% and an F1-max of 83.7% along with explanations of the anomalies. This significantly outperforms the existing SOTA method by 18.1% in AUROC and 4.6% in F1-max score.
The LogicAD method is based on robust text extraction. The detailed steps of text extraction can be summarized as follows:
After the robust text extraction, we propose the following two directions for anomaly text detection:
After extracting the text features, we use an LLM to summarize the text in JSON format. Both the normal/reference image \(X_{n}\) and the query image \(X_{q}\) are then fed into the embedding function \(f_{\text{emb}}\) to generate the respective embedding features \(\mathbf{\hat{e}}_n\) and \(\mathbf{\hat{e}}_q\). We then calculate the anomaly score based on the cosine similarity between these normalized embeddings as follows:
\[ \text{ascore} = 1 - \langle \mathbf{\hat{e}}_n, \mathbf{\hat{e}}_q \rangle \]Let \(\Gamma = \Sigma_{\text{norm}} \cup \Sigma_{\text{na}} \cup \Sigma_{\text{fa}} \cup \Sigma_{\text{dca}}\) be the union of all hypotheses. The anomaly detection (\(\text{AD}\)) task is then converted to a theorem-proving task:
We use Prover9 for theorem proving. Here, \(\Gamma \models \neg \Sigma_0\) means \(\Gamma\) logically entails \(\neg \Sigma_0\), i.e., every logical model satisfying \(\Gamma\) will also satisfy the negation of \(\Sigma_0\). Since \(\Sigma_0\) is the formal description of the image, it shows that the image description contradicts the normal cases and hence the image is abnormal.
To identify the actual anomaly, we look for a minimal subset of \(\Sigma_0\) which causes the anomaly, i.e., for \(\Sigma_a \subseteq \Sigma_0\), if \(\Gamma \models \neg \Sigma_a\) and for any \(\Sigma' \subsetneq \Sigma_a\), we have \(\Gamma \not\models \neg \Sigma'\), then \(\Sigma_a\) forms an explanation. Consider \(\Sigma_0\) and \(\Gamma\) defined as above with all the mentioned formulae:
\[ \Gamma \models \neg \Sigma_0 \quad \text{since both an apple and a nectarine are on the left.} \] Then \(\Sigma_a = \{ \text{left}(\text{nectarine}, 1), \text{left}(\text{apple}, 1) \}\) is a formal explanation of the anomaly since \(\Sigma_a \subseteq \Sigma_0\), and
\[ \begin{align*} & \Gamma \models \neg (\text{left}(\text{nectarine}, 1) \land \text{left}(\text{apple}, 1)) \\ & \Gamma \not\models \neg \text{left}(\text{nectarine}, 1) \\ & \Gamma \not\models \neg \text{left}(\text{apple}, 1) \end{align*} \]BibTex Code Here