Binary Representation via Jointly Personalized Sparse Hashing

Xiaoqin Wang xqwang@guet.edu.cn , Chen Chen chenchen˙guet@163.com , Rushi Lan rslan@guet.edu.cn Guangxi Key Laboratory of Image and Graphic Intelligent Processing, Guilin University of Electronic Technology Guilin China , Licheng Liu College of Electrical and Information Engineering, Hunan UniversityChangshaChina lichenghnu@gmail.com , Zhenbing Liu Guangxi Key Laboratory of Image and Graphic Intelligent Processing, Guilin University of Electronic TechnologyGuilinChina zbliu2011@163.com , Huiyu Zhou School of Computing and Mathematical Sciences, University of LeicesterLeicesterthe United Kingdom hz143@leicester.ac.uk and Xiaonan Luo Guangxi Key Laboratory of Image and Graphic Intelligent Processing, Guilin University of Electronic TechnologyGuilinChina luoxn@guet.edu.cn

2018

Abstract.

Unsupervised hashing has attracted much attention for binary representation learning due to the requirement of economical storage and efficiency of binary codes. It aims to encode high-dimensional features in the Hamming space with similarity preservation between instances. However, most existing methods learn hash functions in manifold-based approaches. Those methods capture the local geometric structures (i.e., pairwise relationships) of data, and lack satisfactory performance in dealing with real-world scenarios that produce similar features (e.g. color and shape) with different semantic information. To address this challenge, in this work, we propose an effective unsupervised method, namely Jointly Personalized Sparse Hashing (JPSH), for binary representation learning. To be specific, firstly, we propose a novel personalized hashing module, i.e., Personalized Sparse Hashing (PSH). Different personalized subspaces are constructed to reflect category-specific attributes for different clusters, adaptively mapping instances within the same cluster to the same Hamming space. In addition, we deploy sparse constraints for different personalized subspaces to select important features. We also collect the strengths of the other clusters to build the PSH module with avoiding over-fitting. Then, to simultaneously preserve semantic and pairwise similarities in our proposed JPSH, we incorporate the proposed PSH and manifold-based hash learning into the seamless formulation. As such, JPSH not only distinguishes the instances from different clusters, but also preserves local neighborhood structures within the cluster. Finally, an alternating optimization algorithm is adopted to iteratively capture analytical solutions of the JPSH model. We apply the proposed representation learning algorithm JPSH to the similarity search task. Extensive experiments on four benchmark datasets verify that the proposed JPSH outperforms several state-of-the-art unsupervised hashing algorithms.

Binary representation, personalized hashing, manifold hashing, similarity search

Corresponding authors: Rushi Lan and Licheng Liu

^†^†copyright: acmcopyright^†^†journalyear: 2018^†^†doi: XXXXXXX.XXXXXXX^†^†journal: JACM^†^†journalvolume: 37^†^†journalnumber: 4^†^†article: 111^†^†publicationmonth: 8^†^†ccs: Computing methodologies Image representations

1. Introduction

Representation learning has attracted extensive research attention in affective applications, such as affective image retrieval (Yao et al., 2019, 2020; Zhao et al., 2014), emotion classification (Chen et al., 2015; Rao et al., 2020, 2019) and facial recognition (Shi et al., 2020). Traditional representation learning usually bases on real-valued features, which will cause a waste of time and space. Recently, hashing representation also known as binary representation has been popular and achieved promising performances (Xiang et al., 2021; Jin et al., 2018; Tang et al., 2018). Furthermore, unsupervised hashing refers to the technology that maps high-dimensional features to compact binary codes without label information (Li et al., 2022; Liu et al., 2014; Hao et al., 2017; Hu et al., 2020; Zhang et al., 2020), and the similarity relationship between instances can be approximated by the Hamming distance. Therefore, unsupervised hashing with the low storage and efficient computation can be widely used in representation learning tasks, especially in the image similarity search tasks, which aims to retrieve some related samples from the dataset (Zhu et al., 2018; Jin et al., 2021).

Existing unsupervised hashing methods can be categorized into data-independent and learning-based hashing. Previous works mainly focus on finding suitable projections to produce optimal binary codes, e.g., Locality Sensitive Hashing (LSH) (Datar et al., 2004) and Min-wise Hashing (Min-Hash) (Indyk, 2001). Such methods model learning processes without using any data structure and distribution in the original space, and require long hash bits to achieve satisfactory results.

In contrast, learning-based hashing methods are gaining popularity in recent years. These methods preserve similarity relationships between instances via different viewpoints. For example, Iterative Quantization with Principal Component Analysis (PCA) (PCA-ITQ) (Gong et al., 2012) imposes a rotation matrix to iteratively reduce quantization errors between low-dimensional features and binary codes. Sparse Projections (SP) (Xia et al., 2015) incorporates a sparse regularizer to reduce the number of parameters and computational cost. Ordinal Embedding Hashing (OEH) (Liu et al., 2016) and Ordinal Constraint Hashing (OCH) (Liu et al., 2018) preserve the ranking information by embedding the ordinal relation among data points. Concatenation Hashing (CH) (Weng and Zhu, 2020) encourages that any two instances close to the same center point are close, whilst maintaining relative positions in the Hamming space. Recovery of Subspace Structures Hashing (RSSH) (Tian et al., 2020) preserves the semantic similarity in the Hamming space via an unsupervised multi-stage hashing model.

In addition, some manifold-based hashing methods have been proposed to capture the complex structures of data based on graph learning. Spectral Hashing (SH) (Weiss et al., 2008) converted the binary code learning to the graph partitioning. Zhu et al. proposed the Sparse Embedding and Least Variance Encoding (SELVE) (Zhu et al., 2014) to encode the sparse embedding vector over a learned dictionary. Anchor Graph Hashing (AGH) (Liu et al., 2011) built an anchor graph (Liu et al., 2010), and it obtained the tractable low-rank adjacency matrix. The matrix was used to measure the similarity between a pair of data points and a small number of anchor points, which were the K-means clustering centers because of the strong representation power of the category attributes. To preserve the underlying manifold structure with the t-SNE (Van der Maaten and Hinton, 2008) that is a modification of stochastic neighborhood embedding, Inductive Manifold Hashing (IMH) (Shen et al., 2013) was propsed. Some researchers developed a tractable alternating maximization algorithm to preserve the neighborhood structure inherent in the data, called Discrete Graph Hashing (DGH) (Liu et al., 2014). Jiang et al. implicitly computed the similarity graph matrix by feature transformation, and then proposed the Scalable Graph Hashing (SGH) (Jiang and Li, 2015) method. Graph PCA (gPCA) Hashing (Zhu et al., 2017) simultaneously preserved local structures via manifold learning and global structures via PCA. Locally Linear Hashing (LLH) (Irie et al., 2014) reconstructed the locally linear structures of manifolds in the binary Hamming space with locality-sensitive coding. Compared to LLH, Discrete Locality Linear embedding Hashing (DLLH) (Ji et al., 2017) directly reconstructed binary codes by maintaining the local linear relationship of data points. Considering the $l_{2, 1}$ -norm term, Jointly Sparse Hashing (JSH) (Lai et al., 2018) could minimize the information loss. Unsupervised Discrete Hashing (UDH) (Jin et al., 2021) captured the semantic information by a balanced graph semantic loss for exploring both the similar and dissimilar relationships among data. By considering the semantic information, Li et al. proposed a weakly-supervised hashing method for image retrieval and achieved state-of-the-art performance (Li et al., 2020).

Figure 1. The proposed JPSH framework. It constructs a seamless hash function, which consists of twofold properties: semantic and pairwise similarities. JPSH accommodates the proposed PSH module to maintain semantic similarity, and preserves pairwise similarity using a manifold-based hashing method. Thus, we learn discriminative binary codes by combining the two similarities. In this figure, shapes including triangle, square and star are used to describe semantic relationships except for circle, whilst colors are used to describe pairwise relationships. (a) Some real-world instances with similar attributes have different semantic information. (b) Preserving semantic similarity by the proposed PSH, adaptively mapping instances within the same cluster to the same Hamming space. (c) Maintaining the pairwise similarity by a manifold-based hashing method. (d) Jointly learning the above twofold properties in the same Hamming space, and obtaining the discriminative hashing codes for the image binary representation.

However, most of existing manifold-based hashing methods do not always produce satisfactory results in practice because instances with similar features (e.g. color and shape) may have different semantics. An example is shown in Figure 1 (a), where a cat (row 1, column 2) is visually similar to a dog (row 2, column 2) instead of the other cat (row 1, column 1). On the other hand, there is a large difference in color between the black horse (row 3, column 1) and the white horse (row 3, column 2), which may be mistakenly treated as two different animals. Existing methods model the neighboring graph in hash functions to reflect the pairwise relationship between two instances. For example, SH constructs the Laplacian graph to describe the relationship between two instances. AGH creates a modified version of SH by introducing the anchor graph (Liu et al., 2010). JSH uses a $l_{2, 1}$ -regularized term to minimize the information loss of low-dimensional features and binary codes on the basis of AGH. Although the above methods can effectively preserve the local neighborhood structure from the high-dimensional feature space (shown in Figure 2 (a)) to a low-dimensional space (shown in Figure 2 (b)), such methods fail to consider semantics of instances that correspond to the intrinsic geometric structures of data (Zhang et al., 2019; Yang et al., 2018; Lin et al., 2015; Fernandez-Beltran et al., 2020).

Figure 2. Comparison of the standard manifold-based method (Lai et al., 2018) and our JPSH method. (a) the data distribution in the high-dimensional feature space; (b) the low-dimensional features generated by a standard manifold-based hashing method (Lai et al., 2018); (c) the low-dimensional features with discriminative achieved by our proposed JPSH, and each cluster is transformed by a personalized weight produced by the PSH module. Different from the standard scheme (Lai et al., 2018) that only preserves pairwise similarity, the proposed JPSH jointly preserves semantic and pairwise similarities in the low-dimensional space, which allows us to effectively distinguish different clusters, and preserve local intrinsic structures within each cluster.

To tackle the above problems, in this paper, we propose an effective unsupervised hashing method, namely Jointly Personalized Sparse Hashing (JPSH), to learn the binary representation for similarity search. Unlike existing manifold hashing methods that only preserve the pairwise relationship, the proposed JPSH encodes both semantic and pairwise similarities into the Hamming space, which can effectively distinguish different clusters while preserving local structures within clusters, as illustrated in Figure 2 (c). The proposed JPSH framework is shown in Figure 1, and the main contributions of our work can be summarized as follows:

A novel Personalized Sparse Hashing (PSH) module is proposed to preserve the semantics of instances that correspond to the intrinsic structures of data. It learns personalized subspaces that reflect the special categorical characteristics, and then defines sparse constraints for each personalized subspaces. Thus, the PSH can adaptively map instances within the same cluster to the same Hamming space and learn more discriminative features. To our knowledge, this is the first work that preserves semantic similarity in a personalized subspace.
We incorporate PSH and a standard manifold-based hashing method JSH in a seamless formulation to construct the proposed JPSH model, which simultaneously preserves semantic and pairwise similarities in the Hamming space.

The rest of this paper is organized as follows. Section 2 introduces some related works. Section 3 presents the details of our JPSH including the proposed personalized sparse hashing, the overall loss function, the discussions, the optimization algorithm, the out-of-sample extension. Section 4 shows experimental results and analyses, followed by the conclusion in Section 5.

2. Related Works

In this section, we first define some used notations, and then introduce several related works, including Spectral Hashing, Anchor Graph Hashing and Jointly Sparse Hashing.

2.1. Notations

In the following, we use bold uppercase letters (e.g. $Z$ ) for matrices, bold lowercase letters (e.g. $z$ ) for vectors, and normal lowercase letters (e.g. z) for scalars. If there are no special instructions, we define the $i$ -th row of the matrix $Z$ as $Z_{i *}$ , the $j$ -th column of the matrix $Z$ as $Z_{* j}$ , the $i$ -th element of $Z_{* j}$ or the $j$ -th element of $Z_{i *}$ as $Z_{i j}$ . We also denote the transpose of the matrix $Z$ as $Z^{T}$ , and the trace of the matrix $Z$ as $t r (Z)$ if it is a square matrix. $∥ Z ∥_{2}$ , $∥ Z ∥_{2, 1}$ , and $∥ Z ∥_{F}$ represent the $l_{2}$ -norm, $l_{2, 1}$ -norm, and Frobenius norm of the matrix $Z$ , respectively. $Z \otimes H$ denotes the Kronecker product between matrices $Z$ and $H$ . In this paper, the training set is denoted as $X = {x_{i}}_{i = 1}^{n}$ , $x_{i} \in R^{d}$ , where $n$ is the number of training samples, and $d$ is the dimension of features. The binary codes $B = {b_{i}}_{i = 1}^{m}$ , $b_{i} \in R^{l}$ , where $l ≪ d$ , $l$ is the length of bits, and $m ≪ n$ , $m$ is the number of anchor points. $I$ is the identity matrix with the size of $l$ -by- $l$ .

2.2. Spectral Hashing

SH (Weiss et al., 2008) aims to extract the discriminative information from the original space to learn binary codes based on the graph theory. With this objective, it minimizes the errors of binary codes between instances with Laplacian graph as follows:

(1)

\begin{matrix} min \begin{matrix} H \end{matrix} & n \sum i, j = 1 ∥ h_{i} - h_{j} ∥_{2}^{2} M_{i j} s . t . H \in {- 1, 1}^{l \times n}, l H = 0, H H^{T} = n I \end{matrix},

where $l \in R^{1 \times l}$ is an all-ones vector, $M_{i j} = e x p (- ∥ x_{i} - x_{j} ∥^{2} / η^{2})$ , $M \in R^{n \times n}$ is the affinity matrix, $η$ is a pre-defined parameter, and $H = {h_{i}}_{i = 1}^{n}, h_{i} \in R^{l}$ is binary codes of the $i$ -th training sample. It is time-consuming to compute $M$ for large-scale datasets. In addition, the information loss between low-dimensional features and binary codes will increase due to the binarization operation.

2.3. Anchor Graph Hashing

To reduce the computational complexity in Eq. (1), AGH (Liu et al., 2011) designs a truncated similarity matrix $A \in R^{n \times m}$ between all $n$ instances and $m$ anchor points. The $i$ -th element of $A_{* j}$ is defined as:

(2)

A_{i j} = ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ \begin{matrix} \frac{e x p (- ∥ x_{i} - c_{j} ∥^{2} / θ)}{\sum_{j^{'} \in {i}} e x p (- ∥ x_{i} - c_{j^{'}} ∥^{2} / θ)}, \forall j \in {i} 0, o t h e r w i s e \end{matrix},

where $c_{j} \in R^{d}$ is the $j$ -th anchor point; ${i}$ is the index set of $k$ neighboring anchor points of $x_{i}$ , and $θ$ is a pre-defined parameter. Although AGH has the linear training time for the developed model, it still fails to control the information loss between low-dimensional features and binary codes in the binary representation learning.

2.4. Jointly Sparse Hashing

To minimize the information loss between low-dimensional feature vectors and binary codes, JSH (Lai et al., 2018) presents $l_{2, 1}$ -regularized regression formulation, and the objective function can be defined as follows:

(3)

\begin{matrix} min \begin{matrix} B, W, V \end{matrix} & m \sum j = 1 n \sum i = 1 ∥ b_{j} - V W^{T} x_{i} ∥_{2}^{2} A_{i j} + α ∥ W ∥_{2, 1} s . t . B \in {- 1, 1}^{l \times m}, V^{T} V = I \end{matrix},

where $α$ is the balance parameter, $W \in R^{d \times l}$ is the pairwise weight matrix; $V \in R^{l \times l}$ is the rotation matrix. Although JSH effectively handles the pairwise projection from a high-dimensional space to the Hamming space, the semantics of instances are unknown and motivate our work reported in this paper.

3. Jointly Personalized Sparse Hashing

In this section, we present the details of the proposed JPSH, including semantic-preserving Personalized Sparse Hashing (PSH), overall objective function of JPSH, discussions, optimization, the out-of-sample extension.

3.1. Personalized Sparse Hashing

The semantic structure can reflect the intrinsic geometric relationship across different data points within a specific category. The proposed PSH thinks that instances with similar semantic structures should be adaptively projected to the same Hamming space, and vice versa, as shown in Figure 1 (b). For this purpose, one simple yet effective way is to build different personalized subspaces (Weinberger et al., 2009) to capture different semantic structures. Retrieving label information is an expensive process in terms of time, labor and human expertise, and we define semantic structures in a high-dimensional space by pseudo labels produced by K-means. The instances with similar semantic information likely share the same anchor point. Therefore, we derive a set of anchor points $C = {c_{j}}_{j = 1}^{m}$ as the training set of PSH and obtaining the following objective function:

(4)

\begin{matrix} min \begin{matrix} B, P, R \end{matrix} & m \sum j = 1 ∥ b_{j} - R P_{j}^{T} c_{j} ∥_{2}^{2} s . t . B \in {- 1, 1}^{l \times m}, R^{T} R = I \end{matrix},

where $P_{j} \in R^{d \times l}$ is a personalized weight matrix for the anchor point $c_{j}$ and $P = [P_{1}; P_{2}; \dots; P_{m}] \in R^{m d \times l}$ . $R \in R^{l \times l}$ is a rotation matrix to minimize quantization errors, driven by ITQ (Gong et al., 2012).

Since the binarization operation for low-dimensional features $R P_{j}^{T} c_{j}$ and binary codes $b_{j}$ helps increase the information loss, we consider using the feature sparsity for personalized weight $P_{j}$ to select important features. Exclusive group lasso (EGL) (Kong et al., 2014, 2016) encourages intra-cluster competition but discourages inter-cluster competition. Inspired by EGL we first impose a $l_{2, 1}$ -norm regularization on personalized weight $P_{j}$ for pursuing the sparsity of intra-cluster features. Then, we create a $l_{2}$ -norm term to constrain the non-sparsity of inter-cluster features. This regularization term enables us to perform optimal feature selection for each cluster. However, moderating the personalized weight $P_{j}$ with a single anchor point $c_{j}$ may over-fit with a poor generalization ability. Therefore, we force the anchor point $c_{j}$ to lend certain strengths from its neighboring anchors to learn the personalized weight $P_{j}$ , motivated by the network lasso penalty (Hallac et al., 2015). Finally, we re-write the loss function of PSH as follows:

(5)

\begin{matrix} min \begin{matrix} B, P, R \end{matrix} & m \sum j = 1 ∥ b_{j} - R P_{j}^{T} c_{j} ∥_{2}^{2} + λ_{1} m \sum j = 1 ∥ P_{j} ∥_{2, 1}^{2} + λ_{2} m \sum i, j = 1 S_{i j} ∥ P_{i} - P_{j} ∥_{F} s . t . B \in {- 1, 1}^{l \times m}, R^{T} R = I \end{matrix},

where $λ_{1}$ and $λ_{2}$ are balance parameters, $S \in R^{m \times m}$ is the similarity matrix of anchors, and the $i$ -th element of $S_{* j}$ is defined as follows:

(6)

where $δ$ is a given parameter, and $N_{ψ} (c_{i})$ is the index set of $ψ$ nearest neighbors of the anchor point $c_{i}$ .

3.2. Overall Objective Function of JPSH

As aforementioned, the PSH model shown in Eq. (5) can effectively preserve particular attributes of each cluster from the high-dimensional space into the Hamming space. Instances within the same cluster can be projected by the same personalized weight matrix, resulting in directly maintaining the semantics of instances that correspond to intrinsic structures of data into the Hamming space.

To further improve the representation capacity of binary codes, we also hope to preserve the local neighborhood structure within the cluster in the Hamming space. To this end, we maintain the pairwise similarity via a manifold-based hashing learning method. Since JSH (Lai et al., 2018) can quantify the similarity between each instance and its neighboring anchor points, and greatly reduce the time complexity, we thereby joint Eqs. (3) and (5) to construct the Jointly Personalized Sparse Hashing (JPSH) model. The overall objective function of JPSH is defined as:

(7)			$min \begin{matrix} B, P, W, R, V \end{matrix} m \sum j = 1 ∥ b_{j} - R P_{j}^{T} c_{j} ∥_{2}^{2} + m \sum j = 1 n \sum i = 1 ∥ b_{j} - V W^{T} x_{i} ∥_{2}^{2} A_{i j}$
			$+ λ_{1} m \sum j = 1 ∥ P_{j} ∥_{2, 1}^{2} + λ_{2} m \sum i, j = 1 S_{i j} ∥ P_{i} - P_{j} ∥_{F} + λ_{3} ∥ W ∥_{2, 1}$
			$s . t . B \in {- 1, 1}^{l \times m}, R^{T} R = V^{T} V = I$

where $λ_{3}$ is the balance parameter for projected matrix $W$ .

3.3. Discussions

In the following, we will discuss how our proposed JPSH learns the binary representation and preserves discriminative information compared with other hashing methods from two aspects.

On the one hand, compared to existing manifold-based hashing methods, such as SH and JSH, our proposed JPSH not only considers local pairwise similarity in the original feature space $X$ , but also the intrinsic semantic similarity of the data. In particular, Eq. (3) shows that JSH can effectively reduce the computational complexity using truncated similarity matrix $A$ , and avoid information loss via $∥ W ∥_{2, 1}$ . Therefore, we maintain local neighborhood structures within each cluster by JSH, as the local structure information in the two clusters (purple points and blue points shown in the Figure 2 (b)). However, JSH has difficulty distinguishing instances with similar features but different semantic information. Comparing Eq. (7) against Eq. (3), we learn a set of personalized weights $P_{j} (j = 1, 2, \dots, m)$ for different anchors $c_{j} (j = 1, 2, \dots, m)$ , and employ the $l_{2, 1}$ -norm sparse constraint on each $P_{j}$ . So, instances with the same pseudo-label can be adaptively mapped to the personalized subspace through the unique projection learning of personalized weights. Therefore, the proposed JPSH can avoid misrepresenting instances of similar features with different semantics by directly retaining the semantic relationship of the original feature space.

On the other hand, compared against those multi-stage hashing methods, such as PCA-ITQ and RSSH, we construct the seamless hash objective function Eq. (7). It can avoid suboptimal solutions generated by using multiple independent learning strategies. For example, PCA-ITQ first applies PCA to perform linear dimensionality reduction, and then uses an alternating algorithm for refining the initial orthogonal transformation to reduce quantization errors. RSSH preserves semantic relationships on three subproblems: multi-subspace learning, similarity matrix construction, and semantics preserved hashing. Different from the above methods, we construct personalized weight matrix $P$ , pairwise weight matrix $W$ and two rotation matrices $R$ and $V$ in the seamless formulation Eq. (7). Thus, our proposed JPSH can significantly avoid the suboptimal solution, and obtain discriminative hash codes in binary representation learning.

3.4. Optimization

The optimization of Eq. (7) is intractable and non-convex w.r.t all parameters simultaneously. Therefore, we adopt an alternating optimization algorithm to iteratively update all parameters till the algorithm converges to an acceptable settlement.

Update Personalized Weight $P$ : First, we update the personalized weight $P$ when $W$ , $R$ , $V$ and $B$ are fixed. The subproblem of Eq. (7) w.r.t $P$ is:

(8)

\begin{matrix} min \begin{matrix} P \end{matrix} & m \sum j = 1 ∥ b_{j} - R P_{j}^{T} c_{j} ∥_{2}^{2} + λ_{1} m \sum j = 1 ∥ P_{j} ∥_{2, 1}^{2} + λ_{2} m \sum i, j = 1 S_{i j} ∥ P_{i} - P_{j} ∥_{F} s . t . B \in {- 1, 1}^{l \times m}, R^{T} R = I \end{matrix} .

For the convenience, we devide Eq. (8) into three functions: $F_{1} (P) = \sum_{j = 1}^{m} ∥ b_{j} - R P_{j}^{T} c_{j} ∥_{2}^{2}$ , $F_{2} (P) = λ_{1} \sum_{j = 1}^{m} ∥ P_{j} ∥_{2, 1}^{2}$ , $F_{3} (P) = λ_{2} \sum_{i, j = 1}^{m} S_{i j} ∥ P_{i} - P_{j} ∥_{F}$ . We denote $Y = d i a g (c_{1}, c_{2}, \dots, c_{m})$ , and $Y \in R^{m d \times m}$ , where $d i a g (\cdot)$ is a diagonal function. Since $R R^{T} = I$ , the derivative of $F_{1} (P)$ w.r.t $P$ can be computed as follows:

(9)

\begin{matrix} \frac{\partial F_{1} (P)}{\partial P} = - 2 Y B^{T} R + 2 Y Y^{T} P \end{matrix} .

According to Refs. (Li et al., 2017a, 2018), we have the derivative of $F_{2} (P)$ w.r.t $P$ as follows:

(10)

\begin{matrix} \frac{\partial F_{2} (P)}{\partial P} = 2 λ_{1} K P \end{matrix},

where $K \in R^{m d \times m d}$ is a diagonal matrix, and the $i$ -th diagonal element can be defined as:

(11)

\begin{matrix} K_{i i} = m \sum j = 1 \frac{I_{i j} ∥ P_{j} ∥_{2, 1}}{∥ P^{i *} ∥_{2} + ϵ} \end{matrix},

where $ϵ$ is a very small constant to ensure the derivative solvable, $I (\cdot)$ is an indicator function, and $I_{i j} = 1$ if $P^{i *}$ belongs to $P_{j}$ , otherwise $I_{i j} = 0$ .

The derivative of $F_{3} (P)$ w.r.t. $P$ is computed as follows:

(12)

\begin{matrix} \frac{\partial F_{3} (P)}{\partial P} = 2 λ_{2} (G \otimes I_{d}) P \end{matrix} .

where $I_{d} \in R^{d \times d}$ is an identity matrix. $G \in R^{m \times m}$ is square matrix, and the $i$ -th element of $G_{* j}$ can be expressed as:

(13)

Combining $\partial F_{1} (P) / \partial P$ , $\partial F_{2} (P) / \partial P$ and $\partial F_{3} (P) / \partial P$ together, and setting it to be zero, we have a closed-form solution of the personalized weight $P$ as follows:

(14)

\begin{matrix} P = {(λ_{1} K + λ_{2} (G \otimes I_{d}) + Y Y^{T})}_{1}^{- 1} Y B^{T} R \end{matrix},

Update Pairwise Weight $W$ : We attempt to update the pairwise weight $W$ when other four parameters are fixed. Removing terms that are irrelevant to $W$ , we have the following function as:

(15)

\begin{matrix} min \begin{matrix} W \end{matrix} & m \sum j = 1 n \sum i = 1 ∥ b_{j} - V W^{T} x_{i} ∥_{2}^{2} A_{i j} + λ_{3} ∥ W ∥_{2, 1} s . t . B \in {- 1, 1}^{l \times m}, V^{T} V = I \end{matrix} .

According to Ref. (Lai et al., 2018), since the $\sum_{i} A_{i j} = 1$ and $V V^{T} = I$ , the term between $X$ and $X^{T}$ is an identity matrix. So, the closed-form solution of $W$ is:

(16)

\begin{matrix} W = {(λ_{3} Q + X X^{T})}_{3}^{- 1} X A B^{T} V \end{matrix},

where $Q \in R^{d \times d}$ is a diagonal matrix, and the $i$ -th element of $Q_{i *}$ is:

(17)

\begin{matrix} Q_{i i} = \frac{1}{2 ∥ w_{i *} ∥_{2}} \end{matrix},

where $w_{i *}$ is the $i$ -th row of the pairwise weight $W$ .

Update Rotation Matrix $R$ : To compute $R$ , we fix $P$ , $W$ , $V$ and $B$ , and then solve the following maximization problem:

(18)

\begin{matrix} max \begin{matrix} R \end{matrix} t r (R P^{T} Y B^{T}) \end{matrix},

According to (Zou et al., 2006; Li et al., 2017b), we compute $P^{T} Y B^{T} = {^U}_{1} D_{1} {^V}_{1}^{T}$ by Singular Value Decomposition (SVD), and the updating scheme of $R$ is described below:

(19)

\begin{matrix} R = {^V}_{1} {^U}_{1}^{T} \end{matrix} .

Update Rotation Matrix $V$ : To derive the solution of $V$ , we fix the other four parameters and re-form the maximization problem as follows:

(20)

\begin{matrix} max \begin{matrix} V \end{matrix} t r (V W^{T} X A B^{T}) \end{matrix} .

We use SVD to solve the problem $W^{T} X A B^{T} = {^U}_{2} D_{2} {^V}_{2}^{T}$ , and then retain the solution of $V$ as follows:

(21)

\begin{matrix} V = {^V}_{2} {^U}_{2}^{T} \end{matrix} .

Update Binary Codes $B$ : Finally, we update the parameter $B$ when others have been fixed and solve the optimization problem as follows:

(22)

max \begin{matrix} B \end{matrix} t r (R P^{T} Y B^{T} + V W^{T} X A B^{T})

Thus, the solution to $B$ is:

(23)

B = s g n (R P^{T} Y + V W^{T} X A)

where $s g n (\cdot)$ is a sign function.

With the above handling, the pseudo codes of our proposed JPSH method is summarized in Algorithm 1.

0: Training set

X

, balance parameters

λ_{1}

λ_{2}

λ_{3}

, the number of anchors

m

k

neighboring anchor points, the length of hash codes

l

, and the iteration time

T

0: Binary codes

B

, personalized weight

P

, pairwise weight

W

, and two rotation matrices

R

and

V

1: Obtain

m

anchor points by the K-means method.

2: Compute truncated similarity matrix

A

and similarity matrix

S

by Eqs. (2) and (6), respectively.

3: Initialize

R

and

V

as orthogonal matrices,

D

and

Q

as identity matrices, and

B

as a random binary matrix.

4: while Loop until converge or reach iteration time

T

5: Update

P

by Eq. (14);

6: Update

W

by Eq. (16);

7: Update

Q

by Eq. (17);

8: Update

R

by Eq. (19);

9: Update

V

by Eq. (21);

10: Update

B

by Eq. (23);

11: end while

12: return

B

P

W

R

and

V

Algorithm 1 Jointly Personalized Sparse Hashing (JPSH)

3.5. Out-of-Sample Extension

After having optimized the overall objective function of JPSH, we perform similarity prediction based on four parameters $P$ , $W$ , $R$ and $V$ . More specifically, for each query instance $^x$ , we first find the personalized weight $P_{j}$ corresponding to the anchor point $c_{j}$ using the minimum Euclidean distance, for the query instance $^x$ . Then, we joint the personalized weight $P_{j}$ , pairwise weight $W$ , and two rotation matrices $R$ and $V$ to learn discriminative binary codes $^b$ , that is: $^b = s g n (R P_{j}^{T} c_{j} + V W^{T}^x)$ .

Observing $^b = s g n (R P_{j}^{T} c_{j} + V W^{T}^x)$ , we can find that each hash code $^b$ is constrained by $R P_{j}^{T} c_{j}$ and $V W^{T}^x$ . The first term $R P_{j}^{T} c_{j}$ transforms the anchor $c_{j}$ corresponding to $^x$ by the personalized weight matrix $P_{j}$ . Since the personalized weight matrices are different for different clusters, the hash codes $^x$ generated by the first trem have the semantic separability. From Eq. (7), we can observe that the pairwise weight matrix $W$ is produced under the constraints of pairwise similarity matrix $A$ . Thus, the second trem $V W^{T}^x$ reflects the local neighborhood structures among data. Integrating the first term $R P_{j}^{T} c_{j}$ and the second term $V W^{T}^x$ into the program of producing $^b$ , we can obtain the binary codes fused the semantic and pairwise similarities.

4. Experimental WorkS

To verify the effectiveness of the proposed binary representation algorithm JPSH, we use it in the task of similarity search. We conduct extensive experiments on four widely-used benchmark image datasets, and compare with several state-of-the-art hash algorithms.

4.1. Datasets

The proposed method and the comparative ones are evaluated on the following four datasets, and some example images of each dataset are shown in Figure 3. MNIST (LeCun et al., 1998) contains 70,000 handwritten digit images from 10 classes in total. Each image is re-shaped to a 784-dimensional feature vector. We construct the testing set by randomly selecting 100 images per class, and the remaining images are composed as the training set (Lin et al., 2018). CIFAR-10 (Krizhevsky et al., 2009) consists of 60,000 tiny images of 10 classes. Following (Hu et al., 2018), 512-dimensional GIST features (Oliva and Torralba, 2001) are retrieved. We randomly select 100 images per class to form the testing set, and the remaining 5900 images per class form the training set. NUS-WIDE (Chua et al., 2009) is a real-world dataset containing 269,648 images related to 81 ground truth concepts. We pick the most 21 frequent concepts for evaluation. For each category, 100 images are randomly sampled to form the testing set, and 100,000 images from the remaining images form the training set. 4096-dimensional CNN features (Krizhevsky et al., 2017) are firstly extracted, and then 1000-dimensional PCA features are generated to represent each image. FLICKR25K (Huiskes and Lew, 2008) is also a real-world dataset including 25,000 images of 24 categories. We randomly select 1000 images to form the testing set, and 19,015 images for the training set (Dong et al., 2021). Following the setting of NUS-WIDE, we extract 1000-dimensional PCA features to represent each image.

Figure 3. Example images of the used datasets, including MNIST, CIFAR-10, NUS-WIDE and FLICKR25K.

4.2. Settings

For the proposed JPSH, the test range of parameters $λ_{1}$ , $λ_{2}$ and $λ_{3}$ is ${0, 1, \dots, 10^{5}, 10^{6}}$ , the test range of parameter $k$ is $[3, 5, 7, 10, 20]$ , and the test range of parameter $m$ is $[100, 200, \dots, 1000]$ . Based on the test results, we empirically set $k = 7$ , $m = 800$ , $λ_{1} = λ_{2} = 1$ and $λ_{3} = 10$ on MNIST and CIFAR-10, and set $k = 5$ , $λ_{1} = λ_{2} = 10$ and $λ_{3} = 10^{5}$ on NUS-WIDE and FLICKR25K, while $m = 1000$ for NUS-WIDE, and $m = 500$ for FLICKR25K. In addition, we set $ψ = 7$ in Eq. (6) for MNIST and CIFAR-10 datasets, and set $ψ = 5$ for FLICKR25K and NUS-WIDE datasets. We test the system with different lengths of codes ranged in the set of $[8, 16, 32, 64, 96, 128]$ . As for the evaluation, five standard evaluation metrics, i.e. mean Average Precision (mAP), the mean average precision of the top 100 testing samples (Pre@100), Precision-Recall curve, precision vs. top-N positions (Precision@N) curve, and recall vs. top-N positions (Recall@N) curve, are used to evaluate the performance. All results shown here are produced with Hamming radius 2, being the average of 10 run times. The baseline methods used in this evaluation are under no domain adaption assumption. The baseline methods consist of 10 state-of-the-art unsupervised hashing methods: LSH (Datar et al., 2004), SH (Weiss et al., 2008), PCA-ITQ (Gong et al., 2012), SP (Xia et al., 2015), SGH (Jiang and Li, 2015), OEH (Liu et al., 2016), OCH (Liu et al., 2018), JSH (Lai et al., 2018), CH (Weng and Zhu, 2020) and RSSH (Tian et al., 2020). All hashing methods are examined using Matlab on a PC with 3.6GHz and 64G RAM.

4.3. Ablation Studies

To analyze the roles of pairwise and semantic similarities played in JPSH, we perform ablation studies on the CIFAR-10 and FLICKR25K datasets. The experimental results are depicted in Table 1. From this table, we have three clear observations: Firstly, JPSH achieves better performance than JSH and PSH. The reason is that JPSH learns and creates a seamless hash function, jointly considering pairwise and semantic similarities. Compared to JSH, the proposed JPSH considers further the semantics of instances corresponding to the intrinsic geometric structures. In addition, the JPSH retains the superior performance of the traditional manifold hashing method JSH comparsed with the PSH. Secondly, closely looking at JSH and PSH, we observe that PSH performs worse than JSH. One possible reason is that pseudo labels determined by K-means cannot accurately represent the semantic information of the high-dimensional data, comparing to the ground truths. Third, only taking PSH into consideration, we discover that PSH can achieve decent performance, especially on real-world FLICKR25K dataset. This verifies that the proposed PSH can retain the semantic similarity of the high-dimensional space into the Hamming space in a certain extent, without any manually tagged labels. Overall, these well-designed studies exhibit the effectiveness of combining JSH-based pairwise and PSH-based semantic similarity to a seamless formulation.

Datasets	CIFAR-10		FLICKR25K
Datasets	mAP	Pre@100	mAP	Pre@100
JSH(Lai et al., 2018)	0.1561	0.3001	0.6667	0.8909
PSH	0.1133	0.1613	0.5632	0.8316
JPSH	0.1606	0.3275	0.6936	0.9176

Table 1. mAP and Pre@100 of JSH, PSH and JPSH on the CIFAR-10 and FLICKR25K datasets with 16 bits. Note that the best results are in bold and the second-best results are underlined.

Datasets	Metrics	Bits	LSH(Datar et al., 2004)	SH(Weiss et al., 2008)	PCA-ITQ(Gong et al., 2012)	SP(Xia et al., 2015)	SGH(Jiang and Li, 2015)	OEH(Liu et al., 2016)	OCH(Liu et al., 2018)	JSH(Lai et al., 2018)	CH(Weng and Zhu, 2020)	RSSH(Tian et al., 2020)	JPSH
MNIST	mAP	8	0.1808	0.2651	0.3589	0.3404	0.3111	0.2920	0.2376	0.3204	0.3030	0.3940	0.4332
		16	0.2094	0.2661	0.4042	0.3780	0.3526	0.3100	0.3073	0.4066	0.3688	0.4128	0.5561
		32	0.2633	0.2606	0.4378	0.4178	0.3898	0.3497	0.3727	0.5118	0.4317	0.4183	0.6184
		64	0.3199	0.2400	0.4490	0.4401	0.4180	0.3727	0.4107	0.5613	0.4112	0.4204	0.6548
		96	0.3592	0.2367	0.4596	0.4472	0.4282	0.3889	0.4253	0.5675	0.4318	0.4222	0.6666
		128	0.3757	0.2330	0.4627	0.4510	0.4358	0.4014	0.4339	0.5758	0.4401	0.4245	0.6659
	Pre@100	8	0.4372	0.4859	0.7165	0.7057	0.6835	0.6040	0.5581	0.7213	0.6644	0.7843	0.8635
		16	0.4871	0.4904	0.7999	0.7630	0.7300	0.6523	0.6786	0.8112	0.7424	0.7122	0.9396
		32	0.6102	0.4732	0.8468	0.8169	0.7828	0.7233	0.7461	0.8954	0.7954	0.6795	0.9614
		64	0.7010	0.4416	0.8533	0.8451	0.8004	0.7470	0.7794	0.9279	0.7641	0.6750	0.9619
		96	0.7565	0.4379	0.8685	0.8524	0.8111	0.7915	0.7931	0.9370	0.7914	0.6737	0.9646
		128	0.7832	0.4370	0.8656	0.8501	0.8139	0.7945	0.8026	0.9447	0.8072	0.6768	0.9656
CIFAR-10	mAP	8	0.1102	0.1232	0.1421	0.1234	0.1382	0.1408	0.1314	0.1290	0.1413	0.1342	0.1380
		16	0.1134	0.1266	0.1472	0.1265	0.1497	0.1489	0.1494	0.1561	0.1447	0.1428	0.1606
		32	0.1230	0.1252	0.1561	0.1281	0.1592	0.1561	0.1618	0.1607	0.1560	0.1485	0.1812
		64	0.1306	0.1254	0.1485	0.1268	0.1684	0.1646	0.1705	0.1785	0.1658	0.1533	0.1848
		96	0.1334	0.1250	0.1547	0.1308	0.1713	0.1671	0.1750	0.1845	0.1713	0.1544	0.1925
		128	0.1373	0.1249	0.1576	0.1333	0.1734	0.1694	0.1765	0.1865	0.1738	0.1574	0.1895
	Pre@100	8	0.1517	0.1899	0.2612	0.2019	0.2382	0.2486	0.2157	0.2187	0.2601	0.2138	0.2413
		16	0.1583	0.1906	0.2730	0.1961	0.2539	0.2652	0.2661	0.3001	0.2740	0.2443	0.3275
		32	0.1915	0.1760	0.3063	0.1979	0.2839	0.2876	0.2989	0.3313	0.2894	0.2626	0.3672
		64	0.2103	0.1771	0.2720	0.1942	0.3171	0.3118	0.3244	0.3657	0.3154	0.2819	0.3910
		96	0.2120	0.1748	0.2988	0.2047	0.3263	0.3177	0.3371	0.3765	0.3318	0.2828	0.4109
		128	0.2235	0.1754	0.3083	0.2047	0.3319	0.3247	0.3424	0.3846	0.3351	0.2949	0.4015
NUS-WIDE	mAP	8	0.3427	0.4293	0.5260	0.5063	0.4725	0.4864	0.4152	0.4425	0.4321	0.3475	0.5521
		16	0.3598	0.4067	0.5243	0.5032	0.4600	0.4780	0.4672	0.5511	0.4712	0.3614	0.6027
		32	0.3873	0.3830	0.5230	0.5055	0.4601	0.4844	0.5046	0.5876	0.4783	0.3952	0.6200
		64	0.4138	0.3689	0.5286	0.5133	0.4815	0.4903	0.5350	0.6138	0.4758	0.4126	0.6271
		96	0.4346	0.3674	0.5344	0.5189	0.4971	0.4964	0.5431	0.6169	0.4855	0.4164	0.6293
		128	0.4479	0.3769	0.5397	0.5224	0.5059	0.5019	0.5487	0.6224	0.4947	0.4211	0.6340
	Pre@100	8	0.6660	0.7121	0.8258	0.8173	0.7485	0.8033	0.7479	0.7588	0.7529	0.6644	0.8313
		16	0.6875	0.6998	0.8243	0.8142	0.7408	0.7991	0.7805	0.8420	0.7796	0.6752	0.8636
		32	0.7144	0.6805	0.8240	0.8105	0.7525	0.8026	0.8107	0.8695	0.7795	0.6935	0.8762
		64	0.7357	0.6809	0.8265	0.8179	0.7729	0.8018	0.8342	0.8929	0.8024	0.6983	0.8841
		96	0.7578	0.6865	0.8322	0.8225	0.7862	0.8001	0.8427	0.8877	0.8026	0.6947	0.8860
		128	0.7638	0.6986	0.8362	0.8255	0.7935	0.8117	0.8439	0.8948	0.8060	0.6979	0.8885
FLICKR25K	mAP	8	0.5802	0.6321	0.6949	0.6828	0.6539	0.6653	0.6360	0.6206	0.6711	0.5674	0.6790
		16	0.5927	0.6174	0.6911	0.6818	0.6476	0.6677	0.6606	0.6667	0.6736	0.5792	0.6936
		32	0.6100	0.6023	0.6907	0.6819	0.6539	0.6698	0.6859	0.6875	0.6735	0.6028	0.7138
		64	0.6246	0.5923	0.6909	0.6834	0.6653	0.6687	0.7017	0.7011	0.6819	0.6420	0.7253
		96	0.6403	0.5928	0.6923	0.6864	0.6742	0.6759	0.7070	0.7042	0.6838	0.6521	0.7313
		128	0.6443	0.5928	0.6950	0.6884	0.6805	0.6799	0.7115	0.7085	0.6866	0.6546	0.7331
	Pre@100	8	0.8417	0.8701	0.9079	0.9027	0.8874	0.8991	0.8773	0.8603	0.8959	0.8361	0.9119
		16	0.8487	0.8633	0.9055	0.9004	0.8754	0.8926	0.8920	0.8909	0.8930	0.8398	0.9176
		32	0.8587	0.8590	0.9039	0.8998	0.8803	0.8941	0.9057	0.9077	0.8938	0.8482	0.9269
		64	0.8684	0.8561	0.9036	0.8992	0.8836	0.8930	0.9127	0.9142	0.9008	0.8652	0.9291
		96	0.8745	0.8565	0.9049	0.9014	0.8892	0.8957	0.9151	0.9167	0.9029	0.8688	0.9307
		128	0.8789	0.8597	0.9060	0.9022	0.8932	0.8978	0.9181	0.9187	0.9050	0.8710	0.9305

Table 2. mAP and Pre@100 for the JPSH and baselines on the MNIST, CIFAR-10, NUS-WIDE and FLICKR25K datasets. Note that the best results are in bold and the second-best results are underlined.

Figure 4. Precision-Recall curve, Precision and Recall vs. the number of searched instances on the MNIST, CIFAR-10, NUS-WIDE and FLICKR25K datasets with 64 bits.

4.4. Results and Discussions

For the MNIST dataset, the first two rows in Table 1 illustrate the mAP and Pre@100 of our proposed JPSH and the baselines with different hash bits. Figure 4 (a, b, c) show the Precision-Recall curve, Precision@N and Recall@N curves on 64 bits, respectively. Going through these results, we have the following observations: (1) JPSH consistently outperforms all baselines in all cases on this dataset with large gaps, demonstrating its effectiveness on the binary representation in dealing with similarity search tasks. (2) Compared against the second-best method JSH, our proposed method improves $23.76 %$ and $8.63 %$ at least for mAP and Pre@100, respectively. This shows that JPSH outperforms the standard manifold-based hashing method JSH, and generates more discriminative binary codes by jointly learning semantic and pairwise similarities in the Hamming space. (3) RSSH shows better performance than manifold-based hashing methods in low hash bits but worse than JPSH. This indicates that semantic preservation helps to learn the discriminative information, especially that for our proposed method. (4) Figure 4 (a, b, c) show that JPSH achieves consistent performance improvement in terms of Precision-Recall curve, Precision@N and Recall curves on 64 bits, indicating the effectiveness of our JPSH.

For the CIFAR-10 dataset, mAP and Pre@100 are reported in the third and fourth rows of Table 1, and Figure 4 (d, e, f) show the Precision-Recall curve, Precision@N and Recall@N curves based on 64 bits, respectively. From these results, we have the observations similar to those of MNIST: (1) Compared with all baselines, JPSH achieves the best results of 16 to 128 bits in terms of mAP and Pre@100, demonstrating that JPSH can learn discriminative binary codes by maintaining semantic structures in different personalized subspaces. (2) Of 8 bits, PCA-ITQ achieves the best performance. However, the performance of JPSH and PCA-ITQ is very close, whilst JPSH is competitive to the other methods. (3) In Figure 4 (d, e, f), JPSH matches the best baseline in terms of Precision-Recall curve and Recall@N curve, but little low than RSSH on Precision@N. It is worth noting that JPSH ourperforms JSH in terms of Precision@N and Recall@N curves. This indicates that PSH can help to map instances within the same cluster to the same Hamming space in a certain extent.

For the NUS-WIDE dataset, we report the results of mAP and Pre@100 in the fifth and sixth rows of Table 1. The Precision-Recall, Precision@N, and Recall@N curves are shown in Figure 4 (g, h, i). Regarding those results, we have similar observations to the above cases: (1) JPSH also outperforms all baselines in all cases in terms of mAP on this real-world dataset. It significantly surpasses all baselines of low hash bits (8 to 32 bits) in terms of Pre@100, and little lower than JSH of high hash bits (64 to 128 bits). As a whole, these results demonstrate the effectiveness and robustness of the semantic- and pairwise-preserving scheme used in JPSH. (2) From Figure 4 (g, h, i), we witness that JPSH has a consistent improvement compared with all baselines. This justifies that our proposed JPSH discovers more discriminative structural information in binary representation learning.

For the FLICKR25K dataset, the last two rows of Table 1 show results of mAP and Pre@100, and curves of Precision-Recall, Precision@N and Recall@N are shown in Figure 4 (j, k, l). Again, we have observations similar to the above three datasets: (1) JPSH obtains the best results compared with all baselines in terms of mAP and Pre@100, demonstrating the effectiveness and robustness of our proposed JPSH in the binary representation learning process. (2) Figure 4 (j, k, l) show that our proposed JPSH outperforms all compared methods in terms of Precision-Recall and Recall@N curves, slightly lower than some compared methods in terms of Precision@N curve. Furthermore, JPSH is always better than JSH in Figure 4 (k). Therefore, our JPSH can learn better representative binary codes for similarity search.

Figure 5. Compared results of mAP between JPSH $_{0}$ and JPSH on (a) CIFAR-10, (b) NUS-WIDE and (c) FLICKR25K datasets.

Figure 6. Compared results of Pre@100 between JPSH $_{0}$ and JPSH on (a) CIFAR-10, (b) NUS-WIDE and (c) FLICKR25K datasets.

4.5. Semantic Analysis

In JPSH, we use pseudo labels to distinguish different clusters, and then map instances within the same cluster to the same Hamming space. In order to verify that our JPSH indeed maintains a certain semantic similarity, the training set of PSH is replaced by random samples called JPSH $_{0}$ . The results of mAP and Pre@100 between JPSH $_{0}$ and JPSH are shown in Fig 5 and Fig 6, respectively. From those figures, we can observe that the mAP and Pre@100 of JPSH $_{0}$ are obviously below that of JPSH, particularly in CIFAR-10 and NUS-WIDE datasets. These results indicate that K-means is good for extracting discriminative information with exploring the semantic structures. Moreover, our proposed JPSH successfully retains those structures into the Hamming space, and JPSH $_{0}$ can not be.

Figure 7. Visualization of personalized weight matrix $P$ processed by $s i g n$ function, (a) $16 \times 784$ matrix, (b) $32 \times 784$ matrix, and (c) $64 \times 784$ matrix on MNIST dataset. The green column indicates that the corresponding features are zero and ignored during the projection operation.

4.6. Sparsity Analysis

As shown in Figure 7, to study the sparsity of JPSH, we depict the visualization of personalized weight matrix $P$ processed by $s i g n$ function with size $16 \times 784$ , $32 \times 784$ and $64 \times 784$ . It can be seen that each matrix contains some green columns that are equal to zero due to the sparsity constraint of $∥ P_{j} ∥_{2, 1}$ . It means that those corresponding features are ignored in a projection operation. Thus, we can eliminate some redundant features for binary representation learning. For example, the first column of matrices in Figure 7 (a), (b) and (c) are all the zero vector, which means that the first feature of data is malfunction to distinguish the differences between data. Another benefit of sparsity lies in that it can increase the interpretability of the JPSH model. We can know that which features provide the discriminative information, while those features do not. Therefore, the sparsity of $∥ P_{j} ∥_{2, 1}$ works in our JPSH model.

Figure 8. Sensitivity analysis of $λ_{1}$ , $λ_{2}$ , $λ_{3}$ , $k$ and $m$ in terms of mAP on the FLICKR25K dataset with 64 bits, and convergence curves of the JPSH on CIFAR-10 and FLICKR25K datasets with 32 bits.

4.7. Sensitivity Analysis

We discuss the sensitivity of our JPSH over parameters $λ_{1}$ , $λ_{2}$ , $λ_{3}$ , $k$ and $m$ . The setting ranges of the corresponding parameters are described in Section 4.2. As an example, Figure 8 (a, b, c) show the sensitivity of the $λ_{1}$ , $λ_{2}$ and $λ_{3}$ on the FLICKR25K dataset with 64 bits. Figure 8 (a) and (b) show that the mAPs are over 0.71 within a large wide ranges w.r.t $λ_{1}$ and $λ_{2}$ . In addition, the mAPs fluctuated within a certain range indicate the terms of $\sum_{j = 1}^{m} ∥ P_{j} ∥_{2, 1}^{2}$ and $\sum_{i, j = 1}^{m} S_{i j} ∥ P_{i} - P_{j} ∥_{F}$ can influence the performance of the JPSH. We set $λ_{1}$ and $λ_{2}$ as 10 for the FLICKR25K dataset because of the relatively higher and more stable performance. From Figure 8 (c), mAPs exceed 0.69 in the range of ${0, 1, \dots, 10^{5}}$ , and when $λ_{3}$ equals to $10^{5}$ , the mAP surpasses 0.72. Therefore, we set the best $λ_{3} = 10^{5}$ for the FLICKR25K dataset. In addition, Figure 8 (d) shows mAPs versus $k$ and $m$ on the FLICKR25K dataset with 64 bits, suggesting that the JPSH is not sensitive to $k$ and $m$ , either. The sensitivity analysis of $λ_{1}$ , $λ_{2}$ , $λ_{3}$ , $k$ and $m$ on MNIST, CIFAR-10 and NUS-WIDE datasets is similar with that on the FLICKR25K dataset.

4.8. Convergence Analysis

Figure 8 (e) and (f) show the convergence curves of our JPSH method on CIFAR-10 and FLICKR25K datasets with 32 bits. The objective function of the JPSH is well converging, and quickly converges to an optimal solution within 20 iterations in all datasets. From Figure 8 (e), we witness that the objective loss function values settle after 4 iterations. Thus, we set the time iteration of our JPSH for CIFAR-10 dataset as $T = 4$ . Analogously, based on Figure 8 (f), we set $T = 10$ for FLICKR25K dataset. For MNIST and NUS-WIDE datasets, we empirically set iteration times to be $T = 10$ and $T = 5$ , respectively.

4.9. Time Complexity

The overall training time complexity of the JPSH is mainly composed by three parts: (1) the building of matrix $S$ , which the computational complexity is $O (m m)$ ; (2) the time-consuming of computing matrix $A$ is $O (n m)$ ; (3) For each iteration, the computational complexity of Eq. (14) is $O (m^{3} d^{3} + m^{3} d^{2} + m^{2} d l + m d l^{2})$ ; of Eq. (16) is $O (d^{3} + n d^{2} + m n d + m d l + d l^{2})$ ; of performing SVD on rotation matrices $R$ and $V$ both are $O (l^{3})$ ; and of Eq. (23) is $O (m d l^{2} + m^{2} d l + d l^{2} + n d l + m n l)$ . Generally, due to the length of binary codes $l$ and anchor points $m$ are usually smaller, the total computational complexity of our JPSH is $O (T d^{3} + T n d^{2})$ . It is linear with the number of training samples. Note that the whole computational process is mainly based on matrix multiplications, which can be computed in parallel for accelerations. It is reasonably efficient on the binary representation learning. To quantitatively compare the computational complexity, we evaluate the training time-consumation of OEH, RSSH and JPSH on CIFAR-10 dataset in terms of 32, 64 and 96 bits. Table 3 shows the time complexity and running times of OEH, RSSH and JPSH. Obviously, we could find that JPSH needs much less time than the multi-stage hashing method RSSH while little higher than OEH. It is noted that the JPSH yields higher mAP values than OEH, which are shown in Table 2. Therefore, the time cost of our proposed JPSH model is acceptable.

Methods	Time Complexity	Running Times (seconds)
Methods	Time Complexity	32 bits	64 bits	96 bits
OEH(Liu et al., 2016)	$\textgreaterO(m2l3k)$	13.3385	21.9931	43.9122
RSSH(Tian et al., 2020)	$\textgreaterO(Tnm2+Tnml)$	75.3624	80.4510	86.0708
JPSH	$O (T d^{3} + T n d^{2})$	45.0945	53.9658	62.6788

Table 3. Time complexity and running times of OEH, RSSH and JPSH on 32, 64 and 96 bits. Note that the best results are in bold and the second-best results are underlined.

5. Conclusion

In this paper, we proposed an effective unsupervised hashing method for binary representation learning, namely Jointly Personalized Sparse Hashing (JPSH), which jointly preserves semantic and pairwise similarities in the Hamming space. Firstly, we developed a novel Personalized Sparse Hashing (PSH) module that adaptively maps different clusters to corresponding personalized subspaces to maintain the identical semantics. Then, we jointed PSH and the manifold-based model JSH to construct JPSH, which can preserve semantic and pairwise similarities in a seamless formulation. Finally, an alternating optimization method was adopted to iteratively solve one variable by fixing the others. Extensive experiments on four benchmark image datasets had been conducted. The results verified that our proposed JPSH outperforms the other state-of-the-art unsupervised hashing methods in the similarity search task.

Acknowledgements.

This work was supported in part by the Guanxi Natural Science Foundation under grants (2019GXNSFFA245014, 2020GXNSFBA238014, 2020GXNSFAA297061), the National Natural Science Foundation of China under grants (62172120, 62002082, 62071174), the Natural Science Foundation of Hunan Province under Grant (2020JJ3014), and the Guangxi Key Research and Development Project (AB21220037).

References

M. Chen, L. Zhang, and J. P. Allebach (2015) Learning deep features for image emotion classification. In Proceedings of the IEEE International Conference on Image Processing, pp. 4491–4495. Cited by: §1.
T. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng (2009) NUS-wide: a real-world web image database from national university of singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval, pp. 1–9. Cited by: §4.1.
M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni (2004) Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the Annual Symposium on Computational Geometry, pp. 253–262. Cited by: §1, §4.2, Table 2.
X. Dong, L. Liu, L. Zhu, Z. Cheng, and H. Zhang (2021) Unsupervised deep k-means hashing for efficient image retrieval and clustering. IEEE Transactions on Circuits and Systems for Video Technology 31 (8), pp. 3266–3277. Cited by: §4.1.
R. Fernandez-Beltran, B. Demir, F. Pla, and A. Plaza (2020) Unsupervised remote sensing image retrieval using probabilistic latent semantic hashing. IEEE Geoscience and Remote Sensing Letters 18 (2), pp. 256–260. Cited by: §1.
Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin (2012) Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (12), pp. 2916–2929. Cited by: §1, §3.1, §4.2, Table 2.
D. Hallac, J. Leskovec, and S. Boyd (2015) Network lasso: clustering and optimization in large graphs. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 387–396. Cited by: §3.1.
Y. Hao, T. Mu, J. Y. Goulermas, J. Jiang, R. Hong, and M. Wang (2017) Unsupervised t-distributed video hashing and its deep hashing extension. IEEE Transactions on Image Processing 26 (11), pp. 5531–5544. Cited by: §1.
D. Hu, F. Nie, and X. Li (2018) Discrete spectral hashing for efficient similarity retrieval. IEEE Transactions on Image Processing 28 (3), pp. 1080–1091. Cited by: §4.1.
H. Hu, L. Xie, R. Hong, and Q. Tian (2020) Creating something from nothing: unsupervised knowledge distillation for cross-modal hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3123–3132. Cited by: §1.
M. J. Huiskes and M. S. Lew (2008) The mir flickr retrieval evaluation. In Proceedings of the ACM International Conference on Multimedia Information Retrieval, pp. 39–43. Cited by: §4.1.
P. Indyk (2001) A small approximately min-wise independent family of hash functions. Journal of Algorithms 38 (1), pp. 84–90. Cited by: §1.
G. Irie, Z. Li, X. Wu, and S. Chang (2014) Locally linear hashing for extracting non-linear manifolds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2115–2122. Cited by: §1.
R. Ji, H. Liu, L. Cao, D. Liu, Y. Wu, and F. Huang (2017) Toward optimal manifold hashing via discrete locally linear embedding. IEEE Transactions on Image Processing 26 (11), pp. 5411–5420. Cited by: §1.
Q. Jiang and W. Li (2015) Scalable graph hashing with feature transformation. In Proceedings of the International Joint Conference on Artificial Intelligence, Vol. 15, pp. 2248–2254. Cited by: §1, §4.2, Table 2.
L. Jin, K. Li, Z. Li, F. Xiao, G. Qi, and J. Tang (2018) Deep semantic-preserving ordinal hashing for cross-modal similarity search. IEEE Transactions on Neural Networks and Learning Systems 30 (5), pp. 1429–1440. Cited by: §1.
S. Jin, H. Yao, Q. Zhou, Y. Liu, J. Huang, and X. Hua (2021) Unsupervised discrete hashing with affinity similarity. IEEE Transactions on Image Processing 30, pp. 6130–6141. Cited by: §1, §1.
D. Kong, R. Fujimaki, J. Liu, F. Nie, and C. Ding (2014) Exclusive feature learning on arbitrary structures via l1,2-norm. Advances in Neural Information Processing Systems 27, pp. 1655–1663. Cited by: §3.1.
D. Kong, J. Liu, B. Liu, and X. Bao (2016) Uncorrelated group lasso. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1765–1771. Cited by: §3.1.
A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. (7). Cited by: §4.1.
A. Krizhevsky, I. Sutskever, and G. E. Hinton (2017) Imagenet classification with deep convolutional neural networks. Communications of the ACM 60 (6), pp. 84–90. Cited by: §4.1.
Z. Lai, Y. Chen, J. Wu, W. K. Wong, and F. Shen (2018) Jointly sparse hashing for image retrieval. IEEE Transactions on Image Processing 27 (12), pp. 6147–6158. Cited by: Figure 2, §1, §2.4, §3.2, §3.4, §4.2, Table 1, Table 2.
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §4.1.
J. Li, L. Wu, H. Dani, and H. Liu (2018) Unsupervised personalized feature selection. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 3514–3521. Cited by: §3.4.
J. Li, L. Wu, O. R. Zaïane, and H. Liu (2017a) Toward personalized relational learning. In Proceedings of the SIAM International Conference on Data Mining, pp. 444–452. Cited by: §3.4.
S. Li, X. Li, J. Lu, and J. Zhou (2022) Structure-adaptive neighborhood preserving hashing for scalable video search. IEEE Transactions on Circuits and Systems for Video Technology 32 (4), pp. 2441–2454. Cited by: §1.
X. Li, D. Hu, and F. Nie (2017b) Large graph hashing with spectral rotation. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 2203–2209. Cited by: §3.4.
Z. Li, J. Tang, L. Zhang, and J. Yang (2020) Weakly-supervised semantic guided hashing for social image retrieval. International Journal of Computer Vision 128 (8), pp. 2265–2278. Cited by: §1.
M. Lin, R. Ji, H. Liu, and Y. Wu (2018) Supervised online hashing via hadamard codebook learning. In Proceedings of the ACM International Conference on Multimedia, pp. 1635–1643. Cited by: §4.1.
Z. Lin, G. Ding, M. Hu, and J. Wang (2015) Semantics-preserving hashing for cross-view retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3864–3872. Cited by: §1.
H. Liu, R. Ji, J. Wang, and C. Shen (2018) Ordinal constraint binary coding for approximate nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (4), pp. 941–955. Cited by: §1, §4.2, Table 2.
H. Liu, R. Ji, Y. Wu, and W. Liu (2016) Towards optimal binary code learning via ordinal embedding. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1258–1265. Cited by: §1, §4.2, Table 2, Table 3.
W. Liu, J. He, and S. Chang (2010) Large graph construction for scalable semi-supervised learning. In Proceedings of the International Conference on International Conference on Machine Learning, pp. 679–686. Cited by: §1, §1.
W. Liu, C. Mu, S. Kumar, and S. Chang (2014) Discrete graph hashing. Advances in Neural Information Processing Systems 27, pp. 3419–3427. Cited by: §1, §1.
W. Liu, J. Wang, S. Kumar, and S. Chang (2011) Hashing with graphs. In Proceedings of the International Conference on Machine Learning, pp. 1–8. Cited by: §1, §2.3.
A. Oliva and A. Torralba (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal of Computer Vision 42 (3), pp. 145–175. Cited by: §4.1.
T. Rao, X. Li, and M. Xu (2020) Learning multi-level deep representations for image emotion classification. Neural Processing Letters 51 (3), pp. 2043–2061. Cited by: §1.
T. Rao, X. Li, H. Zhang, and M. Xu (2019) Multi-level region-based convolutional neural network for image emotion classification. Neurocomputing 333, pp. 429–439. Cited by: §1.
F. Shen, C. Shen, Q. Shi, A. Van Den Hengel, and Z. Tang (2013) Inductive hashing on manifolds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1562–1569. Cited by: §1.
Y. Shi, X. Yu, K. Sohn, M. Chandraker, and A. K. Jain (2020) Towards universal representation learning for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6817–6826. Cited by: §1.
J. Tang, J. Lin, Z. Li, and J. Yang (2018) Discriminative deep quantization hashing for face image retrieval. IEEE Transactions on Neural Networks and Learning Systems 29 (12), pp. 6154–6162. Cited by: §1.
Z. Tian, H. Zhang, Y. Chen, and D. Zhang (2020) Unsupervised hashing based on the recovery of subspace structures. Pattern Recognition 103, pp. 107261. Cited by: §1, §4.2, Table 2, Table 3.
L. Van der Maaten and G. Hinton (2008) Visualizing data using t-sne.. Journal of Machine Learning Research 9 (11). Cited by: §1.
K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg (2009) Feature hashing for large scale multitask learning. In Proceedings of the Annual International Conference on Machine Learning, pp. 1113–1120. Cited by: §3.1.
Y. Weiss, A. Torralba, and R. Fergus (2008) Spectral hashing. Advances in Neural Information Processing Systems 21, pp. 1753–1760. Cited by: §1, §2.2, §4.2, Table 2.
Z. Weng and Y. Zhu (2020) Concatenation hashing: a relative position preserving method for learning binary codes. Pattern Recognition 100, pp. 107151. Cited by: §1, §4.2, Table 2.
Y. Xia, K. He, P. Kohli, and J. Sun (2015) Sparse projections for high-dimensional binary codes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3332–3339. Cited by: §1, §4.2, Table 2.
X. Xiang, Y. Zhang, L. Jin, Z. Li, and J. Tang (2021) Sub-region localized hashing for fine-grained image retrieval. IEEE Transactions on Image Processing 31, pp. 314–326. Cited by: §1.
E. Yang, C. Deng, T. Liu, W. Liu, and D. Tao (2018) Semantic structure-based unsupervised deep hashing. In Proceedings of the International Joint Conference on Artificial Intelligence, pp. 1064–1070. Cited by: §1.
X. Yao, D. She, H. Zhang, J. Yang, M. Cheng, and L. Wang (2020) Adaptive deep metric learning for affective image retrieval and classification. IEEE Transactions on Multimedia 23, pp. 1640–1653. Cited by: §1.
X. Yao, D. She, S. Zhao, J. Liang, Y. Lai, and J. Yang (2019) Attention-aware polarity sensitive embedding for affective image retrieval. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1140–1150. Cited by: §1.
W. Zhang, D. Wu, Y. Zhou, B. Li, W. Wang, and D. Meng (2020) Deep unsupervised hybrid-similarity hadamard hashing. In Proceedings of the ACM International Conference on Multimedia, pp. 3274–3282. Cited by: §1.
Z. Zhang, G. Xie, Y. Li, S. Li, and Z. Huang (2019) SADIH: semantic-aware discrete hashing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 5853–5860. Cited by: §1.
S. Zhao, H. Yao, Y. Yang, and Y. Zhang (2014) Affective image retrieval via multi-graph learning. In Proceedings of the ACM International Conference on Multimedia, pp. 1025–1028. Cited by: §1.
L. Zhu, Z. Huang, Z. Li, L. Xie, and H. T. Shen (2018) Exploring auxiliary context: discrete semantic transfer hashing for scalable image retrieval. IEEE Transactions on Neural Networks and Learning Systems 29 (11), pp. 5264–5276. Cited by: §1.
X. Zhu, X. Li, S. Zhang, Z. Xu, L. Yu, and C. Wang (2017) Graph pca hashing for similarity search. IEEE Transactions on Multimedia 19 (9), pp. 2033–2044. Cited by: §1.
X. Zhu, L. Zhang, and Z. Huang (2014) A sparse embedding and least variance encoding approach to hashing. IEEE Transactions on Image Processing 23 (9), pp. 3737–3750. Cited by: §1.
H. Zou, T. Hastie, and R. Tibshirani (2006) Sparse principal component analysis. Journal of Computational and Graphical Statistics 15 (2), pp. 265–286. Cited by: §3.4.