Bahar Aydemir, Ludo Hoffstetter, Tong Zhang, Mathieu Salzmann, Sabine SĂĽsstrunk

School of Computer and Communication Sciences, EPFL, Switzerland

{bahar.aydemir, tong.zhang, mathieu.salzmann, sabine.susstrunk}@epfl.ch

In this supplementary material, we provide additional qualitative results and ablation studies for our TempSAL method. The document is structured as follows:

Section A : Additional Qualitative Results

Section B : Details on the Statistical Analysis

Section C : Equal Duration vs Equal Distribution

Section D : Results of the Slicing Alternatives

Section E : The Number of Time Slices

Section F : Approximation of Fixation Timestamps

We provide additional qualitative results for our model on the SALICON validation dataset. We use an animated image format due to the temporal nature of our results. Best viewed on screen.

Input image

Ground truth

TempSAL

a) Image Saliency

Figure 8. Image saliency and temporal saliency predictions with their respective ground truths from the SALICON dataset. Black-and-white maps are the image saliency maps for the whole observation duration. Red-yellow maps are the temporal saliency maps for one-second intervals. Our model can predict the temporal saliency of each slice and track attention shifts between regions over time. For input images a) and c), the men are initially salient, then the attention shifts to the inanimate objects. That is, the food and the skateboard become more salient afterwards. In row e), people look at the man on the left first, then at the woman on the right, and eventually at the food. Our model is able to follow these transitions. Similarly, in rows g) and i), the attention is focused on the humans first, then shifts towards the book and the faucet on the right. We are able to capture these shifts in our predictions. In comparison, there are fewer shifts in row l). However, in the first second of observation, the bird on the left is the most salient region in the image, which is also successfully detected by our model.

In this section, we provide details on the calculations presented in Section 3.2 of our paper. We aim to observe the evolution of attention over time and discover temporal patterns in the data.

We calculate the correlation coefficient between slices \(\mathcal{T}_j\), and \(\mathcal{T}_k\) as :

$$\mathrm{CC}(\mathcal{T}_j,\mathcal{T}_k) = \frac{1}{N}\sum_{i=0}^{N} \mathrm{CC}(\mathcal{T}_{ij},\mathcal{T}_{ik}) ,\quad j,k \in \{1,\ldots,5\},$$

\(N\) is the total number of images, and \( \mathcal{T}_{ij}\) and \(\mathcal{T}_{ik}\) denote the \(j^{th} \) and \(k^{th}\) slice of the \(i^{th}\) image. We illustrate the similarity calculation of slice \( \mathcal{T}_{1}\) with the other slices for a single picture as follows:

By performing the same calculation over the \(N\) images, we obtain \(N\) comparisons for each slice pair \(\mathcal{T}_j\), and \(\mathcal{T}_k\) which are denoted by the arrows in Figure 9.
As a result, we can determine whether the difference between these comparisons is statistically significant. On these pairwise comparisons, we compute t-test scores with \(N=\) 10000 samples in each comparison. Table 8 displays the correlation coefficients and standard error values for each pair of slices.

With the exception of \( (\mathcal{T}_{1},\mathcal{T}_{3} )\) and \(( \mathcal{T}_{1},\mathcal{T}_{5}) \), we find that every difference is statistically significant \((p<0.01)\). That is, CC scores for the majority of the pairs come from different distributions with different mean values. This shows a difference in the slices that we exploit to predict temporal saliency and consequently to improve the overall image saliency prediction.

We also investigate the deviation of each slice from its respective average slice. We compute CC scores between a single slice and the corresponding average slice as:
$$\mathrm{CC}(\mathcal{T}_j,A_j) = \frac{1}{N}\sum_{i=0}^{N} \mathrm{CC}(\mathcal{T}_{ij},A_{j}) ,\quad j \in \{1,\ldots,5\},$$
where \(A_j\) denotes the \(j^{th}\) average slice. For a single image, we illustrate the similarity calculation of all slices with the average ones as follows:

By performing the same calculation over the \(N\) images, we obtain \(N\) comparisons for each slice pair \(\mathcal{T}_k\), and \(\mathcal{A}_k\) which are shown by the arrows in Figure 10.
Similar to the previous section, we determine the statistical significance difference between these comparisons. We compute t-test scores with \(N=\)10000 samples. Table 9 displays the correlation coefficients and standard error values.

In this case, we find that all comparisons are statistically significantly different from each other \((p<0.01)\). Note that \(\mathcal{T}_1\) has the highest CC value. This means that it is the most similar slice to its respective average slice which suggests that the subjects tend to look at the similar locations at the first second of observation. The other slices \(\mathcal{T}_2,\mathcal{T}_3,\mathcal{T}_4,\mathcal{T}_5\), have lower CC values with their respective average slices thus representing a greater variety.

In this section, we describe two slicing alternatives that we introduce in Section 4.1 of our paper. The SALICON dataset contains over 4.9M fixation points distributed across 5 seconds of observation time. Figure 11 shows the distribution of fixations over time.

We represent temporal saliency by dividing fixations into time slices. Too many slices of data increase in-slice variance, reduce the number of fixations per slice, and hence reduce predictability. Using too few slices, on the other hand, restricts the observation of attention shifts. Therefore, we use one-second intervals due to their interpretability and in reference to the Codecharts[1] method, which collects data in one-second increments.
The "Equal duration" slice format simply divides the data into equal-duration time intervals. In Section 3.2, we have used this slicing with a duration of one second. This fixation processing approach is straightforward to comprehend. It does not, however, provide any guarantee on the sampling balance between the slices.
As seen from Figure 11, the number of fixations in the first second are less than the consecutive seconds. Therefore, we also consider an "Equal distribution" slice format to have equal sample probability and correct the small skew in the fixation distribution. We illustrate these slicing alternatives with five red boxes indicating the number of slices in Figure 12 a) and b).

We show the boundaries of each temporal slice for two slicing alternatives in Table 10.

The time division of equal distribution slices is almost identical to that formed by equal duration method. The greatest difference between the two methods is in the first slice. Therefore, we conduct a similar statistical analysis as in Section 3.2 with the equal distribution slices. Table 11 displays the correlation coefficients and standard error values for each pair of slices.

We calculate t-test scores for pairwise comparisons as in Section B.1. We also find the difference between \(( \mathcal{T}_{2},\mathcal{T}_{3}) \) and \(( \mathcal{T}_{4},\mathcal{T}_{5}) \) is not statistically significant. Moreover, Table 11 displays the correlation coefficients and standard error values which we calculate as in Section B.2.

The mean values are closer to each other than the mean values in the equal duration method which are shown in Table 9. This can be explained by the balanced number of samples in each slice.

We break down the fixations into time slices with two time-slicing alternatives, namely equal duration and equal distribution as we describe in the previous section. The equal duration method groups the fixations based on their timestamps. Each slice has a different total number of fixations. On the other hand, the equal distribution method groups an equal number of fixations in each slice. Therefore, the duration of each slice is different from that of the other ones. We train and evaluate two models using both sampling methods. The results are presented in Table 13.

We break down the fixations into different number of time slices as an ablation study as we mention in Section 5.7 of our paper. We train the models using their respective number number of slices and evaluate on image saliency. The results are presented in Table 14.

The SALICON dataset provides saliency maps, fixations, and gaze points for each image and observer. Following common practice in eye tracking experiments, Jiang et al.[2] grouped spatially and temporally close gaze points to create fixations. Since these fixations were created by grouping multiple gaze points, they do not have a particular timestamp. SALICON-MD[1] assumes that the fixations are uniformly distributed across the total viewing time. We use a finer approximation for recovering the fixationsâ€™ timestamps by minimizing the spatial and temporal distance between a fixation and the nearest gaze point. Here, we provide the details of our approximation.

A simple approach to this problem is to match the raw gaze point in space that is closest to the fixation point. This approach is as follows:

$$f_{t s}=\underset{p_{t s}, \forall p \in \text { GazePoints } }{\operatorname{argmin}}\left\|f_{x y}-p_{x y}\right\|^{2}$$where \(f_{ts}\) is the desired fixation timestamp and \(f_{xy}\) and \(p_{xy}\) are the spatial coordinates of the fixation and gazepoint, respectively. Although this simple spatial attribution has a good matching according to our initial experiments, it does not account for the temporal "boomerang" patterns. The gaze tends to focus first on the most prominent part of the image in a boomerang pattern before investigating the context and returning to the initial attention location [1]. Therefore, this effect produces gaze points close in space but far away in time. To avoid this issue, we used the fact that the fixations are sorted chronologically in the dataset. Assuming a uniform fixation distribution throughout time corresponds to the timestamp assignment \(f_{ts}\) :

$$f_{t s}=5000 * \frac{(\text { index }+1)} {(\text{# of fixations} +1)}$$where total observation time is 5000 milliseconds, index is the fixation's occurrence order, and # of fixations is the total number of fixations in the image. The first method only considers the spatial distance while the second one only considers the temporal distance. Neither of them takes into account all available information. We combine these two approaches for a more accurate spatio-temporal timestamp approximation. For a given fixation, we compute a distance score for each gaze point. We assign the timestamp of the spatially closest gaze point to the given fixation. Then, we calculate \(p_{score}\) as follows, where \(w\) the weighting factor responsible for balancing the difference in time \(p_{time\_diff}\) and distance in space \(p_{space\_dist}\):

$$p_{\text {space}_{-} d i s t}=\left\|f_{x y}-p_{x y}\right\|^{2}$$ $$p_{\text{time}_{-} diff}=\left|p_{t s}- f_{t s}\right| $$ $$p_{\text {score}}=p_{\text{space}_{-}dist}+w * p_{\text {time}_{-}diff}$$ $$f_{t s}=\underset{p_{t s}, \forall p \in \text { GazePoints }}{\operatorname{argmin}} p_{\text {score }}$$where \(p_{t s}\) denotes timestamp of a gazepoint. This requires the optimization of the weighting factor \(w\). Over the 10000 training samples in the SALICON dataset, we emprically found that \(w=\)0.017 is the best weighting factor between spatial and temporal distances.

Back to the top
[1] Camilo Fosco, Anelise Newman, Pat Sukhum, Yun Bin
Zhang, Nanxuan Zhao, Aude Oliva, and Zoya Bylinskii.
How much time do you have? modeling multi-duration
saliency. In *IEEE/CVF Conference
on Computer Vision and Pattern Recognition*, 2020.

[2] Ming Jiang, Shengsheng Huang, Juanyong Duan, and Qi
Zhao. SALICON: Saliency in context. *In IEEE/CVF Conference
on Computer Vision and Pattern Recognition*, 2015.