TempSAL - Uncovering Temporal Information for Deep Saliency Prediction

Project Page and Supplementary Material

Bahar Aydemir, Ludo Hoffstetter, Tong Zhang, Mathieu Salzmann, Sabine SĂĽsstrunk
School of Computer and Communication Sciences, EPFL, Switzerland
{bahar.aydemir, tong.zhang, mathieu.salzmann, sabine.susstrunk}@epfl.ch

A. Additional Qualitative Results

We provide additional qualitative results for our model on the SALICON validation dataset. We use an animated image format due to the temporal nature of our results. Best viewed on screen.

Input image

Ground truth


a) Image Saliency

b) Temporal Saliency

c) Image Saliency

d) Temporal Saliency

e) Image Saliency

f) Temporal Saliency

g) Image Saliency

h) Temporal Saliency

i) Image Saliency

j) Temporal Saliency

k) Image Saliency

l) Temporal Saliency

Figure 8. Image saliency and temporal saliency predictions with their respective ground truths from the SALICON dataset. Black-and-white maps are the image saliency maps for the whole observation duration. Red-yellow maps are the temporal saliency maps for one-second intervals. Our model can predict the temporal saliency of each slice and track attention shifts between regions over time. For input images a) and c), the men are initially salient, then the attention shifts to the inanimate objects. That is, the food and the skateboard become more salient afterwards. In row e), people look at the man on the left first, then at the woman on the right, and eventually at the food. Our model is able to follow these transitions. Similarly, in rows g) and i), the attention is focused on the humans first, then shifts towards the book and the faucet on the right. We are able to capture these shifts in our predictions. In comparison, there are fewer shifts in row l). However, in the first second of observation, the bird on the left is the most salient region in the image, which is also successfully detected by our model.

B. Details on the Statistical Analysis

In this section, we provide details on the calculations presented in Section 3.2 of our paper. We aim to observe the evolution of attention over time and discover temporal patterns in the data.

B.1 Inter-slice similarity across time

We calculate the correlation coefficient between slices \(\mathcal{T}_j\), and \(\mathcal{T}_k\) as :

$$\mathrm{CC}(\mathcal{T}_j,\mathcal{T}_k) = \frac{1}{N}\sum_{i=0}^{N} \mathrm{CC}(\mathcal{T}_{ij},\mathcal{T}_{ik}) ,\quad j,k \in \{1,\ldots,5\},$$
\(N\) is the total number of images, and \( \mathcal{T}_{ij}\) and \(\mathcal{T}_{ik}\) denote the \(j^{th} \) and \(k^{th}\) slice of the \(i^{th}\) image. We illustrate the similarity calculation of slice \( \mathcal{T}_{1}\) with the other slices for a single picture as follows:

Figure 9. Calculation of CC score for slice \( \mathcal{T}_{1}\) of Image \( \mathcal{I}_{i}\).

By performing the same calculation over the \(N\) images, we obtain \(N\) comparisons for each slice pair \(\mathcal{T}_j\), and \(\mathcal{T}_k\) which are denoted by the arrows in Figure 9. As a result, we can determine whether the difference between these comparisons is statistically significant. On these pairwise comparisons, we compute t-test scores with \(N=\) 10000 samples in each comparison. Table 8 displays the correlation coefficients and standard error values for each pair of slices.

Table 8. Correlation coefficients with the standard error values for each pair of slices. The bold values indicate the slices most similar to each other. The highlights indicate statistically insignificant differences.

With the exception of \( (\mathcal{T}_{1},\mathcal{T}_{3} )\) and \(( \mathcal{T}_{1},\mathcal{T}_{5}) \), we find that every difference is statistically significant \((p<0.01)\). That is, CC scores for the majority of the pairs come from different distributions with different mean values. This shows a difference in the slices that we exploit to predict temporal saliency and consequently to improve the overall image saliency prediction.

B.2 Intra-slice similarity across images

We also investigate the deviation of each slice from its respective average slice. We compute CC scores between a single slice and the corresponding average slice as: $$\mathrm{CC}(\mathcal{T}_j,A_j) = \frac{1}{N}\sum_{i=0}^{N} \mathrm{CC}(\mathcal{T}_{ij},A_{j}) ,\quad j \in \{1,\ldots,5\},$$ where \(A_j\) denotes the \(j^{th}\) average slice. For a single image, we illustrate the similarity calculation of all slices with the average ones as follows:
Figure 10. Calculation of CC scores for all slices of Image \( \mathcal{I}_{i}\).

By performing the same calculation over the \(N\) images, we obtain \(N\) comparisons for each slice pair \(\mathcal{T}_k\), and \(\mathcal{A}_k\) which are shown by the arrows in Figure 10. Similar to the previous section, we determine the statistical significance difference between these comparisons. We compute t-test scores with \(N=\)10000 samples. Table 9 displays the correlation coefficients and standard error values.
Table 9. CC scores for all slices with the corresponding average slices. This table is also included in the main paper as Table 2 without the standar error values.

In this case, we find that all comparisons are statistically significantly different from each other \((p<0.01)\). Note that \(\mathcal{T}_1\) has the highest CC value. This means that it is the most similar slice to its respective average slice which suggests that the subjects tend to look at the similar locations at the first second of observation. The other slices \(\mathcal{T}_2,\mathcal{T}_3,\mathcal{T}_4,\mathcal{T}_5\), have lower CC values with their respective average slices thus representing a greater variety.

C. Equal Duration vs Equal Distribution

In this section, we describe two slicing alternatives that we introduce in Section 4.1 of our paper. The SALICON dataset contains over 4.9M fixation points distributed across 5 seconds of observation time. Figure 11 shows the distribution of fixations over time.

Figure 11. Number of fixations with their timestamps.

We represent temporal saliency by dividing fixations into time slices. Too many slices of data increase in-slice variance, reduce the number of fixations per slice, and hence reduce predictability. Using too few slices, on the other hand, restricts the observation of attention shifts. Therefore, we use one-second intervals due to their interpretability and in reference to the Codecharts[1] method, which collects data in one-second increments. The "Equal duration" slice format simply divides the data into equal-duration time intervals. In Section 3.2, we have used this slicing with a duration of one second. This fixation processing approach is straightforward to comprehend. It does not, however, provide any guarantee on the sampling balance between the slices. As seen from Figure 11, the number of fixations in the first second are less than the consecutive seconds. Therefore, we also consider an "Equal distribution" slice format to have equal sample probability and correct the small skew in the fixation distribution. We illustrate these slicing alternatives with five red boxes indicating the number of slices in Figure 12 a) and b).
Figure 12. a) Time intervals of equal duration slicing; b) Time intervals of equal distribution slicing; c) Number of fixations per slice in equal duration slicing; d) Number of fixations per slice in equal distribution slicing. Note that equal duration slicing has uniform time intervals as shown in a), while equal distribution slicing has equal number of samples in each slice as shown in d).

We show the boundaries of each temporal slice for two slicing alternatives in Table 10.
Table 10. Time intervals for the two different slicing methods

The time division of equal distribution slices is almost identical to that formed by equal duration method. The greatest difference between the two methods is in the first slice. Therefore, we conduct a similar statistical analysis as in Section 3.2 with the equal distribution slices. Table 11 displays the correlation coefficients and standard error values for each pair of slices.
Table 11. Correlation coefficients with the standard error values for each pair of slices with equal distribution method. The highlights indicate statistically insignificant differences.
We calculate t-test scores for pairwise comparisons as in Section B.1. We also find the difference between \(( \mathcal{T}_{2},\mathcal{T}_{3}) \) and \(( \mathcal{T}_{4},\mathcal{T}_{5}) \) is not statistically significant. Moreover, Table 11 displays the correlation coefficients and standard error values which we calculate as in Section B.2.
Table 12. CC scores for all slices with the corresponding average slices with equal distribution method. The bold values indicate the slices most similar to each other.
The mean values are closer to each other than the mean values in the equal duration method which are shown in Table 9. This can be explained by the balanced number of samples in each slice.

D. Results of the Slicing Alternatives

We break down the fixations into time slices with two time-slicing alternatives, namely equal duration and equal distribution as we describe in the previous section. The equal duration method groups the fixations based on their timestamps. Each slice has a different total number of fixations. On the other hand, the equal distribution method groups an equal number of fixations in each slice. Therefore, the duration of each slice is different from that of the other ones. We train and evaluate two models using both sampling methods. The results are presented in Table 13.

Table 13. Results of the equal distribution model (first column) and the equal duration one (second column) across different time slices. The equal distribution model includes an equal number of fixations per slice. The equal duration model achieves better results in 13 out of 20 comparisons.

E. The Number of Time Slices

We break down the fixations into different number of time slices as an ablation study as we mention in Section 5.7 of our paper. We train the models using their respective number number of slices and evaluate on image saliency. The results are presented in Table 14.

Table 14. Effect of the number of different time slices on SALICON validation dataset. When we increase the number of slices, the number of fixations per slice decreases. Therefore, the time slices and their predictions become noisy. We observe a decrease in the performance as the number of the time slices increases. However, training with 3 or 4 slices yields results similar to using 5 slices, which shows that our approach is not highly sensitive to this number.

F. Approximation of Fixation Timestamps

The SALICON dataset provides saliency maps, fixations, and gaze points for each image and observer. Following common practice in eye tracking experiments, Jiang et al.[2] grouped spatially and temporally close gaze points to create fixations. Since these fixations were created by grouping multiple gaze points, they do not have a particular timestamp. SALICON-MD[1] assumes that the fixations are uniformly distributed across the total viewing time. We use a finer approximation for recovering the fixations’ timestamps by minimizing the spatial and temporal distance between a fixation and the nearest gaze point. Here, we provide the details of our approximation.

A simple approach to this problem is to match the raw gaze point in space that is closest to the fixation point. This approach is as follows:

$$f_{t s}=\underset{p_{t s}, \forall p \in \text { GazePoints } }{\operatorname{argmin}}\left\|f_{x y}-p_{x y}\right\|^{2}$$

where \(f_{ts}\) is the desired fixation timestamp and \(f_{xy}\) and \(p_{xy}\) are the spatial coordinates of the fixation and gazepoint, respectively. Although this simple spatial attribution has a good matching according to our initial experiments, it does not account for the temporal "boomerang" patterns. The gaze tends to focus first on the most prominent part of the image in a boomerang pattern before investigating the context and returning to the initial attention location [1]. Therefore, this effect produces gaze points close in space but far away in time. To avoid this issue, we used the fact that the fixations are sorted chronologically in the dataset. Assuming a uniform fixation distribution throughout time corresponds to the timestamp assignment \(f_{ts}\) :

$$f_{t s}=5000 * \frac{(\text { index }+1)} {(\text{# of fixations} +1)}$$

where total observation time is 5000 milliseconds, index is the fixation's occurrence order, and # of fixations is the total number of fixations in the image. The first method only considers the spatial distance while the second one only considers the temporal distance. Neither of them takes into account all available information. We combine these two approaches for a more accurate spatio-temporal timestamp approximation. For a given fixation, we compute a distance score for each gaze point. We assign the timestamp of the spatially closest gaze point to the given fixation. Then, we calculate \(p_{score}\) as follows, where \(w\) the weighting factor responsible for balancing the difference in time \(p_{time\_diff}\) and distance in space \(p_{space\_dist}\):

$$p_{\text {space}_{-} d i s t}=\left\|f_{x y}-p_{x y}\right\|^{2}$$ $$p_{\text{time}_{-} diff}=\left|p_{t s}- f_{t s}\right| $$ $$p_{\text {score}}=p_{\text{space}_{-}dist}+w * p_{\text {time}_{-}diff}$$ $$f_{t s}=\underset{p_{t s}, \forall p \in \text { GazePoints }}{\operatorname{argmin}} p_{\text {score }}$$

where \(p_{t s}\) denotes timestamp of a gazepoint. This requires the optimization of the weighting factor \(w\). Over the 10000 training samples in the SALICON dataset, we emprically found that \(w=\)0.017 is the best weighting factor between spatial and temporal distances.

Back to the top


[1] Camilo Fosco, Anelise Newman, Pat Sukhum, Yun Bin Zhang, Nanxuan Zhao, Aude Oliva, and Zoya Bylinskii. How much time do you have? modeling multi-duration saliency. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
[2] Ming Jiang, Shengsheng Huang, Juanyong Duan, and Qi Zhao. SALICON: Saliency in context. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2015.