We propose a novel neural representation for videos (NeRV) which encodes videos in neural networks. Table4.5 shows results for common activation layers. In Table4.5, we apply common normalization layers in NeRV block. Although adopting SSIM alone can produce the highest MS-SSIM score, but the combination of L1 loss and SSIM loss can achieve the best trade-off between the PSNR performance and MS-SSIM score. Conclusion. As a normal practice, we fine-tune the model to regain the representation, after the pruning operation. Video encoding in NeRV is simply fitting a neural network to video frames and decoding process is a simple feedforward operation. 21 May 2021, 20:48 (modified: 22 Jan 2022, 15:59), neural representation, implicit representation, video compression, video denoising. where q is the q percentile value for all parameters in . Given a frame index, NeRV outputs the corresponding RGB image. Figure6 shows the results of different pruning ratios, where model of 40% sparsity still reach comparable performance with the full model. We perform experiments on Big Buck Bunny sequence from scikit-video to compare our NeRV with pixel-wise implicit representations, which has 132 frames of 7201080 resolution. The GELU[19]activation function achieve the highest performances, which is adopted as our default design. Activation layer. PS-NeRV, which represents videos as a function of patches and the corresponding patch coordinate. What is a video? Given a neural network fit on a video, we use global unstructured pruning to reduce the model size first. Video compression visulization. As the most popular media format nowadays, videos are generally viewed as frames of sequences. Unlike conventional representations that treat videos as frame sequences, we represent videos as neural networks taking frame index as input. Recently, the image-wise implicit neural representation of videos, NeRV, has gained popularity for its promising results and swift speed compared to regular pixel-wise implicit. Following prior works, we used ffmpeg[49]. Most notably, we examine the suitability of NeRV for video compression. We compare with H.264[58], HEVC[47], STAT-SSF-SP[61], HLVC[60], Scale-space[1], and Wu et al. Given a frame index, NeRV outputs the corresponding RGB image. Therefore, video encoding is done by fitting a neural network f to a given video, such that it can map each input timestamp to the corresponding RGB frame. The main differences between our work and image-wise implicit representation are the output space and architecture designs. A convolutional neural network (CNN) is a type of artificial neural network used in image recognition and processing that is specifically designed to process pixel data. CNNs are powerful image processing, artificial intelligence that use deep learning to perform both generative and descriptive tasks, often using machine vison that includes . Such a long pipeline makes the decoding process very complex as well. Loss objective. We take SIREN[44] and NeRF[33] as the baseline, where SIREN[44] takes the original pixel coordinates as input and uses sine activations, while NeRF[33], adds one positional embedding layer to encode the pixel coordinates and uses ReLU activations. In contrast, with NeRV, we can use any neural network compression method as a proxy for video compression, and achieve comparable performance to traditional frame-based video compression approaches (H.264, HEVC \etc). Given a frame index, NeRV outputs the corresponding RGB image. As an image-wise implicit representation, NeRV output the whole image and shows great efficiency compared to pixel-wise implicit representation, improving the encoding speed by 25x to 70x, the decoding speed by 38x to 132x, while achieving better video quality. Unlike conventional representations that treat videos as frame sequences, we represent videos as neural networks taking frame index as input. In contrast, with NeRV, we can use any neural network compression method as a proxy for video compression. First, we concatenate 7 videos into one single video along the time dimension and train NeRV on all the frames from different videos, which we found to be more beneficial than training a single model for each video. In contrast, with NeRV, we can use any neural network compression method as a proxy for video compression, and achieve comparable performance to traditional frame-based video compression approaches (H.264, HEVC \etc). Due to the simple decoding process (feedforward operation), NeRV shows great advantage, even for carefully-optimized H.264. As the first image-wise neural representation, NeRV generally achieves comparable performance with traditional video compression techniques and other learning-based video compression approaches. Therefore, unlike traditional video representations which treat videos as sequences of frames, shown in Figure 1 (a), our proposed NeRV considers a video as a unified neural network with all information embedded within its architecture and parameters, shown in Figure1 (b). We first conduct ablation study on video Big Buck Bunny. DIP emphasizes that its image prior is only captured by the network structure of Convolution operations because it only feeds on a single image. where T is the frame number, f(t)RHW3 the NeRV prediction, vtRHW3 the frame ground truth, is hyper-parameter to balance the weight for each loss component. However, neither pixel-wise nor image-wise representation is the most suitable strategy for video data. Rate distortion plots on the MCL-JCV dataset. For fine-tune process after pruning, we use 50 epochs for both UVG and Big Buck Bunny. As the most popular media format nowadays, videos are generally viewed as frames of sequences. These can be viewed as denoising upper bound for any additional compression process. Input Embedding. To compare with state-of-the-arts methods on video compression task, we do experiments on the widely used UVG[32], consisting of 7 videos and 3900 frames with 19201080 in total. Figure6 shows the full compression pipeline with NeRV. As a simple and efficient video representation, HNeRV also shows decoding advantages for speed, flexibility, and deployment, compared to traditional codecs (H.264, H.265) and learning-based compression methods. Unlike conventional representations that treat videos as frame sequences, we represent videos as neural networks taking frame index as input. At similar memory budget, NeRV shows image details with better quality. to produce the evaluation metrics for H.264 and HEVC. As an image-wise implicit representation, NeRV output the whole image and . With such a representation, we show that by simply applying general model compression techniques, NeRV can match the performances of traditional video compression approaches for the video compression task, without the need to design a long and complex pipeline. By changing the hidden dimension of MLP and channel dimension of NeRV blocks, we can build NeRV model with different sizes. In this work, we present a novel neural representation for videos, NeRV, which encodes videos into neural networks. Given a frame index, NeRV outputs the corresponding RGB image. It is worth noting that when BPP is small, NeRV can match the performance of the state-of-the-art method, showing its great potential in high-rate video compression. In UVG experiments on video compression task, we train models with different sizes by changing the value of C1,C2 to (48,384), (64,512), (128,512), (128,768), (128,1024), (192,1536), and (256,2048). Traditional video compression frameworks are quite involved, such as specifying key frames and inter frames, estimating the residual information, block-size the video frames, applying discrete cosine transform on the resulting image blocks and so on. As an image-wise implicit representation, NeRV output the whole image and shows great efficiency compared to pixel-wise implicit representation, improving the encoding speed by $\textbf{25}\times$ to $\textbf{70}\times$, the decoding speed by $\textbf{38}\times$ to $\textbf{132}\times$, while achieving better video quality. Advances in Neural Information Processing Systems 34 (NeurIPS 2021). (c) and (e) are denoising output for DIP, Input embedding ablation. As shown in Table3, the decoding video quality keeps increasing when the training epochs are longer. Compression ablation. There are some limitations with the proposed NeRV. Compare with state-of-the-arts methods. The goal of model compression is to simplify an original model by reducing the number of parameters while maintaining its accuracy. We propose a novel neural representation for videos (NeRV) which encodes videos in neural networks. Unlike conventional representations that treat videos as frame sequences, we represent videos as neural networks taking frame index as input. A novel neural representation for videos (NeRV) which encodes videos in neural networks taking frame index as input, which can be used as a proxy for video compression, and achieve comparable performance to traditional frame-based video compression approaches. NeRV [5,6], RGBNeRV2 T H W T H W NeRVT NeRV NeRVMLP+ConvNetsMLPRGB NeRV Entropy Encoding. Given a frame index, NeRV outputs the corresponding RGB image. decoding process is a simple feedforward operation. We also show that NeRV can outperform standard denoising methods. Specifically, we train our model with a subset of frames sampled from one video, and then use the trained model to infer/predict unseen frames given an unseen interpolated frame index. First, to achieve the comparable PSNR and MS-SSIM performances, the training time of our proposed approach is longer than the encoding time of traditional video compression methods. Temporal interpolation results for video with small motion. Surprisingly, our model tries to avoid the influence of the noise and regularizes them implicitly with little harm to the compression task simultaneously, which can serve well for most partially distorted videos in practice. Acknowledgement. Our method represents the scene as a continuous volumetric function parameterized as MLPs whose inputs are a 3D . With such a representation, we can treat videos as neural networks, simplifying several video-related tasks. We speedup NeRV by running it in half precision (FP16). On the contrary, our NeRV can output frames at any random time index independently, thus making parallel decoding much simpler. This project was partially funded by the DARPA SAIL-ON (W911NF2020009) program, an independent grant from Facebook AI, and Amazon Research Award to AS. Pixel-wise representations output the RGB value for each pixel, while NeRV outputs a whole image, demonstrated in Figure2. PSNR and MS-SSIM are adopted for evaluation of the reconstructed videos. Recently, the image-wise implicit neural representation of videos, NeRV, has gained popularity for its promising results and swift speed compared to regular pixel-wise implicit representations. Inspired by the super-resolution networks, we design the NeRV block, illustrated in Figure, For NeRV, we adopt combination of L1 and SSIM loss as our loss function for network optimization, which calculates the loss over all pixel locations of the predicted image and the ground-truth image as following.
