Recently, 3D action recognition has received more attention of research and industrial communities thanks to the popularity of depth sensors and the efficiency of skeleton estimation algorithms. Accordingly, a large number of methods have been studied by using either handcrafted features with traditional classifiers or recurrent neural networks. However, they cannot learn high-level spatial and temporal features of a whole skeleton sequence exhaustively. In this paper, we proposed a novel encoding technique to transform the pose features of joint-joint distance and joint-joint orientation to color pixels. By concatenating the features of all frames in a sequence, the spatial joint correlations and temporal pose dynamics of action appearance are depicted by a color image. For learning action models, we adopt the strategy of end-to-end fine-tuning a pre-trained deep convolutional neural networks to completely capture multiple high-level features at multi-scale action representation. The proposed method achieves the state-of-the-art performance on NTU RGB+D, the largest and most challenging 3D action recognition dataset, for both the cross-subject and cross-view evaluation protocols.