Recently, skeleton-based human action recognition has been received more interest from industrial and research communities for many practical applications thanks to the popularity of depth sensors. A large number of conventional approaches, which have exploited handcrafted features with traditional classifiers, cannot learn high-level spatiotemporal features to precisely recognize complex human actions. In this paper, we introduce a novel encoding technique, namely Pose-Transition Feature to Image (PoT2I), to transform skeleton information to image-based representation for deep convolutional neural networks (CNNs). The spatial joint correlations and temporal pose dynamics of an action are exhaustively depicted by an encoded color image. For learning action models, we fine-tune end-to-end a pre-trained network to thoroughly capture multiple high-level features at multi-scale action representation. The proposed method is benchmarked on several challenging 3D action recognition datasets (e.g., UTKinect-Action3D, SBU-Kinect Interaction, and NTU RGB+D) with different parameter configurations for performance analysis. Outstanding experimental results with the highest accuracy of 90.33% on the most challenging NTU RGB+D dataset demonstrate that our action recognition method with PoT2I outperforms state-of-the-art approaches.