Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Articles Challenges of Human Pose Estimation in AI-Powered Fitness Apps

Challenges of Human Pose Estimation in AI-Powered Fitness Apps

Leia em Português

Key Takeaways

  • Human pose estimation is a popular solution that AI has to offer; it is used to determine the position and orientation of the human body given an image containing a person.
  • Some examples of applying pose estimation in fitness are Kaia, VAI Fitness Coach, Ally apps, or the Millie Fit device. 
  • Powered by computer vision and natural language processing algorithms, the technologies lead end-users through a number of workouts and give real-time feedback.
  • Learn about the failure cases in AI-based fitness apps with examples of powerlifting and squatting exercises.
  • This area can be divided into 2D and 3D pose estimation. While the acceptable level of accuracy in 2D pose estimation has been already reached, 3D pose estimation still requires more work until more accurate models are produced.

Fitness is a trend today. Every year the revenue of the fitness industry grows by 8.7%, according to the Wellness Creatives report, and fitness apps have not spared this field.

There are many cases of how technologies might help to improve your body - from tracking exercise activity to adjusting nutrition. The question is how much better can such apps help improve the performance of physical exercises compared to human coaches?

Artificial Intelligence (Al, a broad name for a group of advanced methods, tools, and algorithms for automatic execution of various tasks) has invaded practically all functional areas of business over the years. Pose estimation is among the most popular solutions that AI has to offer; it is used to determine the position and orientation of the human body given an image containing a person. Unsurprisingly, such a useful tool has found many use cases, for instance it can be used in avatar animation for Artificial Reality, markerless Motion Capture, worker pose analysis, and many more.

With the arrival of human pose estimation technology, the fitness technology market has been filling up with AI-based personal trainer apps. Some examples of applying pose estimation in fitness are Kaia, VAI Fitness Coach, Ally apps, or the Millie Fit device. Being powered by computer vision, human pose estimation and natural language processing algorithms, these technologies lead end-users through a number of workouts and give real-time feedback.

How Human Pose Estimation Works in Fitness Apps

In order to understand whether modern fitness apps can really help to perform physical exercises properly, let’s review how the human pose estimation-based apps work.

At the core of any human pose estimation application lies a pose estimation algorithm that receives an image of a person as an input and outputs the coordinates of the specific keypoints or landmarks on the human body (XY coords in 2D pose estimation or XYZ coords in 3D pose estimation). Modern pose estimation algorithms are almost exclusively based on convolutional neural networks with hourglass architecture or its variants (see the image below). Such a network consists of two major parts: a convolutional encoder that compresses the input image into the so-called latent representation and decoder that constructs N heatmaps from the latent representation where N is the number of searched keypoints.

Figure 1. Hourglass network architecture (source)

A single heatmap, produced as an output, is a one-channel image with the same resolution as the input. Each pixel has a probability of containing the target keypoint expressed in the values between 0 and 1.

Figure 2. Example of heatmaps produced by convnet for human pose estimation, left to right - input image, right shoulder, right elbow, right hand (source)

After the heatmaps are obtained, calculating the coordinates of the keypoints is as easy as finding the “centers of mass” of each area with high keypoint probability.

However, the matters can sometimes get complicated because the frame can contain more than one person. When this happens, it becomes difficult to understand which keypoints belong to which person and thus additional post-processing is required. Such cases are called multi-person pose estimation and can be dealt with in a variety of ways. For example, one can use an object detection model (e.g., Single Shot Detector - SSD, Mask R-CNN, etc.) to detect a bounding box around a person and then run the pose estimation model within the detected box. Another approach is to use a fast greedy decoding algorithm that connects the closest keypoints based on the kinematic person graph. This method is used in the PoseNet model by Google.

Usually, AI-based fitness apps are supposed to be used via devices equipped with a camera, which can record videos up to 720p and 60 fps in order to capture more frames during an exercise performance. The common algorithm of the human pose estimation-based app is the following:

  1. When the user starts to use the fitness app, the camera captures his/her movements during the exercise and records the video.
  2. The recorded video is split into individual frames that are processed with the help of the human pose estimation model, which detects keypoints on the user’s body and forms the virtual “skeleton” in the 2D or 3D dimensions.
  3. The virtual “skeleton” is analyzed through geometry-based rules or other means, and the mistakes in the exercise technique are pinpointed (if any).
  4. The user receives the description of mistakes made and recommendations on how to correct them.

Libraries and Tools for Human Pose Estimation

Since the human pose estimation is a useful tool for creating all sorts of systems (recognition of suspicious activity in the surveillance systems, interactive dancing assistants, augmented reality and more), some instruments that can help accelerate the development process when creating an application requiring pose estimation, has naturally appeared. Let’s briefly discuss some of the available options.


While not directly providing human pose estimation models, OpenCV includes all the necessary modules. It allows the creation of an instance of a pre-trained pose estimation model, created in the Caffe deep learning framework, reading, and pre-processing of images (or even video streams), and rendering the output keypoints. See the code example here.


  • many tutorials on how to set up and run the models, useful built-in tools for image processing.


  • not many customization options available;
  • most likely poor real-time performance.


The system is primarily written in C++ (Caffe framework) and capable of performing body, face, hand, and foot pose estimation (135 keypoints in total). The functionality of the library includes the keypoint detection in both 2D and 3D as well as multi-person keypoint detection.

Pros: combined full-body keypoint detection, real-time operation.

Cons: non-commercial use only, real-time performance may be difficult to achieve on CPU (the system showed real-time performance when tested on a setup with Nvidia 1080 Ti).


It is a lightweight model developed by Google, using the deep learning framework Tensorflow.

Pros: can be run from a browser (JS support) or a native Android or iOS app (using Tensorflow Lite), multi-person detection, fast inference (31 ms on Pixel 3, CPU).

Cons: moderate accuracy, may be insufficient for some of the pose analysis applications.


An extensive object detection library from Facebook, the models are implemented using the PyTorch framework. It has models for a variety of tasks besides pose estimation: object detection, instance segmentation, panoptic segmentation.

Pros: a large number of models available, high accuracy.

Cons: no real-time performance, especially on the CPU.


Being an open-source library for pose estimation with Linux and Windows support, it has a balanced accuracy and inference speed.

Pros: multi-person pose estimation, pose tracking, high accuracy (70 mAP on COCO dataset), fast - 23 fps on COCO validation set.

Cons: non-commercial use only, most likely will not perform in real time on CPU (authors tested the speed on TITAN XP).

Risks and Errors in Fitness Apps Based on Human Pose Estimation

It would be good to suppose that a single fitness app could cover all fitness-related questions, but unfortunately, such apps are still imperfect. In this article, I’d like to review some failure cases in AI-based fitness apps that might be still faced. As an example, I took human pose estimation in squatting exercises, since they are good for illustration of common form errors that sometimes might cause serious health issues.

Among other physical activities, powerlifting is one of the most popular directions in fitness. Squatting is the basic exercise performed by athletes. Being simple at first glance, this exercise is often not performed correctly due to the heavy weight of the barbell that is why many athletes need a personal trainer.

Now let’s suppose that a human personal trainer is replaced with an AI-based fitness app. Will it show the same exercise estimation capabilities as humans? Here are the possible errors that we discovered when investigating human pose estimation technology.

Men and Women Body Specifics

When training human pose estimation models for fitness apps, it is required to consider that men and women's bodies are physiologically different. If the model was trained only on men's images, it will return accurate results for only male users but not for females. Since the difference in men's and women's body postures during physical training is very big, the model may output incorrect results even if the exercise was performed correctly. Thus, it's important to consider this aspect when developing a fitness app based on human pose estimation.

Physiology Specifics

There are no human bodies with ideal proportions. All of us have disproportionalities, whether in the length of the legs, arms, or torso. It’s important to understand that the human pose estimation model will analyze the user’s body based on those images of people used for training. Thereby, the model’s perception of the user’s body will be based on people’s bodies from the training images. There is no guarantee that the training dataset included images with a similar body structure.

Having this fact considered, let’s suppose that a user is a person with unusual body features like the non-standard length of the legs or arms. When comparing the exercise performance of such a user with the reference ones, the accuracy of the result may be low even if the exercise was performed correctly.

Detection of the Exercise Start

The exercise estimation involves the detection of the exercise start and end, where the fitness app is expected to analyze the exercise duration period directly. In squatting with a barrel, the app can analyze the positions of the user's hands and shoulders by using arbitrary hard-coded thresholds. The error may occur when the arm angles briefly go above the arbitrary threshold.

Figure 3. Error in detection exercise start - condition randomly interrupts after the person has already begun the exercise. Source

Frontal View Error

The comparison of movements between two people is one more technique to estimate the correctness of exercise performance. The data are taken as frames from two video records: the reference video with the correct exercise technique and the input video of a user who utilizes a fitness app to analyze the exercise's performance. In order to compare these two videos, it is necessary to identify the position of 3D keypoints on both videos, align them, and measure distances between the user's joints and joints of an athlete from the reference video.

The problem lies in unusual movements and limitations of existing datasets. When the human pose estimation model processes frames with a strictly frontal view, the quality of results may be low. It is because the datasets used for training human pose estimation models still do not contain enough images of movements, poses, and different perspectives.

Figure 4. Comparison of complex movement (video 1, video 2) using 3D pose estimation (rightmost view). Problem: poor quality leg pose prediction. Source 

Quick Movements of the Lower Body Part

Squatting is not a single exercise where AI-based fitness apps may return errors. One more example is kicking using legs in martial arts. When a person does a quick kick with the leg, the deep learning model might not catch this movement. The reason for that might be the fast transition of the leg, which can be partially caused by the blurring of leg keypoints.

One more reason is that the 2D keypoint dataset (COCO in our case) may not contain such kind of limb images, thereby, 3D predictions for the lower part of the body does not reflect real movements since the 2D detections that are used as an input for 3D pose prediction were detected incorrectly. 

Figure 5. Poor 2D detections (orange dots in the original video; the model fails to properly detect right leg keypoints during the fast movement) that lead to poor 3D detections. Source 

Horizontal Position

Push-ups may also be a challenge for the human pose estimation model. When detecting 2D keypoints of arms and legs from the video of the athlete performing push-ups, the model returns a significant number of errors. We decided to check whether it would work properly if we rotate the video so that we could analyze the athlete's movements from the vertical position, and it worked well. This issue proved our assumption about the insufficiency of visual data in the open datasets once again.

Figure 6. Poor detections when a person is presented in an unusual horizontal pose (left) and improvement of detections when the pose is changed to a more common vertical (right). Source

Occluded Joints

There are exercises where body parts are occluded with other objects or body parts. An example of such exercise is snatching. When the barbell overshadows the position of hands, the 2D detector places keypoints in the wrong location. Thereby, the output of the 3D pose estimation model is incorrect.  

Figure 7. Detection error when joints are occluded (left hand is occluded by barbell weights in the upper part of the amplitude) Source

The Lack of Accuracy

Some pose estimation algorithms that perform the 2D detections don’t have enough accuracy for fitness apps. Especially when the app is expected to be used on mobile devices, for example, if using the PoseNet model, aimed to estimate human poses in real-time, we can see that detections cannot keep up with the human movements.

Figure 8. Low accuracy of pose estimation as a trade-off for inference speed when using a lightweight model Source

Since PoseNet is a lightweight model that is supposed to be run from the browser, it may output poor results when running it in a fitness app from the mobile device.

Please note that the challenges described above are just examples of what issues might be faced when developing fitness apps based on human pose estimation. We overcame most of these challenges when researched and did experiments for the 3D human pose estimation in AI fitness coach apps. The human pose estimation development strategy in a fitness app depends on many factors. The first and foremost are business requirements and expected results, as corny as that sounds.

Future Perspectives in Human Pose Estimation

If considering the points discussed above, it would be great to get a bigger picture of what’s going on in the pose estimation area and what to expect in the coming years.

Overall, the area can be subdivided into 2D and 3D pose estimation. While the acceptable level of accuracy in 2D pose estimation has been already reached, 3D pose estimation still requires more work until more accurate models are produced, especially for inference from a single image and with no depth information (some methods use multiple cameras pointed at the person or information from depth sensors to achieve better predictions). 

Part of the 3D pose estimation issues lie in the lack of large annotated datasets with people in the open environments (for example, some large datasets for 3D pose estimation Human3.6M were captured completely indoors). There is an ongoing effort to produce new datasets that would include more diverse data in terms of environmental conditions, clothing variety, strong articulations, etc. Hopefully, it will be possible to achieve better pose estimation performance in this kind of conditions in the future.

Another research direction is concerned with creating both accurate and fast models. The goal is to enable the pose estimation models (both 2D and 3D) to  be run on the edge devices with fewer hardware restrictions. It is done by exploring more efficient architectures (e.g., MobileNet, DenseNet) as well as the techniques allowing to reduce the size and increase the speed of the models by removing the redundant parameters (network pruning).

All in all, the efforts made to push the field further give strong evidence that there will be more robust models in the future, capable of dealing with 3D pose estimation in the wild and running on a broader range of edge devices.

About the Author

Maksym Tatariants is a Data Science Engineer at MobiDev. Having a background in Environmental and Mechanical Engineering, Materials Science, and Chemistry, he is keen on gaining new insights and experience in the Data Science and Machine Learning sectors. He is particularly interested in Deep Learning-based technologies and their application to business use cases.


Rate this Article