M.Sc. István Sárándi
Room 129
Phone: +49 241 80 20 769
Fax: +49 241 80 22 731
Email: sarandi@vision.rwth-aachen.de

My research interests lie in the automated visual analysis of humans, especially for applications such as human-robot interaction and collaborative robotics. I am currently focusing on estimating articulated 3D body pose using deep learning methods.

Supervised Students

  • Markus Knoche (student assistant, master thesis on "Volumetric Feature Transformation for Pose-Conditioned Human Image Synthesis")
  • Yinglun Liu (student assistant)



Metric-Scale Truncation-Robust Heatmaps for 3D Human Pose Estimation

István Sárándi, Timm Linder, Kai O. Arras, Bastian Leibe
IEEE International Conference on Automatic Face and Gesture Recognition (FG) 2020, to appear

Heatmap representations have formed the basis of 2D human pose estimation systems for many years, but their generalizations for 3D pose have only recently been considered. This includes 2.5D volumetric heatmaps, whose X and Y axes correspond to image space and the Z axis to metric depth around the subject. To obtain metric-scale predictions, these methods must include a separate, explicit post-processing step to resolve scale ambiguity. Further, they cannot encode body joint positions outside of the image boundaries, leading to incomplete pose estimates in case of image truncation. We address these limitations by proposing metric-scale truncation-robust (MeTRo) volumetric heatmaps, whose dimensions are defined in metric 3D space near the subject, instead of being aligned with image space. We train a fully-convolutional network to estimate such heatmaps from monocular RGB in an end-to-end manner. This reinterpretation of the heatmap dimensions allows us to estimate complete metric-scale poses without test-time knowledge of the focal length or person distance and without relying on anthropometric heuristics in post-processing. Furthermore, as the image space is decoupled from the heatmap space, the network can learn to reason about joints beyond the image boundary. Using ResNet-50 without any additional learned layers, we obtain state-of-the-art results on the Human3.6M and MPI-INF-3DHP benchmarks. As our method is simple and fast, it can become a useful component for real-time top-down multi-person pose estimation systems. We make our code publicly available to facilitate further research.

» Show BibTeX

title={Metric-Scale Truncation-Robust Heatmaps for 3{D} Human Pose Estimation},
author={S\'ar\'andi, Istv\'an and Linder, Timm and Arras, Kai O. and Leibe, Bastian},
booktitle={Automatic Face and Gesture Recognition, 2020 IEEE Int. Conf. on},
note={in press}

Reposing Humans by Warping 3D Features

Markus Knoche, István Sárándi, Bastian Leibe
Workshop on Towards Human-Centric Image/Video Synthesis -- Conference on Computer Vision and Pattern Recognition (CVPRW'20)

We address the problem of reposing an image of a human into any desired novel pose. This conditional image-generation task requires reasoning about the 3D structure of the human, including self-occluded body parts. Most prior works are either based on 2D representations or require fitting and manipulating an explicit 3D body mesh. Based on the recent success in deep learning-based volumetric representations, we propose to implicitly learn a dense feature volume from human images, which lends itself to simple and intuitive manipulation through explicit geometric warping. Once the latent feature volume is warped according to the desired pose change, the volume is mapped back to RGB space by a convolutional decoder. Our state-of-the-art results on the DeepFashion and the iPER benchmarks indicate that dense volumetric human representations are worth investigating in more detail.

» Show BibTeX

author = {Markus Knoche and Istv\'an S\'ar\'andi and Bastian Leibe},
title = {Reposing Humans by Warping 3{D} Features},
booktitle = {CVPR Workshop on Towards Human-Centric Image/Video Synthesis},
year = {2020}

Visual Person Understanding through Multi-Task and Multi-Dataset Learning

Kilian Pfeiffer, Alexander Hermans, István Sárándi, Mark Weber, Bastian Leibe
German Conference on Pattern Recognition (GCPR'19)

We address the problem of learning a single model for person re-identification, attribute classification, body part segmentation, and pose estimation. With predictions for these tasks we gain a more holistic understanding of persons, which is valuable for many applications. This is a classical multi-task learning problem. However, no dataset exists that these tasks could be jointly learned from. Hence several datasets need to be combined during training, which in other contexts has often led to reduced performance in the past. We extensively evaluate how the different task and datasets influence each other and how different degrees of parameter sharing between the tasks affect performance. Our final model matches or outperforms its single-task counterparts without creating significant computational overhead, rendering it highly interesting for resource-constrained scenarios such as mobile robotics.

» Show BibTeX

title = {Visual Person Understanding Through Multi-task and Multi-dataset Learning},
author = {Kilian Pfeiffer and Alexander Hermans and Istv\'{a}n S\'{a}r\'{a}ndi and Mark Weber and Bastian Leibe},
booktitle = {German Conference on Pattern Recognition},
pages = {551--566}
date = {2019},

How Robust is 3D Human Pose Estimation to Occlusion?

István Sárándi, Timm Linder, Kai O. Arras, Bastian Leibe
Workshop on Robotic Co-workers 4.0 -- IEEE/RSJ Int. Conference on Intelligent Robots and Systems (IROSW'18)

Occlusion is commonplace in realistic human-robot shared environments, yet its effects are not considered in standard 3D human pose estimation benchmarks. This leaves the question open: how robust are state-of-the-art 3D pose estimation methods against partial occlusions? We study several types of synthetic occlusions over the Human3.6M dataset and find a method with state-of-the-art benchmark performance to be sensitive even to low amounts of occlusion. Addressing this issue is key to progress in applications such as collaborative and service robotics. We take a first step in this direction by improving occlusion-robustness through training data augmentation with synthetic occlusions. This also turns out to be an effective regularizer that is beneficial even for non-occluded test cases.

» Show BibTeX

title={How Robust is 3D Human Pose Estimation to Occlusion?},
author={S{\'a}r{\'a}ndi, Istv{\'a}n and Linder, Timm and Arras, Kai O and Leibe, Bastian},
booktitle={IROS Workshop - Robotic Co-workers 4.0},

Synthetic Occlusion Augmentation with Volumetric Heatmaps for the 2018 ECCV PoseTrack Challenge on 3D Human Pose Estimation

István Sárándi, Timm Linder, Kai O. Arras, Bastian Leibe
Extended abstract for the ECCV PoseTrack Workshop 2018

In this paper we present our winning entry at the 2018 ECCV PoseTrack Challenge on 3D human pose estimation. Using a fully-convolutional backbone architecture, we obtain volumetric heatmaps per body joint, which we convert to coordinates using soft-argmax. Absolute person center depth is estimated by a 1D heatmap prediction head. The coordinates are back-projected to 3D camera space, where we minimize the L1 loss. Key to our good results is the training data augmentation with randomly placed occluders from the Pascal VOC dataset. In addition to reaching first place in the Challenge, our method also surpasses the state-of-the-art on the full Human3.6M benchmark when considering methods that use no extra pose datasets in training. Code for applying synthetic occlusions is availabe at https://github.com/isarandi/synthetic-occlusion.

» Show Videos
» Show BibTeX

author = {S{\'a}r{\'a}ndi, Istv{\'a}n and Linder, Timm and Arras, Kai O and Leibe, Bastian},
title = {Synthetic Occlusion Augmentation with Volumetric Heatmaps for the 2018 {E}{C}{C}{V} {P}ose{T}rack Challenge on 3{D} Human Pose Estimation},
year = {2018}

Pedestrian Line Counting by Probabilistic Combination of Flow and Appearance Information

István Sárándi
Master Thesis

In this thesis we examine the task of estimating how many pedestrians cross a given line in a surveillance video, in the presence of high occlusion and dense crowds. We show that a prior, blob-based pedestrian line counting method fails on our newly annotated private dataset, which is more challenging than those used in the literature.

We propose a new spatiotemporal slice-based method that works with simple low-level features based on optical flow, background subtraction and edge detection and show that it produces good results on the new dataset. Furthermore, presumably due to the very simple and general nature of the features we use, the method also performs well on the popular UCSD vidd dataset without additional hyperparameter tuning, showing the robustness of our approach.

We design new evaluation measures that generalize the precision and recall used in information retrieval and binary classification to continuous, instantaneous pedestrian flow estimations and we argue that they are better suited to this task than currently used measures.

We also consider the relations between pedestrian region counting and line counting by comparing the output of a region counting method with the counts that we derive from line counting. Finally we show a negative result, where a probabilistic method for combining line and region counter outputs does not lead to the hoped result of mutually improved counters.

Quantitative Conjunctival Provocation Test for Controlled Clinical Trials

István Sárándi, Dan Philipp Claßen, Anatoli Astvatsatourov, Oliver Pfaar, Ludger Klimek, Ralph Mösges, Thomas M. Deserno
Methods of Information in Medicine

The conjunctival provocation test (CPT) is a diagnostic procedure for the assessment of allergic diseases. Photographs are taken before and after provocation increasing the redness of the conjunctiva due to hyperemia. We propose and evaluate an automatic image processing pipeline for objective and quantitative CPT. After scale normalization based on intrinsic image features, the conjunctiva region of interest (ROI) is segmented combining thresholding, edge detection and Hough transform. Redness of the ROI is measured from 0 to 1 by the average pixel redness, which is defined by truncated projection in HSV space. In total, 92 images from an observational diagnostic study are processed for evaluation. The database contains images from two visits for assessment of the test-retest reliability (46 images per visit). All images were successfully processed by the algorithm. The relative redness increment correlates between the two visits with Pearson's r=0.672 (p<.001). Linear correlation of the automatic measure is larger than the manual measure (r=0.59). This indicates a higher reproducibility and stability of the automatic method. We presented a robust and effective way to objectify CPT. The algorithm operates on low resolution, is fast and requires no manual input. Quantitative CPT measures can now be established as surrogate endpoint in controlled clinical trials.

» Show BibTeX

title={Quantitative conjunctival provocation test for controlled clinical trials},
author={S\'ar\'andi, I and Cla{\ss}en, DP and Astvatsatourov, A and Pfaar, O and Klimek, L and M{\"o}sges, R and Deserno, TM},
journal={Methods of information in medicine},
publisher={Schattauer GmbH}

Towards Quantitative Assessment of Calciphylaxis

Thomas M. Deserno, István Sárándi, Abin Jose, Daniel Haak, Stephan Jonas, Paula Specht, Vincent Brandenburg
SPIE Medical Imaging 2014

Calciphylaxis is a rare disease that has devastating conditions associated with high morbidity and mortality. Calciphylaxis is characterized by systemic medial calcification of the arteries yielding necrotic skin ulcerations. In this paper, we aim at supporting the installation of multi-center registries for calciphylaxis, which includes a photographic documentation of skin necrosis. However, photographs acquired in different centers under different conditions using different equipment and photographers cannot be compared quantitatively. For normalization, we use a simple color pad that is placed into the field of view, segmented from the image, and its color fields are analyzed. In total, 24 colors are printed on that scale. A least-squares approach is used to determine the affine color transform. Furthermore, the card allows scale normalization. We provide a case study for qualitative assessment. In addition, the method is evaluated quantitatively using 10 images of two sets of different captures of the same necrosis. The variability of quantitative measurements based on free hand photography is assessed regarding geometric and color distortions before and after our simple calibration procedure. Using automated image processing, the standard deviation of measurements is significantly reduced. The coefficients of variations yield 5-20% and 2-10% for geometry and color, respectively. Hence, quantitative assessment of calciphylaxis becomes practicable and will impact a better understanding of this rare but fatal disease.

Disclaimer Home Visual Computing institute RWTH Aachen University