Dr. Carolyn Matl, Research Scientist at Toyota Research Institute, explains why Interactive Perception and soft tactile sensors are critical for manipulating challenging objects such as liquids, grains, and dough. She also dives into “StRETcH” a Soft to Resistive Elastic Tactile Hand, a variable stiffness soft tactile end-effector, presented by her research group.
Carolyn Matl is a research scientist at the Toyota Research Institute, where she works on robotic perception and manipulation with the Mobile Manipulation Team. She received her B.S.E in Electrical Engineering from Princeton University in 2016, and her Ph.D. in Electrical Engineering and Computer Sciences at the University of California, Berkeley in 2021. At Berkeley, she was awarded the NSF Graduate Research Fellowship and was advised by Ruzena Bajcsy. Her dissertation work focused on developing and leveraging non-traditional sensors for robotic manipulation of complicated objects and substances like liquids and doughs.
Carolyn Matl’s Related Research Videos
Episode 350 – Improving Perception through Interaction
Shihan Lu: Hi, Dr. Matl, welcome to Robohub. Would you mind introducing yourself?
Carolyn Matl: All right. Uh, so hello. Thank you so much for having me on the podcast. I’m Carolyn Matl and I’m a research scientist at the Toyota research Institute where I work with a really great group of people on the mobile manipulation team on fun and challenging robotic perception and manipulation problems.
I recently graduated from right up the road. From UC Berkeley, where I was advised by the wonderful Ruzena Bajcsy and where for my dissertation, I worked on interactive perception for robotic manipulation of different materials, like liquid, grains, and doughs.
Shihan Lu: So what is interactive perception?
Carolyn Matl: So in a nutshell, interactive perception is exactly what it sounds like.
It’s perception that requires physical interaction with the surrounding environment and whether this is purposeful or not. This interaction is what ultimately changes the state of the environment, which then allows the actor. So this could be the robot or a human to infer something about the environment that otherwise would not have been observed as humans.
You know, we use interactive perception all the time to learn about the world around us. So in fact, you might be familiar with the work of EJ and JJ Gibson who studied in depth, how humans use physical interaction to obtain more accurate representations of objects. So take as an example, when you’re at the grocery store, uh, and you’re picking out some things.
You might lightly press an orange as an example, to see if it’s overripe or, and I don’t know if this is scientifically proven, but you might even knock on a watermelon to then listen to the resulting vibrations, which some people tell me allows you to judge, whether it’s a juicy one or not. So, yeah, people use interactive perception all the time to learn about the world.
And in robotics, we would like to equip robots with similar perceptual intelligence.
Shihan Lu: Okay. So using interactive perception is to apply like, uh, active action to the object and try and check out that there’s a correspondent feedback, and then using this process to better understand the object states. So, How is this helpful for the manipulation tasks?
Carolyn Matl: So when we think of traditional perception for robots, often, what comes to mind is pure computer vision, where the robot is essentially this floating head moving around in the world and collecting visual information. And to be clear, the amazing developments in computer vision have enabled robotics as a field, uh, to make tremendous advancements.
And you see this with the success in areas ranging from automated car navigation, all the way to bin picking robots and these robotic systems are able to capture a rich representation of the state of the world through images alone, often without any interaction, but as we all know, robots, not only sense, but they also act on the world and through this interaction, they can observe important physical properties of objects or the environment that would otherwise not be perceived.
So for example, circling back to the fruits, staring at an orange or statically weighing a watermelon will not necessarily tell you how ripe it is, but instead robots can take advantage of the fact that they’re not just floating heads in space and use their actuators to prod and press the fruit. And so quoting a review article on interactive perception by Jeanette Bohg who was on this podcast.
And, along with many others in this field, they wrote a review article on interactive perception, that says that this interaction creates a novel sensory signal that would otherwise not be present.
So as an example, these signals could be the way the fruit deforms under the applied pressure or the sounds that the watermelon makes when the robot knocks on its rind, and the same review article also provides an additional advantage of interactive perception, which is that the interpretation of the novel signal consequently becomes simpler and more robust.
So, for example, it’s much simpler and more robust to find a correspondence with the measured stiffness of a fruit and its ripeness than simply predicting rightness from the color of the fruit. The action of pressing the fruit and the resulting signals from that action directly relate to the material property the robot is interested in observing.
Whereas when no action is taken, the relationship between the observation and inference might be less causal. So I do believe that interactive perception is fundamental for robots to tackle challenging manipulation tasks, especially for the manipulation of deformable objects or complex materials.
whether the robot is trying to directly infer a physical parameter, like the coefficient of friction, or to learn a dynamics function, to represent a deformable object, interacting with the object is what ultimately allows the robot to observe parameters that are relevant to the dynamics of that object.
Therefore helping the robot attain a more accurate representation or model of the object. This subsequently helps the robot predict the causal effects of different interactions with an object, which then allows the robot to plan a complex manipulation behavior.
I’d also like to add, that through an interactive perception framework, this gives us an opportunity to take advantage of multimodal active sensing.
So aside from vision, other sensory modalities are inherently tied to interaction. or I should say many of these nontraditional sensors rely on signals that result from forceful interaction with the environment. So for instance, I think sound is quite under explored within robotics as a useful type of signal sound can cue a robot into what sort of granular surface it’s walking on, or it could help a robot confirm a successful assembly task by listening for a click, as one part is attached to the other.
Um, Jivko Sinapov who you interviewed on robohub, uh, used different exploratory procedures and the resulting sound to classify different types of objects in containers. I should also mention that I noticed one of your own papers with Heather Culbertson, right?
Uh, involving modeling the sounds from tool to surface interactions, which are indicative of surface texture properties. Right?
Shihan Lu: And in the opposite direction, we’re trying to model the sound. And here is when you utilize the sounds in the, in the task. It’s like the two directions of the research.
Carolyn Matl: Yeah, but what’s so interesting is what they share is that, ultimately the sound is created through interaction, right? sound is directly related to event-driven activity and it signals changes in the state of the environment in particular when things, make and break contact, or in other words, when things interact with each.
Other modalities that I found to be quite useful in my own research are force in tactile sensing. Like the amount of force or tactical information you obtained from dynamically, interacting with an object is so much richer than if you were to just statically hold it in place. And we can get into this a little bit later, but basically designing a new tactile sensor that could be used actively allowed us to target the problem of dough manipulation, which I would consider a pretty challenging manipulation task.
So yes, I do believe that interactive perception fundamentally is a benefit to robots for tackling challenging manipulation tasks.
Shihan Lu: Great. And, lastly, you mentioned that, you’re trying to use, this interactive perception to help with a dough rolling task. And your very recent work StRETcH – “soft to resistive elastic tactile hand” which is a soft tactile sensor you designed specifically for this type of task.
Do you still remember, where did you get the first inspiration of designing a soft tactile sensor for the purpose of dough rolling?
Carolyn Matl: So I think I would say that in general, in my opinion as a roboticist. I like to first find a real world challenge for an application I want to tactile, define a problem statement tosolidify what I’d like to accomplish, and then brainstorm ways to approach that problem.
And so this is the usual approach that I like to take. A lot of my inspiration does come straight from the application space. So for instance, I love to cook, so I often find myself thinking about liquids and doughs, and even as a person manipulating these types of materials takes quite a bit of dexterity.
And we might not think about it on, on the daily, but even preparing a bowl of cereal requires a fair bit of interactive perception. So a lot of the observations from daily life served as inspiration for my PhD work on top of that, I thought a lot about the same problems, but from the robot’s perspective.
Why is this task easy for me to do and difficult for the robot? Where do the current limitations lie in robotics that prevent us from having a robot that can handle all of these different unstructured materials. And every single time I ask that question, I find myself revisiting the limitations within robotic perception and what makes it so challenging.
So, yeah, I would say that in general, I take them more applications forward approach. But sometimes, you know, you might design a cool sensor or algorithm for one application and then realize that it could be really useful for another application. So for example, the first tactile sensor I designed was joint work with Ben McInroe on a project, headed by Ron Fearing at Berkeley.
And our goal was to design a soft tactile sensors/actuator that could vary in stiffness and the application space or motivation behind this goal was that soft robots are safe to use in environments, that have, for instance, humans or delicate objects since they’re compliant and can conform to their surrounding environment.
However, they’re difficult to equip with perception, capabilities, and can be quite limited in their force output, unless they can vary their stiffness. So with that application in mind, we designed a variable stiffness soft, tactile sensor that was pneumatically actuated, which we called SOFTCell. And what was so fun about SOFTCell was being able to study the sensing and actuation duality, which was a capability I hadn’t seen in many other sensors before SOFTcell could reactively change its own properties in response to what it was sensing in order to exert force on the world.
Seeing these capabilities come to life, this made me realize that similar technology could be really useful for dough manipulation, which involves a lot of reactive adjustments based on touch. And that’s sort of what inspired the idea of the “soft to resistive elastic tactal hand” or StRETcH.
So in a way here where the creation of one sensor inspired me to pursue another application space.
Shihan Lu: Gotcha would you introduce like a basic category of your stretch sensor, the soft tactile sensor, designed for the dough rolling, uh, which class it belongs to?
Carolyn Matl: Yeah.
So in a general sense, the “Soft to Resistive Elastic Tactile Hand” is a soft tactal sensor. there’s a wide variety of soft sensing technology out there. And they all have their advantages in particular areas. And as roboticists, part of our job is knowing the trade-offs and figuring out which design makes sense for our particular application.
so I can briefly go over maybe some of these types of sensors and how we reached the conclusion of the design first stretch.
So for instance, so there’s a lot of soft sensing technology out there, including, one practical solution I’ve seen is to embed a grid of conductive wires or elastomers into the deformable material, but this then limits the maximum amount of strain the soft material can now undergo, right?
Because now that’s defined by the more rigid conductive material. so to address this scientists have been developing really neat solutions like conductive hydrogels, but then if you go down that hard material science route, it might become quite complicated to actually manufacture the sensor.
And then it wouldn’t be so practical to test in a robotics setting. then there are few soft, tactile sensors you can actually purchase, like for instance, the BioTac sensor, which is basically the size of a human finger and is composed of a conductive fluidic layer inside of a rubbery skin. So that saves you the trouble of making your own sensor, but it’s also quite expensive and the raw signals are difficult to interpret.
Unless you take a deep learning approach, like Yashraj Narang, et al from Nvidia’s Seattle robotics lab. But soft tactile sensors don’t need to be so complex. They can be as simple as a pressure sensor in a pneumatic actuated finger or a creative way I’ve seen pressure sensors used in soft robots is from Hannah Stewart’s lab at UC Berkeley, where they measured suction flow as a form of underwater, tactical sensing.
And finally, you may have seen these become more popular in recent years, but there are also optical based soft tactile sensors. And what I mean by optically based is that these sensors have a soft interface that interacts with objects for the environment and a photodiode or camera is inside the sensor and is used to image the deformations experienced by the soft skin.
And from those image deformations, you can infer things like forces, shear, object geometry, and even sometimes if you have a high resolution sensor, you can image the texture of the object.
So some examples of this type of sensor include the OptoForce sensor, the GelSight from MIT, the soft bubble from Toyota research, the TacTip from Bristol Robotics Lab, and finally StRETcH: a Soft to Resistive Elastic Tactile Hand. and what’s nice about this sort of design is that it allows us to decouple the soft skin and the sensing mechanism. So the sensing mechanism doesn’t impose any constraints on the skin’s maximum strain.
And at the same time, if the deformations are imaged by a camera, this gives the robot spatially rich tactical information. So, yeah, ultimately we chose this design for our own soft, tactile sensor, since hardware-wise, uh, this sort of design presented a nice balance between complexity and functionality.
Shihan Lu: Your StRETcH sensor, is also under the optical tactile sensor category, optical based tactile sensor. During this data collection process, what specific technique are you using to do the data processing for specifically this type of very new, very different data type?
Carolyn Matl: So in general, I tend to lean on the side of using as much knowledge or structure you can derive from physics or known models before diving completely into let’s say end to end latent feature space approach.
Um, I have to say deep learning has taken off within the vision community in part because computer vision scientists spent a great deal of time studying foundational topics like projective 3d reconstruction, optical flow. How filters work like the Sobo filter for edge detection and SIFT features for object recognition.
And all of that science and engineering effort laid out a great foundation for all the amazing recent advancements that use deep learning and computer vision. so studying classical computer vision, techniques of feature design and filters gives great intuition for interpreting inner layers. We’re designing networks for end-to-end learning and also great intuition for evaluating the quality of that data.
Now for these new types of data that we’re acquiring with these new sensors. I think similar important work needs to be done and is being done before we can leap into completely end to end approaches or features.
So especially if this data is collected within an interactive perception framework, there’s usually a clear causal relationship between the action the robot takes the signal or data that is observed, and the thing that is being inferred.
So why not use existing physical models or physically relevant features to interpret a signal? Especially if you know what caused that signal in the first place. Right? And that’s part of why I believe interactive perception is such a beautiful framework since the robot can actively change the state of an object or the environment to intentionally induce signals that can be physically interpreted.
Now. I don’t think there’s anything wrong with using deep learning approaches to interpret these new data types. If you’re using it as a tool to learn a complex dynamics model, that’s still grounded in physics. So I can give an example. I mentioned earlier that Yashraj S. Narang et. al. From Nvidia worked with the BioTac sensor to interpret it’s raw, low dimensional signals.
And to do this, they collected a data set of raw BioTac signals observed as the robot used the sensor to physically interact with a force sensor. So in addition to this dataset, they had a corresponding physics-based 3d finite element model of the BioTac, which essentially served as their ground truth and using a neural net, they were able to map the raw, difficult to interpret signals, to high density deformation fields.
And so I think that’s a great example where deep learning is used to help the interpretation of a new data type while still grounding their interpretation in physics.
Shihan Lu: Interesting. Yeah. So since there’s a causal relationship between the action and the sensory output in the interactive perception, So the role of physics is quite, it’s quite important here.
It’s harder to reduce the dependence on the huge amount of datasets, right? Because we know the magic of deep learning, usually it gets much better when it has more data. Do you think using these interactive perception way is the collection of data more time consuming, and more difficult comparing to the traditional, like passive perception methods?
Carolyn Matl: I think this becomes a real bottleneck only when you actually need a lot of data to train a model, like you alluded to. If you’re able to interpret the sensor signals with a low dimensional physics-based model, then the amount of data you have, shouldn’t be a bottleneck.
In fact, real data is always sort of the gold standard for learning a model. Since ultimately you’ll be applying the model to real data, and you don’t want to over-fit to any sort of artifacts or weird distributional shifts that might be introduced if you, for instance, augment your data by stuff that, as an example, synthetically generated in simulation.
That being said, sometimes you won’t have access to a physics-based model that’s mature enough or complex enough to interpret the data that you’re observing. For instance, in collaboration with Nvidia Seattle robotics lab, I was studying robotic manipulation of grains and trying to come up with a way to infer their material properties from a single image of a pile of grains.
Now the motivation behind this was that by inferring the material properties of grains, which ultimately affects their dynamics, the robot can then predict their behavior to perform more precise, manipulation tasks. So for instance, like pouring grains into a bowl, you can imagine how painful and messy it would be to collect this data in real life. Right?
Because first of all, you don’t have a known model for how these grains will behave. Um, and so yes. Pretty painful to collect in real life, but using NVIDIA’s physics simulator and a Bayesian inference framework, they called BayesSim. We could generate a lot of data in simulation to then learn a mapping from granular piles to granular material properties.
But of course, the classic challenge with relying on data synthesis or augmentation in simulation. Especially with this new data type, right? with this new data that we’re collecting from new sensors is, the challenge is this simulation to reality gap, which people call the SIM to real gap, where distributions in simulation don’t quite match those in real life.
Partly due to lower complexity representations in simulation, inaccurate physics and lack of stochastic modelling. So we faced these challenges when, in collaboration, again, with Nvidia, I studied the problem of trying to close the SIM to real gap by adding learned stochastic dynamics to a physics simulator.
And another challenge is what if you want to augment data that isn’t easily represented in simulation. So for example, we were using sound to measure the stochastic dynamics of a bouncing ball. But because the sounds of a bouncing ball are event-driven, we were able to circumvent the problem of simulating sound.
So our SIM to real gap was no longer dependent on this drastic difference in data representation. I also have another example, um, at Toyota research in our mobile manipulation group, there’s been some fantastic work on learning depth from stereo images. And they call their simulation framework, SimNet, and oftentimes when you learn from stimulated images, models can over-fit to weird texture or non photorealistic rendering artifacts.
so to get really realistic simulation data, to match real data, you often have to pay a high price in terms of time, computation and resources to generate or render that simulated data. However, since the SIMNET team was focusing on the problem of perceiving 3d geometry, rather than texture, they could get really high performance learning on non-photo realistic textured, simulated images, which could be procedurally generated at a much faster rate.
So this is another example I like of where the simulation and real data formats are not the same, but clever engineering can make synthetic data just as valuable to learn these models of new data.
Shihan Lu: But you also mentioned synthesize the data or augmented data sometimes where we have to pay the cost for like overfitting issues and low fidelity issues.
And it’s not always the best move to just out being. And the, sometimes we still kind of need to rely on the real data .
Carolyn Matl: Exactly, yeah.
Shihan Lu: Can we talk a little bit, like the reasons part? Where did you get the idea and what kind of physical behaviors you’re trying to mimic or trying to learn in the learning part?
Carolyn Matl: Sure. So maybe for this point, I’ll refer to you my most recent work with the StRETcH sensor.
So the Soft to Resistive Elastic Tactile Hand, where we decided to take a model based reinforcement learning approach to roll a ball of dough into a specific length. And surely when you think about this problem. It involves highly complex dynamic interactions with a soft elastic sensor and an elastoplastic object, our data type is also complex as well, since it’s a high dimensional depth image of the sensory skin.
So how do we design an algorithm that can handle such complexity? Well, the model based reinforcement learning framework was very useful since we wanted the robot to be able to use its knowledge of stiffeness to efficiently roll doughs of different hydration levels. So hence this gives us our model based part, but we also wanted it to be able to improve or adjust its model as the material properties of the dough changed.
Hence the reinforcement learning part of the algorithm. And this is necessary since if you’ve ever worked with dough, it can change quite drastically depending on its hydration levels or how much time it has had to rest. And so while we knew we wanted to use model based reinforcement learning, we were stuck with the problem that this algorithm scales poorly with increased data complexity.
So we ultimately decided to simplify both the state space of the dough and action space of the robot, which allowed the robot to tractably solve this problem. And since the stretch sensor was capable of measuring a proxy for stiffness using its new data from the camera imaging, the deformations of the skin.
this estimate of stiffness was essentially used to seed the model of the dough and make the algorithm converge faster to a policy that could efficiently roll out the dough into a specific length.
Shihan Lu: Okay. Very interesting. So during this model-based reinforcement learning. So is there any specific way you’re trying to design your reward function and, uh, or you’re trying to make your reward function to follow a specific real life goal?
Carolyn Matl: Yeah. So, because the overall goal was pretty simple, it was to get the dough into a specific length it was basically the shape of the dough which we were able to compress the state space of the dough into just three dimensions, the bounding box of the dough.
But you can imagine that a more complicated shape would require a higher dimensional more expressive state space. But since we were able to compress the state space into such a low dimension, this allowed us to solve the problem a lot more easily.
Shihan Lu: And the lastly, I saw in your personal webpage, you say you work on unconventional sensors. And if we wanted to make those unconventional sensors become conventional and let more researchers and the labs use them in their own research. Which parts should we allocate more resources and maybe need more attention?
Carolyn Matl: Yeah. So that’s a great question. I think practically speaking, at the end of the day, we should allocate more resources for developing, easier interfaces and packaging for these new unconventional sensors. Like part of the reason why computer vision is so popular in robotics is that it’s easy to interface with the camera.
There are so many types of camera sensors available that can be purchased for a reasonable price. Camera drivers are packaged nicely. And, there are a ton of image libraries that help take the load off of image processing. And finally, we live in a world that’s inundated with visual data. So for roboticists, who are eager to get right to work on fun manipulation problems, the learning curve to plug in a camera and use it for perception is fairly low.
In fact, I think it’s quite attractive for all those reasons. However, I do believe that if there were more software packages or libraries that were dedicated to interfacing with these new or unconventional sensors on a lower level, this could help considerably in making these sensors seem more appealing to try using within the robotics community.
So for example, for one of my projects, I needed to interface with three microphones. And just the leap from two to three microphones required that I buy an audio interface device to be able to stream this data in parallel. And it took quite a bit of engineering effort to find the right hardware and software interface to just to enable my robot to hear.
Yeah. However, if these unconventional sensors were packaged in a way that was intended for robotics, It would take away the step function necessary, for figuring out how to interface with the sensor. Um, allowing researchers to immediately explore how to use them in their own robotics applications. And that’s how I imagine we can make these unconventional sensors become more conventional in the future.
Shihan Lu: A quick follow-up question. It’s if we just focus on a specific category under the soft tactile sensor. Do you think we will have a standardized sensor for this type in the future? If there is a such a standardized sensor we use just like cameras, what’s the specification the way, imagine we would envision.
Carolyn Matl: Well, I imagine, I guess with cameras, you know, there’s still a huge diversity in types of cameras. We have depth cameras, we have LIDAR, we have traditional RGB cameras, we have heat cameras, uh, thermal cameras rather. And so I, I could see tactile sensing for instance, progressing in a similar way where we will have classes of tactile sensors that will sort of be more popular.
Because of a specific application. for instance, you can imagine vibration sensors might be more useful for one application. Soft, optical, tactile sensors. Um, we’ve been seeing a lot of their use in robotic applications. For a manipulation. So I think in the future, we’ll see classes of these, tactile sensors becoming more prominent.
Um, as we see in the classes of cameras that are available now that I answered your question. Yeah.
Shihan Lu: Yeah. That’s great. For camera these days, we still have a variety of different cameras and they have their own strengths for specific tasks. So you envision tactile sensors are also like focused on their own specific task or a specific areas. It’s very hard to have like generalized and the standard or universal tactile sensors, which can handle lots of tasks. So we still need to specify them into the small areas.
Carolyn Matl: Yes. I think there still needs to be some work in terms of integration of all this new, technology.
But at the end of the day as engineers, we care about trade-offs and, um, that’ll ultimately lead us to choose what sensor makes the most sense for our application space.
Shihan Lu: Thanks so much for your interesting talk and lots of stories behind yourself, the tactile sensor design, and also let us know a lots of new knowledge and the perspectives about interactive perception.
Carolyn Matl: Thanks so much for having me today.
Shihan Lu: Thank you. It was a pleasure.