Speech input is one of the missing features in #Phosh's stevia. I had looked at several possible solutions but didn't want to pull in a ton more dependencies into stevia itself.
While looking for something completely different I stumbled onto #vosk-server which runs fully locally but can be talked to via websocket and so I could punch that into the prototype I had already alying around (video has audio):