Ok, cool! I was actually one of the people on the hyprnote HN thread asking for a headless mode!
I was actually integrating some whisper tools yesterday. I was wondering if there was a way to get a streaming response, and was thinking it'd be nice if you can.
I'm on linux, so don't think I can test out owhisper right now, but is that a thing that's possible?
Also, it looks like the `owhisper run` command gives it's output as a tui. Is there an option for a plain text response so that we can just pipe it to other programs? (maybe just `kill`/`CTRL+C` to stop the recording and finalize the words).
Same question for streaming, is there a way to get a streaming text output from owhisper? (it looks like you said you create a deepgram compatible api, I had a quick look at the api docs, but I don't know how easy it is to hook into it and get some nice streaming text while speaking).
Oh yeah, and diarisation (available with a flag?) would be awesome, one of the things that's missing from most of the easiest to run things I can find.
Nice stuff, had a quick test on linux and it works (built directly, I didn't check out the brew). I ran into a small issue with moonshine and opened an issue on github.
Great work on this! excited to keep an eye on things.
philjackson 32 minutes ago [-]
Also had a quick play too. The TUI is garbled thanks to some stderr messages which can just be dev/null'd. I don't seem to be able to interact with the transcripts with the arrow or jk keys.
Overall though, it's fast and really impressive. Can't wait for it to progress.
Can you help me out to find where the code you've built is? I can see the folder in github[0], but I can't see the code for the cli for instance? unless I'm blind.
Please find a way to add speaker diarization, with a way to remember the speakers. You can do it with pyannote, and get a vector embedding of each speaker that can be compared between audio samples, but that’s a year old now so I’m sure there’s better options now!
yujonglee 16 hours ago [-]
yeah that is on the roadmap!
tempodox 6 hours ago [-]
It seems to use https://api.deepgram.com (and other web endpoints) and apparently needs an API key, so it's not actually local. Why is it being compared to ollama, which does run fully locally?
yujonglee 5 hours ago [-]
It can run Whisper and Moonshine models locally, while also allowing the use of other API providers. Read the docs - or at least this post.
tempodox 5 hours ago [-]
I would want such information accessible without having to go hunt for it. You could improve your presentation by interposing fewer clicks between a reader and the thing they want to know.
0x696C6961 3 hours ago [-]
The information is readily available in the open-your-eyes section.
3 hours ago [-]
vinni2 2 hours ago [-]
Very neat project! Congratulations to the founders. I was wondering why there is no one working on such a tool.
But I was hoping couple of features would be supported:
1. Multilingual support. It seems like even if I use a multilingual model like whisper-cpp-large-turbo-q8, the application seems to assume I am speaking English.
2. Translate feature. Probably already supported but I didnt see the option.
wanderingmind 14 hours ago [-]
Thank you for taking the time to build something and share it. However what is the advantage of using this over whisper.cpp stream that can also do real time conversion?
- It also works as proxy for cloud model providers.
- It can expose local models as Deepgram compatible api server
wanderingmind 11 hours ago [-]
Thank you. Having it to operate a proxy server that other apps can connect to is really useful.
solarkraft 16 hours ago [-]
Wait, this is cool.
I just spent last week researching the options (especially for my M1!) and was left wishing for a standard, full-service (live) transcription server for Whisper like OLlama has been for LLMs.
I’m excited to try this out and see your API (there seems to be a standard vaccuum here due to openai not having a real time transcription service, which I find to be a bummer)!
Very cool. I was reading through the various threads here. I am working on adding stt and tts to an AI DungeonMaster. Just a personal fun project, am working on the adventure part of it now. This will come in handy. I had dungeon navigation via commands working but started over and left it at the point where I am ready to merge the navigation back in again once I was happy with a slimmer version with one file. It will be fun to be able to talk to the DM and have it respond with voice and actions. The diarization will be very helpful if I can create a stream where it can hear all of us conversing at once. But baby steps. Still working on getting the whole campaign working after I get characters created and put in a party :)
fancy_pantser 9 hours ago [-]
I scratched a similar itch and found local LLMs plus Whisper worked really well to listen in and "DJ" a soundtrack while playing tabletop RPGs with a group. If you want to check it out: https://github.com/sean-public/conductor
theanonymousone 4 hours ago [-]
Given how sentiments towards ollama has become, I'm not sure this is a clever marketing line :D
jftuga 3 hours ago [-]
I have not heard about this. Can you please provide more context?
I’m looking for something that is aware of what is being discussed realtime, so if I zone out for a few minutes, I can ask it what I missed or to clarify something. Can this do that? If not, anybody know of something that can?
koolala 9 hours ago [-]
Why not use a LLM with the speech to text output?
net_rando 2 hours ago [-]
Is a container version in your roadmap?
JP_Watts 17 hours ago [-]
I’d like to use this to transcribe meeting minutes with multiple people. How could this program work for that use case?
Can you describe how it pick different voices? Does it need separate audio channels, or does it recognize different voices on the same audio input?
yujonglee 17 hours ago [-]
It separate mic/speaker as 2 channel. So you can reliably get "what you said" vs "what you heard".
For splitting speaker within channel, we need AI model to do that. It is not implemented yet, but I think we'll be in good shape somewhere in September.
Also we have transcript editor that you can easily split segment, assign speakers.
sxp 17 hours ago [-]
If you want to transcribe meeting notes, whisper isn't the best tool because it doesn't separate the transcribe by speakers. There are some other tools that do that, but I'm not sure what the best local option is. I've used Google's cloud STT with the diarization option and manually renamed "Speaker N" after the fact.
dcreater 7 hours ago [-]
Why cant it just use any OpenAI API endpoint?
yujonglee 7 hours ago [-]
what do you mean? this use-case is not llm. it is realtime stt.
Just wanna give a shout out to the hyprnote team - I've been running it for about a month now and I love how simple and no gimmicks it is. It's a good app, def recommend! (Team seem like a lovely group of youngins' also) :)
rshemet 12 hours ago [-]
THIS IS THE BOMB!!! So excited for this one. Thanks for putting cool tech out there.
yujonglee 12 hours ago [-]
Thank you!
notthetup 11 hours ago [-]
Is there a way to list all the models that are available to be pulled?
yujonglee 10 hours ago [-]
sure. `owhisper pull --help`
yujonglee 17 hours ago [-]
Happy to answer any questions!
These are list of local models it supports:
- whisper-cpp-base-q8
- whisper-cpp-base-q8-en
- whisper-cpp-tiny-q8
- whisper-cpp-tiny-q8-en
- whisper-cpp-small-q8
- whisper-cpp-small-q8-en
- whisper-cpp-large-turbo-q8
- moonshine-onnx-tiny
- moonshine-onnx-tiny-q4
- moonshine-onnx-tiny-q8
- moonshine-onnx-base
- moonshine-onnx-base-q4
- moonshine-onnx-base-q8
phkahler 16 hours ago [-]
I thought whisper and others took large chunks (20-30 seconds) of speech, or a complete wave file as input. How do you get real-time transcription? What size chunks do you feed it?
To me, STT should take a continuous audio stream and output a continuous text stream.
yujonglee 16 hours ago [-]
I use VAD to chunk audio.
Whisper and Moonshine both works in a chunk, but for moonshine:
> Moonshine's compute requirements scale with the length of input audio. This means that shorter input audio is processed faster, unlike existing Whisper models that process everything as 30-second chunks. To give you an idea of the benefits: Moonshine processes 10-second audio segments 5x faster than Whisper while maintaining the same (or better!) WER.
Also for kyutai, we can input continuous audio in and get continuous text out.
Something like that, in a cli tool, that just gives text to stdout would be perfect for a lot of use cases for me!
(maybe with an `owhisper serve` somewhere else to start the model running or whatever.)
yujonglee 16 hours ago [-]
Are you thinking about the realtime use-case or batch use-case?
For just transcribing file/audio,
`owhisper run <MODEL> --file a.wav` or
`curl httpsL//something.com/audio.wav | owhisper run <MODEL>`
might makes sense.
mijoharas 16 hours ago [-]
agreed, both of those make sense, but I was thinking realtime. (pipes can stream data, I'd like and find useful something that can stream tts to stdout in realtime.)
yujonglee 15 hours ago [-]
It's open-source. Happy to review & merge if you can send us PR!
FYI:
owhisper pull whisper-cpp-large-turbo-q8
Failed to download model.ggml: Other error: Server does not support range requests. Got status: 200 OK
But the base-q8 works (and works quite well!). The TUI is really nice. Speaker diarization would make it almost perfect for me. Thanks for building this.
yujonglee 13 hours ago [-]
we store data in R2 and range query sometime glitch...
It might work if you retry it
alkh 15 hours ago [-]
Sorry, maybe I missed it but I didn't see this list on your website. I think it is a good idea to add this info there. Besides that, thank you for the effort and your work! I will definetely give it a try
yujonglee 15 hours ago [-]
got it. fyi if you run `owhisper pull --help`, this info is printed
elektor 13 hours ago [-]
Cool tool! Are you guys releasing Hyprnote for Windows this month?
yujonglee 13 hours ago [-]
probably end of this month or early next month. not 100% sure.
pylotlight 11 hours ago [-]
This has MPS support for hwa?
yujonglee 11 hours ago [-]
yes. metal is on
DiabloD3 15 hours ago [-]
I suggest you don't brand this "Ollama for X". They've become a commercial operation that is trying to FOSS-wash their actions through using llama.cpp's code and then throwing their users under the bus when they can't support them.
I see that you are also using llama.cpp's code? That's cool, but make sure you become a member of that community, not an abuser.
yujonglee 15 hours ago [-]
yeah we use whisper.cpp for whisper inference. this is more like a community-focused project, not a commercial product!
reilly3000 9 hours ago [-]
Ya after spending a decent amount of time in r/localllama I was surprised that a project would want to name itself in association with Ollama, it’s got a pretty bad reputation in the community at this point.
I was actually integrating some whisper tools yesterday. I was wondering if there was a way to get a streaming response, and was thinking it'd be nice if you can.
I'm on linux, so don't think I can test out owhisper right now, but is that a thing that's possible?
Also, it looks like the `owhisper run` command gives it's output as a tui. Is there an option for a plain text response so that we can just pipe it to other programs? (maybe just `kill`/`CTRL+C` to stop the recording and finalize the words).
Same question for streaming, is there a way to get a streaming text output from owhisper? (it looks like you said you create a deepgram compatible api, I had a quick look at the api docs, but I don't know how easy it is to hook into it and get some nice streaming text while speaking).
Oh yeah, and diarisation (available with a flag?) would be awesome, one of the things that's missing from most of the easiest to run things I can find.
I didn't tested on Linux yet, but we have linux build: http://owhisper.hyprnote.com/download/latest/linux-x86_64
> also, it looks like the `owhisper run` command gives it's output as a tui. Is there an option for a plain tex
`owhisper run` is more like way to quickly trying it out. But I think piping is definitely something that should work.
> Same question for streaming, is there a way to get a streaming text output from owhisper?
You can use Deepgram client to talk to `owhisper serve`. (https://docs.hyprnote.com/owhisper/deepgram-compatibility) So best resource might be Deepgram client SDK docs.
> diarisation
yeah on the roadmap
Great work on this! excited to keep an eye on things.
Overall though, it's fast and really impressive. Can't wait for it to progress.
Can you help me out to find where the code you've built is? I can see the folder in github[0], but I can't see the code for the cli for instance? unless I'm blind.
[0] https://github.com/fastrepl/hyprnote/tree/main/owhisper
https://github.com/fastrepl/hyprnote/blob/8bc7a5eeae0fe58625...
But I was hoping couple of features would be supported: 1. Multilingual support. It seems like even if I use a multilingual model like whisper-cpp-large-turbo-q8, the application seems to assume I am speaking English. 2. Translate feature. Probably already supported but I didnt see the option.
https://github.com/ggml-org/whisper.cpp/tree/master/examples...
- It supports other models like moonshine.
- It also works as proxy for cloud model providers.
- It can expose local models as Deepgram compatible api server
I just spent last week researching the options (especially for my M1!) and was left wishing for a standard, full-service (live) transcription server for Whisper like OLlama has been for LLMs.
I’m excited to try this out and see your API (there seems to be a standard vaccuum here due to openai not having a real time transcription service, which I find to be a bummer)!
Edit: They seem to emulate the Deepgram API (https://developers.deepgram.com/reference/speech-to-text-api...), which seems like a solid choice. I’d definitely like to see a standard emerging here.
Let me know how it goes!
For splitting speaker within channel, we need AI model to do that. It is not implemented yet, but I think we'll be in good shape somewhere in September.
Also we have transcript editor that you can easily split segment, assign speakers.
also fyi - https://docs.hyprnote.com/owhisper/configuration/providers/o...
These are list of local models it supports:
- whisper-cpp-base-q8
- whisper-cpp-base-q8-en
- whisper-cpp-tiny-q8
- whisper-cpp-tiny-q8-en
- whisper-cpp-small-q8
- whisper-cpp-small-q8-en
- whisper-cpp-large-turbo-q8
- moonshine-onnx-tiny
- moonshine-onnx-tiny-q4
- moonshine-onnx-tiny-q8
- moonshine-onnx-base
- moonshine-onnx-base-q4
- moonshine-onnx-base-q8
To me, STT should take a continuous audio stream and output a continuous text stream.
Whisper and Moonshine both works in a chunk, but for moonshine:
> Moonshine's compute requirements scale with the length of input audio. This means that shorter input audio is processed faster, unlike existing Whisper models that process everything as 30-second chunks. To give you an idea of the benefits: Moonshine processes 10-second audio segments 5x faster than Whisper while maintaining the same (or better!) WER.
Also for kyutai, we can input continuous audio in and get continuous text out.
- https://github.com/moonshine-ai/moonshine - https://docs.hyprnote.com/owhisper/configuration/providers/k...
(maybe with an `owhisper serve` somewhere else to start the model running or whatever.)
For just transcribing file/audio,
`owhisper run <MODEL> --file a.wav` or
`curl httpsL//something.com/audio.wav | owhisper run <MODEL>`
might makes sense.
https://github.com/fastrepl/hyprnote/blob/8bc7a5eeae0fe58625...
But the base-q8 works (and works quite well!). The TUI is really nice. Speaker diarization would make it almost perfect for me. Thanks for building this.
I see that you are also using llama.cpp's code? That's cool, but make sure you become a member of that community, not an abuser.