IMO they’re way too much fixated on making a single model AGI.
Some people tried to combine multiple specialized models (voice recognition + image recognition + LLM, + controls + voice synthesis) to get quite compelling results.
If you’re the programmer, it’s not hard to use a key press to enable TTS and then send it in chunks. I made a very similar version of this project, but my GPU didn’t stream the responses nearly as seamlessly.
IMO they’re way too much fixated on making a single model AGI.
Some people tried to combine multiple specialized models (voice recognition + image recognition + LLM, + controls + voice synthesis) to get quite compelling results.
https://www.youtube.com/watch?v=7Fa3_rH4NcQ
I’m just impressed how snappy it was, I wish he had the ability to let it listen longer without responding right away though.
I wish I had that ability too.
If you’re the programmer, it’s not hard to use a key press to enable TTS and then send it in chunks. I made a very similar version of this project, but my GPU didn’t stream the responses nearly as seamlessly.
Sorry, I meant the real life me, crippling audhd and all. But I’m also not a programmer.