- Home /
Synchronized text with voice like karaoke
Hello, I am trying to do a book for children. The idea is that a voice reads the tale for them and at the same time they can see the text with the word that they are hearing highlighted. (like a karaoke). The voice and the stand out word must be synchronized word by word (no line by line or sentence by sentence). Do someone have an idea about how I can do this effect in an efficient way?. My idea: For each book page create a layer with all the text for that page (base layer), and many transparent layers, each one with a highlighted word. Then I should show the base layer all the time and synchronize with the voice that only the layer with the listened word be shown in front of the base layer. But I think it will consume a lot of resources and it is not going to be efficient and fast enough. What do you think? Any other idea? Thanks in advance!!
Interesting question. Probably belongs in the forums. UA isn't meant for conceptual stuff.
Unfortunately I doubt there's an "easy" way to accomplish this. I believe in the end you'll ultimately need to edit some kind of time data for each spoken word. You could perhaps tie this to an animation. You'd want one animation per sentence. The animation could trigger the highlighting of each word over time as it's spoken. The alternative bypasses animation and uses timecodes. You'd keep a collection of timecodes tied to each "sentence" object.
This assumes a "sentence" object is something you've made that holds a single audio clip and either an animation with events or a collection of timecodes.
Your animation solution is overly complicated. You need only a single Text object. You ARE using UI/canvases, right? Good!
Let your animation-events-or-timecodes trigger a method as the audio clip is played which inspects the current sentence object's string. It will display the whole string throughout playback of the clip, and colorize each word in sequence as spoken. No more complicated than that!
AlwaysSunny, thank you so much for your response!! :-)
What I don't understand is how I can make that each word remains highlighted the amount of time it takes the reader to pronounce it (Because each word has its own time and the effect I am looking for is that the animation and the audio are perfectly synchronized, like a karaoke.
$$anonymous$$y idea: (e.g. text = "$$anonymous$$y house")
1) $$anonymous$$easure the time it takes the reader pronounce each singular word. (e.g. it takes 0.1s say "$$anonymous$$y" and 0.3s say "house")
2) Store each word's time in an array of times.(e.g. times=[0.1,0.3])
3) Parse the text splitting the words and storing each one in an array of words. (e.g. text=["$$anonymous$$y","House"])
4)Process each array at the same time to highlight each word the exact time each word is pronounced. (e.g. change the color of "$$anonymous$$y" for 0.1s then change the color of "house" 0.3s).
I think this idea is too manual and will take me a lot of time. Any other idea/suggestion?
I am really stuck with this issue and I really appreciate any help!
Thanks in advance!!
http://stemrobot.com:8081/bollywoodbands/public/download I would recommend you read the user manual of this software - AnyWhereCanPerform $$anonymous$$araoke Lyrics Ti$$anonymous$$g Studio.
Answer by AlwaysSunny · May 01, 2015 at 04:13 AM
Unfortunately I imagine that a manual timestamp editing process will be required. Any automated method you could devise would likely involve an equal-or-greater amount of work.
Obviously that depends upon the total length of this spoken dialog. If you've got a dictated a novella, an automated analysis program begins to make a lot more sense.
Assuming you want the entirety of each word to highlight the instant the word is spoken, and it should return to normal the instant that piece of audio stops, each word will require two timecodes. If the next spoken word can trigger the de-highlight event of the previous word, you cut your timestamping work in half.
If it were me, I'd use the following structure:
Your application plays through a collection of design-time constructed Sentence objects.
A Sentence object contains:
A string representing the spoken dialog
A reference to a single Text object which is part of a canvas. When the sentence is loaded, the text changes to reflect the new string. By enabling "rich text" you will be able to colorize a given word independently with a teensy bit of coding. You can write a handy WrapStringInColorTags method like mine.
An audio clip of the specific sentence or a timestamp and duration for accessing that segment from a larger track, whichever is more convenient.
A List of floats as the timecodes for each word -- more on this...
The method you'll be invoking at each timecode need only advance an indexer and colorize the corresponding word in the Text object's string. Each call nullifies the previous highlighting, and highlights the appropriate word.
Write a helper script for yourself which holds and plays Sentences at runtime. Play the clip source at a reduced pitch if necessary for greater precision, but note you'll have to convert the recorded timecodes back relative to a normal pitch if you do this, since pitch control in Unity affects playback time.
As the audio plays, just as each spoken word begins, tap the space bar to make a note of the elapsed time in a List of floats. Save this list of floats as the Sentence's timecodes. This will require some serialization trickery, but you could have a function which saves this list of recorded floats to the corresponding Sentence object. Gotta be the best way, I'd think.