Extract chapters from Youtube Media
Youtube recently got this “chapter” concept where it fragment a long video with chapters. I think this data might be parsed from the description of the video done, as they already parse any timestamp available for a while now.
Thanks to youtube-dl, we can download thena video and the metadata which now contains this chapter data.
$ youtube-dl --write-info-json -x --audio-format mp3 https://www.youtube.com/watch?v=HZTStHzWRxM
[youtube] HZTStHzWRxM: Downloading webpage
[info] Writing video description metadata as JSON to: The New Youtube Chapter Timestamp Feature-HZTStHzWRxM.info.json
[download] Destination: The New Youtube Chapter Timestamp Feature-HZTStHzWRxM.webm
[download] 100% of 3.22MiB in 00:00
[ffmpeg] Destination: The New Youtube Chapter Timestamp Feature-HZTStHzWRxM.mp3
Deleting original file The New Youtube Chapter Timestamp Feature-HZTStHzWRxM.webm (pass -k to keep)
We will use https://www.youtube.com/watch?v=HZTStHzWRxM as example.
The command above will download the video file, transcode it to mp3 and also download the metadata in a json format. We have now 2 files :
The New Youtube Chapter Timestamp Feature-HZTStHzWRxM.info.json
that contains dataThe New Youtube Chapter Timestamp Feature-HZTStHzWRxM.mp3
that is the media
jq
is a wonderful command line to manipulate json on bash. We can for example get the title of the video like this :
$ cat The\ New\ Youtube\ Chapter\ Timestamp\ Feature-HZTStHzWRxM.info.json | jq -r .title | sed -e 's/[^A-Za-z0-9._-]/_/g'
The_New_Youtube_Chapter_Timestamp_Feature
The sed
here is to make sure we won’t have special characters that might lead to some error later.
The -r
on jq
indicate to return “raw text”. By default, jq
will use some syntax colorization and keep some sepcial character that might leads to some issue.
If available, Youtube-dl info json contains a chapters
array that contain all the chapters with their start_time
, end_time
and title
.
$ cat The\ New\ Youtube\ Chapter\ Timestamp\ Feature-HZTStHzWRxM.info.json |\
jq -r '.chapters[]'
{
"start_time": 0,
"end_time": 17,
"title": "The new feature"
}
{
"start_time": 17,
"end_time": 76,
"title": "Slow roll-out"
}
{
"start_time": 76,
"end_time": 124,
"title": "How it works"
}
{
"start_time": 124,
"end_time": 180,
"title": "Problems / suggestions for the future"
}
The idea now is to use each dict entry here as parameters for ffmpeg
to split the media according to the chapters data. As we are in bash, current json representation will be quite hard to use it like that, so we need to transform a little bit the representation here to use the output of jq
in a pipe and in xargs
.
What also we need to take into consideration is that ffmpeg
can split a media by giving the option -ss
to know where to start and -t
to know the duration of the cut, not the end time. As the information on the json gives us a start and end time, we need to perfom a simple substraction to have the start time and the duration.
$ cat The\ New\ Youtube\ Chapter\ Timestamp\ Feature-HZTStHzWRxM.info.json |\
jq -r '.chapters[] | .start_time,.end_time-.start_time,.title ' |\
sed 's/"//g'
0
17
The new feature
17
59
Slow roll-out
76
48
How it works
124
56
Problems / suggestions for the future
Thanks to jq
, we can perfom simple math operation directly on the command to compute the duration. sed
here again is only for cleaning up special characters.
Now, we can pipe the wonderful xargs
to use the output as parameter and trigger a ffmpeg
command
$ cat The\ New\ Youtube\ Chapter\ Timestamp\ Feature-HZTStHzWRxM.info.json|\
jq -r '.chapters[] | .start_time,.end_time-.start_time,.title ' |\
sed -e 's/[^A-Za-z0-9._-]/_/g' |\
xargs -n3 -t -d'\n' sh -c 'ffmpeg -y -ss $0 -i "The New Youtube Chapter Timestamp Feature-HZTStHzWRxM.mp3" -t $1 -codec:a copy "$2.mp3"'
-n3
indicate to take parameters 3 by 3*-t
is only to debug as it will print each commandxargs
will execute-d'\n'
indicate that parameters are separated by\n
What is cool is that we could potentially parallelize the process here by adding to xargs
the parameter -P X
to run the multiple ffmpeg
invokation in parallel.
On ffmpeg
side, nothing tremendous :
-ss
and-t
has been already explain as start time and duration,-codec:a copy
indicate that we keep everything same as the original file in terms of codec, so no re-encoding for the output file, which means it’s going fast-y
to avoid prompt and force override of existing output file
That works quite well. It might be possible to fully one line it, but let’s put a proper script to ease the usage of this.
#!/bin/sh
set -x
#Download media + metadata
youtube-dl --write-info-json -x --audio-format mp3 -o "tmp_out.%(ext)s" $1
# Maybe a way to get the file name from previous function
INFO="tmp_out.info.json"
AUDIO="tmp_out.mp3"
echo :: $INFO $AUDIO ::
# Fetch the title
TITLE=$(cat "$INFO" | jq -r .title | sed -e 's/[^A-Za-z0-9._-]/_/g' )
# ^--- Remove all weird character as we want to use it as filename
# We will put all chapter into a directory
mkdir "$TITLE"
# Chapterization
cat "$INFO" |\
jq -r '.chapters[] | .start_time,.end_time-.start_time,.title ' |\
sed -e 's/[^A-Za-z0-9._-]/_/g' |\
xargs -n3 -t -d'\n' sh -c "ffmpeg -y -ss \$0 -i \"$AUDIO\" -to \$1 -codec:a copy -f mp3 \"$TITLE/\$2.mp3\""
#Remove tmp file
rm tmp_out*
The script file here : https://gist.github.com/totetmatt/b4bf50c62642e5a9e1bf6365a47e19c6
No big change on the global approach just something to becareful : Yes, there is a hell quote escape game to play and it might not be pleasant ….
To explain the last part, as far as I understand it, the string will be evaluated multiple time :
- First time will be at “script level”, so it will replace any
$VARIABLE
present in the script like$AUDIO
and$TITLE
- Second time will be at
xargs / sh -c
invokation where then it’s possible to use$0 $1 and $2
. But if we don’t escape it first, theses variables will be evaluated at the first round, that’s why we need to backslash it\$0, \$1, \$2
.
You can see the result of the string after the 1st evaluation thanks to the -t
option of xargs
:
sh -c 'ffmpeg -y -ss $0 -i "The New Youtube Chapter Timestamp Feature-HZTStHzWRxM.mp3" -to $1 -codec:a copy -f mp3 "The_New_Youtube_Chapter_Timestamp_Feature/$2.mp3"' 124 56 Problems___suggestions_for_the_future
There might be other and better way to deal wih the args parsing, the string escape and the string cleanup, but current solution works enough :)