Subtitle improvements, #build

Changes: - merges subtitle suport for JS video player - merges hint what to do when no videos found - merges better indexing and error handeling of subtitles
2022-02-12 19:11:47 +07:00 · 2022-02-12 19:11:47 +07:00 · 428cc315e4
parent 385d6bace8 78720b33b7
commit 428cc315e4
14 changed files with 115 additions and 14 deletions
--- a/1
+++ b/1
@ -9,6 +9,7 @@ ENV PYTHONUNBUFFERED 1
 RUN apt-get clean && apt-get -y update && apt-get -y install --no-install-recommends \
    build-essential \
    nginx \
+    atomicparsley \
    curl && rm -rf /var/lib/apt/lists/*

 # get newest patched ffmpeg and ffprobe builds for amd64 fall back to repo ffmpeg for arm64
--- a/docs/FAQ.md
+++ b/docs/FAQ.md
@ -0,0 +1,31 @@
+# Frequently Asked Questions
+
+## 1. Scope of this project
+Tube Archivist is *Your self hosted YouTube media server*, which also defines the primary scope of what this project tries to do:
+- **Self hosted**: This assumes you have full control over the underlying operating system and hardware and can configure things to work properly with Docker, it's volumes and networks as well as whatever disk storage and filesystem you choose to use.
+- **YouTube**: Downloading, indexing and playing videos from YouTube, there are currently no plans to expand this to any additional platforms.
+- **Media server**: This project tries to be a stand alone media server in it's own web interface.
+
+Additionally to that, progress is also happening on:
+- **API**: Endpoints for additional integrations.
+- **Browser Extension**: To integrate between youtube.com and Tube Archivist.
+
+Defining the scope is important for the success of any project:
+- A scope too broad will result in development effort spreading too thin and will run into danger that his project tries to do too many things and none of them well.
+- A too narrow scope will make this project uninteresting and will exclude audiences that could also benefit from this project.
+- Not defining a scope will easily lead to misunderstandings and false hopes of where this project tries to go.
+
+Of course this is subject to change, as this project continues to grow and more people contribute.
+
+## 2. Emby/Plex/Jellyfin/Kodi integrations
+Although there are similarities between these excellent projects and Tube Archivist, they have a very different use case. Trying to fit the metadata relations and database structure of a YouTube archival project into these media servers that specialize in Movies and TV shows is always going to be limiting.
+
+Part of the scope is to be its own media server, so that's where the focus and effort of this project is. That being said, the nature of self hosted and open source software gives you all the possible freedom to use your media as you wish.
+
+## 3. To Docker or not to Docker
+This project is a classical docker application: There are multiple moving parts that need to be able to interact with each other and need to be compatible with multiple architectures and operating systems. Additionally Docker also drastically reduces development complexity which is highly appreciated.  
+
+So Docker is the only supported installation method. If you don't have any experience with Docker, consider investing the time to learn this very useful technology.
+
+## 4. Finetuning Elasticsearch
+A minimal configuration of Elasticsearch (ES) is provided in the example docker-compose.yml file. ES is highly configurable and very interesting to learn more about. Refer to the [documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html) if you want to get into it.
--- a/docs/Home.md
+++ b/docs/Home.md
@ -2,6 +2,7 @@
 Welcome to the official Tube Archivist Wiki. This is an up-to-date documentation of user functionality.

 Table of contents:
+* [FAQ](FAQ): Frequently asked questions what this project is and tries to do
 * [Channels](Channels): Browse your channels, handle channel subscriptions
 * [Playlists](Playlists): Browse your indexed playlists, handle playlist subscriptions
 * [Downloads](Downloads): Scanning subscriptions, handle download queue
--- a/docs/Settings.md
+++ b/docs/Settings.md
@ -27,9 +27,16 @@ Additional settings passed to yt-dlp.
 - **Embed Metadata**: This saves the available tags directly into the media file by passing `--embed-metadata` to yt-dlp.
 - **Embed Thumbnail**: This will save the thumbnail into the media file by passing `--embed-thumbnail` to yt-dlp.

+## Subtitles
+- **Download Setting**: Select the subtitle language you like to download. Add a comma separated list for multiple languages.
+- **Source Settings**: User created subtitles are provided from the uploader and are usually the video script. Auto generated is from YouTube, quality varies, particularly for auto translated tracks.
+- **Index Settings**: Enabling subtitle indexing will add the lines to Elasticsearch and will make subtitles searchable. This will increase the index size and is not recommended on low-end hardware.
+
 ## Integrations
 All third party integrations of TubeArchivist will **always** be *opt in*.
- **returnyoutubedislike.com**: This will get dislikes and average ratings for each video back by integarting with the API from [returnyoutubedislike.com](https://www.returnyoutubedislike.com/).
+- **API**: Your access token for the Tube Archivist API. 
+- **returnyoutubedislike.com**: This will get return dislikes and average ratings for each video by integrating with the API from [returnyoutubedislike.com](https://www.returnyoutubedislike.com/).
+- **Cast**: Enable Google Cast for videos. Requires a valid SSL certificate and works only in Google Chrome.

 # Scheduler Setup
 Schedule settings expect a cron like format, where the first value is minute, second is hour and third is day of the week. Day 0 is Sunday, day 1 is Monday etc.
@ -69,7 +76,7 @@ Create a zip file of the metadata and select **Max auto backups to keep** to aut
 Additional database functionality.

 ## Manual Media Files Import
-So far this depends on the video you are trying to import to be still available on YouTube to get the metadata. Add the files you like to import to the */cache/import* folder. Then start the process from the settings page *Manual Media Files Import*. Make sure to follow one of the two methods below.
+So far this depends on the video you are trying to import to be still available on YouTube to get the metadata. Add the files you'd like to import to the */cache/import* folder. Then start the process from the settings page *Manual Media Files Import*. Make sure to follow one of the two methods below.

 ### Method 1:
 Add a matching *.json* file with the media file. Both files need to have the same base name, for example:
@ -86,6 +93,7 @@ Detect the YouTube ID from filename, this accepts the default yt-dlp naming conv

 ### Some notes:
 - This will **consume** the files you put into the import folder: Files will get converted to mp4 if needed (this might take a long time...) and moved to the archive, *.json* files will get deleted upon completion to avoid having duplicates on the next run.
+- There should be no subdirectories added to */cache/import*, only video files. If your existing video library has video files inside subdirectories, you can get all the files into one directory by running `find ./ -mindepth 2 -type f -exec mv '{}' . \;` from the top-level directory of your existing video library. You can also delete any remaining empty subdirectories with `find ./ -mindepth 1 -type d -delete`.
 - Maybe start with a subset of your files to import to make sure everything goes well...
 - Follow the logs to monitor progress and errors: `docker-compose logs -f tubearchivist`.

--- a/tubearchivist/home/config.json
+++ b/tubearchivist/home/config.json
@ -25,6 +25,7 @@
        "add_thumbnail": false,
        "subtitle": false,
        "subtitle_source": false,
+        "subtitle_index": false,
        "throttledratelimit": false,
        "integrate_ryd": false
    },
--- a/tubearchivist/home/src/frontend/forms.py
+++ b/tubearchivist/home/src/frontend/forms.py
@ -70,8 +70,14 @@ class ApplicationSettingsForm(forms.Form):

    SUBTITLE_SOURCE_CHOICES = [
        ("", "-- change subtitle source settings"),
+        ("user", "only download user created"),
        ("auto", "also download auto generated"),
-        ("user", "only download uploader"),
+    ]
+
+    SUBTITLE_INDEX_CHOICES = [
+        ("", "-- change subtitle index settings --"),
+        ("0", "disable subtitle index"),
+        ("1", "enable subtitle index"),
    ]

    subscriptions_channel_size = forms.IntegerField(required=False)
@ -91,6 +97,9 @@ class ApplicationSettingsForm(forms.Form):
    downloads_subtitle_source = forms.ChoiceField(
        widget=forms.Select, choices=SUBTITLE_SOURCE_CHOICES, required=False
    )
+    downloads_subtitle_index = forms.ChoiceField(
+        widget=forms.Select, choices=SUBTITLE_INDEX_CHOICES, required=False
+    )
    downloads_integrate_ryd = forms.ChoiceField(
        widget=forms.Select, choices=RYD_CHOICES, required=False
    )
--- a/tubearchivist/home/src/index/reindex.py
+++ b/tubearchivist/home/src/index/reindex.py
@ -204,7 +204,9 @@ class Reindex:
        video.build_json()
        if not video.json_data:
            video.deactivate()
+            return

+        video.delete_subtitles()
        # add back
        video.json_data["player"] = player
        video.json_data["date_downloaded"] = date_downloaded
@ -218,6 +220,7 @@ class Reindex:
        thumb_handler.delete_vid_thumb(youtube_id)
        to_download = (youtube_id, video.json_data["vid_thumb_url"])
        thumb_handler.download_vid([to_download], notify=False)
+        return

    @staticmethod
    def reindex_single_channel(channel_id):
--- a/tubearchivist/home/src/index/video.py
+++ b/tubearchivist/home/src/index/video.py
@ -27,7 +27,8 @@ class YoutubeSubtitle:
    def sub_conf_parse(self):
        """add additional conf values to self"""
        languages_raw = self.video.config["downloads"]["subtitle"]
-        self.languages = [i.strip() for i in languages_raw.split(",")]
+        if languages_raw:
+            self.languages = [i.strip() for i in languages_raw.split(",")]

    def get_subtitles(self):
        """check what to do"""
@ -61,6 +62,9 @@ class YoutubeSubtitle:
        video_media_url = self.video.json_data["media_url"]
        media_url = video_media_url.replace(".mp4", f"-{lang}.vtt")
        all_formats = all_subtitles.get(lang)
+        if not all_formats:
+            return False
+
        subtitle = [i for i in all_formats if i["ext"] == "vtt"][0]
        subtitle.update(
            {"lang": lang, "source": "auto", "media_url": media_url}
@ -120,8 +124,9 @@ class YoutubeSubtitle:
            parser.process()
            subtitle_str = parser.get_subtitle_str()
            self._write_subtitle_file(dest_path, subtitle_str)
-            query_str = parser.create_bulk_import(self.video, source)
-            self._index_subtitle(query_str)
+            if self.video.config["downloads"]["subtitle_index"]:
+                query_str = parser.create_bulk_import(self.video, source)
+                self._index_subtitle(query_str)

    @staticmethod
    def _write_subtitle_file(dest_path, subtitle_str):
@ -157,6 +162,7 @@ class SubtitleParser:
        self._parse_cues()
        self._match_text_lines()
        self._add_id()
+        self._timestamp_check()

    def _parse_cues(self):
        """split into cues"""
@ -179,7 +185,8 @@ class SubtitleParser:
                clean = re.sub(self.stamp_reg, "", line)
                clean = re.sub(self.tag_reg, "", clean)
                cue_dict["lines"].append(clean)
-                if clean and clean not in self.all_text_lines:
+                if clean.strip() and clean not in self.all_text_lines[-4:]:
+                    # remove immediate duplicates
                    self.all_text_lines.append(clean)

        return cue_dict
@ -199,11 +206,25 @@ class SubtitleParser:
                try:
                    self.all_text_lines.remove(line)
                except ValueError:
-                    print("failed to process:")
-                    print(line)
+                    continue

            self.matched.append(new_cue)

+    def _timestamp_check(self):
+        """check if end timestamp is bigger than start timestamp"""
+        for idx, cue in enumerate(self.matched):
+            # this
+            end = int(re.sub("[^0-9]", "", cue.get("end")))
+            # next
+            try:
+                next_cue = self.matched[idx + 1]
+            except IndexError:
+                continue
+
+            start_next = int(re.sub("[^0-9]", "", next_cue.get("start")))
+            if end > start_next:
+                self.matched[idx]["end"] = next_cue.get("start")
+
    def _add_id(self):
        """add id to matched cues"""
        for idx, _ in enumerate(self.matched):
@ -404,7 +425,7 @@ class YoutubeVideo(YouTubeItem, YoutubeSubtitle):
            os.remove(file_path)

        self.del_in_es()
-        self._delete_subtitles()
+        self.delete_subtitles()

    def _get_ryd_stats(self):
        """get optional stats from returnyoutubedislikeapi.com"""
@ -434,7 +455,7 @@ class YoutubeVideo(YouTubeItem, YoutubeSubtitle):
            self.json_data["subtitles"] = subtitles
            handler.download_subtitles(relevant_subtitles=subtitles)

-    def _delete_subtitles(self):
+    def delete_subtitles(self):
        """delete indexed subtitles"""
        data = {"query": {"term": {"youtube_id": {"value": self.youtube_id}}}}
        _, _ = ElasticWrap("ta_subtitle/_delete_by_query").post(data=data)
--- a/tubearchivist/home/templates/home/channel_id.html
+++ b/tubearchivist/home/templates/home/channel_id.html
@ -133,6 +133,7 @@
            {% endfor %}
        {% else %}
            <h2>No videos found...</h2>
+            <p>Try going to the <a href="{% url 'downloads' %}">downloads page</a> to start the scan and download tasks.</p>
        {% endif %}
    </div>
 </div>
--- a/tubearchivist/home/templates/home/home.html
+++ b/tubearchivist/home/templates/home/home.html
@ -73,6 +73,7 @@
            {% endfor %}
        {% else %}
            <h2>No videos found...</h2>
+            <p>If you've already added a channel or playlist, try going to the <a href="{% url 'downloads' %}">downloads page</a> to start the scan and download tasks.</p>
        {% endif %}
    </div>
 </div>
--- a/tubearchivist/home/templates/home/playlist_id.html
+++ b/tubearchivist/home/templates/home/playlist_id.html
@ -114,6 +114,7 @@
            {% endfor %}
        {% else %}
            <h2>No videos found...</h2>
+            <p>Try going to the <a href="{% url 'downloads' %}">downloads page</a> to start the scan and download tasks.</p>
        {% endif %}
    </div>
 </div>
--- a/tubearchivist/home/templates/home/settings.html
+++ b/tubearchivist/home/templates/home/settings.html
@ -94,6 +94,9 @@
                <i>Embed thumbnail into the mediafile.</i><br>
                {{ app_form.downloads_add_thumbnail }}
            </div>
+        </div>
+        <div class="settings-group">
+            <h2 id="format">Subtitles</h2>
            <div class="settings-item">
                <p>Subtitles download setting: <span class="settings-current">{{ config.downloads.subtitle }}</span><br>
                <i>Choose which subtitles to download, add comma separated two letter language ISO code,<br>
@ -105,12 +108,20 @@
                <i>Download only user generated, or also less accurate auto generated subtitles.</i><br>
                {{ app_form.downloads_subtitle_source }}
            </div>
+            <div class="settings-item">
+                <p>Index and make subtitles searchable: <span class="settings-current">{{ config.downloads.subtitle_index }}</span></p>
+                <i>Store subtitle lines in Elasticsearch. Not recommended for low-end hardware.</i><br>
+                {{ app_form.downloads_subtitle_index }}
+            </div>
        </div>
        <div class="settings-group">
            <h2 id="integrations">Integrations</h2>
            <div class="settings-item">
-                <p>API token:</p>
-                <p>{{ api_token }}</p>
+                <p>API token: <button type="button" onclick="textReveal()" id="text-reveal-button">Show</button></p>
+                <div id="text-reveal" class="description-text">
+                    <p>{{ api_token }}</p>
+                    <button class="danger-button" type="button" onclick="resetToken()">Revoke</button>
+                </div>
            </div>
            <div class="settings-item">
                <p>Integrate with <a href="https://returnyoutubedislike.com/">returnyoutubedislike.com</a> to get dislikes and average ratings back: <span class="settings-current">{{ config.downloads.integrate_ryd }}</span></p>
--- a/tubearchivist/home/views.py
+++ b/tubearchivist/home/views.py
@ -715,7 +715,6 @@ class SettingsView(View):
        """get existing or create new token of user"""
        # pylint: disable=no-member
        token = Token.objects.get_or_create(user=request.user)[0]
-        print(token)
        return token

    @staticmethod
@ -758,6 +757,11 @@ def process(request):
    if request.method == "POST":
        current_user = request.user.id
        post_dict = json.loads(request.body.decode())
+        if post_dict.get("reset-token"):
+            print("revoke API token")
+            request.user.auth_token.delete()
+            return JsonResponse({"success": True})
+
        post_handler = PostData(post_dict, current_user)
        if post_handler.to_exec:
            task_result = post_handler.run_task()
--- a/tubearchivist/static/script.js
+++ b/tubearchivist/static/script.js
@ -235,6 +235,14 @@ function findPlaylists(button) {
    }, 500);
 }

+function resetToken() {
+    var payload = JSON.stringify({'reset-token': true});
+    sendPost(payload);
+    var message = document.createElement("p");
+    message.innerText = "Token revoked";
+    document.getElementById("text-reveal").replaceWith(message);
+}
+
 // delete from file system
 function deleteConfirm() {
    to_show = document.getElementById("delete-button");