<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>txtv &amp;mdash; Jerry of the Week</title>
    <link>https://write.in0rdr.ch/tag:txtv</link>
    <description>ˈdʒɛri - Individual who sends life against the grain no matter the consequences</description>
    <pubDate>Sun, 26 Apr 2026 15:42:58 +0000</pubDate>
    <item>
      <title>SRF Teletext Parsing with txtv</title>
      <link>https://write.in0rdr.ch/h1srf-teletext-parsing-with-txtv-h1</link>
      <description>&lt;![CDATA[I spent some time with a neat little CLI application: txtv&#xA;&#xA;This app written in Python let&#39;s you easily read latest Swedish a href=&#34;https://en.wikipedia.org/wiki/Teletext&#34;Teletext/a news. It&#39;s &#34;a client for reading swedish text tv in the terminal&#34;.&#xA;&#xA;I slightly modified the code so it reads from Swiss 🇨🇭 Teletext source (SRF):&#xA;&#xA;https://github.com/voidcase/txtv/pull/15&#xA;&#xA;#coding #python #txtv&#xA;&#xA;!--more--&#xA;&#xA;I could figure out the URL for the API by observing how the .gif pictures for the teletext.ch website are generated from the API while browsing through the txt pages:&#xA;&#xA;img src=&#34;https://code.in0rdr.ch/pub/blog/txt-api.png&#34; alt=&#34;gif page created from api.teletext.ch&#34; /&#xA;&#xA;https://www.teletext.ch&#xA;https://api.teletext.ch/channels/SRF1/pages/{num}&#xA;&#xA;Unfortunately, I did not find a proper documentation for this API 😢.&#xA;&#xA;Reading a bit of Javascript in the browser debugging console helped me to figure out the path structure of the https://api.teletext.ch API a bit better.&#xA;&#xA;By the way, on my mobile I typically read https://m.txt.ch/SRF1 (but this one does not really work well for reading on the web browser / Laptop).&#xA;&#xA;After playing a while with the API I realized, that there is not really an easy way to parse the structure of such a txt page from specific API fields.&#xA;&#xA;img src=&#34;https://code.in0rdr.ch/pub/blog/txt-structure.png&#34; alt=&#34;content and structure of a txt page from the API&#34; /&#xA;&#xA;I decided to only use the content field, because it already displays the Swiss German special characters and Umlauts quite well. Also, the header and other fields like commandRow where not really helpful to parse the content on a page (e.g., to substract title text):&#xA;&#xA;curl -s https://api.teletext.ch/channels/SRF1/pages/130 | jq -r &#39;.subpages[0].ep1Info.data.ep1Format.header&#39;&#xA;ICAgICAgICAgICAgICBQMDEgICAgICAgICAgICAgICAgICAgICAgRA==&#xA;&#xA;curl -s https://api.teletext.ch/channels/SRF1/pages/130 | jq -r &#39;.subpages[0].ep1Info.data.ep1Format.header&#39; | base64 -d&#xA;              P01                      D&#xA;&#xA;In Teletext the pages generally follow a specific structure, most of the time.&#xA;&#xA;Generally, there are some overview pages with links to subpages. The actual content can then be explored on these subpages.&#xA;&#xA;The txtv CLI also roughly supports these two types of pages.&#xA;&#xA;By default, txtv starts in the &#34;interactive mode&#34;, when not started with a specific argument (e.g., to read a specific page nr):&#xA;&#xA;[andi@nixos:~]$ txtv&#xA;WHO: Kontakt nach Gaza gestört.......133&#xA;UNO bestätigt Gaza-Resolution........134&#xA;Maine: Mutmasslicher Schütze tot.....140&#xA;Frauen-Nati unterliegt Schweden......183&#xA;Lugano siegt bei ZSC Lions...........187&#xA;05:05 Gredig direkt/UT...............734&#xA;05:40 Hoch hinaus/UT.................735&#xA;06:25 News-Schlagzeilen 07:30 Wetterkanal KURZÜBERSICHT101&#xA;INLAND...............................104&#xA;SPORT................................180&#xA;AUSLAND..............................130&#xA;METEO................................500&#xA;WIRTSCHAFT...........................150&#xA;TV&amp;RADIO.............................700&#xA;&#xA;Very nice 😎 Not perfect, but still useful.&#xA;&#xA;Now, for my patch I concentrated on displaying the information on two pages:&#xA;&#xA;INLAND 104&#xA;AUSLAND 130&#xA;&#xA;I also did some parsing regarding the &#34;TV&amp;RADIO&#34; section but gave up on over-engineering that part after some time 🤓.&#xA;&#xA;[andi@nixos:~]$ txtv 104&#xA;Sie folgt der Schwesterkommission....107&#xA;Aktionsplan Schweiz und Frankreich...108&#xA;Piste in Zermatt.....................109&#xA;Eine Stunde länger einkaufen.........110&#xA;&#xA;[andi@nixos:~]$ txtv 130&#xA;WHO: Kein Kontakt zu Mitarbeitenden..133&#xA;UNO-Vollversammlung Gaza-Resolution..134&#xA;Reaktionen auf UNO-Resolution........135&#xA;Gaza: Kein Internet nach Angriffen...136&#xA;Weitere Hilfsgüter für Gazastreifen..137&#xA;Israel greift........................250&#xA;Hamas-Ziele an.......................138&#xA;Unesco fordert Schutz der Schulen.. 139MAINE/USA: Polizei findet Leiche140&#xA;&#xA;As you can see from the examples above, the parsing rules still have some issues. This is due to the fact that I can only rely on a few basic assumptions on how the overview txt pages (e.g., 104 and 130) are structured. For instance, the current parsing is flawed when a page title includes a three digit number. For example &#34;Israel greift 250 Hamas-Ziele an&#34;:&#xA;&#xA;Israel greift........................250&#xA;Hamas-Ziele an.......................138&#xA;&#xA;The problem here is that 250 is interpreted as a page number, where it should not.&#xA;&#xA;img src=&#34;https://code.in0rdr.ch/pub/blog/txt-pages.png&#34; alt=&#34;picture of txt pages 138 and 150&#34; /&#xA;&#xA;Unfortunately, I&#39;m not aware of a solution for this, except for finding a smarter parsing rule:&#xA;&#xA;Find all three digit numbers, most probably these are page numbers&#xA;pagenrs = re.findall(r&#39;\s(\d{3})-\/(|$)&#39;, stories)&#xA;allpagenrs = []&#xA;&#xA;for p in pagenrs:&#xA;    try:&#xA;        n = int(p[0])&#xA;        allpagenrs.append(n)&#xA;    except:&#xA;        pass&#xA;    try:&#xA;        n = int(p[1])&#xA;        allpagenrs.append(n)&#xA;    except:&#xA;        pass&#xA;&#xA;allpagenrs = [str(p) for p in allpagenrs]&#xA;&#xA;I experimented with quite a few variations, but gave up after some time, because each has it&#39;s benefits and pitfalls, where the rule would help in some cases but perform worse in another edge case.&#xA;&#xA;So, if you have any suggestions on how to solve that problem of &#34;missing intent&#34; while parsing basically &#34;generic text&#34; let me know. I heard that in this age of AI and machine learning we should be able to do better, shouldn&#39;t we?&#xA;&#xA;The same problem also applies when parsing the titles and subtitles on the overview pages. The txtv ls command is my favorite command. It simply lists all the subpages of the pages 104 (INLAND) and 130 (AUSLAND):&#xA;&#xA;img src=&#34;https://code.in0rdr.ch/pub/blog/txt-output.png&#34; alt=&#34;txtv ls command output&#34; /&#xA;&#xA;But here, as well, my parsing rule is flawed when it comes to actual page titles containing more than a few capital letters in a row (3-4 capital chars are typically used for abbreviations):&#xA;&#xA;if self.num == 100 or self.num == 700:&#xA;    # Remove actual titles&#xA;    stories = re.sub(&#34;Jetzt auf SRF 1&#34;, &#34;&#34;, stories)&#xA;    stories = re.sub(&#34;JETZT AUF SRF 1&#34;, &#34;&#34;, stories)&#xA;    stories = re.sub(&#34;TELETEXT SRF 1&#34;, &#34;&#34;, stories)&#xA;else:&#xA;    # Remove all uppercase subtitles. There can be multiple&#xA;    # subtitles on a page (subtitle, stories, subtitle, stories, etc)&#xA;    stories = regex.sub(r&#39;[\p{Lu}\s-]{9,}[\s:]&#39;, &#39;&#39;, self.content)&#xA;&#xA;So in generally I can be quite happy with the outcome. It gives me a fast and smart way to quickly read the news on any terminal enabled device with Internet access 🤓. Also, the parsing rules on the individual detail/content/sub pages to extract the category and date work flawlessly (except maybe for the more special pages/categories like SPORT and TV&amp;RADIO).&#xA;&#xA;For future improvements I&#39;m right now thinking to also use the links of the API output for extracting the set of real page numbers, which would at least help me with the first problem presented in this article (the titles that contain three digits):&#xA;&#xA;curl -s https://api.teletext.ch/channels/SRF1/pages/130 | jq -r &#39;.subpages[0].ep1Info.links&#39;&#xA;&#xA;For any page without links, I could simply apply the &#34;content page&#34; rendering which has shown to work pretty well already.&#xA;&#xA;div style=&#34;text-align:center; font-size: 0.8em&#34;&#xD;&#xA;a href=&#34;https://write.in0rdr.ch/feed&#34;&amp;#128732; RSS/a | a href=&#34;https://m.in0rdr.ch/in0rdr&#34;&amp;#128024; Fediverse/a | a href=&#34;https://chat.in0rdr.ch/#/guest?join=p0c@conference.in0rdr.ch&#34;&amp;#128172; XMPP/a&#xD;&#xA;/div]]&gt;</description>
      <content:encoded><![CDATA[<p>I spent some time with a neat little CLI application: <code>txtv</code></p>

<p>This app written in Python let&#39;s you easily read latest Swedish <a href="https://en.wikipedia.org/wiki/Teletext">Teletext</a> news. It&#39;s “a client for reading swedish text tv in the terminal”.</p>

<p>I slightly modified the code so it reads from Swiss 🇨🇭 Teletext source (SRF):</p>

<p><a href="https://github.com/voidcase/txtv/pull/15">https://github.com/voidcase/txtv/pull/15</a></p>

<p><a href="https://write.in0rdr.ch/tag:coding" class="hashtag"><span>#</span><span class="p-category">coding</span></a> <a href="https://write.in0rdr.ch/tag:python" class="hashtag"><span>#</span><span class="p-category">python</span></a> <a href="https://write.in0rdr.ch/tag:txtv" class="hashtag"><span>#</span><span class="p-category">txtv</span></a></p>



<p>I could figure out the URL for the API by observing how the <code>.gif</code> pictures for the teletext.ch website are generated from the API while browsing through the txt pages:</p>

<p><img src="https://code.in0rdr.ch/pub/blog/txt-api.png" alt="gif page created from api.teletext.ch"/></p>
<ul><li><a href="https://www.teletext.ch">https://www.teletext.ch</a></li>
<li><code>https://api.teletext.ch/channels/SRF1/pages/{num}</code></li></ul>

<p>Unfortunately, I did not find a proper documentation for this API 😢.</p>

<p>Reading a bit of Javascript in the browser debugging console helped me to figure out the path structure of the <a href="https://api.teletext.ch">https://api.teletext.ch</a> API a bit better.</p>

<p>By the way, on my mobile I typically read <a href="https://m.txt.ch/SRF1">https://m.txt.ch/SRF1</a> (but this one does not really work well for reading on the web browser / Laptop).</p>

<p>After playing a while with the API I realized, that there is not really an easy way to parse the structure of such a txt page from specific API fields.</p>

<p><img src="https://code.in0rdr.ch/pub/blog/txt-structure.png" alt="content and structure of a txt page from the API"/></p>

<p>I decided to only use the <code>content</code> field, because it already displays the Swiss German special characters and Umlauts quite well. Also, the <code>header</code> and other fields like <code>commandRow</code> where not really helpful to parse the content on a page (e.g., to substract title text):</p>

<pre><code class="language-bash">curl -s https://api.teletext.ch/channels/SRF1/pages/130 | jq -r &#39;.subpages[0].ep1Info.data.ep1Format.header&#39;
ICAgICAgICAgICAgICBQMDEgICAgICAgICAgICAgICAgICAgICAgRA==

curl -s https://api.teletext.ch/channels/SRF1/pages/130 | jq -r &#39;.subpages[0].ep1Info.data.ep1Format.header&#39; | base64 -d
              P01                      D
</code></pre>

<p>In Teletext the pages generally follow a specific structure, most of the time.</p>

<p>Generally, there are some overview pages with links to subpages. The actual content can then be explored on these subpages.</p>

<p>The <code>txtv</code> CLI also roughly supports these two types of pages.</p>

<p>By default, <code>txtv</code> starts in the “interactive mode”, when not started with a specific argument (e.g., to read a specific page nr):</p>

<pre><code class="language-bash">[andi@nixos:~]$ txtv
WHO: Kontakt nach Gaza gestört.......133
UNO bestätigt Gaza-Resolution........134
Maine: Mutmasslicher Schütze tot.....140
Frauen-Nati unterliegt Schweden......183
Lugano siegt bei ZSC Lions...........187
05:05 Gredig direkt/UT...............734
05:40 Hoch hinaus/UT.................735
06:25 News-Schlagzeilen 07:30 Wetterkanal KURZÜBERSICHT101
INLAND...............................104
SPORT................................180
AUSLAND..............................130
METEO................................500
WIRTSCHAFT...........................150
TV&amp;RADIO.............................700
</code></pre>

<p>Very nice 😎 Not perfect, but still useful.</p>

<p>Now, for my patch I concentrated on displaying the information on two pages:</p>
<ul><li><code>INLAND 104</code></li>
<li><code>AUSLAND 130</code></li></ul>

<p>I also did some parsing regarding the “TV&amp;RADIO” section but gave up on over-engineering that part after some time 🤓.</p>

<pre><code class="language-bash">[andi@nixos:~]$ txtv 104
Sie folgt der Schwesterkommission....107
Aktionsplan Schweiz und Frankreich...108
Piste in Zermatt.....................109
Eine Stunde länger einkaufen.........110

[andi@nixos:~]$ txtv 130
WHO: Kein Kontakt zu Mitarbeitenden..133
UNO-Vollversammlung Gaza-Resolution..134
Reaktionen auf UNO-Resolution........135
Gaza: Kein Internet nach Angriffen...136
Weitere Hilfsgüter für Gazastreifen..137
Israel greift........................250
Hamas-Ziele an.......................138
Unesco fordert Schutz der Schulen.. 139MAINE/USA: Polizei findet Leiche140
</code></pre>

<p>As you can see from the examples above, the parsing rules still have some issues. This is due to the fact that I can only rely on a few basic assumptions on how the overview txt pages (e.g., 104 and 130) are structured. For instance, the current parsing is flawed when a page title includes a three digit number. For example “Israel greift 250 Hamas-Ziele an”:</p>

<pre><code>Israel greift........................250
Hamas-Ziele an.......................138
</code></pre>

<p>The problem here is that 250 is interpreted as a page number, where it should not.</p>

<p><img src="https://code.in0rdr.ch/pub/blog/txt-pages.png" alt="picture of txt pages 138 and 150"/></p>

<p>Unfortunately, I&#39;m not aware of a solution for this, except for finding a smarter parsing rule:</p>

<pre><code class="language-python"># Find all three digit numbers, most probably these are page numbers
page_nrs = re.findall(r&#39;\s(\d{3})*[-\/]*(\d{3})([^\d]|$)&#39;, stories)
all_page_nrs = []

for p in page_nrs:
    try:
        n = int(p[0])
        all_page_nrs.append(n)
    except:
        pass
    try:
        n = int(p[1])
        all_page_nrs.append(n)
    except:
        pass

all_page_nrs = [str(p) for p in all_page_nrs]
</code></pre>

<p>I experimented with quite a few variations, but gave up after some time, because each has it&#39;s benefits and pitfalls, where the rule would help in some cases but perform worse in another edge case.</p>

<p>So, if you have any suggestions on how to solve that problem of “missing intent” while parsing basically “generic text” let me know. I heard that in this age of AI and machine learning we should be able to do better, shouldn&#39;t we?</p>

<p>The same problem also applies when parsing the titles and subtitles on the overview pages. The <code>txtv ls</code> command is my favorite command. It simply lists all the subpages of the pages 104 (INLAND) and 130 (AUSLAND):</p>

<p><img src="https://code.in0rdr.ch/pub/blog/txt-output.png" alt="txtv ls command output"/></p>

<p>But here, as well, my parsing rule is flawed when it comes to actual page titles containing more than a few capital letters in a row (3-4 capital chars are typically used for abbreviations):</p>

<pre><code class="language-python">if self.num == 100 or self.num == 700:
    # Remove actual titles
    stories = re.sub(&#34;Jetzt auf SRF 1&#34;, &#34;&#34;, stories)
    stories = re.sub(&#34;JETZT AUF SRF 1&#34;, &#34;&#34;, stories)
    stories = re.sub(&#34;TELETEXT SRF 1&#34;, &#34;&#34;, stories)
else:
    # Remove all uppercase subtitles. There can be multiple
    # subtitles on a page (subtitle, stories, subtitle, stories, etc)
    stories = regex.sub(r&#39;[\p{Lu}\s-]{9,}[\s:]&#39;, &#39;&#39;, self.content)
</code></pre>

<p>So in generally I can be quite happy with the outcome. It gives me a fast and smart way to quickly read the news on any terminal enabled device with Internet access 🤓. Also, the parsing rules on the individual detail/content/sub pages to extract the category and date work flawlessly (except maybe for the more special pages/categories like SPORT and TV&amp;RADIO).</p>

<p>For future improvements I&#39;m right now thinking to also use the <code>links</code> of the API output for extracting the set of real page numbers, which would at least help me with the first problem presented in this article (the titles that contain three digits):</p>

<pre><code class="language-bash">curl -s https://api.teletext.ch/channels/SRF1/pages/130 | jq -r &#39;.subpages[0].ep1Info.links&#39;
</code></pre>

<p>For any page without links, I could simply apply the “content page” rendering which has shown to work pretty well already.</p>

<div style="text-align:center; font-size: 0.8em">
<a href="https://write.in0rdr.ch/feed">🛜 RSS</a> | <a href="https://m.in0rdr.ch/in0rdr">🐘 Fediverse</a> | <a href="https://chat.in0rdr.ch/#/guest?join=p0c@conference.in0rdr.ch">💬 XMPP</a>
</div>
]]></content:encoded>
      <guid>https://write.in0rdr.ch/h1srf-teletext-parsing-with-txtv-h1</guid>
      <pubDate>Sat, 28 Oct 2023 02:34:09 +0000</pubDate>
    </item>
  </channel>
</rss>