• Re: The perils of writing your own newsreader - the perils of posting to moderated newsgroups

    From Maria Sophia@mariasophia@comprehension.com to alt.free.newsservers,news.software.readers,alt.comp.os.windows-11 on Thu Mar 12 10:10:31 2026
    From Newsgroup: alt.comp.os.windows-11

    Several kind posters have helped me track down why my replies sometimes get corrupted when responding to posts from Winston, so I wanted to kindly summarize the
    findings so the whole picture is clear, which would be helpful to those who care.

    While anything I say below can be wrong, it's what I "think" is happening...

    1. Winston types his display name using Windows Alt-codes.
    These produce raw Windows-1252 bytes:
    A1 = ¡
    F1 = ñ
    A7 = §
    B1 = ±
    A4 = ¤
    His full display name is literally:
    ...w¡ñ§±¤ñ

    2. These bytes are legal in Windows-1252, but not legal in a Usenet header.
    Usenet headers must be 7-bit ASCII unless they use a MIME encoded-word.
    Winston’s header contains raw 8-bit bytes, not ASCII and not UTF-8.

    3. Thunderbird displays those bytes as-is.
    Thunderbird does not sanitize nor re-encode the header on send.
    In the message viewer, Thunderbird shows:
    ...w¡ñ§±¤ñ <winstonmvp@gmail.com>
    When viewing the raw source, Thunderbird shows a MIME-encoded version,
    but that is Thunderbird's internal representation, not what was sent.

    4. My own workflow is strict ASCII.
    I enforce 7-bit output. When I quote Winston, his raw 8-bit bytes get
    copied into my attribution line. That can create a mojibake mismatch
    between declared charsets and the actual bytes in my outgoing post.

    5. Some NNTP servers may (apparently) try to 'repair' that mismatch.
    Different nntp servers may handle illegal bytes differently. Some
    may rewrite the charset, some might re-encode the body, and some
    may simply corrupt the article into mojibake scrambled eggs.
    I think that is why my replies sometimes get mangled on the way out.

    6. By experiment, ASCII mode works better for me than UTF-8 mode.
    When I declare US-ASCII and strip all non-ASCII before posting, the
    article is internally consistent and servers seem to not interfere.
    When I declare UTF-8, servers appear to try to validate the bytes to
    fix what is not valid UTF-8, which may lead to unpredictable results.

    Since this is a component of the perils of writing your own newsreader,
    I am adding a normalization step in my shortcuts.xml so that any non-ASCII bytes in the attribution line are removed or replaced before posting. This keeps my outgoing articles 7-bit clean and prevents NNTP servers from
    rewriting them. Modern newsreaders already do this automatically, so this likely perhaps mainly only affects older strict-ASCII workflows like mine.

    Thanks to everyone who helped test this from the recipient's side. The
    problem is now better understood & the workaround on my end is ongoing.
    --
    There are 2 types of posters on Usenet, only half of which can add value.
    --- Synchronet 3.21d-Linux NewsLink 1.2
  • From Phil Boutros@philb@philb.ca to alt.free.newsservers,news.software.readers,alt.comp.os.windows-11 on Fri Mar 13 18:57:41 2026
    From Newsgroup: alt.comp.os.windows-11

    Maria Sophia <mariasophia@comprehension.com> wrote:
    <snip>

    I need to stress that out of a million people, only two or three know what is in that thread I listed above, so I understand that most people wouldn't have any idea that the Linux/macOS/Windows clipboard is so complicated.
    But it is complicated too.

    You think only "two of three" people out of a million, whom hang
    out on Usenet in 2026, in a software-related group no less (I'm
    reading this on news.software.readers) understand how character
    encoding and decoding works?

    <snip a whole heap of macro>

    And that it takes about a thousand lines to move from one encoding
    to another? Most modern programming languages can do this trivially
    in one operation. You should only have to decode based on the
    specified encoding, then encode to your specified encoding. Literally
    one line per operation. This is specifically why Unicode is used.

    Am I missing something obvious? In what language is your
    newsreader written?


    Phil
    --
    AH#61 Wolf#14 BS#89 bus#1 CCB#1 SENS KOTC#4
    philb@philb.ca http://philb.ca
    --- Synchronet 3.21d-Linux NewsLink 1.2