The Need for Project DANSO
I have benchmarked different ways of translating the document.
1. Naive Document Translationโ
Works pretty well, but sometimes translates the MDX tags. See the <Intro>
tag at the beginning of the document
But we cannot ignore these <tag>
lines altogether because sometimes it includes strings, such as <BlogPost title="What is React?" />
. In this case, we want to translate only the string inside
Translates proper nouns, such as "Server Actions" into "์๋ฒ ์์ " which severely confuses the reader
With some prompting, both can be fixed pretty easily
However, this approach has one critical flaw, that GPT is terrible at transcription.
Take a look at the following examples. It modified a code block, either making a syntax error or unknowingly modifying the code.
2. Remark Parsing and Translatingโ
Severely limits the context of AI translation.
Parsing to the html tag level worsens the translation quality, because sometimes the AI needs to see the sentence as a whole to employ more adequate markdown syntaxes.
For example, a sentence like the following
You [need to use](/some-supporting-doc) this because of [this](/some-youtube-video)
should be translated to
[์ด๊ฒ](/some-youtube-video) ๋๋ฌธ์ [์ด๋ ๊ฒ ์ฌ์ฉํ์
์ผ](/some-supporting-doc) ํฉ๋๋ค.
Meanwhile if you parse it to the html tag level, the AI won't be able to reorder the sentences. Instead, it will translate it as:
๋น์ ์ [์ฌ์ฉ์ด ํ์ํ๋ค](/some-supporting-doc) ์ด๊ฒ ๋๋ฌธ์ [์ด๊ฒ](/some-youtube-video)
3. Solutionโ
I need to write an MDX parser. It will shallowly parse the elements, into:
- Frontmatter. Only translate strings, such as
title
anddescription
. - MDX Tags. Only translate inner strings.
- Code. Extract the comments and only translate the comments (or just don't.)
- Paragraph. Provide the markdown raw text as a whole.
Then each types of entities will be translated using their corresponding translation logic.
Thus, I have started Project DANSO