前言

本文是动手学Agent系列的第二篇，对吴恩达老师的翻译Agent的笔记。

代码结构

最重要的内容在./src/translation_agent/utils.py里面。
这里定义了整个translation_agent库，下面以伪代码形式重写其主函数：

# MAX_TOKENS_PER_CHUNK=1000，后同
def translate(源语言,目标语言,待翻译文本,国家,max_tokens=MAX_TOKENS_PER_CHUNK,):
    计算待翻译文本的token数量
    if 文章token数不超过max_token:
        文章不分块直接翻译 
    else:   # 多文本块分别翻译最后合并
        计算分块大小
        使用langchain定义分块模型
        文本分块
        多文本块翻译并合并

计算文本的token数量

num_tokens_in_string(
    input_str: str,
    encoding_name: str = "cl100k_base"
)

使用的encoder是GPT4使用的cl100k_base，也可换成其他的。

文章不分块直接翻译

one_chunk_translate_text(
    source_lang: str,
    target_lang: str,
    source_text: str,
    country: str = ""
) -> str

本函数使用一个线性的三段流程完成翻译：initial, reflect, improve
挺有意思的是，吴恩达老师担心GPT4不返回json格式，设置了一个if-else。这其实可以在下面的每个prompt都严格给一个json格式来解决。
各部分prompt如下：

initial

简单的角色设定+输出模板+条件限制

system_message = f"You are an expert linguist, specializing in translation from {source_lang} to {target_lang}."

translation_prompt = f"""This is an {source_lang} to {target_lang} translation, please provide the {target_lang} translation for this text. Do not provide any explanations or text apart from the translation.
{source_lang}: {source_text}

{target_lang}:"""

reflect

system_message如下：

1
2

system_message = f"You are an expert linguist specializing in translation from {source_lang} to {target_lang}. \
You will be provided with a source text and its translation and your goal is to improve the translation."

根据国家是否给出，有两种translation prompt:

国家信息不为空：

使用了XML注释的技巧来掩盖部分的思考过程，其他的就是常见的详细要求和格式限制：

reflection_prompt = f"""Your task is to carefully read a source text and a translation from {source_lang} to {target_lang}, and then give constructive criticisms and helpful suggestions to improve the translation. \

The source text and initial translation, delimited by XML tags <SOURCE_TEXT></SOURCE_TEXT> and <TRANSLATION></TRANSLATION>, are as follows:

<SOURCE_TEXT>
{source_text}
</SOURCE_TEXT>

<TRANSLATION>
{translation_1}
</TRANSLATION>

When writing suggestions, pay attention to whether there are ways to improve the translation's \n\
(i) accuracy (by correcting errors of addition, mistranslation, omission, or untranslated text),\n\
(ii) fluency (by applying {target_lang} grammar, spelling and punctuation rules, and ensuring there are no unnecessary repetitions),\n\
(iii) style (by ensuring the translations reflect the style of the source text and take into account any cultural context),\n\
(iv) terminology (by ensuring terminology use is consistent and reflects the source text domain; and by only ensuring you use equivalent idioms {target_lang}).\n\

Write a list of specific, helpful and constructive suggestions for improving the translation.
Each suggestion should address one specific part of the translation.
Output only the suggestions and nothing else."""

国家信息为空

比上面的内容就多了一条要求：翻译结果符合给定国家的语言习惯

1	The final style and tone of the translation should match the style of {target_lang} colloquially spoken in {country}.

计算分块大小

1	calculate_chunk_size(token_count=num_tokens_in_text, token_limit=max_tokens)

这个函数的逻辑是这样的：
设T为token_count，L为token_limit，

计算需要分多少块
$$N=\left\lceil\frac TL\right\rceil $$
计算每块的基础数值
$$S_a=\left\lfloor\frac TN\right\rfloor $$
分配剩余的token
$$R = T \mod L$$
将剩余的 tokens 均匀地添加到基础大小上：
$$S_b=\left\lfloor\frac{R}{N}\right\rfloor$$

$$S=S_a+S_b$$
如$T=100,L=30$，
$$N=\left\lceil\frac TL\right\rceil=\left\lceil\frac{100}{30}\right\rceil=\left\lceil3.33\right\rceil=4$$
$$S_a=\left\lfloor\frac TN\right\rfloor=\left\lfloor\frac{100}4\right\rfloor=\lfloor25\rfloor=25$$
$$R=T\mod L=100\mod30=10$$
$$S_a=\left\lfloor\frac{R}{N}\right\rfloor=\left\lfloor\frac{10}{4}\right\rfloor=\lfloor2.5\rfloor=2$$
$$S=S_b+S_a=25+2=27$$

使用langchain定义分块模型

Langchain库函数，不多介绍

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    model_name="gpt-4",
    chunk_size=token_size,
    chunk_overlap=0,
)

文本分块

Langchain库函数，不多介绍

1	source_text_chunks = text_splitter.split_text(source_text)

多文本块翻译

类似单文本块翻译，也有initial, reflect, improve三个步骤

initial

技巧：XML注释，详细要求，限制条件

system_message = f"You are an expert linguist, specializing in translation from {source_lang} to {target_lang}."

translation_prompt = """Your task is to provide a professional translation from {source_lang} to {target_lang} of PART of a text.

The source text is below, delimited by XML tags <SOURCE_TEXT> and </SOURCE_TEXT>. Translate only the part within the source text delimited by <TRANSLATE_THIS> and </TRANSLATE_THIS>. You can use the rest of the source text as context, but do not translate any of the other text. Do not output anything other than the translation of the indicated part of the text.

<SOURCE_TEXT>
{tagged_text}
</SOURCE_TEXT>

To reiterate, you should translate only this part of the text, shown here again between <TRANSLATE_THIS> and </TRANSLATE_THIS>:
<TRANSLATE_THIS>
{chunk_to_translate}
</TRANSLATE_THIS>

Output only the translation of the portion you are asked to translate, and nothing else.
"""

但是{tagged_text}变成了被XML标签<TRANSLATE_THIS></TRANSLATE_THIS>包围的全文。实现代码如下：

translation_chunks = []
for i in range(len(source_text_chunks)):
    # Will translate chunk i
    tagged_text = (
        "".join(source_text_chunks[0:i])    #拼接前面的所有块
        + "<TRANSLATE_THIS>"                #标签
        + source_text_chunks[i]             #待翻译的文本
        + "</TRANSLATE_THIS>"               #标签
        + "".join(source_text_chunks[i + 1 :])#拼接后面的所有块
    )
    # prompt 根据 i 和tagged_text更新 {chunk_to_translate}
    # 逐块翻译，不断拼接

reflect

prompt和单块翻译类似，略；

improve

prompt和单块翻译类似，略。

对多文本块翻译部分`{tagged_text}`的改进

上文的这种标注方法固然有助于LLM把握“全局视角”，但是对于过长的翻译原文，这样做有以下坏处：

消耗大量的token。这种方法等于把整段待翻译内容重复了$\left\lceil\frac TL\right\rceil$次（T,L定义见前文）。对在线token服务这样就是过量烧钱，对本地LLM这样会显著浪费显存；

笔者写这篇文章的时候刚入职一个月，上周隔壁组的mentor吐槽我们system prompt过长吃掉了很多显存

消耗context。哪怕现在主流的LLM的context都很长，过长的内容也会降低LLM对其中内容的把控能力；
浪费时间。这个是显然的。
分块的时候可能出现拆分点在句子中间的情况。上文的拆分只看定死的token数，而每个词汇句子占用的token数是不均匀的。

对于问题1~3，一个简单的解决方法是，每次只包含待翻译块的相邻两块（如果待翻译块在文章两端，则只包含其内侧的邻居）。
简单的实现代码如下：

translation_chunks = []
for i in range(len(source_text_chunks)):
    # 获取当前块的上下文
    prev_context = "".join(source_text_chunks[max(0, i - context_length):i])
    next_context = "".join(source_text_chunks[i + 1:min(i + context_length + 1, len(source_text_chunks))])
    tagged_text = (
        prev_context
        + "<TT>"
        + source_text_chunks[i]
        + "</TT>"
        + next_context
    )
    # prompt 根据 i 和tagged_text更新 {chunk_to_translate}
    # 逐块翻译，不断拼接

对于4，可能可以通过正则拆分句子的方式，计算每句的token，再进行分组翻译。这里涉及语义分割了，我不是很了解，不过应该有成熟解决方案；

进一步考虑：如果文本足够长，长到出现明显的章节、段落，则可以在分块时先按照章节、段落这种自然分块进行拆分，再对较长（超过token_limit）的块做进一步拆分。

总结和杂谈

本文总结了吴恩达老师的翻译Agent的流程，学习了这个Agent的设计思路和prompt设计范式，也思考了一些可能的改进措施。

吴恩达老师在今年（2024年）6月份开源了这个Agent。他在X上通过这个案例讲解了分治思想在Agent设计中的应用，是一个很好的范式：此前很多开源的翻译Agent都是通过模型的长文本能力硬吃大段翻译，翻译到后面难免出现指令跟随差的现象。

作者在学习过程中，也发现了这个Agent固有的大量消耗token的问题，并提出了一些简单的解决思路，希望能对读者有所启发。