终于!文生图可以直接生成中文了
女模特的表情温和,目光直视前方,展现出自信与从容。作为当前开源文生图领域的SOTA(最先进技术)模型,CogView4不仅在复杂语义理解、指令跟随能力上表现卓越,还首次实现了对汉字生成的支持,解决了AI图像生成领域长期存在的痛点。例如,输入古诗“野径云俱黑,江船火独明”,模型能精准生成云雾笼罩的江船灯火画。可以看出,在不使用提示词增强的情况下,生成的效果还是差强人意,使用提示词增强后,文字效果和人
之前我有在扩散模型对于文字的渲染已不再是难题这篇文章中有提到通过扩散模型直接生成文字的两种方式,一种是《TextDiffuser: Diffusion Models as Text Painters》,另外一种是网站《Ideogram: Helping people become more creative.》,这两种方式对英文生成比较稳定,但是在中文生成的条件下就一言难尽了。
2025年3月4日,智谱AI正式推出其“开源年”计划的首个重磅模型—CogView4
作为当前开源文生图领域的SOTA(最先进技术)模型,CogView4不仅在复杂语义理解、指令跟随能力上表现卓越,还首次实现了对汉字生成的支持,解决了AI图像生成领域长期存在的痛点。本文将深入解析这一模型的创新点、技术架构及其应用前景。
核心亮点
-
1、汉字生成能力突破
传统文生图模型在生成汉字时常面临字形结构错位、语义断层等问题,CogView4通过分层训练策略(基础阶段学习通用图文映射,泛化阶段适配多分辨率,最终通过人类偏好微调优化审美),成功将汉字笔锋与图像意境自然融合。例如,输入古诗“野径云俱黑,江船火独明”,模型能精准生成云雾笼罩的江船灯火画 -

2、中英双语混合输入与理解
CogView4首次将文本编码器从纯英文的T5升级为支持双语的GLM-4,结合中英双语图文对训练,使其能无缝处理混合提示词。例如,生成“赛博幻彩风格与超现实主义融合的女性侧脸”时,模型可同时解析中英文描述,并输出高一致性的图像
-
3、任意长度提示词与分辨率支持
突破传统模型的固定token限制,CogView4支持输入任意长度的文本(最高达1024 tokens),并生成指定范围内的任意分辨率图像(如512×512至2048×2048)。这种灵活性极大扩展了创作场景,例如用超长故事生成四格漫画,或通过复杂描述生成精细画面。
技术架构
-
1、多模态编码与训练优化
-
GLM-4双语编码器:替换传统T5编码器,增强中文语义捕捉能力。
-
二维旋转位置编码(2D RoPE):建模图像位置信息,支持不同分辨率生成。
-
Flow-matching扩散建模:结合动态噪声规划,优化信噪比以适应多分辨率需求。
-
2、混合训练范式:采用四阶段训练策略:
-
-
基础分辨率训练:覆盖广泛图像分布;
-
泛分辨率训练:适配任意分辨率生成;
-
高质量数据微调:提升图像美感;
-
人类偏好对齐:确保生成结果符合用户审美。
-
3、效率与性能提升
-
通过优化文本token冗余,训练效率提升5%-30%;在DPG-Bench基准测试中综合评分第一,验证了其在复杂语义对齐和指令跟随上的领先性。

本地部署
既然是开源,那么少不了本地部署尝鲜:
git clone https://github.com/THUDM/CogView4.gitcd CogView4python -m pip install -r inference/requirements.txtpython inference/gradio_web_demo.py需要注意的是,CogView4是使用了提示词扩展,当然你也可以在gradio_web_demo.py中修改提示词扩展,默认使用的扩展模型api是https://open.bigmodel.cn/api/paas/v4 ,对应的大模型是glm-4-flash当然你也可以修改成自己用的api,比如deepseek或者Qwen或者你本地部署的模型也行。
def convert_prompt( prompt: str, key: str, retry_times: int = 5,) -> str: os.environ["OPENAI_API_KEY"] = key if not key: return prompt client = OpenAI() prompt = clean_string(prompt) for i in range(retry_times): try: response = client.chat.completions.create( messages=[ { "role": "system", "content": 'You are a bilingual image description assistant that works with an image generation bot. You work with an assistant bot that will draw anything you say . \n For example ,For example, outputting "a beautiful morning in the woods with the sun peaking through the trees" or "阳光透过树林的美丽清晨" will trigger your partner bot to output an image of a forest morning, as described . \n You will be prompted by people looking to create detailed , amazing images . The way to accomplish this is to take their short prompts and make them extremely detailed and descriptive . \n There are a few rules to follow : \n - Input can be in Chinese or English. If input is in English, prompt should be written in English. If input is in Chinese, prompt should be written in Chinese.\n - You will only ever output a single image description per user request .\n - Image descriptions must be detailed and specific, including keyword categories such as subject, medium, style, additional details, color, and lighting. \n - When generating descriptions, focus on portraying the visual elements rather than delving into abstract psychological and emotional aspects. Provide clear and concise details that vividly depict the scene and its composition, capturing the tangible elements that make up the setting.\n - Do not provide the process and explanation, just return the modified description . \n ', }, { "role": "user", "content": 'Create an imaginative image descriptive caption for the user input : "An anime girl stands amidst a dense flower bush."', }, { "role": "assistant", "content": "This image is a beautifully crafted digital illustration in an anime style. It features Suzuran, a character from the video game Arknights. She stands gracefully amidst a picturesque meadow with lush green grass and scattered wildflowers that gently sway in the breeze. Her attire includes a detailed outfit with layered ruffles and intricate fastenings, reflecting both elegance and functionality typical of the game's aesthetic. Suzuran holds a small bouquet of flowers delicately in her hands, adding to the serene atmosphere.\n\nProminent fox ears extend from her head, signifying her animal spirit form or persona within the game. The background showcases rolling hills covered in dense foliage under a brilliant blue sky dotted with fluffy white clouds. Golden leaves float whimsically through the air, enhancing the magical quality of the scene. A distant building partially obscured by vegetation hints at human presence without detracting from the natural beauty surrounding Suzuran. Overall, this artwork captures a moment of tranquility and charm, embodying the essence of the character and the enchanting world she inhabits.", }, { "role": "user", "content": 'Create an imaginative image descriptive caption for the user input : "Draw a bright convertible car with a sense of artistic design."', }, { "role": "assistant", "content": 'The image showcases a meticulously crafted Delahaye 135 M.S. Special Roadster from 1937 in a highly polished and realistic rendering style that highlights its luxurious design and impeccable details. The car\'s body is an elegant deep brown with a glossy finish, exuding sophistication and timeless beauty. Its aerodynamic, streamlined shape features smooth curves accentuated by chrome detailing on the fenders and running boards. The front grille has three prominent circular headlights, adding to its classic charm while the dual exhaust pipes are tastefully integrated into the rear fenders.\n\nThe open cockpit area reveals sumptuous tan leather seats, emphasizing both comfort and elegance for the driver and passenger. The vehicle’s number plate reads "MB 175". Photographed against a dark gradient background, the focus remains solely on this automotive masterpiece, allowing viewers to appreciate every intricate detail without distraction. This image captures not just a car but a piece of art, reflecting the luxury and craftsmanship emblematic of pre-war European automobiles.', }, { "role": "user", "content": 'Create an imaginative image descriptive caption for the user input : "画一个白发、神情忧郁的动漫女孩,手里拿着一支香烟"', }, { "role": "assistant", "content": "这幅图像是一幅动漫风格的插画,画中描绘了一位长发飘逸的白发女孩。她神情忧郁,双眼低垂,脸上带着微微的愁容。女孩穿着浅色外套,里面搭配深色衬衫和领带,增添了她沉静却时尚的外表。她的手靠近嘴边,似乎在拿着一支香烟,画作的说明也表示她在吸烟。背景描绘了一个冬季的城市场景,地面和建筑物上覆盖着积雪。几株稀疏的植物从雪地中探出,街边停着一辆车,似乎是空荡荡的街道,增强了场景的荒凉氛围。这幅作品整体风格细致精美,典型的现代动漫插画风格,既捕捉了角色的情感表达,也呈现了周围环境的静谧氛围", }, { "role": "user", "content": 'Create an imaginative image descriptive caption for the user input : "一张红色的海报,中间写有“开门大吉”"', }, { "role": "assistant", "content": "这张图片采用了海报风格,色彩鲜艳,主要以红色和金色为主,寓意吉祥如意。在画面的正中间是一块红色的大匾,上面用白色大字写着“开门大吉”,四角点缀着金色的装饰图案,显得格外喜庆。匾额上方悬挂着一对红色的灯笼,增添了节日的气氛。背景左右两侧是传统的绿色中式建筑,屋顶呈现出典型的飞檐翘角设计。底部有祥云朵朵和可爱的卡通福袋,象征着好运福气满满。整张海报传达出浓厚的节日氛围。", }, { "role": "user", "content": f"Create an imaginative image descriptive caption for the user input : {prompt}", }, ], model="deepseek-chat", temperature=0.01, top_p=0.7, stream=False, max_tokens=300,实测效果
我们首先不使用提示词增强,提示词是“一个亚洲女模特,手举一块牌子,牌子上写着 “小小少年” 四个字”,可以看到,文字显示不完美,第二个小字缺了一笔

然后我们使用另外的提示词,使用提示词增强,且使用deepseek进行提示词增强,增强后提示词“这幅图像是一幅现代风格的插画,描绘了一位亚洲女模特站在画面中央。她身穿时尚的白色上衣和黑色长裤,搭配一双黑色高跟鞋,整体造型简约而优雅。她的右手高举一块白色牌子,牌子上用黑色字体清晰地写着“小小少年”四个字,字体工整有力。背景是柔和的浅色调,营造出一种清新而宁静的氛围。女模特的表情温和,目光直视前方,展现出自信与从容。画面整体构图简洁,色彩搭配和谐,突出了主题文字与模特的时尚气质。”

可以看出,在不使用提示词增强的情况下,生成的效果还是差强人意,使用提示词增强后,文字效果和人像效果都不错。
-
更多推荐


所有评论(0)