Technology Radar

用于测试和训练模型的合成数据

Published : Oct 23, 2024

NOT ON THE CURRENT EDITION

This blip is not on the current edition of the Radar. If it was on one of the last few editions, it is likely that it is still relevant. If the blip is older, it might no longer be relevant and our assessment might be different today. Unfortunately, we simply don't have the bandwidth to continuously review blips from previous editions of the Radar. Understand more

Oct 2024

Trial

合成数据集创建涉及生成可以模拟现实世界场景的人工数据，而无需依赖敏感或有限访问的数据源。虽然合成数据在结构化数据集中的应用已得到广泛探索（例如，用于性能测试或隐私安全环境），但我们看到在非结构化数据中重新使用合成数据的趋势。企业通常面临领域特定数据缺乏标注的问题，尤其是在训练或微调大语言模型（LLMs）时。像Bonito和 Microsoft's AgentInstruct 这样的工具可以从原始数据源（如文本文档和代码文件）生成合成的指令调优数据。这有助于加速模型训练，同时降低成本和对手动数据管理的依赖。另一个重要的用例是生成合成数据来解决不平衡或稀疏数据的问题，这在欺诈检测或客户细分等任务中很常见。像SMOTE这样的技术通过人工创建少数类实例来帮助平衡数据集。同样，在金融等行业，生成对抗网络（GANs）用于模拟稀有交易，使模型在检测边缘案例方面更加稳健，从而提高整体性能。

行业

数字出版物和工具

所有洞见

用于测试和训练模型的合成数据

Download the PDF

Sign up for the Technology Radar newsletter

Download the PDF

Sign up for the Technology Radar newsletter

Visit our archive to read previous volumes