文章目录
- 前言
- 一、数据源获取与解析
- 二、数据清洗
- 1.清洗逻辑
- 2.基于规则的清洗
- 2.1 关键词黑名单机制
- 2.2 文本标准化
- 3.基于 LLM 的“智能”校验
- 3.1 Prompt 设计
- 3.2 缓存机制
- 3.3 速率控制
- 4.结果与输出
- 4.1 清洗结果示例展示
- 4.2 最终产物
- 总结与期望
前言
药物相互作用是影响患者安全的重要因素。传统的表格数据难以直观展示药物间复杂的网状关系。
为了更智能地识别药物禁忌,我们需要构建一个药物相互作用知识图谱。
而在构建图谱之前,则需要获取高质量准确的数据集。
一、数据源获取与解析
- 数据来源: 数据来自 openFDA 官方数据源
点击进入下载网址 - 数据形式: 描述原始数据是 JSON 格式。
例如:(某一个药物的"drug_interactions"字段)
"drug_interactions":["7. DRUG INTERACTIONS \u2022 Potent inhibitors of CYP1A2 should be avoided (7.1). \u2022 Potent inhibitors of CYP2D6 may increase duloxetine concentrations (7.2). \u2022 Duloxetine is a moderate inhibitor of CYP2D6 (7.9). Both CYP1A2 and CYP2D6 are responsible for duloxetine metabolism. 7.1 Inhibitors of CYP1A2 When duloxetine 60 mg was co-administered with fluvoxamine 100 mg, a potent CYP1A2 inhibitor, to male subjects (n=14) duloxetine AUC was increased approximately 6-fold, the C max was increased about 2.5-fold, and duloxetine t 1/2 was increased approximately 3-fold. Other drugs that inhibit CYP1A2 metabolism include cimetidine and quinolone antimicrobials such as ciprofloxacin and enoxacin [see Warnings and Precautions (5.12)] . 7.2 Inhibitors of CYP2D6 Concomitant use of duloxetine (40 mg once daily) with paroxetine (20 mg once daily) increased the concentration of duloxetine AUC by about 60%, and greater degrees of inhibition are expected with higher doses of paroxetine. Similar effects would be expected with other potent CYP2D6 inhibitors (e.g., fluoxetine, quinidine) [see Warnings and Precautions (5.12)] . 7.3 Dual Inhibition of CYP1A2 and CYP2D6 Concomitant administration of duloxetine 40 mg twice daily with fluvoxamine 100 mg, a potent CYP1A2 inhibitor, to CYP2D6 poor metabolizer subjects (n=14) resulted in a 6-fold increase in duloxetine AUC and C max . 7.4 Drugs that Interfere with Hemostasis (e.g., NSAIDs, Aspirin, and Warfarin) Serotonin release by platelets plays an important role in hemostasis. Epidemiological studies of the case-control and cohort design that have demonstrated an association between use of psychotropic drugs that interfere with serotonin reuptake and the occurrence of upper gastrointestinal bleeding have also shown that concurrent use of an NSAID or aspirin may potentiate this risk of bleeding. Altered anticoagulant effects, including increased bleeding, have been reported when SSRIs or SNRIs are co-administered with warfarin. Concomitant administration of warfarin (2-9 mg once daily) under steady state conditions with duloxetine 60 or 120 mg once daily for up to 14 days in healthy subjects (n=15) did not significantly change INR from baseline (mean INR changes ranged from 0.05 to +0.07). The total warfarin (protein bound plus free drug) pharmacokinetics (AUC T,ss , C max,ss or t max,ss ) for both R- and S-warfarin were not altered by duloxetine. Because of the potential effect of duloxetine on platelets, patients receiving warfarin therapy should be carefully monitored when duloxetine is initiated or discontinued [see Warnings and Precautions (5.5)] . 7.5 Lorazepam Under steady-state conditions for duloxetine (60 mg Q 12 hours) and lorazepam (2 mg Q 12 hours), the pharmacokinetics of duloxetine were not affected by co-administration. 7.6 Temazepam Under steady-state conditions for duloxetine (20 mg qhs) and temazepam (30 mg qhs), the pharmacokinetics of duloxetine were not affected by co-administration. 7.7 Drugs that Affect Gastric Acidity Duloxetine delayed-release capsules have an enteric coating that resists dissolution until reaching a segment of the gastrointestinal tract where the pH exceeds 5.5. In extremely acidic conditions, duloxetine delayed-release capsules, unprotected by the enteric coating, may undergo hydrolysis to form naphthol. Caution is advised in using duloxetine delayed-release capsules in patients with conditions that may slow gastric emptying (e.g., some diabetics). Drugs that raise the gastrointestinal pH may lead to an earlier release of duloxetine. However, co-administration of duloxetine delayed-release capsules with aluminum- and magnesium-containing antacids (51 mEq) or duloxetine delayed-release capsules with famotidine, had no significant effect on the rate or extent of duloxetine absorption after administration of a 40 mg oral dose. It is unknown whether the concomitant administration of proton pump inhibitors affects duloxetine absorption [see Warnings and Precautions (5.14)] . 7.8 Drugs Metabolized by CYP1A2 In vitro drug interaction studies demonstrate that duloxetine does not induce CYP1A2 activity. Therefore, an increase in the metabolism of CYP1A2 substrates (e.g., theophylline, caffeine) resulting from induction is not anticipated, although clinical studies of induction have not been performed. Duloxetine is an inhibitor of the CYP1A2 isoform in in vitro studies, and in two clinical studies the average (90% confidence interval) increase in theophylline AUC was 7% (1%-15%) and 20% (13%-27%) when co-administered with duloxetine (60 mg twice daily). 7.9 Drugs Metabolized by CYP2D6 Duloxetine is a moderate inhibitor of CYP2D6. When duloxetine was administered (at a dose of 60 mg twice daily) in conjunction with a single 50 mg dose of desipramine, a CYP2D6 substrate, the AUC of desipramine increased 3-fold [see Warnings and Precautions (5.12)] . 7.10 Drugs Metabolized by CYP2C9 Results of in vitro studies demonstrate that duloxetine does not inhibit activity. In a clinical study, the pharmacokinetics of S-warfarin, a CYP2C9 substrate, were not significantly affected by duloxetine [see Drug Interactions (7.4)]. 7.11 Drugs Metabolized by CYP3A Results of in vitro studies demonstrate that duloxetine does not inhibit or induce CYP3A activity. Therefore, an increase or decrease in the metabolism of CYP3A substrates (e.g., oral contraceptives and other steroidal agents) resulting from induction or inhibition is not anticipated, although clinical studies have not been performed. 7.12 Drugs Metabolized by CYP2C19 Results of in vitro studies demonstrate that duloxetine does not inhibit CYP2C19 activity at therapeutic concentrations. Inhibition of the metabolism of CYP2C19 substrates is therefore not anticipated, although clinical studies have not been performed. 7.13 Monoamine Oxidase Inhibitors (MAOIs) [see Dosage and Administration (2.5, 2.6), Contraindications (4.1), and Warnings and Precautions (5.4)]. 7.14 Serotonergic Drugs [see Dosage and Administration (2.5, 2.6), Contraindications (4.1), and Warnings and Precautions (5.4)]. 7.15 Alcohol When duloxetine and ethanol were administered several hours apart so that peak concentrations of each would coincide, duloxetine did not increase the impairment of mental and motor skills caused by alcohol. In the duloxetine clinical trials database, three duloxetine-treated patients had liver injury as manifested by ALT and total bilirubin elevations, with evidence of obstruction. Substantial intercurrent ethanol use was present in each of these cases, and this may have contributed to the abnormalities seen [see Warnings and Precautions (5.2 and 5.12)] . 7.16 CNS Drugs [see Warnings and Precautions (5.12)]. 7.17 Drugs Highly Bound to Plasma Protein Because duloxetine is highly bound to plasma protein, administration of duloxetine delayed-release capsules to a patient taking another drug that is highly protein bound may cause increased free concentrations of the other drug, potentially resulting in adverse reactions. However, co-administration of duloxetine (60 or 120 mg) with warfarin (2-9 mg), a highly protein-bound drug, did not result in significant changes in INR and in the pharmacokinetics of either total S-or total R-warfarin (protein bound plus free drug) [see Drug Interactions (7.4)] ."],- 关系抽取: 找到JSON中每个药物的"drug_interactions"字段,利用正则表达式和药物匹配,初步筛选出rugA 和 DrugB 之间的 interact_with 关系表示如下。(此结果跑了一晚上…52w行)
二、数据清洗
从 FDA 提取的 drugA_interact_drugB_final.csv 并非完美数据。
痛点: 1)噪声多:包含大量内源性物质、副作用症状;
2)非药物实体:很多记录其实是“药物A 与 症状B”的关系,而非“药物A 与 药物B”。目的: 构建一个高精度的药物相互作用数据集。
1.清洗逻辑
原始 CSV -> Step 1: 规则清洗 (Pandas) -> Step 2: LLM 智能校验 (Qwen) -> 干净数据
2.基于规则的清洗
2.1 关键词黑名单机制
维护了一个庞大的 NOISE_KEYWORDS 列表,如果 drugB 字段中包含这些关键词,直接判定为噪声。
defis_noise(drug_b):ifpd.isna(drug_b)ornotisinstance(drug_b,str):returnTruedrug_b_lower=drug_b.lower()fornoiseinNOISE_KEYWORDS:ifnoise.lower()indrug_b_lower:if" and "indrug_b_lower:continuereturnTruereturnFalse2.2 文本标准化
统一转小写、去除首尾空格、规范化大小写(如将 “aspirin” 转为 “Aspirin”),确保图谱节点名称的一致性。
defclean_text(text):ifpd.isna(text):returntext text=text.strip().lower()words=text.split()cleaned_words=[]forwordinwords:ifwordin["and","or","the","of","in","ii","iii","iv"]:cleaned_words.append(word)else:cleaned_words.append(word.capitalize())return" ".join(cleaned_words)3.基于 LLM 的“智能”校验
该清洗脚本使用的是Qwen3.6-Plus
3.1 Prompt 设计
- 角色设定: “你是一名顶尖药物信息学专家…”
- 判断标准: 内源性物质、副作用/症状视为噪声。
- 输出格式: 输出 VALID 或 INVALID,方便代码解析。
defbuild_llm_prompt(drug_a,drug_b,relation="INTERACTS_WITH"):return("你是一名顶尖药物信息学专家,负责清洗药物相互作用数据。\n""格式:药物A, INTERACTS_WITH, 实体B。请判断实体B是否为真正的药物。\n""如果不是药物,请判定为删除。\n\n""删除条件:\n""1) 内源性物质,如 Renin、Angiotensin、Dopamine(除非明确是注射液)、Norepinephrine、Gaba、Nadh/Nad、Alanine/Amino Acids、Iron、Copper。\n""2) 副作用或症状,如 Constipation、Hypertension、Diabetes。\n""保留条件:\n""3) 真正的药物,包括化学药、处方药、生物制剂/单抗(如 Insulin、Adalimumab)、或复方制剂。\n""输出只需一个词:VALID 表示保留,INVALID 表示删除。\n\n"f"drugA:{drug_a}\n"f"drugB:{drug_b}\n"f"relation:{relation}\n")3.2 缓存机制
代码中使用 cache 充当键值对字典,用于记录程序已经询问过千问大模型的问题及其答案,避免对相同的药物对重复调用 API,节省成本并提高速度。
3.3 速率控制
代码中使用 time.sleep(0.2),让程序每问完一个问题就暂停0.2s,防止请求过快被 API 限流,保证程序的鲁棒性。
print("正在使用千问 LLM 进行二次验证...")cache={}results=[]validated=0for_,rowindf_filtered.iterrows():ifvalidated>=llm_limit:results.append(row)continuekey=f"{row['drugA']}|{row['drugB']}"ifkeyincache:decision=cache[key]else:decision=llm_validate(client,row['drugA'],row['drugB'],model=llm_model)cache[key]=decision time.sleep(0.2)ifdecision=="VALID":results.append(row)elifdecision=="INVALID":continueelse:results.append(row)validated+=1df_filtered=pd.DataFrame(results).reset_index(drop=True)print(f"LLM 验证完成。已验证{validated}条记录。")4.结果与输出
4.1 清洗结果示例展示
| 原始数据 (drugB) | 清洗结果 | 原因 |
|---|---|---|
| Serotonin Syndrome | INVALID | 症状/副作用 |
| Warfarin | INVALID | 出现在黑名单中 |
| Adalimumab | VALID | 生物制剂/药物 |
4.2 最终产物
这是可以直接导入 Neo4j 的干净数据。
总结与期望
- 本次清洗采用规则+LLM的模式,规则引擎负责高速剔除显性噪声,LLM模型负责语义精洗,兼顾了效率与成本,实现了工程上的最佳平衡。在日后的数据处理中,亦可以使用此方法。
- 高质量的清洗是构建可信医疗知识图谱的前提,只有精准的实体,才能支撑起可靠的推理。
- 下一步我将用该数据构建知识图谱,并结合LLM进行推理调试。