top of page

LangchainのWebScraperで発明の性能抽出

1.はじめに

LangchainのResearch automationを使って、特許明細書中の性能を抽出してみました。


2.経緯

  • Langchainのページを漁っていると、WebScrapingのページの中に、beautifulsopupとの連携というものが。

  • 図の通り、ウェブサイトからテキストを抽出してvector storeに入れて、ほしい情報の形でまとめて返してくれる、とのこと。


  • 昔、明細書中から発明の性能( ex:陽極初期過電圧、陽極過電圧、耐久性)など「性能関連の記載を抜き出して一覧表」にするという作業をした(ものすごく面倒だった)のを思い出し、これを使って作業が楽にできないかと検討しました。

3.ソースコード

  • ほとんど上記の公式ページのとおり、urlに、ほしいgoogle patetnsのurlを指定するだけです。

from langchain.chains import create_extraction_chain

schema = {
    "properties": {
        "Anode initial overvoltage": {"type": "string","description":"電圧の値をmVで抽出"},
        "Anode overvoltage": {"type": "string","description":"電圧の値をmVで抽出"},
        "durability": {"type": "string","description":"耐久性について記載されている文章を抽出"},
    },
    #"required": ["陽極初期過電圧(Anode initial overvoltage)", "陽極過電圧(Anode overvoltage)","durability"],
}

def extract(content: str, schema: dict):
    return create_extraction_chain(schema=schema, llm=llm).run(content)


  • 公式から変更したのは、 loader = AsyncChromiumLoader(urls)が動作しなかったので、loader = AsyncHtmlLoader(urls)に変更したくらいです。必要な部分を持ってくるのはtags_to_extract=[]のタグで、google patentsでは文章は大体divに入っているのでそれを抽出します。


import pprint
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import AsyncHtmlLoader
from langchain.document_transformers import BeautifulSoupTransformer

def scrape_with_playwright(urls, schema):
    
    loader = AsyncHtmlLoader(urls)
    docs = loader.load()
    bs_transformer = BeautifulSoupTransformer()
    docs_transformed = bs_transformer.transform_documents(docs,tags_to_extract=["li","div"])
    #print(docs_transformed)
    print("Extracting content with LLM")
    
    # Grab the first 1000 tokens of the site
    splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=1000, 
                                                                    chunk_overlap=0)
    splits = splitter.split_documents(docs_transformed)
    
    # Process the first split 
    extracted_content = extract(
        schema=schema, content=splits[0].page_content
    )
    pprint.pprint(extracted_content)
    return extracted_content

#urls = ["https://www.wsj.com"]
urls = ["https://patents.google.com/patent/US20190078220A1/en"]#,"https://patents.google.com/patent/US20160237578A1/en","https://patents.google.com/patent/US20160237578A1/en?"]

extracted_content = scrape_with_playwright(urls, schema=schema)

結果:


Fetching pages: 100%|##########| 1/1 [00:00<00:00, 16.71it/s] 
Extracting content with LLM [
{'Anode initial overvoltage': 'high',  
 'Anode overvoltage': 'low',  
 'durability': 'high'}]

  • うーん、過電圧などはmVで、耐久性は文章で出してほしいけどまだダメそうでした。指示の仕方が悪かったかもなので、検討中です。

  • 使うモデルをgpt-4に変更して再度実行した結果。少しまともな回答が。


Fetching pages: 100%|##########| 3/3 [00:00<00:00, 26.70it/s] 
Extracting content with LLM [{
'Anode initial overvoltage': '60 mV lower than conventional nickel '                                'electrodes',   
'Anode overvoltage': 'even lower overpotential can be achieved',   
'durability': 'separation of the catalyst component from the catalyst layer '                 'does not occur, the corrosion resistance improves'}]


  • urlを1件ずつで指定すると結構出してくれました。

extracted_content = scrape_with_playwright(urls[1], schema=schema)
Fetching pages: 100%|##########| 1/1 [00:00<00:00, 13.23it/s] 
Extracting content with LLM [{'Anode initial overvoltage': '1.7 to 1.9 V',   'Anode overvoltage': '0.3 to 0.4 Acm−2',   'durability': 'more than several tens of years'}]



4.その他

  • 公式ページに書いてあった次の項目のResaerch Automationも試してみました。



# Run
import logging
logging.basicConfig()
logging.getLogger("langchain.retrievers.web_research").setLevel(logging.INFO)
from langchain.chains import RetrievalQAWithSourcesChain
user_input = "Find out the performance of the anode initialization voltage of alkaline water electrolysis technology.Please tell me the best performance."
qa_chain = RetrievalQAWithSourcesChain.from_chain_type(llm,retriever=web_research_retriever,return_source_documents=True)
result = qa_chain({"question": user_input})
result

  • ソース付きで回答してもらいましたが、これも便利ですね。

  • "Co3Se4/CF electrodes, which can deliver 10 and 100 mA cm−2 for over 3500 and 2000 hours without noticeable degradation"


{'question': 'Find out the performance of the anode initialization voltage of alkaline water electrolysis technology.Please tell me the best performance.',  'answer': 'The best performance of the anode initialization voltage of alkaline water electrolysis technology is achieved using two symmetrical Co3Se4/CF electrodes, which can deliver 10 and 100 mA cm−2 for over 3500 and 2000 hours without noticeable degradation. This performance is superior to electrolyzers consisting of Pt cathodes and RuO2 anodes under similar conditions. (',  'sources': 'https://pubs.rsc.org/en/content/articlehtml/2022/ma/d2ma00185c)',  'source_documents': [Document(page_content='Tafel slope of 44 mV dec−1, and outstanding electrocatalytic stability at\nvarious current densities for the OER. The water electrolyser constructed\nusing two symmetrical Co3Se4/CF electrodes could deliver 10 and 100 mA cm−2\nfor over 3500 and 2000 h without noticeable degradation, respectively (Fig.\n5).164 These alkaline electrolysers showed superior performance to those\nconsisting of the Pt cathode and the RuO2 anode under similar conditions.', metadata={'source': 'https://pubs.rsc.org/en/content/articlehtml/2022/ma/d2ma00185c'}),   Document(page_content='H+ and OH− is 1 M) according to the Nernst equation. To effectively harvest\nthe ENE, an asymmetric electrochemical cell should be obtained with the\ncathodic reaction consuming H+ in an acidic catholyte and anodic reaction\nconsuming OH− in an alkaline anolyte. This asymmetric acid/alkaline\nelectrochemical cell gives rise to a theoretical voltage of 0.0591 × ΔpH,\nwhich equals 0.828 V with ΔpH = 14. Therefore, the ENE may reduce the applied\nwater electrolysis voltage through a rational design of electrolysers with\nasymmetric acid/alkaline electrolytes, as the harvested ENE from\nneutralization reaction can provide an additional internal voltage input.316\nNormally, an appropriate ion-selective membrane separator is indispensable to\nmaintain the ionic current flow and conductivity and avoid the direct\nneutralization. On the contrary, the theoretical energy required to dissociate\none mole of water to H+ and OH− is 79.9 kJ mol−1 which is called the\ndissociation energy and translates to an additional external voltage input.\nThe bipolar membrane (BPM) consisting of a cation-exchange membrane (CEM) and\nan anion exchange membrane (AEM) laminated together is a commonly used\nseparator used in the asymmetric acid/alkaline electrochemical cells.317,318\nUnder a fixed external voltage direction, for the given OER and HER\nelectrodes, the placement of two sides of a BPM and sequence of acid and\nalkaline electrolytes influence the operating process and voltage. Fig. 18', metadata={'source': 'https://pubs.rsc.org/en/content/articlehtml/2022/ma/d2ma00185c'}),   Document(page_content='the Ni3S2/Ni foam showed much smaller voltage than full water electrolysis on\ntwo identical electrodes and 100% faradaic efficiency for H2 production in 1.0\nM KOH (Fig. 17). This hybrid water electrolysis demonstrates great potential\nfor H2 production and electrochemical organic reforming.312–314 Normally, this\nhybrid water electrolysis technology produces H2 at the cathode and nongaseous\noxidative products. Very recently, Wang et al. reported a novel hybrid water\nelectrolyser that combines the cathodic HER and low-potential anodic oxidation\nof aldehyde with a low onset voltage of merely 0.1 V.315 Unlike conventional\naldehyde electrooxidation at the anode, in which the hydrogen atom of the\naldehyde group is oxidized into H2O at high potentials and nongaseous product\nmolecules are generated, the low-potential aldehyde oxidation can produce H2\nfrom the hydrogen atom of aldehyde at the anode. In other words, H2 can be\nproduced at both the cathode and anode simultaneously. The demonstrated\nelectrolyser requires an electricity input of only ∼0.35 kW h per m3 of H2, in\ncontrast to the ∼5 kW h per m3 of H2 required for conventional water\nelectrolysis. Therefore, the hybrid water electrolysis technology has great\npotential for future application.', metadata={'source': 'https://pubs.rsc.org/en/content/articlehtml/2022/ma/d2ma00185c'}),   Document(page_content='deliver the most significant boost in the current density under various\nmagnetic field intensities ranging from 0.6 to 4.5 T. These results suggest\nthat the integrated magnetic field is likely compatible with the current zero-\ngap configuration of alkaline electrolyser. Note that the influence of FL on\nthe physical movement of OH− and H3O+ ions is negligible, because both ions in\nthe aqueous solutions move by sequential proton hopping/transfer instead of\nphysical motion, which is known as the Grotthuss mechanism.235', metadata={'source': 'https://pubs.rsc.org/en/content/articlehtml/2022/ma/d2ma00185c'}),   Document(page_content="increased ΔpH should have theoretically decreased the voltage. This is likely\nbecause the BPM suffers from poor chemical stability when exposed to strong\nacid and base with increased resistance. Later, the same group replaced Ni2P\nwith bifunctional Ru–RuO2 nanoparticles loaded on carbon nanotubes in the\nsimilar electrolyser design, realizing a smaller onset potential of 0.65 V and\nlower voltage of 0.73 V at 10 mA cm−2 due to the higher electrocatalytic\nactivities of Ru–RuO2 for acidic HER and alkaline OER.324 Liu's group used\nbifunctional cobalt nickel phosphide as the acidic HER and alkaline OER\nelectrodes with an “irregular” BPM operating under a forward bias and the\nelectrolyser could be driven to deliver 13 mA cm−2 using a photovoltaic cell\nof 0.908 V.325 Later, they used CoP–CoTe2 composite nanowires as the\nbifunctional HER and OER electrocatalysts with a similar water electrolyser\nconfiguration.326 They compared the BPM-assisted acid/alkaline asymmetric\nelectrolyte water electrolysis under the forward and reverse bias conditions\nand found a voltage decrease by 720 mV at 10 mA cm−2 under the forward bias.\nDespite a smaller water electrolysis voltage enabled by the BPM under a\nforward bias, many studies focus on the reverse bias to achieve longer\nlifetime and reduce BPM delamination. In 2014, McDonald et al. reported the\nacid/alkaline asymmetric electrolyte water electrolysis using the Pt\nelectrodes with a BPM under the reverse bias and proposed that the anion", metadata={'source': 'https://pubs.rsc.org/en/content/articlehtml/2022/ma/d2ma00185c'}),   Document(page_content='off at different field intensities at an applied anodic current of 35 mA. This\ninstantaneous response of the potential to AMF clearly reflects the pronounced\nlocalized heating of Fe2.2C@Ni. The full water electrolysis cell performance\nwas also investigated in a zero-gap and flow-cell setup close to the\ncommercial alkaline water electrolyser in the presence and absence of AMF,\nwhich substantiates that the water electrolysis voltage can be reduced by\nleveraging the magnetothermal effect upon the Fe2.2C@Ni electrodes induced by\nAMF-triggered localized heating. Note that the reduced voltage may be ascribed\nto the multiple synergy effects of improved mass transfer and kinetics and\nreduced energy barrier from localized heating and bubble coverage from MHD. As\nsuch, this work provides an interesting proof-of-concept for magnetothermal\nwater electrolysis technology in a relatively non-destructive heating way\nwhich brings new vitality to alkaline water electrolysers. Nonetheless,\nfurther fundamental and engineering investigations are still required. First,\nthe magnetothermal effect and other magnetic field induced effects are not\nexplicitly deconvoluted to elucidate how they separately influence the\nactivity enhancement. Second, unlike the use of a static permanent magnetic\nfield, there is an additional energy input for high-frequency AMF. Therefore,\nthe comparison between overall energy consumption by using AMF and', metadata={'source': 'https://pubs.rsc.org/en/content/articlehtml/2022/ma/d2ma00185c'}),   Document(page_content='electrodes, the placement of two sides of a BPM and sequence of acid and\nalkaline electrolytes influence the operating process and voltage. Fig. 18\nillustrates four configurations using acid/base asymmetric electrolytes in two\ncompartments separated by a BPM for water electrolysis.316Fig. 18a shows the\noptimal configuration that can effectively use ENE to minimize the theoretical\nwater electrolysis voltage. In this case, the HER and the OER take place in\nthe acidic (pH = 0) and alkaline (pH = 14) electrolytes, respectively, and the\nBPM operates under the forward bias, during which the counter ions of acid\n(anions of X−) and base (cations of M+) can penetrate into the BPM providing\nionic transport under the applied voltage. Therefore, the theoretical water\nelectrolysis voltage in this case is only 0.401 V (Fig. 18e). However, the\noperation of a BPM under a forward bias may cause the accumulation of salt\nions in the BPM and result in the contamination and delamination of the CEM\nand AEM layers. The configuration shown in Fig. 18b makes the HER and the OER\noccur in alkaline and acidic electrolytes, respectively, while the operation\nof the BPM remains under a forward bias, in which OH− and H+ move into the BPM\nand water is formed inside. This configuration cannot utilize the ENE and\nhence leads to a higher theoretical water electrolysis voltage of 2.057 V. The\nconfiguration shown in Fig. 18c enables the HER and the OER to occur in the', metadata={'source': 'https://pubs.rsc.org/en/content/articlehtml/2022/ma/d2ma00185c'}),   Document(page_content='The potentials of two half reactions (HER and OER) are dependent on the pH\nvalue. Therefore, the entire water electrolysis voltage can be tuned by\ncontrolling the different pH values of electrolytes in the cathodic and anodic\ncompartments. As shown in Fig. 1a, the potential gap between the HER and the\nOER can be theoretically reduced to 0.401 V by using an acidic electrolyte (pH\n= 0) in the cathodic compartment and an alkaline electrolyte (pH = 14) in the\nanodic compartment. On the contrary, the potential gap between the HER and the\nOER can be increased to 2.057 V by coupling an alkaline catholyte (pH = 14)\nfor the HER with an acidic anolyte (pH = 0) for the OER. The water\nelectrolysis voltage in the acid/alkaline asymmetric electrolytes is\nintrinsically related to the electrochemical neutralization energy (ENE) and\ndissociation energy.316 The ENE is related to the converted electrochemical\nvoltage output when the spontaneous acid–base neutralization reaction takes\nplace as follows: | H+ \\+ OH− ↔ H2O| (23)  \n---|---  \nwhere the change of the Gibbs free energy ΔG0 = −79.9 kJ mol−1, the enthalpy\nchange ΔH0 = −55.84 kJ mol−1 and the thermal energy TΔS0 = 24.06 kJ mol−1\nunder standard conditions (298.15 K, 1 atm). The neutralization energy can be\nharvested in the electrochemical form, which translates to a theoretical ENE\nvoltage (EENE) of 0.828 V under the standard conditions (concentration of both\nH+ and OH− is 1 M) according to the Nernst equation. To effectively harvest', metadata={'source': 'https://pubs.rsc.org/en/content/articlehtml/2022/ma/d2ma00185c'}),   Document(page_content='for a long time under the acidic OER conditions, the application of developed\nbifunctional Janus materials focuses on alkaline water electrolysis, which\nwill potentially mitigate the incompatibility, simplify the device design,\nimprove the longevity, and reduce the costs.', metadata={'source': 'https://pubs.rsc.org/en/content/articlehtml/2022/ma/d2ma00185c'})]}



閲覧数:253回0件のコメント

最新記事

すべて表示

Comments


bottom of page