Patent paperwork usually use authorized and extremely technical language, with context-dependent phrases which will have meanings fairly completely different from colloquial utilization and even between completely different paperwork. The means of utilizing traditional patent search strategies (e.g., key phrase looking out) to go looking by way of the corpus of over 100 million patent paperwork might be tedious and end in many missed outcomes as a result of broad and non-standard language used. For instance, a “soccer ball” could also be described as a “spherical recreation machine”, “inflatable sportsball” or “ball for ball sport”. Additionally, the language utilized in some patent paperwork could obfuscate phrases to their benefit, so extra highly effective pure language processing (NLP) and semantic similarity understanding may give everybody entry to do an intensive search.
The patent area (and extra basic technical literature like scientific publications) poses distinctive challenges for NLP modeling as a consequence of its use of authorized and technical phrases. While there are a number of generally used general-purpose semantic textual similarity (STS) benchmark datasets (e.g., STS-B, SICK, MRPC, PIT), to one of the best of our information, there are at present no datasets targeted on technical ideas present in patents and scientific publications (the considerably associated BioASQ problem incorporates a biomedical query answering activity). Moreover, with the persevering with progress in dimension of the patent corpus (thousands and thousands of recent patents are issued worldwide yearly), there’s a must develop extra helpful NLP fashions for this area.
Today, we announce the discharge of the Patent Phrase Similarity dataset, a brand new human-rated contextual phrase-to-phrase semantic matching dataset, and the accompanying paper, offered on the SIGIR PatentSemTech Workshop, which focuses on technical phrases from patents. The Patent Phrase Similarity dataset incorporates ~50,000 rated phrase pairs, every with a Cooperative Patent Classification (CPC) class as context. In addition to similarity scores which are usually included in different benchmark datasets, we embrace granular ranking lessons just like WordNet, equivalent to synonym, antonym, hypernym, hyponym, holonym, meronym, and area associated. This dataset (distributed below the Creative Commons Attribution 4.0 International license) was utilized by Kaggle and USPTO because the benchmark dataset within the U.S. Patent Phrase to Phrase Matching competitors to attract extra consideration to the efficiency of machine studying fashions on technical textual content. Initial outcomes present that fashions fine-tuned on this new dataset carry out considerably higher than basic pre-trained fashions with out fine-tuning.
The Patent Phrase Similarity Dataset
To higher prepare the subsequent technology of state-of-the-art fashions, we created the Patent Phrase Similarity dataset, which incorporates many examples to handle the next issues: (1) phrase disambiguation, (2) adversarial key phrase matching, and (3) onerous detrimental key phrases (i.e., key phrases which are unrelated however acquired a excessive rating for similarity from different fashions ). Some key phrases and phrases can have a number of meanings (e.g., the phrase “mouse” could consult with an animal or a pc enter machine), so we disambiguate the phrases by together with CPC lessons with every pair of phrases. Also, many NLP fashions (e.g., bag of phrases fashions) is not going to do nicely on knowledge with phrases which have matching key phrases however are in any other case unrelated (adversarial key phrases, e.g., “container section” → “kitchen container”, “offset table” → “table fan”). The Patent Phrase Similarity dataset is designed to incorporate many examples of matching key phrases which are unrelated by way of adversarial key phrase match, enabling NLP fashions to enhance their efficiency.
Each entry within the Patent Phrase Similarity dataset incorporates two phrases, an anchor and goal, a context CPC class, a ranking class, and a similarity rating. The dataset incorporates 48,548 entries with 973 distinctive anchors, cut up into coaching (75%), validation (5%), and take a look at (20%) units. When splitting the information, all the entries with the identical anchor are saved collectively in the identical set. There are 106 completely different context CPC lessons and all of them are represented within the coaching set.
Anchor | Target | Context | Rating | Score |
acid absorption | absorption of acid | B08 | precise | 1.0 |
acid absorption | acid immersion | B08 | synonym | 0.75 |
acid absorption | chemically soaked | B08 | area associated | 0.25 |
acid absorption | acid reflux disorder | B08 | not associated | 0.0 |
gasoline mix | petrol mix | C10 | synonym | 0.75 |
gasoline mix | gasoline mix | C10 | hypernym | 0.5 |
gasoline mix | fruit mix | C10 | not associated | 0.0 |
faucet meeting | water faucet | A22 | hyponym | 0.5 |
faucet meeting | water provide | A22 | holonym | 0.25 |
faucet meeting | college meeting | A22 | not associated | 0.0 |
A small pattern of the dataset with anchor and goal phrases, context CPC class (B08: Cleaning, C10: Petroleum, fuel, gasoline, lubricants, A22: Butchering, processing meat/poultry/fish), a ranking class, and a similarity rating. |
Generating the Dataset
To generate the Patent Phrase Similarity knowledge, we first course of the ~140 million patent paperwork within the Google Patent’s corpus and routinely extract necessary English phrases, that are usually noun phrases (e.g., “fastener”, “lifting assembly”) and practical phrases (e.g., “food processing”, “ink printing”). Next, we filter and preserve phrases that seem in at the very least 100 patents and randomly pattern round 1,000 of those filtered phrases, which we name anchor phrases. For every anchor phrase, we discover all the matching patents and all the CPC lessons for these patents. We then randomly pattern as much as 4 matching CPC lessons, which develop into the context CPC lessons for the particular anchor phrase.
We use two completely different strategies for pre-generating goal phrases: (1) partial matching and (2) a masked language mannequin (MLM). For partial matching, we randomly choose phrases from your entire corpus that partially match with the anchor phrase (e.g., “abatement” → “noise abatement”, “material formation” → “formation material”). For MLM, we choose sentences from the patents that include a given anchor phrase, masks them out, and use the Patent-BERT mannequin to foretell candidates for the masked portion of the textual content. Then, all the phrases are cleaned up, which incorporates lowercasing and the elimination of punctuation and sure stopwords (e.g., “and”, “or”, “mentioned”), and despatched to skilled raters for assessment. Each phrase pair is rated independently by two raters expert within the know-how space. Each rater additionally generates new goal phrases with completely different scores. Specifically, they’re requested to generate some low-similarity and unrelated targets that partially match with the unique anchor and/or some high-similarity targets. Finally, the raters meet to debate their scores and give you remaining scores.
Dataset Evaluation
To consider its efficiency, the Patent Phrase Similarity dataset was used within the U.S. Patent Phrase to Phrase Matching Kaggle competitors. The competitors was extremely popular, drawing about 2,000 opponents from all over the world. A wide range of approaches have been efficiently utilized by the highest scoring groups, together with ensemble fashions of BERT variants and prompting (see the total discussion for extra particulars). The desk under exhibits one of the best outcomes from the competitors, in addition to a number of off-the-shelf baselines from our paper. The Pearson correlation metric was used to measure the linear correlation between the expected and true scores, which is a useful metric to focus on for downstream fashions to allow them to distinguish between completely different similarity scores.
The baselines within the paper might be thought of zero-shot within the sense that they use off-the-shelf fashions with none additional fine-tuning on the brand new dataset (we use these fashions to embed the anchor and goal phrases individually and compute the cosine similarity between them). The Kaggle competitors outcomes display that through the use of our coaching knowledge, one can obtain important enhancements in contrast with current NLP fashions. We have additionally estimated human efficiency on this activity by evaluating a single rater’s scores to the mixed rating of each raters. The outcomes point out that this isn’t a very straightforward activity, even for human consultants.
Model | Training | Pearson correlation |
word2vec | Zero-shot | 0.44 |
Patent-BERT | Zero-shot | 0.53 |
Sentence-BERT | Zero-shot | 0.60 |
Kaggle 1st place single | Fine-tuned | 0.87 |
Kaggle 1st place ensemble | Fine-tuned | 0.88 |
Human | 0.93 |
Performance of fashionable fashions with no fine-tuning (zero-shot), fashions fine-tuned on the Patent Phrase Similarity dataset as a part of the Kaggle competitors, and single human efficiency. |
Conclusion and Future Work
We current the Patent Phrase Similarity dataset, which was used because the benchmark dataset within the U.S. Patent Phrase to Phrase Matching competitors, and display that through the use of our coaching knowledge, one can obtain important enhancements in contrast with current NLP fashions.
Additional difficult machine studying benchmarks might be generated from the patent corpus, and patent knowledge has made its manner into a lot of at the moment’s most-studied fashions. For instance, the C4 textual content dataset used to coach T5 incorporates many patent paperwork. The BigBird and LongT5 fashions additionally use patents by way of the BIGPATENT dataset. The availability, breadth and open utilization phrases of full textual content knowledge (see Google Patents Public Datasets) makes patents a novel useful resource for the analysis neighborhood. Possibilities for future duties embrace massively multi-label classification, summarization, data retrieval, image-text similarity, quotation graph prediction, and translation. See the paper for extra particulars.
Acknowledgements
This work was potential by way of a collaboration with Kaggle, Satsyil Corp., USPTO, and MaxVal. Thanks to contributors Ian Wetherbee from Google, Will Cukierski and Maggie Demkin from Kaggle. Thanks to Jerry Ma, Scott Beliveau, and Jamie Holcombe from USPTO and Suja Chittamahalingam from MaxVal for his or her contributions.