The raw patent data originates from the USPTO.
Historical USPTO patents are made digitally and publicly available by Google Patents. Google used Optical Character Recognition software to read the text of these historical patent documents, digitize and hosts them online.
A team of researchers at Utrecht University and UCLA has scraped these text files, cleaned and structured these data. These data are made publicly available in HistPat by Petralia, Balland and Rigby (2016) and can be found here at Nature’s Scientific Data.
This data-base is an extremely rich source. For more than 4 million patents it provides information on: (1) first inventor, (2) her/his geographical location, (3) application year,
(4) grant year, (5) technology class(es) and sometimes an (6) assignee.
However, it does not provide any information on additional inventors that might collaborated on the patents.
This is where I contribute. I’ve mined the text of more than 4 million HistPat text-files. Using complex search and matching algorithms, I examined each single word to identify inventors names and their exact geographical location.
After picking up more than 8 million could-be-inventors, I’ve used state-of-the-art (fully supervised) machine learning techniques to identify which could-be’s are truly inventors – not witnesses, examiners, assignee’s etc.
Finally, building upon work by Ventura et al. (2015) I built a supervised machine learning algorithm to disambiguate unique inventors.
The end-product is an inventor-patent data-base that holds – for each historical U.S. patent between 1836-1975 – information on all inventors and their geographical location. This allows me to generate networks of collaboration that connects inventors within and between U.S. cities, as well as to track the movement of inventors over time and in technology and geographical space.