Computes a Word2Vec model based on an input (text) corpus. Word2Vec is a neural network based approach to create word embeddings (high dimensional features vectors) from a given input corpus. It relies on the distributional hypothesis, which states that words occurring within a specific context range are similar.
Note: Changed machine learning algorithm implementations in Spark 3 may slightly change results compared to Spark 2. Overall performance improved.
As input, the processor takes a table with columns containing a corpus (collection of written texts).
The processor forwards a table with extracted words and the respective vector (apart from the words column, the number of columns created is equal to the dimension mentioned in the configuration).
For this example, we use the following corpus as input:
|Trees were swaying, though gently, and their leaves were rustling as if in applause to the change in the weather|
|This had been going on for several days|
|The men and women who gauge the climate on television were exultant over the unusual run of good weather as if it was they who had brought it on|
|how it's supposed to be done " is a trait all of the best deer hunters share|
|Plus , it's a lot of fun to pull off-the-wall stunts that actually work in special situations|
|The protesters here certainly know what they don't like: war, globalization, capitalism, drug laws, immigrant detention centers, a high-speed train line and, inexplicably, the Olympic torch|
|This is a discussion of war, " said Claudio Robba, 25, one of maybe 150 protesters at a piazza|