Overview

Computes a Word2Vec model based on an input (text) corpus. Word2Vec is a neural network based approach to create word embeddings (high dimensional features vectors) from a given input corpus. It relies on the distributional hypothesis, which states that words occurring within a specific context range are similar. 


Note: Changed machine learning algorithm implementations in Spark 3 may slightly change results compared to Spark 2. Overall performance improved.


Input

As input, the processor takes a table with columns containing a corpus (collection of written texts).


Configuration

Output

The processor forwards a table with extracted words and the respective vector (apart from the words column, the number of columns created is equal to the dimension mentioned in the configuration).


Example

Example Input

For this example, we use the following corpus as input:


Trees were swaying, though gently, and their leaves were rustling as if in applause to the change in the weather
This had been going on for several days
The men and women who gauge the climate on television were exultant over the unusual run of good weather as if it was they who had brought it on
how it's supposed to be done " is a trait all of the best deer hunters share
Plus , it's a lot of fun to pull off-the-wall stunts that actually work in special situations
The protesters here certainly know what they don't like: war, globalization, capitalism, drug laws, immigrant detention centers, a high-speed train line and, inexplicably, the Olympic torch
This is a discussion of war, " said Claudio Robba, 25, one of maybe 150 protesters at a piazza


Workflow

Example Configuration

Result


Related Articles

Decision Tree Regression Forecast

Decision Tree Classification Forecast

Random Forest Classification Forecast Processor

Random Forest Regression Forecast Processor