Word embedding is one of the fundamental concepts of natural language processing. Since the posts about word embedding methods are generally too technical, it might be difficult to understand for beginners. So, rather than mentioning how the methods work, I will simply talk about what they do and for what purpose they can be used.

A numerical form of data is required to create machine learning models. Therefore, data in other types should be converted into numerical form. Categorical data can be converted into numerical values in different ways. The values {“low”, “medium”, “high”} have a greatness relation and they can be represented by the numbers {1, 2, 3} in an ordered way. Or, if we have days of the week, they can be numbered from 1 to 7. In that case, although the 1st and 7th days are actually consecutive, the 7th day has a value far from the 1st day. In such cases, the categorical values are based on a sine or cosine wave, making the distance between values more realistic.

Of course, not all categorical data is sortable. If we have movie genres, such as {“horror”, “thriller”, “romantic”, “comedy”}, we can not assign sequential numbers to them. Then, the values can be represented with 0 or 1 with one-hot encoding. The numerical conversion of the data in a table with columns in the three cases we mentioned can be seen in Figure 1 (code sample*).

Figure 1. Converting categorical data into numerical values

So far, we have talked about assigning numbers to the words, which is encoding. But in some cases, such a simple transformation may not be enough. If we want to cluster the data in the example, we do not do anything about the semantic relationships between the values, while the horror and thriller movies should be close in the genre.

If our variables were sentences, then how would we represent those sentences? The first method that comes to mind is to create a column (bag of words) for each word in the sentence, similar to encoding. Then, points can be given according to the frequency of occurrence of the words (TF-IDF). But here, too, we cannot catch the relationship between words. Also, this method is not applicable for more complex problems as we will need to create a new column for each value (curse of dimensionality). Let’s move on to embedding.

In embedding, words are located so that they are most closely related to the words they are with. This relation can be in many different categories. For example, the country name of England should be close to other country names, but also to the expressions that are specific to England. In this way, a structure with hundreds of layers can be established while creating an embedding model. The example below shows several countries, their capitals, and their relationship to a football team. While line number 1 keeps the country relation, line number 2 capitals, line number 3 football teams; Lines 4, 5 and 6 represent the country’s relation to its capital and its football team (Figure 2).

Figure 2. A word space representation

Of course, this relation will not be the same in all. Countries can be mentioned more with each other or with a country capital. Figure 3 is a more accurate representation in this respect.

Figure 3. A more realistic word space representation

After learning what word embedding is, let’s talk about a few embedding methods. As mentioned, the proximity of the words to each other is based on the words they pass together. Let’s take a few articles on “artificial intelligence” from Wikipedia and do an experiment. We’re going to use n-grams here. Although “n” is a variable value, it produces n words that are adjacent to each other.

“first example of word embedding”

2-grams of this sentence;

“first example”

“example word”

“word embedding”

After doing the preprocessing steps on the corpus that we created with the data retrieved from Wikipedia and the content is put into the required format, let’s find the words that are adjacent to 2-grams and look at their frequencies. The 8 most common words, together with the word “artificial” can be seen in Figure 4.

Figure 4. Words close to the word “artificial”

Although our corpus is small, the results are reasonable. Accordingly, in our vocabulary, the word “artificial” should be the nearest to the word “intelligence”.

Word2Vec & FastText, among the most used embedding methods, are artificial neural network models in which the words in the corpus are trained with CBOW or skip-gram. There are embedding models that are readily trained for many different languages. However, since these contain general expressions, they may not be very efficient in studies to be done specifically in a field. If there is enough data in field-specific studies, training a model specific to that area is also an option. Even necessary in most cases.

Let’s take chemistry-related content from Wikipedia this time and find the words nearest to the word “artificial” with 2-grams. Figure 5 contains the nearest words to the word “artificial” for artificial intelligence and chemistry contents. As can be seen, it is completely context-dependent and different outputs are produced.

Figure 5. 2-gram outputs created by the corpus of “artificial intelligence” and “chemistry”

Let’s train a FastText model this time with the assembly we have. As a result of the trained model, words are located in the word space according to many different categories. Figure 6 shows the 60-dimensional vector created for the word “artificial” (code sample *).

Figure 6. Vector showing the position of the word “artificial” in the word space.

And again, based on the model, words that are close to the word “artificial” are present in Figure 7.

Figure 7. Words that are close to the word “artificial” with the FastText model

The word spaces created by the model established with the words in the corpus can be visualized at https://projector.tensorflow.org/. To do that, you should create a file with the .tsv extension (code sample *). The word space of the FastText model we have trained is as follows (Figure 8). The word space is a small space because the corpus is not too big. The relationships are shown simply in Figure 3 actually consist of many more layers in this way.

Figure 8. Word space in the model

To emphasize again that the models we have created with “artificial intelligence” and “chemistry” corpus produce content-specific output, we see the nearest words to the word “artificial” for models in Figure 9.

Figure 9. Words close to the word “artificial” in “artificial intelligence” and “chemistry” models

Finally, let’s briefly talk about the advantages of the FastText model. FastText, which is based on Word2Vec, also works for words that are not in the collection or are misspelled. When we want to find the closest words to the word “artifcial” with FastText, we get a result like in Figure 10.

Figure 10. Words that are close to the misspelled word “artifcial”

When we try the same expression with the word2Vec model, we get the following error:

KeyError: “word ‘artifcial’ not in vocabulary”

FastText has this characteristic since, during the model training, small parts (characters grams) of the words are used. Because it is a more advanced model, it takes longer to train and occupies more memory than word2Vec. It can be decided which one to choose according to the needs and time/memory constraints.

I tried to explain the word embedding concept in a simple way without going into too much detail. Experiments can be done on a different corpus with code samples. In my next article, I will talk about embedding methods in more detail and compare the methods used.

Teşekkürler:)