The first idea that might come to mind is to assign a number to each time-step within the range in which 0 means the first word and 1 is the last time-step. We call this “piece of information”, the positional encoding. One possible solution to give the model some sense of order is to add a piece of information to each word about its position in the sentence. Consequently, there’s still the need for a way to incorporate the order of the words into our model. And theoretically, it can capture longer dependencies in a sentence.Īs each word in a sentence simultaneously flows through the Transformer’s encoder/decoder stack, The model itself doesn’t have any sense of position/order for each word. Avoiding the RNNs’ method of recurrence will result in massive speed-up in the training time. This will integrate the words’ order in the backbone of RNNs.īut the Transformer architecture ditched the recurrence mechanism in favor of multi-head self-attention mechanism. Recurrent Neural Networks (RNNs) inherently take the order of word into account They parse a sentence word by word in a sequential manner. They define the grammar and thus the actual semantics of a sentence. Position and order of words are the essential parts of any language. Header Photo by Susan Yin on Unsplash What is positional encoding and Why do we need it in the first place? To understand the rest of this post, I highly suggest you read one those tutorials to get familiar with the transformer architecture. So in this article, I want to try to break this module apart and look at how it works. When I read this part of the paper, it raised some questions in my head, which unfortunately the author had not provided sufficient information to answer them. In this article, I don’t plan to explain its architecture in depth as there are currently several great tutorials on this topic ( here,Īnd here), but alternatively, I want to discuss one specific part of the transformer’s architecture - the positional encoding. Even though making it more accessible is a great thing, but on the downside it may cause the details of the model to be ignored. Thanks to the several implementations in common deep learning frameworks, it became an easy option to experiment with for many students (including myself). Its ability for parallelizable training and its general performance improvement made it a popular option among NLP (and recently CV) researchers. Transformer architecture was introduced as a novel pure attention-only sequence-to-sequence architecture by Vaswani et al.