I loved the article's clarity. It helped me wrap my head around how each development in NLP built on the prior's problems in a sequential history.
One question: you cite the difficulty of transfer learning with the LSTM as a core flaws, emphasizing the difference between the style of texts as making pretrained embeddings difficult. However, I think that problem was generally alleviated with ULMFiT, in which we fine-tune the Wikipedia pretrained embeddings with a language model on the corpus we are working with. Consequently, we can get really good embeddings even before we train the model on the intended task. Would you agree with me that the LSTM's drawbacks lie much more in its attention span than its capability for transfer learning? I would love your take on this.