hHablaCore

Chapter 9 · An Adaptive Data-Driven Approach to Second Language Acquisition

Requirements and prototype implementation for language learning software

Requirements

Basic requirements derived from our framework:

  • Decentralized access, accessible from everywhere (low barrier to access).
  • Optimize for User Interface (UI) and user input, so try to avoid UI limitations.
  • Offer vocab training.
  • Give user relevant vocabs first.
  • Vocab training is focused on recall the in user's memory.
  • For early stages of learning, Vocab recall is only focused on translation (bilingual word pairs).
  • Offer user golden path: all decision are made by software.
  • Don't violate the user's autonomy.
  • Offer user insight by showing learning data.
  • Give users option to make their own decisions, even if this contradicts these requirements.
  • Provide feedback to the user.
  • Allow user to minimize time spent on feedback (skip feedback).
  • Reward correct answers and user progress.
  • Don't punish mistakes.
  • Offer reading.
  • Offer selection of text that allows for diverse user interest.
  • Offer tools to maximize the minimization of extraneous cognitive load.
  • Synergy through synchronization: share user data between different parts of software.
  • Adapt to user based on data.

Additional requirements:

  • Offer additional learning activities that help transfer from vocab to input.
  • Allow user to control privacy.
  • Allow user to set their own goals.
  • Offer informative key performance indicators.
  • Offer social component.
  • Offer virtual classrooms.

Implementation

The concept, development, programming, design and implementation are solely done and built by me Michael Baumgarten, the author of this thesis. The code base can be found in an accompanying data carrier. While the prototype is still a work in progress, it is enough to showcase the concepts discussed in this thesis.

Basic design decision

The nature of this prototype is meant to show the challenges, use case, potential and future of language learning software. Implementations of the requirements are simplified, so that once they are solved, they can be generalized and extended. For this reason English was used as the known base language (L1) and Spanish as the target learn language (L2). Initially, Chinese mandarin was used as the L2, but there are too many edge cases, as it is too different and unique. Problems solved, when implementing Chinese, usually cannot be generalized for other languages. This is not the case with Spanish. Its very similar to English, and all the solved challenges such as matching article words or creating simple sentences can be more or less easily generalized and applied to other European language such as Portuguese, French, Italian and German because all of these are Indo-Germanic languages.

The prototype is web based to enable a low barrier to access. A website (web app) is easily accessible as it only requires a web browser and no extra installation. A web based app can, to some degree, also be accessed via mobile devices. However, the user interface and user input are usually not as good as a native mobile solution. Therefore, ultimately, in addition to a web based app there may also be a supplementary native mobile app.

Different vocab learning modes and article reader

Based on the framework and requirements, four basic features were attempted to be implemented to desirable degree of quality: First, a mode for deliberately learning and practicing vocabulary. Second, a reading mode for comprehensible input. Third, automatic synchronization to synergize both modes. Fourth, all data created by the user is used to make the modes, such as the vocab learning mode, adapt to the user dynamically. This enables a tailored, constantly adapting, learning experience for the user. Finally, the user data is also made available for the user to some degree to provide feedback and insight into his use and progress.

The vocab mode, closest resembling a golden path, is named 'Endless Mode' because, while the user can start and end this mode any time he wants to, it never ends by itself to avoid interrupting the flow of the learner. When using the Endless Mode, the software constantly decides if the user should learn a new vocab, or which vocabs the user should practice. All decisions, except when to start or end this mode, are taken from the user to free him from everything except learning the vocabulary. A second vocab mode is the 'Lesson Mode'. For this mode, all vocabs are split up into lessons and optionally sorted into the relevant part of speech categories so that the user can focus on learning only verbs or nouns if he wishes to do so. These lessons again are sorted so that high frequency vocabulary words are learned first. There is, however, no restriction on lessons, meaning the user can do any lesson he wants, eg. start with the last lesson. When actively doing such a lesson, the system decides when the user has practiced enough and end the lesson by itself. The user can then repeat the lesson or do something else. Finally, there is a 'Practice Mode' where the user only practices familiar vocabs and no new vocabs are given.

In addition to these modes, in future work, there should be the ability to create custom vocab lists that are automatically split up into small lessons. This feature already exists for the developer (it is not accessible to normal user), but the user interface for curating these lists is still incomplete. This feature would give the user full autonomy as he can curate his own learning path. This customizablity can be utilized as an infrastructure for later features such as sharing of lessons and custom classrooms. Currently, only recognized words appearing in articles can be saved as custom vocab lists.

Technology stack

Regarding the technology used in this implementation, JavaScript (JavaScript, 2019) is used in both the front end and the back end. For data set modifications, in addition to JavaScript, also Python (Python, 2019) was used for tasks such as part of speech tagging and other analysis. For the front end the React (React, 2019) library is used. The front end communicates with the back end via a Representational state transfer (REST) application programming interface (API) and JSON Web Token (Token, 2019) for authentication. The back end is implemented with ExpressJs (ExpressJs, 2019) on NodeJs (NodeJs, 2019). As database, MongoDB (MongoDB, 2019), without an object relationship mapper (ORM) is used. MongoDB stores data units as 'documents' (similar to rows in SQL based databases) which are in 'collections' (similar to tables in SQL). A MongoDB document closely resembles a JavaScript object and they are automatically converted when using the MongoDB API.

Front end challenge

On the front end, React is used together with ReduxJs (ReduxJs, 2019), a state management library. React separates the web app into components that have their data represented in the form of the 'state'. While the state can be shared with nested child components via properties 'props', there are limitations of sharing the state globally to enable isolation, independence and modulations of components. ReduxJs is one of the ways of sharing application state globally by having a 'store' that can be accessed and updated by every component that is connected to it. The management of data loading and state is one of the biggest front end challenges as it has the biggest effect on page loading speed and overall performance. React has the benefit that transitioning the web app into a mobile app is easy because the code base can be migrated into (React-Native, 2019).

Vocabulary learning and practice

Vocabs are shown to the user, who then translates them via his keyboard. The translation is then compared to the correct answer and evaluated accordingly. With the evaluation, the user gets feedback. The feedback is designed to be fast, but dependent on the vocab mode, there is a delay because of the server response. This is because the server adapts to the user, so access to the database is needed which requires the back end. Once the user is presented with the feedback screen, he can click on the vocab word (this doesn't work for sentences yet) to see his learn data history for the vocab. He can also press the enter key again (if he wants, he can do so without delay) to be presented with the next vocab question. After the feedback, the user is presented with the next vocab. In future work, among other things, the time spent between answering, feedback and the next question should be minimized.

Because the vocab mode focuses only on vocab recall, the correction is case-insensitive, so if the user wrongly capitalizes letters this has no effect. The same goes for diacritics and punctuation. Ideally the correct spelling would also not be evaluated, but it is needed as a proxy to check if the user remembers the correct translation.

To some degree, the algorithms can detect accidental (and for recall meaningless) misspellings (eg. 'organisación' instead of 'organización' where the z is wrongly an s). However, if the user makes a minor spelling mistake, while the app shows him that his answer was imperfect, it also scores the answer differently than for a wrong answer and thus also adapts accordingly. This avoids asking the user a vocab because the user made a mistake even though he actually knows the word schema. The app will not ask him the word again but instead give him other words. To decide when the user should learn new words or practice old ones, a Spaced Repetition System is used that is paired with a weighted system. If the user is given a new vocab, the vocab should be the next most relevant one. To enable this, all vocabs are sorted and ranked by their frequency so that the user learns the 'important' vocabs first (George, 1935). In addition they are sorted by their part of speech with a focus on nouns, verbs, adjectives and subjectives.

Data sets

The biggest challenge in development was the creation of the different data sets. Almost all problems of the application are due to missing data in the data sets. The implication of this is that a good data set is required for a good application.

Creation of dictionary data set

The first major data set challenge was the creation of a dictionary that contains all vocab words. Here the first design decision was to count related words such as 'cat' and 'cats' as one word. The root word would be called a headword, which is why the dictionary collection was called 'headWords'. In initial creations word lists were used which were combined and supplemented with words from other text sources such as articles. This resulted in the data set being highly polluted with wrong or obscure words. The current iteration is improved in this regard: As a first step, a list of English words that are only headwords is needed, without plurals or conjugations. For this the website Sketch Engine (engine, 2019) was used. Its data contains the most frequent 4000 words sorted into the most common parts of speech such as nouns, verbs, etc. The data was extracted by saving the XHR request in json format. After this step the dictionary was modified by translating all the English words into Spanish via the Watson translation API (Watson, 2019), and web scraping conjugation tables from SpanishDict (SpanishDict, 2019) for verbs (conjugation was not available for all verbs). Other modifications, which Python was used for, consisted of getting the word frequency (Wordfreq, 2019), or what determiners a nouns appears with together in the Brown text corpus - accessed via the Natural Language Toolkit (NLTK) (Corpus, 2019; NLTK, 2019). The modifications were done over time as features were implemented. Overall, various ways were used to obtain needed data, and the techniques of modifying the dictionary where tedious and imperfect. While they worked to some degree, the accuracy was not as high as desired. This imperfection of data causes serious problems within the prototype. Words might be translated badly or even wrongly, among other things. One such example is the initial translation of the word 'people' into 'personas', when 'gente' would be the more appropriate translation. Unfortunately, the only two solutions appear to be to accept those imperfections and to manually edit them by hand when finding them. There are often ways of systematically finding unclean and imperfect data, but this almost never covers all the cases (so manual cleaning is necessary).

With this, one of the largest weaknesses of the prototype is that, in its imperfection, it is unreliable. The user could loose trust in the validity and he could never be sure if what the app is teaching him is correct.

Reading

To offer comprehensible input, there is a reading section. For this, articles are used. Maybe other types of texts such as blog posts, forums or even complete books would also have their use case, but currently this would add too much technical complexity for a prototype. Articles vary in size and simplicity, which enables a progression of the language schema. The amount of text input is maximized to allow for more articles that are better tailored to the individual user's needs. The user can browse and sort the article selection by how many words he knows. When he selects an article he can read it in reader mode which provides relevant tools and information. It is, however, unclear if the presentation when reading and the available selection of articles is sufficient, when compared to original news sites. A weakness of this prototype is the risk that the user is unable to find interesting articles, and another risk is that when the user wants to read an article, the presentation is so unattractive that he will use this feature less than what would be optimal or even stop using this feature altogether.

When reading the article, the user can translate all the words, that he doesn't understand, to minimize extraneous load. This translation is limited to words or word chains but not much more, so that the user has to try to infer the meaning by himself, rather than reading a parallel translation. While we have established that the user should first learn vocab to reduce cognitive load, and only later transfer to reading natural text, this creates a difficult design decision. The only way to guarantee manageable text input would be to disable the reading section and to have to user unlock it. This, however, would be in violation of the user's autonomy. A more elegant solution would be to mark articles for their degree of appropriateness, which is, to some degree, accomplished by showing the amount of words, a user knows. However, a more sophisticated solution seems appropriate.

Finally, it would be desirable if the articles contained images, could be filtered by categories such as 'lifestyle', 'travel', 'technology', 'science' and so on, and if the user could add his own texts. The features would improve the attractiveness and interest for the user, which would improve his motivation and provide meaning-based value - however, none of these desirable features are implemented in the prototype yet.

Synergy through synchronization

To a certain degree, the prototype, is able to detect as many words as possible in the articles by comparing them to the vocab dictionary. To some degree this also extends to recognizing semantically relevant words (as opposed to commonly called 'stop words' such as 'the', 'as', 'for' , etc.) and mark them for the user, compare them with the words the user has learned, when and how well the user has learned them, and if the words are new. To create the learning-acquisition synergy, three things should be and are possible:

First, the user can directly learn words that he just read in an article and/or save them as a custom vocab list.

Second, the user can read an article that contains words, that he just practiced or learned, or more generally, the user can select an article that is appropriate for him based on his current knowledge. This also allows for the situation that Elgort described: to encounter words that the user just deliberately learned in a real world situation (and the articles are also real articles).

Third and not yet implemented:

When the user is reading an article and actively translates words he doesn't know, the software should track how often he clicks on which words and maybe even how much time he spends on reading the article. This data can then be used to adapt the vocab mode by creating specific vocab lists and / or adjusting the learning experience in the Endless Mode.

Creation of the article data set

Another major challenge was the creation of the articles data set. While there are many Spanish articles on the internet, they tend to be copyright protected. The news site NewsUSA (NewsUSA, 2019) has Spanish articles which are without copyright. It only has about 500 Spanish articles which were web scraped for the prototype. More would be better, but it would require a better data model as the current one is limited. With the article data set the goal is to match each word used in the article to a potential headword. To do this, besides the article collection 'news_usa_articles', another collection called 'spanish_article_dict' was created, which is a dictionary for all the words that are used in the articles. This data set was modified so that each word has a counter of how often it appears in articles in total, the article titles it appears in, the English translation obtained via the Watson API and the part of speech tags via NLTK. There is also an attribute, that is an array of headword matches, which's accuracy is a central challenge because without this, any synchronization is not possible. Different forms of a headword are used by utilizing its conjugation, plural and synonym data. They are combined in a document property array called 'wordBag'. Because those sources (such as the conjugation) can contain polluted data, the Levenshtein distance between the sub-word and the headword divided by the headword's length and a cutoff of that value are used to determine if the sub-word is clean or not. While this has a high accuracy, edge cases such has 'save' and 'have' still occur, meaning 'have' might be wrongly matched with 'save'. This is because through 'I have saved', 'have' is placed into the wordBag of 'save'. This wordBag array of a headword is used to check if an article word is in it.

The other challenge related to this is the performance of checking if any word used in an article matches with a user word. For these performance reasons each article in the articles collection an array attribute was given that contains all the words of the 'match' attribute for that word.

When the users queries all the articles, all their words are checked if their matching headwords are also in the users 'userWords' collection. This is needed to obtain statistics of how many words a user knows, which is required for article matching and filtering. It also has to be done every time articles are queried, because the user's learn data might have changed. It is, however, also one of the biggest performance bottlenecks. On the other hand, querying every word from 'spanish article dict' for all of the about 500 articles would be even worse. Instead, once the user decides to read a specific article, all the words are queried with their data used to show respective information on the reader section UI. In summary, the vocab or text matching and the text to the user learn data matching has still a lot of potential for improvement.

Simple sentence generation

When available for the current selected vocab, a simple sentence that combines familiar words with new or less familiar ones is given to the user instead of the vocab alone.

Generating sentences based on which words the user needs to review, that are also written in two languages, has the following challenge: To enable flexible sentences which's words can be replaced (adapted) depending on what the user needs, they need to be dynamically generated. A problem with this is that edge cases might violate form and grammar rules. Another problem is that the word relations need to be considered, otherwise a sentence such as 'the knife swims to the lamp' might be generated. The easy way to solve these problems is to give the user sentences that are more or less hard coded and generated manually, and thus are not dynamic. This means that the sentences will not be able to adapt to the user’s needs and present him with less important words, which wastes his cognitive energy and, if the user realizes that what he is doing might be pointless, also reduces his motivation.

The strategy in minimizing the form errors is to disable words for sentences and instead hard code them into the sentence templates. The word relationship challenge is a bigger problem: To solve this, ultimately a data set is needed with probabilistic word relations such as 'knife' to 'cut', 'chop' and 'man' with 'have' to 'knife', 'car', '...'. This can be used to create a sentence such as 'A man has a knife to cut food' , 'a man cuts food' , 'a man cuts food with a knife'. This also avoids sentences such as 'a knife drives / swims / flies a car'.

Another solution for this, that is very limited, is the use of data structures that are called 'tagRelations' in the current prototype because they have a tag name such as 'person' and they declare the relationships that two pos tag words have such as a noun with a verb ('man' and 'run'). All verbs and nouns within such a tag relation object are interchangeable ('police' and 'run', 'man' and 'drink'). These relationships have been declared manually which also makes them susceptible to errors in the form nonsensical sentences, but it also creates tedious work that has a low scalability and many words are left uncovered by such a tag relation object which excludes them from sentences created by this method.

To analyze word relationships, the strategy would be to analyze a text corpus, such as the Brown corpus, for those word relationships and build a relationship data set from them. The idea is to first find the headword in the text (with matching techniques discussed before) and then analyze its neighboring words. This would work in a simple sentence, but in real world text, the sentences are far more complex, which makes it difficult to extract the relationships, eg. 'the man, whose eyes where cold as ice, slowly got into the care that he wanted to drive to his friend's house'. A solution to this is to first filter for simple sentences based on how many words, or commas they contain, and then to extract relevant words form the sentences by their pos tags. Then all words to the right could be collected, which still would contain polluted words (eg. words that randomly appear but are not in a relationship with the word). If enough samples are found, a distribution might show which words are most common just on the right side of the word within the sentence. This technique or similar ones might be a good approach to finding these relationships. Currently the problem is that with this method, not enough samples are found, even in a big corpus such as the Brown corpus. Improvement of factors such as matching accuracy might solve this.

For now, the current solution is to use simple sentence templates that avoid these problems but still contain them at a certain frequency. Finally, in future work, the user should have the ability to deviate from the default and only translate sentences or completely disabling them. Similar settings could be available that enable a higher focus on spelling, capitalization and diacritics.

User skill assessment - scoring

When the user is in vocab learning mode, he is presented either a single vocab or a sentence. The user is then asked to translate the presented text by typing the correct answer with his computer keyboard. When he submits his answer, it is sent along with the question object to the back end (API endpoint). There, both the question and answer are normalized for capitalization and diacritics and then their Levenshtein distance is used as a measurement for accuracy. If there could be multiple possible answers, such as the question object containing synonyms (which are in the data set and have to be added manually), or a sentence was asked, all possible answers are grouped, their Levenshtein distance with the user answer taken and sorted and the lowest distance taken as the answer the user intended. The reason for this is obvious for synonyms, but with sentences it is the design decision to detect 'off by one' errors in the order of the user answer words:

If the user answers a sentence, his answer string is split into a word array by the whitespace characters. This array could be compared to the sentence question word object array, but if the user typed an additional word (or just a character) between his first and last word, by error, all the user answer-question pairs would be off and thus wrong even if they are not: 'el zorro salta' is translated to 'the a fox jumps' which results in the pairs (el, the), (zorro, a), (salta, fox), (undefined, jumps). In this example, the system would asses that the user does not recall any of the words, which appear after the inserted letter 'a', when in reality the user did recall all of them.

For this reason, the lowest Levenshtein distance, independent of word order is used to find the actual pair. This means the correct word order of the user is not checked, so the user answer 'water a drinks cat' (as opposed to the correct: 'a cat drinks water') would be without error. This is in line with our primary goal to asses if the user remembers the correct translations. Feedback, that the order is incorrect, can sill be provided so that the user is aware, by mixing scoring methods.

In addition to the user answer, the time needed to answer is also available. Moreover, the time could be differentiated into total time from question to answer, from question to typing start, from input of each letter, and form typing start to typing end (and submission).

The Levenshtein distance of the user answer-question pair is used as a measurement for the error score and this score is used to adjust the SRS level and an additional learn score that goes up for correct answer or down (stops at 0) for wrong answers.

All this available data is then pushed into an array called 'learnData', which is an attribute of a user word document. The learning data might offer insights for relevant information such as how well a user knows a word. For example, if the user answers fast, but with minor spelling mistakes, this might indicate that the user knows the word well and is impatient. This could indicate that he is maybe bored and his cognitive load is low as he has automated the word schema. Traditional language learning software might interpret the small mistakes as an indication that the user has still has not learned the word sufficiently, and will present him the word more often than needed. This runs contrary to the actual solution as we are not interested in perfect spelling but rather correct recall of meaning.

Another example for insightful behavior could be a user who takes a long time but his answer is correct. This could indicate that he still has the word in memory, but has difficulty recalling it and a high cognitive load, as he needs more time than normal. Again, conventional software might think that the word has been practiced sufficiently, when really it needs to be reviewed.

User data, adaptiveness, and analytics

The software attempts to use whatever data it can collect to enable an optimal learning experience. The exact executing on this is still quite lacking. The reason for this is that all such values need to be adjusted and calibrated based on real user prototype usage. The user can see all the data the app collects about him, which improves transparency but mainly offers useful information: He can see which words he is familiar with and how well he knows them. For every word, there is also a complete learning history. All this is paired with data visualization, but this is currently rudimentary and could be improved. Not currently implemented but also desirable would be if the user could find words he struggles with, what kind of mistakes he does and other things. There could also be additional information such as articles in which this word appears together with the corresponding sentences. Moreover, the user could disable or mark words as favorites.

This thesis, built

HablaCore is the framework in these chapters, turned into an app — real articles you read with instant, in-context translation.

Try HablaCore free