I created a tool to generate decks of i+1 sentences

Hey friends,

I've been working on an algorithm that takes a (preferably very large) set of sentences and turns it into a deck of i+1 cards. That means you can make, for example, an Anki deck where each new card teaches you a new word using only sentences where you've already seen every other word (with some freebies like names). I've made a spreadsheet with a Spanish deck for English learners made from the Tatoeba sentences so you can get a better idea of what I'm talking about:

https://docs.google.com/spreadsheets/d/1i-GllU4FgsYeqMmuQBVX0wToUAUIq6AnvYN_b6emEFw

(The initial 13 cards with a gray background are cards I created manually that I've seeded the deck with in order to open up the set of available i+1 sentences a little bit.)

The "Cards" sheet has the cards. You can export this as a TSV and import that into Anki or the SRS platform of your choice. The "Words" sheet shows the words in the order they are introduced. Now, it's important to note that words are not introduced in order of general frequency. For example, in this deck "saber" is the 34th word that is introduced, but it's ranked 233rd in a frequency list I found. What that means is that, at least in the Tatoeba corpus, "saber" is used in comparatively simpler sentences more often than a word like "tiempo" (word 103 in my list, number 97 in the frequency list). As you see with "tiempo", and if you look around at the spreadsheet, there is at least some correlation between the order in which words are introduced and the order they are found in a typical frequency list, which does make some sense. Frequency = -1 in my spreadsheet means that the word was not found in the 5k words frequency list I found.

Another note is that the word order is quite different from what you might find in a textbook. For one, it's not themed. For another, you'll get some words much earlier than you would in a textbook, like "dijo" ("he/she said") which is word 17 in my deck. It's a very frequently used word (number 155 in the frequency list), but more importantly, and more relevant to my algorithm, it's used quite often in very simple sentences, like the example "¿Sabes lo que dijo?".

The last thing I want to mention is that you will see that there are multiple cards for pretty much all words. I generate at least 2 cards for each word, up to 4 cards as the deck progresses, and I have a primitive scheduler to space them out in the deck. This is all configurable, along with other things like minimum and maximum sentence size.

It's my belief that a deck like this can be a great supplement in a language learning journey. You're learning words in the context of sentences. You are using the translated meaning of the sentence but not the translation of the individual word. And most importantly, you're able to focus on how a single word affects that meaning without also having to contend with other words you don't know.

There are some areas where I can improve this though:

  • I want to expand the concept of a word to include things like "por qué", so it's not treated as two words ("por" and "qué"). Other examples include "por supuesto", "sin embargo", and "lo siento". I'm sure there are a ton of phrases and other colloquialisms like this that might make more sense to learn as a single unit.
  • Some of the translations in my deck are weird. That's because Tatoeba is imperfect. I'd like to find more sources of sentences, preferably ones with curated translations.
  • I can also turn collections of books (right now just .epubs) into (untranslated) decks. I'd like to find sets of publicly-available texts so I can build and release some ready-made Anki decks (and more spreadsheets for people to generate their own decks). These would unfortunately have to rely on machine translation or massive community effort to get them translated. Or is there another way?
  • I'd like to make this work with more languages than just Spanish, starting with languages with similar scripts and grammars like French and German. That being said, this probably should work mostly out of the box with any language that separates words with spaces. For other languages I'll need to do some research or talk to someone about how to support them.
  • If I get enough interest I might clean up the code and open source it. I have some other tools I've built in conjunction with this, such as a tool that can take a word and give you a minimal (ish) i+1 path to that word. This was much harder to build than it might sound... The code for this is really gross. If you see it try not to judge me :)

And this is where you come in. I could use your help with the following:

  • I want to know what you think about this in general.
  • Do you have other ideas for how I might improve this in any way?
  • Do you know of better bilingual sentence corpora or other sources of sentences I can use?
  • Do you know of similar projects I can look to for inspiration or contribute to?

Thank you!

submitted by /u/wbowers
[link] [comments]

from Language Learning https://ift.tt/3ysDrIw
via Learn Online English Speaking

Comments