Sanskrit programming open source project.


वाक्यकारं वररुचिं, भाष्यकारं पतञ्जलिं, पणिनीं सूत्रकारञ्च प्रणतोस्मि मुनित्रयं!


  1. Important resources:
  2. Services and downloads:
  3. About contributions and license:
  4. Programming languages and build system:
  5. Background and plans:
    1. Motivation:
    2. Why open source?
    3. Assessment of the current state of Sanskrit NLP:
    4. Plans:
      1. Long term dream:
      2. Initial focii:

Important resources:

We have a code repository on bitbucket for an open-source sanskrit NLP project.

Also, contributors (and users with questions) are welcome to join the sanskrit-programmers mailing list.

Services and downloads:

For downloads, please visit our code repository. For now, the following services are provided:
  1. Indic transliteration. (A side-effect.)
  2. pANinIya pratyAhArI.
  3. saMskRRitaM-DSL: Our member shrI vAsudevan shrInivAsan has made rapid progress with imlementing parts of aShThAdhyAyI as a domain-specific language (DSL) in groovy, yet another jvm based functional + object programminglanguage

About contributions and license:

Anyone is welcome to contribute any Sanskrit processing code which runs on the Java Virtual Machine. All contributions will be gratefully acknowledged. Anyone who would like check-in access, please let me know.

We use the popular Apache license. Note that, if necessary to take advantage of others contributions, we can have different licenses for subcomponents.

Programming languages and build system:

Contributors may use any language compatible with the Java Virtual Machine as long as they try their best to write correct code (We prefer Scala, as it combines the speed of Java with the conciseness of Python, though people may be more familiar with java). However, we are willing to host code written in other languages (not integrable with JVM code) also.

It uses sbt as a build system. All you need to run or build your code is to invoke bin/sktnlp with the appropriate command (Please take a look at that code for details). After getting code from the repository for the first time, you will run bin/sktnlp build. Of course, you can privately use any build system you like for development.

Background and plans:

Motivation:

Following discussions (1, 2) in the samskrita mailing list and some positive responses, we have started this project. Volunteer effort is a labor of love, and many of us here carry deep sentiments, an emotional attachment, about saMskRta. In fact, many of us yearn for the day saMskRta is revived as a spoken language, even as we use it whenever possible in our daily life. That is why contributing to sanskrit tools is a natural outlet

Why open source?

For us to rapidly achieve our dreams, it is important that we are able to rapidly build on each other’ work, without unnecessary duplication of effort. For this reason open source software and freely available data are important.

Of course, non-community based, closed source software and restricted data are very valuable and useful to end users (eg spokensanskrit.de), but a previous attempt tells me that it can be difficult/ slow to elicit cooperation/ responsiveness. Several websites may well be accessible to end users, but closed source and restricted data is just that - their not being able to make money out of this does not change this fact.

Assessment of the current state of Sanskrit NLP:

Natural Language Processing in general is a thriving field, with open source projects such as openNLP.

Several academics have done valuable work in Sanskrit NLP, and there has been some contribution from enthusiasts. Thanks to separate conversations, we gather the following impressions. Current aims have been to develop tools and algorithms aimed at helping a reader comprehend Sanskrit text by doing the following:

  1. Digitize dictionaries(D, B), sUtras and thesarauses(H) and enable online search(B, 2B). Some online dictionaries enable collaborative editing. They do have the following limitations:
  2. Develop tools which model and illustrate application of various sandhi(F, H, C, C2), prAtipadika declension(D, H1, H2, F, B), dhAtu conjugation (F, H, 3B5B, H2, Dl, ) kRdanta(H1, H2, F) and taddhitAnta (H1, H2) rules. These can in-turn be used to analyze inflected words (1F, 2F, B, H, Dl), do sandhi analysis (1H, 2H), to produce dictionaries of inflected words (F) and corpora of text with corresponding word roots (D).
  3. Domain specific languages tailored for the saMskR^ita grammar are beginning to be seen (V).
  4. Mechanically parsing (H) Sanskrit text, doing part of speech tagging(D). Producing, standardizing Sanskrit corpora (I, Ms..).
  5. Translating Sanskrit into a more familiar language. (H, 2H)
  6. Tools to identify metre(D, M, C, C2).
  7. Tools to help understand grammer sUtras (H, B, 3V, T, D, A, Ar).
  8. Transliteration tools(SLs, H, B, B, Gv, V, D, C, Rd, Rp, Ar...) and IMEs (I, GM, ..) to input Indic script directly without transliteration. Other lists are available at [N, N2, W, W2]. Some tools/ websites (1W, 2W, ) enable viewing text in script of reader's choice.
  9. Formal attempts at encoding Indian scripts in Unicode(B, I, ), fonts.
  10. Sanskrit optical character recognition (OCR) tools(D1, B, X).

Note that we have focused on computer programs above, more general, curated collections of links and corpora are available elsewhere (F, N, 2N, D, ...). Also, other summaries are available (I).

In some cases above source code for Sanskrit tools are available (the links in bold are said to be - our gratitude!); but much good software is not open-source; and there is quite a bit of duplication of effort. Besides the limitations noted above, what is conspicuously missing from the above are tools directed at meeting important needs of the popular spoken Sanskrit movement, especially as we increasingly interact with information through computers and the internet.
  1. Consuming documents and webpages written in other languages in saMskRRita (There is no google-translate like device at present nor will there be one in the near future).
  2. Sanskrit UI versions of commonly used software don't exist (Unlike Arabic, Hebrew..).
  3. There are no good Sanskrit browser scripts or extensions to do common things like look up word meanings with a click or a mouse-over.
  4. No effort at generating Sanskrit content easily. Eg: Sanskrit wikipedia is nowhere close to the english version. Same goes for the wiktionary.

Plans:

Long term dream:

Because many of us Sanskrit-lovers have both programming skills and some Sanskrit knowledge (if not linguistics/ NLP expertise), there is now great opportunity and will to develop useful Sanskrit tools. For example, we dream of a day when we will be able to read any English webpage in Sanskrit, when the Sanskrit wikipedia will approach the English version the richness of its content. (For an inkling as to what is possible with some very simple technology, see screenshots here.)

Of course, Machine translation is a very difficult problem. But, many open source software efforts are directed at equally difficult, technically challenging problems (designing new languages, search engines, numerical analysis packages etc..). And it is not like tool-makers cannot learn the simple ideas behind HMM's and read papers. Machine translation will naturally comewhen much simpler subproblems are conquered.

Initial focii:

1. As proposed in the earlier thread, we could first focus on developing a wiktionary bot. prathamaM EkaM sulabhataraM kAryaM gRRihNoma (pUrva-vivRRitaM wiktionary kAryaM). EtasmAt wikipedia-api upayOga-kaushalaM api prApnEma, vardhana-yOgyaM bAdhA-rahitaM shabdakOshaM api vyvasthApayiShyAma.

2. Encoding pANinIya vyAkaraNa in Scala. Imagine a grammatical tool which besides producing the output also produces sUtra-pramANa.


...