Sanskrit programming
open source project.


वाक्यकारं वररुचिं, भाष्यकारं पतञ्जलिं, पणिनीं सूत्रकारञ्च प्रणतोस्मि
मुनित्रयं!
- Important resources:
- Services and downloads:
- About contributions
and license:
- Programming languages
and build system:
- Background and plans:
- Motivation:
- Why open source?
- Assessment of the current state of
Sanskrit NLP:
- Plans:
- Long term dream:
- Initial focii:
Important resources:
We have a code repository on
bitbucket for an open-source sanskrit NLP project.
Also, contributors (and users with questions) are welcome to join
the sanskrit-programmers
mailing list.
Services and downloads:
For downloads, please visit our code repository.
For now, the following services are provided:
- Indic transliteration. (A side-effect.)
- pANinIya pratyAhArI.
- saMskRRitaM-DSL: Our member shrI vAsudevan shrInivAsan has made rapid progress
with imlementing parts of aShThAdhyAyI as a domain-specific language
(DSL) in groovy, yet another jvm based functional + object programminglanguage
About contributions
and license:
Anyone is welcome to contribute any Sanskrit processing code which
runs on the Java Virtual Machine. All contributions will be gratefully
acknowledged. Anyone who would like check-in access, please let me
know.
We use the popular Apache license. Note that, if necessary to take advantage
of others contributions, we can have different licenses for
subcomponents.
Programming languages
and build system:
Contributors may use any language compatible with the Java Virtual
Machine as long as they try their best to write correct code (We prefer
Scala, as it combines the speed of Java with the
conciseness of Python, though people may be more familiar with java).
However, we are willing to host code written in other languages (not
integrable with JVM code) also.
It uses sbt
as a build system. All you need to run or build your code is to invoke
bin/sktnlp with the appropriate command (Please take a look at that
code for details). After getting code from the repository for the first
time, you will run bin/sktnlp build. Of course, you can privately use
any build system you like for development.
Background and plans:
Motivation:
Following discussions (1,
2)
in the samskrita
mailing list and some positive responses, we have started this project.
Volunteer effort is a labor of love, and many of us here carry deep
sentiments, an emotional attachment, about saMskRta. In fact, many of
us yearn for the day saMskRta is revived as a spoken language, even as
we use it whenever possible in our daily life. That is why contributing
to sanskrit tools is a natural outlet
Why open source?
For us to rapidly achieve our dreams, it is important that we are
able to rapidly build on each other’ work, without unnecessary
duplication of effort. For this reason open source software and freely
available data are important.
Of course, non-community based, closed source software and
restricted data are very valuable and useful to end users (eg
spokensanskrit.de), but a previous attempt tells me that it can be
difficult/ slow to elicit cooperation/ responsiveness. Several websites
may well be accessible to end users, but closed source and restricted
data is just that - their not being able to make money out of this does
not change this fact.
Assessment of the
current state of Sanskrit NLP:
Natural Language Processing in general is a thriving field, with
open source projects such as openNLP.
Several academics have done valuable work in Sanskrit NLP, and there
has been some contribution from enthusiasts. Thanks to
separate conversations, we gather
the following impressions. Current aims have been to develop
tools and algorithms aimed at helping a reader comprehend Sanskrit text
by doing the following:
- Digitize dictionaries(D, B), sUtras
and thesarauses(H)
and enable online search(B, 2B).
Some online dictionaries
enable collaborative editing. They do have the following limitations:
- But database updated in this manner is not publicly available.
- They don't currently provide an online API (application
programming interface) to build on them easily.
- Develop tools which model and illustrate
application of various sandhi(F, H, C, C2), prAtipadika declension(D, H1, H2, F, B), dhAtu conjugation (F, H, 3B, 5B, H2, Dl, )
kRdanta(H1, H2, F) and taddhitAnta (H1, H2) rules. These can in-turn be used to analyze inflected words (1F,
2F, B, H, Dl),
do sandhi analysis (1H,
2H), to produce dictionaries of inflected words (F) and corpora of text with corresponding word roots (D).
- Inflected word generation is usually based on the 'word and
paradigm'
model, close to the work such as ruupa chandrikaa which gives the
naamaruupaavalii for 'typical' words ending in different var.nas in
different lingas. This is found to be very useful and accurate in
the analysis of classical Sanskrit texts.
- Limitation:
However, as a generative model the above is not perfect because, not
being based firmly on pANini's rules (which separate saMskR^ita from
apabhraMShA), they may generate wrong inflections.
- Domain specific languages tailored for the saMskR^ita grammar are beginning to be seen (V).
- Mechanically parsing (H)
Sanskrit text, doing part of speech tagging(D). Producing, standardizing Sanskrit corpora (I, Ms..).
- Translating Sanskrit into a more familiar language. (H, 2H)
- Tools to identify metre(D, M, C, C2).
- Tools to help understand grammer sUtras (H,
B,
3V, T, D, A, Ar).
- Transliteration tools(S, Ls, H, B, B, Gv, V, D, C, Rd, Rp, Ar...) and IMEs (I, G, M, ..) to input Indic script directly without transliteration. Other lists are available at [N, N2, W, W2]. Some tools/ websites (1W,
2W, ) enable
viewing text in script of reader's choice.
- Formal attempts at encoding Indian scripts in Unicode(B, I, ),
fonts.
- Sanskrit optical character recognition (OCR) tools(D1, B, X).
Note that we have focused on computer programs above, more general,
curated collections of links and corpora are available elsewhere (F, N, 2N, D, ...). Also, other summaries are available (I).
In some cases above source code for Sanskrit tools are available (the links in bold are said to be - our gratitude!);
but much good software is not open-source; and there is quite a bit of
duplication of effort. Besides the limitations noted above, what is
conspicuously missing from the above are tools directed at meeting
important needs of the popular spoken Sanskrit movement, especially as
we
increasingly interact with information through computers and the
internet.
- Consuming documents and webpages written in other languages in
saMskRRita (There is no google-translate like device at present nor
will there be one in the near future).
- Sanskrit UI versions of commonly used software don't exist
(Unlike Arabic, Hebrew..).
- There are no good Sanskrit browser scripts or extensions to do
common things like look up word meanings with a click or a mouse-over.
- No effort at generating Sanskrit content easily. Eg: Sanskrit
wikipedia is nowhere close to the english version. Same goes for the
wiktionary.
Plans:
Long term dream:
Because many of us Sanskrit-lovers have both programming skills and
some Sanskrit knowledge (if not linguistics/ NLP expertise), there is
now great opportunity and will to develop useful Sanskrit tools. For
example, we dream of a day when we will be able to read any English
webpage in Sanskrit, when the Sanskrit wikipedia will approach the
English version the richness of its content. (For an inkling as to what
is possible with some very simple technology, see screenshots here.)
Of course, Machine translation is a very difficult problem. But,
many open source software efforts are directed at equally difficult,
technically challenging problems (designing new languages, search
engines, numerical analysis packages etc..). And it is not like
tool-makers cannot learn the simple ideas behind HMM's and read papers.
Machine translation will naturally comewhen much simpler subproblems
are conquered.
Initial focii:
1. As proposed in the earlier
thread, we could first focus on developing a wiktionary bot. prathamaM EkaM sulabhataraM kAryaM gRRihNoma (pUrva-vivRRitaM
wiktionary kAryaM). EtasmAt wikipedia-api upayOga-kaushalaM api
prApnEma, vardhana-yOgyaM bAdhA-rahitaM shabdakOshaM api
vyvasthApayiShyAma.
2. Encoding pANinIya vyAkaraNa in Scala. Imagine a grammatical tool
which besides producing the output also produces sUtra-pramANa.
...