Entry Saharia:2014:SRP from talip.bib
Last update: Sun Oct 15 02:55:04 MDT 2017
Top |
Symbols |
Numbers |
Math |
A |
B |
C |
D |
E |
F |
G |
H |
I |
J |
K |
L |
M |
N |
O |
P |
Q |
R |
S |
T |
U |
V |
W |
X |
Y |
Z
BibTeX entry
@Article{Saharia:2014:SRP,
author = "Navanath Saharia and Utpal Sharma and Jugal Kalita",
title = "Stemming resource-poor {Indian} languages",
journal = j-TALIP,
volume = "13",
number = "3",
pages = "14:1--14:??",
month = sep,
year = "2014",
CODEN = "????",
DOI = "https://doi.org/10.1145/2629670",
ISSN = "1530-0226 (print), 1558-3430 (electronic)",
ISSN-L = "1530-0226",
bibdate = "Sat Oct 4 06:09:41 MDT 2014",
bibsource = "http://portal.acm.org/;
http://www.math.utah.edu/pub/tex/bib/talip.bib",
abstract = "Stemming is a basic method for morphological
normalization of natural language texts. In this study,
we focus on the problem of stemming several
resource-poor languages from Eastern India, viz.,
Assamese, Bengali, Bishnupriya Manipuri and Bodo. While
Assamese, Bengali and Bishnupriya Manipuri are
Indo-Aryan, Bodo is a Tibeto-Burman language. We design
a rule-based approach to remove suffixes from words. To
reduce over-stemming and under-stemming errors, we
introduce a dictionary of frequent words. We observe
that, for these languages a dominant amount of suffixes
are single letters creating problems during suffix
stripping. As a result, we introduce an HMM-based
hybrid approach to classify the mis-matched last
character. For each word, the stem is extracted by
calculating the most probable path in four HMM states.
At each step we measure the stemming accuracy for each
language. We obtain 94\% accuracy for Assamese and
Bengali and 87\%, and 82\% for Bishnupriya Manipuri and
Bodo, respectively, using the hybrid approach. We
compare our work with Morfessor [Creutz and Lagus
2005]. As of now, there is no reported work on stemming
for Bishnupriya Manipuri and Bodo. Our results on
Assamese and Bengali show significant improvement over
prior published work [Sarkar and Bandyopadhyay 2008;
Sharma et al. 2002, 2003].",
acknowledgement = ack-nhfb,
articleno = "14",
fjournal = "ACM Transactions on Asian Language Information
Processing",
journal-URL = "http://portal.acm.org/browse_dl.cfm?&idx=J820",
}
Related entries
- accuracy,
6(4)1,
7(1)1,
7(1)2,
8(3)11,
8(4)15,
9(1)4,
10(2)7,
10(4)17,
11(2)6,
11(3)9,
12(1)4,
12(2)7,
12(3)10,
12(3)11,
12(3)12,
12(4)15,
13(3)12
- amount,
7(3)8,
8(4)17,
9(1)3,
9(4)15,
10(2)7,
10(2)9,
10(4)21,
13(1)3
- Assamese,
7(3)9,
11(1)1
- based, HMM-,
7(3)10
- based, rule-,
9(4)14,
10(2)8,
11(3)8
- basic,
7(3)10,
8(1)4,
9(1)4,
10(3)15,
11(4)16
- Bengali,
9(3)11,
9(3)12,
10(2)9,
11(1)1
- character,
1(3)269,
2(1)27,
6(2)6,
6(2)8,
7(4)11,
8(2)9,
8(3)11,
9(4)14,
10(2)10,
11(2)7,
12(1)1,
12(1)2,
12(2)6,
12(3)9,
12(4)16,
13(2)6,
13(2)8,
13(3)12,
13(4)18
- classify,
7(3)8,
12(3)9
- creating,
12(4)16
- Creutz,
9(1)3
- design,
8(4)15,
10(2)10,
10(3)14,
10(4)18,
11(4)15,
12(1)2,
12(2)6
- dictionary,
1(4)281,
5(2)121,
6(3)11,
7(3)9,
9(1)4,
10(1)3,
10(2)7,
11(2)6,
11(4)16,
12(2)7
- dominant,
11(2)7,
12(1)4
- during,
7(3)9,
9(2)7,
11(1)1,
12(4)15
- each,
5(2)121,
5(2)146,
5(2)165,
7(3)8,
7(3)10,
7(4)11,
7(4)13,
8(4)16,
8(4)18,
9(1)2,
10(1)2,
10(2)7,
10(2)9,
10(4)20,
12(1)2,
12(1)3,
12(3)9,
12(3)10,
13(1)2,
13(2)9,
13(3)13,
13(4)17
- error,
4(1)18,
6(3)9,
7(1)2,
7(3)10,
9(1)2,
9(2)6,
10(1)2,
10(1)5,
10(1)6,
10(2)7,
10(2)10,
11(1)3,
11(2)7,
11(4)18,
12(1)2,
13(2)8
- extracted,
3(4)227,
4(3)321,
6(2)8,
7(1)1,
7(1)3,
8(3)10,
9(1)1,
11(2)6,
11(3)11,
13(2)9,
13(4)16
- focus,
7(4)11,
8(1)4,
9(2)5,
10(3)12,
10(3)16,
10(4)17,
11(1)2,
12(4)14
- four,
5(2)146,
8(4)19,
10(1)2,
10(2)9,
13(1)1,
13(4)18
- frequent,
8(4)15,
13(2)8
- HMM,
4(1)38,
7(3)10,
11(3)9
- HMM-based,
7(3)10
- hybrid,
3(2)113,
7(2)5,
7(4)13,
9(1)3,
11(3)11
- improvement,
4(3)280,
5(2)121,
6(2)7,
7(1)2,
7(1)3,
7(3)8,
7(3)10,
8(1)4,
8(3)10,
8(4)15,
8(4)16,
9(1)3,
9(3)11,
10(2)8,
11(4)17,
11(4)18,
12(1)1,
12(2)7,
12(3)11,
13(3)12,
13(4)16,
13(4)17
- India,
7(3)9,
11(1)1,
13(2)8
- Indian,
8(2)8,
9(3)9,
9(3)10,
9(3)12,
10(2)8,
10(2)9,
11(1)1,
13(2)8
- introduce,
5(2)121,
7(4)11,
8(2)7,
8(3)12,
10(3)16,
12(3)9,
12(4)15,
13(1)3,
13(1)4
- Kalita, Jugal,
12(2)6
- Lagus,
9(1)3
- last,
7(1)3,
8(4)17,
8(4)18,
11(1)1,
11(2)7,
12(3)10,
13(1)2,
13(1)3
- letters,
9(3)11
- measure,
5(2)89,
6(2)6,
6(4)3,
8(2)7,
9(2)7,
10(1)2,
10(1)6,
10(4)20,
11(2)6,
11(3)9,
11(3)11,
13(3)11,
13(3)13
- Morfessor,
9(1)3
- morphological,
6(4)2,
7(3)9,
8(4)16,
9(1)3,
9(4)15,
10(1)4,
11(3)9,
13(2)9
- most,
6(2)6,
7(1)1,
7(3)8,
7(3)10,
8(4)15,
9(1)1,
9(2)5,
9(3)11,
10(1)5,
10(2)8,
12(1)1,
12(1)2,
13(1)1,
13(1)4,
13(2)6,
13(4)18
- natural,
1(2)123,
3(1)11,
5(2)121,
5(4)291,
6(2)7,
7(1)1,
7(4)13,
8(1)2,
8(2)9,
8(4)13,
8(4)14,
8(4)16,
8(4)19,
9(2)6,
9(3)11,
9(4)15,
10(3)14,
10(4)20,
11(1)2,
11(4)14,
11(4)15,
12(1)3
- normalization,
5(3)245,
8(4)19,
9(3)11,
10(2)8,
12(3)11,
13(2)7
- obtain,
7(2)7,
7(3)8,
7(3)9,
8(4)15,
9(2)5,
9(3)12,
10(3)12,
11(3)8,
12(3)10,
12(4)17
- path,
7(4)13,
10(3)15
- poor, resource-,
8(4)17,
10(2)9,
13(1)3
- prior,
7(3)9,
11(2)4
- probable,
6(2)6
- problem,
6(2)7,
6(3)9,
6(3)11,
6(4)1,
7(1)2,
7(2)7,
7(3)10,
8(1)2,
8(2)9,
8(3)10,
8(4)19,
9(1)1,
9(1)3,
9(2)5,
9(4)13,
10(1)2,
10(1)4,
10(3)14,
10(3)16,
10(4)21,
11(3)8,
11(3)11,
11(4)17,
11(4)18,
12(1)2,
12(1)3,
12(2)7,
12(3)10,
12(3)12,
12(4)16,
13(2)8,
13(4)17
- published,
12(2)7
- reduce,
6(4)2,
8(2)7,
11(3)10,
12(1)2,
12(2)5,
12(3)11,
12(3)12,
13(2)8
- remove,
6(4)2
- reported,
7(4)13,
11(1)1,
11(1)2,
11(2)7,
13(1)4,
13(2)6
- resource-poor,
8(4)17,
10(2)9,
13(1)3
- respectively,
7(4)11,
7(4)13,
10(3)12,
10(3)13,
12(1)1,
12(1)4
- result,
4(2)135,
5(2)121,
5(2)146,
5(2)165,
6(2)6,
6(2)7,
6(3)9,
6(3)11,
6(4)3,
7(1)2,
7(2)5,
7(2)6,
7(2)7,
7(3)8,
7(3)10,
7(4)11,
7(4)12,
7(4)13,
8(1)2,
8(1)3,
8(1)4,
8(2)6,
8(2)9,
8(3)10,
8(3)12,
8(4)14,
8(4)15,
8(4)16,
8(4)17,
8(4)18,
8(4)19,
9(1)1,
9(1)2,
9(2)5,
9(2)6,
9(2)7,
9(3)11,
9(3)12,
9(4)14,
10(1)2,
10(2)7,
11(2)4,
11(2)5,
11(3)8,
11(3)9,
11(3)11,
11(4)13,
11(4)14,
11(4)15,
12(1)3,
12(1)4,
12(2)5,
12(2)7,
12(3)9,
12(3)10,
12(3)11,
12(4)14,
12(4)16,
13(1)1,
13(1)4,
13(2)6,
13(2)7,
13(2)9,
13(3)11,
13(3)12
- rule-based,
9(4)14,
10(2)8,
11(3)8
- several,
6(2)6,
6(2)7,
6(4)3,
7(2)5,
7(2)7,
7(3)10,
8(3)10,
8(4)16,
8(4)17,
8(4)18,
9(3)12,
11(2)6,
11(4)13,
11(4)16,
12(1)2,
13(3)12
- Sharma, Utpal,
7(3)9
- show,
5(2)89,
5(2)146,
7(1)1,
7(1)2,
7(1)3,
7(4)11,
7(4)12,
7(4)13,
8(1)4,
8(2)7,
8(2)9,
8(3)12,
8(4)16,
8(4)17,
9(1)1,
9(1)2,
9(1)3,
9(2)5,
9(2)6,
9(2)7,
9(3)11,
9(3)12,
9(4)14,
10(1)3,
10(3)15,
11(2)4,
11(2)5,
11(2)7,
11(3)8,
11(3)11,
11(4)14,
11(4)15,
11(4)17,
11(4)18,
12(1)2,
12(1)4,
12(2)5,
12(2)7,
12(3)9,
12(3)10,
12(3)11,
12(4)15,
12(4)16,
13(1)3,
13(2)6,
13(2)7,
13(2)9
- significant,
5(2)121,
7(1)3,
8(1)4,
8(4)15,
8(4)16,
8(4)17,
8(4)18,
9(2)5,
9(3)11,
10(1)5,
10(2)8,
10(3)14,
11(2)6,
11(2)7,
12(1)1,
12(4)16,
13(1)3,
13(4)16
- single,
7(3)8,
8(3)11,
9(3)12,
11(2)5,
13(4)17
- state,
5(2)165,
7(3)10,
9(3)8,
11(3)9,
12(3)9,
13(3)15
- stem,
6(3)9,
6(4)2,
8(4)16
- stemming,
6(4)2,
9(3)11,
9(3)12,
10(2)8,
13(2)7
- step,
7(3)8,
8(2)8,
8(4)16,
9(3)12,
10(1)5,
10(3)12,
11(2)6,
12(1)2,
12(2)7,
13(4)17
- study,
4(2)159,
4(3)243,
5(2)121,
5(2)146,
5(2)165,
5(3)209,
6(2)6,
6(2)7,
8(1)3,
8(1)4,
8(4)16,
9(2)5,
9(2)6,
9(2)7,
9(3)11,
10(2)10,
10(3)12,
10(4)17,
10(4)18,
11(1)3,
11(2)6,
11(3)9,
11(3)11,
11(4)14,
13(1)3,
13(2)7,
13(3)11,
13(3)12
- suffix,
6(4)2,
7(3)9,
8(4)16,
9(3)11,
12(3)11
- text,
1(1)34,
1(2)159,
3(3)190,
3(4)215,
4(1)38,
4(2)135,
4(4)435,
5(1)1,
5(2)165,
6(1)z-3,
6(3)10,
6(4)2,
7(2)6,
7(3)8,
7(3)9,
8(1)4,
8(3)11,
8(4)14,
8(4)16,
8(4)18,
9(1)1,
9(3)10,
9(4)15,
10(3)14,
11(1)2,
11(2)4,
11(2)5,
11(4)13,
11(4)14,
11(4)15,
11(4)16,
11(4)17,
11(4)18,
12(1)2,
12(1)3,
12(2)6,
12(3)11,
12(4)15,
13(1)1,
13(1)4,
13(2)7,
13(2)8,
13(2)9,
13(2)10
- there,
7(3)9,
7(4)11,
8(2)7,
8(3)12,
8(4)17,
9(2)5,
9(3)12,
9(4)15,
10(1)2,
10(3)14,
12(1)2,
13(1)1,
13(2)8
- while,
5(2)165,
8(1)2,
8(4)18,
9(4)15,
10(1)4,
10(3)15,
11(2)4,
11(2)5,
12(3)10,
12(3)11,
13(1)1,
13(2)8,
13(3)12
- work,
5(2)121,
6(2)6,
6(3)11,
6(4)2,
7(2)7,
7(3)9,
8(4)19,
9(2)5,
9(4)15,
10(1)4,
10(2)10,
12(1)3,
12(3)9,
13(1)1,
13(2)9,
13(4)18