Writing a tokenizer in Python
Written by: J Dawg
I want to design a custom tokenizer module in Python that lets users specify what tokenizer(s) to use for the input. For instance, consider the following input:
Q: What is a good way to achieve this? A: I am not so sure. I think I
will use Python.
I want to be able to provide NLTK’s sentence tokenization, sent_tokenize()
as an option because it works well in many situations and I don’t want to re-invent the wheel. In addition to this, I also want to provide a finer-grained tokenization builder (something along the lines of a rule-engine). Let me explain:
Assume that I provider a couple of tokenizers:
SENTENCE # Tokenizes the given input by using sent_tokenize()
WORD # Tokenizes the given input by using word_tokenize()
QA # Tokenizes using a custom regular expression. E.g., Q: (.*?) A: (.*?)
I want to support rules as follows:
- QA -> SENTENCE: Apply the QA tokenizer first, followed by the sentence tokenizer
- QA: Apply just the QA tokenizer
Therefore, the expected output is as follows:
1. QA -> SENTENCE
[
('QUESTION',
('SENTENCE', 'What is a good way to achieve this?'),
),
('ANSWER',
('SENTENCE', 'I am not so sure', 'I think I will use Python')
)
]
2. QA
[
('QUESTION', 'What is a good way to achieve this?'),
('ANSWER', 'I am not so sure. I think I will use Python')
]
What is a good design to achieve this efficiently?
As tokenizing is easy in Python, I’m wondering what your module is planned to provide.
I mean when starting a piece of software a good design rather comes from thinking about the usage scenarios than considering data structures first.
Your examples for expected output are a bit confusing.
I assume you want the tokenizers return name on left side and a list of tokens on right side.
I played a bit to achieve similar results, but using lists for easier handling:
import re
# some tokenizers
def tokzr_WORD(txt): return ('WORD', re.findall(r'(?ms)W*(w+)', txt)) # split words
def tokzr_SENT(txt): return ('SENTENCE', re.findall(r'(?ms)s*(.*?(?:.|?|!))', txt)) # split sentences
def tokzr_QA(txt):
l_qa = []
for m in re.finditer(r'(?ms)^[s#-*]*(?:Q|Question)s*:s*(?P<QUESTION>S.*??)[s#-*]+(?:A|Answer)s*:s*(?P<ANSWER>S.*?)$', txt): # split (Q, A) sequences
for k in ['QUESTION', 'ANSWER']:
l_qa.append(m.groupdict()[k])
return ('QA', l_qa)
def tokzr_QA_non_canonical(txt): # Note: not supported by tokenize_recursively() as not canonical.
l_qa = []
for m in re.finditer(r'(?ms)^[s#-*]*(?:Q|Question)s*:s*(?P<QUESTION>S.*??)[s#-*]+(?:A|Answer)s*:s*(?P<ANSWER>S.*?)$', txt): # split (Q, A) sequences
for k in ['QUESTION', 'ANSWER']:
l_qa.append((k, m.groupdict()[k]))
return l_qa
dict_tokzr = { # control string: tokenizer function
'WORD' : tokzr_WORD,
'SENTENCE': tokzr_SENT,
'QA' : tokzr_QA,
}
# the core function
def tokenize_recursively(l_tokzr, work_on, lev=0):
if isinstance(work_on, basestring):
ctrl, work_on = dict_tokzr[l_tokzr[0]](work_on) # tokenize
else:
ctrl, work_on = work_on[0], work_on[1:] # get right part
ret = [ctrl]
if len(l_tokzr) == 1:
ret.append(work_on) # add right part
else:
for wo in work_on: # dive into tree
t = tokenize_recursively(l_tokzr[1:], wo, lev + 1)
ret.append(t)
return ret
# just for printing
def nestedListLines(aList, ind=' ', d=0):
""" Returns multi-line string representation of param aList. Use param ind to indent per level. """
sRet = 'n' + d * ind + '['
nested = 0
for i, e in enumerate(aList):
if i:
sRet += ', '
if type(e) == type(aList):
sRet += nestedListLines(e, ind, d + 1)
nested = 1
else:
sRet += 'n' + (d + 1) * ind + repr(e) if nested else repr(e)
sRet += 'n' + d * ind + ']' if nested else ']'
return sRet
# main()
inp1 = """
* Question: I want try something. Should I?
* Answer : I'd assume so. Give it a try.
"""
inp2 = inp1 + 'Q: What is a good way to achieve this? A: I am not so sure. I think I will use Python.'
print repr(tokzr_WORD(inp1))
print repr(tokzr_SENT(inp1))
print repr(tokzr_QA(inp1))
print repr(tokzr_QA_non_canonical(inp1)) # Really this way?
print
for ctrl, inp in [ # example control sequences
('SENTENCE-WORD', inp1),
('QA-SENTENCE', inp2)
]:
res = tokenize_recursively(ctrl.split('-'), inp)
print nestedListLines(res)
Btw. Python/Lib/tokenize.py (for Python code itself) might be worth a look how to handle things.
Leave a Reply
You must be logged in to post a comment.