Conversation
| syms = ['M', 'CM', 'D', 'CD', 'C', 'XC', 'L', 'XL', 'X', 'IX', 'V', 'IV', 'I'] | ||
| roman = '' | ||
| i = 0 | ||
| while n > 0: |
There was a problem hiding this comment.
Will get stuck in an infinite loop if we pass 0, which we can't exclude from happening.
Should rewrite without while-loop, something like this (not tested):
for v, s in zip(val, syms):
count = n // v
roman += s * count
n -= v * count
|
|
||
| def get_style_num_map(styles_root, numId_to_abstract, abstract_to_levels, num_root): | ||
| style_num_map = {} | ||
| for style in styles_root.findall('w:style', NS): |
There was a problem hiding this comment.
Looks like there are two sources for the style_num_map, the non-abstract and the abstract ones, each being handled by a different for-loop.
Are they independent from one another? Does it makes sense to put them in the same map?
Also, intuitively, I would create two separate functions (or sub-functions, depending on whether we want them in the same map or not)
| return style_num_map | ||
|
|
||
| def extract_paragraph_text(para): | ||
| return ''.join([node.text or '' for node in para.findall('.//w:t', NS)]).strip() |
There was a problem hiding this comment.
This might not work for paragraphs that contain track changes (w:ins and w:del), to check.
In that case, we probably want to pass a parameter to decide whether we want the version pre or post changes, with post changes by default I guess.
| continue | ||
| style_id = style.attrib.get(f'{{{NS["w"]}}}styleId') | ||
| pPr = style.find('w:pPr', NS) | ||
| if pPr is not None: |
There was a problem hiding this comment.
Could be if pPr: for a more Pythonic way to write it
There was a problem hiding this comment.
xml.etree.ElementTree: Testing the truth value of an Element is deprecated. In a future release it will always return True. Prefer explicit len(elem) or elem is not None tests instead. zipimport.zipimporter.load_module() is deprecated: use exec_module() instead.
| return style_num_map | ||
|
|
||
| def extract_paragraph_text(para): | ||
| return ''.join([node.text or '' for node in para.findall('.//w:t', NS)]).strip() |
There was a problem hiding this comment.
Also, you can remove the [ ... ] to make it a generator comprehension and avoid creating a list in memory.
| return ''.join([node.text or '' for node in para.findall('.//w:t', NS)]).strip() | |
| return ''.join(node.text or '' for node in para.findall('.//w:t', NS)).strip() |
| return ''.join([node.text or '' for node in para.findall('.//w:t', NS)]).strip() | ||
|
|
||
| def get_list_number(para, style_num_map, counters, numId_to_abstract, abstract_to_levels): | ||
| pPr = para.find('w:pPr', NS) |
There was a problem hiding this comment.
A good argument to integrate that code into the parsing done by python-docx would be that it will parse such info only once, such that we can simply do para.pPr afterwards whenever we need it, and avoid losing performance running .find(...) each time.
| if counters[numId][ilvl] == 0: | ||
| counters[numId][ilvl] = level_def.get('start', 1) | ||
| else: | ||
| counters[numId][ilvl] += 1 |
There was a problem hiding this comment.
Since the function is called get_list_number, I would expect it to return something but not modify its parameters, like counters here.
I would expect the counters mapping to be already constructed already. Here this looks like it create inconsistent numbering if we call get_list_number twice for the same paragraph.
|
|
||
| lvlText = level_def.get('lvlText', '') | ||
| fmt_values = {} | ||
| for level in range(9): |
There was a problem hiding this comment.
To avoid any confusion regarding to fact that "level" starts at 1 and not 0, I would initialize counters as defaultdict(lambda: defaultdict(lambda: 1)) and initialize level here at 1:
for level in range(1, 10):
and remove the + 1s afterwards.
Maybe applies to ilvl as well?
And by the way, if 10 the maximum level that can possibly be defined, per the specs?
| for level in range(9): | ||
| if f'%{level + 1}' in lvlText: | ||
| if counters[numId][level] == 0: | ||
| level_start = abstract_to_levels[abstractId].get(level, {}).get('start', 1) |
There was a problem hiding this comment.
Maybe it is a bit overkill, but we could initialize abstract_to_levels sub-dict with {'start': 1} by default, such that we don't need the double .get
| numbering_str = get_list_number(para, style_num_map, counters, numId_to_abstract, abstract_to_levels) | ||
| text = extract_paragraph_text(para) | ||
| if text: | ||
| lines.append(f'{numbering_str}{text}') |
There was a problem hiding this comment.
Add a space in between? Or maybe f'{numbering_str} {text.lstrip()}' to have one space at most.
No description provided.