Full script for docx extraction by Thomas-Daniel · Pull Request #1 · didmar/python-docx

Thomas-Daniel · 2025-04-14T13:08:28Z

No description provided.

didmar · 2025-04-14T13:18:14Z

+    syms = ['M', 'CM', 'D', 'CD', 'C', 'XC', 'L', 'XL', 'X', 'IX', 'V', 'IV', 'I']
+    roman = ''
+    i = 0
+    while n > 0:


Will get stuck in an infinite loop if we pass 0, which we can't exclude from happening.
Should rewrite without while-loop, something like this (not tested):

for v, s in zip(val, syms): count = n // v roman += s * count n -= v * count

didmar · 2025-04-14T13:29:03Z

+
+def get_style_num_map(styles_root, numId_to_abstract, abstract_to_levels, num_root):
+    style_num_map = {}
+    for style in styles_root.findall('w:style', NS):


Looks like there are two sources for the style_num_map, the non-abstract and the abstract ones, each being handled by a different for-loop.
Are they independent from one another? Does it makes sense to put them in the same map?
Also, intuitively, I would create two separate functions (or sub-functions, depending on whether we want them in the same map or not)

didmar · 2025-04-14T13:33:32Z

+    return style_num_map
+
+def extract_paragraph_text(para):
+    return ''.join([node.text or '' for node in para.findall('.//w:t', NS)]).strip()


This might not work for paragraphs that contain track changes (w:ins and w:del), to check.
In that case, we probably want to pass a parameter to decide whether we want the version pre or post changes, with post changes by default I guess.

didmar · 2025-04-14T13:35:15Z

+            continue
+        style_id = style.attrib.get(f'{{{NS["w"]}}}styleId')
+        pPr = style.find('w:pPr', NS)
+        if pPr is not None:


Could be if pPr: for a more Pythonic way to write it

xml.etree.ElementTree: Testing the truth value of an Element is deprecated. In a future release it will always return True. Prefer explicit len(elem) or elem is not None tests instead. zipimport.zipimporter.load_module() is deprecated: use exec_module() instead.

didmar · 2025-04-14T13:37:10Z

+    return style_num_map
+
+def extract_paragraph_text(para):
+    return ''.join([node.text or '' for node in para.findall('.//w:t', NS)]).strip()


Also, you can remove the [ ... ] to make it a generator comprehension and avoid creating a list in memory.

Suggested change

return ''.join([node.text or '' for node in para.findall('.//w:t', NS)]).strip()

return ''.join(node.text or '' for node in para.findall('.//w:t', NS)).strip()

didmar · 2025-04-14T13:45:12Z

+    return ''.join([node.text or '' for node in para.findall('.//w:t', NS)]).strip()
+
+def get_list_number(para, style_num_map, counters, numId_to_abstract, abstract_to_levels):
+    pPr = para.find('w:pPr', NS)


A good argument to integrate that code into the parsing done by python-docx would be that it will parse such info only once, such that we can simply do para.pPr afterwards whenever we need it, and avoid losing performance running .find(...) each time.

didmar · 2025-04-14T14:07:37Z

+                if counters[numId][ilvl] == 0:
+                    counters[numId][ilvl] = level_def.get('start', 1)
+                else:
+                    counters[numId][ilvl] += 1


Since the function is called get_list_number, I would expect it to return something but not modify its parameters, like counters here.
I would expect the counters mapping to be already constructed already. Here this looks like it create inconsistent numbering if we call get_list_number twice for the same paragraph.

didmar · 2025-04-14T14:15:49Z

+
+                lvlText = level_def.get('lvlText', '')
+                fmt_values = {}
+                for level in range(9):


To avoid any confusion regarding to fact that "level" starts at 1 and not 0, I would initialize counters as defaultdict(lambda: defaultdict(lambda: 1)) and initialize level here at 1:

for level in range(1, 10):

and remove the + 1s afterwards.

Maybe applies to ilvl as well?

And by the way, if 10 the maximum level that can possibly be defined, per the specs?

didmar · 2025-04-14T14:19:07Z

+                for level in range(9):
+                    if f'%{level + 1}' in lvlText:
+                        if counters[numId][level] == 0:
+                            level_start = abstract_to_levels[abstractId].get(level, {}).get('start', 1)


Maybe it is a bit overkill, but we could initialize abstract_to_levels sub-dict with {'start': 1} by default, such that we don't need the double .get

didmar · 2025-04-14T14:23:46Z

+    numbering_str = get_list_number(para, style_num_map, counters, numId_to_abstract, abstract_to_levels)
+    text = extract_paragraph_text(para)
+    if text:
+        lines.append(f'{numbering_str}{text}')


Add a space in between? Or maybe f'{numbering_str} {text.lstrip()}' to have one space at most.

Full script for docx extraction

c8eecef

didmar reviewed Apr 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full script for docx extraction#1

Full script for docx extraction#1
Thomas-Daniel wants to merge 1 commit intodidmar:masterfrom
Thomas-Daniel:wordapi

Thomas-Daniel commented Apr 14, 2025

Uh oh!

didmar Apr 14, 2025

Uh oh!

didmar Apr 14, 2025

Uh oh!

didmar Apr 14, 2025

Uh oh!

didmar Apr 14, 2025

Uh oh!

Thomas-Daniel Apr 14, 2025

Uh oh!

didmar Apr 14, 2025

Uh oh!

didmar Apr 14, 2025

Uh oh!

didmar Apr 14, 2025

Uh oh!

didmar Apr 14, 2025

Uh oh!

didmar Apr 14, 2025

Uh oh!

didmar Apr 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	return ''.join([node.text or '' for node in para.findall('.//w:t', NS)]).strip()
	return ''.join(node.text or '' for node in para.findall('.//w:t', NS)).strip()

Conversation

Thomas-Daniel commented Apr 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants