SciDocx2WebConversion module

Convert scientific papers in DOCX format to HTML. See this project’s GitHub page for more info: https://github.com/Fulminis-ictus/SciDocx2Web

This is a module that is called by SciDocx2WebUI. It handles all of the actual conversion functions.

Documentation last updated: 2023.06.03

Author: Tim Reichert

Version: 1.0 (first public release)

Uses and is dependent on Mammoth: https://github.com/mwilliamson/python-mammoth

Makes use of dwasyl’s added page break detection functionailty: https://github.com/dwasyl/python-mammoth/commit/38777ee623b60e6b8b313e1e63f12dafd82b63a4

SciDocx2WebConversion.abbreviate_footnotes(footnotes, abbreviateFootnotesNumber)[source]

Abbreviates footnotes according to the number in the “Abbreviate tooltips after how many symbols?” input field and adds “[…]” to the end of abbreviated footnotes.

If the input field is empty:

Skips this function.

If something other than an integer is input:

Tooltip abbreviation is skipped with an error message indicating that the abbreviation was unsuccessful but the rest of the conversion process continues as normal.

SciDocx2WebConversion.add_Head_IDs(headingsIDVar, bodyxml)[source]

If “Add IDs to headings?” is checked:

Adds IDs to all found <h1> elements following the form “headingN” where “N” is an integer counting upwards.

Example: <h1 id=”heading1”>First heading</h1>, <h1 id=”heading2”>Second heading</h1>, <h2 id=”heading3”>Subheading of second heading</h2> etc.

SciDocx2WebConversion.add_cite(tooltiptextPath, bodyxml, footnotes)[source]: Adds a cite attribute to <blockquote> elements. Inserts the footnote text as the cite value by using the footnote number at the end of the blockquote (minus one since the list starts counting at 0 but the footnotes start counting at 1) as the index of the footnote list.

SciDocx2WebConversion.add_wbr_footnotes(footnotesAbbr, abbreviateFootnotesNumber)[source]

Finds all slashes within footnotes and adds <wbr> tags after them to ensure that links automatically receive a line break when necessary.

If footnotes are abbreviated:

Apply to all text. Since all HTML tags have been removed in a previous step there’s no need to worry about HTML tags being affected.

Else:

Look for text in <a> tags specifically so that other HTML tags aren’t affected.

SciDocx2WebConversion.add_wbr_text(bodyxml)[source]: Finds all slashes in links within the main text and adds <wbr> tags to ensure that links automatically receive line breaks after slashes.

SciDocx2WebConversion.adjust_footnotes(tooltipsCheckVar, bodyxml)[source]

If “Add tooltips to footnotes?” is checked:

Removes the <sup> elements and creates new ones within the <a> tags. Also removes the square brackets around the footnote numbers, and moves the numbers from the <a> tags into the <sup> tags.

This whole process ensures that only the footnote numbers, not the footnote text within the tooltips, is enclosed by the <sup> elements.

Example: <sup><span class=”tooltip”><a href=”#footnote-N” id=”footnote-ref-N”>[N]</a><span class=”tooltiptext”>Footnote content.</span></span></sup> ->

<span class=”tooltippop” aria-describedby=”tooltip-N”><a href=”#footnote-N” id=”footnote-ref-N”><sup>N</sup></a><span role=”tooltip” id=”tooltip-N”>Footnote content.</span></span>

Else:

Only the square brackets around the footnote numbers are removed.

SciDocx2WebConversion.assemble_html(navigationVar, bodyCheckVar, cssCheckVar, cssXML, navGridDiv, bodyxml, javascriptXML, javascriptCheckVar)[source]

Assembles the individual sections (navigation, css, body) and writes them to an html file.

If “Create navigation?” is unchecked:

Skips navigation section.

If “Add suggested css?” is unchecked:

Skips css section.

If “Only export the body?” is unchecked:

“<!DOCTYPE html>” is added to the beginning to create a well formed html file. This is not necessary if only the body is exported since it’s an incomplete html file without the <head> element.

SciDocx2WebConversion.create_footnotes_list(bodyxml, abbreviateFootnotesNumber)[source]

Compiles a list of footnotes. Iterates over all <li> elements in the footnote list created by mammoth and saves their contents to a python list.

There was one case where footnotes started counting at 0 instead of 1. This case has been accounted for by checking whether the list is empty after searching for list items with the ID “footnote-1”.

An empty footnote list is created if there are no footntoes.

If footnotes are abbreviated:

Get footnote text without HTML tags to prevent tags that are never closed due to the abbreviation process.

Else:

Get the footnote text with all HTML tags.

SciDocx2WebConversion.create_navigation(navigationVar, navigationTypeVar, findH1, navigationElement, commentNavigation, h1Navigation, navGridDiv)[source]

If “Create navigation?” is checked:

Creates a navigation by compiling a list of all <h1>, <h2> and <h3> elements. Each navigation item is a paragraph or a button, depending on which radio button option is activated in the GUI. Adds links with href attributes that link each navigation item to their respective heading.

Example: <button><a href=”#heading1”>Navigation to first heading</a></button>

When compiling the headings text, the <h1>, <h2> and <h3> tags and new page markers are removed from the navigation.

Encloses everything with a <div> with the class “navGrid”.

SciDocx2WebConversion.create_sections(bodyxml)[source]: This code was planned to create sections based on chapters but it’s not functional.

SciDocx2WebConversion.embed_audio(bodyxml)[source]: Embeds audio links that have the “insertaudio” attribute. The text marked as audio should be a link (not a hyperlink!), which is then inserted into the “src” attribute of the <source> element inside the <audio> element.

SciDocx2WebConversion.embed_images(bodyxml, dimensions)[source]

Embeds image links that have the “insertimage” attribute. The text marked as an image should be a link (not a hyperlink!), which is then inserted into the “src” attribute of the image.

Else if the height and width input is empty:

Don’t insert any width and height parameters.

Else:

Don’t insert any width and height parameters and display an error.

SciDocx2WebConversion.embed_videos(bodyxml, dimensions)[source]

Embeds video links that have the “insertvideo” attribute. The text marked as a video should be a link (not a hyperlink!), which is then inserted into the “src” attribute of the video.

If the height and width input consists of two numbers split by a comma:

Use these two numbers as height and width of the iframe.

Else if the height and width input is empty:

Don’t insert any width and height parameters.

Else:

Don’t insert any width and height parameters and display an error.

SciDocx2WebConversion.enclose_body(input, bodyCheckVar, pageTitleEntryText)[source]

Encloses the imported file with <body> tags to make it navigable with xpath.

If “Only export the body?” is unchecked:

Adds <html>, <head> and <title> elements. Also sets the charset to UTF-8 and adds the page title if the “Page title:” field isn’t empty.

Additionally, the grid container and main grid <div> tags are inserted and new page markers are unescaped.

SciDocx2WebConversion.escape_unescape(exportableBodyxml)[source]: Unescapes the file to make sure HTML tags are applied properly instead of them being displayed as escaped HTML. Re-escapes HTML symbols that are marked as example code.

SciDocx2WebConversion.file_insertion_message(bodyxml)[source]: Adds the comment “Insert Media” before <p> elements that have the media caption class to alert the user to the fact that they might need to manually insert media at that line.

SciDocx2WebConversion.footnotes_bottom_adjust(bodyxml, commentBottomFootnotes, breakElement, hrElement)[source]

If a footnote list at the bottom of the main text exists:

Adds elements before the footnote list at the bottom to separate it from the rest: <br/>, <hr/> and the comment “Bottom footnotes”.

Also adds “aria-label” attributes that describe the links at the end of the footntoes as links back to the footnotes in the main text.

SciDocx2WebConversion.insert_footnotes(tooltipsCheckVar, bodyxml, footnotesAbbr)[source]

If “Add tooltips to footnotes?” is checked:

Finds <sup> tags that contain links with an ID starting with “footnote-ref”, which denotes footnote numbers in the main text. It then creates a tooltip <span> and appends it to the end of the found <sup> tags. The whole element, including the <sup> element and tooltip <span> element, is then enclosed with a tooltippop <span> element. Structure derived from: https://www.w3schools.com/howto/howto_css_tooltip.asp

Also adds “role” and “aria-attribute” attributes for accessibility.

Conversion example: <sup><a href=”#footnote-N” id=”footnote-ref-N”>[N]</a></sup> ->

<span class=”tooltippop” aria-describedby=”tooltip-N”><sup><a href=”#footnote-N” id=”footnote-ref-N”>[N]</a><span role=”tooltip” id=”tooltip-N”>Footnote content.</span></sup></span>

SciDocx2WebConversion.move_table_caption(bodyxml)[source]: Moves the <caption> elements to the beginning of the <table> elements for semantic reasons.

SciDocx2WebConversion.page_breaks(pageNumberCheckVar, pageNumberStartCheckVar, bodyxml, bodyCheckVar)[source]

If “Insert page numbers?” is checked

Inserts a <sub class=”pagenumber”> element at the very beginning of the text (only page breaks are marked automatically, meaning the first page needs to receive an element manually). Finds all <sub class=”pagenumber”> elements and inserts a page number following the form {N} where N is an integer counting upwards. N is calculated as 2-pageNumberStart. If the page number start is set to 1 then the page number at the very top would receive the number 1. If the page number start is set to 2 then the page number at the very top would receive the number 0 etc. Page number indicators that would receive a number <= 0 receive no text instead, which makes them invisible.

Else:

Finds <sub class=”pagenumber”> tags that denote a new page beginning and deletes their content.

SciDocx2WebConversion.paragraph_numbering(paragraphNumberCheckVar, bodyxml)[source]

If “Number the paragraphs?” is checked:

Adds numbers to the beginning of each paragraph and each blockquote following the form [N] where N is an integer counting upwards. Skips paragraphs that have the “ignorePNum”, “mediacaption” or “bibliography” classes.

SciDocx2WebConversion.remove_word_lnks(bodyxml)[source]: Removes various links that word inserts into the document, including “Table Of Contents”, “_heading” and “_Hlk” link. They are interpreted as non-closed “a”-tags that can lead to display errors or mess with the navigation.

SciDocx2WebConversion.style_map_func(custom_style_map, headings1, headings2, headings3, images, videos, audio, media, blockquotes, tableCaptions, bibliography, ignorePNum, paragraphNumberCheck, code, addStyleMap)[source]

Generates a style map based on the format template name input fields. Empty input fields are ignored. The style map detects format templates applied to text and encloses them with an html element. For more information see: https://github.com/mwilliamson/python-mammoth#custom-style-map

Headings1 -> h1:fresh

Headings2 -> h2:fresh

Headings3 -> h3:fresh

Image -> img.insertimage:fresh

Video -> iframe.insertvideo:fresh

Audio -> audio.insertaudio:fresh

Media -> p.mediacaption:fresh

Blockquotes -> blockquote:fresh

Table Captions -> caption:fresh

Bibliography -> p.bibliography:fresh

Ignore Paragraph Numbering -> p.ignorePNum:fresh

Code -> code

SciDocx2WebConversion.write_html(exportableBodyxml, outputPath)[source]: Write to file.