|
Previous Entry: “The girl version of the “On the population dynamics of “Puella Magi Madoka Magica”” essay”
Next Entry: “Modified TrackballControls for Three.js” 2011-11-03Mixing HTML and TeX in a StyledTextCtrl in wxPythonI have written my very own editor with syntax highlighting to write these sketches. This is far less impressive than it might appear on a first glance: wxWidgets, and therefore also wxPython, provide access to the Scintilla library via the StyledTextCtrl element, and this element can display HTML source code and style the different elements in different colors. Scintilla provides “lexers” for different languages, parsers which are able to determine which fragments of a document are keywords, identifiers, string literals, numerical literals, operators, comments and so on, and tokens from different categories can be styled differently. So “writing my own editor” in fact just means “gluing together some standard pieces”. It looks like this:
In the preceding sketch, I used MathJax to transform TeX source snippets into nicely rendered formulas. It works like this: I include a TeX formula like \(\LaTeX\), and MathJax automagically turns it into \(\LaTeX\) (provided that Javascript is enabled, and your browser isn’t old as dirt). Now it would be really nifty if I could modify my StyledTextCtrl in such a way that any occurrence of such a TeX snippet would be lexed and styled as TeX, not as HTML. Especially, I wish those snippets to stand out, to be visually distinct from the surrounding text. Scintilla has, among lexers for many, many other languages, a lexer for TeX and LaTeX and ConTeX. Unfortunately, a StyledTextCtrl can use only one Scintilla lexer at a time. You have to decide whether your text is HTML or TeX, you can’t have both at the same time. HTML, of course, can contain all kinds of other stuff: it can contain PHP fragments, or Javascript code, or CSS instructions, or foreign XML parts containing SVG or MathML, or CDATA sections or perhaps some other stuff I have never heard of. Scintilla actually tries to take care of this. Most other languages Scintilla can lex and style are restricted to at most 32 different styles (for keywords, variables, operators, string literals, numeric literals and so on), but the lexer for HTML has room for up to 128 different style, so there is a style for Javascript string literals in single quotes and Javascript string literals in double quotes and Python keywords and VB operators and PHP variables and so on and so on. The right thing to do would be to look at the C++ source code of the HTML lexer of Scintilla and to modify it to include TeX in the list of things the HTML lexer knows about and is able to style, and then recompile it and use this within wxPython. Unfortunately, this sounds like a lot of work, and while I’m not completely unwilling to touch C++ code if there is absolutely no other way, I’d prefer to work in a more human-friendly language. Preferably, I’d simply add some Python code to my editor to make StyledTextCtrl behave in the way I want to. Theoretically, it is possible to implement a Scintilla lexer purely in Python. Unfortunately, this would mean that I would have to completely re-invent the HTML lexer, which sounds like even more work than simply fiddling with some pre-existing C++ code. And perhaps out of my league, since writing such a lexer isn’t trivial. And probably, it would also be slow as hell. And all I want is to style most of my text as HTML, while styling a few snippets as TeX. So this is what I do: I write some code which splits my text in those parts which are HTML, and those parts which are TeX. Then I let Scintilla style each part individually, and finally I glue all those parts back together. Scintilla does all the hard work, and I get what I want. Unfortunately, wxPython doesn’t expose a static method to style a string: the StyledTextCtrl just offers a method to style its own content, but not some arbitrary foreign text. So I create two dummy StyledTextCtrl, one with HTML as its language and one TeX as its language, and I feed a string I want to be styled to one of those two elements. Since that’s all I want from those elements, I suppress them from being rendered. This is the actual code:
#!/usr/bin/env python
# -*- coding: utf-8 -*- u""" Provides a StyledTextCtrl that understands TeX embedded in HTML. This is useful for editing source code of HTML pages using MathJax_. MathJax is a Javascript library which parses snippets of the form ``\\(\\LaTeX\\)`` or ``\\[\\LaTeX\\]`` and displays them either as MathML or some combination of HTML and CSS. The ``HtmlWithTexInput`` behaves like a ``StyledTextCtrl`` and will lex and style source code as HTML. TeX snippets will be recognized as such and styled differently. If you want to provide your own styles for the TeX snippets, use the constants ``wx.stc.STC_L_*`` + ``OFFSET_ROUND`` for the styles of the snippets delimited with round brackets, and ``wx.stc.STC_L_*`` + ``OFFSET_SQUARE`` for square brackets. The style of the delimiters themselves are stored in the style of ``wx.stc.STC_L_MATH`` + ``OFFSET_*``. The styles for ASP VBScript code get overwritten and can’t be used in conjunction with ``HtmlWithTexInput``. Example:: my_ctrl = html_with_tex_ctrl.HtmlWithTexInput(my_frame, -1) my_ctrl.StyleSetSpec(wx.stc.STC_L_DEFAULT + html_with_tex_ctrl.OFFSET_ROUND, "fore:#000099,back:#6666ff,face:Courier New,size:10,eolfilled") This sets the foreground color of the default style of TeX fragments in round brackets to dark blue, the background color to light blue. If the module is called as a stand-alone application, a test frame is produced. .. _MathJax: http://www.mathjax.org """ __author__ = u"Jan Thor" __docformat__ = u"restructuredtext de" import wx, wx.stc from html_styles import style_control, HTML_KEYWORDS OFFSET_ROUND = 80 # ASP VBScript styles will be replaced with TeX styles OFFSET_SQUARE = OFFSET_ROUND + 5 MATH_ROUND = chr(wx.stc.STC_L_MATH + OFFSET_ROUND) * 4 MATH_SQUARE = chr(wx.stc.STC_L_MATH + OFFSET_SQUARE) * 4 STATE_DEFAULT = 0 STATE_TAG = 1 STATE_ROUND = 2 STATE_SQUARE = 3 STATE_DELIM = 100 STATE_TRANSITIONS = { STATE_DEFAULT: [("\\(", STATE_ROUND, True), ("\\[", STATE_SQUARE, True), ("<", STATE_TAG, False)], STATE_TAG: [(">", STATE_DEFAULT, False)], STATE_ROUND: [("\\)", STATE_DEFAULT, True)], STATE_SQUARE: [("\\]", STATE_DEFAULT, True)], } class HtmlWithTexInput(wx.stc.StyledTextCtrl): u"""StyledTextCtrl with lexing for HTML source with TeX snippets.""" def __init__(self, parent, ID=-1, pos=wx.DefaultPosition, size=wx.DefaultSize, style=0): u"""Same parameters as ``wx.stc.StyledTextCtrl``.""" wx.stc.StyledTextCtrl.__init__(self, parent, ID, pos, size, style) self.SetLexer(wx.stc.STC_LEX_CONTAINER) self.SetStyleBits(7) self.Bind(wx.stc.EVT_STC_STYLENEEDED, self.OnStyling) style_control(self) # === Styles for TeX === self.StyleSetSpec(wx.stc.STC_L_DEFAULT + OFFSET_ROUND, "fore:#333333,back:#ffef66,face:Courier New,size:10,eolfilled") self.StyleSetSpec(wx.stc.STC_L_COMMAND + OFFSET_ROUND, "fore:#000099,back:#ffef66,face:Courier New,size:10,bold") self.StyleSetSpec(wx.stc.STC_L_TAG + OFFSET_ROUND, "fore:#7f007f,back:#ffef66,face:Courier New,size:10") self.StyleSetSpec(wx.stc.STC_L_MATH + OFFSET_ROUND, "fore:#bb0000,back:#ffdf66,face:Courier New,size:10,bold") self.StyleSetSpec(wx.stc.STC_L_COMMENT + OFFSET_ROUND, "fore:#007f00,back:#efff66,face:Courier New,size:10,italic") self.StyleSetSpec(wx.stc.STC_L_DEFAULT + OFFSET_SQUARE, "fore:#333333,back:#fff8bb,face:Courier New,size:10,eolfilled") self.StyleSetSpec(wx.stc.STC_L_COMMAND + OFFSET_SQUARE, "fore:#000099,back:#fff8bb,face:Courier New,size:10,bold") self.StyleSetSpec(wx.stc.STC_L_TAG + OFFSET_SQUARE, "fore:#7f007f,back:#fff8bb,face:Courier New,size:10") self.StyleSetSpec(wx.stc.STC_L_MATH + OFFSET_SQUARE, "fore:#bb0000,back:#ffdfaa,face:Courier New,size:10,bold") self.StyleSetSpec(wx.stc.STC_L_COMMENT + OFFSET_SQUARE, "fore:#007f00,back:#f8ffaa,face:Courier New,size:10,italic") # === More stylin’ === self.SetEdgeMode(wx.stc.STC_EDGE_LINE) self.SetEdgeColumn(213) self.SetMarginWidth(1, 0) self.SetWrapMode(wx.stc.STC_WRAP_WORD) # === Dummy controls as lexers === self.dummyHtml = wx.stc.StyledTextCtrl(self, -1) self.dummyHtml.Show(False) self.dummyHtml.SetLexer(wx.stc.STC_LEX_HTML) self.dummyHtml.SetStyleBits(7) self.dummyHtml.SetKeyWords(0, HTML_KEYWORDS) self.dummyTex = wx.stc.StyledTextCtrl(self, -1) self.dummyTex.Show(False) self.dummyTex.SetLexer(wx.stc.STC_LEX_LATEX) def _parseHtml(self, fragment): self.dummyHtml.SetText(fragment.decode("utf8")) fl = len(fragment) self.dummyHtml.Colourise(0, fl) multiplexed = self.dummyHtml.GetStyledText(0, fl) return multiplexed def _parseTex(self, fragment, offset): self.dummyTex.SetText(fragment.decode("utf8").replace("\n", " ")) fl = len(fragment) self.dummyTex.Colourise(0, fl) multiplexed = self.dummyTex.GetStyledText(0, fl) multiplexed = [s for s in multiplexed] for i in range(1, len(multiplexed), 2): multiplexed[i] = chr(ord(multiplexed[i]) + offset) return "".join(multiplexed) def OnStyling(self, evt): u"""Called when the control needs styling.""" text = self.GetText().encode("utf8") # === split text into chunks === splitpoints = [0] states = [STATE_DEFAULT] state = STATE_DEFAULT for i in range(0, len(text)): transitions = STATE_TRANSITIONS[state] for delim, newstate, bsplit in transitions: nd = len(delim) if i >= nd - 1 and text[i+1-nd:i+1] == delim: if bsplit: splitpoints.append(i-1) splitpoints.append(i+1) states.append(STATE_DELIM + state + newstate) states.append(newstate) state = newstate if splitpoints[-1] != len(text): splitpoints.append(len(text)) parts = [text[splitpoints[i]:splitpoints[i+1]] for i in range(len(splitpoints) - 1)] # === lex and style each part === parsed = "" for i in range(len(parts)): type = states[i] fragment = parts[i] if type == STATE_DEFAULT: parsed += self._parseHtml(fragment) elif type == STATE_ROUND: parsed += self._parseTex(fragment, OFFSET_ROUND) elif type == STATE_SQUARE: parsed += self._parseTex(fragment, OFFSET_SQUARE) elif type == STATE_ROUND + STATE_DELIM: parsed += MATH_ROUND elif type == STATE_SQUARE + STATE_DELIM: parsed += MATH_SQUARE # === style the complete control === self.StartStyling(0, 127) parsed = "".join([parsed[i] for i in range(1, len(parsed), 2)]) self.SetStyleBytes(len(parsed), parsed) # ============================================================================ # # The remainder of this modul is for testing purposes and not really needed. # You may cut if off in your own projects. # # ============================================================================ _TESTSTRING = ur"""\(\)<!DOCTYPE> <p>This $i$s <error/> \(\LaTeX\) &!</p> <!--Remark \(x\)--> \[(\frac{a}{c})^2+(\frac{b}{c})^2=1 \error \no comment %Comment üble Ümläutë \This \is \still \a \comment\] <?What’s this?> <?php echo "$this funny string"; 3++ ?> <script type="text/javacript"> function f(x) {if("unclos" + '!' return 2*x;} </script> \[These\) are \(confusing\) delims\]\(right?\)""" class _TestApp(wx.App): def OnInit(self): frame = wx.Frame(None,-1, "Test Ctrl") HtmlWithTexInput(frame, -1).SetText(_TESTSTRING) frame.Show() return True if __name__ == "__main__": _TestApp(0).MainLoop() You can load this code here: html_with_tex_ctrl.py. It uses an auxiliary modul which simply contains some style rules for HTML I adapted to my personal liking, rather boring stuff so I put it in a different modul, you can download it here: html_styles.py. As you can see, I also added some test code. This test code produces a frame which looks like this:
It’s rather colorful, but that’s because I crammed a lot of test cases in a small space. A typical source code is usually less colorful. TeX snippets are rendered with a yellow background, snippets within a comment are suppressed, non-Ascii characters work, and the rest of the source is still styled like HTML source code, including all the other sections with foreign stuff. Rendering is reasonable fast if the source code isn’t overly long. A few comments on the code. Well, first a few comments on how styling information is handled within Scintilla. A text is split into tokens, and each token can belong to one of several categories of a language, like being a keyword or an identifier or a string literal or a comment or something like this. Each category gets its one number, like the number STC_L_DEFAULT (which is simply the number 0). In a separate step, a visual style can be assigned to one of these style numbers, for example, we could specify that STC_L_DEFAULT should be styled with a white background and a black foreground color and a monospace font and neither bold nor italic. The style information produced by the lexer is stored in a string that contains a tangled version of both the original text and the style data. As a first approximation: each character is stored in a “cell” consisting of two bytes. The first byte is the character itself, while the second byte contains the style information. For example, if the character belongs to a token which is to be styled as STC_L_DEFAULT, then this second byte will be STC_L_DEFAULT, that is, the byte 0x00. For example, the character “a”, as a byte, is 0x61, so a cell with this character and the default style would be the string "\x61\x00". In reality, it’s a tad bit more complicated than that, since one character may need more than one byte. I’m working with UTF8, and my texts usually contain non-ASCII characters. In such a case, a character might span more than one cell, but the style byte is the same in all these cells. For example, the letter “ü”, represented as a UTF8 byte sequence, would be "\xC3\xBC". If this letter should be styled using style number 80, the style byte would be "\x50" (0x50 is the hexadecimal representation of the number 80). A byte sequence containing the corresponding two cells for the character “ü” with style 80 would be "\xC3\x50\xBC\x50". So, a lexer takes a string with UTF8 bytes and returns a string with twice its length, with the original text and the style information multiplexed. For most languages, 32 distinct styles are more than enough. To encode one of 32 possible styles, 5 bits are sufficient, and if we use a full byte for the style information, we have 3 additional bits left for some auxiliary style information. For example, we could set one of those three additional bits to indicate that a word is misspelled, or that a line of source code is responsible for a runtime error, or whatever we want to do with those additional 3 bits. As mentioned above, HTML needs more styles, that’s why Scintilla uses 7 bits for the style information for 128 distinct styles a token can have, leaving just one bit for other purposes. Now I want two things: I want a function that accepts a normal string of a text encoded as UTF8, and returns a string of cells with twice as much byte, containing the styling information. And I want a function that is able to split a text into HTML-chunks and TeX-chunks. Then I can split a text in chunks, feed those chunks in my styling functions, and then feed the styled strings back into my StyledTextCtrl. Since, as mentioned, wxPython doesn’t expose a method to lex individual strings, I create two invisible dummy StyledTextCtrl elements. First I create a class that inherits from StyledTextCtrl:
class HtmlWithTexInput(wx.stc.StyledTextCtrl):
u"""StyledTextCtrl with lexing for HTML source with TeX snippets.""" def __init__(self, parent, ID=-1, pos=wx.DefaultPosition, size=wx.DefaultSize, style=0): u"""Same parameters as ``wx.stc.StyledTextCtrl``.""" wx.stc.StyledTextCtrl.__init__(self, parent, ID, pos, size, style) I can’t use one of the predefined lexers, instead I have to provide my own. I do this by setting the lexer of my element to STC_LEX_CONTAINER. Therefore, since I’m using my own lexer, I can’t let Scintilla do the styling, but have to catch the EVT_STC_STYLENEEDED event myself. And since I’m using more than 32 styles, I’ll have to tell Scintilla that I need 7 bits for my styling information.
self.SetLexer(wx.stc.STC_LEX_CONTAINER)
self.SetStyleBits(7) self.Bind(wx.stc.EVT_STC_STYLENEEDED, self.OnStyling) Now I have to implement my own lexer to style this element:
def OnStyling(self, evt):
u"""Called when the control needs styling.""" text = self.GetText().encode("utf8") I get the text of the element, and now I have to split this text in chunks containing pure HTML respectively pure TeX. At this point, I can’t avoid implementing something like a primitive parser. My parser needs to know when to switch between “I’m currently reading HTML” states and “I’m currently reading TeX” states. It’s even more complicated than that: I want to differentiate between TeX in round brackets and TeX in square brackets, and color them differently. The switch between different states is triggered by the presence of delimiters. But when I’m in HTML mode and inside a tag, those delimiters should be ignored. Therefore, I need at least two different states for HTML: the default state, and being inside a tag. The complete model of transitions from one state to the next when encounting a delimiter looks like this:
STATE_TRANSITIONS = {
STATE_DEFAULT: [("\\(", STATE_ROUND, True), ("\\[", STATE_SQUARE, True), ("<", STATE_TAG, False)], STATE_TAG: [(">", STATE_DEFAULT, False)], STATE_ROUND: [("\\)", STATE_DEFAULT, True)], STATE_SQUARE: [("\\]", STATE_DEFAULT, True)], } This means that if we are walking along a string, and we are in default HTML mode, and we encounter the substring "\(", we switch to the round bracket enclosed TeX state. The additional parameter True means that this is also a change of state that triggers a change of lexers. Some code to actually perform this parsing:
# === split text into chunks ===
splitpoints = [0] states = [STATE_DEFAULT] state = STATE_DEFAULT for i in range(0, len(text)): transitions = STATE_TRANSITIONS[state] for delim, newstate, bsplit in transitions: nd = len(delim) if i >= nd - 1 and text[i+1-nd:i+1] == delim: if bsplit: splitpoints.append(i-1) splitpoints.append(i+1) states.append(STATE_DELIM + state + newstate) states.append(newstate) state = newstate if splitpoints[-1] != len(text): splitpoints.append(len(text)) parts = [text[splitpoints[i]:splitpoints[i+1]] for i in range(len(splitpoints) - 1)] The delimiters themselves get their own chunks with own state flags, the sum of the state flags for the state before and after the delimiter, and a special flag for delimiters. So, for example, "\(" gets its own chunk with flag STATE_DEFAULT + STATE_ROUND + STATE_DELIM. Since STATE_DEFAULT is 0, this is iden Add Comment
| QuicksearchRecent Entries
ArchivesLatest Skizzenblog Entry |