Addressing Editor Content

Sep. 30th, 2025 12:00 am
[syndicated profile] marijnhaverbeke_feed

Posted by Marijn Haverbeke

Every text editor system, whether it works with plain or rich text, needs some way to refer to specific positions in the document. The very first reason one runs into this, when implementing such a system, is for representing the cursor and selection. But in a mature editor, you'll be doing a lot more things that refer to document positions—representing changes, displaying user interface elements in the text, or tracking metadata about a piece of content.

Offset Positions

An obvious way to address such positions, especially in plain text, is just by string offset. For rich text documents it also tends to be quite straightforward to define a scheme that gives every document position its own offset. Such offset increase monotonically, so even in data structures that aren't flat arrays, it is not hard to efficiently look them up.

Because text editors often store their content split by line in some bigger data structure, it is tempting to use {line, character} pairs for your positions. Though this is often a useful thing to present to a user, as an addressing system is is really quite awful. Manipulating such positions tends to get convoluted and awkward. Whereas finding nearby positions in a flat system is just a matter of addition and subtraction, with line-based addresses you have to always special-case line boundaries, and know the length of lines to even know where those boundaries are. This can be made to work, since several editors, including old versions of CodeMirror, do it, but moving to flat system in CodeMirror 6 was a huge relief.

In many cases, just having a position isn't enough. A cursor at a line wrapping point or a jump between right-to-left and left-to-right text, for example, may correspond to multiple different positions on the screen. At least for cursors, you need both a position and a direction—which disambiguate whether the position is attached to the element before or after the position.

But other than that, offset positions work well. They just have one big drawback: when the document changes, they need to change with it. So any data that refers to a document position needs to be updated along with the document every time that document is modified. This requires quite a lot of discipline to get right, and can get expensive when you're tracking a lot of positions.

Unique IDs

So, though both of my editor projects use offset positions, I keep asking myself whether there is a way around the need to map those positions on every change. If we could have some way to represent document position in a ‘stable’ way, where you can still use them when you come back to a document, even if that document has changed in a bunch of ways since you generated it, that would be so convenient.

To be able to do such a thing, you'd need to assign a stable ID to every single element in the document. There are ways to make this less nauseatingly expensive than it initially sounds. Stretches of text that are loaded or inserted together can be assigned a contiguous region of IDs, allowing the data structure, in many circumstance, to not store every single ID separately, but instead just assign a range to a stretch of text. If you have that, you can now describe the position before element X or after element Y in a stable way. To find it, you just need to look up the element.

Except, of course, if that element has been deleted. When your ID no longer exists in the document, your position has lost its meaning.

One way to handle that is to keep ‘tombstones’ for deleted elements, either directly in your document data structure, or in a separate data structure that maps the IDs of deleted elements to IDs of adjacent elements that are still in the document. This does have the downside that, for some types of editing patterns, your tombstone data can become bigger than your actual document data. It is possible to define schemes where you periodically garbage collect these, but then you reintroduce the issue that you can be invalidating position pointers that may still exist in some other data structure, and you are back to needing to carefully update such pointers.

Another issue of such IDs is that going from an ID to an actual position in the document generally needs to be fast. This is not something you get for free. Doing a full document scan every time you need to find a position tends to be too slow.

There are some tricks that you can do with mutable doubly-linked trees or lists, where you keep a map from IDs to objects in those data structures, and then traverse from that object via parent or sibling pointers to figure out where it is. But I am very partial to persistent data structures, where such tricks don't work.

It's probably possible to do something with bloom filters in a tree structure or similar, rather heavyweight tricks. But in the end, if you're just moving the work that an offset system would do when mapping positions over changes to lookup time, that may not be much of an improvement.

Ordered IDs

One way to avoid the tombstone and lookup issues with regular IDs is to define your ID assignment scheme in such a way that there is an ordering of the IDs that corresponds to their order in the document. If deleted IDs can still be compared to IDs still in the document, that gives you a way to locate their position even though they aren't there anymore. Similarly, if you can compare IDs you can run a binary or tree search through your document to quickly locate a position.

The obvious downside of this approach is that it is tricky to define your IDs in such a way that you can keep making up new IDs that ‘fit’ between any two existing IDs, and this forces you to use a schema where IDs can grow in size when there's no room left in the sequence space on their current level.

It also, and this may be a worse issue, makes the position of deleted IDs weirdly dependent on what is inserted in their old place after the deletion. Unless some kind of tombstone data is kept, changes will happily fill in the ID space left empty by a deletion with elements that, more or less randomly, may be above or below (or even equal to) an old but still referenced ID, making its position point at a meaningless position within those inserted elements.

(Readers familiar with sequence CRDTs may notice a lot of similarity between what I'm describing and how such systems work. That's because I stole a lot of these ideas from CRDT literature.)

Conclusion

This problem space appears to be a tricky one where every solution has significant drawbacks. I'm going to keep muddling along with offset positions in my own systems. Though mapping all your document positions is a chore, this approach is relatively easy to understand and reason about, and doesn't require a lot of complicated data structures to maintain and use.

Investigating a forged PDF

Sep. 24th, 2025 12:24 pm
[personal profile] mjg59
I had to rent a house for a couple of months recently, which is long enough in California that it pushes you into proper tenant protection law. As landlords tend to do, they failed to return my security deposit within the 21 days required by law, having already failed to provide the required notification that I was entitled to an inspection before moving out. Cue some tedious argumentation with the letting agency, and eventually me threatening to take them to small claims court.

This post is not about that.

Now, under Californian law, the onus is on the landlord to hold and return the security deposit - the agency has no role in this. The only reason I was talking to them is that my lease didn't mention the name or address of the landlord (another legal violation, but the outcome is just that you get to serve the landlord via the agency). So it was a bit surprising when I received an email from the owner of the agency informing me that they did not hold the deposit and so were not liable - I already knew this.

The odd bit about this, though, is that they sent me another copy of the contract, asserting that it made it clear that the landlord held the deposit. I read it, and instead found a clause reading SECURITY: The security deposit will secure the performance of Tenant’s obligations. IER may, but will not be obligated to, apply all portions of said deposit on account of Tenant’s obligations. Any balance remaining upon termination will be returned to Tenant. Tenant will not have the right to apply the security deposit in payment of the last month’s rent. Security deposit held at IER Trust Account., where IER is International Executive Rentals, the agency in question. Why send me a contract that says you hold the money while you're telling me you don't? And then I read further down and found this:
Text reading ENTIRE AGREEMENT: The foregoing constitutes the entire agreement between the parties and may bemodified only in writing signed by all parties. This agreement and any modifications, including anyphotocopy or facsimile, may be signed in one or more counterparts, each of which will be deemed anoriginal and all of which taken together will constitute one and the same instrument. The followingexhibits, if checked, have been made a part of this Agreement before the parties’ execution:۞Exhibit 1:Lead-Based Paint Disclosure (Required by Law for Rental Property Built Prior to 1978)۞Addendum 1 The security deposit will be held by (name removed) and applied, refunded, or forfeited in accordance with the terms of this lease agreement.
Ok, fair enough, there's an addendum that says the landlord has it (I've removed the landlord's name, it's present in the original).

Except. I had no recollection of that addendum. I went back to the copy of the contract I had and discovered:
The same text as the previous picture, but addendum 1 is empty
Huh! But obviously I could just have edited that to remove it (there's no obvious reason for me to, but whatever), and then it'd be my word against theirs. However, I'd been sent the document via RightSignature, an online document signing platform, and they'd added a certification page that looked like this:
A Signature Certificate, containing a bunch of data about the document including a checksum or the original
Interestingly, the certificate page was identical in both documents, including the checksums, despite the content being different. So, how do I show which one is legitimate? You'd think given this certificate page this would be trivial, but RightSignature provides no documented mechanism whatsoever for anyone to verify any of the fields in the certificate, which is annoying but let's see what we can do anyway.

First up, let's look at the PDF metadata. pdftk has a dump_data command that dumps the metadata in the document, including the creation date and the modification date. My file had both set to identical timestamps in June, both listed in UTC, corresponding to the time I'd signed the document. The file containing the addendum? The same creation time, but a modification time of this Monday, shortly before it was sent to me. This time, the modification timestamp was in Pacific Daylight Time, the timezone currently observed in California. In addition, the data included two ID fields, ID0 and ID1. In my document both were identical, in the one with the addendum ID0 matched mine but ID1 was different.

These ID tags are intended to be some form of representation (such as a hash) of the document. ID0 is set when the document is created and should not be modified afterwards - ID1 initially identical to ID0, but changes when the document is modified. This is intended to allow tooling to identify whether two documents are modified versions of the same document. The identical ID0 indicated that the document with the addendum was originally identical to mine, and the different ID1 that it had been modified.

Well, ok, that seems like a pretty strong demonstration. I had the "I have a very particular set of skills" conversation with the agency and pointed these facts out, that they were an extremely strong indication that my copy was authentic and their one wasn't, and they responded that the document was "re-sealed" every time it was downloaded from RightSignature and that would explain the modifications. This doesn't seem plausible, but it's an argument. Let's go further.

My next move was pdfalyzer, which allows you to pull a PDF apart into its component pieces. This revealed that the documents were identical, other than page 3, the one with the addendum. This page included tags entitled "touchUp_TextEdit", evidence that the page had been modified using Acrobat. But in itself, that doesn't prove anything - obviously it had been edited at some point to insert the landlord's name, it doesn't prove whether it happened before or after the signing.

But in the process of editing, Acrobat appeared to have renamed all the font references on that page into a different format. Every other page had a consistent naming scheme for the fonts, and they matched the scheme in the page 3 I had. Again, that doesn't tell us whether the renaming happened before or after the signing. Or does it?

You see, when I completed my signing, RightSignature inserted my name into the document, and did so using a font that wasn't otherwise present in the document (Courier, in this case). That font was named identically throughout the document, except on page 3, where it was named in the same manner as every other font that Acrobat had renamed. Given the font wasn't present in the document until after I'd signed it, this is proof that the page was edited after signing.

But eh this is all very convoluted. Surely there's an easier way? Thankfully yes, although I hate it. RightSignature had sent me a link to view my signed copy of the document. When I went there it presented it to me as the original PDF with my signature overlaid on top. Hitting F12 gave me the network tab, and I could see a reference to a base.pdf. Downloading that gave me the original PDF, pre-signature. Running sha256sum on it gave me an identical hash to the "Original checksum" field. Needless to say, it did not contain the addendum.

Why do this? The only explanation I can come up with (and I am obviously guessing here, I may be incorrect!) is that International Executive Rentals realised that they'd sent me a contract which could mean that they were liable for the return of my deposit, even though they'd already given it to my landlord, and after realising this added the addendum, sent it to me, and assumed that I just wouldn't notice (or that, if I did, I wouldn't be able to prove anything). In the process they went from an extremely unlikely possibility of having civil liability for a few thousand dollars (even if they were holding the deposit it's still the landlord's legal duty to return it, as far as I can tell) to doing something that looks extremely like forgery.

There's a hilarious followup. After this happened, the agency offered to do a screenshare with me showing them logging into RightSignature and showing the signed file with the addendum, and then proceeded to do so. One minor problem - the "Send for signature" button was still there, just below a field saying "Uploaded: 09/22/25". I asked them to search for my name, and it popped up two hits - one marked draft, one marked completed. The one marked completed? Didn't contain the addendum.

November 2022

S M T W T F S
  12345
6789101112
1314 1516171819
20212223242526
27282930   

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Sep. 30th, 2025 03:44 pm
Powered by Dreamwidth Studios