Reading and referencing in Emacs Org-Mode.

TODO: I have an interesting thought, I need to write it here before I have forgotten it. In computers you may have “bookmarks” and “recentf”, and the ratio between them is the ratio between exploitation and exploration. There should be a healty ratio, not sure which one, but perhaps about 2 (0.5). If recentf is too big, you are exploring too much and not developing. If recentf is too small, you are not exploring enough.

This document starts as a Yet Another HOWTO on keeping your references in Emacs and Org-Mode, but I have a feeling that it might grow into something bigger.

Referencing is a big pain for a scientist. It is painful for two reasons.

Firstly, it is a complex task by itself; when preparing an article, a scientist not just needs to consume a lot of relevant material, he needs also to filter through a lot of material that is less relevant for current work, but might turn out to be useful later.

Secondly, people who want to profit from scientists’ work while contributing very little to the ecosystem are trying to use various political, economical, and informational compulsive measures to keep scientists restricted in their access to knowledge.

What is “knowledge”? I initially wanted to ask this question as “what is research?” or even “what is a research article?”, but those three seemingly different questions turned out to have the same answer.

To imagine “knowledge”, consider such a popular thing as a “neuron”. Actually, not the real neuron, but the neuron as it is presented on “Machine Learning” courses. It is a “node” with one output and many inputs. If you think about it, it looks very much like a scientific statement, which is a sub-statement of a larger “thought”, and is partitioned into many sub-statements. “Nodes” also have so-called “weak links”, that is, references to other nodes which cannot be described using a “part of” relationship, but are rather “associated”.

1. TODO Body

1.3. TODO Concepts

1.3.1. Hypertext

1.3.2. Garbage collection and consistency

1.3.3. Capturing

1.3.4. Tracking reading

1.3.5. Keywords

1.3.6. Search

1.3.8. Annotation

1.3.9. Inter-document referencing (citations)

1.3.10. Intra-document referencing (refs)

1.3.11. Indexing

1.3.12. Mathematics

1.3.13. Graphics

1.3.14. Animation

1.3.15. Table of contents

1.3.16. Spell checking.

1.3.17. Text Highlighting

1.3.18. Sticky Notes

1.4. Use-Cases and Use-Scenarios

1.4.1. Mental Discipline

“Doctor, when I do it like this, it hurts!” “Then don’t do it. Next!”

1.4.2. A piece of knowledge (document), and its branches

After reviewing 17 out of 50 documents, I started to get some thoughts about what a research system should do.

I am tempted to say that a basic unit of the system is an “article”. An article can have several forms:

PDF
HTML
TeX
Semantic markup, such as org or markdown, which is the most useful option.

The interesting thing is that these kinds of content can be transformed one into another. Semantic markup can be freely compiled into PDF or HTML, and PDF can be converted into Semantic markup using OCR (mathpix.com does amazing things).

We need a system which can track at least those three kinds of content.

But that is not enough.

In addition to “different presentations”, a piece of knowledge has metadata. Supposedly, we can deal with biblatex’s fields to “describe” a document exhaustively.

Among the metadata, at least “annotation” is a very useful field which need to be written for each processed paper. You might call it “lightly embedded” into the brain, because often the depth needed to write a decent annotation is still not profound enough to understand all of the paper. You can speculate that an “annotation” is what “Mathematical Reviews” or “ZbMATH” are doing. I guess, if a written annotation is not to be openly published, you can just type in a URL into the annotation field.

Most papers are incomprehensible and what to do with it

One important thing to note about modern science is that most of the papers are written either to pass irrelevant review tests, or at writers own pleasure with no quality control. Therefore we can safely assume that most of the papers are garbage.

This characterisation is not to denigrate the work that has been invested into them, as the authors are playing by extant rules. But this means that almost no paper is ready to be consumed as a good software library, with a well-defined interface and layered design.

Reading papers is essentially like reverse-engineering binaries. Those were written for the machine, not for you. And therefore we need to use tools that are frequently seen in binary analysis and bytecode debugging.
1. Instrumentation
  
  Admittedly, papers are slightly better than binary, they are, after all, written in a human language, so the decompilation part can be skipped. But we need a thing that in dynamic languages is called “instrumentation”.
  
  In fact, there is nothing new in instrumentation applied to texts. It is called “interlineation”, and consists of inserting text in-between the lines of the text that is being studied. When most of the studies were humanities, especially theology-related, this seems perfectly natural for people, but for some reason nowadays people seem to have largely forgotten this approach.
2. How to do interlineation if source is available?
  That is already a big question? Even if we have LaTeX source, this is not a trivial task. While having LaTeX source lets us edit the document at will, we cannot:
  1. Throw away the old document (as it may have links to it).
  2. Blindly write text in between the sentences of the original document, as it might break indexing and page navigation, and if not made somehow visually different from the old text, might confuse the reader.
  As a quick-and-dirty approach, I have just defined an environment in LaTeX, which is displaying its internal text in grey.
```
\begin{mycomment}
Interlineation.
\end{mycomment}
```
  This is not a very good approach though, as it is not cleanly working with LaTeX’s paragraphs.
  
  A better approach is to enumerate all thoughts in a document, giving each though a separate numbered clause, and writing the explanation for it in the clause body below the clause text. (Yes, I know, this is a little messy, to distinguish a “clause body” and a “clause text”, but I have no better wording.) See some thoughts on this subject here: https://gitlab.com/Lockywolf/study_notes/-/tree/master/2023-07-02_numbered-well-structured-LaTeX/2023-04-11_improvised-method
  
  In this write-up I do not want to spend a lot of effort on describing how to transform a bad paper first into an interlineated papper, and later into a good paper. For this I have a separate article, that is not yet finished: How to write papers in LaTeX.
3. How to do interlineation if source is not available?
  That is an even bigger of an issue, is it not?
  
  I am giving the following pairwise incomparable options:
  1. Reverse-engineer your paper with mathpix or other OCR.
  2. Use org-noter to attach annotations to certain pieces of the document.
    1. and leave it as-is
    2. and burn-in the notes as PDF sticky notes or highlighted text
    3. and burn-in the notes as actual interlineary text into the pages, increasing page sizes to be greater than A4
  3. Reverse engineer the paper into a set of image tiles, possibly on the basis of intensity analysis of the lines of text, and typeset your annotations between the tiles, thus keeping the A4 size, but potentially losing some of the page navigation.
  So far, solutions implementing options 2.2, 2.3, and 3 are unknown to me. Options 1 and 2.1 are incomparable, because they require an incomparable amount of work. Option 1 is far more flexible, but option 2 allows you to start annotating right away.
4. Annotating HTML
  
  This is an interesting use-case. I have not seen papers written in HTML originally, with an exception of SRFI documents of the Scheme Community Process. HTML opens a lot of opportunities for annotation which are better than those of TeX, such as text expandable on click (which is much better than text-on-hover, or text-on-sticky-notes). Still, there probably will be a need for at least three versions of the paper: original, annotated, and improved.
5. Why instrumentation is not a good answer
  
  For the same reason Richard Stallman started the Free Software movement.
  
  Wasting time on reverse-engineering computer games and device drivers, even though it is also stupid, at least has some motivation behind it, after all, computers run binary.
  
  There is no reason why articles, especially those which are published as TeX on Arxiv, or those which are published at author’s expense along the OpenAccess model, should be set in stone once a “release” is done.
  
  Articles should follow the software development model, with pull-requests, patch review, automatic testing for consistency, and a set of guidelines on what is an API/ABI breakage and versioning.
  
  Moreover, retracting a paper should not merely be a stamp of disapproval, but a peer-reviewed patch, which highlights exactly the place where there is a flaw, with the typology of the flaw indicated, so that automatic search for similarly-flawed articles can be conducted.
Summing up this section.
When making a database of “pieces of knowledge”, we need an entry to have at least the following fields or field groups:
1. Bibliographic metadata
2. Original PDF/HTML
3. Original TeX (empty unless Arxiv)
4. Reverse-engineered TeX
5. Annotated TeX/HTML
6. org-noter notes
7. Annotated PDF (with burned-in notes)
8. Improved PDF/HTML
9. Semantic Version (org, or sTeX)

1.4.3. Linking pieces of knowledge

If we have a database of “articles”, a database of pieces of knowledge, we, quite naturally, might want to interlink them.

This would be mimicking the Web, or human (or, rather, artificial) brain, or some other semantic network.

Dependent articles

Sometimes articles are released as “version 2.0”, and books quite often get a “Second Edition”. Another example of a dependent article is a solution book for a problem book, or a conference presentation for a paper.
Bibliographic references

That is what bibtex was originally for. If you have tex sources for many article, with bib files included, you can draw a network of citations. I am not sure how exactly you would do that for articles for which bib files are not available, which is the case for most articles other than Arxiv ones, so usefulness of this feature is dubious.

What I do want, however, is to be able to cite articles from the database using a hotkey, similar to reftex, and assemble a bib file for later upload to Arxiv.
Reading lists

Reading list quite naturally go hand in hand with the concept of a “project”. What is a “project”? It is hard to define a project precisely, but for theorists and for humanities scholars, a “project” is most likely to include a set of books or articles to read, and a set of claims to prove or discursively argue for or against. (For experimental disciplines things are more involved.)

From the paragraph above, it is already quite visible that Org-mode is quite naturally mapping the concept of a project.

When you have a project, say, you want to prove a certain theorem in Engineering Communications Theory (imaginary field), you might want to grind through a set of articles studying this field, which are usually on Arxiv, so you can annotate them in-place, and more importantly, place indexing markers in some interesting places.

Very often you will not be able to understand some theorems from a paper without background reading, so very soon you will, quite naturally, arrive to a graph of concepts. (I am not sure whether it can be called a “knowledge graph”, as I have seen that term used to describe a specific thing.) A theorem from a paper would require some (linked) reading to be understood. That “linked reading” would be in some other paper or book. If that book is not available as source, linking is likely to be done to the annotation file, or an annotated pdf.

So, a “project” will be a “concept graph”, which will be referring to the concepts of the underlying papers/books somehow. Making this graph is, seemingly, much easier than making a bibliographic citation graph, because, even if you have zero metadata about the paper or book you are reading, you are very likely to read through at least the table of contents, and re-coding the table of contents into a file is negligible in time, compared to the time needed to understand the concepts themselves.

Aha! I have mentioned something without explicitly saying. A Table of Contents is one of the most natural ways of breaking a paper into a skeleton, similar to org-mode’s outline. See the next paragraph.
Concept maps
So, I have mentioned a few ways for grinding through scientific material, which eventually should lead to the creation of a new piece of knowledge.

A “project” is a set of articles to read, and a set of concepts to define. Ideas for new concepts arise from consumed article, and the need to read more articles arises from the need to understand concepts, from reading an “incoming” list, and from citations by other articles.

When we want to visualise what is going on, we will quite naturally see three kinds of links between “Pieces of Knowledge”. (I am abusing notation here. From now on, a “Piece of Knowledge” is not just an article, it may be any piece of text that deserves independent study, for example, a chapter, or a section.)

These three kinds of links are:
1. Constituent links: a chapter is linked to its sections. Unidirectional.
2. Soft links: a theorem requires some background knowledge to be understood. “To understand this statement, I needed to read that place in that book”. Might be bi-directional, for example, if a theorem is described in two places, and understanding it might require reading both explanations. (See Scheme’s letrec.)
3. Indexing links: Two “Pieces of Knowledge” are describing the same concept, but I did not actually need to read one to understand the other.
How exactly a “Concept Map” would map onto a “Ready-made article” is a debatable subject. In some sense, its value is that of the debugging symbols for a binary program. It should greatly improve understanding, but most probably will not happen to be the skeleton of the final paper.

1.5. References

Ludwig Wittgenstein, Tractatus

1. TODO Body

1.1. Reading list [52/58]

1.1.1. Fireforg

1.1.2. Emacs Conference talk of 2022

1.1.3. My own old approach, used for ICFP 2020

1.1.4. Other thoughts.

1.1.5. http://gewhere.github.io/org-bibtex

1.1.6. http://www.draketo.de/english/emacs/writing-papers-in-org-mode-acpd

1.1.7. https://orgmode.org/worg/org-tutorials/org-latex-export.html

1.1.8. https://lepisma.xyz/wiki/emacs/org-mode/references.html

1.1.9. https://ogbe.net/emacs/references

1.1.10. http://kitchingroup.cheme.cmu.edu/blog/2014/05/13/Using-org-ref-for-citations-and-references

1.1.11. https://tincman.wordpress.com/2011/01/04/research-paper-management-with-emacs-org-mode-and-reftex/

1.1.12. http://socialdatablog.com/emacs-org-mode-as-outliner-bibliography-and-citation-manager-working-with-zotero-too.html

1.1.13. http://academia.stackexchange.com/questions/1273/use-cases-of-org-mode-as-a-scientific-productivity-tool-for-academics-without-pr

1.1.14. https://cachestocaches.com/2020/3/org-mode-annotated-bibliography/

1.1.15. https://rgoswami.me/posts/org-note-workflow/

1.1.16. https://unixbhaskar.wordpress.com/2023/04/11/bibliography-management-in-emacs-with-bibtex/

1.1.17. https://bastibe.de/2014-09-23-org-cite.html

1.1.18. https://blog.karssen.org/2013/08/22/using-bibtex-from-org-mode/

1.1.19. https://nickgeorge.net/science/org-ref-setup/

1.1.20. https://paul-nameless.com/emacs-org-mode-100-books.html

1.1.21. https://karl-voit.at/2015/12/26/reference-management-with-orgmode/

1.1.22. https://gitlab.inria.fr/compose/include/compose-bibliography

1.1.23. https://rebeja.eu/posts/managing-bibliography-using-emacs-org-mode-and-org-ref/

1.1.24. https://lists.gnu.org/archive/html/emacs-orgmode/2021-04/msg00445.html

1.1.25. https://www-public.imtbs-tsp.eu/~berger_o/weblog/2012/03/23/how-to-manage-and-export-bibliographic-notesrefs-in-org-mode/

1.1.26. https://stackoverflow.com/questions/73790997/emacs-org-mode-latex-export-doesnt-export-bibliography

1.1.27. https://viveks.info/org-mode-academic-writing-bibliographies-org-ref/

1.1.28. https://soham.dev/posts/org-bibliography/

1.1.29. http://doc.norang.ca/org-mode.html

1.1.30. https://koustuvsinha.com/post/emacs_research_workflow/

1.1.31. https://kristofferbalintona.me/posts/202206141852/

1.1.32. https://lucidmanager.org/productivity/bibliographic-notes-in-emacs-with-citar-denote/

1.1.33. https://blog.tecosaur.com/tmio/2021-07-31-citations.html

1.1.34. https://github.com/jkitchin/scimax/blob/master/scimax.org

1.2. Literature review [0/7]

1.2.1. Bibus

1.2.2. refdb-mode

1.2.3. Pure Bibtex / Biblatex

1.2.4. org-inlinetodos

1.2.5. reftex

1.2.6. ol-bibtex / org-bibtex

1.2.7. TODO org-bibtex-extras.el

1.2.8. ox-bibtex

1.2.9. Org-Mode’s own citation machinery, oc.el

1.2.10. bibtex-completion

1.2.11. org-ref

1.2.12. ebib

1.2.13. org-bib-mode

1.2.14. org-ebib

1.2.15. TODO evince

1.2.16. TODO PDF and its annotation tools

1.2.17. TODO denote

1.2.18. amsreftex

1.2.19. org-transclusion

1.2.20. org-sidebar

1.2.21. org-roam

1.2.22. TODO otter