A record of a meeting about news, their sources and development. (Attendees: Vladimir Nikishkin, David Wood, Ilya Vorontsov)
1. Preface
We all love news. At the same time, we are all dissatisfied with the state of the news ecosystem of the time. At the moment we are all coping with this dissatisfaction in different ad-hoc ways. However, we all want a more efficient, controllable, and affordable (both in money and in effort) solution for getting news.
In that meeting we wanted to discuss what exactly we do, what we want to improve, both technologically and socially, with a potential of developing a novel news-related product.
“News” is a not very clearly defined thing by itself, and this document also aims to give a more or less working definition, as well as clarify the differences between the types of news.
2. Definitions
2.1. “Event” – something that physically happened. An intangible thing.
Examples: a tsunami, a piece of software released, a president elected, I made a photo of my cat.
2.2. “News Source” – someone who reported the original piece of news.
Examples: NVidia publishing a press-release, Government office issuing a press-release, A website issuing an update, I uploaded the cat’s photo online.
2.3. “News Provider” – someone who wrote an article about the event and posted it somewhere on the Internet. Sometimes they are a “news source” too, if the content is original.
Example: Bloomberg, Reuters, a blog, a Facebook profile.
2.4. “Traditional Medium” – a resource (company) that existed in the media business before the advent of the Internet, or is primarily based off-line.
They are usually in about 90% of the cases acting as news providers, but in about 10% of the cases acting as news sources, when they have “special correspondents”.
Examples: The BBC, The First Channel (of whatever country), The Washington Post, The Echo of Moscow Radio.
2.5. “Internet Medium” – a resource (company) that is primarily or exclusively online-based.
Examples: Cnet.com, TechRadar.com, Habr.{com,ru}, opensource.com, medium.com.
2.5.1. “Social Medium” – an internet medium with a heavy emphasis on letting the users create content.
Examples: Facebook, Twitter, LiveJournal, Mastodon, Parler, VK.
2.6. “News Aggregator” – a resource for collecting news pieces and presenting a unified interface for them.
Examples: Google News, Yandex News, Yandex Zen, news.ycombinator.com, Perl Planet
2.7. “Terminal” – a thing that the end-user is using to access a piece of news.
Examples: Email client, Facebook website, Facebook App, AtomFeed Reader, WeChat messenger, Telegram
2.8. Weight – the amount of energy a customer needs to spend on consuming a piece of news.
A hard to define precisely thing, since it is context and consumer dependent. The length of a text/video/audio can be an estimate, but a bad one.
As extreme examples, scientific papers can easily demand 8 hours per page, whereas pulp fiction can be consumed at a much higher rate. Vladimir’s personal record is the SICP book, which took 9 months to read.
2.9. Medium – a way a piece of news is stored.
- Text
- Audio
- Graphic
- Video
- Digitally-native (3D/Program)
- N-media (e.g. a text with illustrations)
3. The ways we obtain news at the moment.
3.1. Smartphone news aggregators. (Read, Google News)
Google News is a service that gives you news headlines, as well as full article bodies fetched from news providers. Google does quite a good job at identifying duplicates. Since one news source is usually later used by many news providers aiming at delivering the content, enriched and post-processed, to their users.
Google News gives you access to the full article body, fetched from one of the providers, but usually not the news source.
Google News uses a sophisticated recommendation method, that is presumably fully algorithmic (no human involvement), to recommend news to the consumers. This algorithm is heavily based on the data Google knows about the users, collected implicitly, the most used data provider being user’s search queries.
The problem with Google News is that it is hard (impossible) to “force” Google to show you more of something. It’s just not possible to make it directly subscribe to something.
3.2. Email
Vladimir gets his news by subscribing to individual RSS/Atom feeds via an email-gateway. He gets ~50 news-related emails daily and is quite overwhelmed with them.
This solution has a difficulty in that not all news providers (or news sources) have an Email or Feed gateways. For example, Facebook disabled their gateways circa 2011.
The filtering problem could be solved by crafting various data-collection “sensors”, such as a Chrome extension or a context-sensitive keylogger, and than training a local filtering tool, but Vladimir has so far been extremely far from doing that.
3.3. Opting in for just one or several news providers and just following them.
That’s what most people do. They just regularly visit, say, Habr.com, and try to tune the news feed in a way that is as personalised as possible.
- problem with this approach is that it requires most of the time spent on deliberately reading news.
- Another problem is that the user is mostly limited to the news from a single provider. There are ways to break this limit (LiveJournal, for example, lets you introduce RSS feeds into your news feed), but that is not a frequently used feature.
3.4. Summary
Most of the ways above are annoying.
- Too little control.
- Too bad recommendations.
- Bad coverage.
- Hard to tune interface.
- Lack of API.
- Preferences data leaks likely.
What follows will present various thoughts about the news data structures, algorithms and pipelines.
4. News pipeline
4.1. Creation
News creation can be:
- Automatic : a CCTV camera finished recording for today, uploads a video file, and updates an Atom feed.
- Manual : I take a photo of a beautiful flower-bed and upload it to Instagram, by creating a new post.
- Semi-automatic : I just upload a new software release .tar.gz on a web server (I am making an event), and some other bot or a human spots this and makes a piece of news.
4.2. Augmentation
Often the original piece of news is very terse. There is a process that is called “augmentation” in this document, that makes that piece of news more understandable, more readable, and richer.
Example: Vladimir wrote SRFI-203, which is a technical document. Later, Vladimir wrote an article on Habr.ru, in order to announce the existence of SRFI-203, and in order to provide more context on why it is needed, and to give some examples of its usage.
Augmentation is usually done by the “news providers”, and often is tailored for their audience. This is one of the places where bias is introduced. On the other hand, leaving out augmentation entirely seems not viable, as the readers are often lacking the context.
4.3. Collection
Naturally, news providers are many, and it is hardly possible to subscribe to each of them individually, especially since many of them do not have any web pages at all, let alone feeds, especially RSS feeds.
The news, therefore, have to be aggregated.
Aggregation can be:
- Automatic
- Pushed (RSS)
- Pulled (parsing bots)
- By the employees of a news provider (pulled)
- By crowd-sourcing (pushed)
4.4. Filtering
Naturally, filtering is crucial for any news-related ecosystem, since the amount of noise is giant.
- Implicit : the news that are missed by the collection systems are naturally filtered out. This is not always bad, but many golden nuggets are lost this way too.
- Automatic : Regexps, Natual Language Processing, stop-words, Sieve spam-filters, etc…
- Human : Direct censorship, editorship, class selection by customers, etc.
- Collaborative : That is a mixture of automatic and human. Some bootstrap is made by humans, and then we try to extrapolate the same filtering mechanism on the other news pieces.
4.5. Formatting
Formatting is more important than it is usually seen. Some people are happy with just headlines. Some people prefer abstracts. Other people are into full-length articles, extended articles (long-reads), or even series of articles (a thing that is hard to define!).
A (hypothetical) perfect piece of news supports all the aforementioned levels of abstraction.
The problem here is usually that news are manufactured at a single level of abstraction. We are therefore, met with a problem of up-scaling and down-scaling information.
Since this section naturally deals with the problem of news “weight”, apart from up-scaling and down-scaling, we should mention same-scaling, or re-wording a piece of information.
Note that re-wording is tightly connected to lossless compression. Lossless compression reduces the length of a piece of news, while preserving its weight. However, its practicality seems to be not very self-evident.
In the same section, I have to discuss medium conversion. Medium conversion
- Up-scaling
- Down-scaling
- Re-scaling
- Compression
- Generation (faking)
- Medium conversion
- Language translation
4.5.1. Automatic
Neither summarising, nor elaborating are solved problems.
The progress on summarising is a little bit better, as there are word2vec~/~text2vec
embeddings
that attempt to solve this.
Elaborating would require access to external data sources and context, and I am not aware of any progress on this matter.
There is some
progress on re-wording, at least up to the level of fooling search engines into believing that a piece of news is distinct from the other pieces.
However, this is a GAN
-like system.
Search engines are increasingly getting better at detecting auto-rewrites.
All of the progress above is generally concerned with pieces of text.
Compression is basically non-existent.
Conversion exists in the following way:
from \ to | Text | Audio | Image | Video | Digital |
---|---|---|---|---|---|
Text | No need | Good | No | No | No |
Audio | Mediocre | No need | No | No | No |
Image | Bad | Bad (via text) | No need | No | No |
Video | Very bad | Very bad (via text) | ? | No need | No |
Digital | Lossy | Lossy | Lossy | Lossy | No need |
Language translation works for an unassuming customer.
4.5.2. Manual
Titles (the highest level of abstraction) are usually available for free.
- Up-scaling : generally possible
- Down-scaling : generally possible (main selling point of news providers)
- Re-scaling : generally possible (usually for fooling search engines)
- Compression : generally possible (another selling point)
- Generation (faking) : generally possible
- Medium conversion : very expensive
- Language translation : grows increasingly cheaper
4.6. Classification, Importance, Analysis
Classifying the news is also important, but a little bit difficult to define. Classification is not entirely the same thing as filtering, although the class of a piece of the news can be a basis for filtering it out or letting it go through.
4.6.1. Distance to consumer
- Immediately connected to the consumer. (E.g. law updates for accountants.)
- Of general importance. (E.g. the introduction of a curfew.)
- Everything else.
4.6.2. Importance
- Important/Action required
- Important/Action not required
- Ignore-able
4.6.3. Area of Effect
- Single person
- Household
- Locally bound (House/District/Country)
- Universal
- Certain group, e.g. diabetic people, music fans
4.6.4. Areas of life
- too many to list
- Science/Technology
- Society
- Hobby
- Art
- Medicine
4.6.5. Verification
Verification is hard to define. What is true and what is false in our new world of post-truth?
- Manual
- Automatic
- Absent
4.7. Delivery
4.7.1. Terminals
Terminals may be:
- Personal Computers
- Smartphones
- (Smart-)Television Sets
- (Smart-)Radio Sets
- Specialised devices : e-Books, media players, in-vehicle thingies, smart speakers
- Unrelated devices : Billboards, digital photo frames, tickers
- Actual paper newspapers, journals, magazines
- Human assistants
Reception tools:
- Specialised : News app, news website, NNTP-reader
- General-purpose (coerced) : Email-reader, LiveJournal feed,
By intent:
- Intentional : I subscribe
- Sponsored : Ads
- Subliminal : “Native ads”, “Biased news providers”
4.7.2. Time, selection, batching, conversion.
This section is hard to describe, but it is an important point of attention. The time, place, size (weight/length), grouping, format of the news form an important selling point.
For example, a consumer is driving to work. Driving requires relatively low concentration, so one may, perhaps, want to receive some news at the time. However, the sort of attention a driver may dedicate to consuming the information is limited generally to audio.
Selection in this section is not entirely the same thing as filtering. Vaguely speaking, filtering is a partition of the news into useful/useless, whereas selection is working with the news that have already been chosen to be useful, and are further selected to be the most appropriate for the user at the time/place/class.
- Manual : presidents have it. Do other people?
- Automatic : Logic-based.
- Collaborative : Extending the manual thing to algorithms.
4.8. Sharing and Feedback
4.8.1. Sharing
Sharing is an important part of the news ecosystem.
- Specialised : the news are forwarded to the receivers (friends) using specialised channels, such as the “Share” button, and are delivered to their expected news terminal.
- Ad-hoc : the news are forwarded to the receivers in an ad-hoc manner.
4.8.2. Feedback
Feedback is not distant semantically from sharing. In some sense, sharing is the most basic kind of reaction that a user may have.
- Direct : comments, “like” buttons, subscriptions, donations, stuff built-into the news pieces themselves, class selection, subscriptions, preferences.
- Indirect
- Traceable : “Sign a petition”, and the provider sees the number of signatures.
- Non-traceable : “Lock your door”, and the provider does not know if his call is heeded.
4.9. Storage and indexing
News pieces have to be stored somewhere. News sources often do not care about making the pieces persistent to any degree.
News providers usually care a bit more about that, but even they often neglect permanence of web-links.
It is not by definition clear which news are worth storing for a long time. CCTV systems are usually wiped very soon after recording, perhaps about half-year maximum.
Indexing and searching is also a difficult question. Progress exists for text search (e.g. elastic), and there are things like “search by image”, but the state of the art I do not know.
4.10. Systems analysis
4.10.1. Flow analysis
In terms of “selling point”, this section is completely on the back-end side of the industry.
It is that kind of services that show you which words are “trending” now, what kind of news is dominating the agenda, which news produce more feedback.
4.10.2. KYC/Customer analysis
If a news service is getting feedback, it inevitably has a profile of the users This information can be used to improve the news flow, as well as make money on it.
4.10.3. Stories
If the storage is efficient, and flow analysis is advanced, it should be possible to build “books” or “narratives” that tell a story, as a narrated sequence of events, concatenated and re-worded to the same language. Not sure this is really feasible now.
4.11. Sources of income
These kinds of monetisation may be present at any stage of the pipeline.
- Cost-less/Enthusiasm
- (Sub-)Service subscriptions
- Time-limited
- Lifetime
- Pay-per-piece
- Advertisement
- Explicit
- Native
- Pay-per-publication (from authors, e.g. OpenAccess)
- Information selling (e.g. selling customer data)
- Sponsorship
- Government
- Commercial
- Donations
- Merchandise
- Archive access fee
4.12. Computing Technologies and Buzzwords
4.12.1. Technologies:
- RSS, Atom : news-feed supplying format
- NNTP : an old ticker-style news protocol
- Email, webmail : a way to receive news
- HTML, HTTP, Web : the way most people create news nowadays
- NLTK : a Natural Language Processing tool for Python
- Elastic, xapian : search libraries
- Android, iOS, Windows, Linux, MacOS : computing environments
- Google Analytics, Yandex Analytica : KYC tools
- Selenium : programmable browser
- grep, awk, bison/yacc/lex/flex/sed, perl : text processing tools
- Siri, Alisa, Cortana, Baidu : a way to get audio training data
- OpenCV : a way to extract something from video
- TTS : Text-to-Speech
- Speech Recognition : Speech to text transcription
- OCR : Optical Character Recognition
4.12.2. Buzzwords
- Cambridge Analytica
- Palantir Technologies
- Semantic Web
- word2vec
- Deep Neural Networks
4.12.3. Companies
- Bloomberg
- Reuters
- Xinhua News
- BBC
- Habr/TM
- Slashdot Media
- LiveJournal
- Microsoft
- Yandex
- Baidu
- Tencent
- Alibaba
4.12.4. Markets
- UK
- USA
- Russia
- China
4.12.5. Languages
- English
- Russian
- Chinese
(- Spanish/French/Arabic/Japanese)
5. Review
Any decent analytic work requires peer-review.
If you are invited to be a peer reviewed for this document, you are encouraged to add you comments into this section.
6. Conclusion
No conclusion so far.