- When classifying images, using the biggest hash set is best. However for triage, comparatively small datasets can be effective.
- For triage it’s often useful to know if there is any material of a particular type on a device. (Rather than identifying everything.)
- As most online content follows a head-and-long-tail pattern, you may already have all the data required for triage.
Hash Analysis – Bigger is Better for Classification
When images on a device must be classified, every image that can be processed automatically saves time. The bigger the hash set, the less operator time is needed for manual classification, with a corresponding reduction in exposure to disturbing content.
For classification, limited size hash sets have clear limitations, but what about for triage?
Traditionally, using a comprehensive hash set for triage was impractical as this excessively slowed down tools, presenting a difficult operational trade-off.
With tools that use triage-first technologies this slow down may have been eliminated for comprehensive datasets, but adoption may require access to original material. For example, Cyan Forensics’ tools use Contraband Filters, our equivalent of hash sets. Contraband Filters must be built from original material, but allow fast lookup against almost any dataset size.
This leads to an interesting question: if it is not practical to triage using a comprehensive search set (either due to performance or data availability) which files should or could be included in the search set to deliver the most impactful triage results?
Predictable Human Patterns
Human behaviours tend to exhibit striking and predictable patterns when you look at them across large numbers of people.
One example of this is the Long Tail Effect for media content – which is frequently observed to apply to everything from Amazon’s book sales and Google’s search queries to media plays on streaming services.
In this pattern there is a ‘Head’ – a small number of popular things, viewed or purchased by many (many) people, and a ‘Long Tail’ – a very large number of things that are only viewed/purchased a very small number of people.
Given that these relationships are observed in so many areas online, it seems likely that similar behavioural patterns will also be observed in the distribution of illegal content.
Most forensics analysts have seen the same indecent images many times across cases.
Most investigators would agree that there is a core subset of material that they see again and again across many investigations.
— Bruce Ramsay, 10 years of forensic analysis, CTO Cyan Forensics
One of the things I found was that almost all of the images and video clips I have seen in my first 4 or 5 years I had seen in my first 6 months.
— Forensics Focus post – steve862 (Experienced Examiner)
One explanation for this might be the homogenising power of the Internet.
While it is now easier than ever to form communities and share material, “consumers” and “content distribution points” tend to be in many-to-few relationships. This is true both on the web, in the ratio of website visitors to websites, and in peer-to-peer file sharing when comparing number of downloaders to the number of tracker servers.
This many-to-few pattern centralises content, leading to ‘new’ and ‘most popular’ content being promoted, leading to many consumers downloading a central core of material.
As far as we know, no force for good has compiled numbers about which indecent images are encountered most frequently, so the actual pattern of criminal behaviour is unknown.
We hypothesise that the distribution of content across all law enforcement seizures will follow a pattern, and as most of the illegal content is distributed online, the distribution of material is likely to fall into a Head and Long Tail.
This hypothetical graph shows the number of times a known file is detected, vs the file’s rank amongst all the files ever detected.
The total number of file detections in the yellow ‘head’ is equal to those in the ‘long tail’.
Head & Tail & Triage
Effective triage in, or finding quick wins, only needs to provide a rapid indication that material of interest has been found.
If the Head and Long Tail pattern applies to illegal files, it suggests that we are many times more likely to encounter material in the Head, and even within that subgroup, some items are many, many times more likely to be seen.
However the inverse is also true: if you were to assemble the files encountered in a small number of cases, the resulting library would have better coverage of files in the head, as this common material will be present in more cases.
This suggests that an effective set for triage could be assembled from a small number of current cases.
Cyan Examiner Triage Experience of a UK Police Force
One UK police force recognised the benefits that Cyan Examiner would bring to their workflow, but a Contraband Filter (Cyan Forensics’ alternative to a hash set) for their preferred database (CAID – the UK’s national Child Abuse Image Database) was not yet available.
The force decided to build a Contraband Filter (using Cyan Collector) from the cases it encountered in the course of its day to day work, believing that the benefits of any positive rapid triage detections would outweigh the lack of a comprehensive contraband set.
That force is now successfully using this Contraband Filter with Cyan Examiner to get “quick wins” – finding evidence quickly to support suspect interviews and prioritise further examination of devices.
They, like us, believe that this is because the content falls into a Head and Long Tail pattern, where the core of material is present across many investigations.
Past Examples of ‘Most Common’ Useful Results
The introduction of previous tools has also shown that effective results can be obtained from a data gathered from a small number of current cases.
In the UK, Operation Notarise (Operation NOCAP in Scotland) made clear the scale of consumption of indecent images of children online, and lead to some prosecutions. The initial (international origin) database supporting this type of work was built out from a small number of hashes, with further intelligence added later.
Similarly, when C4All was first introduced, analysts started with zero files classified, but it quickly became an effective time saver by identifying the minority of files – the most commonly seen. Over time, and with collaboration, C4All would come to pre-categorise more than 70% of files on a device.
Effective Triage sets are Easy to Build
While a Contraband Filter created from the common core or Head of material represents far fewer images than some hash databases, operational success demonstrates that even small datasets can be effective if they include content that is widely circulated and appears on many suspect devices.
For triage in, quickly identifying if a device contains any known indecent images is valuable, as it can inform officers during a search, or allow a suspect to be held on remand within the custody window, or simply prioritise devices for earlier full investigation.
Using the largest dataset of contraband will always give the most comprehensive results, and when preparing evidence for court a comprehensive examination may be required. However, a smaller dataset may be sufficient for triage.
While triage out should only be attempted with comprehensive datasets, small datasets are sufficient to find evidence fast, confirming which devices definitely contain indecent images (triage in). While triage in will always benefit from more data, the additional impact of ‘comprehensive’ over a ‘common core’ dataset may be smaller than you might think.
You may just be a few cases away from being able to build your own effective triage set.