Sunday, December 11, 2016

Sorting Out Photos

My wife and I are (or at least were) shutter bugs of a sort.  At this moment, I have just a bit shy of 1.5 terabytes of photos that she and I have take over the years.  I’ve also managed to make a hash of them with copies, duplicates and the occasional “I think I have this somewhere, but I can’t say for certain” directory.

I’ve been looking for the past year or two for a solution and still haven’t really found one I liked, so like a good nerd, I’ve rolled my own.  It’s cobbled together using BASH, ImageMagick, dcraw and MediaInfo.

My primary goal was to make sure that I had one copy of every file, not necessarily one high quality version of each picture or video.  Meaning, that if I end up with duplicated of a picture in RAW, high res JPEG and a JPEG thumb, I’m ok with that.  Once I have the initial culling of the photos, then I make take another swipe at further deduping it.

Anyway, my script starts by recursively looping through the current path and all subdirectories.  If it encounters a file, it will retrieve the extension and then conditionally call some combination of the above utilities to retrieve the creation/capture/modified timestamp and the camera make/model.  It does this fairly well, but there are some major caveats which I’ll discuss in a bit.

Once it has retrieved the above, it starts creating the following folder structure:

/<camera>/<year>/<month>/<day>

It then takes the file and tries to copy it into the following:

/<<camera>/<year>/<month>/<day>/<year><month><day><hour><minute><second>.<#>.<ext>

This should, in theory, allow me to identify a specific picture taken by a specific type of camera at a specific moment in time.  The initial issue I ran into with this is around time resolution.  The timestamps given to me by the various tools only resolve to the second (not millisecond like I’d prefer).  This means that if you have a camera that can take multiple pictures per second, then you can easily end up with duplicates, hence the <#> at the end.

If I encounter a file that is the same timestamp, I then do a MD5 sum on both files to confirm they are actually the same time.  If they are, then off to a duplicates tree the file goes.  If they aren’t the same, then I start an auto increment pass until I can write the file out uniquely in the target folder.

One issue, though, is that if the 0 file doesn’t match, I don’t check for subsequent matches, so the script could easily end up with files 1 2 3 and 4 all being duplicates.  Maybe I’ll try and fix that in a future edit.

As for the tooling, I use the following:

  • BASH
  • ImageMagick’s “identify --verbose”  command to get information a JPEGs
  • MediaInfo for details on MP4/M4V/AVI/MOV files
  • dcraw for details on RAW files (such as Nikon’s NEF/NRW)

Probably one of the biggest issues I have is that while your typical JPG/NRW/NEF file includes the camera details, a video typically does not. That means that I’m a bit hard pressed to determine what camera took a specific video.  I also found that the camera metadata for when the file was captured isn’t always that useful, so there are limits.

One other thing to note: ImageMagick isn’t always that fast, so there’s room for improvement on this, specifically around JPEG processing.  I was hoping to use MacOSs mdls for getting the camera data, but that only works if the filesystem is local (not mounted like mine was).

If you’re a BASH expert, please be kind.  I’m good at programming, not always good at scripting.  Otherwise, help yourself.

#!/bin/bash
BASE="/Volumes/e/Pictures/Processed"

moveFile()
{
#    local SOURCE="$1"
#    local BASE="$2"
#    local EXT="$3"
#    local SUFFIX="$4"
   
    if [ -f "$2.$4.$3" ] ; then
        moveFile "$1" "$2" "$3" $(($4 + 1 ))
    else
        mv -n "$1" "$2.$4.$3"
    fi
}

moveNonDuplicateFile()
{
    mkdir -p "$BASE/Sorted/$2/$3/$4/$5"

    local T="$BASE/Sorted/$2/$3/$4/$5/$3$4$5$6$7$8"

    moveFile "$1" "$T" "$9" "0"
}

#        moveDuplicateFile "$1" "$BASE" "$CAMERA" "$YEAR" "$MONTH" "$DAY" "$HOUR" "$MINUTE" "$SECOND" "$EXT"
moveDuplicateFile()
{
    mkdir -p "$BASE/Duplicates/$2/$3/$4/$5"

    local T="$BASE/Duplicates/$2/$3/$4/$5/$3$4$5$6$7$8"

    moveFile "$1" "$T" "$EXT" 0
}

processMOV()
{
    TIMESTAMP=`mediainfo "$1" | grep "Encoded date" | head -n 1 | sed 's/Encoded date//' | awk '{$1=$1;print}' | sed 's/: //'`
   
    YEAR=`date -ujf "%Z %Y-%m-%d %H:%M:%S" "$TIMESTAMP" +%Y`
    MONTH=`date -ujf "%Z %Y-%m-%d %H:%M:%S" "$TIMESTAMP" +%m`
    DAY=`date -ujf "%Z %Y-%m-%d %H:%M:%S" "$TIMESTAMP" +%d`

    HOUR=`date -ujf "%Z %Y-%m-%d %H:%M:%S" "$TIMESTAMP" +%H`
    MINUTE=`date -ujf "%Z %Y-%m-%d %H:%M:%S" "$TIMESTAMP" +%M`
    SECOND=`date -ujf "%Z %Y-%m-%d %H:%M:%S" "$TIMESTAMP" +%S`
   
    if [ -z "$CAMERA" ] ; then
        CAMERA="MOV"
    fi
}

processRAW()
{
    TIMESTAMP=`dcraw -i -v "$1" | grep Timestamp | sed s/Timestamp\:\ //`
   
    YEAR=`date -jf "%a %b %d %H:%M:%S %Y" "$TIMESTAMP" +%Y`
    MONTH=`date -jf "%a %b %d %H:%M:%S %Y" "$TIMESTAMP" +%m`
    DAY=`date -jf "%a %b %d %H:%M:%S %Y" "$TIMESTAMP" +%d`
   
    HOUR=`date -jf "%a %b %d %H:%M:%S %Y" "$TIMESTAMP" +%H`
    MINUTE=`date -jf "%a %b %d %H:%M:%S %Y" "$TIMESTAMP" +%M`
    SECOND=`date -jf "%a %b %d %H:%M:%S %Y" "$TIMESTAMP" +%S`
   
    CAMERA=`dcraw -i -v "$1" | grep 'Camera:' | awk -F\: '{ print $2 }' | tr '[:lower:]' '[:upper:]' | awk '{$1=$1;print}'`
}

processAVI()
{
    TIMESTAMP=`mdls "$1" | grep kMDItemContentCreationDate | sed 's/kMDItemContentCreationDate     = //'`
   
    if [ "$TIMESTAMP" == "" ] ; then
        YEAR="0000"
        MONTH="00"
        DAY="00"
        HOUR="00"
        MINUTE="00"
        SECOND="00"
    else
        YEAR=`date -jf "%Y:%m:%d %H:%M:%S %z" "$TIMESTAMP" +%Y`
        MONTH=`date -jf "%Y-%m-%d %H:%M:%S %z" "$TIMESTAMP" +%m`
        DAY=`date -jf "%Y-%m-%d %H:%M:%S %z" "$TIMESTAMP" +%d`

        HOUR=`date -jf "%Y-%m-%d %H:%M:%S %z" "$TIMESTAMP" +%H`
        MINUTE=`date -jf "%Y-%m-%d %H:%M:%S %z" "$TIMESTAMP" +%M`
        SECOND=`date -jf "%Y-%m-%d %H:%M:%S %z" "$TIMESTAMP" +%S`
    fi
   
    if [ -z "$CAMERA" ] ; then
        CAMERA="AVI"
    fi
}

processJPG()
{
#    TIMESTAMP=`mdls "$1" | grep kMDItemContentCreationDate | sed 's/kMDItemContentCreationDate     = //'`
#    CAMERA=`mdls "$1" | grep kMDItemAcquisitionModel | sed 's/kMDItemAcquisitionModel        = \"//' | sed s/\"//`

    TIMESTAMP=`identify -verbose "$1" | grep DateTimeDigitized | sed 's/    exif:DateTimeDigitized: //'`
    TIMESTAMP="$TIMESTAMP -0000"
    CAMERA=`identify -verbose "$1" | grep "exif:Model" | sed 's/    exif:Model: //'`

    if [ "$TIMESTAMP" == " -0000" ] ; then
        TIMESTAMP=`identify -verbose "$1" | grep "date:modify" | sed 's/    date:modify: //' | sed 's/\(.*\)-\(.*\)-\(.*\)T\(.*\)\([+-]\)\(.*\):\(.*\)/\1:\2:\3 \4 \5\6\7/'`
        #TIMESTAMP=`mdls "$1" | grep kMDItemFSContentChangeDate | sed 's/kMDItemFSContentChangeDate = //'`
    fi
   
    # | awk '{$1=$1;print}'`
       
    #echo $TIMESTAMP / $IMAGE
   
    # 2014-07-05T11:12:16-04:00
   
    if [ "$TIMESTAMP" == "" ] ; then
        YEAR="0000"
        MONTH="00"
        DAY="00"
        HOUR="00"
        MINUTE="00"
        SECOND="00"
    else
#         YEAR=`date -jf "%Y-%m-%d %H:%M:%S %z" "$TIMESTAMP" +%Y`
#         MONTH=`date -jf "%Y-%m-%d %H:%M:%S %z" "$TIMESTAMP" +%m`
#         DAY=`date -jf "%Y-%m-%d %H:%M:%S %z" "$TIMESTAMP" +%d`
#    
#         HOUR=`date -jf "%Y-%m-%d %H:%M:%S %z" "$TIMESTAMP" +%H`
#         MINUTE=`date -jf "%Y-%m-%d %H:%M:%S %z" "$TIMESTAMP" +%M`
#         SECOND=`date -jf "%Y-%m-%d %H:%M:%S %z" "$TIMESTAMP" +%S`
        YEAR=`date -jf "%Y:%m:%d %H:%M:%S %z" "$TIMESTAMP" +%Y`
        MONTH=`date -jf "%Y:%m:%d %H:%M:%S %z" "$TIMESTAMP" +%m`
        DAY=`date -jf "%Y:%m:%d %H:%M:%S %z" "$TIMESTAMP" +%d`

        HOUR=`date -jf "%Y:%m:%d %H:%M:%S %z" "$TIMESTAMP" +%H`
        MINUTE=`date -jf "%Y:%m:%d %H:%M:%S %z" "$TIMESTAMP" +%M`
        SECOND=`date -jf "%Y:%m:%d %H:%M:%S %z" "$TIMESTAMP" +%S`
    fi
   
   
    if [ -z "$CAMERA" ] ; then
        CAMERA="Unidentified"
    fi
}

processMPEG4()
{
    TIMESTAMP=`mediainfo "$1" | grep "Encoded date" | head -n 1 | sed 's/Encoded date//' | awk '{$1=$1;print}' | sed 's/: //'`
   
    YEAR=`date -ujf "%Z %Y-%m-%d %H:%M:%S" "$TIMESTAMP" +%Y`
    MONTH=`date -ujf "%Z %Y-%m-%d %H:%M:%S" "$TIMESTAMP" +%m`
    DAY=`date -ujf "%Z %Y-%m-%d %H:%M:%S" "$TIMESTAMP" +%d`

    HOUR=`date -ujf "%Z %Y-%m-%d %H:%M:%S" "$TIMESTAMP" +%H`
    MINUTE=`date -ujf "%Z %Y-%m-%d %H:%M:%S" "$TIMESTAMP" +%M`
    SECOND=`date -ujf "%Z %Y-%m-%d %H:%M:%S" "$TIMESTAMP" +%S`

    if [ -z "$CAMERA" ] ; then
        CAMERA="MP4"
    fi
}

processFile()
{
    echo "Processing file $1"
   
    local EXT=`echo "$1" | sed 's/.*\.\([A-Za-z0-9]*\)/\1/' | tr '[:lower:]' '[:upper:]'`

    case $EXT in

        # Picture Formats Here

        NEF)
            processRAW "$1"
            ;;

        NRW)
            processRAW "$1"
            ;;

        JPG)
            processJPG "$1"
            ;;

        # Media Formats Here

        AVI)
            processAVI "$1" "$EXT"
            ;;

        MOV)
            processMOV "$1" "$EXT"
            ;;

        MP4)
            processMPEG4 "$1" "$EXT"
            ;;

        M4V)
            processMPEG4 "$1" "$EXT"
            continue
            ;;
           
        DB)
            rm "$1"
            return 0
            ;;
           
        PNG)
            rm "$1"
            moveFile "$BASE/Other/PNG" "$2" 0 "PNG"
            continue
            ;;
           
        PANO)
            rm "$1"
            moveFile "$BASE/Other/PANO" "$2" 0 "PANO"
            continue
            ;;
           
        \*)
            continue
            ;;

        DS_STORE)
            rm "$1"
            ;;

        *)
            echo Unmaped extension $EXT
            continue
            ;;

    esac

    if [ -z "$YEAR" ] ; then
        echo "Image with no YEAR"
        continue
    fi

    if [ -z "$MONTH" ] ; then
        echo "Image with no MONTH"
        continue
    fi

    if [ -z "$DAY" ] ; then
        echo "Image with no DAY"
        continue
    fi

    if [ -z "$HOUR" ] ; then
        echo "Image with no HOUR"
        continue
    fi

    if [ -z "$MINUTE" ] ; then
        echo "Image with no MINUTE"
        continue
    fi

    if [ -z "$SECOND" ] ; then
        echo "Image with no SECOND"
        continue
    fi

    TARGET="$BASE/Sorted/$CAMERA/$YEAR/$MONTH/$DAY/$YEAR$MONTH$DAY$HOUR$MINUTE$SECOND.0.$EXT"
   
    if [ -f "$TARGET" ] ; then
   
        SOURCEHASH=`md5 -r "$1" | awk '{ print $1; }'`
        TARGETHASH=`md5 -r "$BASE/Sorted/$CAMERA/$YEAR/$MONTH/$DAY/$YEAR$MONTH$DAY$HOUR$MINUTE$SECOND.0.$EXT" | awk '{ print $1; }'`

        if [ "$SOURCEHASH" == "$TARGETHASH" ] ; then
            moveDuplicateFile "$1" "$CAMERA" "$YEAR" "$MONTH" "$DAY" "$HOUR" "$MINUTE" "$SECOND" "$EXT"
        else
            moveNonDuplicateFile "$1" "$CAMERA" "$YEAR" "$MONTH" "$DAY" "$HOUR" "$MINUTE" "$SECOND" "$EXT"
        fi
    else
        moveNonDuplicateFile "$1" "$CAMERA" "$YEAR" "$MONTH" "$DAY" "$HOUR" "$MINUTE" "$SECOND" "$EXT"
    fi

#    mv -n "$1" "$TARGET"
}

processDirectory()
{
    echo "Processing dir  $1"
   
    cd "$1"

    YEAR=""
    MONTH=""
    DAY=""
    HOUR=""
    MINUTE=""
    SECOND=""
    CAMERA=""

    for FILE in * ; do
   
        if [ -d "$1/$FILE" ] ; then
            processDirectory "$1/$FILE"
#            rmdir "$1/$FILE"
        else
            processFile "$1/$FILE" "$FILE"
        fi
   
    done

    if [ -f ".DS_Store" ] ; then
        rm .DS_Store
    fi   

    cd ..
    rmdir "$1"
}

CURRENT=`pwd`

processDirectory "$CURRENT”

No comments: