Case study:

International Federation of the Phonographic Industry 

Text-based detection of infringing URLs of songs that belong to specific artists

Challenge

IFPI represents the recording industry worldwide and has the crucial task of protecting copyright infringement online. In order to achieve this objective, IFPI needed to transition a legacy system to a scalable and highly efficient solution capable of ingesting massive amounts of data from automated web crawlers for comparison and analysis against an internal database of artists and tracks.

The customer represents music industry clients and is tasked with enforcing copyright on a large and varied back catalogue. Part of this task entails searching for infringing shares on social media. The objective is to facilitate large scale back-catalogue processing and identifying the exact or best match against a particular piece of repertoire. The existing implementation was slow due to design being inherently throttled in terms of volumes and throughput because of resource constraints associated with the crawling process and the single threaded nature of the legacy Java application.

The Cloudwick Solution

Cloudwick proposed the solution to replace the legacy on-premise data warehouse with Amazon Redshift, a fast, fully managed data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. The solution is supported by AWS Data Pipeline to orchestrate the same transformations which were being done on-prem. The complex ETL processes are migrated to PySpark to use Spark SQL and make minimal changes to the original code. We chose that solution in order to maximize future maintainability, but we also future-proofed the work by choosing PySpark as our toolkit.

Cloudwick employed the modern CI/CD pipelines to reliably deploy the code. This allowed us to use test cases and verify that the procedures matched the original results and have confidence in the release management process. Cloudwick utilised GitLab as our source code management system, and used its CI/CD capabilities to automatically test and push passing master builds to Amazon S3 for deployment.

We implemented the AWS Data Pipeline which was used to automatically provision Amazon EC2 instances with our latest required configuration, and run the latest deployed version of our PySpark code automatically. Having an automated process significantly reduced Cloudwick’s code-compile-debug cycle and resulted in a more refined end result. Database Migration Service was used to import the data into Amazon S3 from the legacy database. Infrastructure-as-Code allowed us to replicate our test environment precisely into the staging environment so our developers could work with dummy data and ensure that the deploy was reliable.

Finally, AWS CloudWatch along with Amazon Simple Notification Service was used to inform the relevant data owners or responsible parties of successful data loads, or alert them to any problems.

Benefits

  • Improved performance, rapid ingest and ability to de-dupe.
  • Solution capable of processing millions of records daily and enabling fast and accurate identification and detection against batch input data.
  • AWS role-based security provided controlled data access.