Developed a system for parsing data from Linkedin with Facebook pairing.
Application complex infrastructure:
- Virtual server based on Linux OS;
- Docker-based component containerization;
- Parameterizing the launch of Docker containers;
- DBMS for storing intermediate and final selection options;
- WEB server for displaying settings and lists of selected posts;
- Regular procedures for starting the process of collecting data from LinkedIn at intervals of 2 hours;
- Logging of the container launch procedure;
- Forced start of the collector for non-standard situations.
Data collection (in the form of three functional software modules):
- Search by input parameters among posts by key phrases.
- Search by input parameters in the list of accounts and key phrases for individual accounts.
- Search by input parameters of groups and key phrases for each of the groups.
Pre-processing and filtering of collected data, processing of collected posts, additional analysis and selection based on text content:
- Filter and filter posts by URL and post title to find duplicate posts by other LinkedIn users.
- Filtering and selection by post author;
- Filtering and selection by publication date;
- Filtering and selection by additional phrases, the presence of which means the post is included in the final list;
- Filtering and selection by negative keywords from the list of negative keywords.
- Programmatic search for contact data using stop phrases from the stop phrase directory;
- When a stop-phrase is found, below this stop-phrase is deleted and replaced with a previously prepared text block;
- Saving the selected data in an intermediate database for further use, moderation and publication.
Outputting data collection results:
- The program module displays a list of selected posts and information about each of them. The user web interface has the ability to specify posts to be excluded from the final unloading;
- Programming interface for displaying a list of search accounts with the ability to edit the list;
- Programming interface for displaying a list of search groups with the ability to edit the list;
- Program interface for displaying a list of stop phrases with the ability to edit the list.
- The data preparation block converts the content of posts into a standard format for publication and takes into account the results of manual moderation of posts.